Correlation and Regression
Correlation and Regression
Correlation
Correlation is a statistical technique for measuring the extent of the relationship between
numerical variables. If an increase in one variable tends to be matched by an increase in
another, there is said to be positive correlation, and the two variables are positively
correlated. If an increase in one variable tends to be matched by a decrease in another
there is said to be negative correlation, and the two variables are negatively correlated.
800
600
Cost (€)
400
Annual Maintenance Cost (€)
200
0
0 2 4 6 8 10 12
Vehicle Age
1
In the above example, the scatter diagram confirms the expected link between vehicle age
and maintenance cost.
A numerical indication of the extent to which two sets of figures are correlated can be
measured by finding the correlation coefficient. The formula used to obtain the correlation
coefficient is;
This formula can be used to calculate the correlation coefficient in the case of vehicle age
and maintenance cost.
This coefficient can range in value from -1, which indicates perfect negative correlation, to
+1, which indicates perfect positive correlation. A coefficient of zero indicates that there is
2
no correlation between the variables. Strong positive correlation can be said to occur when
the correlation coefficient is in the range 0.75 to 1. Strong negative correlation can be said
to occur when the coefficient is in the range -0.75 to -1. The example above shows strong,
positive correlation exists between vehicle age and maintenance cost.
Note: Correlation does not necessarily indicate cause. A high correlation coefficient merely
indicates that the variables appear to be linked. An indicator of high correlation must be
analysed and interpreted before any firm conclusions regarding cause may be drawn.
The square of the correlation coefficient is known as the coefficient of determination (R2).
It indicates the extent to which the variation in the dependent variable can be explained by
the change in the independent variable. It is usually expressed as a percentage.
In the example above, R2 = (0.816)2 = 0.6659. This indicates that about 66% of the variation
in the vehicles’ maintenance costs is explained by the variation in their ages.
Linear Regression
Where a scatter diagram suggests that there is a statistical relationship between two
quantitative variables, and this is confirmed by the correlation coefficient, the technique of
linear regression can be used to find the equation of the line that best fits the data.
5 345
3 280
7 755
6 655
8 695
2 325
3 420
7 950
5 650
11 800
2 300
5 400
3
Annual Maintenance Cost (€)
1000
900
800
700
600
Cost (€) 500 Annual Maintenance Cost (€)
400
300 Linear (Annual Maintenance Cost
200 (€))
100
0
0 2 4 6 8 10 12
Vehicle Age
The regression line is the line that minimises the combined distances between it and the
points of the scatter diagram. The formulae for finding the equation of the regression line
in the form y = a + bx are;
When these two formulae are applied to the above example (vehicle ages and maintenance
costs) the values found are;
a = 176.2
b = 69.7
y = 176.2 + 69.7x
The equation can be used to predict the value of the y variable for given values of x. For
example, the company might want to predict the likely annual maintenance cost incurred
4
for a 9 year old vehicle. The predicted value can be found by inserting a value of 9 for x in
the regression equation and calculating the resulting y value;
The regression equation can also be used to draw conclusions such as ‘for every unit
increase in x, the value of y increases by..........’. In the example we can deduce from the
regression equation that for every additional year of a vehicle’s age, an average of €69.70 is
added to its annual maintenance cost.
Regression equations are frequently used, as above, to predict values of the dependent, Y,
variable. How reliable are these predictions? The coefficient of determination (R2) gives an
indication of how much of the variation in the values of the dependent variable can be
attributed to the variation in the independent variable. When the correlation between the
two variables is high (e.g. r = 0.9, R-squared = 0.81), then significant reliance can be placed
on the equation and predicted values can be considered to be quite accurate. Where
correlation is low (e.g. r = 0.4, R-squared = 0.16) then little reliance can be placed on the
predicted values.
There are a range of other issues to consider when evaluating the reliability of predictions
based on linear regression:
- Little reliance can be placed on predictions based on small data sets (e.g. 10 pairs of
observations or less) regardless of the level of correlation.
- Care must be taken when extrapolating (i.e. predicting values of the independent
variable outside the range of the data from which the regression equation was
obtained)
- It should not always be assumed that the relationship between the variables is
linear. Some alternative functional form may better represent the relationship
between the two variables, in which case a prediction based on linear regression
would not be reliable.
5
Interpreting output obtained from linear regression in Excel
Pay careful attention to the order in which the terms are presented on the computer screen.
The first row of output contains the regression equation coefficient (the’b’) and the
constant term (the ‘a’). The first cell of the third row contains the R-squared term, the
‘coefficient of determination’. This statistic is an indicator of the reliability of the regression
equation as a method of predicting the value of the dependent variable, the ‘y’.
Rank Correlation
Where only ranking orders, rather than actual values, are available, Spearman’s rank
correlation coefficient can give an indicator of correlation. The formula is given below.