0% found this document useful (0 votes)
138 views6 pages

Correlation and Regression

This document discusses correlation and linear regression. Correlation measures the relationship between two variables, with positive correlation indicating that as one variable increases, so does the other, and negative correlation meaning the opposite. Linear regression finds the equation of the line that best fits the data points when graphed. It can then be used to predict values of the dependent variable from the independent variable. The coefficient of determination, R^2, indicates how well the regression line represents the data.

Uploaded by

Aline Brito
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views6 pages

Correlation and Regression

This document discusses correlation and linear regression. Correlation measures the relationship between two variables, with positive correlation indicating that as one variable increases, so does the other, and negative correlation meaning the opposite. Linear regression finds the equation of the line that best fits the data points when graphed. It can then be used to predict values of the dependent variable from the independent variable. The coefficient of determination, R^2, indicates how well the regression line represents the data.

Uploaded by

Aline Brito
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Correlation and Regression

Correlation
Correlation is a statistical technique for measuring the extent of the relationship between
numerical variables. If an increase in one variable tends to be matched by an increase in
another, there is said to be positive correlation, and the two variables are positively
correlated. If an increase in one variable tends to be matched by a decrease in another
there is said to be negative correlation, and the two variables are negatively correlated.

A visual indication of a possible statistical relationship between two variables can be


obtained using a scatter diagram. For example, a road haulage company might want to
assess the relationship between the age of a vehicle and its yearly maintenance cost and it
might produce figures such as those shown below.

Age of Vehicle (years) Annual Maintenance Cost (€)


5 345
3 280
7 755
6 655
8 695
2 325
3 420
7 950
5 650
11 800
2 300
5 400

These figures could be illustrated by means of a scatter


diagram. When preparing a scatter diagram, it is normal
practice to measure the independent variable on the
horizontal (x) axis. The dependent variable, shown on the
vertical axis, changes in response to change in the
independent variable.

Annual Maintenance Cost (€)


1000

800

600
Cost (€)
400
Annual Maintenance Cost (€)
200

0
0 2 4 6 8 10 12
Vehicle Age
1
In the above example, the scatter diagram confirms the expected link between vehicle age
and maintenance cost.

A numerical indication of the extent to which two sets of figures are correlated can be
measured by finding the correlation coefficient. The formula used to obtain the correlation
coefficient is;

This formula can be used to calculate the correlation coefficient in the case of vehicle age
and maintenance cost.

Age (X) Cost (Y) XY X^2 Y^2


5 345 1725 25 119025
3 280 840 9 78400
7 755 5285 49 570025
6 655 3930 36 429025
8 695 5560 64 483025
2 325 650 4 105625
3 420 1260 9 176400
7 950 6650 49 902500
5 650 3250 25 422500
11 800 8800 121 640000
2 300 600 4 90000
5 400 2000 25 160000
64 6575 40550 420 4176525

In this case, the correlation coefficient is;

r = 12(40550) – (64)(6575)/√((12x420) – 4096)((12x4176525) – 43230625)


= 486600 – 420800/√(5040 – 4096)(50118300 – 43230625)
= 65800/√(944)(6887675)
= 65800/√6501965200
= 65800/80634.76 = 0.816025 (≈ 0.816)

This coefficient can range in value from -1, which indicates perfect negative correlation, to
+1, which indicates perfect positive correlation. A coefficient of zero indicates that there is

2
no correlation between the variables. Strong positive correlation can be said to occur when
the correlation coefficient is in the range 0.75 to 1. Strong negative correlation can be said
to occur when the coefficient is in the range -0.75 to -1. The example above shows strong,
positive correlation exists between vehicle age and maintenance cost.
Note: Correlation does not necessarily indicate cause. A high correlation coefficient merely
indicates that the variables appear to be linked. An indicator of high correlation must be
analysed and interpreted before any firm conclusions regarding cause may be drawn.

The square of the correlation coefficient is known as the coefficient of determination (R2).
It indicates the extent to which the variation in the dependent variable can be explained by
the change in the independent variable. It is usually expressed as a percentage.
In the example above, R2 = (0.816)2 = 0.6659. This indicates that about 66% of the variation
in the vehicles’ maintenance costs is explained by the variation in their ages.

Linear Regression

Where a scatter diagram suggests that there is a statistical relationship between two
quantitative variables, and this is confirmed by the correlation coefficient, the technique of
linear regression can be used to find the equation of the line that best fits the data.

Age of Vehicle (years) Annual Maintenance Cost (€)

5 345

3 280

7 755

6 655

8 695

2 325

3 420

7 950

5 650

11 800

2 300

5 400

3
Annual Maintenance Cost (€)
1000
900
800
700
600
Cost (€) 500 Annual Maintenance Cost (€)
400
300 Linear (Annual Maintenance Cost
200 (€))
100
0
0 2 4 6 8 10 12
Vehicle Age

The regression line is the line that minimises the combined distances between it and the
points of the scatter diagram. The formulae for finding the equation of the regression line
in the form y = a + bx are;

When these two formulae are applied to the above example (vehicle ages and maintenance
costs) the values found are;

a = 176.2
b = 69.7

The regression equation is therefore;

y = 176.2 + 69.7x

The equation can be used to predict the value of the y variable for given values of x. For
example, the company might want to predict the likely annual maintenance cost incurred

4
for a 9 year old vehicle. The predicted value can be found by inserting a value of 9 for x in
the regression equation and calculating the resulting y value;

Y = 176.2 + 69.7(9) = 176.2 + 627.3 = €803.50.

The regression equation can also be used to draw conclusions such as ‘for every unit
increase in x, the value of y increases by..........’. In the example we can deduce from the
regression equation that for every additional year of a vehicle’s age, an average of €69.70 is
added to its annual maintenance cost.

Regression equations are frequently used, as above, to predict values of the dependent, Y,
variable. How reliable are these predictions? The coefficient of determination (R2) gives an
indication of how much of the variation in the values of the dependent variable can be
attributed to the variation in the independent variable. When the correlation between the
two variables is high (e.g. r = 0.9, R-squared = 0.81), then significant reliance can be placed
on the equation and predicted values can be considered to be quite accurate. Where
correlation is low (e.g. r = 0.4, R-squared = 0.16) then little reliance can be placed on the
predicted values.
There are a range of other issues to consider when evaluating the reliability of predictions
based on linear regression:
- Little reliance can be placed on predictions based on small data sets (e.g. 10 pairs of
observations or less) regardless of the level of correlation.
- Care must be taken when extrapolating (i.e. predicting values of the independent
variable outside the range of the data from which the regression equation was
obtained)
- It should not always be assumed that the relationship between the variables is
linear. Some alternative functional form may better represent the relationship
between the two variables, in which case a prediction based on linear regression
would not be reliable.

Using Regression in Microsoft Excel (2007)


- Open Microsoft Excel
- Input the data (y’s and x’s) in columns
- Select (highlight) the cells where you want the output data to be displayed. You will
need to select five rows. The number of columns selected needs to equal the number of
variables in the table of data, e.g. if there is one dependent (y) variable, and one
independent (x) variable, you should select two columns.
- Click on ‘formulas’ and select the ‘more functions’ option. From the drop-down menu
select ‘statistical’ and then scroll to ‘linest’.
- In the ‘function arguments’ box enter the cell ranges for the y’s and the x’s. Type the
word ‘true’ in the ‘const’ and ‘stats’ boxes.
- Press Control + Shift + Enter
- Output data will appear in selected cells

5
Interpreting output obtained from linear regression in Excel
Pay careful attention to the order in which the terms are presented on the computer screen.
The first row of output contains the regression equation coefficient (the’b’) and the
constant term (the ‘a’). The first cell of the third row contains the R-squared term, the
‘coefficient of determination’. This statistic is an indicator of the reliability of the regression
equation as a method of predicting the value of the dependent variable, the ‘y’.

Rank Correlation

Where only ranking orders, rather than actual values, are available, Spearman’s rank
correlation coefficient can give an indicator of correlation. The formula is given below.

In the above formula, d represents the observed differences in rankings.


Example: 7 students have the following ranking orders for their performances in Statistics
and Economics.

Rank (Statistics) Rank (Economics)


2 1
1 3
4 7
6 5
5 6
3 2
7 4

Spearman’s rank correlation coefficient would be found as follows;


Rank (Statistics) Rank (Economics) d d2
2 1 1 1
1 3 -2 4
4 7 -3 9
6 5 1 1
5 6 -1 1
3 2 1 1
7 4 3 9
26
r = 1 – [6(26)/7(48] = 1 – [156/336] = 1 – 0.463 = 0.5357

You might also like