Correlation and Regression
Correlation and Regression
Example
Find the moment coefficient of the following distribution
X f
12 1
14 4
16 6
18 10
20 7
22 2
528
M = = 17.6
30
179.20
σ2 = = 5.973
30
σ4 = 35.677
( x − m)
4
f 2, 676.74
M4 = = = 89.22
f 30
89.22
Moment coefficient of Kurtosis = = 2.5
35.677
Note Coefficient of kurtosis can also be found using the method of assumed
mean.
CHAPTER FIVE
87
Specific Objectives
At the end of the topic the trainee should be able to:
➢ Draw the scatter diagram;
➢ Differentiate between the various forms of correlation;
➢ Determine the correlation coefficient and interpret;
➢ Determine the coefficient of determination and interpret;
➢ Apply the linear regression models.
Introduction
When the relationship is of a quantitative nature, the appropriate
statistical tool for discovering and measuring the relationship and
expressing it in a brief formula is known as a correlation. This is an
important statistical concept which refers to interrelationship or
association between variables. The purpose of studying correlation is for
one to be able to establish a relationship, plan and control the inputs
(independent variables) and the output (dependent variables).
In business one may be interested to establish whether there exists a
relationship between the
i. Amount of fertilizer applied on a given farm and the resulting
harvest
ii. Amount of experience one has and the corresponding
performance
iii. Amount of money spent on advertisement and the expected
incomes after sale of the goods/service
There are two methods that measure the degree of correlation between
two variables these are denoted by R and r.
88
Significance of the study of correlation
➢ Most of the variation shows some kind of relationship between price
and supply income and expenditure. With the help of correlation
analysis the degree of relationship existing between the variable can
be measured.
➢ The value of one variable can be estimated once it has been known
that they closely related. It can be done with the help of regression
analysis.
➢ It contributes to the economic behavior, aids in locating the critically
important variable on which other depend.
➢ Nature has been found to be multiplying of interrelated force.
➢ It helps in determining the degree of relationship between two or
more variable.
Types of correlation
• Positive and negative correlation
• Simple, partial and multiple
• Linear and non linear.
89
Linear and non linear correlation
The distinction between linear and non linear correlation is based upon the
constancy of the ratio of change between the variables. If the amount of
change in one variable tends to bear constant ratio to the amount of
change in the other variable then the correlation is said to be linear. If such
variable are plotted on a graph paper, all the plotted points would fall on a
straight line. Correlation would be non linear or curvilinear, If the amount
of change in one variable does not bear constant ratio to the amount of
change in the other variable.
Scatter diagram
The simplest device for studying correlation in two variables is a special
type of dot chart called dotogram or scatter diagram.
A scatter graph is a graph which comprises of points which have been
plotted but are not joined by line segments
The pattern of the points will definitely reveal the types of relationship
existing between variables
The following sketch graphs will greatly assist in the interpretation of
scatter graphs.
Independent variable
NB: For the above pattern, it is referred to as perfect because the points
may easily be represented by a single line graph e.g. when measuring
relationship between volumes of sales and profits in a company, the more
the company sales the higher the profits.
90
Perfect negative correlation
y x
Quantity sold x
X
x
x
x
x
x
x
10 20 X
Price
This example considers volume of sale in relation to the price, the cheaper
the goods the bigger the sale.
Independent variable
price
No correlation
y
600 x x x x x
x x x
400 x x x x x
x x x x
200 x x x x x
x x x x
0
10 20 30 40 50 x
h) Spurious Correlations
92
in some rare situations when plotting the data for x and y we may have
a group showing either positive correlation or –ve correlation but when
you analyze the data for x and y in normal life there may be no
convincing evidence that there is such a relationship. This implies
therefore that the relationship only exists in theory and hence it is
referred to as spurious or non sense e.g. when high pass rates of student
show high relation with increased accidents.
Correlation coefficient
These are numerical measures of the correlations existing between the
dependent and the independent variables. These are better measures of
correlation than scatter graphs (diagrams).The range for correlation
coefficients lies between +ve 1 and –ve 1. A correlation coefficient of +1
implies that there is perfect positive correlation. A value of –ve shows
that there is perfect negative correlation. A value of 0 implies no
correlation at all.
The following chart will be found useful in interpreting correlation
coefficients.
93
Limitation
• Exact degree of correlation can not be established between the
variable
Note that this formula can be rearranged to have different outlooks but the
resultant is always the same.
Example
The following data was observed and it is required to establish if there
exists a relationship between the two.
X 15 24 25 30 35 40 45 65 70 75
Y 60 45 50 35 42 46 28 20 22 15
Solution
n xy − x y
r=
n x 2 − ( x ) n y 2 − ( y )
2 2
94
−25, 762
= = −0.93
( 39, 484 ) (19, 461)
The correlation coefficient thus indicates a strong negative linear
association between the two variables.
NOTE:
• A high value of r (+0.9 or – 0.9) only shows a strong association between
the two variables but doesn’t imply that there is a causal relationship
i.e. change in one variable causes change in the other it is possible to
find two variables which produce a high calculated r yet they don’t have
a causal relationship. This is known as spurious or nonsense correlation
e.g. high pass rates in QT in Kenya and increased inflation in Asian
countries.
• Also note that a low correlation coefficient doesn’t imply lack of
relation between variables but lack of linear relationship between the
variables i.e. there could exist a curvilinear relation.
• A further problem in interpretation arises from the fact that the r value
here measures the relationship between a single independent variable
and dependent variable, where as a particular variable may be
dependent on several independent variables (e.g. crop yield may be
dependent on fertilizer used, soil exhaustion, soil acidity level, season
of the year, type of seed etc.) in which case multiple correlation should
be used instead.
Example
95
B 7 6 1 1
C 6 4 2 4
D 1 2 -1 1
E 4 5 -1 1
F 3 1 2 4
G 5 8 -3 9
H 8 7 1 1
d 2
= 22
= 0.74
Thus we conclude that there is a reasonable agreement between student’s
performances in the two types of tests.
NOTE: in this example, if we are given the actual marks then we find
r. R varies between +1 and -1.
Tied Rankings
A slight adjustment to the formula is made if some students tie and have
the same ranking the adjustment is
t3 − t
where t = number of tied rankings the adjusted formula
12
becomes
R=1-
6 ( d + )
2 t 3 −t
12
n ( n − 1)
2
Example
Assume that in our previous example student E & F achieved equal marks in
Q. T. and were given joint 3rd place.
Solution
Student Q. T. ranking Law II d d2
ranking
A 2 3 -1 1
B 7 6 1 1
C 6 4 2 4
96
D 1 2 -1 1
E 3½ 5 -1 ½ 2¼
F 3½ 1 2½ 6¼
G 5 8 -3 9
H 8 7 1 1
d 2 = 26 1 2
R = 1-
6 ( d + )
2 t 3 −t
12
= 1-
(
6 26 1 2 + 212− 2
3
) since t = 2
n ( n − 1)
2
8 ( 8 − 1)
2
= 0.68
NOTE: It is conventional to show the shared rankings as above, i.e. E, & F
take up the 3rd and 4th rank which are shared between the two as 3½
each.
Coefficient of Determination
This refers to the ratio of the explained variation to the total variation and
is used to measure the strength of the linear relationship. The stronger the
linear relationship the closer the ratio will be to one.
REQUIRED
Calculate the rank correlation coefficient and hence comment briefly on
the value obtained
97
d d2
A 6 5 1 1
B 1 3 -2 4
C 3 4 -1 1
D 7 6 1 1
E 8 7 1 1
F 2 1 1 1
G 4 8 -4 16
H 5 2 3 9
J 10 9 +1 1
K 9 10 -1 1
Σd2 = 36
6 d 2
R=1-
n ( n2 − 1)
6 36
=1-
10 (102 − 1)
216
=1-
990
=1–0
Comment: since the correlation is 0.78 it implies that there is
high positive correlation between the ranks awarded to the
contestants. 0.78 > 0 and 0.78 > 0.5
Example
Contestant 1st 2nd d d2
assessor assessor
A 1 2 -1 1
B 5 (5.5) 3 2.5 6.25
C 3 4 -1 1
D 2 1 1 1
E 4 5 -1 1
F 5 (5.5) 6.5 -1 1
98
G 7 6.5 -0.5 0.25
H 8 8 0 0
Σd2 = 11.25
Required: Complete the rank correlation coefficient
6 d 2 6 11.25
∴R= 1- =1-
n ( n − 1)
2
8 ( 63)
67.5
=1–
504
= 1 – 0.13
= 0.87
This implies high positive correlation
Example (Rank Correlation Coefficient)
Sometimes numerical data which refers to the quantifiable variables may
be given after which a rank correlation coefficient may be worked out.
Is such a situation, the rank correlation coefficient will be determined after
the given variables have been converted into ranks. See the following
example;
Example
(Product moment correlation)
99
The following data was obtained during a social survey conducted in a given
urban area regarding the annual income of given families and the
corresponding expenditures.
Workings:
4020 3550
X = = 402 Y= = 355
10 10
= 0.89
Comment: The value obtained 0.89 suggests that the correlation between
annual income and annual expenditure is high and positive. This implies
that the more one earns the more one spends.
REGRESSION
BASIC CONCEPTS
This is a concept, which refers to the changes which occur in the dependent
variable as a result of changes occurring on the independent variable.
Knowledge of regression is particularly very useful in business statistics
100
where it is necessary to consider the corresponding changes on dependant
variables whenever independent variables change. It should be noted that
most business activities involve a dependent variable and either one or
more independent variable. Therefore knowledge of regression will enable
a business statistician to predict or estimate the expenditure value of a
dependant variable when given an independent variable e.g. consider the
above example for annual incomes and annual expenditures. Using the
regression techniques one can be able to determine the estimated
expenditure of a given family if the annual income is known and vice versa.
101
y x x Line of best fit
x x
x x
x x
x x
x x
Example
An investment company advertised the sale of pieces of land at different
prices. The following table shows the pieces of land their acreage and costs
102
Piece of (x)Acreage (y) Cost £ xy x2
land Hectares 000
A 2.3 230 529 5.29
B 1.7 150 255 2.89
C 4.2 450 1890 17.64
D 3.3 310 1023 10.89
E 5.2 550 2860 27.04
F 6.0 590 3540 36
G 7.3 740 5402 53.29
H 8.4 850 7140 70.56
J 5.6 530 2969 31.36
Σx =44.0 Σy = 4400 Σxy= 25607 Σx2 = 254.96
Required
Determine the regression equations of
i. y on x and hence estimate the cost of a piece of land with 4.5
hectares
ii. Estimate the expected average if the piece of land costs £ 900,000
Σy = an + bΣxy
Σxy = a∑x + bΣx2
intercept a =
y − b x
n
n xy − x y
Slope b =
n x 2 − ( x )
2
Example
The calculations for our sample size n = 10 are given below. The linear
regression model is
y = a + bx
Table
= 2.66
= 5.91
We now insert these values in the linear model giving
y = 5.91 + 2.66x
or
Delivery time (mins) = 5.91 + 2.66 (delivery distance in miles)
104
The slope of the regression line is the estimated number of minutes per
mile needed for a delivery. The intercept is the estimated time to prepare
for the journey and to deliver the goods which is the time needed for each
journey other than the actual traveling time.
Example
Cost of production per week in a large department depends on several
factors;
i. Total numbers of hours worked
ii. Raw material used during the week
iii. Total number of items produced during the week
iv. Number of hours spent on repair and maintenance
It is sensible to use all the identified factors to predict department costs
Scatter diagram will not give the relationship between the various factors
and total costs
The linear model for multiple linear regression if of the type; (which is the
line of best fit).
y = α + b1x1 +b2x2 +………… + bnxn
We assume that errors or residuals are negligible.
In order to choose between the models we examine the values of the
multiple correlation coefficient r and the standard deviation of the
residuals α.
A model which describes well the relationship between y and x’s has
multiple correlation coefficient r close to ±1 and the value of α which is
small.
Example
Odino chemicals limited are aware that its power costs are semi variable
cost and over the last six months these costs have shown the following
relationship with a standard measure of output.
105
Month Output (standard units) Total power costs £
000
1 12 6.2
2 18 8.0
3 19 8.6
4 20 10.4
5 24 10.2
6 30 12.4
Required
i. Using the method of least squares, determine an appropriate
linear relationship between total power costs and output
ii. If total power costs are related to both output and time (as
measured by the number of the month) the following least
squares regression equation is obtained
Power costs = 4.42 + (0.82) output + (0.10) month
Where the regression coefficients (i.e. 0.82 and 0.10) have t
values 2.64 and 0.60 respectively and coefficient of multiple
correlation amounts to 0.976
Compare the relative merits of this fitted relationship with one
you determine in (a). Explain (without doing any further analysis)
how you might use the data to forecast total power costs in seven
months.
Solution
a)
Output (x) Power costs (y) x2 y2 xy
12 6.2 144 38.44 74.40
18 8.0 324 64.00 144.00
19 8.6 361 73.96 163.40
20 10.4 400 108.16 208.00
24 10.2 576 104.04 244.80
30 12.4 900 153.76 372.00
Σx = 123 Σy = 55.8 2
Σx = 2705 Σy2 = Σxy=
542.36 1,206.60
n xy − x y
b=
n x 2 − ( x )
2
376.2
= = 0.342
1101
106
1
a = (Σy – bΣx)
n
1
= (55.8 – 0.342) 123
6
= 2.29
(Power costs) = 2.29 + 0.342 (output)
376.2
=
1101140.52
= 0.96
This show a strong correlation between power cost and output. The
multiple correlations when both output and time are considered at the
same time are 0.976.
We observe that there has been very little increase in r which means that
inclusion of time variable does not improve the correlation significantly
The value for time variable is only 0.60 which is insignificant as compared
with a t value of 2.64 for the output variable
In fact, if we work out correlation between output and time, there will be a
high correlation. Hence there is no necessity of taking both the variables.
Inclusion of time does improve the correlation coefficient but by a very
small amount.
If we use the linear regression analysis and attempt to find the linear
relationship between output and time i.e.
Month Output
1 12
2 18
3 19
4 20
5 24
6 30
The value of b and a will turn out to be 3.11 and 9.6 i.e. relationship will
be of the form
107
Output = 9.6 + 3.11 × month
For this equation forecast for 7th month will be
Output = 9.6 + 3.11 × 7
= 9.6 + 21.77
= 31.37 units
Using the equation, Power costs = 2.29 + 0.34 × output
= 2.29 + 0.34 × 31.37
= 2.29 + 10.67
= 12.96 i.e. £ 12,960
i. Exponential model
y = ab x
Take log of both sides
log y = log a + log bx
log y = log a + xlog b
Let log y = Y and log a = A and log b = B
108