0% found this document useful (0 votes)
198 views32 pages

Correlation and Regression-1

This document discusses correlation and regression analysis. Correlation analysis quantifies the relationship between two or more continuous variables without inferring causation. Regression analysis assesses the relationship between an outcome variable and predictor variables. The key objectives covered are defining dependent and independent variables, computing and interpreting correlation coefficients and regression coefficients, and recognizing applications and limitations of correlation and regression analysis techniques.

Uploaded by

KELVIN ADDO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
198 views32 pages

Correlation and Regression-1

This document discusses correlation and regression analysis. Correlation analysis quantifies the relationship between two or more continuous variables without inferring causation. Regression analysis assesses the relationship between an outcome variable and predictor variables. The key objectives covered are defining dependent and independent variables, computing and interpreting correlation coefficients and regression coefficients, and recognizing applications and limitations of correlation and regression analysis techniques.

Uploaded by

KELVIN ADDO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

CORRELATION AND REGRESSION ANALYSIS 1

In this Chapter, we discuss correlation analysis which is a technique used to quantify the
interrelation between two or more continuous variables. For example, correlation coefficient
could be computed for a research that has been carried out to find out whether a person’s
height is related to his age.

Regression analysis is a related technique to assess the relationship between an outcome


variable and one or more risk factors or confounding variables. The outcome variable is also
called the response or dependent variable, and the risk factors and confounders are called the
predictors, or explanatory or independent variables. In regression analysis, the dependent
variable is denoted “Y” and the independent variables are denoted by “X”.

LEARNING OBJECTIVES

At the end of this Chapter, you must be able to;

1. Define and identify dependent and independent variables in a study.


2. Compute and interpret a correlation coefficient.
3. Compute and interpret coefficients in a linear regression analysis.
4. Define and compute the coefficient of determination.
5. Calculate the simple linear regression equation for a set of data and know the basic
assumptions behind regression analysis.
6. Recognize regression analysis applications for purposes of description and prediction.
7. Obtain measures of the error involved in using the regression line as a basis of estimation.
8. Recognize some potential problems if regression analysis is used incorrectly.

CORRELATION

Correlation deals with finding the relationship between two quantitative variables without
being able to infer causal relationships. Correlation is a statistical technique used to determine
the degree, and direction to which two variables are related. Correlation expresses the
relationship or interdependence of two sets of variables upon each other in such a way that
the changes in the value of one variable are in sympathy with the changes in the other.

1
Correlation Coefficient is the numerical measurement showing the degree of correlation
between two variables.

CAUSE AND EFFECT

Correlation does not mean the presence of cause and effect relationship between the two
distributions. Thus, a correlation between two variables does not necessarily imply that one
causes the other. The “cause and effect” assumption is a fallacy known as cum hoc ergo
propter hoc, Latin for "with this, therefore because of this". For example, when we say that
there is relationship between price and demand; it does not mean that price “cause” demand.
In other words, as price increases, the amount of demand decreases and vice versa.

It is generally assumed that when two variables are correlated, a certain relationship exists
between them. But there is a possibility that, statistically, two variables are found correlated
but practically they are not related at all. For example, there cannot be statistical relationship
between rainfall and percentage of pass in an examination, even though there may exist
correlation between them. Such correlation is called Spurious Correlation, which arises due
to chance factor.

USEFULNESS OF CORRELATION

Correlation is useful in physical and social sciences. The following are the important uses.
1. Correlation is very useful to economists to study the relationship between variables like
price and quantity demanded. It helps businessmen to estimate costs, sales, price, and
other related variables.
2. Some variables show some kind of relationship; correlation analysis helps in measuring
the degree of relationship between the variables like supply and demand etc.
3. The relation between variables can be verified and tested for significance, with the help of
the correlation analysis.
4. The coefficient of Correlation is a relative measure, and we can compare the relationship
between variables which are expressed in different units.
5. Sampling error can also be calculated.
6. Correlation is the basis for the concept of regression and ratio of variation.

Types of Correlation

Correlation is classified into the following types:

2
1. Positive and Negative
2. Simple and Multiple
3. Partial and Total
4. Linear and Non-Linear

Positive and Negative

The direction of variation of the variables determines whether correlation is positive or


negative.

Correlation is said to be positive when the values of two variables move in the same
direction, so that an increase in the values of one variable is associated with an increase in the
values of the other variable also; and a decrease in the values of one variable is associated
with the decrease in the values of other variables.

Correlation is said to be negative if an increase or decrease in the values of one variable is


associated with a decrease or increase in the values of the other so that the changes in the
values move in the opposite direction.

1. Simple and Multiple


When we study only two variables, the relationship is described as simple correlation.
But in a multiple correlation we study more than two variables simultaneously; example, the
relationship among price, demand and supply of a commodity.

2. Partial and Total


The study of two variables excluding some other variables is called partial correlation. For
example, when we study price and demand eliminating the supply side. In total correlation,
all facts are taken into account.

3. Linear and Non-Linear


Correlation is said to be linear if the amount of change in one variable tends to bear a
constant ratio to the amount of change in the other variable. If the ratio of change between
two variables is uniform, then there will be linear correlation between them.
Correlation is said to be non-linear, if the amount of change in one variable does not bear a
constant ratio to the amount of change in the other related variable. The emphasis of this text
is linear correlation.

3
Methods of Studying Correlation

The commonly used methods for studying the correlation between two variables are:

1. Graphical Method
a) Scatter diagram
b) Simple graph

2. Mathematical Method
Karl Pearson’s coefficient of correlation

1. a) Scatter Diagram
This is the simplest way of studying correlation between the two distributions, by plotting the
values on a chart known as scatter diagram. In this method, the given data are plotted on a
graph paper in the form of dots. X variables are plotted on the horizontal axis and y variables
on the vertical axis. Thus, we have the dots and we can know the scatter of the various points;
and this will show the type of correlation.
The following diagrams illustrate the degree and direction of relationship

Positive correlation Negative No correlation

Diagram I indicates positive correlation as it shows that the values of the two variables
move in the same direction.
Diagram 2 indicates negative correlation as the values of the two variables move in the
reverse direction.
Diagram 3 indicates no correlation.

Simple Correlation Coefficient ( )


It is also called Pearson's correlation or product moment correlation coefficient. It measures
the nature and strength of linear relationship between two variables of the quantitative type.

Assumptions Testing
Correlation analysis has the following underlying assumptions:
• Related Pairs– the data should be collected from related pairs: i.e. if you obtain a score on
an X variable, there must be a score on the Y variable from the same subject.

4
• Scale of Measurement– data should be interval or ratio in nature.
• Normality– the scores for each variable should be normally distributed. For large data, this
assumption may be relaxed
• Linearity– the relationship between the two variables must be linear. You should first use a
scatter plot to establish if the data indicates a linear relationship.
• Homogeneity of Variance– the variability in scores for one variable is roughly the same at
all values of the other variable; i.e. it is concerned with how the scores cluster uniformly
about the regression line. The variance (standard deviation) of X should be roughly the same
as Y

HOW TO COMPUTE THE SIMPLE CORRELATION

To determine the numerical value of the coefficient of correlation, the following formula is
used:
n xy x y
Correlation coefficient rx , y
2 2
n x2 x n y2 y

n
xi x yi y
Cov x , y
rx , y i 1
n n
x y 2 2
xi x yi y
i 1 i 1

Where;

(X-X)(Y Y )
Cov(x,y) , x and y are the standard deviation of X and Y variables
n 1
respectively.
(X-X)( X X) 2
(Y-Y)(Y-Y) 2
Note: Cov(x,x) x and Cov(y,y) y
n 1 n 1
The sign of r denotes the natureof association while the value of r denotes the strength of
association.
If the sign is +ve, this means the relation is direct (an increase in one variable is
associated with an increase in the other variable and a decrease in one variable is
associated with a decrease in the other variable).
While if the sign is –ve, this means an inverse or indirect relationship (which means
an increase in one variable is associated with a decrease in the other).

5
Example 1
Yyome, an economic analyst wanted to find the relationship between inflation rate and prime
lending rate. He, therefore, collected data on inflation rate and lending rate over a seven-year
period. The data below represent the inflation rate (x) and prime lending rate (y) over the
seven-year period.

X 3.3 6.2 11.0 9.1 5.8 6.5 7.6


Y 5.2 8.0 10.8 7.9 6.8 6.9 9.0

Compute the product moment correlation coefficient and comment on the results.

Solution 1
X Y xy x2 y2
33. 5.2 17.16 10.89 27.04
6.2 8.0 49.60 38.44 64.0
11.0 10.8 118.80 121.00 116.64
9.1 7.9 71.89 82.81 62.41
5.8 6.8 39.44 33.64 46.24
6.5 6.9 44.85 42.25 47.61
7.6 9.0 68.40 57.76 81.00
Sums 49.5 54.6 410.14 386.79 444.94

Here; x 49.5 y 54.6 xy 410 .14 x2 386 .79

y2 444 .94

Substituting these values into the formula gives


n xy x y
r
2 2
n X2 X n y2 y

7( 410.14) ( 49.5)(54.6)
r
7(386.79) ( 49.5) 2 7( 444.94) (54.6) 2

2870.98 2702.7
r
2707.53 2450.25 3114.58 2981.16

6
168.28 168.28
r 0.91
185.25
257.28 133.42

A correlation coefficient of 0.91 shows a very strong positive correlation between inflation
rate (x) and prime lending rate (y).

Example 2
The managers of a company with ten operating plants of similar size producing small
components have observed the following pattern of expenditure on inspection and defective
parts delivered to the customer:

Observation No. 1 2 3 4 5 6 7 8 9 10
Inspection Expenditure/1000 Units 25 30 15 75 40 65 45 24 35 70
Defective parts/1000 Units delivered 50 35 60 15 46 20 28 45 42 22

They are wondering how strong the relationships is between inspection expenditure and the
number of faulty items delivered.
Calculate the product moment correlation coefficient and comment on your results.

Solution

X Y xy x2 y2
25 50 1250 625 2500
30 35 1050 900 1225
15 60 900 225 3600
75 15 1125 5625 225
40 46 1840 1600 2116
65 20 1300 4225 400
45 28 1260 2025 784
24 45 1080 576 2025
35 42 1470 1225 1764
70 22 1540 4900 484
424 363 12815 21926 15123
n xy x y
r
2 2
n X2 X n y2 y

10 12815 424 363


r
2 2
10 21926 424 10 15123 363

7
128150 153912
r
2109260 179776 151230 131769

25762 25762
r 0.93
768398124 27720

A correlation coefficient of –0.93 indicates a very strong negative association between


expenditure on inspection and defective parts delivered.

APPLICATIONS IN BUSINESS
Review
Correlation is a statistical measure of the relationship between two series of numbers
representing data.
Positively Correlated items move in the same direction.
Negatively Correlated items move in opposite directions.
Correlation Coefficient is a measure of the degree of correlation between two series of
numbers representing data.

APPLICATIONS IN FINANCE – DIVERSIFICATION


Correlation and variances have many applications in business and in this text, we will look at
its application in investment, specifically, application in portfolio risk reduction or
diversification

Keynotes
To reduce overall risk in a portfolio, it is best to combine assets that have a negative (or
low-positive) correlation.
Uncorrelated assets reduce risk somewhat, but not as effectively as combining negatively
correlated assets.
Investing in different investments with high positive correlation will not provide
sufficient diversification.

Consider a portfolio of three assets A, B, and C with return a, b and c respectively. Assumed
that they are equally weighted, the covariance of return from all possible pairs of assets can
be presented in the covariance matrix

8
cov(a, a ) cov(a, b) cov(a, c)
cov(b, a ) cov(b, b) cov(b, c)
cov(c, a) cov(c, b) cov(c, c )

The diagonals: cov(a,a), cov(b,b) and


cov(c,c) give the variances of a, b and
c. and therefore SDa cov(a , a )
SDb cov(b, b ) , SDc cov(c, c )

Example
Miss EwuramaGyamfuaa wants to invest part of her student grant and she considering any
two of the following investment opportunities; A, B, C and D. the covariance matrix of the
historic returns of these investment opportunities is:

A B C D

Her interest is to diversify the investment to minimize the risk. Compute the correlation
matrix of these investments and advise Ewurama on the best combination.

Solution

Cov ( A, B) Cov ( A, B)
Corr ( A, B)
SD A SD B Cov ( A, A) Cov ( B, B)

= = -0.233 Corr(A,C) = = 0.9836

Corr(A,D) = = 0.1895 Corr(B,C) = = -0.26722

Corr(B,D) = = -0.19791 Corr(C,D) = = 0.204696

A B C D
A 1 0.2330 0.9836 0.1895
B 0.2330 1 0.2672 0.1979
C 0.9836 0.2672 1 0.20470
D 0.1895 0.1979 0.20470 1

9
Miss Ewuramacan reduce or minimize risk by investing in products with a most negative
correlated investment. Hence, she should invest in .

RANK CORRELATION COEFFICIENT


Rank correlation is used under the following circumstances:
(a) When the underlying relationship between the two variables is not necessarily linear.
(b) When one or both of the variables involved is non-numeric, but can be ranked.

This coefficient also known as the spearman rank correlation coefficient. It is an alternative
method of measuring correlation and based on the ranks of the sizes of item values.

The following steps can be followed to compute the rank correlation:


STEP 1 Rank the (to get values)
STEP 2 Rank the (to get values)
STEP 3 Find the differences between corresponding ranks and square them
2
[i.e. Rx Ry or d 2 where d Rxi Ryi ]
6 d2
STEP 4 Use the formula r 1 to compute the spearman rank correlation
n(n 2 1)
coefficient.

Example 3

A group of 8 business students were tested in Quantitative Methods and cost Accounting.
Their rankings in the two tests were:

Student A B C D E F G H
Quantitative Methods (Ranking) 2 7 6 1 4 3 5 8
Cost Accounting (Ranking) 3 6 4 2 5 1 8 7

Calculate the spearman’s rank correlation coefficient for the two sets of ranks and comment
on the results.

10
Solution

QM C. A
Ranking Ranking
Rx Ry D= Rx- Ry d2
A 2 3 -1 1
B 7 6 1 1
C 6 4 2 4
D 1 2 -1 1
E 4 5 -1 1
F 3 1 2 4
G 5 8 -3 9
H 8 7 1 1
22

6 d2 6 22
r 1 2
1 0.74
nn 1 8 82 1
The rank correlation coefficient of 0.74 shows a strong positive relation between students’
performances in the two tests.

Example 4

A national consumer protection society investigated seven brands of paint to determine their
quality relative to price. The society’s conclusions were ranked according to the following
table:

Brand A B C D E F G
Price/Litre (x) 192 158 135 160 205 139 177
Quality ranking (Ry) 2 6 7 4 3 5 1

Using Spearman’s rank correlation coefficient determines whether the consumer generally
gets value for money.

11
Solution 4
Ranking for quality has already been done. Therefore price/litre must be ranked so that we
use the Spearman’s formula.

Brand Rx Ry D d2
A 2 2 0 0
B 5 6 -1 1
C 7 7 0 0
D 4 4 0 0
E 1 3 -2 4
F 6 5 1 1
G 3 1 2 4_
2
d 10

6 d2 6 10
r 1 2
1 0.821
nn 1 7 72 1

A coefficient of 0.82 shows a high degree of positive correlation which means that in general
the consumer gets value for money.

TIED RANKINGS

If one or more groups of data items have the same value (known as tied values) the ranks that
would have been allocated separately must be averaged and this average rank given to each
item with this equal value. For example the five numbers 8, 14, 14, 19, 21 would be
allocated ranks 1, 2.5, 2.5, 4, 5 respectively (since two items have value 14, each must be
allocated the average of ranks 2 and 3.

Example 5
The Department of Public Health under the auspices of the Ministry of Health investigated
the age, weight and diastolic blood pressure of nine women, with the following results:

Age (Years) 69 33 27 45 58 24 51 35 21
Weight (Kg) 64 70 60 102 75 67 76 55 67
Blood Pressure 85 85 70 85 75 85 80 60 55
(mm of mercury)

Calculate the Spearman’s rank correlation coefficient between:


(i) age and blood pressure
12
(ii) weight and blood pressure and interpret your results (C. A. Nov. 1997)

Solutions
Rank all the variables: Rx for rank of Age; Ry for rank of weight; and Rz for rank of Blood
pressure

Age Weight Pressure Age & Blood Weight & Blood


Rx Ry Rz Pressure Pressure
d d2 d d2
Rx-Ry (Ry-Rz)
1 7 2.5 -1.5 2.25 4.5 20.25
6 4 2.5 3.5 12.25 1.5 2.25
7 8 7 0 0 1.0 1.0
4 1 2.5 1.5 2.25 -1.5 2.25
2 3 6 -4.0 16.0 -3.0 9.0
8 5.5 2.5 5.5 30.25 3.0 9.0
3 2 5 -2.0 4.0 -3.0 9.0
5 9 8 -3 9.0 1.0 1.0
9 5.5 9 0 0 -3.5 12.25
d2 76.0 d2 66.0

(i) Rank correlation between Age and Blood pressure is given by


6 d2 6 76
r 1 2
1 0.37
nn 1 9 92 1
(ii) Rank correlation between weight and blood pressure is
6 d2 6 66
r 1 2
1 0.45
nn 1 9 92 1
The rank correlation coefficient of 0.37 in (i) shows a moderate positive relation between age
and blood pressure.

The correlation coefficient of 0.45 in (ii) shows a moderate positive relation between weight
and blood pressure.

It can be said from the results of (i) & (ii) that blood pressure rises as age and weight
increase.

13
CORRECTION OF TIED RANKS
From practical viewpoint it is often not worth correcting for ties. Use of correction is advised
if
i) Three or more observation are tied equally
ii) The number of pairs of ties is more than ¼ of the number of observations.

2
j
m m2 1
6 d
i 1 12
r 1 2
N N 1

Where m is the number of equal observations with common rank and j total number of ties.

Examples; the table below present the price (P) and quantity demanded (Q) of a commodity.
Calculate the Coefficient of rank correlation between P and Q.
P 80 78 75 75 68 57 60 59
Q 110 111 114 114 114 116 115 117

2 m1 m 2 1 m2 m 2 1
6 d
12 12
r 1 2
N N 1

2 22 1 3 32 1
6 159 . 50
12 12
r 1
8 82 1

6 159 . 50 0 .5 2 .0
r 1
8 64 1

6 162
r 1
504

r 1 1 . 928 0 . 928

r = -0.928 indicates a high degree of negative correlation between P and Q.

MERITS OF RANK CORRELATION METHOD


1. Simple and easily understandable.
2. Useful when precise measurement on the variable are not given or cannot be obtained,
that is, when the factor under study are qualitative in nature.
3. It is applicable to irregular data as it does not assume that the data should be normal.

14
DEMERITS OF RANK CORRELATION METHOD
1. It is applied to ungrouped data only
2. The ranking procedure ignores the actual magnitude of the data and as such the results
obtained are only approximate
3. Computation is difficult as the number of paired observations increase.

CONCURRENT DEVIATION METHOD


This is the simplest method of studying correlation between two variables. It gives a general
idea of the direction of covariation between two variables. The coefficient of correlation by
the concurrent deviation method is obtained by the formula;

2C n
rc , if 2 C n 0
n

2C n
rc , if 2 C n 0
n

Where:
C = the number of the positive concurrent deviation
N = the number of pairs of deviations compared
The number is one less than N(n=N-1)

STEPS
(1) Determine deviation of the paired series, (Dx and Dy). Dx and Dy are determined by
comparing the series with the preceding one. If the value is greater than the preceding,
deviation is taken positive (+), otherwise (-), if the value is equal to the preceding,
deviation will be zero (0).
(2) Multiply the corresponding deviations (Dx,Dy) to get the concurrent deviation.
(3) Count the number of positive concurrent deviations.
(4) Find n=N-1

2C n
(5) If 2C-n is positive, use rc and
n

2C n
If 2C-n is negative use rc
n

15
Example
The data below relates to prices and imports of Ayedwe Ltd. Determine the correlation
coefficient using concurrent method.
Price 368 384 385 361 347 384 395 403 400 385
Imports 22 21 24 20 22 26 24 28 28 27

Solution
Price X Deviation Dx Import Deviation Dy Concurrent
Deviation
Dx,Dy
368 22
384 + 21 - -
385 + 24 + +
361 - 20 - +
347 - 22 + -
384 + 26 + +
395 + 24 - -
403 + 29 + +
400 - 28 - +
385 - 27 - +
C=6

n = N – 1 = 10 – 1 = 9

2C – n = 2(6) – 9 = 3 which is positive so we use


2C n
rc
n
26 9
rc 0.333 0.577
9
Therefore, there is a moderate degree of positive correlation between prices and import.

MERITS OF CONCURRENT DEVIATION METHOD


1. It is simple and quick way of determining nature of correlation
2. It is very useful when the number of items is very large.
3. It is easy to calculate and useful for the study of short time fluctuations in a time
series.

DEMERITS OF CONCURRENT DEVIATION METHOD


1. It has a weakness of assigning the same weight to small and big changes in the value
of the series.
2. It gives only a rough indicator of correlation.

16
REGRESSION ANALYSIS

Regression is a technique used to describe a relationship between two variables in


mathematical terms.
The objectives of regression analysis include the following:
(i) To provide estimates of values of the dependent variable from values of the
independent variable.
(ii) To obtain measures of the error involved in using the regression line as a basis of
estimation

USES OF REGRESSION ANALYSIS

Regression analysis is of great practical use even more than the correlation analysis; the
following are some uses,

1. Regression analysis helps in establishing a functional relationship between two or


more variables once this is established, it can be used for various advanced analytic
purpose.
2. With the use of electronic machine and computers tedium of collection of
regression equation particularly expressing multiple and a non-linear relationship
has been reduced a great deal.
3. Since most of the problems of economic analysis are based on cause and effect
relationship. The regression analysis is a highly valuable tool in economic and
business research.
4. The regression analysis is very useful for prediction purpose. Once a functional
relationship is known, the value of dependent variable can be predicted from the
given value of the independent variable.

CORRELATION AND REGRESSION

These two techniques are directed towards a common purpose of establishing the degree and
the direction of relationship between two or more variables but the methods of doing so are
different. The choice of one or the other will depend on the purpose. In spite certain
similarities between these two, but there are some basic differences in the two approaches,
which have been summarized below:

17
CORRELATION REGRESSION
1. Correlation, literally means related or 1. Regression literally means return to
sympathetic movements between the normal, which is true on account
variables of the average of relationship.
2. There is a sort of interdependence, 2. It establishes a functional
which is mutual. relationship, which is mathematical
3. There is no cause and effect relation- showing dependence of one variable
ship. It only shows the existence of on the other.
some association in the movement of 3. It may have a cause and effect
variables. relationship.
4. It may be spurious correlation if the 4. It is a mathematical relationship,
sympathetic movement is on account which should be interpreted suitably.
of the influence of an outside variable 5. It is an absolute measure of
which has no relevance. relationship.
5. It is a relative measure showing 6. Besides verification it can also be
association between variables. used for estimation and prediction. It
6. It is used only for testing and tenders more comprehensive
verification of the relationship. It information.
tenders only a limited information. 7. It is very useful for further
7. It is not very useful for further mathematical treatment.
mathematical treatment.

METHODS OF OBTAINING A REGRESSION EQUATION


The technique used to develop the equation for the straight line which can be used to make
estimates or predictions is called regression analysis. The regression equation is the equation
that defines the relationship between two variables.

The process of obtaining a linear regression equation for a given set of (bivariate) data is
often referred to as fitting a regression line.

There are 3 main methods commonly used to fit a regression line to a given set of bivariate
data. These are

(a) By inspection: This is the simplest method and consists of plotting a scatter diagram
for the relevant data and then drawing in the line that most suitably fits the data.

A scatter diagram is chart that portrays the relationship between two variables. It is to be
noted that the mean point of the data is to be plotted and ensure that the regression line passes
through this point.

18
This method suffers from the defect that we cannot get a unique line. Different people would
probably draw different lines using the same data.

(b) By semi-averages: This technique consists of splitting the data into two equal groups,
plotting the mean point for each group and joining these two points with a straight
line.

(c) By method of least squares: The most generally applied curve-fitting technique in
regression analysis is the method of least squares. This method imposes the
requirement that the sum of the squares of the deviations of the observed values of the
dependent variable from the corresponding computed values on the regression line
must be a minimum. Thus, if a straight line is fitted to a set of data by the method of
least squares, it is a “best fit” in the sense that the sum of the squared deviations
2
y y is the least compared with any other possible straight line. Another useful

characteristic of the least squares straight line is that it passes through the point of
means x, y and therefore makes the total of the positive and negative deviations
equal to 0.

The elementary form of a straight line y a bx is used, where ‘a’ is constant and
indicates the y- intercept; ‘b’ is also a constant and indicates the gradient of the line.
The values of a and b are obtained by solving these two simultaneous equations
y an b x …………………… (i)

xy a x b x 2 ………………… (ii)

and are called the normal equations. Derived from (i) and (ii) are the following
computational formulae for finding and .
n xy x y y x
b 2
and a b
n x2 x n n

Note:
Regression coefficient and the constant ( a ) can also be computed as follows when the
regression coefficient ( r ), standard deviations of x ( x ) and y ( y
) are known:

Regression Equation of on : byx = r ( y


/ x ) and a yx y b yx x

19
Regression Equation of on : bxy = r ( x / y
) and a xy y bxy x

ASSUMPTIONS UNDERLYING LINEAR REGRESSION


(i) The independent variable is measured without error. That is, the magnitude of the
measurement error in the independent variable is negligible.
(ii) For each of x, there is a group of y values and these y values are normally distributed
(iii) The standard deviations of these normal distributions are equal.
(iv) The y values are statistically independent. This means that in the selection of a
sample, the y values chosen for a particular x value do not depend on the y values for
any other x value.

FINDING THE EQUATION OF REGRESSION OF Y ON X USING THE METHOD


OF LEAST SQUARES
We are going to use the method of least squares in getting the equation of regression since it
is the most generally accepted method and avoids subjective judgements.

Example:

A research was conducted to find the relationship between years of experience (x) and
monthly salary (y) in thousands of cedis earned by technicians in a very large company. The
data below gives the results of a sample of 12 technicians covered by the research:
X 12 16 6 23 27 8 5 19 23 13 16 8
Y 580 580 460 680 760 480 440 680 720 540 660 540

Determine the regression equation of Y on X by the method of least squares and use your
results to estimate the salary of technician with 15 years’ experience.

20
Solution
Calculations needed for determining the least squares regression equation
x y xy x2 y2
12 580 6960 144 336400
16 580 9280 256 336400
6 460 2760 36 211600
23 680 15640 529 462400
27 760 20520 729 577600
8 480 3840 64 230400
5 440 2200 25 193600
19 680 12920 361 462400
23 720 16560 529 518400
13 540 7020 169 291600
16 660 10560 256 435600
8_ 540 4320 64 291600
176 7120 112580 3162 4348000

Let the equation of regression line be y a bx

n xy x y 12 112580 176 7120


b 2 2
2
n x x 12 3162 176

1350960 1253120 97840


14.04
37944 30976 6968

y x 7120 176
And a b 14.04 593 .33 205 .92 387 .41
n n 12 12
The equation is y 387.41 14.04 x

Note: The b value of 14.04 means that for each additional year the salary of a technician
is expected to increase by about GH¢14040.

When x 15, y 387.41 14.04 15


387 .41 210 .6 598 .01
The salary of a technician with 15 years’ experience is GH¢598.01

Example 2

21
Apuskeleke, an economic analyst wanted to find the relationship between inflation rate and
prime lending rate. He, therefore, collected data on inflation rate and lending rate over a
seven-year period. The data below represent the inflation rate (x) and prime lending rate (y)
over the seven-year period.
x 3.3 6.2 11.0 9.1 5.8 6.5 7.6
y 5.2 8.0 10.8 7.9 6.8 6.9 9.0

Find the line of best fit for predicting the prime lending rate from the inflation rate. Use your
results to predict the prime lending rate when the inflation rate is 10.5.

Solution 2
Calculations needed to find the line of best fit
x y xy x2
3.3 5.2 17.16 10.89
6.2 8.0 49.60 38.44
11.0 10.8 118.80 121.00
9.1 7.9 71.89 82.81
5.8 6.8 39.44 33.64
6.5 6.9 44.85 42.25 Summations of
7.6 9.0 68.40 57.76 the respective
columns
49.5 54.6 410.14 386.79

Let the line of best fit be y a bx y

n xy x y 7 410.14 49.5 54.6


Where b 2 2
2
n x x 7 386.79 49.5

2870.98 2702.70
2707.53 2450.25
168.28
0.654
257.28

y x 54.6 49.5
And a b 0.654 7.80 4.62 3.18
n n 7 7

y 3.18 0.654x

When x 10.5, y 3.18 0.654 10.5

22
3..18 6.867
10.05
The prime lending rate at inflation rate of 10.5 is 10.05.

COEFFICIENT OF DETERMINATION

The Coefficient of Determinationis the amount of variability in one measure that is


explained by the other measure. The coefficient of determination is the square of the
correlation coefficient (r2).

Example: If the correlation coefficient between x and y is 0.8, the coefficient of determination
will be 0.64. It implies that there is 64% of variation in y explained by the variation in x and
the remaining 36% is explained by some other factors. This 1 r 2 is referred to as coefficient
of non-determination.The square root of coefficient of non-determination is known as
coefficient of alienation.

Properties of the Regression Coefficients


Recall
Regression Equation of y on x: byx = r ( y
/ x )

Regression Equation of x on y: bxy = r ( x / y


)

We can write that byx (bxy ) r( y / x ) . r( x / y ) r2 R square

Implications

1. The coefficient of correlation is geometric mean of the two regression coefficients.

r b yx x bxy

2. If byx is positive then bxy should also be positive & vice versa.
3. If one regression coefficient is greater than one the other must be less than one.
4. The coefficient of correlation will have the same sign as that of our regression
coefficient.
5. Arithmetic mean of byx&bxyis equal to or greater than coefficient of correlation.

b yx bxy
r
2
6. Regression coefficient are independent of origin but not of scale.

23
APPLICATIONS

Illustration 1
Prices of inputs of general goods and services are facing an upward shift as a result of the
general increase in petroleum products in 2015 and this has the resultant increase in the unit
prices of goods and services.

Trustee, the operator of Koko Burger in Cape Coast Metro, has contracted AugBlay Advisory
Services to advise him on expected changes in demand and revenue. AugBlay Services has
ascertained a very high correlation between the price and demand of Koko Burger at -0.95.
Weekly demand and associated prices for a given period sampled and descriptive statistics of
the sample showed a standard deviation of demand and price as 165.23 and 8.32 respectively.
Also, demand and price averaged 403.22 and 15.63 respectively.

As the Financial Manager of AugBlay Advisory Services,

(a) determine the regression equation of Trustee’s Koko Burger operations.


(b) estimate the expected decrease in demand if current price of GHC22 is expected to
increase by 12%.
(c) What is the expected percentage drop in revenue?

SOLUTION

(a) The regression equation relating quantity sold to price can be stated as
Q a bP
Where Q is the quantity sold, P is the unit price, b the regression coefficient/slope and
a , the regression intercept.
Recall that
Q
bQP r , where r is the correlation coefficient between Q and P , Q and P are
P

respectively standard deviation of Q and P .


Note that r 0.95 , Q 165.23 and P 8.32
r ( Q / P) =
We can also recall that
aQP Q bP where Q and P are the means of Q and P respectively.
Given that Q 403.22 , P 15.63 and b 18.87
a Q bP 403.22 ( 18.87)(15.63) 698.16
The resultant regression equation becomes;

24
(b) at = GHC22;

The expected price ( ( P1 ) 1.12P 24.64 because prices are expected to increase by 12%

Q Q Q1 283.02 233.20 49.82 50


PQ PQ 22(283) 24.64(233)
(c) the expected percentage drop ( EPD ) 1 1
x100 7.79%
PQ 22(283)

Illustration 2
Apraku, an importer of stationary material is considering cutting down the quantity of good
he imports to reduce the associated cost of the quarterly imports. The table below shows the
quarterly imports and importation cost over the past five years.

Import Volume Importation Cost


Year Quarter (Tonnes) (GHC)
2009 1 500 1000
2 511 1020
3 658 1300
4 450 1200
2010 1 560 1560
2 750 1580
3 800 1700
4 510 1900
2011 1 580 2000
2 590 2100
3 600 2300
4 450 1500
2012 1 600 1800
2 680 1566
3 780 1800
4 590 1890
2013 1 456 1500
2 560 2300
3 800 2400
4 600 2300

a. Is there evidence to support Apraku’s view?


b. Establish the cause and effect relationship between import volume and cost; explain the
extent to which the import volumes explain importation cost.

25
c. Estimate the regression equation of import volumes on importation cost and interpret your
results.
d. Determine the coefficients of non-determination and alienation and interpret them.
e. Given that the relationship between import volume(V) and importation cost(C) is such
that C V ,
i. find and
ii. The value of C when V 1000

Solution
Apraku’s consideration of cutting down the quantity of good he imports to reduce the
associated cost of the quarterly imports is an indication of suspicion of some possible
association between import volumes and importation cost. This may be validated by
examining the correlation between import volumes and importation cost. We will therefore
calculate the correlation coefficient between import volumes and importation cost.
Let:
X = Import Volume
Y = Importation Cost

Year Quarter X Y XY X2 Y2
GHC GHC GHC GHC GHC
2009 1 500 1000 500000 250000 1000000
2 511 1020 521220 261121 1040400
3 658 1300 855400 432964 1690000
4 450 1200 540000 202500 1440000
2010 1 560 1560 873600 313600 2433600
2 750 1580 1185000 562500 2496400
3 800 1700 1360000 640000 2890000
4 510 1900 969000 260100 3610000
2011 1 580 2000 1160000 336400 4000000
2 590 2100 1239000 348100 4410000
3 600 2300 1380000 360000 5290000
4 450 1500 675000 202500 2250000
2012 1 600 1800 1080000 360000 3240000
2 680 1566 1064880 462400 2452356
3 780 1800 1404000 608400 3240000
4 590 1890 1115100 348100 3572100
2013 1 456 1500 684000 207936 2250000
2 560 2300 1288000 313600 5290000
3 800 2400 1920000 640000 5760000
4 600 2300 1380000 360000 5290000
TOTAL 12025 34716 21194200 7470221 63644856

26
n XY X Y
(a) Correlation Coefficient (r)= 2 2 2 2
n X X n Y Y

20(21194200 ) (12025)(34716)
=
20(7470221) (12025 2 ) 20(63644856 ) (34716 2 )

6424100
= 0.356
18033.034

A correlation coefficient of 0.36 shows a low positive correlation between import volume and
importation costs. This is enough evidence to support Apraku’s view.

(b) The regression equation of importation costs (Y ) on import volumes (X ) could thus be
stated as:
Y a bX , where a = regression intercept; and
b = regression coefficient/slope
We can recall that
n XY X Y Y X
bYX 2
and a b
n X2 X n n

20(21194200) (12025)(34716)
b
20(7470221) (120252 )

34716 12025
a 1.3373
20 20

6424100
b 1.3373 a 931.75
4803795
The resultant regression equation could therefore be defined as:

Y 931.75 1.3373X

(c) Similarly, the regression equation of import volumes (X ) on importation costs (Y ) could
be stated as: X a bY
Given that, r bxy byx

r2 0.356 2
bxy 0.0948
byx 1.3373

Y X 12025 34716
Similarly, a b 0.0948 436.523
n n 20 20
The regression equation for import volume on importation cost could thus be defined as:
X 436.523 0.0949Y

27
(d) (i) Coefficient of non-determination = 1 – r2 , where r = correlation coefficient = 0.36 [(a)
above]
Coefficient of non-determination 1 0.36 2 0.8704
This implies that 87.04% of any variation in importation cost is not explained by variation
in import volumes.

(ii) Coefficient of alienation 1 r2 ,

where 1 r 2 = coefficient of non-determination = 0.8704 [as computed in (i)


above]

Coefficient of alienation 0.8704 0.933


The result indicates that 93.3% of variation in the importation cost cannot be explained.

Year Quarter X IC (Y) Log X Log Y X² XY


t GH GH GH GH GH GH
2009 1 500 1000 2.69897 3 7.284439 8.09691
2 511 1020 2.708421 3.0086 7.335544 8.148556
3 658 1300 2.818226 3.113943 7.942397 8.775796
4 450 1200 2.653213 3.079181 7.039537 8.169722
2010 5 560 1560 2.748188 3.193125 7.552537 8.775307
6 750 1580 2.875061 3.198657 8.265977 9.196335
7 800 1700 2.90309 3.230449 8.427931 9.378284
8 510 1900 2.70757 3.278754 7.330936 8.877455
2011 9 580 2000 2.763428 3.30103 7.636534 9.122159
10 590 2100 2.770852 3.322219 7.677621 9.205378
11 600 2300 2.778151 3.361728 7.718124 9.339388
12 450 1500 2.653213 3.176091 7.039537 8.426845
2012 13 600 1800 2.778151 3.255273 7.718124 9.043639
14 680 1566 2.832509 3.194792 8.023107 9.049276
15 780 1800 2.892095 3.255273 8.364211 9.414556
16 590 1890 2.770852 3.276462 7.677621 9.078591
2013 17 456 1500 2.658965 3.176091 7.070094 8.445115
18 560 2300 2.748188 3.361728 7.552537 9.23866
19 800 2400 2.90309 3.380211 8.427931 9.813057
20 600 2300 2.778151 3.361728 7.718124 9.339388
TOTAL 210 12025 34716 55.44038 64.52533 153.8029 178.9344

C V
Applying log10 to the both sides, the above equation can be restated as:

log C log logV

28
Recall that,
n XY X Y
bCV 2
, where X = log V , Y = log C , and bCV
n X2 X

20 178.9344 55.44038 64.52533


b 0.569
20 153.8029 55.44038 2

Also,
Y X
a b , where a log and b bCV
n n
64.52533 55.44038
a 0.569 1.649
20 20
Since a log , antilog a
anti log1.649 44.566
i. Thus, 44.57 and 0.57

Hence, C 44.57V 0.57


ii. Given V 1000,
0 . 57
C 44 .57 1000 GHC 2285 . 823

EXCEL APPLICATION
COMMONLY USED BIVARIATE FUNCTIONS IN EXCEL
FUNCTION MEANING

correl y values , x values Calculate the Pearson product-moment correlation


pearson y values , x values coefficient between y and x

slope y values, x values Calculate the slope of regression line of y on x

intercept y values, x values Calculate the intercept of regression line of y on x

RSQ y values, x values Calculate r 2

linest y values, x values, True, False Calculate coefficient and intercept of regression
line of y on x

29
Illustration of Excel Applications

Exercises

1a) Calculate the value of coefficient of correlation between price and supply.

Price 8 10 15 17 20 22 24 25
Supply 25 30 32 35 37 40 42 45

b) Compute Karl Pearson’s coefficient of correlation between per capita National income and per
capita consumer expenditure from the data given below.

Per capital national income


249 251 248 252 258 269 271 272 280 275
in GHS
Per capita consumer
237 238 236 240 245 255 254 252 258 251
expenditure in GHS

30
c) Calculate Karl Pearson’s coefficient correlation Advertisement and sales as per the data given.

Advertisement cost in 000 of GHS 39 65 62 90 82 75 25 98 36 78


Sales in lakhs of GHS 47 53 58 86 62 68 60 91 51 84

d) The following data relate to annual net income and annual food expenditures (GH¢’m)
for 8 selected families in Sikanti.

Annual Net Income GHS 9000 5000 11000 13000 12000 7000 15000 13000
Food Expenditure GHS 6000 8000 4000 4000 3000 6000 3000 5000

i. Compute the product moment correlation coefficient and the coefficient of


determination and interpret your results.
ii. Determine the line of ‘best fit’ for food expenditure on annual net income, and use
your line to estimate the food expenditure of a family whose annual net income is
GH¢8500.

2. For five cities, data have been collected on number of civil disturbances (riots, strikes and
so on) over the past year and on unemployment rate.

City A B C D E
Unemployment Rate (x) 22 20 10 15 9
Civil disturbance (y) 25 13 10 5 0

(a) Are these variables associated


(b) Determine the linear regression equation for estimating the number of civil
disturbance, given the rate of unemployment.

3. At Abotareye Company Ltd, staff appraisal is a two-way affair in that subordinates


appraise superiors and vice-versa. A random selection of 10 junior workers is made
and the performance rating each was given by a particular supervisor is noted together
with the rating each of them assigned to the supervisor. The table below shows the
ratings:

Worker A B C D E F G H I J
Rating of Worker 80 95 83 86 82 75 92 74 75 90
Rating of Supervisor 87 93 87 92 95 78 97 81 76 92

31
Determine the Spearman rank correlation between the workers’ ratings of the
supervisors and the latter’s ratings of the workers and interpret your results.

4. To investigate the relationship between height and shoe, the president of the Ladies’
Club at Ahayiaa Rubber Products Ltd collected the data below:

Lady 1 2 3 4 5 6 7 8 9 10
Height (cm) 164 168 167 165 171 171 168 171 169 165
Shoe size 38 39 40 38 39 40 40 40 39 39

Required
(a) Draw a scatter diagram for the data using the horizontal axis for the height and the
vertical, the shoe size.
(b) Determine the linear regression equation for estimating the shoe sizes from the given
heights. y 6.64 0.19 x . Use the regression equation to estimate the shoe size of a
lady whose height is 166 cm (38 or 39).

32

You might also like