STA 202 Correlation and Regression
STA 202 Correlation and Regression
CORRELATION ANALYSIS
We have so far discussed Pearson product-moment correlation coefficient(r) and Spearman’s Rank
correlation coefficient (p). We now want to discuss the Kendall’s tau Rank correlation coefficient.
Kendall suggested another way of measuring correlation coefficient between two variables, X, and Y
where data can be ranked. He proposed a way of ranking which is entirely different from the one
proposed by Spearman.
1. Arrange the X-component into its natural order. The X values are then paired with their right
pair of Y values.
2. Now count the number of values in Y that are larger than the first number in the Y
component. Do the same for all the Y-values.
3. Find the sum of all the values recorded and labelled it P.
4. Start the process again by considering the values that are smaller than the reference Y- value.
Sum all the values recorded and negate it. Label this value Q.
5. Find S by adding P and Q. That is S = P+Q
S
Kendall’s tau Rank Correlation Coefficient
1 N ( N 1)
2
Illustrative example:
Find the Kendall tau ranked correlation coefficient for the data below.
X 4 10 6 1 12 3 8 9 2 7 5 11
Y 5 8 11 3 12 1 6 7 4 9 2 10
Solution
X 1 2 3 4 5 6 7 8 9 10 11 12
Y 3 4 1 5 2 11 9 6 7 8 10 12
Find P, using steps 2 and 3; example for 3 we count 9 numbers that are larger than 3.
P= 9+8+9+7+7+1+2+4+3+2+1+0 = 53
Similarly, we find Q; we count the numbers that are smaller than the reference number.
1
Q = 2+2+0+1+0+5+3+0+0+0+0+0= -13 , S = P+Q = 53 – 13 = 40
S 40
1
1 0.606
2
N ( N 1) 2 12(12 1)
If some observations are repeated then the formula for estimating the correlation coefficient is
S
given by
[ 12 N ( N 1) Tx ][ 12 N ( N 1) T y ]
Illustrative example
Given the data below, find the correlation coefficient using Kendall’s tau Rank method. Interpret you
value.
X 10 8 4 5 9 2 1 3 6 7
Y 2.5 4.5 8 6.5 6.5 9 10 4.5 2.5 1
Solution
X 1 2 3 4 5 6 7 8 9 10
Y 10 9 4.5 8 6.5 2.5 1 4.5 6.5 2.5
Following the steps, we obtain P = 9, Q = -33 . Note, the same values are left out. Example, in
considering 4.5 (3rd 0bsevation in Y) , the 8th observation which is 4.5 is left out.
S = P+Q = 9 – 33 = -24.
2
There is no tied among, X, hence Tx =0
For Y variable, there are 3 sets of tied ranks, 2.5 appeared 2 times; 4.5 appeared 2 times and 6.5 also
appeared 2 times.
T y 12 [ 2( 2 1) 2( 2 1) 2( 2 1) ] 3
N = 10
S
[ 12 N ( N 1) Tx ][ 12 N ( N 1) T y ]
24
0.552
[ 12 10(10 1) 0][ 12 10(10 1) 3 ]
Try Question
In an interview for job application, two set of panel did the interview for 10 applicants. The
applicants were scored out of 20. The result is provided in the table below.
Applicants 1 2 3 4 5 6 7 8 9 10
Panellist 1 11 18 13 15 16 17 18 19 12 14
Panellist 2 19 12 17 15 12 18 12 11 15 20
(a) Using Pearson product- moment correlation coefficient, finds how much the scoring of the
panellists are related.
(b) Also using Kendall’s tau Rank Correlation Coefficient find the correlation coefficient. Is the
interpretation similar to (a)?
(c) Would you recommend the rankings to be used in making selection? Explain your answer.
COEFFICIENT OF DETERMINATION
The interpretation of correlation coefficient, r, as strong, weak, moderate etc, is not precise. A
measure that has a more exact meaning is the coefficient of determination. This is obtained by
squaring the correlation coefficient.
3
For instance, if correlation coefficient is 0.88 then coefficient of determination is given by 0.77 which
can be written as 77%. This means 77% of the total variation is explained or accounted for by the
variables.
Definition: Coefficient of determination is defined as the proportion of the total variation in the
dependent variable, say, Y that is explained by the variation in the independent variable, X.
ESTIMATION
The process of using single value to estimate the population parameter is known as Point Estimation.
If an interval is used to estimate the population parameter, the process is known as interval
estimation.
The specific value used to do the estimation is known as Point Estimate and interval estimate
respectively.
The rule that is used to find the estimate is known as Point estimator and Interval estimator
respectively.
For example, to estimate the population mean, , we use the sample mean, x . Now the process of
finding it is known as point estimation. The value, x , is the point estimate and the rule (formula)
1 n
x xi is the point estimator for .
n i 1
For each estimate, we can calculate the error that is associate with it. This error is called Standard
error of the estimate. For this study we just illustrate with the arithmetic mean.
Illustrative Example
Find the standard error that is associated with, x , in estimating , , for the data 3, 5, 7 and 5.
Solution
1 n
First find variance of the estimator, x xi .
n i 1
1 n 1 n
Var ( x ) Var( xi ) =
n i 1 n 2 var( xi ) , var( x ) S 2
i 1
4
1 n
1 n
nS 2 S2
n2
var( xi ) = n2
S2 = n2
=
n
i 1 i 1
S
Sx
n
1 n
Now from the data, x 5 , S 2 ( xi x ) 2 = 2.667
n 1 i 1
S 1.633
S 1.633
Sx 0.8165
n 4
REGRESSION ANALYSIS
Regression analysis seeks to find a model that best describes the relationships that exist between
two or more variables. This relationship could be linear or non- linear. It represent a mathematical
equation that defines the relationship between two or more variables.
Regression analysis is divided into two types, linear and non- linear regression analysis. Linear
regression is further divided into simple linear regression and multiple linear regression. Likewise,
Non-Linear regression is further divided into simple non-linear regression and multiple non- linear
regression analysis.
There are two main types of variables. These are Response (Dependent) variable and the predictor
also known as explanatory (independent) variable. The main difference between simple linear
regression and multiple linear regression is that there is only on predictor variable for the simple
linear regression but the multiple linear regression has two or more predictor variables.
This is a linear equation involving the response variable and only one predictor variable. Basically the
equation obtained can be used to make predictions and also can be used to discuss the relationship
between the response variable and the predictor variable.
5
Model
yi xi ei ……………… 1
ŷi is the estimated response variable. Also, E (ei ) 0 , that is, the expectation of the error term is
equal to zero.
Thus, it indicate that the error term can be estimated for each pair of values.
What we need to do now is to be able to estimate the regression coefficient and from the
estimated model, yˆ i xi .
We want to study two methods for estimating these coefficients. These are: The Eyeball Fitting
Method (The graphical Method) and The Least Square Method.
The first thing to do is to plot a scatter diagram of the Response variable, say, Y, against, the
Predictor variable , say, X. That is, plot of Y on X.
6
Grape of Y on X showing a scatter diagram for simple linear regression. This model is of the
form yˆ i xi
From the graph the Y intercept is the estimate, ,and the gradient or the slope of the graph is the
estimate of .
Y
That is ̂ . Given the estimate, yˆ i ˆ ˆ xi .
X
Weakness of this method is that different line of best fit can be drawn for the same data set by
different people. When this happens the estimation, of and may be slightly different for
different people using the same data set, hence the estimated model, yˆ i ˆ ˆ xi may be
different.
Suggestion of how to plot a line of best fit has however been made, to help draw line of best fit from
the scatter diagram. This helps to reduce the varied line that can be draw by different people for the
same data set.
These are the guidelines for drawing the line of best fit;
7
(d) Now find the right centroid by considering all the points at the right side of the line,
( X R , YR ) . Also, find the left centroid by considering all the points at the left side of the
line, ( X L , YL ) .
(e) Plot these two points on the scatter diagram.
(f) Plot the line of best fit by passing it through ( X , Y ) and then as close to
( X R , YR ) and ( X L , YL ) as possible.
(g) Now find the Y-intercept and the slope of the line drawn.
By obtaining and , you are able to obtain, the linear regression model, yˆ i ˆ ˆ xi .
For this method we derive formula for finding and . This concept rely on minimizing the
deviations from the line of best fit.
d1 y1 y1 , d 2 y 2 y 2 , , d i y i y i
d 12 ( y1 y1 ) 2 , d 22 ( y 2 y 2 ) 2 , , d i2 ( y i y i ) 2
8
Now summing all deviations
n n
d i2 ( yi yi ) 2 , but y i xi , (see the graph) it implies
i 1 i 1
n n
d i2 ( yi xi ) 2
i 1 i 1
n
Let q ( yi xi ) 2 ………………………………….. 3
i 1
n
dq
2 ( y i xi ) , equating this to zero and making the subject, we have
d i 1
n n
y i xi
i 1 i 1
…………………. 4
n
n
dq
2 xi ( yi xi ) , equating this to zero and making the subject, we have
d i 1
n n
xi y i xi
i 1 i 1
n
………………………. 5
xi2
i 1
Equations 4 and 5 are known as the normal equations for the least square estimations. We can solve
equations 4 and 5 simultaneously for and .
n n n
n xi y i xi y i
i 1 i 1 i 1
By substitution we can obtained ̂
n n
n xi2 ( xi ) 2
i 1 i 1
Illustrative Example
X 3 2 4 2 3 2
Y 2 1 3 1 2 2
9
Estimate Y when x = 4, find also the error of estimation.
Solution
X Y XY X2
3 2 6 9
2 1 2 4
4 3 12 16
2 1 2 4
3 2 6 9
2 2 4 4
16 11 32 46
n n n
n xi y i xi y i
i 1 i 1 i 1 6(32) (16)(11)
̂ n n
= = 0.8
6(46) (160) 2
n xi2 ( xi ) 2
i 1 i 1
n n
y i xi 11 (0.8)(16)
i 1 i 1
= = -0.30
n 6
The regression coefficient is the estimated value of the response variable when the predictor
variable is unimportant.
is the change in the response variable for a unit change in the predictor variable.
Try question
Income and savings of a family are understudy. Samples of ten members were recorded in Ghc as
follows:
Savings 1000 2000 2000 5000 5000 6000 7000 8000 7000 9000
Income 36,000 39,000 42,000 45,000 48,000 51,000 54,000 56,000 59,000 60,000
(a) Determine the least squares for prediction equation of Income on Savings.
(b) Use the equation to predict the savings if his income happens to be Ghc 10,000.
10
Standard Error of Estimate for the Line of Regression
The standard error of estimate measures the spread or the dispersion of the observed values around
the line of regression. It is given by
n
(Yi Yˆi ) 2
i 1
SY .X
n2
Example
Given the data below estimate the simple linear regression using the least squares method and
hence estimate also the standard error associated with the regression line.
X 4 7 3 6 10
Y 5 12 4 8 11
Solution
First part solved. Linear regression model estimated to be, yˆ i 1.202 1.33 x i .
X Y Yˆ Y Yˆ (Y Yˆ ) 2
4 5 5.734 -0.734 0.5388
7 12 9.133 2.867 8.2200
3 4 4.601 -0.601 0.3612
6 8 8.000 0.000 0.0000
10 11 12.00 -1.532 2.3470
30 40 11.4670
n
(Yi Yˆi ) 2 11 .4670
i 1
SY . X = = 1.9550
n2 52
11