0% found this document useful (0 votes)
85 views11 pages

STA 202 Correlation and Regression

Kendall's tau rank correlation coefficient is an alternative to Spearman's rank correlation coefficient for measuring the relationship between two variables that can be ranked. It involves ranking the data and counting the number of concordant and discordant pairs to calculate Kendall's tau. The coefficient ranges from -1 to 1, where higher positive values indicate a stronger monotonic relationship and negative values indicate a negative relationship. Kendall's tau can also account for ties in the ranked data. Regression analysis seeks to model relationships between variables through mathematical equations from sample data, with simple linear regression involving one predictor variable and multiple linear regression using two or more predictors.

Uploaded by

zyzzxzyzzx1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views11 pages

STA 202 Correlation and Regression

Kendall's tau rank correlation coefficient is an alternative to Spearman's rank correlation coefficient for measuring the relationship between two variables that can be ranked. It involves ranking the data and counting the number of concordant and discordant pairs to calculate Kendall's tau. The coefficient ranges from -1 to 1, where higher positive values indicate a stronger monotonic relationship and negative values indicate a negative relationship. Kendall's tau can also account for ties in the ranked data. Regression analysis seeks to model relationships between variables through mathematical equations from sample data, with simple linear regression involving one predictor variable and multiple linear regression using two or more predictors.

Uploaded by

zyzzxzyzzx1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

STA 202: FURTHER STATISTICS.

CORRELATION ANALYSIS

We have so far discussed Pearson product-moment correlation coefficient(r) and Spearman’s Rank
correlation coefficient (p). We now want to discuss the Kendall’s tau Rank correlation coefficient.

Kendall’s Tau Rank Correlation Coefficient

Kendall suggested another way of measuring correlation coefficient between two variables, X, and Y
where data can be ranked. He proposed a way of ranking which is entirely different from the one
proposed by Spearman.

The following procedures were outlined by Kendall:

1. Arrange the X-component into its natural order. The X values are then paired with their right
pair of Y values.
2. Now count the number of values in Y that are larger than the first number in the Y
component. Do the same for all the Y-values.
3. Find the sum of all the values recorded and labelled it P.
4. Start the process again by considering the values that are smaller than the reference Y- value.
Sum all the values recorded and negate it. Label this value Q.
5. Find S by adding P and Q. That is S = P+Q

S
Kendall’s tau Rank Correlation Coefficient  
1 N ( N  1)
2

Where N, is the number of observations that are ranked on both X and Y.

Illustrative example:

Find the Kendall tau ranked correlation coefficient for the data below.

X 4 10 6 1 12 3 8 9 2 7 5 11
Y 5 8 11 3 12 1 6 7 4 9 2 10

Solution

By step 1. Rearrange the table

X 1 2 3 4 5 6 7 8 9 10 11 12
Y 3 4 1 5 2 11 9 6 7 8 10 12

Find P, using steps 2 and 3; example for 3 we count 9 numbers that are larger than 3.

P= 9+8+9+7+7+1+2+4+3+2+1+0 = 53

Similarly, we find Q; we count the numbers that are smaller than the reference number.

1
Q = 2+2+0+1+0+5+3+0+0+0+0+0= -13 , S = P+Q = 53 – 13 = 40

S 40
 1
1  0.606
2
N ( N  1) 2 12(12  1)

This indicate positive moderate relationship between X and Y.

Kendall’s tau Rank Correlation Coefficient for tied observation

If some observations are repeated then the formula for estimating the correlation coefficient is
S
given by  
[ 12 N ( N  1)  Tx ][ 12 N ( N  1)  T y ]

Tx is given by Tx  12  t x (t x  1) , where t x is the number of repeated (tied) values that

appeared in the X variable.

Ty is given by T y  12  t y (t y  1) , where t y is the number of repeated (tied) values that

appeared in the Y variable.

Illustrative example

Given the data below, find the correlation coefficient using Kendall’s tau Rank method. Interpret you
value.

X 10 8 4 5 9 2 1 3 6 7
Y 2.5 4.5 8 6.5 6.5 9 10 4.5 2.5 1

Solution

First follow the steps in 1 to 5 to obtain S.

X 1 2 3 4 5 6 7 8 9 10
Y 10 9 4.5 8 6.5 2.5 1 4.5 6.5 2.5

Following the steps, we obtain P = 9, Q = -33 . Note, the same values are left out. Example, in
considering 4.5 (3rd 0bsevation in Y) , the 8th observation which is 4.5 is left out.

S = P+Q = 9 – 33 = -24.

Now, we need to determine, Tx and Ty .

2
There is no tied among, X, hence Tx =0

For Y variable, there are 3 sets of tied ranks, 2.5 appeared 2 times; 4.5 appeared 2 times and 6.5 also
appeared 2 times.

In each of these t=2, that is number of tied observations.

T y  12 [ 2( 2  1)  2( 2  1)  2( 2  1) ]  3

N = 10

S

[ 12 N ( N  1)  Tx ][ 12 N ( N  1)  T y ]

 24
   0.552
[ 12 10(10  1)  0][ 12 10(10  1)  3 ]

The two variables, X and Y are moderate negatively correlated.

Try Question

In an interview for job application, two set of panel did the interview for 10 applicants. The
applicants were scored out of 20. The result is provided in the table below.

Applicants 1 2 3 4 5 6 7 8 9 10
Panellist 1 11 18 13 15 16 17 18 19 12 14
Panellist 2 19 12 17 15 12 18 12 11 15 20

(a) Using Pearson product- moment correlation coefficient, finds how much the scoring of the
panellists are related.
(b) Also using Kendall’s tau Rank Correlation Coefficient find the correlation coefficient. Is the
interpretation similar to (a)?
(c) Would you recommend the rankings to be used in making selection? Explain your answer.

COEFFICIENT OF DETERMINATION

The interpretation of correlation coefficient, r, as strong, weak, moderate etc, is not precise. A
measure that has a more exact meaning is the coefficient of determination. This is obtained by
squaring the correlation coefficient.

That is , Coefficient of determination (R2) = r2.

3
For instance, if correlation coefficient is 0.88 then coefficient of determination is given by 0.77 which
can be written as 77%. This means 77% of the total variation is explained or accounted for by the
variables.

Definition: Coefficient of determination is defined as the proportion of the total variation in the
dependent variable, say, Y that is explained by the variation in the independent variable, X.

Definition: Coefficient of Non-determination is the proportion of total variation in the dependent


variable, say, Y, that is not explained by the variation in the independent variable ,X.

Coefficient of Non-determination is given by 1 – r2.

It must be noted that both measures cannot be negative.

ESTIMATION

The process of using single value to estimate the population parameter is known as Point Estimation.
If an interval is used to estimate the population parameter, the process is known as interval
estimation.

The specific value used to do the estimation is known as Point Estimate and interval estimate
respectively.

The rule that is used to find the estimate is known as Point estimator and Interval estimator
respectively.

For example, to estimate the population mean,  , we use the sample mean, x . Now the process of
finding it is known as point estimation. The value, x , is the point estimate and the rule (formula)
1 n
x  xi is the point estimator for  .
n i 1

For each estimate, we can calculate the error that is associate with it. This error is called Standard
error of the estimate. For this study we just illustrate with the arithmetic mean.

Illustrative Example

Find the standard error that is associated with, x , in estimating ,  , for the data 3, 5, 7 and 5.

Solution

1 n
First find variance of the estimator, x   xi .
n i 1

1 n 1 n
Var ( x )  Var(  xi ) =
n i 1 n 2  var( xi ) , var( x )  S 2
i 1

4
1 n
1 n
nS 2 S2
n2
 var( xi ) = n2
S2 = n2
=
n
i 1 i 1

S
Sx 
n

1 n
Now from the data, x  5 , S 2   ( xi  x ) 2 = 2.667
n  1 i 1

 S  1.633

S 1.633
Sx    0.8165
n 4

REGRESSION ANALYSIS

Regression analysis seeks to find a model that best describes the relationships that exist between
two or more variables. This relationship could be linear or non- linear. It represent a mathematical
equation that defines the relationship between two or more variables.

Regression analysis is divided into two types, linear and non- linear regression analysis. Linear
regression is further divided into simple linear regression and multiple linear regression. Likewise,
Non-Linear regression is further divided into simple non-linear regression and multiple non- linear
regression analysis.

There are two main types of variables. These are Response (Dependent) variable and the predictor
also known as explanatory (independent) variable. The main difference between simple linear
regression and multiple linear regression is that there is only on predictor variable for the simple
linear regression but the multiple linear regression has two or more predictor variables.

This study discusses only simple linear regression.

Simple Linear Regression Analysis

This is a linear equation involving the response variable and only one predictor variable. Basically the
equation obtained can be used to make predictions and also can be used to discuss the relationship
between the response variable and the predictor variable.

5
Model

yi     xi  ei ……………… 1

Where yi is the response variable

xi is the predictor variable

 ,  are the regression coefficients

ei is the random error (error term)

The estimated model is given by yˆ i     xi ………………. 2

ŷi is the estimated response variable. Also, E (ei )  0 , that is, the expectation of the error term is
equal to zero.

Hence we can see that combination of 1 and 2 gives, yi  yˆ i  ei .

Thus, it indicate that the error term can be estimated for each pair of values.

What we need to do now is to be able to estimate the regression coefficient  and  from the
estimated model, yˆ i     xi .

Estimation of Regression Coefficients  and  .

We want to study two methods for estimating these coefficients. These are: The Eyeball Fitting
Method (The graphical Method) and The Least Square Method.

The Eyeball Fitting Method

The first thing to do is to plot a scatter diagram of the Response variable, say, Y, against, the
Predictor variable , say, X. That is, plot of Y on X.

You then draw a line of best fit.

6
Grape of Y on X showing a scatter diagram for simple linear regression. This model is of the
form yˆ i     xi

From the graph the Y intercept is the estimate,  ,and the gradient or the slope of the graph is the
estimate of  .

Y
That is ̂  . Given the estimate, yˆ i  ˆ  ˆ xi .
X

Weakness of this method is that different line of best fit can be drawn for the same data set by
different people. When this happens the estimation, of  and  may be slightly different for
different people using the same data set, hence the estimated model, yˆ i  ˆ  ˆ xi may be
different.

Suggestion of how to plot a line of best fit has however been made, to help draw line of best fit from
the scatter diagram. This helps to reduce the varied line that can be draw by different people for the
same data set.

These are the guidelines for drawing the line of best fit;

(a) Find the centroid of X and Y. This is the mean of X and Y, ( X , Y ) .


(b) Draw a line vertical to the response variable to pass through the point, ( X , Y ) . Ensure that
it is indeed parallel to the response variable.
(c) Identify all the points that are found at the right side and the points at the lift side of the
vertical line through the centroid.

7
(d) Now find the right centroid by considering all the points at the right side of the line,
( X R , YR ) . Also, find the left centroid by considering all the points at the left side of the
line, ( X L , YL ) .
(e) Plot these two points on the scatter diagram.
(f) Plot the line of best fit by passing it through ( X , Y ) and then as close to
( X R , YR ) and ( X L , YL ) as possible.
(g) Now find the Y-intercept and the slope of the line drawn.

By obtaining  and  , you are able to obtain, the linear regression model, yˆ i  ˆ  ˆ xi .

Method of Least Squares

For this method we derive formula for finding  and  . This concept rely on minimizing the
deviations from the line of best fit.

From the graph, we derive the formula for finding,  and  .

d1  y1  y1 , d 2  y 2  y 2 ,  , d i  y i  y i

By squaring these deviations, we have

d 12  ( y1  y1 ) 2 , d 22  ( y 2  y 2 ) 2 ,  , d i2  ( y i  y i ) 2

8
Now summing all deviations

n n
 d i2   ( yi  yi ) 2 , but y i     xi , (see the graph) it implies
i 1 i 1

n n
 d i2   ( yi     xi ) 2
i 1 i 1

n
Let q   ( yi     xi ) 2 ………………………………….. 3
i 1

Now we minimize q by differentiate q (equation 3) with respect to  and  .

n
dq
 2 ( y i     xi ) , equating this to zero and making  the subject, we have
d i 1

n n
 y i    xi
i 1 i 1
 …………………. 4
n

Minimizing q with respect to 

n
dq
 2 xi ( yi     xi ) , equating this to zero and making  the subject, we have
d i 1

n n
 xi y i    xi
i 1 i 1
 n
………………………. 5
 xi2
i 1

Equations 4 and 5 are known as the normal equations for the least square estimations. We can solve
equations 4 and 5 simultaneously for  and  .

n n n
n  xi y i   xi  y i
i 1 i 1 i 1
By substitution we can obtained ̂ 
n n
n xi2  ( xi ) 2
i 1 i 1

Illustrative Example

Find the linear regression equation of Y on X for the data below

X 3 2 4 2 3 2
Y 2 1 3 1 2 2

9
Estimate Y when x = 4, find also the error of estimation.

Solution

X Y XY X2
3 2 6 9
2 1 2 4
4 3 12 16
2 1 2 4
3 2 6 9
2 2 4 4
16 11 32 46

n n n
n  xi y i   xi  y i
i 1 i 1 i 1 6(32)  (16)(11)
̂  n n
= = 0.8
6(46)  (160) 2
n xi2  (  xi ) 2

i 1 i 1

n n
 y i    xi 11  (0.8)(16)
i 1 i 1
 = = -0.30
n 6

Hence the linear regression model is given by, yˆ i  0.30  0.80 x i

Now if X=5, then estimate of y is given by yˆ i  0.30  0.80 ( 4)  2.9

Error of estimation ei  y i  yˆ i  e3  3  2.9 = 0.10

Interpretation of Regression Coefficients.

The regression coefficient  is the estimated value of the response variable when the predictor
variable is unimportant.

 is the change in the response variable for a unit change in the predictor variable.

Try question

Income and savings of a family are understudy. Samples of ten members were recorded in Ghc as
follows:

Savings 1000 2000 2000 5000 5000 6000 7000 8000 7000 9000
Income 36,000 39,000 42,000 45,000 48,000 51,000 54,000 56,000 59,000 60,000

(a) Determine the least squares for prediction equation of Income on Savings.
(b) Use the equation to predict the savings if his income happens to be Ghc 10,000.

10
Standard Error of Estimate for the Line of Regression

The standard error of estimate measures the spread or the dispersion of the observed values around
the line of regression. It is given by

n
 (Yi  Yˆi ) 2
i 1
SY .X 
n2

Example

Given the data below estimate the simple linear regression using the least squares method and
hence estimate also the standard error associated with the regression line.

X 4 7 3 6 10
Y 5 12 4 8 11
Solution

First part solved. Linear regression model estimated to be, yˆ i  1.202  1.33 x i .

X Y Yˆ Y  Yˆ (Y  Yˆ ) 2
4 5 5.734 -0.734 0.5388
7 12 9.133 2.867 8.2200
3 4 4.601 -0.601 0.3612
6 8 8.000 0.000 0.0000
10 11 12.00 -1.532 2.3470
30 40 11.4670

n
 (Yi  Yˆi ) 2 11 .4670
i 1
SY . X  = = 1.9550
n2 52

11

You might also like