W 1
W 1
S XY and given by
SX Y
( X i X )(Yi Y ) XY nXY
n 1 n 1
Correlation Analysis: deals with the measurement of the closeness of the relationship which are
described in the regression equation.
We say there is correlation if the two series of items vary together directly or inversely.
Y (Y1 , Y2 ,...Yn )
When higher values of X are associated with higher values of Y and lower values of X are
associated with lower values of Y, then the correlation is said to be positive or direct.
Examples:
- Income and expenditure
1
- Number of hours spent in studying and the score obtained
- Height and weight
- Distance covered and fuel consumed by car.
When higher values of X are associated with lower values of Y and lower values of X are
associated with higher values of Y, then the correlation is said to be negative or inverse.
Examples:
- Demand and supply
- Income and the proportion of income spent on food.
The correlation between X and Y may be one of the following:
1. Perfect positive (slope=1)
2. Positive (slope between 0 and 1)
3. No correlation (slope=0)
4. Negative (slope between -1 and 0)
5. Perfect negative (slope=-1)
The presence of correlation between two variables may be due to three reasons:
1. One variable being the cause of the other. The cause is called “subject” or “independent”
variable, while the effect is called “dependent” variable.
2. Both variables being the result of a common cause. That is, the correlation that exists
between two variables is due to their being related to some third force.
Example:
Let X1= ESLCE result
Y1= rate of surviving in the University
Y2= the rate of getting a scholar ship.
Both X1&Y1 and X1&Y2 have high positive correlation, likewiseY1 & Y2 have positive
correlation but they are not directly related, but they are related to each other via X1.
3. Chance: The correlation that arises by chance is called spurious correlation.
Examples:
Price of teff in Addis Ababa and grade of students in USA.
Weight of individuals in Ethiopia and income of individuals in Kenya.
Therefore, while interpreting correlation coefficient, it is necessary to see if there is any likelihood of
any relationship existing between variables under study.
2
The correlation coefficient between X and Y denoted by r is given by
r
( X i X )(Yi Y ) and the short cut formula is
2 2
( X i X ) (Yi Y )
n XY ( X )( Y )
r
[n X 2 ( X ) 2 ] [n Y 2 ( Y ) 2
r
XY nXY
[ X 2 nX 2 ] [ Y 2 nY 2 ]
Remark: Always this r lies between -1 and 1 inclusively and it is also symmetric.
Interpretation of r
1.Perfect positive linear relationship ( if r 1)
2.Some Positive linear relationship ( if r is between 0 and 1)
(X) (Y)
1 31 31
2 23 29
3 41 34
4 32 35
5 29 25
3
6 33 35
7 28 33
8 31 42
9 31 31
10 33 34
Solution:
r
XY nXY
[ X 2 nX 2 ] [ Y 2 nY 2 ]
10331 10(31.2)(32.9)
(9920 10(973.4)) (11003 10(1082.4))
66.2
0.363
182.5
This means mid semester exam and final exam scores have a slightly positive correlation.
Exercise The following data were collected from a certain household on the monthly income (X)
and consumption (Y) for the past 10 months. Compute the simple correlation coefficient.
X: 650 654 720 456 536 853 735 650 536 666
Y: 450 523 235 398 500 632 500 635 450 360
4
ii. Find the difference of the ranks in a pair , denote them by Di
iii. Use the following formula
2
6 Di
rs 1
n(n 2 1)
Where rs coefficient of rank correlatio n
D the difference between paired ranks
n the number of pairs
Example:
Aster and Almaz were asked to rank 7 different types of lipsticks, see if there is correlation
between the tests of the ladies.
Lipstick types A B C D E F G
Aster 2 1 4 3 5 7 6
Almaz 1 3 2 4 5 6 7
Solution:
X Y R1-R2 D2 2
6 Di 6(12)
rs 1 2
1 0.786
(R1) (R2) (D) n(n 1) 7(48)
2 1 1 1
Yes, there is positive correlation.
1 3 -2 4 10.3 Simple Linear Regression
5
advisable to prepare scatter plot before fitting the model.
Y X
Where :Y Dependent var iable
- The linear model is: X independen t var iable
Re gression cons tan t
regression slope
random disturbanc e term
Y ~ N ( X , 2 )
~ N (0, 2 )
2
- Minimizing SSE gives
b
( X i X )(Yi Y ) XY nXY
2 2 2
(Xi X ) X nX
6
a Y bX
Example 1: The following data shows the score of 12 students for Accounting and Statistics
examinations.
Accounting Statistics
X2 Y2 XY
X Y
7
Mean 57.25 61.75
a)
The Coefficient of Correlation (r) has a value of 0.92. This indicates that the two variables are
positively correlated (Y increases as X increases).
b)
where:
Yˆ 7.0194 0.9560 X
7.0194 0.9560(85) 88.28
Exercise: A car rental agency is interested in studying the relationship between the distance
driven in kilometer (Y) and the maintenance cost for their cars (X in birr). The following
summarized information is given based on samples of size 5.
5 2 5 2
i 1 X i 147,000,000 i 1 Yi 314
5 5 5
i 1 X i 23,000 , i 1Yi 36, i 1 X i Yi 212, 000
a) Find the least squares regression equation of Y on X
8
b) Compute the correlation coefficient and interpret it.
c) Estimate the maintenance cost of a car which has been driven for 6 km
- To know how far the regression equation has been able to explain the variation in Y we use a
2
measure called coefficient of determination ( r )
2 (Yˆ Y ) 2
i.e r 2
(Y Y )
Where r the simple correlatio n coefficien t.
R- Square
2
- r -value measures the percentage of variation in the values of the dependent variable that can
be explained by the variation in the independent variable.
2
- r -value varies from 0 to 1.
- A value of 0.7654 means that 76.54% of the variance in Y can be explained by the changes in
X. the remaining 23.46% of the variation in Y is presumed to be due to random variability.
2
- r gives the proportion of the variation in Y explained by the regression of Y on X.
- 1 r 2 gives the unexplained proportion and is called coefficient of indetermination.
Example: For the above problem (example 1): r 0.9194
9
the simultaneous combination of multiple factors to assess how and to what extent they affect a
certain outcome.
The value being predicted is termed dependent variable because its outcome or value depends on
the behavior of other variables. The independent variables’ value is usually ascertained from the
population or sample.
The Model
The primary objective of regression is to develop a regression model, to explain the
relationship between two or more variables in a given population.
The multiple linear regression model with k predictor variables and a response Y,
can be written as:
Where
Y=the dependent Variables
=the independent Variable
=coefficients of the slope
=Coefficients independent variables
The above equation has one key feature. It assumes that all individuals are drawn from a single
population with common population parameters. The term is the residual or random error for
individual i and represents the deviation of the observed value of the response for this individual
from that expected by the model. These error terms are assumed to have a normal distribution
with mean zero and variance 2.
10
2,…,k. If any plot suggests non-linearity, one may use a suitable transformation to attain
linearity.
Another important assumption is non-existence of multicollinearity the independent
variables are not related among themselves. At a very basic level, this can be tested by
computing the correlation coefficient between each pair of independent variables.
The error terms follow normally distribution, i.e. 쳌䁐 .homoscedasticity.
The values of explanatory variable is fixed
㌳ R , no autocorrelation among error term
The error terms and independent variables are independent ,i.e =0
The rank of explanatory variables is k and where k is number of the parameters or number
of column and it should be less than the number of observation (n).
Examples:
• The selling price of a house can depend on the desirability of the location, the number of
bedrooms, the number of bathrooms, the year the house was built, the square footage of the lot
and a number of other factors.
• The height of a child can depend on the height of the mother, the height of the father, nutrition,
and environmental factors.
11