Chapter1-Introduction To Regression Analysis
Chapter1-Introduction To Regression Analysis
Introduction
Emphasis on this course is to understand and model linear relationships.
Solution
Test Statistic:
1
X (O − E)2
χ2 =
E
= 8.9546
Conclusion: Since χ2 > 3.84, we reject H0 and conclude that there is gender
bias implying dependence relationship between gender and admission.
Y X
Predictand Predictor
Regressand Regressor
Endogenous Exogenous
Targer Control
Defn: Correlation
Correlation is the intensity or strength of the relationship between two vari-
ables.
Note: Correlation does not measure how two variables are related but mea-
sures the strength of their relationship.
2
When to Apply Regression Analysis
There are conditions which must be satisfied before we can apply regression
analysis
3. Cross-tabulations
Scatter Plot
A scatter plot is a plot of one variable against another. We may use three
variables to get a three-dimensional plot. By looking at the scatter plot, we
get the visual impression about the relationship between the variables, that
is, whether the variables are linearly related or otherwise.
3
The above scatter diagram shows that X and Y are linearly related.
Correlation
In correlation both X and Y are random variables and are of equal interest.
We want to determine whether or not there is a linear association between
these two random variables. The most often used measure of linear associ-
ation between two random variables is the Pearson product moment corre-
lation coefficient, ρ. There are other measures of linear association namely
Spearman and Kendall’s tau correlation coefficient. The parameter, ρ is de-
fined in terms of the covariance between X and Y , where covariance is a
measure of the manner in which X and Y vary together.
Defn: Covariance
Let X and Y be random variables with means µX and µY respectively. The
covariance between X and Y , denoted Cov(X, Y ) is given by:
= E(XY ) − E(X)E(Y )
Note
2. If the reverse to (1) above is true, that is, small values of X tend to be
associated with large values of Y and vice-versa then we have negative
covariance.
Covariance is unbounded, that is it can assume any real value. Also its
magnitude
q is meaningless. To correct this problem, we divide the covariance
by V ar(X)V ar(Y ), to form the Pearson’s correlation coefficient.
Cov(X, Y )
ρ= q
V ar(X)V ar(Y )
4
ρ lies between -1 and 1 inclusive.
Note
1. ρ = 1 implies a perfect positive correlation between X and Y .
2. ρ = −1 implies a perfect negative correlation between X and Y .
3. ρ = 0 implies no correlation that is, X and Y are uncorrelated. This
simply tells us that there is no linear association between X and Y . It
does not mean that X and Y are unrelated. If a relationship exists be-
tween the variables, the relationship is not linear for example quadratic
relationship.
Similarly 2
Pn Pn Pn Pn 2
2 2 yi2 yi n yi2 −( yi )
V ar(Y ) = E(Y ) − E (Y ) = i=1
n
− i=1
n
= i=1
n2
i=1
Thus Pn Pn Pn
n i=1
xi yi − i=1
xi i=1
yi
n2
ρ̂ = r = v ! !
u Pn Pn 2 Pn Pn 2
i=1 i ( i=1 i ) i=1 i ( i=1 i )
u n x2 − x n y2 − y
t
n2 n2
5
Pn Pn Pn
n i=1 xi yi − i=1 xi i=1 yi
= r P
Pn 2 Pn 2 n 2 Pn 2
i=1 xi − ( i=1 xi ) n i=1 yi − ( i=1 yi )
Since r lies between -1 and 1, the following are guidelines on the interpreta-
tion of r:
r ∈ (0.7, 1) or r ∈ (−1, −0.7) implies strong or high correlation
r ∈ (0.7, 0.5) or r ∈ (−0.7, −0.5) implies a moderate correlation
r ∈ (0, 0.5) or r ∈ (−0.5, 0) implies a weak correlation
Example: 1.1 The following data set relates maize usage to the number
of animals on 10 farms surveyed.
Calculate the Pearson’s product moment correlation coefficient (r) and in-
terpret your result.
Solution
6
Pn Pn Pn
n i=1 xi yi − i=1 xi i=1 yi
r = r P
Pn 2 Pn 2 n 2 Pn 2
i=1 xi − ( i=1 xi ) n i=1 yi − ( i=1 yi )
n P Pn
From the table, the sample statistics are: n = 10, i=1 xi = 58, i=1 yi =
Pn Pn 2 Pn 2
240, i=1 xi yi = 1554, i=1 xi = 378, i=1 yi = 6436
Therefore, r, is given by
10(1554) − (58)(240)
r=q
(10(378) − 582 ) (10(6436) − 2402 )
1620
=q = 0.9660 4 d.p.
(416)(6760)
Comment: There is a high positive correlation between number of animals(X)
and maize used (Y )
Example 1.2 Using Example 1.1, test the significance of the calculated
Pearson’s correlation coefficient.
Solution
H0 : ρ = 0
H1 : ρ 6= 0
Test statistic: r
7
Test statistic: r=0.9660
Manual Procedure
6 ni=1 d2i
P
rs = 1 −
n(n2 − 1)
8
Example 1.3 Calculate the Spearman’s correlation coefficient between maize
used (Y ) and number of animals (X) from Example 2.1.
6 ni=1 d2i
P
rs = 1 −
n(n2 − 1)
Pn
d2i = 1.5
i=1
Thus;
6(1.5)
rs = 1 − = 1 − 0.0091 = 0.9909
10(102 − 1)
Comment: High positive correlation
Solution
H 0 : ρs = 0
H1 : ρs 6= 0
9
Test statistic: rs
Rejection criteria: We reject H0 if rs > rcrit . In our case n=10 and test-
ing at α = 0.05, we reject H0 if rs > 0.648
Conclusion: Since rs > 0.648, we reject H0 and conclude that the corre-
lation between maize usage and number of animals is significant.
Correlation Matrix
Correlation is defined for two variables, however in some cases we have many
variables. Suppose that we have p variables, X1 , X2 , ..., Xp . Then one can
express the different combinations of correlations coefficients in a matrix,
known as a correlation matrix, as follows:
1 ρx1 x2 . . . ρx1 xp
ρx2 x1 1 . . . ρx2 xp
. . . .
ρ=
. . . .
. . . .
ρxp x1 ρxp x2 . . . 1
ρxi xj for i, j = 1, 2, ..., p, represents the correlation between variables Xi and
Xj . We can tell which variables are correlated by examining the correlation
matrix.
10
Activity 1.1
1. In the field of organizational psychology, extensive study has been made
of different leadership styles. One researcher refers to two extremes as
authoritarian versus democratic; another refers to task-oriented versus
people-oriented; yet others have their own labels for these qualities.
Whatever the label, do these different styles affect the morale of the
subordinates? To address this issue, a researcher established a ranking
scale for worker morale, based on interviews, and grouped the workers
into low, acceptable and high morale categories. These were cross-
classified against the leadership style of the supervisor. The following
contingency table summarizes the results
LEADERSHIP STYLE
WORKER MORALE Authoritarian Democratic
Low 10 5
Acceptable 8 12
High 6 9
11
(e) Test at the α = 0.01 level of significance the significance of Pear-
son’s and Spearson’s correlation coefficients.
12