Regression & Correlation
Regression & Correlation
Chapter Outline
• Regression:
Scatter diagram. Equation of regression lines.
Applications of regression lines. Prediction of Y values
using regression line. Residuals. Standard error of
estimate. Multiple regression.
• Correlation:
Correlation coefficient. Properties of correlation
coefficient. Rank correlation. Multiple correlations.
4
The Scatter Diagram
If we plot the paired observations (Xi , Yi) on a graph,
the resulting set of points is called scatter diagram.
Regression line passes through its mean points.
Plot scatter diagram from the following data.
X : 10 20 30 40 50 58
Y : 1.0 1.5 0.5 0.7 1.4 0.9
6
Simple Linear Regression
Following equation depicts the line fit to the set of data namely
Yˆi = a + bX i
and line reflecting to the model
mY\X = a + bX.
a and b are unknown parameters.
Yi = Predicted Value of Y for observation i
Xi = Value of X for observation i
a = Y - intercept
b = Slope or regression coefficient
7
Regression is a process by which we estimate one of the dependent
variable on the basis of the independent variable . If Y is to be
estimated on the basis of X, by means of some equation, we call
this equation the regression equation of Y on X. If X is to be
estimated on the basis of Y, then the equation is called regression
equation of X on Y. A regression line, is also called a line of best fit,
is the line for which the sum of squares of the residuals is minimum.
Y on X
Y = a + byx X (X as independent variable) ……… or Ŷ = Ȳ + b(X – X )
n 5
(ii) Unexplained Variation = Y - aY - bXY = (Y - Ŷ) 2 = 1.1
2
2
(iii) Explained Variation = (Y - Y)
Explained Variation = Total variation – Unexplained variation
= 30 – 1.1 = 28.9
2
Explained variation
(iv) Coefficient of determination = r = Total variation
= 28.9 / 30 = 0.963
(v) Coefficient of correlation = r = 0.981
11
Example # 3: In an experiment to measure the stiffness of a spring, the
length of the spring under different loads was measured as follows:
X = Loads (Ib) 3 5 6 9 10 12 15 20 22 28
Y = Length (in) 10 12 15 18 20 22 27 30 32 34
X = na + bY
XY = aY + bY2
10a + 220b = 130
220a + 5486b = 3467
Solving these equation simultaneously, we get
b = 0.94, a = –7.68
Hence the desired estimated regression equation is
13
X = –7.68 + 0.94Y
(iii) Standard error of estimates Sy.s and Sx.y are
14
Question # 1
The marks obtained by 10 students in Midterm exam (X)
and Final exam (Y) are given below:
X 20 22 18 16 14 12 9 25 24 25
Y 50 53 60 72 68 79 47 97 89 82
(i) Estimate the marks in the Final exam if a student who was sick
obtained 19 marks in the midterm exam. Ans: Ŷ = 41.64 + 1.52X, 70.52
Ans: (i) Ŷ = 24.086 + 0.957X (ii) Ŷ = – 652 + 4.8X, (iii) Ŷ = 0.548 + 0.636X
16
Question # 5 Match the description in the left column with a description
in the right column.
1. Regression line a. Yi
2. Residual b. The line of best fit
3. The Y-value of a data point c. Yi
corresponding to Xi
4. The Y-value for a point on the d. The difference
regression line corresponding to Xi between the Y-
values on the data
point and the Y-
value on the line for
the same X-value.
Question # 6 A study of the relationship between the IQ’s of husbands and
wives yielded the least-squares equation Ŷ = 48 + 0.5X. Given
that this equation is based on the following data:
X 90 114 102
Y 90 102 Y3
Y 9 6 8 5 2
Question # 13 A study was made by a retail merchant to determine the relation
between weekly advertising expenditures (X) and sales (Y).
X 2 15 30 10 20
Y 7 50 100 40 70
(i) Estimate a and b for the linear regression curve mY\X = a + bX.
(ii) Find a point estimate of mY\35 Ans: (i) Ŷ = 2.94 + 3.28X (ii) 117.74
Y = a + b1 X1 + b2 X 2 + • • • + b p X p +
Yˆ = a + b1 X 1 + b2 X 2 + • • • + bp X p + e
Yˆ = a + b1 X 1 + b2 X 2
Coefficients
Minitab /SPSS Output
Intercept 562.1510092
X1 Temperature -5.436580588
X2 Insulation -20.01232067
Yˆ = 562.151 − 5.437 X 1 − 20.012 X 2
For each degree increase in For each increase in one inch
temperature, the average amount of of insulation, the use of heating
heating oil used is decreased by oil is decreased by 20.012
5.437 gallons, holding insulation gallons, holding temperature
constant. constant. 23
Example: Cont…..
Using The Equation to Make Predictions
Estimate the average amount of heating oil used
for a home if the average temperature is 300 and
the insulation is 6 inches.
Yˆ = 562.151 − 5.437 X 1 − 20.012 X 2
= 562.151 − 5.437 30 − 20.012 6
= 278.969
24
Example # 5: A businessman wants to predict the incomes of
restaurants, using two independent variables: the number of
restaurant employees and the restaurant floor area. He collected
the following data.
Income (000) Y Floor area(000 sq.ft) X1 No of employees X2
30.00 10 15
22.00 5 8
16.00 10 12
7.00 3 7
14.00 2 10
Calculate the estimated multiple linear regression equation for the above
data. Predict the income when floor area is 4 sq.ft and there are 7
employees.
2 2
Y X1 X2 X1 X2 X1X2 X1Y X2Y
30 10 15 100 225 150 300 450
22 5 8 25 64 40 110 176
16 10 12 100 144 120 160 192
7 3 7 9 49 21 21 49
14 2 10 4 100 20 28 140
89 30 52 238 582 351 619 1007
Y = - 1.33 + 0.38X1 + 1.62X2, = 1.33 + 0.38(4) + 1.62(7)= 14.19 25
Question # 14
Given the estimated linear model Ŷ = 10 – 2X1 – 14X2 + 6X3
(i) What is the change in Ŷ when X1 increases by 4?
(ii) What is the change in Ŷ when X3 decreases by 1?
(iii) What is the change in Ŷ when X2 decreases by 2?
Ans: (i) Ŷ decreases by 8 (ii) Ŷ decreases by 6 (iii) Ŷ increases by 28
Question # 15
Given the estimated linear model Ŷ = 10 + 2X1 + 12X2 + 8X3
(i) What is the change in Ŷ when X1 increases by 4?
(ii) What is the change in Ŷ when X3 increases by 1?
(iii) What is the change in Ŷ when X2 increases by 2?
Ans: (i) Ŷ increases by 8 (ii) Ŷ increases by 8 (iii) Ŷ increases by 24
Question # 16
A researcher has determined that a significant relationship exists among
an employee’s age (x1), grade point average (x2), and income (Y).The
multiple regression equation is Ŷ = -34127 + 132x1 + 20805x2. Predict the
income of a person who is 32 years old and has a GPA of 3.4.
Ans: Ŷ = $40834
26
Question # 17
A manufacturer found that a significant relation exists among the numbers
of hours an assembly line employee works per shift (x1), the total number
of items produced (x2), the total number of defective items produced (Y).
The multiple regression equation is Ŷ = 9.6 + 2.2x1 – 1.08x2. Predict the
number of defective items produced by an employee who has worked
nine hours and produced 24 items. Ans: Ŷ = 3.48
Question # 18
A real estate agent found that there is a significant relationship among the
numbers of acres on a farm (x1), the numbers of rooms in a farmhouse
(x2), and the selling price in thousands of dollars (Y) of farms in a specific
area. The regression equation is Ŷ = 44.9 – 0.0266x1 + 7.56x2. Predict the
selling price of a farm that has 371 acres and a farmhouse with six rooms.
Ans: Ŷ = $80.3914 thousand
Question # 19
A medical researcher found a significant relationship among a person’s
age (x1), cholesterol level (x2), sodium level of the blood (x3), and systolic
blood pressure (Y). The regression equation is Ŷ = 97.7 + 0.691x1 +219x2
– 299x3. Predict the blood pressure of a person who is 35 years old and
has a cholesterol level of 194 milligram per decilitre (mg/dl) and sodium
blood level of 142 milliequivalents per litre (mEq/1). Ans: Ŷ = 149.885 ≈ 150
27
Correlation Coefficient Formulae
XY X Y
-
n n n
✓ 1. r =
X 2 X 2 Y 2 Y 2
− −
n n n n
Coefficient of Determination
The coefficient of determination is a measure of the variation of the dependent
variable that is explained by the regression line and the independent variable. The
symbol for the coefficient of determination is r2.
29
Range of Values for the Correlation Coefficient
Strong negative No linear Strong positive
relationship relationship relationship
-1 - 0.5 0 0.5 +1
How would you explain the following values of the Find correlation coefficient
correlation coefficient ‘r’. from the given regression
-1 Perfect negative correlation b/w the variables. coefficients.
+1 Perfect positive correlation b/w the variables. 1. 1.2 and 0.6
0 No linear correlation b/w the variables.
2. - 0.76 and - 0.82
0.92 Strong positive correlation b/w the variables.
- 0.88 Strong negative correlation b/w the variables. 3. 0.02 and 0.56
0.2 Weak positive correlation b/w the variables.
-2 The value of ‘r’ is not possible. 1. r = 0.8485 Strong positive
2. r = - 0.7894 Strong negative
3. r = 0.1058 Weak positive
30
Example # 6 From the following data find the correlation coefficient for
the advertising expenditures (1000s of $) and company sales (1000s of $),
what can you conclude. X : 1 2 3 4 5 Y : 2 5 6 8 9
X = Advertising expenses, Y = Company sales
X Y XY X2 Y2 XY X Y
-
n n n
1 2 2 1 4 r =
X 2 X 2 Y 2 Y 2
2 5 10 4 25 − −
n n n n
3 6 18 9 36
4 8 32 16 64
107 15
5 9 45 25 81 5 ( 5
) ( 30
5
)
r =
15 30 107 55 210 55 15 2 210 30 2
[ 5
(5) ][ 5 (5) ]
r = 0.9815
Because ‘r’ is close to one, there
is a strong positive linear
correlation. As the amount spent
on advertising increases, the
company sales also increases.
31
Question # 20: If the equations of the least square regression lines are:
(i) Y = 20.8 – 0.219X (Y on X), X = 16.2 – 0.785Y (X on Y)
(ii) Y = 2.64 + 0.648X (Y on X), X = -1.91 + 0.917Y (X on Y)
(iii) Y = 15 – 1.96X (Y on X), Y = 15.91 – 2.22X (X on Y)
(iv) Y = 14 + 0.75X (Y on X), Y = 6 + 4X (X on Y)
Find the coefficient of correlation in each case.
1. r = - 0.415 2. r = 0.771 3. r = - 0.940 4. r = 0.433
Question # 25:
The following table gives the distribution Age No of person s in thousands Blind
of the total population and those who are 0 − 9 100 55
wholly or partially blind among them. 10 − 19 60 40
Find out if there is any relation between 20 − 29 40 40
age and blindness. 30 − 39 36 40
40 − 49 24 36
50 − 59 11 22
Hint: First calculate the blindness per lakh and then correlate with the
midpoints of age groups. 60 − 69 6 18
33
Ans: r = 0.898, Correlation is positive and high implying that blindness
increases with age.
70 − 79 3 15
PRACTICE
( Basic Skills & Concepts )
• What
is the general form of the regression line used in statistics?
• Y = a + bX
PRACTICE
True or False
1. A correlation coefficient of -1 implies a perfect linear relationship b/w the 1. True
variables.
2. False
2. It is not possible to have a significant correlation by chance alone.
3. False
3. The range of ‘r’ is - to +1.
4. False
4. Regression equation has two dependent variables.
34
Rank Correlation
When the numerical measurement of the variable is not possible then
they are ranked according to the quality they possess. The correlation
obtained between two such sets of ranks is known as rank correlation
denoted by rs. The limits of rank correlation are as same as that of
simple correlation i.e. ±1. This is often called Spearman’s rank
correlation coefficient.
6d2
rs = 1 – n(n2 – 1) where d = x – y
Multiple Correlation
It measures the degree of relationship between the combined
influence of a group of a variable and a variable which is not included
in that group is known as multiple correlation. Its limits are zero and
one i.e.0 R3.12, R2.13, R1.23 1
2
r12 + r13
2
− 2r12 r13 r23 2
r21 + r23
2
− 2r21 r23 r13 R3.12 =
r 2
31 + r 2
32 − 2r31 r32 r12
R1.23 = R2.13 =
1− 2
r23 1 − r13
2 1 − r12
2
35
where r12 = r21, r13 = r31, r23 = r32
Partial Correlation
It measures the degree of linear relationship between a dependent
variable and one particular independent variable, when all other
independent variables involved are held constant.
Its limits are ±1 i.e. –1 r12.3, r13.2, r23.1 +1
r12 − r13 r23 r13 − r12 r32 r23 − r21 r31
r12.3 = r13.2 = r23.1 =
(1 − r132 )(1 − r23
2
) (1 − r122 )(1 − r322 ) (1 − r21
2
)(1 − r31
2
)
Example # 7 From the following data find the Spearman’s rank
correlation coefficient.
X Y a b d= a-b d2 X : 11 13 15 12 14 Y : 4 5 3 8 9
11 4 5 4 1 1
13 5 3 3 0 6d2
0 rs = 1 – n(n2 – 1)
15 3 1 5 -4 16 6 χ 22
12 8 4 2 2 4 rs = 1 – 5(25 - 1 )
14 9 2 1 1 1
22 rs = – 0.1 36
Question # 26: Given r12 = 0.492, r13 = 0.927, r23 = 0.758 find all partial
and multiple correlation coefficients.
1. r12.3 = - 0.86
2. r23.1 = 0.92
3. r13.2 = 0.98
4. R1.23 = 0.98
5. R2.13 = 0.94
6. R3.12 = 0.99
𝑡3 − 𝑡
Rank Correlation for Tied Ranks. 𝑇=
12
Example # 8: Two members of a selection committee rank eight persons
according to their suitability for promotion as follows. Calculate the rank
correlation. Persons
Members 1
A
1
B C
2.5 2.5 4
D E
5
F
6
G
7
H
8
Members 2 2 4 1 3 6 6 6 8
a b d d2 1 1 3
1 2 -1 1 T= (23 – 2) + (3 – 3) = 2.5
2.5 4 -1.5 2.25 12 12
2.5 1 1.5 2.25
4 3 1 1 rs = 1 – 6[8.5 + 2.5]
5 6 -1 1 = 0.869
6 6 0 0 8(64 – 1)
7 6 1 1
8 8 0 0 37
8.5
Question # 27: Rank the values and hence find a rank correlation coefficient
between the two sets.
X 7.4 9.0 11.0 2.5 4.6 6.5 rs = - 0.60
Y 8.5 6.1 2.4 6.7 12.6 3.3
Question # 28: Rank the values and hence find a rank correlation coefficient
between the two sets.
X 98 47 63 98 55 40 69 77 63 50 63 99
rs = - 0.0699
Y 22 32 18 30 22 18 25 27 35 38 24 22
T = 5, = Ʃd2 = 301
Question # 29: Find R1.23 and r12.3 from the correlation matrix.
R1.23 = 0.824
∆= ( 1
0.5
0.8
0.5
1
0.4
0.8
0.4
1 ) r12.3 = 0.327
Question # 30: Given r12 = 0.60, r13 = 0.70, r23 = 0.65, find partial correlation
coefficient between X2 and X3 keeping X1 constant. Also find
multiple correlation coefficient between the variable X3 and the
Answers
two independent variables X1 and X2.
r23.1 = 0.4026
R3.12 = 0.7567
Question # 31: Rank the values and hence find a rank correlation coefficient
between the two sets. Answers
(X) 93 97 94 92 93 97 94 rs = 0.2678
T = 4, = Ʃd2 = 37
(Y) 44 48 44 47 42 44 48 38
Serial Correlation
It is defined as correlation between
observations ordered in time periods. The
correlation between yt and yt+1 i.e. the
correlation between successive overlapping
pairs is called the serial correlation of first
order. Also known as coefficient of auto-
correlation at lag 1.
n-1
(Yt - Y)(Yt + 1 - Y)
t=1
rk = n
(Yt - Y)2
Where Y = Y / n t=1
39
Autocorrelation : Example #9
The Office Concept Corp. has acquired a number of office
units (in thousands of square feet) over the last 16 years.
Calculate first order serial correlation.
Yt
1.6
0.8
1.2
0.5
0.9
1.1
1.1
0.6
1.5
0.8
0.9
1.2
0.5
1.3
0.8
1.2 40
Yt Yt + 1 Yt Y Yt + 1 Y (Yt - Y)(Yt + 1 - Y) (Yt - Y)2
- -
43
(8) When two regression coefficients have same algebraic signs, then r is:
(A) Positive (B) Zero
(C) Negative (D) According to signs
(9) If X is measured in rupees and Y is measured in dollars, then correlation coefficient
r has the unit:
(A) Dollars (B) Rupees
(C) No unit (D) Both (A) & (B)
(10) When two variables move in the same direction, then the correlation is:
(A) Positive (B) Negative
(C) Fractional (D) None of these
Answer: 1. D 2. D 3. B 4. A 5. D 6. D 7. B 8. D 9. C 10. A
44