0% found this document useful (0 votes)
203 views44 pages

Regression & Correlation

The document discusses regression and correlation methods used to analyze the relationship between dust levels in the atmosphere and emergency room visits for respiratory disorders. Researchers found low but significant correlations between dust levels and bronchitis visits, and higher correlations for sinusitis visits when dust levels exceeded EPA limits.

Uploaded by

222041
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
203 views44 pages

Regression & Correlation

The document discusses regression and correlation methods used to analyze the relationship between dust levels in the atmosphere and emergency room visits for respiratory disorders. Researchers found low but significant correlations between dust levels and bronchitis visits, and higher correlations for sinusitis visits when dust levels exceeded EPA limits.

Uploaded by

222041
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Regression & Correlation

Chapter Outline
• Regression:
Scatter diagram. Equation of regression lines.
Applications of regression lines. Prediction of Y values
using regression line. Residuals. Standard error of
estimate. Multiple regression.
• Correlation:
Correlation coefficient. Properties of correlation
coefficient. Rank correlation. Multiple correlations.

Rizwan Yusuf Khan


1
Associate Professor
Do Dust Storms Affect Respiratory Health?
Southeast Washington state has a long history of
seasonal dust storms. Several researchers decided to see
what effect, if any, these storms had on the respiratory
health of the people living in the area. They undertook
(among other things) to see if there was a relationship
between the amount of dust and sand particles in the air
when the storms occur, and the number of hospital
emergency room visits for respiratory disorders at three
community hospitals in southeast Washington. Using
methods of correlation and regression, which are
explained in this chapter, they were able to determine the
effect of these dust storms on local residents.
The researchers correlated the dust pollutant levels in the
atmosphere and the number of daily emergency room
visits for several respiratory disorders, such as bronchitis,
sinusitis, asthma, and pneumonia. Using the Pearson
correlation coefficient, they found overall a significant but
low correlation, r = 0.13, for bronchitis visits only.
However, they found a much higher correlation value for
sinusitis, P-value = 0.08, when pollutant levels exceeded
maximums set by the Environmental Protection Agency
(EPA). In addition, they found statistically significant
correlation coefficients r = 0.94 for sinusitis visits and r =
0.74 for upper-respiratory-tract infection visits 2 days after
the dust pollutants exceeded the maximum levels set by 2
the EPA.
Purpose of Regression and Correlation
• Regression is Used Primarily for Prediction.
A regression used to predict the values of a
dependent or response variable based on values of
at least one independent or explanatory variable.

• Correlation is Used to Measure Strength of the


Association Between Numerical Variables.
Correlation is a statistical method used to
determine whether a relationship exists
between variables. Denoted by r.
3
Terminology and Notation
Before we proceed to further, let us dwell briefly on the
matter of terminology and notation. In the literature the
terms dependent variable and independent variable are
described variously. A representative list is:

Dependent Variable Independent Variable


Explained Variable Explanatory Variable
Predictand Predictor
Regressand Regressor
Response Stimulus
Endogenous Exogenous
Outcome Covariate
Controlled Variable Control Variable

4
The Scatter Diagram
If we plot the paired observations (Xi , Yi) on a graph,
the resulting set of points is called scatter diagram.
Regression line passes through its mean points.
Plot scatter diagram from the following data.
X : 10 20 30 40 50 58
Y : 1.0 1.5 0.5 0.7 1.4 0.9

Plot of all (Xi , Yi) pairs


Y
1.5
34.7, 1
1
0.5
0 X
0 20 40 60
5
The scatter diagrams shown below reveal that the
relationship between two variables in (a) Positive Linear Relationship
(b) Negative Linear Relationship (c) Curvilinear (d) No Relationship

(a) Positive Linear Relationship (c) Curvilinear

(b) Negative Linear Relationship (d) No Relationship

6
Simple Linear Regression
Following equation depicts the line fit to the set of data namely

Yˆi = a + bX i
and line reflecting to the model
mY\X = a + bX.
 a and b are unknown parameters.
Yi = Predicted Value of Y for observation i
Xi = Value of X for observation i
a = Y - intercept
b = Slope or regression coefficient
7
Regression is a process by which we estimate one of the dependent
variable on the basis of the independent variable . If Y is to be
estimated on the basis of X, by means of some equation, we call
this equation the regression equation of Y on X. If X is to be
estimated on the basis of Y, then the equation is called regression
equation of X on Y. A regression line, is also called a line of best fit,
is the line for which the sum of squares of the residuals is minimum.

Y on X
Y = a + byx X (X as independent variable) ……… or Ŷ = Ȳ + b(X – X )

Normal equations are:


Y = na + b yx X, XY = aX + b yx X 2
Solving these normal equations simultaneously we get the values
of “a” and “byx”. We can also calculate the values of “a” and “byx” by
using the formula.
(X)(Y)
XY -
b yx = n
2 (X)
2
X - 8
n
S xy Sy
or b yx = =r and a = Y - b yx X Put these values of “a” and “byx”
S 2x Sx
in the supposed equation, then the required line is Ŷ = a + b yx X
also Sum of Errors = (Y - Ŷ) = 0
and Sum of Squares of Errors = (Y - Ŷ) or SSE
2 =  Y 2
- aY - b yx XY
 Y 2 − a  Y − b  XY
S y. x = X on Y
n−2

X = a + bxyY (Y as independent variable) ……………….. (supposed equation)


Normal equations are:

X = na + b xy Y, XY = aY + b xy Y 2


Solving these normal equations simultaneously we get the values of “a” and
“bxy”. We can also calculate the values of “a” and “bxy” by using the formula.
(X)(Y)
XY - and a = X - b xy Y Put these values of “a” and “bxy”
b xy = n
S xy Sx
( Y) 2
or b = = r
Y -
2 xy
S 2y Sy
n
in the supposed equation, then the required line is X̂ = a + b xy Y
also Sum of Errors = (X - X̂) = 0 and Sum of Squares of Errors =(X - X̂)
2

 X 2 − a  X − b  XY or SSE = X - aX - b xy XY


2
S x. y = 9
n−2
Example # 1: (i) Compute the least square regression line Y on X for
the following data (ii) Compute the standard error of estimate sy.x.
(iii) Estimate Y when X = 11
X 1 2 3 4 5
(iv) Show that Ʃ(Y – Ŷ) = 0 Y 2 5 6 8 9
   2
X Y XY X2 Y Y–Y (Y–Y) Y2
1 2 2 1 2.6 - 0.6 0.36 4
2 5 10 4 4.3 0.7 0.49 25
3 6 18 9 6.0 0.0 0.00 36
4 8 32 16 7.7 0.3 0.09 64
5 9 45 25 9.4 - 0.4 0.16 81
15 30 107 55 - 0 1.1 210

(i) Equation of least square line is Y = a + bX ( supposed equation )

Normal equations are Y = na + bX, XY = aX + bX 2


5a + 15b = 30 15a + 45b = 90 b = 1.7
15a + 55b = 107 15a + 55b = 107 a = 0.9
 -10b -17
Y = 0.9 + 1.7X 10
(Y - Ŷ) 2 1.1
= 0.61
(ii) Standard Error of Estimate = sy.x. = n–2 = 3

(iii) Y = 0.9 + 1.7(11) = 19.6 (iv) Ʃ(Y – Ŷ) = 0
Example # 2: Using data from example # 1 calculate
(i) the total variation (ii) the unexplained variation
(iii) the explained variation (iv) the coefficient of determination
(v) the coefficient of correlation.

(i) Total Variation = (Y - Y) = Y2 - (Y) = 210 – (30) = 30


2 2 2

n 5
(ii) Unexplained Variation = Y - aY - bXY = (Y - Ŷ) 2 = 1.1
2
 2
(iii) Explained Variation = (Y - Y)
Explained Variation = Total variation – Unexplained variation
= 30 – 1.1 = 28.9
2
Explained variation
(iv) Coefficient of determination = r = Total variation
= 28.9 / 30 = 0.963
(v) Coefficient of correlation = r = 0.981
11
Example # 3: In an experiment to measure the stiffness of a spring, the
length of the spring under different loads was measured as follows:
X = Loads (Ib) 3 5 6 9 10 12 15 20 22 28
Y = Length (in) 10 12 15 18 20 22 27 30 32 34

Find the regression equations appropriate for predicting


(i) the length, given the weight on the spring.
(ii) the weight, given the length on the spring.
(iii) Compute the standard error of estimates Sy.s and Sx.y.
X Y X2 Y2 XY
3 10 9 100 30
5 12 25 144 60

6 15 36 225 90 Y = a + bX ( supposed equation )
9 18 81 324 162
10 20 100 400 200 (i) Equation of least square
12 22 144 484 264 line Y on X
15 27 225 729 405 Normal equations are
20 30 400 900 600 Y = na + bX
22 32 484 1024 704
28 34 784 1186 932 XY = aX + bX2
130 220 2288 5486 3467 12
10a + 130b = 220
130a + 2288b = 3467
Solving these equation simultaneously, we get
b = 1.02
a = 8.74
Hence the desired estimated regression equation is

Y = 8.74 + 1.02X
(ii) Equation of least square line X on Y

X = a + bY

X = na + bY
XY = aY + bY2
10a + 220b = 130
220a + 5486b = 3467
Solving these equation simultaneously, we get
b = 0.94, a = –7.68
Hence the desired estimated regression equation is

13
X = –7.68 + 0.94Y
(iii) Standard error of estimates Sy.s and Sx.y are

 Y − a  Y − b  XY 5486 − (8.74)(220) − (1.02)(3467)


2
S y. x = = = 1.8323
n−2 10 − 2
 X 2 − a  X − b  XY 2288 − (−7.68)(130) − (0.94)(3467)
S x. y = = = 1.8514
n−2 10 − 2

14
Question # 1
The marks obtained by 10 students in Midterm exam (X)
and Final exam (Y) are given below:
X 20 22 18 16 14 12 9 25 24 25
Y 50 53 60 72 68 79 47 97 89 82
(i) Estimate the marks in the Final exam if a student who was sick
obtained 19 marks in the midterm exam. Ans: Ŷ = 41.64 + 1.52X, 70.52

(ii) If a student who migrated after the Midterm exam, obtained 75


marks in the Final exam. Estimate his marks in the Midterm exam
had he been admitted earlier. Ans: 𝑋෠ = 6.95 + 0.17𝑌, 18.85

Question # 2 Given the following sets of values:


X 0 1 2 3 4
Y 6.3 4.5 3.3 1.8 1.0
(i) Compute the least-squares regression equation for Y values on X values.
(ii) Find the trend values. (iii) Compute the standard error of estimate, sy.x
(iv) Compute the explained variation & coefficient of correlation.
Ans: a = 6.04, b = – 1.33, ƩY = 16.9, ƩX = 10, n = 5, ƩXY = 20.5, ƩY2 = 75.07, ƩX2 = 30, sy.x = 0.2938
TV = 17.948, UV = 0.259, EV = 17.689, r2 = 0.9854, r = – 0.9927
15
Question # 3 Determine the estimated regression equation Yˆ = a + bX
in each of the following cases.

Ans: (i) Ŷ = 24.086 + 0.957X (ii) Ŷ = – 652 + 4.8X, (iii) Ŷ = 0.548 + 0.636X

Question # 4 Using data given in example # 3 (i)


(i) Compute Total variation & Unexplained variation
(ii) Compute the explained variation, coefficient of correlation & coefficient of
determination.
Ans: TV = 646, UV = 26.86, EV = 619.14, r = 0.9788, r2 = 0.958,

16
Question # 5 Match the description in the left column with a description
in the right column.

1. Regression line a. Yi
2. Residual b. The line of best fit
3. The Y-value of a data point c. Yi
corresponding to Xi
4. The Y-value for a point on the d. The difference
regression line corresponding to Xi between the Y-
values on the data
point and the Y-
value on the line for
the same X-value.
Question # 6 A study of the relationship between the IQ’s of husbands and
wives yielded the least-squares equation Ŷ = 48 + 0.5X. Given
that this equation is based on the following data:
X 90 114 102
Y 90 102 Y3

Where Y3 is missing, find the missing value. Ans: 105 17


Question # 7 Compute the least square regression equation of Y on X for the
following data. What is the regression coefficient and what does it
mean? Compute the standard error of estimate sy.x
X 8 10 12 5 6 13 15 16 17
Y 23 28 36 16 19 41 44 45 50
Ans: Ŷ = 1.47 + 2.831X. The estimated regression coefficient, b = 2.831, which indicates that the values
of Y increases by 2.831 units for a unit increases in X. sy.x = 1.52
Question # 8 The amounts of a chemical compound Y that dissolved in 100 grams
of water at various temperatures X were recorded as follows:
X(°C) Y (grams)
0 8 6
15 12 10
30 25 21
45 31 33

(i) Find the equation of regression line.


(ii) Estimate the amount of chemical that will dissolve in 100
grams of water at 50 °C. Ans: (i) Ŷ = 5.2 + 0.58X (ii) 34.2

Question # 9 Calculate regression line to the following data taking


(i) X as independent variable
(ii) Show that ƩY = ƩŶ and Ʃ(Y – Ŷ) = 0
(iii) Calculate Ʃ(Y – Ŷ)2
(iv) Verify that Ʃ(Y – Ŷ)2 = ƩY2 – aƩY – bƩXY
X 0 1 2 3 4 18
Y 1 1.8 3.3 4.5 6.3 Ans: Ŷ = 0.72 + 1.33X, Ʃ(Y – Ŷ) = 0.2590 2
Question # 10
(i) The researcher collects the following data which shows the
age of a copy machine (X) and its monthly maintenance cost
(Y) of six machines. Find the least square regression lines Y
on X and X on Y.
(ii) Predict the maintenance cost for a 3-Year-old machine.
(iii) Predict the age of machine, if maintenance cost is $112.
Machine A B C D E F (i) Ŷ = 55.59 + 8.13X Ans: 𝑋෠ = − 5.76 + 0.11𝑌
Age, X (Years) 1 2 3 4 4 6 (ii) $79.98 (iii) 6.56  7 years old.
Monthly Cost ($) Y 62 78 70 90 93 103
Question # 11
A professor in the School of Business in a university polled a dozen
colleagues about the number of professional meetings they attended in the
past five years (X) and the number of papers they submitted to refereed
journals (Y) during the same period. The summary data are given as follows:
12 12
X = 4, Y = 12, X = 232,  X Y = 318
2
i i i
i =1 i =1
Fit a simple linear regression model between X and Y by finding out the
estimates of intercept and slope. Comment on whether attending more
professional meetings would result in publishing more papers.
Ans. Ŷ = 37.8 – 6.45X. It appears that attending professional meetings would not result in publishing more papers.
19
Question # 12
Fit a least squares line to the following data taking
(i) X as independent variable (ii) Y as independent variable.
X 2 3 3 0 2 Ans. Ŷ = 4.66 + 0.67X, 𝑋෠ = 1.22 + 0.13𝑌

Y 9 6 8 5 2
Question # 13 A study was made by a retail merchant to determine the relation
between weekly advertising expenditures (X) and sales (Y).
X 2 15 30 10 20
Y 7 50 100 40 70
(i) Estimate a and b for the linear regression curve mY\X = a + bX.
(ii) Find a point estimate of mY\35 Ans: (i) Ŷ = 2.94 + 3.28X (ii) 117.74

Properties of Regression Coefficients


The regression coefficients have the following properties.
(i) Regression coefficient bYX and bXY have the same signs.
(ii) Regression coefficients are not symmetrical with respect to X and Y,
i.e. bYX ≠ bXY.
(iii) Regression coefficients are independent of origin but not of scale.
(iv) Geometric mean of two regression coefficients is the coefficient of
correlation i.e. GM =  byx  bxy = r
20
Multiple Regression
Relationship between 1 dependent & 2 or more
independent variables is a linear function
Population Random
Population slopes
Y-intercept Error

Y = a + b1 X1 + b2 X 2 + • • • + b p X p + 
Yˆ = a + b1 X 1 + b2 X 2 + • • • + bp X p + e

Dependent (Response) Independent (Explanatory)


variable for sample variables for sample model 21
Multiple Regression: Example # 4

Oil (Gal) Temp (0F) Insulation


Develop a equation for 275.30 40 3
estimating heating oil 363.80 27 3
used for a single family 164.30 40 10
home in the month of 40.80 73 6
January based on average 94.30 64 6
230.90 34 6
temperature and amount
366.70 9 6
of insulation in inches. 300.60 8 10
237.80 23 10
121.40 63 3
31.40 65 10
203.50 41 6
441.10 21 3
323.00 38 3
52.50 58 22 10
Example: Cont……

Yˆ = a + b1 X 1 + b2 X 2
Coefficients
Minitab /SPSS Output
Intercept 562.1510092
X1 Temperature -5.436580588
X2 Insulation -20.01232067
Yˆ = 562.151 − 5.437 X 1 − 20.012 X 2
For each degree increase in For each increase in one inch
temperature, the average amount of of insulation, the use of heating
heating oil used is decreased by oil is decreased by 20.012
5.437 gallons, holding insulation gallons, holding temperature
constant. constant. 23
Example: Cont…..
Using The Equation to Make Predictions
Estimate the average amount of heating oil used
for a home if the average temperature is 300 and
the insulation is 6 inches.
Yˆ = 562.151 − 5.437 X 1 − 20.012 X 2
= 562.151 − 5.437  30 − 20.012  6
= 278.969

The estimated heating oil used is 278.97 gallons

24
Example # 5: A businessman wants to predict the incomes of
restaurants, using two independent variables: the number of
restaurant employees and the restaurant floor area. He collected
the following data.
Income (000) Y Floor area(000 sq.ft) X1 No of employees X2
30.00 10 15
22.00 5 8
16.00 10 12
7.00 3 7
14.00 2 10

Calculate the estimated multiple linear regression equation for the above
data. Predict the income when floor area is 4 sq.ft and there are 7
employees.
2 2
Y X1 X2 X1 X2 X1X2 X1Y X2Y
30 10 15 100 225 150 300 450
22 5 8 25 64 40 110 176
16 10 12 100 144 120 160 192
7 3 7 9 49 21 21 49
14 2 10 4 100 20 28 140
89 30 52 238 582 351 619 1007

Y = - 1.33 + 0.38X1 + 1.62X2, = 1.33 + 0.38(4) + 1.62(7)= 14.19 25
Question # 14
Given the estimated linear model Ŷ = 10 – 2X1 – 14X2 + 6X3
(i) What is the change in Ŷ when X1 increases by 4?
(ii) What is the change in Ŷ when X3 decreases by 1?
(iii) What is the change in Ŷ when X2 decreases by 2?
Ans: (i) Ŷ decreases by 8 (ii) Ŷ decreases by 6 (iii) Ŷ increases by 28

Question # 15
Given the estimated linear model Ŷ = 10 + 2X1 + 12X2 + 8X3
(i) What is the change in Ŷ when X1 increases by 4?
(ii) What is the change in Ŷ when X3 increases by 1?
(iii) What is the change in Ŷ when X2 increases by 2?
Ans: (i) Ŷ increases by 8 (ii) Ŷ increases by 8 (iii) Ŷ increases by 24

Question # 16
A researcher has determined that a significant relationship exists among
an employee’s age (x1), grade point average (x2), and income (Y).The
multiple regression equation is Ŷ = -34127 + 132x1 + 20805x2. Predict the
income of a person who is 32 years old and has a GPA of 3.4.
Ans: Ŷ = $40834

26
Question # 17
A manufacturer found that a significant relation exists among the numbers
of hours an assembly line employee works per shift (x1), the total number
of items produced (x2), the total number of defective items produced (Y).
The multiple regression equation is Ŷ = 9.6 + 2.2x1 – 1.08x2. Predict the
number of defective items produced by an employee who has worked
nine hours and produced 24 items. Ans: Ŷ = 3.48

Question # 18
A real estate agent found that there is a significant relationship among the
numbers of acres on a farm (x1), the numbers of rooms in a farmhouse
(x2), and the selling price in thousands of dollars (Y) of farms in a specific
area. The regression equation is Ŷ = 44.9 – 0.0266x1 + 7.56x2. Predict the
selling price of a farm that has 371 acres and a farmhouse with six rooms.
Ans: Ŷ = $80.3914 thousand
Question # 19
A medical researcher found a significant relationship among a person’s
age (x1), cholesterol level (x2), sodium level of the blood (x3), and systolic
blood pressure (Y). The regression equation is Ŷ = 97.7 + 0.691x1 +219x2
– 299x3. Predict the blood pressure of a person who is 35 years old and
has a cholesterol level of 194 milligram per decilitre (mg/dl) and sodium
blood level of 142 milliequivalents per litre (mEq/1). Ans: Ŷ = 149.885 ≈ 150
27
Correlation Coefficient Formulae
XY  X  Y 
-  
n  n  n 
✓ 1. r =
 X 2  X  2   Y 2  Y  2 
 −   −  
 n  n    n  n  

XY - nXY XY - nXY


2. r = =
(X - nX )(Y - nY )
2 2 2 2 nS x S y

(X - X)(Y - Y) (X - X)(Y - Y)


3. r = =
(X - X) 2 (Y - Y) 2 nS x S y
-
4. r = + b xy . b yx
28
Properties of Correlation Coefficient
(i) Correlation Coefficient is symmetrical with respect to X and Y, i.e. rXY = rYX.
(ii) Correlation Coefficient is a pure number. It is free of the units employed.
(iii) Correlation Coefficient lies between – 1 and +1 i.e. – 1  r  +1.
(iv) Correlation Coefficient is independent of origin and scale, i.e. rXY = rUV.
(v) Correlation Coefficient is Geometric Mean of two regression coefficients i.e.
r = -+ b yx  b xy
(vi) Correlation coefficient have the same sign as the regression coefficients.

Coefficient of Determination
The coefficient of determination is a measure of the variation of the dependent
variable that is explained by the regression line and the independent variable. The
symbol for the coefficient of determination is r2.

29
Range of Values for the Correlation Coefficient
Strong negative No linear Strong positive
relationship relationship relationship

-1 - 0.5 0 0.5 +1

How would you explain the following values of the Find correlation coefficient
correlation coefficient ‘r’. from the given regression
-1 Perfect negative correlation b/w the variables. coefficients.
+1 Perfect positive correlation b/w the variables. 1. 1.2 and 0.6
0 No linear correlation b/w the variables.
2. - 0.76 and - 0.82
0.92 Strong positive correlation b/w the variables.
- 0.88 Strong negative correlation b/w the variables. 3. 0.02 and 0.56
0.2 Weak positive correlation b/w the variables.
-2 The value of ‘r’ is not possible. 1. r = 0.8485 Strong positive
2. r = - 0.7894 Strong negative
3. r = 0.1058 Weak positive

30
Example # 6 From the following data find the correlation coefficient for
the advertising expenditures (1000s of $) and company sales (1000s of $),
what can you conclude. X : 1 2 3 4 5 Y : 2 5 6 8 9
X = Advertising expenses, Y = Company sales

X Y XY X2 Y2 XY  X  Y 
-  
n  n  n 
1 2 2 1 4 r =
 X 2  X  2   Y 2  Y  2 
2 5 10 4 25  −   −  
 n  n    n  n  
3 6 18 9 36
4 8 32 16 64
107 15
5 9 45 25 81 5 ( 5
) ( 30
5
)
r =
15 30 107 55 210 55 15 2 210 30 2
[ 5
(5) ][ 5 (5) ]
r = 0.9815
Because ‘r’ is close to one, there
is a strong positive linear
correlation. As the amount spent
on advertising increases, the
company sales also increases.
31
Question # 20: If the equations of the least square regression lines are:
(i) Y = 20.8 – 0.219X (Y on X), X = 16.2 – 0.785Y (X on Y)
(ii) Y = 2.64 + 0.648X (Y on X), X = -1.91 + 0.917Y (X on Y)
(iii) Y = 15 – 1.96X (Y on X), Y = 15.91 – 2.22X (X on Y)
(iv) Y = 14 + 0.75X (Y on X), Y = 6 + 4X (X on Y)
Find the coefficient of correlation in each case.
1. r = - 0.415 2. r = 0.771 3. r = - 0.940 4. r = 0.433

Question # 21: (a) Compute the correlation coefficient:


(i) n = 12, X = 800, Y = 811, X2 = 53418, Y2 = 54849
XY = 54107.
(ii) n = 10, XY= 330, X = 5, Y = 6, Sx = 2, Sy = 3
(b) Given r = 0.8, bxy = 0.45, what would be the value of byx?
(a) (i). r = 0.703 (ii). r = 0.50 (b) 1.42
r = 0.67

Question # 22: A computer while calculating the correlation coefficient


between two variables X and Y from 25 pairs of observations obtained
the sums: X = 125, X2 = 650, Y = 100, Y2 = 460, XY = 508. It
was, however, later discovered at the time of checking that he had
X 6 8
copied down two pairs as Y 14 6 while the correct values were XY 128 86 .
Obtain the correct value of the coefficient of correlation. 32
Question # 23:
Compute the correlation coefficient for the data obtained in the study of
age (X) and blood pressure (Y) for the following pairs of values.
Age (X) 43 48 56 61 67 70
Blood Pressure (Y) 128 120 135 143 141 152 Ans: r = 0.89

Question # 24: Using the following data calculate


(X) 43 48 56 61 67 70
(Y) 128 120 135 143 141 152
(i) Total variation (ii) Unexplained variation
(iii) Explained variation (iv) Coefficient of determination
(v) Coefficient of correlation 2
Ans: TV = 649.5, UV = 145.5, EV = 504, r = 0.776, r = 0.881

Question # 25:
The following table gives the distribution Age No of person s in thousands Blind
of the total population and those who are 0 − 9 100 55
wholly or partially blind among them. 10 − 19 60 40
Find out if there is any relation between 20 − 29 40 40
age and blindness. 30 − 39 36 40
40 − 49 24 36
50 − 59 11 22
Hint: First calculate the blindness per lakh and then correlate with the
midpoints of age groups. 60 − 69 6 18
33
Ans: r = 0.898, Correlation is positive and high implying that blindness
increases with age.
70 − 79 3 15
PRACTICE
( Basic Skills & Concepts )

• What two things should be done before one performs a regression


analysis?
• A scatter plot should be drawn & the value of the correlation coefficient should be tested to see whether it is significant.

• What

is the general form of the regression line used in statistics?
• Y = a + bX

PRACTICE
True or False
1. A correlation coefficient of -1 implies a perfect linear relationship b/w the 1. True
variables.
2. False
2. It is not possible to have a significant correlation by chance alone.
3. False
3. The range of ‘r’ is - to +1.
4. False
4. Regression equation has two dependent variables.

34
Rank Correlation
When the numerical measurement of the variable is not possible then
they are ranked according to the quality they possess. The correlation
obtained between two such sets of ranks is known as rank correlation
denoted by rs. The limits of rank correlation are as same as that of
simple correlation i.e. ±1. This is often called Spearman’s rank
correlation coefficient.
6d2
rs = 1 – n(n2 – 1) where d = x – y

Multiple Correlation
It measures the degree of relationship between the combined
influence of a group of a variable and a variable which is not included
in that group is known as multiple correlation. Its limits are zero and
one i.e.0  R3.12, R2.13, R1.23  1
2
r12 + r13
2
− 2r12 r13 r23 2
r21 + r23
2
− 2r21 r23 r13 R3.12 =
r 2
31 + r 2
32 − 2r31 r32 r12
R1.23 = R2.13 =
1− 2
r23 1 − r13
2 1 − r12
2

35
where r12 = r21, r13 = r31, r23 = r32
Partial Correlation
It measures the degree of linear relationship between a dependent
variable and one particular independent variable, when all other
independent variables involved are held constant.
Its limits are ±1 i.e. –1  r12.3, r13.2, r23.1  +1
r12 − r13 r23 r13 − r12 r32 r23 − r21 r31
r12.3 = r13.2 = r23.1 =
(1 − r132 )(1 − r23
2
) (1 − r122 )(1 − r322 ) (1 − r21
2
)(1 − r31
2
)
Example # 7 From the following data find the Spearman’s rank
correlation coefficient.
X Y a b d= a-b d2 X : 11 13 15 12 14 Y : 4 5 3 8 9

11 4 5 4 1 1
13 5 3 3 0 6d2
0 rs = 1 – n(n2 – 1)
15 3 1 5 -4 16 6 χ 22
12 8 4 2 2 4 rs = 1 – 5(25 - 1 )
14 9 2 1 1 1
22 rs = – 0.1 36
Question # 26: Given r12 = 0.492, r13 = 0.927, r23 = 0.758 find all partial
and multiple correlation coefficients.

1. r12.3 = - 0.86
2. r23.1 = 0.92
3. r13.2 = 0.98
4. R1.23 = 0.98
5. R2.13 = 0.94
6. R3.12 = 0.99
𝑡3 − 𝑡
Rank Correlation for Tied Ranks. 𝑇=
12
Example # 8: Two members of a selection committee rank eight persons
according to their suitability for promotion as follows. Calculate the rank
correlation. Persons
Members 1
A
1
B C
2.5 2.5 4
D E
5
F
6
G
7
H
8
Members 2 2 4 1 3 6 6 6 8

a b d d2 1 1 3
1 2 -1 1 T= (23 – 2) + (3 – 3) = 2.5
2.5 4 -1.5 2.25 12 12
2.5 1 1.5 2.25
4 3 1 1 rs = 1 – 6[8.5 + 2.5]
5 6 -1 1 = 0.869
6 6 0 0 8(64 – 1)
7 6 1 1
8 8 0 0 37
8.5
Question # 27: Rank the values and hence find a rank correlation coefficient
between the two sets.
X 7.4 9.0 11.0 2.5 4.6 6.5 rs = - 0.60
Y 8.5 6.1 2.4 6.7 12.6 3.3

Question # 28: Rank the values and hence find a rank correlation coefficient
between the two sets.
X 98 47 63 98 55 40 69 77 63 50 63 99
rs = - 0.0699
Y 22 32 18 30 22 18 25 27 35 38 24 22
T = 5, = Ʃd2 = 301
Question # 29: Find R1.23 and r12.3 from the correlation matrix.
R1.23 = 0.824
∆= ( 1
0.5
0.8
0.5
1
0.4
0.8
0.4
1 ) r12.3 = 0.327

Question # 30: Given r12 = 0.60, r13 = 0.70, r23 = 0.65, find partial correlation
coefficient between X2 and X3 keeping X1 constant. Also find
multiple correlation coefficient between the variable X3 and the
Answers
two independent variables X1 and X2.
r23.1 = 0.4026
R3.12 = 0.7567
Question # 31: Rank the values and hence find a rank correlation coefficient
between the two sets. Answers

(X) 93 97 94 92 93 97 94 rs = 0.2678
T = 4, = Ʃd2 = 37
(Y) 44 48 44 47 42 44 48 38
Serial Correlation
It is defined as correlation between
observations ordered in time periods. The
correlation between yt and yt+1 i.e. the
correlation between successive overlapping
pairs is called the serial correlation of first
order. Also known as coefficient of auto-
correlation at lag 1.
n-1
(Yt - Y)(Yt + 1 - Y)
t=1
rk = n
(Yt - Y)2
Where Y = Y / n t=1
39
Autocorrelation : Example #9
The Office Concept Corp. has acquired a number of office
units (in thousands of square feet) over the last 16 years.
Calculate first order serial correlation.
Yt

1.6
0.8
1.2
0.5
0.9
1.1
1.1
0.6
1.5
0.8
0.9
1.2
0.5
1.3
0.8
1.2 40
Yt Yt + 1 Yt Y Yt + 1 Y (Yt - Y)(Yt + 1 - Y) (Yt - Y)2
- -

1.6 0.8 0.6 - 0.2 - 0.12 0.36


0.8 1.2 - 0.2 0.2 - 0.04 0.04
1.2 0.5 0.2 - 0.5 - 0.10 0.04
0.5 0.9 - 0.5 0.05 0.25
0.9 1.1 - 0.1 - 0.01 0.01
1.1 1.1 0.1 0.01
1.1 0.6 0.1
0.6 1.5 - 0.4
1.5 0.8 0.5
0.8 0.9 - 0.2
0.9 1.2 - 0.1
1.2 0.5 0.2
0.5 1.3 - 0.5
1.3 0.8 0.3
0.8 1.2 - 0.2
1.2 ---- 0.2 ---- 0.00 0.04

16.0 ---- ---- ---- - 0.90 1.64

r1 = - 0.90 / 1.64 = - 0.5485 = - 0.55


Hence the serial correlation of first order between the successive observations is found
to be r = - 0.55
Y = Y / n = 16.0 / 16 = 1.0 1
41
Question # 32: The following noise measurements were recorded at an
intersection in time order they were observed. Calculate the first
serial correlation r1 and the coefficient of auto-correlation of lag 2.

Noise Measurements r1 = 0.6602


65 r2 = 0.3413
64
63
61
60
58
63
64
62
64
63
63
62
60
62
64
66
68
68
69
42
Question # 33
Circle the correct option i.e. A / B / C / D.
(1) The dependent variable is also called:
(A) Regressor (B) Explanatory variable
(C) Predictor (D) Response variable
(2) The independent variable is also known as:
(A) Regressand (B) Predicted
(C) Explained (D) Predictor
(3) Slope of regression line is also known as:
(A) X− intercept (B) Regression coefficient
(C) Y− Intercept (D) Regressand
(4) In simple regression equation, the number of variables involved is:
(A) 2 (B) 1
(C) 0 (D) 3
(5) Regression coefficient and slope of a line are:
(A) Dependent (B) Independent
(C) Not same (D) Same
(6) If r = 0.34, bXY = 1.9 then bYX =:
(A) 0.74 (B) 1.9
(C) − 0.34 (D) 0.06
(7) When bxy is positive, then byx will be:
(A) Negative (B) Positive
(C) Zero (D) One

43
(8) When two regression coefficients have same algebraic signs, then r is:
(A) Positive (B) Zero
(C) Negative (D) According to signs
(9) If X is measured in rupees and Y is measured in dollars, then correlation coefficient
r has the unit:
(A) Dollars (B) Rupees
(C) No unit (D) Both (A) & (B)
(10) When two variables move in the same direction, then the correlation is:
(A) Positive (B) Negative
(C) Fractional (D) None of these
Answer: 1. D 2. D 3. B 4. A 5. D 6. D 7. B 8. D 9. C 10. A

44

You might also like