Correlation
Correlation
Correlation
Regression
© BGIMS
Outline
Introduction
Scatter Plots
Correlation
Regression
© BGIMS
Objectives
Draw a scatter plot for a set of
ordered pairs.
Find the correlation coefficient.
Find the equation of the
regression line.
© BGIMS
Objectives
Find the coefficient of
determination.
Find the standard error of
estimate.
© BGIMS
Introduction
Every day we take personal and
professional decisions that are based
on predictions of future events.
To make these forecasts, we rely on the
relationship between what is already
known and what is to be estimated.
© BGIMS
Regression and correlation analysis
show us how to determine both the
nature and the strength of a
relationship between two variables.
© BGIMS
Significance of the study of
correlation
1. Most of the variables show some kind of relationship
between price and supply, income and expenditure, etc.
correlation analysis gives the degree of relationship in
one figure
2. Once we know the relationship we can estimate the
value of one variable given the value of another.
3. Correlation analysis contributes to the economic
behaviour. In business, correlation analysis enables the
executive to estimate costs, price, etc.
© BGIMS
Types of correlation
Positive and negative
Simple, partial and multiple and
Linear and non linear
© BGIMS
Positive and Negative correlation
If two variables vary together in the same
direction or in opposite directions, they are
said to be correlated.
If as X increases Y increases consistently, X&Y
are +vely correlated
If as X increases Y decreases and as X
decreases Y increases X&Y are -vely correlated
© BGIMS
Simple, partial and multiple
correlation
When only two variables are studied –
simple correlation.
When two or more variables are studied –
partial or multiple correlation.
In multiple correlation two or more
variables are studied simultaneously
In partial correlation more than two
variables are there but we consider only two
variables (keeping the other as constant)
© BGIMS
Dependent & Independent
variables
The known variable is called the
independent variable and the variable
we are trying to predict is the
dependent variable.
© BGIMS
Scatter Plots - Example
Positive Relationship or correlation
150
150
Pressure
Pressure
140
140
130
130
120
120
40
40 50
50 60
60 70
70
Age
Age
© BGIMS
Scatter Plots - Other Examples
70
70
Final
60
60
50
50
40
40
55 10
10 15
15
Number
Numberofofabsences
absences
© BGIMS
Scatter Plots - Other Examples
No Relationship
10
10
55
Y
y
00
00 10
10 20
20 30
30 4040 50
50 60
60 70
70
xX
© BGIMS
If the correlation is perfect positive, all
the points will lie in a straight line as
shown in figure and the correlation is
perfect negative they will be in a line as
shown in figure
© BGIMS
Perfect positive correlation
Y
© BGIMS
Perfect negative correlation
X
© BGIMS
Correlation
The statistical tool with the help of
which the relationships between two or
more than two variables is studied is
called correlation.
© BGIMS
Correlation Analysis
Correlation analysis is the statistical
tool to describe the degree to which one
variable is linearly related to another.
© BGIMS
The coefficient of determination
The extent, or strength of the
association that exists between two
variables X & Y
Sample coefficient of determination
© BGIMS
Sample coefficient of determination
r 2
= 1−
∑ (Y − Y )
∑ (Y − Y ) 2
© BGIMS
Sample coefficient of
determination
r2=1 when there is perfect correlation
r2=0 when there is no correlation
Note
r2 measures only the strength of a
linear relationship between two
variables.
© BGIMS
Correlation Coefficient
The correlation coefficient computed
from the sample data measures the
strength and direction of a relationship
between two variables.
Sample correlation coefficient, r.
Population correlation coefficient, ρ .
© BGIMS
Range of Values for the
Correlation Coefficient
−1 0 +1
© BGIMS
Coefficient of correlation
r= r2
When the slope the equation is positive r is
the positive square root, but if b is negative r
is the negative square root..
The sign of r indicates the direction of the
relationship between two variables X & Y
© BGIMS
Karl Pearson’s Correlation
coefficient
The coefficient of correlation denoted by r and named
after Karl Pearson is defined as
r=
∑ ( X − X )(Y − Y )
Nσ xσ y
cov( X , Y ) =
∑ ( X − X )(Y − Y )
N
∴
cov( X , Y )
r=
σ xσ y
© BGIMS
The formula for r can be simplified as
r=
∑ XY − N XY
2 2
∑ X − N X ∑Y
2 2
− NY
© BGIMS
Interpreting r2
Coeff. Of determinations expresses the
amount of variation in Y that is
explained by the regression line.
© BGIMS
What does r=0.6 mean?
r=0.6 r2=0.36
36% of the variation in the amount spent
on movies is explained by the regression line.
From r=0.6 the amount spent on movies
correlates 0.6 with family income seems
like fairy strong correlation . But r2=0.36
36% of the variation in the amount of
money families spend on movies.
© BGIMS
Cont..
If you designed your marketing
strategy to appeal only to families with
high incomes, you’d miss a lot of
potential customers.
Instead try to find what else is
influencing family movie decisions.
© BGIMS
Rank Correlation Coefficient
When quantitative measure of certain
factors cannot be fixed, but the
individuals in the group can be
arranged in order thereby obtaining
for each individual a number
indicating his rank in the group.
© BGIMS
The rank correlation coefficient is applied
to a set of ordinal rank numbers, with 1 for
the individual ranked first, in quantity or
quality, and so on, N for last ranked one,
then R can be defined as
Where D refers to the
difference of ranks between
paired items. 6∑ D 2
R = 1−
N ( N − 1)
2
© BGIMS
Example
Two managers are asked to rank a
group of employees in order of
potential for eventually becoming top
managers. The ranking are as follows.
Compute the coefficient of rank
correlation and comment on the value.
© BGIMS
Employees Ranking by manager 1 Ranking by manager 2
A 10 9
B 2 4
C 1 2
D 4 3
E 3 1
F 6 5
G 5 6
H 8 8
I 7 7
J 9 10
N=10
© BGIMS
Employees Ranking by manager 1 Ranking by D2 = (R1-R2)2
manager 2
A 10 9 1
B 2 4 4
C 1 2 1
D 4 3 1
E 3 1 4
F 6 5 1
G 5 6 1
H 8 8 0
I 7 7 0
J 9 10 1
N=10 14
© BGIMS
Cont..
R=1-0.085
=0.915
© BGIMS
Where Ranks are not given
Assign ranks. Then apply the same
formula
© BGIMS
Equal ranks or tie in ranks
Assign each individual or entry an
average rank.
Thus if individuals are ranked equal at
5th place, give the rank (5+6)/2 =5.5 to
both .
If m is the number of items whose
ranks are common then R is
© BGIMS
Cont..
1 1
6∑ D + (m1 − m1 ) + (m2 − m2 ) + ....
2 3 3
R = 1− 12 12
N3 − N
© BGIMS
Regression analysis
© BGIMS
Example
Sales of major appliances vary with the new
housing market. When new home sales are
good, so are the sales of dishwashers,
washing machines, drinkers and
refrigerators. A trade association compiled
the following historical data ( in thousands of
units) on major appliance sales and housing
starts.
© BGIMS
Housing Starts Appliances sales
(thousands) (thousands)
2 5
2.5 5.5
3.2 6
3.6 7
3.3 7.2
4 7.7
4.2 8.4
4.6 9
4.8 9.71
5 10 © BGIMS
How can we fit this line ?
12
10
0
0 1 2 3 4 5 6
© BGIMS
Cont..
In this case, data points represents the
relationship between the housing market and
sales of house appliances. The relationship
between X & Y is well described a straight
line.
The direction of the line can indicate
whether the relationship is direct or inverse.
© BGIMS
William C Andrews, an organizational behavior
consultant for Victory Motorcycles ,has
designed a test to show the company’s
supervisors the dangers of over supervising their
workers. A worker from the assembly line is
given a series of complicated tasks to perform.
During the worker’s performance, a supervisor
constantly interrupts the worker to assist him or
her in completing the tasks. The worker, upon
completion of the tasks, is then given a
psychological test designed to measure the
worker’s hostility toward authority (a high score
equals low hostility).
© BGIMS
Cont..
Eight different workers were assigned the
tasks and then interrupted for the purpose of
instructional assistance variance number of
times. Their corresponding scores on the
hostility test are revealed as follows. Predict
the expected test score if the worker is
interrupted 18 times.
© BGIMS
no. of times workers Worker's score on
interrupted hostility test
5 58
10 41
10 45
15 27
15 26
20 12
20 16
25 3
© BGIMS
70 How can we fit this line ?
60
Hostility Score
50
40
30
20
10
0
0 10 20 30
Number of interrupts
© BGIMS
How can we fit a line
mathematically?
© BGIMS
The method of least squares
An equation of a line that is drawn through the
middle of a set of points in a scatter diagram
such that the sum of the squares of the errors is
minimum . The estimating line or points that lie
on the estimating line
∧
y =a+bX
© BGIMS
Slope of the best-fitting Regression line &
Y-intercept of the best-fitting Regression
line
b=
∑XY −n X Y
∑X
2
2
−n X
a =Y −b X
© BGIMS
The given equation is regression
equation of Y on X. It gives most
probable values of Y for given values
of X.
The regression line of X on Y gives the
probable values of X for given values
of Y. say X=a + bY.
© BGIMS
The regression equation of Y on X can also be
represented by
σy
Y −Y = r (X − X )
σx
Where r is the coefficient of correlation
© BGIMS
The regression equation of X on Y can also be
represented by
σx
X −X =r (Y − Y )
σy
© BGIMS
Example
The general sales manager of Kiran Enterprises – an
enterprise dealing in the sale of ready-made men’s wears – is
toying with the idea of increasing his sales to 80,000. on
checking the records of sales during the last 10 years, it was
found that the annual sale proceeds and advertisement
expenditure were highly correlated to the extent of 0.8. It was
further noted that the annual average sale has been Rs. 45,000
and annual average advertisement expenditure Rs. 30,000 with
a variance of Rs.1600 and Rs. 626 in advertisement
expenditure respectively.
In view of the above, how much expenditure on advertisement
you would suggest the General sales Manager pf the enterprise
to incur to meet his target of sales.
© BGIMS
X- advertisement expenditure
Y- sales expenditure
When Y= 80,000
X= 47500
© BGIMS
Example
Suppose BMC is interested in the
relationship between the age of
garbage truck and the annual repair
expense they should expect to incur. In
order to determine this relationship,
BMC has accumulated information
concerning four of the trucks the city
currently owns.
© BGIMS
Cont..
Organize the data as outlined in table
Use the equations of a & b to find the
numerical constants for our regression
line.
© BGIMS
truck number age of truck in repair expense
years (x) during last year
in thousands of
Rs.
101 5 7
102 3 7
103 3 6
104 1 4 © BGIMS
b= 0.75
a= 3.75
Y=3.75+0.75X
BMC can estimate the annual repair expense given
the age of truck.
If it is 4 years old use the equation Y=3.75+0.75X to
get the annual expense as follows
Y= 3.75+0.75 *4
=6.75
Expected annual repair expense =6750.0
© BGIMS
How to measure the reliability of the
estimating equation?
Measured by the standard error of
estimate
It measures the variability, or scatter
of the observed values around the
regression line.
© BGIMS
Standard error
Se =
∑ (Y − Y ) 2
n−2
© BGIMS
For the above example
Standard error=0.866 866.0 /-
If standard error is zero we expect the
estimating equation to be a perfect
estimator of the dependent variable.
© BGIMS
Assuming that the observed points are
normally distributed around the regression line,
we can expect
68% of the points within + Se
95.5 % of the points within +Se and 99.7% of
the points within + 3Se
© BGIMS
Multiple regression and
correlation analysis
© BGIMS
Cont..
We can use more than one independent
variable to estimate the dependent
variable and thus attempt to increase
the accuracy of the estimate.
This process is called multiple
regression analysis
© BGIMS
Example
Consider the real estate agent who wishes to
relate the number of houses the firm sells in a
month to the amount of her monthly
advertising.
Certainly we can find a simple estimating
equation that relates these two variables.
Could we also improve the accuracy of our
equation by including the number of
salespeople she employs each month ?
Then we can use number of sales agents and the
advertising expenditures to predict monthly
house sales. © BGIMS
Multiple regression equations
Y = a + b1 X 1 + b2 X 2
X1 & X2 = values of the two independent variables
Y= estimated vale corresponding to the dependent variable
© BGIMS
For getting a, b & c solve the normal equations
∑ y = na + b ∑ X + b ∑ X
1 1 2 2
∑ X y = a∑ X + b ∑ X + b ∑ X X
2
1 1 1 1 2 1 2
∑ X y = a∑ X + b ∑ X X + b ∑ X
2
2 2 1 1 2 2 2
© BGIMS
Example
In trying to evaluate the effectiveness in
its advertising campaign, a firm
complied the following information
Year 1996 1997 1998 1999 2000 2001 2002 2003
Adv. Expenditure 12 15 15 23 24 38 42 48
(‘000 Rs.)
Sales (Lakh Rs.) 5.0 5.6 5.8 7.0 7.2 8.8 9.2 9.5
Estimate the probable sales when advertisement expenditure is
Rs. 60 thousand.
© BGIMS
Cont..
Y= 3.8719+ 0.1250 X
When X=60
Y= 11.37
© BGIMS