Introduction to Linear Regression
and Correlation Analysis
The correlation between two random variables, X
and Y, is a measure of the degree of linear
association between the two variables.
The population correlation, denoted by ρ, and the
sample Correlation coefficient denoted y ‘r’ can
take on any value from -1 to 1.
Methods of Correlation Analysis
Scatter diagram method
Karl Pearson’s Correlation Coefficient
Spearman’s Rank Correlation Method
Scatter Plots and Correlation
A scatter plot (or scatter diagram) is used to show
the relationship between two variables
Correlation analysis is used to measure strength
of the association (linear relationship) between
two variables
Only concerned with strength of the
relationship
No causal effect is implied
Scatter Plot Examples
Linear relationships Curvilinear relationships
y y
x x
y y
x x
Scatter Plot Examples
(continued)
Strong relationships Weak relationships
y y
x x
y y
x x
Scatter Plot Examples
(continued)
No relationship
x
Correlation Coefficient
(continued)
The population correlation coefficient ρ (rho)
measures the strength of the association
between the variables
The sample correlation coefficient r is an
estimate of ρ and is used to measure the
strength of the linear relationship in the
sample observations
Features of ρ and r
Unit free
Range between -1 and 1
The closer to -1, the stronger the negative
linear relationship
The closer to 1, the stronger the positive
linear relationship
The closer to 0, the weaker the linear
relationship
Examples of Approximate
r Values
y y y
x x x
r = -1 r = -.6 r=0
y y
x x
r = +.3 r = +1
Calculating the Correlation Coefficient by using
Karl Pearson’s method
To measure the intensity of the relationship between the
variables Karl Person proposed a formula known as Karl
Pearson's Correlation coefficient
r
( x x )( y y )
[ ( x x ) ][ ( y y ) ]
2 2
or the algebraic equivalent:
n xy x y
r
[n( x 2 ) ( x )2 ][n( y 2 ) ( y )2 ]
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
Assumptions of using Pearson’s
Correlation Coefficient
Pearson’s correlation coefficient is appropriate to
calculate when both variables ‘x’ and ‘y’ are measured
on an interval or a ratio scale
Both variables are Normally distributed, and that there
is a linear relationship between these variables
There is a cause and effect relationship between two
variables that influences the distribution of both the
variables.
Probable Error and Standard Error of
Coefficient of Correlation
By using probable error we can find whether the
obtained correlation coefficient is significant or not
significant
1 r 2
P.E ( r ) = (0.6745)
n
If r < 6.P.E ( r ) then the value of ‘r’ is not significant
If r > 6.P.E ( r ) then the value of ‘r’ is significant
Coefficient of Determination
Coefficient of determination is denoted by r2.
It always has value between 0 to 1
By using coefficient of determination we can find the
strength of the relationship between variables but we
lose the information about the direction
r2 = 0 then no variation in y can be explain by the
variable x
r 2=1 then the values of y completely explained by x
Examples of Approximate
R2 Values
y
R2 = 1
Perfect linear relationship
between x and y:
x
R2 = 1
y 100% of the variation in y is
explained by variation in x
x
R = +1
2
Examples of Approximate
R2 Values
y
0 < R2 < 1
Weaker linear relationship
between x and y:
x
Some but not all of the
y
variation in y is explained
by variation in x
x
Examples of Approximate
R2 Values
R2 = 0
y
No linear relationship
between x and y:
The value of Y does not
x depend on x. (None of the
R2 = 0
variation in y is explained
by variation in x)
Example:
The sales manager of copier wants to
determine whether there is a relationship
between the number of sales calls made in a
month and the number of copiers sold in that
month. The manager selects a random sample
of 10 representatives and determines the
number of sales calls each representative made
last month and the copiers sold. The sample
information is given below
Sales calls and Copier sales
Sales Person Number of sales No. of copiers sold
calls
Medha 20 30
Mahathi 40 60
Nikhil 20 40
Sai Ram 30 60
Sathya 10 30
Sashi 10 40
krishna 20 40
Pavan 20 50
Raman 20 30
Hari 30 70
Calculation Example
X Y XY X² Y²
20 30 600 400 900
40 60 2400 1600 3600
20 40 800 400 1600
30 60 1800 900 3600
10 30 300 100 900
10 40 400 100 1600
20 40 800 400 1600
20 50 1000 400 2500
20 30 600 400 900
30 70 2100 900 4900
220 450 10800 5600 22100
Calculation Example
n xy x y
r
[n( x 2 ) ( x)2 ][n( y 2 ) ( y)2 ]
80
70
10(10800) (220)(450)
60
50
40
[10(5600) (220)2 ][8(22100) (450)2 ]
30
0.759014
20
10
0
5 10 15 20 25
Y
30 35 40 45
There is positive relation
between the sales calls and
sales of the copier
Calculation Example II
Tree Trunk
Height Diameter
y x xy y2 x2
35 8 280 1225 64
49 9 441 2401 81
27 7 189 729 49
33 6 198 1089 36
60 13 780 3600 169
21 7 147 441 49
45 11 495 2025 121
51 12 612 2601 144
=321 =73 =3142 =14111 =713
Calculation Example
(continued)
Tree n xy x y
Height,
y
r
70
[n( x 2 ) ( x)2 ][n( y 2 ) ( y)2 ]
60
8(3142) (73)(321)
50
40
[8(713) (73)2 ][8(14111) (321)2 ]
30
0.886
20
10
0
r = 0.886 → relatively strong positive
0 2 4 6 8 10 12 14
linear association between x and y
Trunk Diameter, x
Excel Output
Excel Correlation Output
Tools / data analysis / correlation…
Tree Height Trunk Diameter
Tree Height 1
Trunk Diameter 0.886231 1
Correlation between
Tree Height and Trunk Diameter
Ex: Pepsi Cola is studying the effect of its last advertising
campaign. People chosen at random were called and
asked how many cans of Pepsi Cola had bought (X) in
the past week and how many advertisements (Y) they
had either read or seen in the past week.
X :3 7 4 2 0 4 1 2
Y :11 18 9 4 7 6 3 8
Calculate the coefficient of Correlation and coefficient of
determination.
An economist wanted to find out if there was any
relationship between the unemployment rate in a country
and its inflation rate . Data gathered from 7 countries for
the year 2004 are given below.
Country Unemployment Inflation rate
rate (%) (%)
A 4.0 3.2
B 8.5 8.2
C 5.5 9.4
D 0.8 5.1
E 7.3 10.1
F 5.8 7.8
G 2.1 4.7
Find the degree of linear association between a country’s
unemployment and its level of inflation.
Spearman Rank correlation( )
Correlation between ranks of two individuals is
known as Rank correlation
To measure the intensity of the relationship
between the variables (having ordinal data), we
use Spearman rank correlation
Spearman rank correlation lies between +1 and
-1
If Rank Correlation coefficient is +1 there is
perfect positive correlation and if it is -1 there is
perfect negative correlation
Spearman's Rank Correlation is given by
6 d 2
x, y 1 2
n(n 1)
Where d R x R y
n is no. of pair of observations
When ranks are equal we add correction factor to
∑d2 and is given by
6 d 2 correction factor
( x, y ) 1
2
n(n 1)
m(m2 1)
Where correction factor is , m is number
12
of times an item repeated
Ten Competitors in a beauty contest are ranked by
three judges in the following order.
Judge I :1 6 5 10 3 2 4 9 7 8
Judge II :3 5 8 4 7 10 2 1 6 9
Judge III:6 4 9 8 1 2 3 10 5 7
Determine which pair of judges has the nearest
approach to common tastes in beauty.
1 2 3 D1=1 -2 D2=1- 3 D3 =2 -3 (D1)² (D2)² (D3)²
A financial analyst wanted to find out whether inventory
turnover influences any company’s earnings per share (in
%).A random sample of 7 companies listed in a stock
exchange were selected and the following data was
obtained for each.
Company Inventory turnover Earnings per
(No.of times) share(%)
A 4 11
B 5 9
C 7 13
D 8 7
E 6 13
F 3 8
G 5 8
Find the strength of association between inventory
turnover and earnings per share. Interpret the result.
Co efficient of Determination:
The Squared value of Coefficient of Correlation
is called Co efficient of determination.
It indicates “the proportion of the total variability
of dependent variable that is accounted for or
explained by the independent variable”.
It always lies between 0 and 1.
The following table gives indices of industrial
production and number of registered unemployed
people (in lakh). Calculate the value of the
correlation coefficient.
Year 1991 1992 1993 1994 1995 1996 1997 1998
Index of production 100 102 104 107 105 112 103 99
No.Of Unemployed 15 12 13 11 12 12 19 26
Introduction to Regression Analysis
Regression analysis is used to:
Predict the value of a dependent variable based on
the value of at least one independent variable
Explain the impact of changes in an independent
variable on the dependent variable
Dependent variable: the variable we wish to
explain
Independent variable: the variable used to
explain the dependent variable
Simple Linear Regression Model
Only one independent variable, x
Relationship between x and y is
described by a linear function
Changes in y are assumed to be caused
by changes in x
Types of Regression Models
Positive Linear Relationship Relationship NOT Linear
Negative Linear Relationship No Relationship
Population Linear Regression
The population regression model:
Population Random
Population Independent Error
Slope
y intercept Variable term, or
Coefficient
Dependent residual
y β0 β1x ε
Variable
Linear component Random Error
component
Linear Regression Assumptions
Error values (ε) are statistically independent
Error values are normally distributed for any
given value of x
The probability distribution of the errors is
normal
The probability distribution of the errors has
constant variance
The underlying relationship between the x
variable and the y variable is linear
Population Linear Regression
(continued)
y y β0 β1x ε
Observed Value
of y for xi
εi Slope = β1
Predicted Value
Random Error
of y for xi
for this x value
Intercept = β0
xi x
Estimated Regression Model
The sample regression line provides an estimate of
the population regression line
Estimated Estimate of Estimate of the
(or predicted) the regression regression slope
y value
intercept
Independent
ŷ i b0 b1x variable
The individual random error terms ei have a mean of zero
Least Squares Criterion
b0 and b1 are obtained by finding the values
of b0 and b1 that minimize the sum of the
squared residuals
e 2
(y ŷ) 2
(y (b 0 b1x)) 2
The Least Squares Equation
The formulas for b1 and b0 are:
b1
( x x )( y y )
(x x) 2
algebraic equivalent:
and
xy x y
b1 n b0 y b1 x
(
x n
2 x ) 2
Interpretation of the
Slope and the Intercept
b0 is the estimated average value of y
when the value of x is zero
b1 is the estimated change in the
average value of y as a result of a one-
unit change in x
Finding the Least Squares Equation
The coefficients b0 and b1 will usually be
found using computer software, such as
Excel or Minitab
Other regression measures will also be
computed as part of computer-based
regression analysis
Simple Linear Regression Example
A real estate agent wishes to examine the
relationship between the selling price of a home
and its size (measured in square feet)
A random sample of 10 houses is selected
Dependent variable (y) = house price in $1000s
Independent variable (x) = square feet
Sample Data for House Price Model
House Price in $1000s Square Feet
(y) (x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Regression Using Excel
Tools / Data Analysis / Regression
Excel Output
Regression Statistics
Multiple R 0.76211 The regression equation is:
R Square 0.58082
Adjusted R Square 0.52842 house price 98.24833 0.10977 (square feet)
Standard Error 41.33032
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
Graphical Presentation
House price model: scatter plot and
regression line
450
400
House Price ($1000s)
350
Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet
house price 98.24833 0.10977 (square feet)
Interpretation of the
Intercept, b0
house price 98.24833 0.10977 (square feet)
b0 is the estimated average value of Y when the
value of X is zero (if x = 0 is in the range of
observed x values)
Here, no houses had 0 square feet, so b0 = 98.24833
just indicates that, for houses within the range of
sizes observed, $98,248.33 is the portion of the
house price not explained by square feet
Interpretation of the
Slope Coefficient, b1
house price 98.24833 0.10977 (square feet)
b1 measures the estimated change in the
average value of Y as a result of a one-
unit change in X
Here, b1 = .10977 tells us that the average value of a
house increases by .10977($1000) = $109.77, on
average, for each additional one square foot of size
Example: House Prices
House Price Estimated Regression Equation:
Square Feet
in $1000s
(x)
(y)
house price 98.25 0.1098 (sq.ft.)
245 1400
312 1600
279 1700
308 1875 Predict the price for a house
199 1100 with 2000 square feet
219 1550
405 2350
324 2450
319 1425
255 1700
Example: House Prices
(continued)
Predict the price for a house
with 2000 square feet:
house price 98.25 0.1098 (sq.ft.)
98.25 0.1098(200 0)
317.85
The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850
Example: Market Trend
Over all Market Average Return %
In finance, it is of interest to look at Return % (X) (Y)
the relationship between Y, a stock’s
10 11
average return, and X, the overall
market return. The slope coefficient 12 15
computed by linear regression is 8 3
called the stock’s beta by investment
15 18
analysts. A beta greater than 1
indicates that the stock is relatively 9 10
sensitive to changes in the market; a 11 12
beta less than 1 indicates that the
8 6
stock is relatively insensitive. For the
following data, compute the beta and 10 7
suggest market trend. 13 18
11 13
Properties of regression lines and
their coefficients:
1. Correlation coefficient is the geometric
mean between the regression
coefficient
2. The sign of correlation coefficient is the
same as that of regression coefficient.
3. Regression coefficients are dependent
of the change origin but not of scale.
Problem
The following data give the ages and Blood
Pressure of 10 women. Find
1. Correlation Coefficient between age and BP
2. Determine the least square regression
equation of BP on age
3. Estimate the BP of a woman whose age is
45
Data
AGE BP Calculations
56 147
x y x2 y2 xy
42 125 56 147 3136 21609 8232
36 118 42 125 1764 15625 5250
47 128 36 118 1296 13924 4248
49 145 47 128 2209 16384 6016
42 140 49 145 2401 21025 7105
60 155 42 140 1764 19600 5880
72 160 60 155 3600 24025 9300
72 160 5184 25600 11520
63 149
63 149 3969 22201 9387
55 150
55 150 3025 22500 8250
522 1417 28348 202493 75188
Correlation coefficient n xy x y
r
[n( x 2 ) ( x) 2 ][n( y 2 ) ( y)2 ]
r = 0.891679 10(75188) (522)(1417 )
[10(28348) (522) 2 ][10(20249 3) (1417) 2 ]
Regression Equation of y on x
ŷ i b 0 b1 x
and
x y
b1
xy n b0 y b1 x
( x ) 2
x 2
n
b1 = 1.11 b0 = 83.755
Regression equation is
y = 83.755+ 1.11x
When x=45 y =?
Y=133.705
Multiple regression Analysis
A linear regression equation with more than one
independent variable is called a multiple
regression model.
The linear regression equation with
k independen t variables takes the form :
y β 0 β1 x1 β 2 x 2 β 3 x 3 ........ β k x k ε
where
y is the value of dependent variable to be estimated
β 0 is a constant
β1,β 2, ...β k are the regression coefficien ts associated
with each of the x k independen t variable.
ε is the random error due to chance.
Let the fitted linear regression equation be
yˆ b 0 b1 x1 b 2 x 2 ....... b k x k which minimizes
the sum of squares errors (SSE) (y - yˆ ) 2
where
yˆ is the estimated value of dependent variable y
b1 , b 2 , b 3 ....b k partial regression coefficien ts and are
obtained by the principle of least squares technique.
Let us consider the case where two independent
variables and a dependent variable.
The multiple linear regression model
involving two independen t variables is :
y β 0 β1 x1 β 2 x 2 ε
where
y is the dependent variable
x1 and x 2 are independen t variables.
ε is the random error due to chance.
β 0 is the y - intercept.
β1 , β 2 are the regression coefficien ts.
Let the fitted multiple linear regression equation be
yˆ b 0 b1 x1 b 2 x 2
or
yˆ b 0 b y1.2 x1 b y2.1x 2
where
yˆ is the estimated value of dependent variable y.
x1 , x 2 are the independen t variables.
b 0 , b1,b 2 are the unknown constants and
are determined by the priniple of least squares technique
which minimizes the sum of squres errors (SSE) (y - yˆ ) 2
By solving the following equations the values of
b 0 , b1 , b 2 can be determined .
y nb 0 b y1.2 x1 b y2.1 x 2
y x 1 b 0 x1 b y1.2 x b x x
1
2
y2.1 1 2
y x 2 b 0 x 2 b y1.2 x1 x 2 b y2.1 x 2
2
Let the fitted multiple linear regression equation be
y b 0 b 1 x1 b 2 x 2
or y b 0 b y1.2 x1 b y2.1x 2 - - - -(1)
y b 0 b y1.2 x1 b y2.1x 2 - - - -(2)
(1) - (2)
(y - y ) b y1.2 (x1 x1 ) b y2.1 (x 2 - x 2 )
Y b y1.2 X1 b y2.1 X 2
Y X X Y X X X
1
2
2 2 2 1
b y1.2
X X X X
2
1
2
2 1 2
2
Y X X Y X X X
2
2
1 1 2 1
b y2.1
X X X X
2
1
2
2 1 2
2
where Y y - y , X 1 x 1 x1 , X 2 x 2 x 2
Relationsh ip b/w partial regression coefficien ts & Correlatio n coefficien ts :
ry1 (ry2 r12 ) σ y
b y1.2 2
σ
1 r12 1
ry2 (ry1 r12 ) σ y
b y2.1 2
σ
1 r12 2
Y X 1
r the correlatio n b/w y & x
y1 1
Y X 2 2
1
Y X 2
r the correlatio n b/w y & x
y2 2
Y X 2 2
2
X X 1 2
r the correlatio n b/w x & x
12 1 2
X X
2 2
1 2
A marketing manager of a company wants to
predict demand for the product. He is believing
strongly demand is highly influenced by annual
average price of the product (in units) &
advertising expenditure (Rs in lakh).He has
collected past data to know the effect of these
factors on demand and given below:
Y 4 6 7 9 13 15
X1 15 12 8 6 4 3
X2 30 24 20 14 10 4
• The following results are obtained from
measuremen t on length (in mm), volume (in cc)
and weight (in gm) of 300 eggs.
x1 55.95 x 2 51.48 y 56.03
σ 1 2.26 σ 2 4.39 σ y 4.41
ry1 0.578 ry2 0.581 r12 0.974
Obtain the linear regression equation of egg weight
on its length and volume. Hence estimate the weight of an egg
whose length is 58 mm and volume is 52.5 cc.
The Federal Reserve is performing a preliminary
study to determine the relationship between
certain economic indicators and annual
percentage change in the gross national product
(GNP). Two such indicators being examined are
the amount of the federal government’s deficit (in
billions of dollars) and the Dow Jones Industrial
Average (the mean value over the year). Data for
6 years follow:
Change in GNP 2.5 -1.0 4.0 1.0 1.5 3.0
Federal Deficit 100.0 400.0 120.0 200.0 180.0 80.0
Dow Jones 2850 2100 3300 2400 2550 2700
i. Calculate the least squares equation that best
describes the data.
ii. What % change in GNP would be expected in a year
in which the federal deficit was $240 billion and the
mean Dow Jones value was 3000?
Multiple correlation analysis:
It is a measure of association between a
dependent variable and several independent
variables taken together.
The coefficient of multiple correlation is given by,
r r 2ry1ry2r12
2
y1
2
y2
R y.12
1r 2
12
Its value always lie in between 0 and 1.
Coefficient of multiple determination:
It is the proportion of the total variation in the
multiple values of dependent variable y,
accounted for or explained by the independent
variables in the multiple regression model.
The square of coefficient of multiple correlation
is called Coefficient of multiple determination.