Corelation and Regression Session 20
Corelation and Regression Session 20
Introduction:
Correlation and Simple Linear
Regression
By
Prof. Vishal Singh Patyal
Outline
• Scatter diagram
• Karl Pearson’s coefficient of correlation
• Simple Linear Regression using Ordinary Least
square Method (OLS)
• Residual Analysis
• Coefficient of Determination
1
2021
Sample Covariance
• The sample covariance measures the strength of the
linear relationship between two numerical variables.
n
• The sample covariance: ( X i X )(Yi Y )
cov ( X , Y ) i 1
n 1
• The covariance is only concerned with the strength of
the relationship.
• No causal effect is implied.
Correlation
• A scatter plot (or scatter diagram) can be used to
show the relationship between two numerical
variables
• Correlation is a measure of the degree of relatedness
of two variables.
• Correlation analysis is used to measure strength of
the association (linear relationship) between two
variables
– Correlation is only concerned with strength of the
relationship
– No causal effect is implied with correlation
2
2021
Correlation
• Correlation is a statistical tool that helps to measure
and analyze the degree of relationship between two
variables.
– The degree of relationship between the variables under
consideration is measure through the correlation
analysis.
• Correlation analysis deals with the association
between two or more variables.
– The degree of relationship is expressed by coefficient
which range from correlation ( ‐1 ≤ r ≥ +1)
– The direction of change is indicated by a sign.
Types of Correlation
Correlation
Positive Correlation Negative Correlation
3
2021
Direction of the Correlation
• Positive relationship – Variables change in the
same direction.
– As X is increasing, Y is increasing
Indicated by
– As X is decreasing, Y is decreasing
sign; (+) or (-)
E.g., As height increases, so does weight.
• Negative relationship – Variables change in
opposite directions.
– As X is increasing, Y is decreasing
– As X is decreasing, Y is increasing
E.g., As TV time increases, grades decrease
More examples
Positive relationships Negative relationships:
Study time and grades. Price & quantity
Rain and umbrella demanded.
Temperature vs ice cream Exercise & body weight
sales Employees are laid off vs
Salary vs spending satisfaction
4
2021
Methods of Studying Correlation
• Scatter Diagram Method
• Karl Pearson’s Coefficient of Correlation
Scatter Diagram Method
• Scatter Diagram is a graph of observed
plotted points where each points represents
the values of X & Y as a coordinate.
• It portrays the relationship between these
two variables graphically.
10
5
2021
Correlation Graphical View
r < 0 r > 0
r = 0
11
Correlation Coefficient ‐ Interpretation
12
12
6
2021
Scatter Diagram
Advantages Disadvantage
• Simple & Non • Can not adopt the an
Mathematical method exact degree of
• Not influenced by the correlation
size of extreme item
• First step in investing
the relationship
between two variables
13
Karl Pearson’s Coefficient of Correlation
• Pearson’s ‘r’ is the most common correlation coefficient.
• Karl Pearson’s Coefficient of Correlation denoted by‐ ‘r’
The coefficient of correlation ‘r’ measure the degree of
linear relationship between two variables say x & y.
• Karl Pearson’s Coefficient of Correlation denoted by‐ r
‐1 ≤ r ≥ +1
• Degree of Correlation is expressed by a value of
Coefficient
• Direction of change is Indicated by sign ( ‐ ve) or ( + ve)
• When scales interval (or ratio)
14
7
2021
Karl Pearson’s Coefficient of Correlation
sum of squares for variable X
sum of squares for variable X
sum of the cross‐products (SSXY)
correlation coefficient (r)
15
Example
The following is a set of data from a sample n= 11 items
X 7 5 8 3 6 10 12 4 9 15 18
Y 21 15 24 9 18 30 36 12 27 45 54
Compute the coefficient of correlation
How strong the relationship between X and Y?
comment
16
8
2021
17
Example
SSxx= 217.64
SSyy= 1958.73
SSxy= 652.91
r = 652.91/√217.64*1958.73
r = 0.99
Highly positive corelated
18
9
2021
Class Exercise
• In an effort to determine whether any
correlation exists between the price of
stocks of airlines, an analyst sampled six
days of activity of the stock market.
• Using the following prices of Delta stock
and Southwest stock, compute the
coefficient of correlation.
19
Delta Southwest
47.6 15.1
46.3 15.4
50.6 15.9
52.6 15.6
52.4 16.4
52.7 18.1
20
10
2021
(302.2)(96.5)
4,870.11
6
(302.2)
2
(96.5) 2
=0.6445
15, 259.62 1,557. 91
6 6
21
Interpretation of Correlation Coefficient (r)
• The value of correlation coefficient ‘r’ ranges
from ‐1 to +1
• If r = +1, then the correlation between the two
variables is said to be perfect and positive
• If r = ‐1, then the correlation between the two
variables is said to be perfect and negative
• If r = 0, then there exists no correlation between
the variables.
• Unit free
22
11
2021
Assumptions of Pearson’s Correlation
Coefficient
• There is linear relationship between two
variables, i.e. when the two variables are
plotted on a scatter diagram a straight line will
be formed by the points.
• Cause and effect relation exists between
different forces operating on the item of the
two variable series.
23
Pearson’s Correlation
Advantages Disadvantages
It summarizes in one Always assume linear
value, the degree of relationship
correlation & direction Interpreting the value
of correlation also. of r is difficult.
Value of Correlation
Coefficient is affected
by the extreme values.
Time consuming
methods
24
12
2021
Class Exercise
Relationship between Anxiety and Test Scores
Anxiety Test score
(X) (Y)
10 2
8 3
2 9
1 7
5 6
6 5
a. Find r and comment on degree of r
b. Draw a scatter plot.
25
Class Exercise
For a consumer product sales were considered
to be related to consumer price Index(CPI).Past
15 months data is given below.
a. Find r and comment on degree of r
b. Draw a scatter plot.
Month 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CPI (X) 100 113 136 144 126 129 157 160 174 191 214 180 215 224 218
Sales (Y)in
crores of 10 13.1 15.1 15.8 13.5 14.7 16.5 18 18.5 21.2 23 20 23.3 24.6 24.4
rupees
26
13
2021
Simple Linear Regression
27
What is a Variable?
• Simply, something that varies.
• Specifically, variables represent persons or
objects that can be manipulated, controlled,
or merely measured for the sake of research.
• Variation: How much a variable varies. Those
with little variation are called constants.
28
14
2021
Independent Variables
• These variables are ones that are more or less
controlled.
• You might manipulate these variables as
needed.
• They still vary, but the variation is relatively
known (like seconds, or days)
• It is easy to figure the next independent
variable.
29
Dependent Variables
• Dependent variables are not controlled or
manipulated, but instead are simply
measured.
• Dependent Variables depend on what the
independent variable is.
30
15
2021
Independent Vs Dependent
Intentionally Intentionally left
manipulated alone
Controlled Measured
Vary at known rate Vary at unknown
Cause rate
Effect
31
Simple Linear Regression
• Regression analysis is the process of constructing a
mathematical model or function that can be used to
predict or determine one variable by another variable
• Regression analysis is used to:
– Predict the value of a dependent variable based on the
value of at least one independent variable
– Explain the impact of changes in an independent variable
on the dependent variable
• Dependent variable
– the variable you wish to explain
• Independent variable
– the variable used to explain the dependent variable
32
16
2021
Types of Regression Models
Simple Multiple
Non- Non-
Linear Linear
Linear Linear
33
Simple Regression Analysis
• Bivariate (two variables) linear regression
‐‐ the most elementary regression model
–dependent variable, the variable to be
predicted, usually called Y
–independent variable, the predictor or
explanatory variable, usually called X
34
17
2021
Regression Models
• Deterministic Regression Model
Y = 0 + 1X
• Probabilistic Regression Model
Y = 0 + 1X +
• 0 and 1 are population parameters
35
Equation of the Simple Regression Line
Yˆ b 0 b1 X
where : b 0
= the sample intercept
b 1
= the sample slope
Yˆ = the predicted value of Y
36
18
2021
Least Squares Analysis
• Least squares analysis is a process whereby a regression
model is developed by producing the minimum sum of
the squared error values
• The vertical distance from each point to the line is the
error of the prediction.
• The least squares regression line is the regression line
that results in the smallest sum of errors squared.
37
Least Squares Analysis
X Y
X X Y Y XY nXY XY n
b
X X X n X
1 2 2 2 2
X X
2
Y X
b Y b X n b n
0 1 1
38
19
2021
Least Squares Analysis
X Y
SS XY X X Y Y XY
n
X X X
2
SS XX
2 2
X
n
SS XY
b 1
SS XX
Y X
b0 Y b1 X n b n 1
39
Linear Regression Probabilistic Model
Population Random
Population Independent Error
Slope
Y intercept Variable term
Coefficient
Dependent
Variable
Yi β 0 β 1X i ε i
Linear component
Random Error
component
40
20
2021
Example: Airline Cost Data
Number of Passengers (X) Cost ($1,000) (Y)
61 4.280
63 4.080
67 4.420
69 4.170
70 4.480
74 4.300
76 4.820
81 4.700
86 5.110
91 5.130
95 5.640
97 5.560
41
Scatter Plot of Airline Cost Data
4
Cost ($1000)
0
0 20 40 60 80 100 120
Number of Passengers
42
21
2021
43
X = 930 Y = 56.69 X 2
= 73,764 XY = 4,462.22
44
22
2021
Solving for b1 and b0 of the Regression
Line: Airline Cost Example (Part 2)
X Y ( 930 )( 56 .69 )
SS XY XY
n
4 , 462 .22
12
68 .745
( X )2 ( 930 ) 2
SS XX X 2
n
73 , 764
12
1689
SS XY 68 . 745
b1 . 0407
SS XX 1689
b0
Y b1
X
56 .69
(. 0407 )
930
1 . 57
n n 12 12
Yˆ 1 .57 .0407 X
45
Regression Line for the Airline Cost Example
4
Cost ($1000)
0
0 20 40 60 80 100 120
Number of Passengers
46
23
2021
Residual Analysis
• Residual is the difference between the actual
ŷ
y values and the predicted values.
• Reflects the error of the regression line at any
given point.
47
Residual Analysis: Airline Cost Example
Number of Predicted
Passengers Cost ($1,000) Value Residual
X Y Yˆ Y Yˆ
(Y Yˆ ) .001
48
24
2021
Residual Analysis for Number of Passengers
• Outliers: data points that lie apart from the rest of the
points.
• They can produce large residuals and affect the regression
line.
49
Residuals to Test the Assumptions of the
Regression Model
• The assumptions of the regression model
– The model is linear
– The error terms have constant variances
– The error terms are independent
– The error terms are normally distributed
50
25
2021
Standard Error of the Estimate
• Residuals represent errors of estimation for
individual points.
• A more useful measurement of error is the
standard error of the estimate.
• The standard error of the estimate, denoted
se, is a standard deviation of the error of the
regression model.
51
Standard Error of the Estimate
Sum of Squares Error
SSE
Y Yˆ
2
Y b Y b XY
2
Standard Error 0 1
of the
Estimate SSE
S e
n2
52
26
2021
Determining SSE for the Airline Cost
Example
N um ber of
P a ss e n g e rs C o s t ( $ 1 ,0 0 0 ) R e s id u a l
X Y Y Yˆ ( Y Yˆ ) 2
61 4 .2 8 .2 2 7 .0 5 1 5 3
63 4 .0 8 - .0 5 4 .0 0 2 9 2
67 4 .4 2 .1 2 3 .0 1 5 1 3
69 4 .1 7 -.2 0 8 .0 4 3 2 6
70 4 .4 8 .0 6 1 .0 0 3 7 2
74 4 .3 0 -.2 8 2 .0 7 9 5 2
76 4 .8 2 .1 5 7 .0 2 4 6 5
81 4 .7 0 -.1 6 7 .0 2 7 8 9
86 5 .1 1 .0 4 0 .0 0 1 6 0
91 5 .1 3 -.1 4 4 .0 2 0 7 4
95 5 .6 4 .2 0 4 .0 4 1 6 2
97 5 .5 6 .0 4 2 .0 0 1 7 6
( Y Yˆ ) . 001 ( Y Yˆ ) 2 = .3 1 4 3 4
S u m o f s q u a r e s o f e r r o r = S S E = .3 1 4 3 4
53
Standard Error of the Estimate
for the Airline Cost Example
Sum of Squares Error
Y Yˆ
2
SSE
Standard Error
0 . 31434
of the
SSE
Estimate
S e
n 2
0 . 31434
10
0 . 1773
54
27
2021
Coefficient of Determination
• The coefficient of determination is the proportion of
variability of the dependent variable (y) accounted for
or explained by the independent variable (x)
• The coefficient of determination ranges from 0 to 1.
• An r 2 of zero means that the predictor accounts for
none of the variability of the dependent variable and
that there is no regression prediction of y by x.
• An r 2 of 1 means perfect prediction of y by x and that
100% of the variability of y is accounted for by x.
55
Coefficient of Determination
SSYY Y Y Y
2 2
Y
2
n
SSYY exp lained var iation un exp lained var iation
SSYY SSR SSE
SSR SSE
1
SSYY SSYY
2 SSR
r SSYY
SSE
1
SSYY
SSE 2
1 0 1r
Y n
2
Y
2
56
28
2021
Coefficient of Determination
for the Airline Cost Example
SSE 0.31434
SSYY Y 2
Y 270.9251 56.69
2 2
3.11209
n 12
SSE
r 1
2
SSYY 89.9% of the variability
of the cost of flying a
.31434
1 Boeing 737 is accounted for
3.11209 by the number of passengers.
.899
57
Example
• A real estate agent wishes to examine the
relationship between the selling price of a
home and its size (measured in square feet).
• A random sample of 10 houses is selected
Dependent variable (Y) = house price in $1000s
Independent variable (X) = square feet
58
29
2021
Linear Regression Example Data
House Price in $1000s Square Feet
(Y) (X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
59
Steps to be followed
Step 1: Find X*Y and X*X as it was done in the table below.
(X) (Y) X*Y X*X Y*Y
1400 245 343000 1960000 60025
1600 312 499200 2560000 97344
1700 279 474300 2890000 77841
1875 308 577500 3515625 94864
1100 199 218900 1210000 39601
1550 219 339450 2402500 47961
2350 405 951750 5522500 164025
2450 324 793800 6002500 104976
1425 319 454575 2030625 101761
1700 255 433500 2890000 65025
∑ Y=2865
∑ X= 17150 ∑ XY= ∑X*X=3098
Y bar= ∑ Y*Y=853423
X bar= 1715 5085975 3750
286.5
60
30
2021
Contd..
Step 2: Find the sum of every column:
Step 3 Use the following equations to find b1 and b0:
SS xy
b1
SS xx
b0 y b1 x
x y
i i
SS xy xi y i
n
x
2
x
i
SS xx i
2
( n 1) s x2
n
61
Contd…
Ŷi b 0 b1X i
62
31
2021
Linear Regression Example Scatter plot
• House price model: scatter plot
450
400
House Price ($1000s)
350
300
250
200
150
100
50
0
0 500 1000 1500 2000 2500 3000
Square Feet
63
Linear Regression Example
• House price model: scatter plot and regression line
450
400
House Price ($1000s)
350
300
Slope
250 = 0.10977
200
150
100
50
Intercept 0
64
32
2021
Linear Regression Example Interpretation
of b0
• b0 is the estimated mean value of Y when the
value of X is zero (if X = 0 is in the range of
observed X values)
• Because the square footage of the house
cannot be 0, the Y intercept has no practical
application.
65
Linear Regression Example Interpretation
of b1
• b1 measures the mean change in the average
value of Y as a result of a one‐unit change in
X
• Here, b1 = .10977 tells us that the mean value
of a house increases by .10977($1000) =
$109.77, on average, for each additional one
square foot of size
66
33
2021
Linear Regression Example Making
Predictions
98.25 0.1098(2000)
317.85
67
Exercise
The marks in Computer science (X) (out of 15) and
Physical Education (Y) (out of 10)
X 14 10 15 11 9 12 6
Y 8 6 4 3 7 5 9
a) Find the least square regression line in the
form y=a⋅x+b.
b) Plot the given points and the regression line
c) Find the value of y if x =13
68
34
2021
THANK YOU
69
35