0% found this document useful (0 votes)
50 views35 pages

Corelation and Regression Session 20

This document provides an introduction and outline for a lesson on correlation and simple linear regression. It defines key terms like sample covariance, correlation, scatter plots, Pearson's coefficient of correlation (r), and positive and negative correlation. Examples are given to illustrate different types of correlation between variables. Methods for studying correlation like scatter diagrams and Pearson's r are explained. The assumptions and interpretation of Pearson's r are also summarized.

Uploaded by

soniya hassina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views35 pages

Corelation and Regression Session 20

This document provides an introduction and outline for a lesson on correlation and simple linear regression. It defines key terms like sample covariance, correlation, scatter plots, Pearson's coefficient of correlation (r), and positive and negative correlation. Examples are given to illustrate different types of correlation between variables. Methods for studying correlation like scatter diagrams and Pearson's r are explained. The assumptions and interpretation of Pearson's r are also summarized.

Uploaded by

soniya hassina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

2021

Introduction:
Correlation and Simple Linear 
Regression

By 
Prof. Vishal Singh Patyal

Outline
• Scatter diagram
• Karl Pearson’s coefficient of correlation
• Simple Linear Regression using Ordinary Least 
square Method (OLS)
• Residual Analysis
• Coefficient of Determination

1
2021

Sample Covariance
• The sample covariance measures the strength of the 
linear relationship between two numerical variables.
n

• The sample covariance: ( X i  X )(Yi  Y )
cov ( X , Y )  i 1
n 1
• The covariance is only concerned with the strength of 
the relationship. 
• No causal effect is implied.

Correlation
• A scatter plot (or scatter diagram) can be used to 
show the relationship between two numerical 
variables
• Correlation is a measure of the degree of relatedness 
of two variables.
• Correlation analysis is used to measure strength of 
the association (linear relationship) between two 
variables
– Correlation is only concerned with strength of the 
relationship 
– No causal effect is implied with correlation

2
2021

Correlation
• Correlation is a statistical tool that helps to measure 
and analyze the degree of relationship between two 
variables. 
– The degree of relationship between the variables under 
consideration is measure through the correlation 
analysis. 
• Correlation analysis deals with the association 
between two or more variables.
– The degree of relationship is expressed by coefficient 
which range from correlation ( ‐1  ≤  r  ≥  +1)
– The direction of change is indicated by a sign.

Types of Correlation 

Correlation

Positive Correlation Negative Correlation

3
2021

Direction of the Correlation

• Positive relationship – Variables change in the 
same direction.
– As X is increasing, Y is increasing
Indicated by
– As X is decreasing, Y is decreasing
sign; (+) or (-)
E.g., As height increases, so does weight.
• Negative relationship – Variables change in
opposite directions.
– As X is increasing, Y is decreasing
– As X is decreasing, Y is increasing
E.g., As TV time increases, grades decrease

More examples

Positive relationships Negative relationships:

 Study time and grades.  Price & quantity 
 Rain and umbrella demanded.
 Temperature vs ice cream   Exercise & body weight
sales  Employees are laid off vs
 Salary vs spending satisfaction

4
2021

Methods of Studying Correlation

• Scatter Diagram Method
• Karl Pearson’s Coefficient of Correlation

Scatter Diagram Method

• Scatter Diagram is  a graph of observed 
plotted points where each points represents 
the values of  X & Y  as a coordinate. 
• It portrays the relationship between these 
two variables graphically. 

10

5
2021

Correlation Graphical View

r < 0 r > 0

r = 0

11

Correlation Coefficient ‐ Interpretation

12

12

6
2021

Scatter Diagram
Advantages Disadvantage
• Simple & Non  • Can not adopt the an 
Mathematical method exact degree of 
• Not influenced by the  correlation
size of extreme item
• First step in investing  
the relationship 
between two variables

13

Karl Pearson’s Coefficient of Correlation 

• Pearson’s ‘r’ is the most common correlation coefficient.
• Karl Pearson’s Coefficient of Correlation denoted by‐ ‘r’ 
The coefficient of correlation ‘r’ measure the degree of 
linear relationship  between two variables say x & y.  
• Karl Pearson’s Coefficient of Correlation denoted by‐ r
‐1 ≤  r ≥ +1
• Degree of Correlation is expressed by a value of 
Coefficient
• Direction of change is Indicated by sign  ( ‐ ve) or ( + ve)
• When scales interval (or ratio)
14

7
2021

Karl Pearson’s Coefficient of Correlation 

sum of squares for variable X

sum of squares for variable X

sum of the cross‐products (SSXY)

correlation coefficient (r)

15

Example

The following is a set of data from a sample n= 11 items

X 7 5 8 3 6 10 12 4 9 15 18
Y 21 15 24 9 18 30 36 12 27 45 54

Compute the coefficient of correlation
How strong the relationship between X and Y? 
comment

16

8
2021

SSxx=      SSyy=  SSxy= (X‐


X X‐Xbar Y Y‐Ybar 2 2
(X‐Xbar) (Y‐Ybar) Xbar)*(Y‐Ybar)
7.00 ‐1.82 21.00 ‐5.45 3.31 29.75 9.92
5.00 ‐3.82 15.00 ‐11.45 14.58 131.21 43.74
8.00 ‐0.82 24.00 ‐2.45 0.67 6.02 2.01
3.00 ‐5.82 9.00 ‐17.45 33.85 304.66 101.55
6.00 ‐2.82 18.00 ‐8.45 7.94 71.48 23.83
10.00 1.18 30.00 3.55 1.40 12.57 4.19
12.00 3.18 36.00 9.55 10.12 91.12 30.37
4.00 ‐4.82 12.00 ‐14.45 23.21 208.93 69.64
9.00 0.18 27.00 0.55 0.03 0.30 0.10
15.00 6.18 45.00 18.55 38.21 343.93 114.64
18.00 9.18 54.00 27.55 84.31 758.75 252.92
X bar= Y bar =
8.82 26.45 217.64 1958.73 652.91

17

Example

SSxx= 217.64

SSyy= 1958.73

SSxy=  652.91

r = 652.91/√217.64*1958.73

r = 0.99

Highly positive corelated

18

9
2021

Class Exercise
• In an effort to determine whether any 
correlation exists between the price of 
stocks of airlines, an analyst sampled six 
days of activity of the stock market.
• Using the following prices of Delta stock 
and Southwest stock, compute the 
coefficient of correlation.

19

Delta Southwest
47.6 15.1
46.3 15.4
50.6 15.9
52.6 15.6
52.4 16.4
52.7 18.1

20

10
2021

x = 302.2 y = 96.5   xy = 4,870.11


x2 = 15,259.62 y2 = 1,557.91

(302.2)(96.5)
4,870.11 
6
 (302.2)  
2
(96.5) 2 
=0.6445
15, 259.62  1,557. 91  
 6  6 

21

Interpretation of Correlation Coefficient (r)

• The value of correlation coefficient ‘r’ ranges 
from ‐1 to +1 
• If r = +1, then the correlation between the two 
variables is said to be perfect and positive 
• If r = ‐1, then the correlation between the two 
variables is said to be perfect and negative 
• If r = 0, then there exists no correlation between 
the variables.
• Unit free

22

11
2021

Assumptions of Pearson’s Correlation 
Coefficient 

• There is linear relationship between two 
variables, i.e. when the two variables are 
plotted on a scatter diagram a straight line will 
be formed by the points.
• Cause and effect relation exists between 
different forces operating on the item of the 
two variable series.

23

Pearson’s Correlation 
Advantages Disadvantages
 It summarizes in one   Always assume linear 
value, the degree of  relationship 
correlation & direction   Interpreting the value 
of correlation also. of  r  is difficult.
 Value of Correlation  
Coefficient is affected 
by the extreme values.
 Time consuming 
methods

24

12
2021

Class Exercise
Relationship between Anxiety and Test Scores
Anxiety  Test score 
(X) (Y)
10 2
8 3
2 9
1 7
5 6
6 5
a. Find r and comment on degree of r
b. Draw a scatter plot.

25

Class Exercise
For a consumer product sales were considered
to be related to consumer price Index(CPI).Past
15 months data is given below.
a. Find r and comment on degree of r
b. Draw a scatter plot.

Month 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CPI (X) 100 113 136 144 126 129 157 160 174 191 214 180 215 224 218
Sales (Y)in 
crores of  10 13.1 15.1 15.8 13.5 14.7 16.5 18 18.5 21.2 23 20 23.3 24.6 24.4
rupees

26

13
2021

Simple Linear Regression

27

What is a Variable?
• Simply, something that varies.
• Specifically, variables represent persons or 
objects that can be manipulated, controlled, 
or merely measured for the sake of research.

• Variation: How much a variable varies.  Those 
with little variation are called constants.

28

14
2021

Independent Variables
• These variables are ones that are more or less 
controlled.  
• You might manipulate these variables as 
needed.
• They still vary, but the variation is relatively 
known (like seconds, or days) 
• It is easy to figure the next independent 
variable.

29

Dependent Variables

• Dependent variables are not controlled or 
manipulated, but instead are simply 
measured.
• Dependent Variables depend on what the 
independent variable is. 

30

15
2021

Independent  Vs Dependent

Intentionally  Intentionally left 
manipulated alone
Controlled Measured
Vary at known rate Vary at unknown 
Cause rate
Effect

31

Simple Linear Regression
• Regression analysis is the process of constructing a 
mathematical model or function that can be used to 
predict or determine one variable by another variable 
• Regression analysis is used to:
– Predict the value of a dependent variable based on the 
value of at least one independent variable
– Explain the impact of changes in an independent variable 
on the dependent variable
• Dependent variable
– the variable you wish to explain
• Independent variable
– the variable used to explain the dependent variable

32

16
2021

Types of Regression Models

1 Explanatory Regression 2+ Explanatory


Variable Models Variables

Simple Multiple

Non- Non-
Linear Linear
Linear Linear

33

Simple Regression Analysis

• Bivariate (two variables) linear regression 
‐‐ the most elementary regression model
–dependent variable, the variable to be 
predicted, usually called Y
–independent variable, the predictor or 
explanatory variable, usually called X

34

17
2021

Regression Models
• Deterministic Regression Model

Y = 0 + 1X

• Probabilistic Regression Model

Y = 0 + 1X + 

• 0 and 1 are population parameters

• 0 and 1 are estimated by sample statistics b0 and b1

35

Equation of the Simple Regression Line

Yˆ  b 0  b1 X
where : b 0
= the sample intercept

b 1
= the sample slope
Yˆ = the predicted value of Y

36

18
2021

Least Squares Analysis
• Least squares analysis is a process whereby a regression 
model is developed by producing the minimum sum of 
the squared error values
• The vertical distance from each point to the line is the 
error of the prediction.
• The least squares regression line is the regression line 
that results in the smallest sum of errors squared.

37

Least Squares Analysis
 X  Y 
 X  X Y  Y    XY  nXY  XY   n 
b  
 X  X   X  n X
1 2 2 2 2

X  X
2

 Y  X
b Y b X  n b n
0 1 1

38

19
2021

Least Squares Analysis

 X  Y 
SS XY   X  X Y  Y    XY 
n

 X  X    X
2

SS XX 
2 2
 X
n
SS XY
b 1

SS XX

Y   X
b0  Y  b1 X  n b n 1

39

Linear Regression Probabilistic Model

Population Random
Population Independent Error
Slope
Y intercept Variable term
Coefficient
Dependent
Variable

Yi  β 0  β 1X i  ε i
Linear component
Random Error
component

The population regression model:

40

20
2021

Example: Airline Cost Data
Number of Passengers (X) Cost  ($1,000) (Y)
61 4.280
63 4.080
67 4.420
69 4.170
70 4.480
74 4.300
76 4.820
81 4.700
86 5.110
91 5.130
95 5.640
97 5.560
41

Scatter Plot of Airline Cost Data

4
Cost ($1000)

0
0 20 40 60 80 100 120
Number of Passengers

42

21
2021

Solving for b1 and b0 of the Regression 


Line: Airline Cost Example (Part 1)
Number of
Passengers Cost ($1,000)
X Y X2 XY

61 4.28 3,721 261.08


63 4.08 3,969 257.04
67 4.42 4,489 296.14
69 4.17 4,761 287.73
70 4.48 4,900 313.60
74 4.30 5,476 318.20
76 4.82 5,776 366.32
81 4.70 6,561 380.70
86 5.11 7,396 439.46
91 5.13 8,281 466.83
95 5.64 9,025 535.80
97 5.56 9,409 539.32

43

Solving for b1 and b0 of the Regression 


Line: Airline Cost Example (Part 1)
Number of
Passengers Cost ($1,000)
X Y X2 XY

61 4.28 3,721 261.08


63 4.08 3,969 257.04
67 4.42 4,489 296.14
69 4.17 4,761 287.73
70 4.48 4,900 313.60
74 4.30 5,476 318.20
76 4.82 5,776 366.32
81 4.70 6,561 380.70
86 5.11 7,396 439.46
91 5.13 8,281 466.83
95 5.64 9,025 535.80
97 5.56 9,409 539.32

X = 930 Y = 56.69 X 2
= 73,764  XY = 4,462.22

44

22
2021

Solving for b1 and b0 of the Regression 
Line: Airline Cost Example (Part 2)

 X Y ( 930 )( 56 .69 )
SS XY   XY 
n
 4 , 462 .22 
12
 68 .745

( X )2 ( 930 ) 2
SS XX  X 2

n
 73 , 764 
12
 1689

SS XY 68 . 745
b1    . 0407
SS XX 1689

b0 
Y  b1
X 
56 .69
 (. 0407 )
930
 1 . 57
n n 12 12

Yˆ  1 .57  .0407 X

45

Regression Line for the Airline Cost Example

4
Cost ($1000)

0
0 20 40 60 80 100 120
Number of Passengers

46

23
2021

Residual Analysis

• Residual is the difference between the actual 

y values and the predicted      values.
• Reflects the error of the regression line at any 
given point.

47

Residual Analysis: Airline Cost Example

Number of Predicted
Passengers Cost ($1,000) Value Residual
X Y Yˆ Y  Yˆ

61 4.28 4.053 .227


63 4.08 4.134 -.054
67 4.42 4.297 .123
69 4.17 4.378 -.208
70 4.48 4.419 .061
74 4.30 4.582 -.282
76 4.82 4.663 .157
81 4.70 4.867 -.167
86 5.11 5.070 .040
91 5.13 5.274 -.144
95 5.64 5.436 .204
97 5.56 5.518 .042

 (Y  Yˆ )  .001

48

24
2021

Residual Analysis for Number of Passengers

• Outliers: data points that lie apart from the rest of the 
points. 
• They can produce large residuals and affect the regression 
line.
49

Residuals to Test the Assumptions of the 
Regression Model

• The assumptions of the regression model
– The model is linear
– The error terms have constant variances
– The error terms are independent
– The error terms are normally distributed

50

25
2021

Standard Error of the Estimate

• Residuals represent errors of estimation for
individual points.
• A more useful measurement of error is the
standard error of the estimate.
• The standard error of the estimate, denoted 
se, is a standard deviation of the error of the
regression model.

51

Standard Error of the Estimate

Sum of Squares Error
SSE   
Y Yˆ 
2

 Y  b  Y  b  XY
2
Standard Error 0 1
of the
Estimate SSE
S e

n2

52

26
2021

Determining SSE for the Airline Cost 
Example
N um ber of
P a ss e n g e rs C o s t ( $ 1 ,0 0 0 ) R e s id u a l
X Y Y  Yˆ ( Y  Yˆ ) 2

61 4 .2 8 .2 2 7 .0 5 1 5 3
63 4 .0 8 - .0 5 4 .0 0 2 9 2
67 4 .4 2 .1 2 3 .0 1 5 1 3
69 4 .1 7 -.2 0 8 .0 4 3 2 6
70 4 .4 8 .0 6 1 .0 0 3 7 2
74 4 .3 0 -.2 8 2 .0 7 9 5 2
76 4 .8 2 .1 5 7 .0 2 4 6 5
81 4 .7 0 -.1 6 7 .0 2 7 8 9
86 5 .1 1 .0 4 0 .0 0 1 6 0
91 5 .1 3 -.1 4 4 .0 2 0 7 4
95 5 .6 4 .2 0 4 .0 4 1 6 2
97 5 .5 6 .0 4 2 .0 0 1 7 6

 ( Y  Yˆ )   . 001  ( Y  Yˆ ) 2 = .3 1 4 3 4

S u m o f s q u a r e s o f e r r o r = S S E = .3 1 4 3 4

53

Standard Error of the Estimate 
for the Airline Cost Example

 
Sum of Squares Error

 Y  Yˆ
2
SSE 

Standard Error
 0 . 31434
of the
SSE
Estimate
S e

n 2
0 . 31434

10
 0 . 1773

54

27
2021

Coefficient of Determination

• The coefficient of determination is the proportion of 
variability of the dependent variable (y) accounted for 
or explained by the independent variable (x)
• The coefficient of determination ranges from 0 to 1.
• An r 2 of zero means that the predictor accounts for 
none of the variability of the dependent variable and 
that there is no regression prediction of y by x.
• An r 2 of 1 means perfect prediction of y by x and that 
100% of the variability of y is accounted for by x.

55

Coefficient of Determination
SSYY   Y Y   Y
2 2

 Y 
2

n
SSYY  exp lained var iation  un exp lained var iation
SSYY  SSR  SSE
SSR SSE
1 
SSYY SSYY
2 SSR
r  SSYY
SSE
 1
SSYY
SSE 2
 1 0 1r
Y  n
2  
Y
2

56

28
2021

Coefficient of Determination 
for the Airline Cost Example
SSE  0.31434

SSYY   Y 2 
 Y   270.9251  56.69
2 2

 3.11209
n 12
SSE
r  1
2
SSYY 89.9% of the variability
of the cost of flying a 
.31434
 1 Boeing 737 is accounted for 
3.11209 by the number of passengers.

 .899

57

Example

• A real estate agent wishes to examine the 
relationship between the selling price of a 
home and its size (measured in square feet).
• A random sample of 10 houses is selected
Dependent variable (Y) = house price in $1000s
Independent variable (X) = square feet

58

29
2021

Linear Regression Example Data
House Price in $1000s Square Feet 
(Y) (X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
59

Steps to be followed
Step 1: Find X*Y and X*X as it was done in the table below.
(X)  (Y)  X*Y X*X Y*Y
1400 245 343000 1960000 60025
1600 312 499200 2560000 97344
1700 279 474300 2890000 77841
1875 308 577500 3515625 94864
1100 199 218900 1210000 39601
1550 219 339450 2402500 47961
2350 405 951750 5522500 164025
2450 324 793800 6002500 104976
1425 319 454575 2030625 101761
1700 255 433500 2890000 65025
∑ Y=2865
∑ X= 17150 ∑ XY=  ∑X*X=3098
Y bar=  ∑ Y*Y=853423
X bar= 1715 5085975 3750
286.5

60

30
2021

Contd..
Step 2: Find the sum of every column:
Step 3 Use the following equations to find b1 and b0:
SS xy
b1 
SS xx
b0  y  b1 x
  x   y 

i i
SS xy  xi y i 
n
 x 
2

x
i
SS xx  i
2
  ( n  1) s x2
n

61

Contd…

Step 4: Substitute a and b in regression 


equation formula

Ŷi  b 0  b1X i

house price  98.24833  0.10977 (square feet)

62

31
2021

Linear Regression Example Scatter plot

• House price model:  scatter plot
450
400
House Price ($1000s)

350
300
250
200
150
100
50
0
0 500 1000 1500 2000 2500 3000
Square Feet

63

Linear Regression Example

• House price model:  scatter plot and regression line
450
400
House Price ($1000s)

350
300
Slope
250 = 0.10977
200
150
100
50
Intercept 0

= 98.248 0 500 1000 1500


Square Feet
2000 2500 3000

64

32
2021

Linear Regression Example Interpretation 
of b0

house price  98.24833  0.10977 (square feet)

• b0 is the estimated mean value of Y when the 
value of X is zero (if X = 0 is in the range of 
observed X values)
• Because the square footage of the house 
cannot be 0, the Y intercept has no practical 
application.

65

Linear Regression Example Interpretation 
of b1

house price  98.24833  0.10977 (square feet)

• b1 measures the mean change in the average 
value of Y as a result of a one‐unit change in 
X
• Here, b1 = .10977 tells us that the mean value 
of a house increases by .10977($1000) = 
$109.77, on average, for each additional one 
square foot of size

66

33
2021

Linear Regression Example Making 
Predictions

Predict the price for a house with 2000 square feet:

house price  98.25  0.1098 (sq.ft.)

 98.25  0.1098(2000)

 317.85

The predicted price for a house with 2000 square feet


is 317.85($1,000s) = $317,850

67

Exercise
The marks in Computer science (X) (out of 15) and 
Physical Education (Y) (out of 10)
X 14 10 15 11 9 12 6
Y 8 6 4 3 7 5 9

a) Find the least square regression line in the 
form  y=a⋅x+b.
b) Plot the given points and the regression line
c) Find the value of y if x =13

68

34
2021

THANK YOU

69

35

You might also like