0% found this document useful (0 votes)
34 views

Lecture 4 - 7 - Association Between Variables - Correlation

IIT MADRAS STATISTICS

Uploaded by

BHARGAV RAO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Lecture 4 - 7 - Association Between Variables - Correlation

IIT MADRAS STATISTICS

Uploaded by

BHARGAV RAO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Statistics for Data Science -1

Statistics for Data Science -1


Lecture 4.7: Association between two numerical
variables-Correlation

Usha Mohan

Indian Institute of Technology Madras

1/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation

Learning objectives

1. Understand the measure of correlation.


2. Interpret correlation to quantify the strength of association
between two numerical variables.

2/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation

Correlation

3/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation

Correlation
I A more easily intepreted measure of linear association between
two numerical variables is correlation

3/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation

Correlation
I A more easily intepreted measure of linear association between
two numerical variables is correlation
I It is derived from covariance.
I To find the correlation between two numerical variables x and
y divide the covariance between x and y by the product of the
standard deviations of x and y . The Pearson correlation
coefficient, r , between x and y is given by

3/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation

Correlation
I A more easily intepreted measure of linear association between
two numerical variables is correlation
I It is derived from covariance.
I To find the correlation between two numerical variables x and
y divide the covariance between x and y by the product of the
standard deviations of x and y . The Pearson correlation
coefficient, r , between x and y is given by

r=

3/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation

Correlation
I A more easily intepreted measure of linear association between
two numerical variables is correlation
I It is derived from covariance.
I To find the correlation between two numerical variables x and
y divide the covariance between x and y by the product of the
standard deviations of x and y . The Pearson correlation
coefficient, r , between x and y is given by
n
X
(xi − x̄)(yi − ȳ )
i=1
r=Ã Ã =
Xn Xn
(xi − x̄)2 (yi − ȳ )2
i=1 i=1

3/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation

Correlation
I A more easily intepreted measure of linear association between
two numerical variables is correlation
I It is derived from covariance.
I To find the correlation between two numerical variables x and
y divide the covariance between x and y by the product of the
standard deviations of x and y . The Pearson correlation
coefficient, r , between x and y is given by
n
X
(xi − x̄)(yi − ȳ )
i=1 cov (x, y )
r=Ã Ã =
Xn Xn sx sy
(xi − x̄)2 (yi − ȳ )2
i=1 i=1

3/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation

4/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation

Remark
The units of the standard deviations cancel out the units of
covariance

4/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation

Remark
The units of the standard deviations cancel out the units of
covariance

Remark
It can be shown that the correlation measure always lies between
-1 and +1

4/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation

Correlation: Example 1

5/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation

Correlation: Example 1

Age Height sq.Devn of x sq.Devn of y


x y (xi − x̄)2 (yi − ȳ )2 (xi − x̄)(yi − ȳ )
1 75 4 309.76 35.2
2 85 1 57.76 7.6
3 94 0 1.96 0
4 101 1 70.56 8.4
5 108 4 237.16 30.8
10 677.2 82

I sx = 1.58, sy = 13.01
I r = √ 82 20.5
OR 1.58×13.01 = 0.9964
10×677.2

5/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation

Correlation: Example 2

Age Price sq. Devn of x sq. Devn of y


x y (xi − x̄)2 (yi − ȳ )2 (xi − x̄)(yi − ȳ )
1 6 4 4 -4
2 5 1 1 -1
3 4 0 0 0
4 3 1 1 -1
5 2 4 4 -4
10 10 -10

I sx = 1.58, sy = 1.58
I r = √ −10√ OR 1.58×1.58
−2.5
= −1
10× 10

6/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation

Correlation using google sheets

Step 1 The function CORREL(series1, series2) will return the value


of correlation.
For example: If the data corresponding to x-variable (series1) is in
cell A2:A6 and data corresponding to y -variable (series2) is in cells
B2:B6; then CORREL(A2:A6,B2:B6) returns the value of the
Pearson Correlation coefficient.

7/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation

Section summary

8/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation

Section summary

1. Introduced measure of correlation.


2. Interpreting correlation between variables.

8/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line

Learning objectives

1. Summarize the linear association between two variables using


the equation of a line.
2. Understand the significance of R 2

9/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line

Summarizing the association with a line

10/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line

Summarizing the association with a line

I The strength of linear association between the variables was


measured using the measures of Covariance and Correlation.

10/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line

Summarizing the association with a line

I The strength of linear association between the variables was


measured using the measures of Covariance and Correlation.
I The linear association can be described using the equation of
a line.

10/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line

Equation of line using google sheets

11/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line

Equation of line using google sheets

Step 1 Open the scatter plot


Step 2 Under customize tab, click on series
Step 3 Click on trendline
Step 4 Under label tab, click on use equation, and click the show R 2
button.

11/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line

Example 1: Size versus Price of homes: Equation

Equation of the line: Price = 30.5 × Size + 36;


R 2 = 0.647; r = 0.804
12/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line

13/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line

Example 2: Age versus Price of cars: Equation

Equation of the line: Price = −0.694 × Age + 9.03;


R 2 = 0.855; r = −0.9247

14/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line

Example 3: Size versus Price of homes: Equation

Equation of the line: Price = 7.77 × Size + 130;


R 2 = 0.022; r = 0.149
15/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line

Section summary

1. Equation of a line describing linear relationship between two


variables.
2. Interpreting slope, R 2 of the line.

16/ 16

You might also like