0% found this document useful (0 votes)
3 views36 pages

Advanced Data Analytics_2_Correlation and Simpleregression

Uploaded by

spidxrishanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views36 pages

Advanced Data Analytics_2_Correlation and Simpleregression

Uploaded by

spidxrishanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

IMB – ADVANCED DATA ANALYTICS

Correlation and Regression Line

M. Poli
Decision Sciences & Business Analytics
SDA Bocconi School of Management

Mumbai | India
RELATION BETWEEN TWO VARIABLES

• 35 observations about Price and Sales of a product have been collected (for
example 35 different brands).

• Are these two variables related, (i.e. is there a relation between Price and Sales)?

2/36
COVARIANCE

• Covariance measures direction of the linear relationship between paired x and y


values

• Alternative formula: Cov(X,Y) = Mean(X∙Y) – μx ∙ μy

3/36
COVARIANCE

Product of the distance from the mean

-+ - ++
Sales
+

μSALES

+ -
-- +-
Price
4/36
COVARIANCE

y y y

x x x
Positive Negative Almost 0

Scatter Plots

5/36
COVARIANCE

y y

x x
Covariance ≈ 0 Covariance ≈ 0
No relation Non linear relation

6/36
COVARIANCE - CALCULUS

10 obs. (N = 10) for Price and Sales.


Price Sales P- P S- S (P - P)(S - S)

(euro) ('000 pz.)


43 184.4 9.4 -185.0 -1738.62
40 279.1 6.4 -90.3 -577.66
31 244.0 -2.6 -125.4 325.94
36 314.2 2.4 -55.2 -132.38
29 382.2 -4.6 12.8 -59.06
29 450.2 -4.6 80.8 -371.86
32 423.6 -1.6 54.2 -86.78
34 410.2 0.4 40.8 16.34
32 500.4 -1.6 131.0 -209.66
30 505.3 -3.6 135.9 -489.38
Sum 336.0 3693.6 -3323.2
Mean 33.6 369.4 -332.3
10
1
𝐶ov 𝑥, 𝑦 = ෍ 𝑥𝑖 − 33.6 ⋅ 𝑦𝑖 − 369.4 = −332.3
10 7/36
𝑖=1
COVARIANCE - CALCULUS

10 obs. (N = 10) for Price and Sales.


Price Sales P*S
(euro) ('000 pz.)
43 184.4 7929.2
40 279.1 11164.0
31 244.0 7564.0 Alternative formula:
36
29
314.2
382.2
11311.2
11083.8
Cov(X,Y) = Mean(X∙Y) – μX ∙ μY =
29 450.2 13055.8 =12078.2 – 33.6*369.4 = -332.3
32 423.6 13555.2
34 410.2 13946.8
32 500.4 16012.8
30 505.3 15159.0
Sum 336.0 3693.6 120781.8
Mean 33.6 369.4 12078.2
8/36
LINEAR CORRELATION COEFFICIENT (ρ OR R)

• Measures strength and direction of the linear relationship between paired x and y values

𝐶ov 𝑋, 𝑌
𝜌=
𝜎𝑥 𝜎𝑦

9/36
LINEAR CORRELATION COEFFICIENT (ρ OR R)

>= −1 <= 1
R or ρ
Perfect negative Perfect positive
linear correlation linear correlation
0
no linear
correlation

is not affected by the choice of x and y


(symmetric)

10/36
LINEAR CORRELATION COEFFICIENT

y y y

x x x
Strong Perfect
Positive
positive positive

ρ = 0.6 ρ = 0.9 ρ=1

11/36
LINEAR CORRELATION COEFFICIENT

y y y

x x x
Negative Strong negative Perfect negative

ρ = - 0.4 ρ = - 0.8 ρ=-1

12/36
LINEAR CORRELATION COEFFICIENT

y y

x x
No Correlation Non-linear Correlation

ρ=0 ρ=0

13/36
LINEAR CORRELATION EXAMPLE

Data set – Haagen Dazs


y Ice cream (lb) 0.27 1.41 2.19 2.83 2.19 1.81 0.85 3.05
x nr Children 2 3 3 6 4 2 1 5
Haage Dazs
n
3 .5

2 .5

2
cream
Ice

1 .5

0 .5

0
0 1 2 3 4 5 6 7
Nr
children
14/36
LINEAR CORRELATION: EXAMPLE

Cov(X,Y) = 7.1 – 3.25*1.825 = 1.169


Y X Y2 X2 X*Y
0.27 2 0.073 4 0.54
Var(Y) = 4.12 – 1.8252 = 1.169
1.41 3 1.988 9 4.23
StdDev(Y) = σ(Y) = 0.790 = 0.889 2.19 3 4.796 9 6.57
Var(X) = 13 – 3.252 = 2.438 2.83 6 8.009 36 16.98
2.19 4 4.796 16 8.76
StdDev(X) = σ(X) = 2.438 = 1.561 1.81 2 3.276 4 3.62
0.85 1 0.723 1 0.85
3.05 5 9.303 25 15.25
Sum 14.6 26 32.963 104 56.8
1.169
𝑅=𝜌= = 0. 842 Mean (N = 8) 1.825 3.25 4.120 13 7.1
0.889 ⋅ 1.561

15/36
TEST OF CORRELATION

• Tests If There Is a Linear Relationship Between two Numerical Variables

• Hypotheses
– H0: ρ = 0 (No Correlation)
– H1: ρ ≠ 0 (Correlation)

16/36
TEST OF CORRELATION

R
t* =
1-R 2

n-2

When Ho is True:

t* is distributed as a Student’s t distribution

with n-2 “degrees of freedom”

17/36
STUDENT’S T DISTRIBUTION

Standard
Normal
Bell-Shaped
Symmetric t (df = 13)
‘Fatter’ Tails

t (df = 5)

0 t /z
for n → 100 Student’s t → Normal (0,1)
in practice when n > 30 t approx. Normal

18/36
TEST OF CORRELATION: BASIC IDEA

Value: t* = 1.2
H0: ρ = 0

Expected Value: t
t=0

we may retain the null hypothesis: ρ = 0


No Significative Correlation
19/36
TEST OF CORRELATION: BASIC IDEA

Value: t* = 11.0
H0: ρ = 0

Expected Value:
t
t=0

we reject the null hypothesis: ρ = 0


There is a Significative Correlation
20/36
LEVEL OF RISK

• Acceptable max chance of rejecting the Null Hypothesis even when it is true.

Designated as: α (alpha)


Typical values: 0.01, 0.05, 0.10

1‐α = Level of Significance of the test

21/36
LINEAR CORRELATION T-TEST

Level of Risk = α → Level of significance = 1 - α

Rejection Rejection
Region- Region+
1-α
1/2 α 1/2 α
No-Rejection
Region

H0 t test statistic
Critical Value Critical
Value (-) Value (+)
22/36
CRITICAL VALUES, USING T-TABLE

df = 2 Assume: α = . 10 → α / 2 = .05

Upper Tail Area

df .25 .10 .05


1 1.000 3.078 6.314

2 0.817 1.886 2.920 .05

3 0.765 1.638 2.353


0 2.920 t
t Values Critical Value

23/36
LINEAR CORRELATION T-TEST

24/36
LINEAR CORRELATION: EXAMPLE

Data set
y Ice cream (lb) 0.27 1.41 2.19 2.83 2.19 1.81 0.85 3.05
x Children 2 3 3 6 4 2 1 5

n=8
ρ= R= 0.842
Level of Risk: α = 0.05

Test statistic:
t6 = R / Sqr( (1-R2) / 6) = + 3.83

25/36
LINEAR CORRELATION: EXAMPLE

From t-table: 6 degrees of freedom


5% - 2 tails test

Critical value(s): +/ - 2.447

Calculated value:

+ 3.83 > + 2.447 → Reject region (right tail)

reject ρ = 0 → Significant Correlation

26/36
DIRECT T-TEST: P-VALUE (2 TAILS)

p-value: (provided by SW)


Prob ( t-distribution >= |Calculated Value| )
when H0 is True

If: p >= α → retain H0 → NO correlation

If: p < α → reject H0 → correlation

27/36
P-VALUE – ANOTHER EXAMPLE

α = .05 1 tail p-value = .06

p Value- = .06 p Value+ = .06

Reject- Reject+
1/2 α = .025 1/2 α = .025

t
Retain
t* statistic inside Retain region → Correlation Not Significant.

28/36
REGRESSION LINE

Y Yi Observed
Value

“Error”

Y*i = β0 + β1Xi = a + bXi


X
Observed
Value Regression line:
Y* = a + bX or Y* = β 0+ β 1X

29/36
REGRESSION LINE – LEAST SQUARES MODEL
(LSM)

min (Error)2 → min (Y – Y*)2 → min (Y – (a + bX))2

Cov(X,Y)
Slope: b = β1
σ X2
Intercept: a = β0 = μy – bμx = μy – β1 μx

30/36
REGRESSION LINE – EXAMPLE

Data set – Haagen Dazs


y Ice cream (lb) 0.27 1.41 2.19 2.83 2.19 1.81 0.85 3.05
x Children 2 3 3 6 4 2 1 5
H a a g e n D a zs

3 .5

2 .5

2
Ice cream

1 .5

1 ρ = 0.842
0 .5

0
0 1 2 3 4 5 6 7
N r children

31/36
REGRESSION LINE – EXAMPLE

Cov(X,Y) =
= 7.1 – 3.25*1.825 = 1.169 Y X Y2 X2 X*Y
0.27 2 0.073 4 0.54
Var(X) = 13 – 3.252 = 2.438 1.41 3 1.988 9 4.23
2.19 3 4.796 9 6.57
2.83 6 8.009 36 16.98
2.19 4 4.796 16 8.76
1.81 2 3.276 4 3.62
0.85 1 0.723 1 0.85
3.05 5 9.303 25 15.25
Sum 14.6 26 32.963 104 56.8
Mean (N = 8) 1.825 3.25 4.120 13 7.1

32/36
REGRESSION LINE – EXAMPLE

Haagen Dazs

3.5
y = 0.4795x + 0.2667
3
R2 = 0.7096
2.5
Ice cream

1.5

0.5

0
0 1 2 3 4 5 6 7
Nr children

33/36
COEFFICIENT OF DETERMINATION (R2)

• The “power” (reliability) of the regression line is given by the Coefficient of


Determination: R2, the square of ρ (R)

0 <= R2 <= 1
“0% of variability of “100% of variability
Y explained” of Y explained”
meaningless line perfect line

34/36
REGRESSION LINE: EXAMPLE

Haagen Dazs

4
y = 0.4795x + 0.2667
3.5
R2 = 0.7096
3

2.5
Ice cream

1.5
Best “forecast”:
Nr. Children = 7 → 3.623
1

0.5

0
0 1 2 3 4 5 6 7 8
Nr children

35/36
REGRESSION LINE

35,000

30,000
y = 0.1959x + 1386.7
R2 = 0.792
25,000
COMMERCIAL COSTS

20,000

15,000

10,000

5,000

0
- 20,000 40,000 60,000 80,000 100,000 120,000
REVENUES

36/36

You might also like