Advanced Data Analytics_2_Correlation and Simpleregression
Advanced Data Analytics_2_Correlation and Simpleregression
M. Poli
Decision Sciences & Business Analytics
SDA Bocconi School of Management
Mumbai | India
RELATION BETWEEN TWO VARIABLES
• 35 observations about Price and Sales of a product have been collected (for
example 35 different brands).
• Are these two variables related, (i.e. is there a relation between Price and Sales)?
2/36
COVARIANCE
3/36
COVARIANCE
-+ - ++
Sales
+
μSALES
+ -
-- +-
Price
4/36
COVARIANCE
y y y
x x x
Positive Negative Almost 0
Scatter Plots
5/36
COVARIANCE
y y
x x
Covariance ≈ 0 Covariance ≈ 0
No relation Non linear relation
6/36
COVARIANCE - CALCULUS
• Measures strength and direction of the linear relationship between paired x and y values
𝐶ov 𝑋, 𝑌
𝜌=
𝜎𝑥 𝜎𝑦
9/36
LINEAR CORRELATION COEFFICIENT (ρ OR R)
>= −1 <= 1
R or ρ
Perfect negative Perfect positive
linear correlation linear correlation
0
no linear
correlation
10/36
LINEAR CORRELATION COEFFICIENT
y y y
x x x
Strong Perfect
Positive
positive positive
11/36
LINEAR CORRELATION COEFFICIENT
y y y
x x x
Negative Strong negative Perfect negative
12/36
LINEAR CORRELATION COEFFICIENT
y y
x x
No Correlation Non-linear Correlation
ρ=0 ρ=0
13/36
LINEAR CORRELATION EXAMPLE
2 .5
2
cream
Ice
1 .5
0 .5
0
0 1 2 3 4 5 6 7
Nr
children
14/36
LINEAR CORRELATION: EXAMPLE
15/36
TEST OF CORRELATION
• Hypotheses
– H0: ρ = 0 (No Correlation)
– H1: ρ ≠ 0 (Correlation)
16/36
TEST OF CORRELATION
R
t* =
1-R 2
n-2
When Ho is True:
17/36
STUDENT’S T DISTRIBUTION
Standard
Normal
Bell-Shaped
Symmetric t (df = 13)
‘Fatter’ Tails
t (df = 5)
0 t /z
for n → 100 Student’s t → Normal (0,1)
in practice when n > 30 t approx. Normal
18/36
TEST OF CORRELATION: BASIC IDEA
Value: t* = 1.2
H0: ρ = 0
Expected Value: t
t=0
Value: t* = 11.0
H0: ρ = 0
Expected Value:
t
t=0
• Acceptable max chance of rejecting the Null Hypothesis even when it is true.
21/36
LINEAR CORRELATION T-TEST
Rejection Rejection
Region- Region+
1-α
1/2 α 1/2 α
No-Rejection
Region
H0 t test statistic
Critical Value Critical
Value (-) Value (+)
22/36
CRITICAL VALUES, USING T-TABLE
df = 2 Assume: α = . 10 → α / 2 = .05
23/36
LINEAR CORRELATION T-TEST
24/36
LINEAR CORRELATION: EXAMPLE
Data set
y Ice cream (lb) 0.27 1.41 2.19 2.83 2.19 1.81 0.85 3.05
x Children 2 3 3 6 4 2 1 5
n=8
ρ= R= 0.842
Level of Risk: α = 0.05
Test statistic:
t6 = R / Sqr( (1-R2) / 6) = + 3.83
25/36
LINEAR CORRELATION: EXAMPLE
Calculated value:
26/36
DIRECT T-TEST: P-VALUE (2 TAILS)
27/36
P-VALUE – ANOTHER EXAMPLE
Reject- Reject+
1/2 α = .025 1/2 α = .025
t
Retain
t* statistic inside Retain region → Correlation Not Significant.
28/36
REGRESSION LINE
Y Yi Observed
Value
“Error”
29/36
REGRESSION LINE – LEAST SQUARES MODEL
(LSM)
Cov(X,Y)
Slope: b = β1
σ X2
Intercept: a = β0 = μy – bμx = μy – β1 μx
30/36
REGRESSION LINE – EXAMPLE
3 .5
2 .5
2
Ice cream
1 .5
1 ρ = 0.842
0 .5
0
0 1 2 3 4 5 6 7
N r children
31/36
REGRESSION LINE – EXAMPLE
Cov(X,Y) =
= 7.1 – 3.25*1.825 = 1.169 Y X Y2 X2 X*Y
0.27 2 0.073 4 0.54
Var(X) = 13 – 3.252 = 2.438 1.41 3 1.988 9 4.23
2.19 3 4.796 9 6.57
2.83 6 8.009 36 16.98
2.19 4 4.796 16 8.76
1.81 2 3.276 4 3.62
0.85 1 0.723 1 0.85
3.05 5 9.303 25 15.25
Sum 14.6 26 32.963 104 56.8
Mean (N = 8) 1.825 3.25 4.120 13 7.1
32/36
REGRESSION LINE – EXAMPLE
Haagen Dazs
3.5
y = 0.4795x + 0.2667
3
R2 = 0.7096
2.5
Ice cream
1.5
0.5
0
0 1 2 3 4 5 6 7
Nr children
33/36
COEFFICIENT OF DETERMINATION (R2)
0 <= R2 <= 1
“0% of variability of “100% of variability
Y explained” of Y explained”
meaningless line perfect line
34/36
REGRESSION LINE: EXAMPLE
Haagen Dazs
4
y = 0.4795x + 0.2667
3.5
R2 = 0.7096
3
2.5
Ice cream
1.5
Best “forecast”:
Nr. Children = 7 → 3.623
1
0.5
0
0 1 2 3 4 5 6 7 8
Nr children
35/36
REGRESSION LINE
35,000
30,000
y = 0.1959x + 1386.7
R2 = 0.792
25,000
COMMERCIAL COSTS
20,000
15,000
10,000
5,000
0
- 20,000 40,000 60,000 80,000 100,000 120,000
REVENUES
36/36