Is The Dependent Variable Related To The Independent Variable?
Is The Dependent Variable Related To The Independent Variable?
y y
9 80
70
8 60
50
40
7 30
20
6 10
0
5 -10
-20
4 -30
-40
3 -50
-60
-70
2 -80
-90
1 -100
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
x x
y
5
4
3
2
1
0
-1
-2
-3
-4
-5
-6
-7
2 4 6 8 10 12 14 16 18 20
x
y
1
0
-1
-2
-3
-4
-5
-6
-7
-8
0 2 4 6 8 10 12 14 16
x
1
• The fitted line is an estimate of the true relationship
o Similar to sample mean vs. population mean
Is the mean blood pressure of the patients higher than that of the normal cases?
2-sample t-tests based on the 2 sample means
2.1.Partitioning variability
• Total sum of squares
n
SST = ∑ ( yi − y )
2
i =1
o Sample variance of yi = SST / (n-1)
o Variation in observed response
y
1000
900 y
800
700
600
500
yi
400
400 500 600 700 800 900
x
i =1
o Variation explained by the regression / model
o Variation due to the regression line
1000
900 y
Predicted Value of y
800
700
600
500
ŷi
400
400 500 600 700 800 900
x
o Based on least squares estimates
yˆ i = b0 + b1 xi
yˆ = b0 + b1 x = ( y − b1 x ) + b1 x = y
2
• Residual sum of squares
n
SSE = ∑ ( yi − yˆ i )
2
i =1
o Variation around the regression line
y
1000
900
800
700
600
500
400
400 500 600 700 800 900
x
Since
∑ ( yˆ i − y )( yi − yˆ ) = ∑ yˆ i ( yi − yˆ i ) − y ∑ ( yi − yˆ i )
= ∑ (b0 + b1 xi )( yi − yˆ i )
= ∑ b0 ( yi − yˆ i ) + b1 ∑ xi ( yi − yˆ i )
= b1 ∑ xi ( yi − b0 − b1 xi )
= b1 ∑ xi ( yi − y + b1 x − b1 xi )
= b1 ∑ xi ( yi − y − b1 ( xi − x ))
[
= b1 ∑ xi ( yi − y ) − b1 ∑ xi (xi − x ) ]
= b1 (S XY − b1S XX )
⎛ S ⎞
= b1 ⎜⎜ S XY − XY S XX ⎟⎟
⎝ S XX ⎠
=0
Therefore
∑ (y − y ) = ∑ ( yˆ i − y ) + ∑ ( yi − yˆ i )
2 2 2
i
3
• Degrees of freedom of SST
o df of SST = ∑ ( yi − y ) is n – 1
2
Though n squares in the summation, they are calculated by xi and 2 parameters in the
regression equation, i.e. start with only 2 degrees of freedom
1 df is lost because ( yˆ i − y ) are not independent in that ∑ ( yˆ i − y ) = 0
df E n − 2 n−2
o Mean squared error
o Varaince due to error
df R 1 1
o Mean squared regression
o Variance explained by the model
2.2.Significant regression
(a) Hypothesis
• H0: β1 = 0
o E(Y) = β0
o There is no significant regression
o Regressor variable does not influence response (linearly)
• H1: β1 ≠ 0
o X significantly influences the response linearly
o There is a regression of Y on X
o (Linear) trend is detected
o However, no implication on the fitness / prediction capability
4
(b) Distribution of sum of squares
• Assume normality
o ε ~ i.i.d. N(0,σ2)
• Then
Sum of squares Degrees of freedom Distribution
SSR 1 σ 2 χ12 Under H0
SSE n–2 σ 2 χ n2−2
SST n–1 σ 2 χ n2−1 Under H0
Example
5
o F-value = 1813 > 5.32 = F0.05,1,8
p-value < 0.001
H0 is rejected at 5% level of significance
Statistically significant linear trend
180
y = 2x + 10
160
140
120
100
80
60
40
20
0
0 20 40 60 80 100
Example
Shock data
• SST = SYY = 199.06
o df = n – 1 = 15
• SSE = (11.4 – 10.48)2 + (11.9 – 9.87)2 + … = 71.32
o df = n – 2 = 14
• SSR = 199.06 – 71.32 = 127.74
o df = 1
• ANOVA table
Sum of Squares df Mean Square F
Regression 127.74 1 127.74 25.07
Residual 71.32 14 5.09
Total 199.06 15
Time
15
14
13
12
11
10
9
8
7
6
5
4
3
2
0 2 4 6 8 10 12 14 16
Shocks
6
Example
CEO data
• The age and salary of the chief executive officers (CEOs) of small companies were determined.
• Small companies were defined as those with annual sales greater than five and less than $350
million.
• Companies were ranked according to 5-year average return on investment.
• This data covers the first 60 ranked firms.
• There are 59 (1 missing) observations on two variables.
• Age (X)
o The age of the CEO in years.
• Salary (Y)
o The salary of chief executive officer (including bonuses) in thousands of dollars.
Salary
1200
1100
1000
900
800
700
600
500
400
300
200
100
0
30 40 50 60 70 80
Age
• Sample means
o x = 51.54 , y = 404.17
• Sum of squares
o S XX = 4676.64 , SYY = 2820832.31 , S XY = 14650.58
• Estimates
o b0 = 242.70, b1=3.1327
• fitted model
o E(Salary) = 242.70 + 3.1327 × Age
• SST = SYY = 2820832.31
o df = n – 1 = 58
7
• SSE = 2774936
o df = n – 2 = 57
• SSR = 2820832 – 2774936 = 45896
o df = 1
• ANOVA table
Sum of Squares df Mean Square F
Regression 45896 1 45896 0.94
Residual 2774936 57 48683
Total 2820832 58
8
• 1-sided test
o H0: βj ≤ c (or βj ≥ c) vs. H1: βj > c (or βj < c)
o Test statistic
bj − c
t=
SE (b j )
o Reject H0 if t > tα,n-2 (or t < –tα,n-2)
Example
Example
Shock data
Parameter Estimate Std. Error t p-value
β0 10.48 1.08 9.73 <.0001
β1 -0.61 0.12 -5.01 0.0002
• df = 14
• | t | = 9.73 > 2.145 = t0.025,14
o Reject H0 at 5% level of significance.
o β0 is not zero
o non-zero intercept
• | t | = 5.01 > 2.145 = t0.025,14
o Reject H0 at 5% level of significance.
o β1 is not zero.
o Significant linear trend
9
Example
CEO data
Parameter Estimate Std. Error t p-value
β0 242.70 168.76 1.44 0.1559
β1 3.13 3.23 0.97 0.3357
• df = 57
• | t | = 1.44 < 2.002 = t0.025,57 (t0.025,60 = 2.000)
o Accept H0 at 5% level of significance
o β0 is zero
o zero intercept
• | t | = 0.97 < 2.002 = t0.025,57 (t0.025,60 = 2.000)
o Accept H0 at 5% level of significance.
o β1 is zero.
o Insignificant linear trend
• Z=N(0,1)
o p=NORMSDIST(x)
x
o x=NORMSINV(p)
• Fndf,ddf p
o p=FDIST(x,ndf,ddf)
o x=FINV(p,ndf,ddf)
• tdf p
o p=TDIST(x,df,k)/k
o x=TINV(p/2,df)
x
10