0% found this document useful (0 votes)
9 views21 pages

Topic 3b

Uploaded by

Edlyn Linet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views21 pages

Topic 3b

Uploaded by

Edlyn Linet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Topic 3b

Analysis of variance (ANOVA)


approach to regression analysis
Learning objectives
• Apply ANOVA … an (alternative) approach
to testing for a linear association
• Know when to use the t-test and the F-test
• Understand and interpret regression output
from software e.g. Stata
The basic idea
• Break down the variation in Y (“total sum
of squares”) into two components:
– a component that is “due to” the change in X
(“regression sum of squares”)
– a component that is just due to random error
(“error sum of squares”)
• If the regression sum of squares is a large
component of the total sum of squares, it
suggests that there is a linear association.
Y  Y Yˆ  Y  Y  Yˆ 
i i i i
The above decomposition holds for the sum of the
squared deviations, too:
2 2 2

     
n n n

 Y  Y  Y ˆ  Y   Y  Yˆ
i i i i
i 1 i 1 i 1

Total sum of squares (SST)

Regression sum of squares (SSR)


Error sum of squares (SSE)

SST = SSR + SSE


Breakdown of degrees of freedom
Degrees of freedom associated with SST

Degrees of freedom associated with SSR

Degrees of freedom associated with SSE


Analysis of Variance (ANOVA) Table
Example: Mortality and Latitude
The regression equation is Mort = 389 - 5.98 Lat

Predictor Coef SE Coef T P


Constant 389.19 23.81 16.34 0.000
Lat -5.9776 0.5984 -9.99 0.000

S = 19.12 R-Sq = 68.0% R-Sq(adj) = 67.3%

Analysis of Variance

Source DF SS MS F P
Regression 1 36464 36464 99.80 0.000
Residual Error 47 17173 365
Total 48 53637
How to find n?
• Recall the degrees of freedom?

( 𝑛 −1 ) =( 𝑘 ) +(𝑛 − 𝑘 −1)
Definitions of Mean Squares
We already know the mean square error (MSE) is defined
as:
MSE 
 
Y Y i
ˆ 2
i

SSE
n k  1 n k  1

For a simple regression k=1 such that:

MSE 
  Yi  Yˆi 
2


SSE
n 2 n 2

Similarly, the regression mean square (MSR) is defined as:

𝑀𝑆𝑅=
∑ ^ 2
( 𝑌 𝑖 −𝑌 𝑖 ) 𝑆𝑆𝑅
=
𝑘 𝑘
R- Squared

• Let us check from the Mortality and Latitude


example!
• Latitude explains 68% of the variation in
mortality. 32% remains unexplained – Has to
always sum up to 100.
Adjusted-
• It is adjusted based on the degrees of freedom (df)
• Relevant in multiple regression
• Adjusted R2 can actually get smaller as additional
variables are added to the model.
• As N gets bigger, the difference between R2 and
Adjusted R2 gets smaller and smaller.
The formal F-test
for slope parameter β1
Null hypothesis H 0: β1 = 0
Alternative hypothesis HA: β1 ≠ 0

* MSR
Test statistic F 
MSE

P-value = What is the probability that we’d get an F* statistic


as large as we did, if the null hypothesis is true? (One-tailed
test!)
The P-value is determined by comparing F* to an F distribution
with 1 numerator degree of freedom and n-k-1 denominator
degrees of freedom.
Row Year Men200m
1 1900 22.20
Winning times (in seconds) 2 1904 21.60
in Men’s 200 meter Olympic 3
4
1908
1912
22.60
21.70
sprints, 1900-1996. 5
6
1920
1924
22.00
21.60
7 1928 21.80
Are men getting faster? 8 1932 21.20
9 1936 20.70
10 1948 21.10
11 1952 20.70
12 1956 20.60
13 1960 20.50
14 1964 20.30
15 1968 19.83
16 1972 20.00
17 1976 20.23
18 1980 20.19
19 1984 19.80
20 1988 19.75
21 1992 20.01
22 1996 19.32
Regression Plot
Men200m = 76.1534 - 0.0283833 Year
S = 0.298134 R-Sq = 89.9 % R-Sq(adj) = 89.4 %

22.5

21.5
Men200m

20.5

19.5

1900 1950 2000

Year
Analysis of Variance Table
DFE = n-k-1 = 22-2 = 20 MSE = SSE/(n-2) = 1.8/20 = 0.09
MSR = SSR/1 = 15.8

Analysis of Variance
Source DF SS MS F P
Regression 1 15.8 15.8 177.7 0.000
Residual Error 20 1.8 0.09
Total 21 17.6

DFTO = n-1 = 22-1 = 21 F* = MSR/MSE = 15.796/0.089 = 177.7

P = Probability that an F(1,20) random variable is greater than 177.7 = 0.000…


For simple linear regression model,
the F-test and t-test are equivalent.
Predictor Coef SE Coef T P
Constant 76.153 4.152 18.34 0.000
Year -0.0284 0.00213 -13.33 0.000

Analysis of Variance
Source DF SS MS F P
Regression 1 15.796 15.796 177.7 0.000
Residual Error 20 1.778 0.089
Total 21 17.574

2
( 13.33) 177.7 t *
 F
( n  k  1)
2 *
(1, n  k  1)
Equivalence of F-test to t-test
• For a given α level, the F-test of β1 = 0
versus β1 ≠ 0 is algebraically equivalent to
the two-tailed t-test.
• Will get exactly same P-values, so…
– If one test rejects H0, then so will the other.
– If one test does not reject H0, then so will the
other.
Should I use the F-test or the t-test?
• The F-test is only appropriate for testing
that the slope differs from 0 (β1 ≠ 0).
• Use the t-test to test that the slope is
positive (β1 > 0) or negative (β1 < 0).
• F-test is more useful for multiple regression
model when we want to test that more than
one slope parameter is 0. Test if β1 and β2
are jointly significant
Alternative formula for F-test
• Null hypothesis • F-critical
H0: β1 = β2 = 0
k – Column, n-k-1 – Row
• Alternative hypothesis
• When F*>F-critical,
HA: β1 ≠ β2 ≠ 0
Reject H0
is statistically significant
• Test statistic
When F*<F-critical,
Fail to reject H0
is not statistically significant
P-values

You might also like