0% found this document useful (0 votes)
7 views

week3_slides

The document discusses multiple testing in the context of ANOVA, highlighting the importance of corrections to maintain a family-wise error rate as the number of tests increases. It introduces methods such as the Bonferroni correction and Tukey's Honestly Significant Differences for adjusting significance levels. Additionally, it covers basic factorial design and the significance of interaction terms in experimental results.

Uploaded by

kimberly.jcui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

week3_slides

The document discusses multiple testing in the context of ANOVA, highlighting the importance of corrections to maintain a family-wise error rate as the number of tests increases. It introduces methods such as the Bonferroni correction and Tukey's Honestly Significant Differences for adjusting significance levels. Additionally, it covers basic factorial design and the significance of interaction terms in experimental results.

Uploaded by

kimberly.jcui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

STATS 101B Discussion 4/14

Navin Souda

2024-04-15
Multiple Testing I
▶ So we run ANOVA and find statistically significant differences
in the effects/means
▶ But this tells us nothing about which effects are significant
▶ Isn’t that what the t-tests are for?
▶ Sort of, but we have to be careful about our significance level
▶ For testing a single effect (i.e. a binary factor), we usually use
a significance level of 0.05 - practically, this means that we
have an at most 5% chance (on average) of rejecting the null
hypothesis when the null hypothesis is actually true
▶ But as we include more tests, each of them separately has a
5% chance of failing - the probability of them all being correct
simultaneously could actually be much less than 95%
▶ Some boring but important math: let Ai be the event test i is
correct for i = 1, ..., p
▶ Then the probability that at least one test fails,
P((∩i Ai )c ) = 1 − P(∩i Ai ), and if all the
Q tests are
independent, then 1 − P(∩i Ai ) = 1 − i P(Ai ) = 1 − (1 − α)p
Multiple Testing II

▶ i.e. as p increases, the probability of at least one test being


incorrect will get closer and closer to one (unless the tests are
completely dependent)
▶ This leads us to the concept of multiple testing correction,
which aims to adjust α so that either P((∩i Ai )c ) (known as
the family-wise error rate (FWER)), or a slightly different
quantity, the false positive rate, is at α for all the tests
simultaneously
Multiple Testing Correction I
▶ Bonferroni Method
▶ The simplest and most conservative method, makes no
assumptions about the tests
▶ With p tests, just use αp for each test
▶ This actually will often end up giving us a FWER less than
0.05, but as mentioned is very simple
▶ Will skip the math but feel free to ask in OH/through email if
interested
▶ pairwise.t.test() with correction = "bonferroni" in
R
▶ Tukey’s Honestly Significant Differences (HSD)
▶ Calculate a standardized statistic for the difference between
each pair of means, which will tell us whether that particular
difference is significant
▶ T = p 1 x̄i. −x̄j. 1 1
2 MSwithin ( ni +n )
j
▶ x̄i. and ni are the mean and size of group i, MSwithin is the
within group sum of squares generated found by one-way
ANOVA
Multiple Testing Correction II

▶ This follows a distribution called a studentized range


distribution, will need R or a table to find the p-value
▶ TukeyHSD() in R
Basic Factorial Design
▶ In a factorial design, we have two or more treatments and an
observation for each combination of the treatment groups (or
rather, we randomly assign each experimental unit to a
treatment combination at random)
▶ The general concepts are the same as with a single factor, but
we have to consider the interaction between the factors as well
▶ We can look for interactions visually using an interaction plot
(example later), and also test its significance using ANOVA
▶ Going to ignore the details of calculating the ANOVA table,
can see lecture slides for derivation - essentially boils down to
finding between group SS for each treatment group individually
as well as as the combination of treatments for the interaction
▶ Model assessment
▶ This is basically the same as it was in 101A - we’re looking for
independent, normally distributed residuals with constant
variance - the usual tools such as residual plots, leverage, and
Cook’s distance will be useful
Example

We’re going to perform an experiment with 2 treatments - the first


has 4 levels, and the second has 3. We want to determine whether
these treatments have an effect on our response, as well as whether
they interact with each other. Finally, we want to determine which
groups of each treatment cause the greatest difference in response.

## treatment1 treatment2 response


## 1 1 a 401.6951
## 2 1 b 504.5292
## 3 1 c 408.6511
## 4 2 a 361.1613
## 5 2 b 351.1951
## 6 2 c 347.7072
Example (cont.)
500
450
400
response

350
300
250

1 2 3 4

treatment1
Example (cont.)
500
450
400
response

350
300
250

a b c

treatment2
Example (cont.)
We might want to start by determining whether it is useful to consider
the interaction term. Based on the following plots, does it seem like the
interaction effect would be significant? Why or why not?
500

400

treatment1
response

1
2
3
4

300
Example (cont.)
500

400

treatment2
response

a
b
c

300

1 2 3 4
treatment1
Example (cont.)

We probably want to start by determining whether it is useful to


consider the interaction term. Based on the following plots, does it
seem like the interaction effect would be significant? Why or why
not?
A: We probably should include the interaction term, as it seems
that group 1 of treatment 1 has some interaction with treatment 2
Example (cont.)

We probably want to start by determining whether it is useful to


consider the interaction term. Based on the following plots, does it
seem like the interaction effect would be significant? Why or why
not?
A: We probably should include the interaction term, as it is clear
that for different groups of treatment 1, the groups of treatment 2
have varying responses (as seen in the first plot).
Extra Q: What would we expect the plots to look like if there was
no interaction?
A: The lines corresponding to different levels would be “parallel”,
i.e. the effect of treatment 2 is the same regardless of the level of
treatment 1 and vice versa
Example (cont.) I

Now that we’ve decided to include the interaction term, let’s fit
our ANOVA model. Based on the resulting output, what can we
say about the significance of our treatments with regard to the
response? Which levels of each treatment are significant?

## Df Sum Sq Mean Sq F value Pr(>F)


## treatment1 3 166409 55470 177.112 < 2e-16 ***
## treatment2 2 4907 2453 7.833 0.00241 **
## treatment1:treatment2 6 16040 2673 8.536 5.05e-05 ***
## Residuals 24 7517 313
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’
Example (cont.) II
## Tables of effects
##
## treatment1
## treatment1
## 1 2 3 4
## 99.54 19.11 -35.06 -83.59
##
## treatment2
## treatment2
## a b c
## -9.948 16.385 -6.437
##
## treatment1:treatment2
## treatment2
## treatment1 a b c
## 1 -21.35 47.61 -26.26
## 2 8.77 -24.14 15.37
## 3 -6.74 -10.56 17.31
## 4 19.33 -12.91 -6.42
Example (cont.)

Now that we’ve decided to include the interaction term, let’s fit
our ANOVA model. Based on the resulting output, what can we
say about the significance of our treatments with regard to the
response? Which levels of each treatment are significant?
A: Both treatments are significant, as well as their interaction. We
don’t have enough information to say which levels of treatment are
significant.
Example (cont.) I
According to the Tukey HSD output, which levels of treatment 1
and treatment 2 have significant differences? Does the Bonferroni
adjustment agree with the Tukey HSD? Why or why not?

## Tukey multiple comparisons of means


## 95% family-wise confidence level
##
## Fit: aov(formula = response ~ treatment1 + treatment2 +
##
## $treatment1
## diff lwr upr p adj
## 2-1 -80.42544 -103.43923 -57.41165 0.00e+00
## 3-1 -134.59933 -157.61312 -111.58554 0.00e+00
## 4-1 -183.12940 -206.14319 -160.11561 0.00e+00
## 3-2 -54.17389 -77.18768 -31.16010 5.80e-06
## 4-2 -102.70396 -125.71775 -79.69017 0.00e+00
## 4-3 -48.53007 -71.54386 -25.51628 3.01e-05
Example (cont.) II
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = response ~ treatment1 + treatment2 +
##
## $treatment2
## diff lwr upr p adj
## b-a 26.333460 8.290942 44.375977 0.0035479
## c-a 3.511213 -14.531304 21.553731 0.8785812
## c-b -22.822246 -40.864764 -4.779729 0.0113900

##
## Pairwise comparisons using t tests with pooled SD
##
## data: dat$response and dat$treatment1
##
## 1 2 3
Example (cont.) III
## 2 1.5e-05 - -
## 3 3.9e-10 0.0032 -
## 4 1.5e-13 1.6e-07 0.0095
##
## P value adjustment method: bonferroni

##
## Pairwise comparisons using t tests with pooled SD
##
## data: dat$response and dat$treatment2
##
## a b
## b 1 -
## c 1 1
##
## P value adjustment method: bonferroni
Example (cont.)

According to the Tukey HSD output, which levels of treatment 1


and treatment 2 have significant differences? Does the Bonferroni
adjustment agree with the Tukey HSD? Why or why not?
A: For treatment 1, all levels are significantly different. For
treatment 2, levels a and b and b and c are significantly different.
For treatment 1, the Bonferroni procedure and the Tukey HSD
seem to agree, but not for treatment 2. This might be because the
pairwise tests aren’t properly accounting for the interaction.
Extra Q: Write a line of R code that we could use to identify the
significantly different interaction terms.
Example (cont.)

According to the Tukey HSD output, which levels of treatment 1


and treatment 2 have significant differences? Does the Bonferroni
adjustment agree with the Tukey HSD? Why or why not?
A: For treatment 1, everything except levels 2 and 3 are
significantly different. For treatment 2, levels a and b and a and c
are significantly different. For treatment 1, the Bonferroni
procedure and the Tukey HSD seem to agree, but not for
treatment 2. This might be because the pairwise tests aren’t
properly accounting for the interaction.
Extra Q: Write a line of R code that we could use to identify the
significantly different interaction terms.
TukeyHSD(example_model, "treatment1:treatment2")['p
adj'] < .05)
Example (cont.)
Comment on the residuals of the model.
Residuals vs Fitted Q−Q Residuals
40

23 23

2
Standardized residuals
20

1
Residuals

0
−1
−20

35

−2
35
−40

15
15

250 300 350 400 450 500 −2 −1 0 1 2

Fitted values Theoretical Quantiles

Constant Leverage:
Scale−Location Residuals vs Factor Levels
15
1.5

23 23

2
35
Standardized residuals

Standardized residuals

1
1.0

0
−1
0.5

−2

35
15
0.0

−3

treatment1 :
250 300 350 400 450 500 1 2 3 4

Fitted values Factor Level Combinations

You might also like