Two Sample T-Test
Two Sample T-Test
5
SE .91
30
2
x
2
y
X n Ym ~N ( x y , )
n m
But…
As before, you usually have to use the
sample SD, since you won’t know the
true SD ahead of time…
So, again becomes a T-distribution...
Estimated standard error of
the difference….
2 2
sx sy
xy
n m
Just plug in the sample
standard deviations for each
group.
Case 1: un-pooled variance
(x i xn ) 2 n
s x2 i 1
and (n 1) s x2 ( xi x n ) 2 (n 1) s x2 ( m 1) s 2y
n 1 i 1 s 2p
m
nm2
( yi y m ) 2 m
s 2y i 1
m 1
and ( m 1) s 2y (y
i 1
i ym ) 2
n m
Degrees of Freedom!
i 1
( xi x n ) 2
i 1
( yi y m ) 2
s 2p
nm2
Estimated standard error
(using pooled variance estimate)
2 2
sp sp
xy The degrees
n m of freedom
are n+m-2
where :
n m
i 1
( xi xn )2
i 1
( yi ym )2
s 2p
nm2
Case 2: ttest, pooled
variances
X n Ym
T ~ t nm2
2 2
sp sp
n m
(n 1) s x2 (m 1) s 2y
s 2p
nm2
Alternate calculation formula:
ttest, pooled variance
X n Ym
T ~ tn m2
mn
sp
mn
s 2p s 2p 1 1 n m nm
s 2p ( ) s 2p ( ) sp ( )
m n m n mn mn mn
Pooled vs. unpooled variance
Rule of Thumb: Use pooled unless you have a
reason not to.
Pooled gives you more degrees of freedom.
Pooled has extra assumption: variances are
equal between the two groups.
SAS automatically tests this assumption for you
(“Equality of Variances” test). If p<.05, this
suggests unequal variances, and better to use
unpooled ttest.
Example: two-sample t-test
In 1980, some researchers reported that
“men have more mathematical ability than
women” as evidenced by the 1979 SAT’s,
where a sample of 30 random male
adolescents had a mean score ± 1 standard
deviation of 436±77 and 30 random female
adolescents scored lower: 416±81 (genders
were similar in educational backgrounds,
socio-economic status, and age). Do you
agree with the authors’ conclusions?
Data Summary
n Sample Sample
Mean Standard
Deviation
Group 1: 30 416 81
women
Group 2: 30 436 77
men
Two-sample t-test
1. Define your hypotheses (null,
alternative)
H0: ♂-♀ math SAT = 0
Ha: ♂-♀ math SAT ≠ 0 [two-sided]
Two-sample t-test
2. Specify your null distribution:
F and M have similar standard
deviations/variances, so make a “pooled” estimate
of variance.
(n 1) sm2 (m 1) s 2f (29)77 2 ( 29)812
s 2p 6245
nm2 58
pval=(1-probt(.98, 58))*2;
Example 2: Difference in means
Example: Rosental, R. and Jacobson, L.
(1966) Teachers’ expectancies:
Determinates of pupils’ I.Q. gains.
Psychological Reports, 19, 115-118.
The Experiment
(note: exact numbers have been altered)
Difference=4 points
What does a 4-point
difference mean?
Before we perform any formal statistical
analysis on these data, we already have
a lot of information.
Look at the basic numbers first; THEN
consider statistical significance as a
secondary guide.
Is the association statistically
significant?
This 4-point difference could reflect a
true effect or it could be a fluke.
The question: is a 4-point difference
bigger or smaller than the expected
sampling variability?
Hypothesis testing
Step 1: Assume the null hypothesis.
s p 4.0
2
4 4
"gifted " control ~T88 (0, 0.52)
18 72
Hypothesis Testing
Step 2: Predict the sampling variability assuming the null
hypothesis is true—computer simulation:
Standard error is
about 0.52
3. Empirical data
Observed difference in our experiment =
12.2-8.2 = 4.0
4. P-value
t-curve with 88 df’s has slightly wider cut-
off’s for 95% area (t=1.99) than a normal
curve (Z=1.96)
12.2 8.2 4
t88 8
.52 .52
p-value <.0001
Visually…
If we ran this
study 1000 times
we wouldn’t
expect to get 1
result as big as a
difference of 4
(under the null
hypothesis).
5. Reject null!
Conclusion: I.Q. scores can bias
expectancies in the teachers’ minds and
cause them to unintentionally treat
“bright” students differently from those
seen as less bright.
Confidence interval (more
information!!)
95% CI for the difference: 4.0±1.99(.52) =
(3.0 – 5.0)
Continuous Ttest: compares means Paired ttest: compares means Non-parametric statistics
(e.g. pain between two independent groups between two related groups (e.g., Wilcoxon sign-rank test:
the same subjects before and non-parametric alternative to the
scale, after)
ANOVA: compares means paired ttest
cognitive
function) between more than two
independent groups Repeated-measures Wilcoxon sum-rank test
ANOVA: compares changes (=Mann-Whitney U test): non-
Pearson’s correlation over time in the means of two or parametric alternative to the ttest
more groups (repeated
coefficient (linear measurements)
correlation): shows linear Kruskal-Wallis test: non-
correlation between two parametric alternative to ANOVA
continuous variables Mixed models/GEE
modeling: multivariate
regression techniques to compare Spearman rank correlation
Linear regression: changes over time between two or coefficient: non-parametric
multivariate regression technique more groups; gives rate of change alternative to Pearson’s correlation
used when the outcome is coefficient
over time
Data Summary
n Sample Sample
Mean Standard
Deviation
8.2 0 8.2
t71 28
2 2 .29
72
p-value <.0001
Normality assumption of ttest
If the distribution of the trait is normal, fine to
use a t-test.
But if the underlying distribution is not normal
and the sample size is small (rule of thumb:
n>30 per group if not too skewed; n>100 if
distribution is really skewed), the Central Limit
Theorem takes some time to kick in. Cannot use
ttest.
Note: ttest is very robust against the normality
assumption!
Alternative tests when normality
is violated: Non-parametric tests
Continuous outcome (means);
Are the observations independent or correlated?
Outcome independent correlated Alternatives if the normality
Variable assumption is violated (and
small sample size):
Continuous Ttest: compares means Paired ttest: compares means Non-parametric statistics
(e.g. pain between two independent groups between two related groups (e.g., Wilcoxon sign-rank test:
the same subjects before and non-parametric alternative to the
scale, after)
ANOVA: compares means paired ttest
cognitive
function) between more than two
independent groups Repeated-measures Wilcoxon sum-rank test
ANOVA: compares changes (=Mann-Whitney U test): non-
Pearson’s correlation over time in the means of two or parametric alternative to the ttest
more groups (repeated
coefficient (linear measurements)
correlation): shows linear Kruskal-Wallis test: non-
correlation between two parametric alternative to ANOVA
continuous variables Mixed models/GEE
modeling: multivariate
regression techniques to compare Spearman rank correlation
Linear regression: changes over time between two or coefficient: non-parametric
multivariate regression technique more groups; gives rate of change alternative to Pearson’s correlation
used when the outcome is coefficient
over time
Non-parametric tests
Hypothetical RESULTS:
Atkin’s group loses an average of 34.5 lbs.
25
20
P
e
r
c 15
e
n
t
10
0
-30 -25 -20 -15 -10 -5 0 5 10 15 20
Weight Change
Atkin’s
30
25
20
P
e
r
c 15
e
n
t
10
0
-300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20
Weight Change
t-test inappropriate…
Comparing the mean weight loss of the
two groups is not appropriate here.
The distributions do not appear to be
normally distributed.
Moreover, there is an extreme outlier
(this outlier influences the mean a great
deal).
Wilcoxon rank-sum test
RANK the values, 1 being the least weight
loss and 20 being the most weight loss.
Atkin’s
+4, +3, 0, -3, -4, -5, -11, -14, -15, -300
1, 2, 3, 4, 5, 6, 9, 11, 12, 20
J. Craig
-8, -10, -12, -16, -18, -20, -21, -24, -26, -30
7, 8, 10, 13, 14, 15, 16, 17, 18, 19
Wilcoxon rank-sum test
Sum of Atkin’s ranks:
1+ 2 + 3 + 4 + 5 + 6 + 9 + 11+ 12 +
20=73
Sum of Jenny Craig’s ranks:
Binary or Chi-square test: McNemar’s chi-square test: Fisher’s exact test: compares
categorical compares proportions between compares binary outcome between proportions between independent
two or more groups two correlated groups (e.g., before groups when there are sparse data
(e.g. and after) (some cells <5).
fracture,
yes/no) Relative risks: odds ratios
or risk ratios Conditional logistic McNemar’s exact test:
regression: multivariate compares proportions between
regression technique for a binary correlated groups when there are
Logistic regression: sparse data (some cells <5).
outcome when groups are correlated
multivariate technique used
(e.g., matched data)
when outcome is binary; gives
multivariate-adjusted odds
ratios GEE modeling: multivariate
regression technique for a binary
outcome when groups are correlated
(e.g., repeated measures)
Difference in proportions (special
case of chi-square test)
Null distribution of a
difference in proportions
Standard error of a proportion= p (1 p )
n
p (1 p ) p (1 p )
Difference of proportions ~ N ( p1 p2 , )
n1 n2
Follows a normal
because binomial can be
approximated with
No Stroke (~D) 8 42 50
Absolute risk: Difference in
proportions exposed
Smoker (E) Non-smoker
(~E)
Stroke (D) 15 35
50
No Stroke (~D) 8 42 50
P ( E / D) P( E / ~ D) 15 / 50 8 / 50
30% 16% 14%
Difference in proportions
exposed
14% 0% .14
Z 1.67
.23 * .77 .23 * .77 .084
50 50
Any antidepressant
drug ever 120 (46%) 448 (36%)
46% 36%
Difference=10%
Is the association statistically
significant?
This 10% difference could reflect a true
association or it could be a fluke in this
particular sample.
The question: is 10% bigger or smaller
than the expected sampling variability?
Hypothesis testing
Step 1: Assume the null hypothesis.
Standard error is
about 3.3%
Hypothesis Testing
Step 3: Do an experiment
.10
Z= = 3.0; p = .003
.033
P-value from our simulation…
Standard error is
about 10%
50 cases and 50
controls.
With only 50 cases and 50 controls…
If we ran this
Standard study 1000 times,
error is we would expect to
about 10% get values of 10%
or higher 170
times (or 17% of
the time).
Two-tailed p-value
Two-tailed
p-value =
17%x2=34%
Practice problem…
1. Hypotheses:
H0: p♂-p♀= 0
Ha: p♂-p♀≠ 0 [two-sided]
63 63 63 63
(1 ) (1 )
75 75 75 75 )
2. Null distribution of difference of two proportions: p f pm ~ N (0,
ˆ ˆ
37 38
.84(.16) .84(.16)
.085
37 38
3. Observed difference in our experiment = .97-.71= .26
.26 0
4. Calculate the p-value of what you observed: Z 3.06
.085
data _null_;
pval=(1-probnorm(3.06))*2;
put pval;
Key two-sample Hypothesis
Tests…
Test for Ho: μx- μy = 0 (σ2 unknown, but roughly equal):
x y ( n x 1) s x2 ( n y 1) s 2y
t n 2 ; s 2p
s 2p s 2p n2
nx ny
pˆ 1 pˆ 2 n1 pˆ 1 n2 pˆ 2
Z ;p
( p )(1 p ) ( p )(1 p ) n1 n2
n1 n2
Corresponding confidence
intervals…
For a difference in means, 2 independent samples (σ2’s
unknown but roughly equal):
s 2p s 2p
( x y ) t n 2, / 2
nx ny
For a difference in proportions, 2 independent samples:
( p )(1 p ) ( p )(1 p )
( pˆ 1 pˆ 2 ) Z / 2
n1 n2
Appendix: details of rank-sum
test…
Wilcoxon Rank-sum test
For example, if team 1 has 3 people and team 2 has 10, we could
rank all 13 participants from 1 to 13 on individual performance. If
team1 (X) and team2 don’t differ in talent, the ranks ought to be
spread evenly among the two groups, e.g.…
n1 n2
(n1 n2 )(n1 n2 1)
sum of within-group ranks for smaller
n1
n ( n 1)
T1 T2 i group.
i 1 1
2
i 1 2 i 1
2 2
(n1 n1n2 n1 n1n2 n2 n2 ) n1 (n1 1) n2 (n2 1)
n1n2
2 2 2
13
(13)(14)
e.g., here : T1 T2 i
i 1 2
91 55 6 30
Take-home point:
n1 (n1 1) n2 (n2 1)
T1 T2 n1n2
2 2
It turns out that, if the null hypothesis is true, the difference
between the larger-group sum of ranks and the smaller-group sum
of ranks is exactly equal to the difference between T1 and T2
10
10(11)
i
i 1 2
55
The difference between the sum of the
3
3( 4) ranks within each individual group is 49.
i 1 2
6
55 6 49
The difference between the sum of the
T1 = 3 + 7 + 11 =21 ranks of the two groups is also equal to 49
T2 = 1 + 2 + 4 + 5 + 6 + 8 + 9 +10 + 12 +13 = 70 if ranks are evenly interspersed (null is
true).
70-21 = 49 Magic!
n2 (n2 1)
define U 2 n1n2 T2
2 Here, under null:
n1 (n1 1) U2=55+30-70
define U1 n1n2 T1
2 U1=6+30-21
n 2 (n 2 1) n1 (n1 1)
E(U 2 - U 1 ) E[( ) (T2 T1 )] 0
2 2
The U’s should be equal to each other and will equal n1n2/2:
U1 + U2 = n1n2
Under null hypothesis, U1 = U2 = U0
E(U1 + U2) = 2E(U0) = n1n2
E(U1 = U2=U0) = n1n2/2
So, the test statistic here is not quite the difference in the
sum-of-ranks of the 2 groups
It’s the smaller observed U value: U0
For small n’s, take U0, and get p-value directly from a U table.
For large enough n’s (>10 per
group)…
n1n2
E (U 0 )
2
n1 n 2
U0
U 0 E (U 0 ) 2
Z
Var (U 0 ) Var (U 0 )
n1n2 (n1 n2 1)
Var (U 0 )
12
Add observed data to the
example…
Example: If the girls on the two gymnastics teams were ranked as follows:
Team 1: 1, 5, 7 Observed T1 = 13
Team 2: 2,3,4,6,8,9,10,11,12,13 Observed T2 = 78
Are the teams significantly different?
Total sum of ranks = 13*14/2 = 91 n1n2=3*10 = 30
Under the null hypothesis: expect U1 - U2 = 0 and U1 + U2 = 30 (each should equal about 15 under the null) and U0
= 15
U1=30 + 6 – 13 = 23
U2= 30 + 55 – 78 = 7
U0 = 7
Not quite statistically significant in U table…p=.1084 (see attached) x2 for two-tailed test
Example problem 2
A study was done to compare the Atkins Diet (low-carb) vs. Jenny Craig
(low-cal, low-fat). The following weight changes were obtained; note
they are very skewed because someone lost 100 pounds; the mean loss
for Atkins is going to look higher because of the bozo, but does that
mean the diet is better overall? Conduct a Mann-Whitney U test to
compare ranks.
Atkins Jenny Craig
-100 -11
-8 -15
-4 -5
+5 +6
+8 -20
+2
Answer Atkins
1
Jenny Craig
4
5 3
7 6
Sum of ranks for JC = 25 (n=5)
9 10
Sum of ranks for Atkins=41 (n=6)
11 2
8
n1n2=5*6 = 30
under the null hypothesis: expect U1 - U2 = 0 and
U1 + U2 = 30 and U0 = 15
U1=30 + 15 – 25 = 20
U2= 30 + 21 – 41 = 10
U0 = 10; n1=5, n2=6
Go to Mann-Whitney chart….p=.2143x 2 = .42