100% found this document useful (1 vote)
129 views95 pages

Two Sample T-Test

The two-sample t-test is used to determine if the means of two independent groups are statistically significantly different from each other. It compares the difference between two sample means to the standard error of the difference. The test can be performed with either pooled or unpooled variance estimates depending on whether the variances are assumed to be equal between the groups. With pooled variance, the degrees of freedom are larger, increasing the power of the test.

Uploaded by

JOMALYN J. NUEVO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
129 views95 pages

Two Sample T-Test

The two-sample t-test is used to determine if the means of two independent groups are statistically significantly different from each other. It compares the difference between two sample means to the standard error of the difference. The test can be performed with either pooled or unpooled variance estimates depending on whether the variances are assumed to be equal between the groups. With pooled variance, the degrees of freedom are larger, increasing the power of the test.

Uploaded by

JOMALYN J. NUEVO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 95

The two-sample t-test

The two-sample T-test


 Is the difference in means that we
observe between two groups more than
we’d expect to see based on chance
alone?
The standard error of the
difference of two means
  2
y 2
Recall, Var (A-B) =
x
Var (A) + Var (B) if
 
A and B are
 x y  
independent! n m

**First add the variances and then take the square


root of the sum to get the standard error.
Shown by simulation:
One sample of 30
One sample of
(with SD=5).
30 (with SD=5).
5
SE   .91 5
30 SE   .91
30

5
SE   .91
30

Difference of the two samples.


25 25
SE (diff )    1.29
SE 
5
 .91
30 30
30
Distribution of differences
If X and Y are the averages of n and m subjects, respectively:

2
x
2
y
X n  Ym ~N ( x   y ,  )
n m
But…
 As before, you usually have to use the
sample SD, since you won’t know the
true SD ahead of time…
 So, again becomes a T-distribution...
Estimated standard error of
the difference….

2 2
sx sy
 xy  
n m
Just plug in the sample
standard deviations for each
group.
Case 1: un-pooled variance

Question: What are your degrees of freedom here?


Answer: Not obvious!
Case 1: ttest, unpooled
variances
X n  Ym
T ~ t
2
2
sx sy

n m

It is complicated to figure out the degrees of freedom here! A good


approximation is given as df ≈ harmonic mean (or SAS will tell you!):
2
1 1

n m
Case 2: pooled variance
If you assume that the standard deviation of the characteristic
(e.g., IQ) is the same in both groups, you can pool all the data
to estimate a common standard deviation. This maximizes your
degrees of freedom (and thus your power).
pooling variances :
n

 (x i  xn ) 2 n
s x2  i 1
and (n  1) s x2   ( xi  x n ) 2 (n  1) s x2  ( m  1) s 2y
n 1 i 1 s 2p 
m
nm2
 ( yi  y m ) 2 m
s 2y  i 1
m 1
and ( m  1) s 2y  (y
i 1
i  ym ) 2

n m
Degrees of Freedom!

i 1
( xi  x n )  2

i 1
( yi  y m ) 2
 s 2p 
nm2
Estimated standard error
(using pooled variance estimate)

2 2
sp sp
 xy   The degrees
n m of freedom
are n+m-2

where :
n m


i 1
( xi  xn )2  
i 1
( yi  ym )2
 s 2p 
nm2
Case 2: ttest, pooled
variances

X n  Ym
T ~ t nm2
2 2
sp sp

n m

(n  1) s x2  (m  1) s 2y
s 2p 
nm2
Alternate calculation formula:
ttest, pooled variance
X n  Ym
T ~ tn m2
mn
sp
mn

s 2p s 2p 1 1 n m nm
  s 2p (  )  s 2p (  )  sp ( )
m n m n mn mn mn
Pooled vs. unpooled variance
Rule of Thumb: Use pooled unless you have a
reason not to.
Pooled gives you more degrees of freedom.
Pooled has extra assumption: variances are
equal between the two groups.
SAS automatically tests this assumption for you
(“Equality of Variances” test). If p<.05, this
suggests unequal variances, and better to use
unpooled ttest.
Example: two-sample t-test
 In 1980, some researchers reported that
“men have more mathematical ability than
women” as evidenced by the 1979 SAT’s,
where a sample of 30 random male
adolescents had a mean score ± 1 standard
deviation of 436±77 and 30 random female
adolescents scored lower: 416±81 (genders
were similar in educational backgrounds,
socio-economic status, and age). Do you
agree with the authors’ conclusions?
Data Summary
n Sample Sample
Mean Standard
Deviation

Group 1: 30 416 81
women
Group 2: 30 436 77
men
Two-sample t-test
1. Define your hypotheses (null,
alternative)
H0: ♂-♀ math SAT = 0
Ha: ♂-♀ math SAT ≠ 0 [two-sided]
Two-sample t-test
2. Specify your null distribution:
F and M have similar standard
deviations/variances, so make a “pooled” estimate
of variance.
(n  1) sm2  (m  1) s 2f (29)77 2  ( 29)812
s 2p    6245
nm2 58

6245 6245 6245 6245


M 30  F30 ~T58 (0,  )   20.4
30 30 30 30
Two-sample t-test
3. Observed difference in our experiment = 20
points
Two-sample t-test
4. Calculate the p-value of what you observed
20  0
T58   .98
20.4
data _null_;

pval=(1-probt(.98, 58))*2;
Example 2: Difference in means
 Example: Rosental, R. and Jacobson, L.
(1966) Teachers’ expectancies:
Determinates of pupils’ I.Q. gains.
Psychological Reports, 19, 115-118.
The Experiment
(note: exact numbers have been altered)

 Grade 3 at Oak School were given an IQ test at


the beginning of the academic year (n=90).
 Classroom teachers were given a list of names of
students in their classes who had supposedly
scored in the top 20 percent; these students were
identified as “academic bloomers” (n=18).
 BUT: the children on the teachers lists had actually
been randomly assigned to the list.
 At the end of the year, the same I.Q. test was re-
administered.
Example 2
 Statistical question: Do students in the
treatment group have more improvement
in IQ than students in the control group?

What will we actually compare?


 One-year change in IQ score in the treatment

group vs. one-year change in IQ score in the


control group.
The standard deviation
of change scores was
Results: 2.0 in both groups. This
affects statistical
“Academic significance…
bloomers” Controls
(n=18) (n=72)

Change in IQ score: 12.2 (2.0)  8.2 (2.0)

12.2 points 8.2 points

Difference=4 points
What does a 4-point
difference mean?
 Before we perform any formal statistical
analysis on these data, we already have
a lot of information.
 Look at the basic numbers first; THEN
consider statistical significance as a
secondary guide.
Is the association statistically
significant?
 This 4-point difference could reflect a
true effect or it could be a fluke.
 The question: is a 4-point difference
bigger or smaller than the expected
sampling variability?
Hypothesis testing
Step 1: Assume the null hypothesis.

Null hypothesis: There is no difference between


“academic bloomers” and normal students (=
the difference is 0%)
Hypothesis Testing
Step 2: Predict the sampling variability assuming the null
hypothesis is true

 These predictions can be made by


mathematical theory or by computer
simulation.
Hypothesis Testing
Step 2: Predict the sampling variability assuming the null
hypothesis is true—math theory:

s p  4.0
2

4 4
"gifted "   control ~T88 (0,   0.52)
18 72
Hypothesis Testing
Step 2: Predict the sampling variability assuming the null
hypothesis is true—computer simulation:

 In computer simulation, you simulate


taking repeated samples of the same
size from the same population and
observe the sampling variability.
 I used computer simulation to take
1000 samples of 18 treated and 72
controls
Computer Simulation Results

Standard error is
about 0.52
3. Empirical data
Observed difference in our experiment =
12.2-8.2 = 4.0
 
4. P-value
t-curve with 88 df’s has slightly wider cut-
off’s for 95% area (t=1.99) than a normal
curve (Z=1.96)
 

12.2  8.2 4
t88   8
.52 .52
p-value <.0001
Visually…

If we ran this
study 1000 times
we wouldn’t
expect to get 1
result as big as a
difference of 4
(under the null
hypothesis).
5. Reject null!
 Conclusion: I.Q. scores can bias
expectancies in the teachers’ minds and
cause them to unintentionally treat
“bright” students differently from those
seen as less bright.
Confidence interval (more
information!!)
95% CI for the difference: 4.0±1.99(.52) =
(3.0 – 5.0)

t-curve with 88 df’s


has slightly wider cut-
off’s for 95% area
(t=1.99) than a normal
curve (Z=1.96)
What if our standard deviation
had been higher?
 The standard deviation for change
scores in treatment and control were
each 2.0. What if change scores had
been much more variable—say a
standard deviation of 10.0 (for both)?
Standard error is
0.52 Std. dev in
change scores =
2.0

Standard error is 2.58 Std. dev in


change scores =
10.0
With a std. dev. of 10.0…
LESS STATISICAL POWER!

Standard If we ran this


error is 2.58 study 1000 times,
we would expect to
get +4.0 or –4.0
12% of the time.
P-value=.12
Don’t forget: The paired T-test
 Did the control group in the previous
experiment improve
at all during the year?
 Do not apply a two-sample ttest to answer
this question!
 After-Before yields a single sample of
differences…
 “within-group” rather than “between-group”
comparison…
Continuous outcome (means);
Are the observations independent or correlated?
Outcome independent correlated Alternatives if the normality
Variable assumption is violated (and
small sample size):

Continuous Ttest: compares means Paired ttest: compares means Non-parametric statistics
(e.g. pain between two independent groups between two related groups (e.g., Wilcoxon sign-rank test:
the same subjects before and non-parametric alternative to the
scale, after)
ANOVA: compares means paired ttest
cognitive
function) between more than two
independent groups Repeated-measures Wilcoxon sum-rank test
ANOVA: compares changes (=Mann-Whitney U test): non-
Pearson’s correlation over time in the means of two or parametric alternative to the ttest
more groups (repeated
coefficient (linear measurements)
correlation): shows linear Kruskal-Wallis test: non-
correlation between two parametric alternative to ANOVA
continuous variables Mixed models/GEE
modeling: multivariate
regression techniques to compare Spearman rank correlation
Linear regression: changes over time between two or coefficient: non-parametric
multivariate regression technique more groups; gives rate of change alternative to Pearson’s correlation
used when the outcome is coefficient
over time
Data Summary

n Sample Sample
Mean Standard
Deviation

Group 1: 72 +8.2 2.0


Change
Did the control group in the
previous experiment improve
at all during the year?

8.2  0 8.2
t71    28
2 2 .29
72
p-value <.0001
Normality assumption of ttest
 If the distribution of the trait is normal, fine to
use a t-test.
 But if the underlying distribution is not normal
and the sample size is small (rule of thumb:
n>30 per group if not too skewed; n>100 if
distribution is really skewed), the Central Limit
Theorem takes some time to kick in. Cannot use
ttest.
 Note: ttest is very robust against the normality
assumption!
Alternative tests when normality
is violated: Non-parametric tests
Continuous outcome (means);
Are the observations independent or correlated?
Outcome independent correlated Alternatives if the normality
Variable assumption is violated (and
small sample size):

Continuous Ttest: compares means Paired ttest: compares means Non-parametric statistics
(e.g. pain between two independent groups between two related groups (e.g., Wilcoxon sign-rank test:
the same subjects before and non-parametric alternative to the
scale, after)
ANOVA: compares means paired ttest
cognitive
function) between more than two
independent groups Repeated-measures Wilcoxon sum-rank test
ANOVA: compares changes (=Mann-Whitney U test): non-
Pearson’s correlation over time in the means of two or parametric alternative to the ttest
more groups (repeated
coefficient (linear measurements)
correlation): shows linear Kruskal-Wallis test: non-
correlation between two parametric alternative to ANOVA
continuous variables Mixed models/GEE
modeling: multivariate
regression techniques to compare Spearman rank correlation
Linear regression: changes over time between two or coefficient: non-parametric
multivariate regression technique more groups; gives rate of change alternative to Pearson’s correlation
used when the outcome is coefficient
over time
Non-parametric tests

 t-tests require your outcome variable


to be normally distributed (or close
enough), for small samples.
 Non-parametric tests are based on
RANKS instead of means and standard
deviations (=“population parameters”).
Example: non-parametric tests
10 dieters following Atkin’s diet vs. 10 dieters following
Jenny Craig

Hypothetical RESULTS:
Atkin’s group loses an average of 34.5 lbs.

J. Craig group loses an average of 18.5 lbs.

Conclusion: Atkin’s is better?


Example: non-parametric tests
BUT, take a closer look at the individual data…

Atkin’s, change in weight (lbs):


+4, +3, 0, -3, -4, -5, -11, -14, -15, -300

J. Craig, change in weight (lbs)


-8, -10, -12, -16, -18, -20, -21, -24, -26, -30
Jenny Craig
30

25

20

P
e
r
c 15
e
n
t

10

0
-30 -25 -20 -15 -10 -5 0 5 10 15 20

Weight Change
Atkin’s
30

25

20

P
e
r
c 15
e
n
t

10

0
-300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20

Weight Change
t-test inappropriate…
 Comparing the mean weight loss of the
two groups is not appropriate here.
 The distributions do not appear to be
normally distributed.
 Moreover, there is an extreme outlier
(this outlier influences the mean a great
deal).
Wilcoxon rank-sum test
 RANK the values, 1 being the least weight
loss and 20 being the most weight loss.
 Atkin’s
 +4, +3, 0, -3, -4, -5, -11, -14, -15, -300
  1, 2, 3, 4, 5, 6, 9, 11, 12, 20
 J. Craig
 -8, -10, -12, -16, -18, -20, -21, -24, -26, -30
 7, 8, 10, 13, 14, 15, 16, 17, 18, 19
Wilcoxon rank-sum test
 Sum of Atkin’s ranks:
  1+ 2 + 3 + 4 + 5 + 6 + 9 + 11+ 12 +

20=73
 Sum of Jenny Craig’s ranks:

7 + 8 +10+ 13+ 14+ 15+16+ 17+ 18+19=137

 Jenny Craig clearly ranked higher!


 P-value *(from computer) = .018
*For details of the statistical test, see appendix of these slides…
Binary or categorical outcomes
(proportions)
Are the observations correlated? Alternative to the chi-
Outcome independent correlated square test if sparse
Variable cells:

Binary or Chi-square test: McNemar’s chi-square test: Fisher’s exact test: compares
categorical compares proportions between compares binary outcome between proportions between independent
two or more groups two correlated groups (e.g., before groups when there are sparse data
(e.g. and after) (some cells <5).
fracture,
yes/no) Relative risks: odds ratios
or risk ratios Conditional logistic McNemar’s exact test:
regression: multivariate compares proportions between
regression technique for a binary correlated groups when there are
Logistic regression: sparse data (some cells <5).
outcome when groups are correlated
multivariate technique used
(e.g., matched data)
when outcome is binary; gives
multivariate-adjusted odds
ratios GEE modeling: multivariate
regression technique for a binary
outcome when groups are correlated
(e.g., repeated measures)
Difference in proportions (special
case of chi-square test)
Null distribution of a
difference in proportions
Standard error of a proportion= p (1  p )
n

Standard error can be estimated by= pˆ (1  pˆ )


n
(still normally distributed)

Standard error of the difference of two proportions=


pˆ 1 (1  pˆ 1 ) pˆ 2 (1  pˆ 2 ) p (1  p ) p (1  p ) (n ) p  (n2 ) p2
 or  , where p  1 1
n1 n2 n1 n2 n1  n2

The variance of a difference is the


sum of variances (as with difference Analagous to pooled variance
in means). in the ttest
Null distribution of a
difference in proportions

p (1  p ) p (1  p )
Difference of proportions ~ N ( p1  p2 ,  )
n1 n2
Follows a normal
because binomial can be
approximated with

Difference in proportions test


normal

Null hypothesis: The difference in proportions is 0.


p1  p2 Recall, variance of a
Z proportion is p(1-p)/n
p * (1  p ) p * (1  p )

n1 n2
n1 p1  n2 p2
p (just average proportion)
n1  n2
p1  proportion in group 1 Use average (or pooled)
p2  proportion in group 2 proportion in standard
error formula, because
n1  number in group 1 under the null
n2  number in group 2 hypothesis, groups have
equal proportions.
Recall case-control example:
  Smoker (E) Non-smoker  
(~E)
Stroke (D) 15 35
50

No Stroke (~D) 8 42 50

 
Absolute risk: Difference in
proportions exposed
  Smoker (E) Non-smoker  
(~E)
Stroke (D) 15 35
50

No Stroke (~D) 8 42 50

P ( E / D)  P( E / ~ D)  15 / 50  8 / 50
 30%  16%  14%
Difference in proportions
exposed

14%  0% .14
Z   1.67
.23 * .77 .23 * .77 .084

50 50

95% CI : 0.14  1.96 * .084  0.03 to .31


Example 2: Difference in
proportions
 Research Question: Are antidepressants
a risk factor for suicide attempts in
children and adolescents?

Example modified from: “Antidepressant Drug Therapy and Suicide in Severely


Depressed Children and Adults ”; Olfson et al. Arch Gen Psychiatry.2006;63:865-
872.
Example 2: Difference in
Proportions
 Design: Case-control study
 Methods: Researchers used Medicaid records
to compare prescription histories between
263 children and teenagers (6-18 years) who
had attempted suicide and 1241 controls who
had never attempted suicide (all subjects
suffered from depression).
 Statistical question: Is a history of use of
antidepressants more common among cases
than controls?
Example 2
 Statistical question: Is a history of use of
antidepressants more common among heart
disease cases than controls?

What will we actually compare?


 Proportion of cases who used antidepressants

in the past vs. proportion of controls who did


Results
No (%) of No (%) of
cases controls
(n=263) (n=1241)

Any antidepressant
drug ever 120 (46%)  448 (36%)

46% 36%

Difference=10%
Is the association statistically
significant?
 This 10% difference could reflect a true
association or it could be a fluke in this
particular sample.
 The question: is 10% bigger or smaller
than the expected sampling variability?
Hypothesis testing
Step 1: Assume the null hypothesis.

Null hypothesis: There is no association


between antidepressant use and suicide
attempts in the target population (= the
difference is 0%)
Hypothesis Testing
Step 2: Predict the sampling variability assuming the null
hypothesis is true

568 568 568 568


(1 ) (1 )
p̂ cases p̂ controls ~ N(0, σ = 1504 1504 + 1504 1504 = .033)
263 1241
Also: Computer Simulation Results

Standard error is
about 3.3%
Hypothesis Testing
Step 3: Do an experiment

We observed a difference of 10% between


cases and controls.
Hypothesis Testing
Step 4: Calculate a p-value

.10
Z= = 3.0; p = .003
.033
P-value from our simulation…

When we ran this


We also got 3 study 1000 times,
results as small we got 1 result as
or smaller than big or bigger than
–10%. 10%.
P-value

From our simulation, we


estimate the p-value to be:
4/1000 or .004
Hypothesis Testing
Step 5: Reject or do not reject the null hypothesis.

Here we reject the null.


Alternative hypothesis: There is an association
between antidepressant use and suicide in the
target population.
What would a lack of
statistical significance mean?
 If this study had sampled only 50 cases
and 50 controls, the sampling variability
would have been much higher—as
shown in this computer simulation…
Standard error is
about 3.3% 263 cases and
1241 controls.

Standard error is
about 10%
50 cases and 50
controls.
With only 50 cases and 50 controls…

If we ran this
Standard study 1000 times,
error is we would expect to
about 10% get values of 10%
or higher 170
times (or 17% of
the time).
Two-tailed p-value
Two-tailed
p-value =
17%x2=34%
Practice problem…

An August 2003 research article in Developmental


and Behavioral Pediatrics reported the following
about a sample of UK kids: when given a choice
of a non-branded chocolate cereal vs. CoCo Pops,
97% (36) of 37 girls and 71% (27) of 38 boys
preferred the CoCo Pops. Is this evidence that
girls are more likely to choose brand-named
products?
Null says p’s are equal so

Answer estimate standard error using


overall observed p

1. Hypotheses:
H0: p♂-p♀= 0
Ha: p♂-p♀≠ 0 [two-sided]
63 63 63 63
  (1  ) (1  )
75 75 75 75 )
2. Null distribution of difference of two proportions: p f  pm ~ N (0,   
ˆ ˆ
37 38
 
.84(.16) .84(.16)
    .085
37 38
3. Observed difference in our experiment = .97-.71= .26
 
.26  0
4. Calculate the p-value of what you observed: Z   3.06
.085
data _null_;

pval=(1-probnorm(3.06))*2;

put pval;
Key two-sample Hypothesis
Tests…
Test for Ho: μx- μy = 0 (σ2 unknown, but roughly equal):

x y ( n x  1) s x2  ( n y  1) s 2y
t n 2  ; s 2p 
s 2p s 2p n2

nx ny

Test for Ho: p1- p2= 0:

pˆ 1  pˆ 2 n1 pˆ 1  n2 pˆ 2
Z ;p
( p )(1  p ) ( p )(1  p ) n1  n2
  
n1 n2
Corresponding confidence
intervals…
For a difference in means, 2 independent samples (σ2’s
unknown but roughly equal):

s 2p s 2p
( x  y )  t n  2, / 2  
nx ny
For a difference in proportions, 2 independent samples:

( p )(1  p ) ( p )(1  p )
  ( pˆ 1  pˆ 2 )  Z  / 2  
n1 n2
Appendix: details of rank-sum
test…
Wilcoxon Rank-sum test

Rank all of the observations in order from 1 to n.


T1 is the sum of the ranks from smaller population (n1 )
T2 is the sum of the ranks from the larger population (n 2 )
n1 ( n1  1)
U 1  n1 n 2   T1 for n1  10, n 2  10,
2
n1 n 2
U0 
n 2 ( n 2  1) 2
U 2  n1 n2   T2 Z
2 n1 n 2 ( n1  n2  1)
Find P(U² U0) in Mann-Whitney U tables
12
U 0  min(U 1 , U 2 ) With n2 = the bigger of the 2 populations
Example
 For example, if team 1 and team 2 (two gymnastic
teams) are competing, and the judges rank all the
individuals in the competition, how can you tell if
team 1 has done significantly better than team 2 or
vice versa?
T1  sum of ranks of group 1 (smaller)
Answer T2  sum of ranks of group 2 (larger)
 Intuition: under the null hypothesis of no difference between the
two groups…
 If n1=n2, the sums of T1 and T2 should be equal.
 But if n1 ≠n2, then T2 (n2=bigger group) should automatically be
bigger. But how much bigger under the null?

 For example, if team 1 has 3 people and team 2 has 10, we could
rank all 13 participants from 1 to 13 on individual performance. If
team1 (X) and team2 don’t differ in talent, the ranks ought to be
spread evenly among the two groups, e.g.…

 1 2 X 4 5 6 X 8 9 10 X 12 13 (exactly even distribution if team1


ranks 3rd, 7th, and 11th)
Remember
this?

n1  n2
(n1  n2 )(n1  n2  1)
sum of within-group ranks for smaller


n1
n ( n  1)
T1  T2  i  group.
 i 1 1
2
i 1 2 i 1

2 2
(n1  n1n2  n1  n1n2  n2  n2 ) n1 (n1  1) n2 (n2  1)
   n1n2
2 2 2
13
(13)(14)

e.g., here : T1  T2  i 
i 1 2
 91  55  6  30

sum of within-group ranks for larger


n2
group. n2 (n2  1)
i 
i 1 2

Take-home point:
n1 (n1  1) n2 (n2  1)
T1  T2    n1n2
2 2
It turns out that, if the null hypothesis is true, the difference
between the larger-group sum of ranks and the smaller-group sum
of ranks is exactly equal to the difference between T1 and T2

10
10(11)
i 
i 1 2
 55
The difference between the sum of the
3
3( 4) ranks within each individual group is 49.

i 1 2
6

55  6  49
The difference between the sum of the
T1 = 3 + 7 + 11 =21 ranks of the two groups is also equal to 49
T2 = 1 + 2 + 4 + 5 + 6 + 8 + 9 +10 + 12 +13 = 70 if ranks are evenly interspersed (null is
true).
70-21 = 49 Magic!

Under the null,


n (n  1) n1 (n1  1)
T2  T1  2 2 
2 2
n1 (n1  1) n2 (n2  1) From slide 23
T2  T1    n1n2
2 2
n2 (n2  1) n1 (n1  1) From slide 24
T2  T1  
2 2
n2 (n2  1) n1n2
T2  
2 2
n1 (n1  1) n1n2
T1   Define new
2 2 statistics

n2 (n2  1)
define U 2   n1n2  T2
2 Here, under null:
n1 (n1  1) U2=55+30-70
define U1   n1n2  T1
2 U1=6+30-21

Their sum should equal n1n2 . U2+U1=30


  under null hypothesis, U1 should equal U2:

n 2 (n 2  1) n1 (n1  1)
E(U 2 - U 1 )  E[(  )  (T2  T1 )]  0
2 2
The U’s should be equal to each other and will equal n1n2/2:
 
U1 + U2 = n1n2
Under null hypothesis, U1 = U2 = U0
E(U1 + U2) = 2E(U0) = n1n2
E(U1 = U2=U0) = n1n2/2
So, the test statistic here is not quite the difference in the
sum-of-ranks of the 2 groups
It’s the smaller observed U value: U0
For small n’s, take U0, and get p-value directly from a U table.
For large enough n’s (>10 per
group)…
n1n2
E (U 0 ) 
2

n1 n 2
U0 
U 0  E (U 0 ) 2
Z 
Var (U 0 ) Var (U 0 )

n1n2 (n1  n2  1)
Var (U 0 ) 
12
Add observed data to the
example…
Example: If the girls on the two gymnastics teams were ranked as follows:
Team 1: 1, 5, 7 Observed T1 = 13
Team 2: 2,3,4,6,8,9,10,11,12,13 Observed T2 = 78
 
Are the teams significantly different?
Total sum of ranks = 13*14/2 = 91 n1n2=3*10 = 30
 
Under the null hypothesis: expect U1 - U2 = 0 and U1 + U2 = 30 (each should equal about 15 under the null) and U0
= 15
  
U1=30 + 6 – 13 = 23
U2= 30 + 55 – 78 = 7
 U0 = 7
 
Not quite statistically significant in U table…p=.1084 (see attached) x2 for two-tailed test
Example problem 2
A study was done to compare the Atkins Diet (low-carb) vs. Jenny Craig
(low-cal, low-fat). The following weight changes were obtained; note
they are very skewed because someone lost 100 pounds; the mean loss
for Atkins is going to look higher because of the bozo, but does that
mean the diet is better overall? Conduct a Mann-Whitney U test to
compare ranks.
  Atkins Jenny Craig

-100 -11

-8 -15

-4 -5

+5 +6
 
+8 -20

+2  
Answer Atkins
1
Jenny Craig
4
5 3
7 6
Sum of ranks for JC = 25 (n=5)
9 10
Sum of ranks for Atkins=41 (n=6)
11 2
 
8  
n1n2=5*6 = 30
 
under the null hypothesis: expect U1 - U2 = 0 and
U1 + U2 = 30 and U0 = 15
  
U1=30 + 15 – 25 = 20
U2= 30 + 21 – 41 = 10
 
U0 = 10; n1=5, n2=6
Go to Mann-Whitney chart….p=.2143x 2 = .42

You might also like