Nonparametric Tests in R
Nonparametric Tests in R
B N Mandal
I.A.S.R.I., Library Avenue, New Delhi – 110 012
bnmandal @iasri.res.in
Introduction
Nonparametric or distribution free tests are so-called because the assumptions underlying their
use are “fewer and weaker than those associated with parametric tests” (Siegel & Castellan,
1988, p. 34). To put it another way, nonparametric tests require fewer assumptions about the
shapes of the underlying population distributions. For this reason, they are often used in place of
parametric tests when one feels that the assumptions of the parametric test have been too grossly
violated (e.g., if the distributions are too severely skewed). Purpose of this note is to demonstrate
how R software can be used to perform nonparametric tests.
Sign Test
The sign test is one of the simplest nonparametric tests. It is for use with 2 repeated (or
correlated) measures (see the example below), and measurement is assumed to be at least
ordinal. For each subject, subtract the 2nd score from the 1st, and write down the sign of the
difference. (That is write “-” if the difference score is negative, and “+” if it is positive.) The
usual null hypothesis for this test is that there is no difference between the two treatments. If this
is so, then the number of + signs (or - signs, for that matter) should have a binomial distribution
with p = .5, and N = the number of subjects. In other words, the sign test is just a binomial test
with + and - in place of Head and Tail (or Success and Failure), i.e., a sign test is used to decide
whether a binomial distribution has the equal chance of success and failure.
Example
A food product company has invented a new product, and would like to find out if it will be as
popular as the existing favorite product. For this purpose, its research department arranges 18
participants for taste testing. Each participant tries both products in random order before giving
his or her opinion. It turns out that 5 of the participants like the new product better, and the rest
prefer the old one. At .05 significance level, can we reject the notion that the two products are
equally popular?
The null hypothesis is that the products are equally popular. Here we apply the binom.test
function. As the p-value turns out to be 0.096525, and is greater than the .05 significance level,
we do not reject the null hypothesis.
data: 5 and 18
number of successes = 5, number of trials = 18,
p-value = 0.09625
alternative hypothesis: true probability of success is not equal to 0.5
Non-parametric tests in R
At .05 significance level, we do not reject the notion that the two products are equally popular.
Example
Barley yield in the year 1931 and 1932 of the same field are recorded for different varieties.
Loc Var Y1 Y2
UF M 81 80.7
UF S 105.4 82.3
UF V 119.7 80.4
UF T 109.7 87.2
UF P 98.3 84.2
W M 146.6 100.4
W S 142 115.5
W V 150.7 112.2
W T 191.5 147.7
W P 145.7 108.1
M M 82.3 103.1
M S 77.3 105.1
M V 78.4 116.5
M T 131.3 139.9
M P 89.6 129.6
C M 119.8 98.9
C S 121.4 61.9
C V 124 96.2
C T 140.8 125.5
C P 124.8 75.7
GR M 98.9 66.4
Non-parametric tests in R
GR S 89 49.9
GR V 69.1 96.7
GR T 89.3 61.9
GR P 104.1 80.3
D M 86.9 67.7
D S 77.1 66.7
D V 78.9 67.4
D T 101.8 91.8
D P 96 94.1
Without assuming the data to have normal distribution, test at .05 significance level if the barley
yields of 1931 and 1932 have identical distributions.
The null hypothesis is that the barley yields of the two sample years are identical populations. To
test the hypothesis, we apply the wilcox.test function to compare the matched samples. For the
paired test, we set the "paired" argument as TRUE. As the p-value turns out to be 0.005318, and
is less than the .05 significance level, we reject the null hypothesis.
> barley=read.csv(file.choose())
> attach(barley)
> wilcox.test(Y1,Y2,paired=TRUE)
data: Y1 and Y2
V = 368.5, p-value = 0.005318
alternative hypothesis: true location shift is not equal to 0
Warning message:
In wilcox.test.default(Y1, Y2, paired = TRUE) :
cannot compute exact p-value with ties
Mann-Whitney-Wilcoxon Test
Two data samples are independent if they come from distinct populations and the samples do not
affect each other. Using the Mann-Whitney-Wilcoxon Test, we can decide whether the
population distributions are identical without assuming them to follow the normal distribution.
Example
The seasonal rainfall in two stations is given below. Without assuming the data to have normal
distribution, test whether the distribution of rainfall in two stations is same or not.
Station A Station B
Non-parametric tests in R
1011.07 496.44
1066.82 541.76
610.8 1562.01
1111.44 2515.12
955.68 1133.99
1203.84 300.33
1600.32 482.55
555.9 503.22
1302.95 2744.23
182.34 1232.22
1233.2
1402.09
> rainfall=read.csv(file.choose())
> attach(rainfall)
To test the hypothesis, we apply the wilcox.test function to compare the independent samples. As
the p-value turns out to be 0.001817, and is less than the .05 significance level, we reject the null
hypothesis.
> wilcox.test(Station_A,Station_B)
At .05 significance level, we conclude that the rainfall distribution in two stations is same
Kruskal-Wallis Test
A collection of data samples are independent if they come from unrelated populations and the
samples do not affect each other. Using the Kruskal-Wallis Test, we can decide whether the
population distributions are identical without assuming them to follow the normal distribution.
In the built-in data set named airquality, the daily air quality measurements in New York, May to
September 1973, are recorded. The ozone density is presented in the data frame column Ozone.
> head(airquality)
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
Without assuming the data to have normal distribution, test at .05 significance level if the
monthly ozone density in New York has identical data distributions from May to September
1973.
The null hypothesis is that the monthly ozone density is same from May to September. To test
the hypothesis, we apply the kruskal.test function to compare the independent monthly data. The
p-value turns out to be nearly zero (6.901e-06). Hence we reject the null hypothesis.
At .05 significance level, we conclude that the monthly ozone density in New York from May to
September 1973 are nonidentical populations.
References
R Development Core Team (2011). R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL
http://www.R-project.org/.
Siegel, S., & Castellan, N.J. (1988). Nonparametric statistics for the behavioral sciences (2nd
Ed.). New York, NY: McGraw-Hill.