Nonparametric Methods: C Vi S A: I N M
Nonparametric Methods: C Vi S A: I N M
Nonparametric Methods: C Vi S A: I N M
6
Nonparametric Methods
CHAPTER VI; SECTION A: INTRODUCTION TO NONPARAMETRIC METHODS
Purposes of Nonparametric Methods:
Nonparametric methods are uniquely useful for testing nominal (categorical) and ordinal
(ordered) scaled data--situations where parametric tests are not generally available. An
important second use is when an underlying assumption for a parametric method has been
violated. In this case, the interval/ratio scale data can be easily transformed into ordinal scale
data and the counterpart nonparametric method can be used.
Inferential and Descriptive Statistics: The nonparametric methods described in this chapter
are used for both inferential and descriptive statistics. Inferential statistics use data to draw
inferences (i.e., derive conclusions) or to make predictions. In this chapter, nonparametric
inferential statistical methods are used to draw conclusions about one or more populations from
which the data samples have been taken. Descriptive statistics arent used to make
predictions but to describe the data. This is often best done using graphical methods.
Examples: An analyst or engineer might be interested to assess the evidence regarding:
1. The difference between the mean/median accident rates of several marked and
unmarked crosswalks (when parametric Students t test is invalid because sample
distributions are not normal).
2. The differences between the absolute average errors between two types of models for
forecasting traffic flow (when analysis of variance is invalid because distribution of errors
is not normal).
3. The relationship between the airport site evaluation ordinal rankings of two sets of
judges, i.e., citizens and airport professionals.
4. The differences between neighborhood districts in their use of a regional mall for
purposes of planning transit routes.
5. The comparison of accidents before and during roadway construction to investigate if
factors such as roadway grade, day of week, weather, etc. have an impact on the
differences.
6. The association between ordinal variables, e.g., area type and speed limit, to
eliminate intercorrelated dependent variables for estimating models that predict the
number of utility
pole accidents.
7. The relative goodness of fit of possible predictive models to the observed data for
expected accident rates for rail-highway crossings.
8. The relative goodness of fit of hypothetical probability distributions, e.g., lognormal
and Weibull, to actual air quality data with the intent of using the distributions to predict
the number of days with observed ozone and carbon monoxide concentrations exceeds
National Ambient Air Quality Standards.
Kullback, S. and John C. Keegel. (1985). Red Turn Arrow: An Information-Theoretic Evaluation.
Journal of Transportation Engineering, V.111, N.4, July, pp. 441-452. American Society of Civil
Engineers. (Inappropriate use of Chi-Square Test for Independence)
Zegeer, Charles V., and Martin R. Parker Jr. (1984). Effect of Traffic and Roadway Features on
Utility Pole Accidents. Transportation Research Record #970 pp. 65-76. National Academy of
Sciences. (Kendalls Tau Measure of Association for Ordinal Data)
Traffic
Smith, Brian L. and Michael J. Demetsky. (1997). Traffic Flow Forecasting: Comparison of
Modeling Approaches. Journal of Transportation Engineering, V.123, N.4, July/August, pp. 261266. American Society of Civil Engineering. (Wilcoxon Matched-Pairs Signed-Ranks Test for
Two Dependent Samples)
Transit
Ross, Thomas J. and Eugene M. Wilson. (1977). Activity Based Transit Routing. Transportation
Engineering Journal, V.103, N.TE5, September, pp. 565-573. American Society of Civil
Engineers. (Chi-Square Test for Independence)
Planning
Jarvis, John J., V. Ed Unger, Charles C. Schimpeler, and Joseph C. Corradino. (1976). Multiple
Criteria Theory and Airport Site Evaluation. Journal of Urban Planning and Development Division,
V.102, N.UP1, August, pp. 187-197. American Society of Civil Engineers. (Kendalls Tau
Measure of Association for Ordinal Data)
Nonparametric References:
W. J. Conover. Practical Nonparametric Statistics. Third Edition, John Wiley & Sons, New York,
1999.
W. J. Conover. Practical Nonparametric Statistics. Second Edition, John Wiley & Sons, New York,
1980.
Richard A. Johnson. Miller & Freunds Probability and Statistics For Engineers. Prentice Hall,
Englewood Cliffs, New Jersey, 1994.
Douglas C. Montgomery. Design and Analysis of Experiments. Fourth Edition, John Wiley & Sons,
New York, 1997.
David J. Sheskin. Handbook of Parametric and Nonparametric Statistical Procedures. CRC Press,
New York, 1997.
As a first look, X1 appears to have a linear relationship with Y1 while X2 appears to have a nonlinear relationship. Hypothesizing such apparent relationships are useful in selecting a
preliminary model type and relationship. Many statistical software packages make it easy for
the user to study such relationships among all the variables in the data. This method is
typically called pairwise scatter plots but other terms are also used. The three variables
explored in the previous scatter plots are used again to plot the pairwise scatter plots shown
below.
14
12
12
10
10
Y1
Y1
25
30
35
40
30
40
50
60
70
X1
X2
By using pairwise scatter plots, the relationships among all the variables may be explored with
a single plot. However, as the number of variables increases, using a single plot reduces the
individual plots too small to be useful. In this case, the variables can be plotted in subsets.
50
60
70
10
12
14
30
50
60
70
Y1
35
40
30
40
X1
25
30
X2
10
12
14
25
30
35
40
surface created by the data. In both of these graphs the values between the data values are
interpolated. The user is cautioned that these interpolations are not validated and a smooth
transition from one point to the next may not be a true representation of the data. Like all
graphical methods, these should be used only to obtain a first look at the data, and when
appropriate, aid in developing preliminary hypotheses--along with other available information--for
modeling the data.
12
40
14
25
10
4
30
X2
Y1
35
30
70
10
40
50
11
12
60
60
50
X1 40
13
30
25
35
30
X2
70
X1
Figure 19: Contour and Surface Plots for Variables Y1, X1, and X2
40
X1
25
28
36
41
44
53
54
61
66
72
X2
25
28
36
41
44
43
39
35
26
21
In order to create a frequency distribution for a variable, appropriate classes must be selected
for it. Using both X1 and X2 as examples, classes of 0 to 10, 11 to 20, etc. are chosen and a
frequency table constructed as shown below.
X1
0
0
2
1
2
2
2
1
10
X2
0
0
4
3
3
0
0
0
10
Variance
Standard
Deviation
252.00
67.73
15.87
8.23
The variance and standard deviation are also shown in the table along with the frequency
distribution. These are measures of the spread of the data about the mean. If all the values are
bunched close to the mean, then the spread is small. Likewise, the spread is large if all the
values are scattered widely about their mean. A measure of the spread of data is useful to
supplement the mean in describing the data. If a set of numbers x1, x2, ... , xn has a mean xbar ,
the differences x1- xbar , x2- xbar ,..., xn- xbar are called the deviations from the mean. Because the
sum of these deviations is always zero, an alternative approach is to square each deviation. The
sample variance, s 2, is essentially the average of the squared deviations from the mean.
n
variance = s 2 =
( x x)
i =1
n 1
By dividing the sum of squares by its degrees of freedom, n-1, an unbiased estimate of the
population variance is obtained. Notice that s 2 has the wrong units, i.e., not the same units as
the variable itself. To correct this, the standard deviation is defined as the square root of the
variance, which has the same units as the data. Unlike the variance estimate, the estimated
standard deviation is biased, where the bias becomes larger as sample sized become smaller.
While these quantitative descriptive statistics are useful, it is often valuable to provide graphical
representations of them. The frequency distributions can be shown graphically using frequency
histograms as shown below for X1 and X2.
3
2
Class Frequency
2
1
0
Class Frequency
Frequency Histogram of X2
Frequency Histogram of X1
20
40
60
80
Variable X1
20
40
60
80
Variable X2
As can be seen from these two frequency histograms, X2 has a much smaller spread (standard
deviation = 8.23) than does X1 (standard deviation = 15.87). Another plot, the boxplot, also
shows the quartiles of the frequency. These are shown in Figure 21 for X1 and X2.
Different statistical software packages typically depict Boxplots in different ways, but the one
shown here is typical. Boxplots are particularly effective when several are placed side-by-side
for comparison. The shaded area indicates the middle half of the data. The center line inside
this shaded area is drawn at the median value. The upper edge of the shaded area is the value
of the upper quartile and the lower edge is the value of the lower quartile. Lines extend from the
shaded areas to the maximum and minimum values of the data indicated by horizontal lines.
20
30
40
50
60
70
X1
X2
Hypotheses About
Population Means/Medians
Independent Samples
Dependent Samples
Di = X median X i
i = 1, 2,K , n
All differences of zero are omitted. Let the number of pairs remaining be denoted by n, n < n.
Ranks from 1 to n are assigned to the n differences. The smallest absolute difference |Di | is
ranked 1, the second smallest |Di | is ranked 2, and so forth. The largest absolute difference is
ranked n. If groups of absolute differences are equal to each other, assign a rank to each equal
to the average of the ranks they would have otherwise been assigned. For example, if four
absolute differences are equal and would hold ranks 8, 9, 10, and 11, each is assigned the rank
of 9.5, which is the average of 8, 9, 10, and 11.
Although the absolute difference is used to obtain the rankings, the sign of Di is still used in the
test statistic. Ri, called the signed rank, is defined as follows:
Ri = the rank assigned to |Di | if Di is positive.
Ri = the negative of the rank assigned to |Di | if Di is negative.
The test statistic T+ is the sum of the positive signed ranks when there are no ties and n < 50.
Lower quantiles of the exact distribution of T+ are given in Table C-8. Under the null hypothesis
that the Di s have mean 0.
T + = ( Ri where Di is positive)
Based on the relationship that the sum of the absolute differences is equal to n (n + 1) divided
by 2, the upper quantiles p are found by the relationship
p =
n' (n '+1)
1 p
2
If there are many ties, or if n > 50, the normal approximation test statistic T is used which uses
all of the signed ranks, with their + and - signs. Quantiles of the approximate distribution of T
are given in a Normal Distribution Table.
n'
T=
Ri
i =1
n'
R
i =1
i =1
n'
2
R
i
i =1
n'
For the lower-tailed test, reject the null hypothesis Ho at level if T+ (or T) is less than its
quantile from Table C-8 for T+ (or the Normal Table C-1 for T). Otherwise, accept Ho (meaning
the median (or mean) of X is greater than or equal to m). The p-value, approximated from the
normal distribution, can be found by
R
+
1
i =1
n'
2
Ri
i =1
n'
The two-tailed p-value is twice the smaller of the one-tailed p-values calculated above.
Computational Example: (Adapted from Conover (1999, p. 356-357)) Thirty observations on
the random variable X are measured in order to test the hypothesis that E (X), the mean of X, is
no smaller than 30 (lower-tailed test).
Ho: E (X) (the mean) m.
Ha: E (X) (the mean) < m.
The observations, the differences, Di = (Xi - m), and the ranks of the absolute differences |Di |
are listed below. The thirty observations were ordered first for convenience.
Di = (Xi - 30)
-6.2
-4.0
-3.1
-2.6
-2.0
+0.3*
+0.7*
+1.2*
+1.3*
+2.8*
+3.2*
+3.9*
+4.3*
+4.9*
+5.0*
Rank of |Di |
17
11
8
6
5
1
2
3
4
7
9
10
12
13
14
Xi
35.9
36.1
36.4
36.6
37.2
37.3
37.9
38.2
39.6
40.6
41.1
42.3
42.8
44.0
45.8
Di = (Xi - 30)
+5.9
+6.1
+6.4
+6.6
+7.2
+7.3
+7.9
+8.2
+9.6
+10.6
+11.1
+12.3
+12.8
+14.0
+15.8
Rank of |Di |
15*
16*
18*
19*
20*
21*
22*
23*
24*
25*
26*
27*
28*
29*
30*
There are no ties in the data nor is the sample size greater than 50. Therefore, from Table C-8,
Quantiles of Wilcoxon Signed Ranks Test Statistic, for n = 30, the 0.05 quantile is 152. The
critical region of size 0.05 corresponds to values of the test statistic less than 152. The test
statistic T+ = 418. This is the sum of all the Ranks, which have positive differences, as noted in
the table by asterisks. Since T+ is not within the critical region, Ho is accepted, and the analyst
concludes that the mean of X is greater than 30.
The approximate p-value is calculated by the following equation. Recall that the summation of
the squares of a set of numbers from 1 to N is equal to [N (N+1) (2N + 1) / 6].
R
+
1
i =1
Z
=
P
n'
Ri
i =1
n'
371 + 1
= P( Z 3.8900)
(30)( 30 + 1)( 2 30 + 1) / 6
The normal distribution table shows that the p-value is greater than 0.999 when the mean is no
smaller than 30, i.e., there is a probability greater than 99.9% that the mean is greater than or
equal to 30.
T = R( X i )
i =1
If there are many ties, the test statistic T1 is obtained which simply subtracts the mean from T
and divides by the standard deviation
T mean
T1 =
=
std deviation
N 1
2
nm( N 1) 2
2
Ri
4( N 1)
T n
N
nm
N ( N 1) i=1
where Ri2 is the sum of the squares of all N of the ranks or average ranks actually used in both
samples.
Lower quantiles p-1 of the exact distribution of T are given for n and m values of 20 or less in
Table C-6. Upper quantiles p are found by the relationship
p = n( n + m + 1) 1 p
Perhaps more convenient is the use T which can be used with the lower quartiles in Table C-6
whenever an upper-tail test is desired.
T ' = n ( N + 1) T
When there are many ties in the data, T1 is used which is approximately a standard normal
random variable. Therefore, the quantiles for T1 are found in Table C-1, which is the standard
normal table.
When n or m is greater than 20 (and there are still no ties), the approximate quantiles are found
by the normal approximation given by
n( N + 1)
nm( N + 1)
+ zp
2
12
for quantiles when n or m is greater than 20, where z p is the pth quantile of a standard normal
random variable obtained from Table C-1.
INTREPRETATION OF OUTPUT (DECISION RULE) OF MANN-WHITNEY TEST FOR TWO
INDEPENDENT SAMPLES
For the two-sided test, reject the null hypothesis Ho at level if T (or T1) is less than its /2
quantile or greater than its 1 - /2 quantile from Table C-6 for T (or from the Standard Normal
Table C-1for T1). Otherwise, accept Ho if T (or T1) is between, or equal to one of, the quantiles
indicating the means of the two samples are equal.
For the upper-tailed test, large values of T indicate that H1 is true. Reject the null hypothesis Ho
at level if T (or T1) is greater than its quantile from Table C-6 for T (or from the Standard
Normal Table C-1 for T1). It may be easier to find T = n (N+1) - T and reject Ho if T is less than
its from Table C-6. Otherwise, accept Ho if T (or T1) is less than or equal to its quantile
indicating the mean of population 1 is less than or equal to the mean of population 2.
For the lower-tailed test, small values of T indicate that H1 is true. Reject the null hypothesis Ho
at level if T (or T1) is less than its quantile from Table C-6 for T (or from the Standard Normal
Table C-1 for T1). Otherwise, accept Ho if T (or T1) is greater than or equal to its quantile
indicating the mean of population 1 is greater than or equal to the mean of population 2. If the n
or m is larger than 20, use
When n or m is greater than 20 (and no ties), the quantiles used in the above decisions are
obtained directly from the equation given previously for this condition.
Computational Example: (Adapted from Conover (1999, p. 278-279)) Nine pieces of flint were
collected for a simple experiment, four from region A and five from region B. Hardness was
judged by rubbing two pieces of flint together and observing how each was damaged. The one
having the least damage was judged harder. Using this method all nine pieces of flint were
tested against each other, allowing them to be rank ordered from softest (rank 1) to hardest
(rank 9).
Rank
1
2
3
4
5
6
7
8
9
p = n( n + m + 1) 1 p = 4(4 + 5 + 1) 12 = 28
Since the test statistic of 11 falls inside the lower critical region, less than 12, the null
hypothesis Ho is rejected and the alternate hypothesis is accepted, i.e., the flints from the two
regions have different harnesses. Because the direction of the difference, it is further concluded
that the flint in region A is softer than the flint in region B.
Safety Example: In an evaluation of marked and unmarked crosswalks (Gibby, Stites, Thurgood,
and Ferrara, Federal Highway Administration FHWACA/TO-94-1, 1994), researchers in California
2.
3.
If there is a difference in any of the k population distribution functions F (x1), F (x2), ..., F
(xk), it is a difference in the location of the distributions. For example, if F (x1) is not
identical with F (x2), then F (x1) is identical with F (x2 + c), where c is some constant.
4.
sample 2
...
sample k
X1,1
X2,1
...
Xk,1
X1,2
X2,2
...
Xk,2
...
...
...
...
X1,n1
X2,n2
...
Xk,nk
N = ni
i =1
Rank the observations Xij in ascending order and replace each observation by its rank R (Xi j),
with the smallest observation having rank 1 and the largest observation having rank N. Let Ri be
the sum of the ranks assigned to the ith sample. Compute Ri for each sample.
ni
Ri = R ( X ij )
i = 1, 2, ... , k
If several values are tied, assign each the average of the ranks that would have been assigned
to them had there been no ties.
2
2
1 k Ri
N ( N + 1)
T = 2
S i=1 ni
4
S2 =
1
N ( N + 1) 2
2
R
(
X
)
ij
N 1 all
4
ranks
T=
k
12
Ri
3( N + 1)
N ( N + 1) i=1 ni
When the number of ties is moderate, this simpler equation may be used with little difference in
the result when compared to the more complex equation need for ties.
INTREPRETATION OF OUTPUT (DECISION RULE) OF KRUSKAL-WALLIS ONE-WAY
ANALYSIS OF VARIANCE BY RANKS TEST FOR SEVERAL INDEPENDENT SAMPLES
The tables required for the exact distribution of T would be quite extensive considering that
every combination of sample sizes for a given k would be needed, multiplied by however many
samples k would be included. Fortunately, if ni are reasonably large, say ni 5, then under the
20
25
30
35
X1j
R (X1j)
X2j
R (X2j)
X3j
R (X3j)
X4j
R (X4j)
X5j
R (X5j)
7
7
15
11
9
2.0
2.0
12.5
7.0
4
12
17
12
18
18
9.5
14
9.5
16.5
16.5
14
18
18
19
19
11
16.5
16.5
20.5
20.5
19
25
22
19
23
20.5
25
23
20.5
24
7
10
11
15
11
2.0
5
7.0
12.5
7.0
Ri =
ni =
27.5
5
66.0
5
85.0
5
113.0
5
33.5
5
N=
25
An example of how ties are ranked can be seen from the lowest value, which is 7. Note that
there are three observations that have a value of 7 so these would normally have ranks 1, 2, and
3. Since they are tied, they are averaged and each value of 7 gets the rank of 2.0.
The hypothesis to be tested is
Ho: E (X1) = E (X2) = . . . = E (Xk) all of the 5 blended fibers, with different percent weights
of cotton, have mean tensile strengths that are equal
Ha: At least one of the 5 blended fiber mean tensile strengths is not equal to at least one of
the other blended fiber mean tensile strengths
Since there are ties, the variance of the ranks S2 is calculated by
1
N ( N + 1) 2
2
S =
R( X ij ) 4
N 1 all
ranks
1
25( 26) 2
S =
5510
= 53.51
24
4
T = 2
S i=1 ni
4
T=
1
25( 26) 2
5245.7
= 19.06
53.51
4
For a critical region of 0.05, the 1 - quantile (0.95) of the chi-square distribution with 5 - 1 = 4
degrees of freedom from Table C-2 is 9.49. Since T = 19.06 lies in this critical region, i.e., in
the region greater than 9.488, the null hypothesis Ho is rejected and it is concluded that at least
one pair of the blended fiber tensile strength means is not equal to each other.
PAIRWISE COMPARISONS USING THE KRUSKAL-WALLIS ONE-WAY ANALYSIS OF
VARIANCE BY RANKS TEST FOR SEVERAL INDEPENDENT SAMPLES
When the Kruskal-Wallis test rejects the null hypothesis, it indicates that one or more pairs of
samples do not have the same means but it does not tell us which pairs. Various sources
support different methods for finding the specific pairs that are not equal, called pairwise
comparisons. Conover (1990) discusses using the usual parametric procedure, called Fishers
least significant difference, computed on the ranks instead of the data. If, and only if, the null
hypothesis is rejected, the procedure dictates that the population means i and j seem to be
different if this inequality is satisfied.
Ri Rj
N 1T 1 1
> t1 ( / 2) S 2
+
ni n j
N k ni nj
Ri and Rj are the rank sums of the two samples being compared. The 1 - /2 quantile of the t
distribution, t1-(/1) , with N - k degrees of freedom is obtained from the t distribution Table C-4.
S2 and T are as already defined for the Kruskal-Wallis test.
For the fiber tensile strength computational example, the pairwise comparisons between the
15% and the 20% cotton content fibers can be made by the following computation. For a
critical region of 0.05, from Table C-4, the 1 - /2 quantile (0.975) for the t distribution with 25 -5
= 20 degrees of freedom is 2.086.
Ri Rj
N 1 T
> t1( / 2) S 2
ni n j
N k
1 + 1
ni nj
27.5 66.0
25 1 19.06 1 1
5 5
Since the inequality is true, it is concluded that the tensile strength means of the 15% and the
20% cotton content fibers are different. Notice that since all the samples are the same size,
the right side of this equality will remain constant for all comparisons. The following table lists
all the pairwise comparisons.
Ri Rj
ni n j
N 1 T 1 1
t1( / 2) S 2
+
N k ni nj
7.7
4.80
11.5
4.80
17.2
4.80
1.2
4.80
3.8
4.80
9.4
4.80
6.5
4.80
5.6
4.80
10.3
4.80
15.9
4.80
All of the pairwise inequalities are true except for two, the 15% and 35% pair and the 20% and
25%. Based on the engineers originally stated experience, he suspects that the 35% fiber may
be losing strength which would account for the 15% and 35% pair having the same tensile
strength. The equal strengths of the 20% and 25% strengths appears to indicate that little
benefit is gained in tensile strength by this raise the cotton content as compared to, say, the
increase from 15% to 20% or from 25% to 30%. Of course, more testing is probably prudent
now that this preliminary information is known.
The sample of n subjects is a random sample from the population it represents. Thus, the
Dis are mutually independent.
2.
3.
4.
or
E (Yi) = E (Xi)
B. Upper-sided test
Ho: E (D) 0
Ha: E (D) > 0
or
E (Yi) E (Xi)
or
E (Yi) E (Xi)
C. Lower-sided test
Ho: E (D) 0
Ha: E (D) < 0
Di = Yi X i
i = 1, 2,K, n
All differences of zero are omitted. Let the number of pairs remaining be denoted by n, n < n.
Ranks from 1 to n are assigned to the n differences. The smallest absolute difference |Di | is
ranked 1, the second smallest |Di | is ranked 2, and so forth. The largest absolute difference is
ranked n. If groups of absolute differences are equal to each other, assign a rank to each equal
to the average of the ranks they would have otherwise been assigned.
Although the absolute difference is used to obtain the rankings, the sign of Di is still used in the
test statistic. Ri, called the signed rank, is defined as follows:
Ri = the rank assigned to |Di | if Di is positive.
Ri = the negative of the rank assigned to |Di | if Di is negative.
The test statistic T+ is the sum of the positive signed ranks when there are no ties and n < 50.
Lower quantiles of the exact distribution of T+ are given in Table C-8, under the null hypothesis
that the Di s have mean 0.
T + = ( Ri where Di is positive)
Based on the relationship that the sum of the absolute differences is equal to n (n + 1) divided
by 2, the upper quantiles p are found by the relationship
p =
n' ( n'+1)
1 p
2
If there are many ties, or if n > 50, the normal approximation test statistic T is used which uses
all of the signed ranks, with their + and - signs. Quantiles of the approximate distribution of T
are given in a Normal Distribution Table.
n'
Ri
T = II: page
Volume
263
n'
i =1
R
i =1
10
11
12
Firstborn Xi
86
71
77
68
91
72
77
91
70
71
88
87
Second born Yi
88
77
76
64
96
72
65
90
65
80
81
72
Difference Di
+2
+6
-1
-4
+5
-12
-1
-5
+9
-7
-15
Ranks of |Di|
1.5
5.5
na
10
1.5
5.5
11
Ri
-1.5
-4
5.5
na
-10
-1.5
-5.5
-8
-11
T=
Ri
i =1
n'
=
2
17
= 0.7565
505
i =1
For a critical region of size 0.05, the quantile from the standard normal Table C-1 is -1.6449.
Since T = -0.7565 is not in this critical region, the null hypothesis Ho is accepted and it is
concluded that the firstborn twin does not exhibit more aggressiveness than does the second
born twin.
Traffic Example: In a comparative study of modeling approaches for traffic flow forecasting
(Smith and Demetsky, Journal of Transportation Engineering, V.123, N.4, July/August, 1997),
researchers needed to assess the relative merits of four models they developed. These trafficforecasting models were developed and tested using data collected at two sites in Northern
Virginia. Two independent sets of data were collected for model development and model
evaluation. The models were estimated using four different techniques from the development
data: historic average, time-series, neural network, and nonparametric regression (nearest
neighbor).
One of the comparative measures used was the absolute error of the models. This is simply how
far the predicted volume deviates from the actual observed volume, using the model evaluation
data. The data were paired by using the absolute error experienced by two models at a given
prediction time. The Wilcoxon Matched-Pairs Signed-Ranks Test for dependent samples was
used to assess the statistical difference in the absolute error between any two models. This test
was chosen over the more traditional analysis of variance approach (ANOVA) because the
distribution of the absolute errors is not normal, an assumption required for ANOVA.
One of the models could not be tested because of insufficient data, leaving three models to be
compared. These models were compared using three tests, representing all combinations of
comparison among three models. Two hypotheses were tested for each pair of models:
Ho: 1 - 1 = 0
Ha: | 1 - 1 0 | (note: the paper states the alternate hypothesis as 1 - 1 > 0, which is
incorrect for a two-sided test, but its evaluation is correct meaning that it actually
evaluated as if it were a two-sided test.)
A 1% or less level of significance was used as being statistically significant.
Data from two sites were used, so the three tests were performed twice. Although not stated
specifically in the paper, it appears that the sample size was greater than 50. This allowed the
researchers to use the normal approximation test statistic. For each of the two sites, the
nonparametric regression (nearest neighbor) model was the preferred model. Using this
evidence, as well as other qualitative and logical evidence, the researchers were able to draw
the conclusion that nonparametric regression (nearest neighbor) holds considerable promise for
application to traffic flow forecasting.
A technique employed in this evaluation has universal application. The test selected only
compares two samples. Therefore if more samples need to be compared, one can perform a
series of tests using all the possible combinations for the number of samples to be evaluated. It
should be noted that often more sophisticated methods for testing multiple samples
simultaneously usually exist. Often these tests require more detailed planning prior to collecting
the data. Unfortunately many researchers collect data with only a vague notion of how the data
will ultimately be analyzed. This often limits the statistical methods available to them--usually to
their detriment. The next test in this manual, Friedman Two-Way Analysis of Variance by Ranks
Test for several dependent variables, is an alternative that may have provided more
sophisticated evaluation for these researchers, if their data had been drawn properly.
designed as randomized complete block designs. Recall that the Kruskal-Wallis One-Way
Analysis of Variance by Ranks test was applied to a completely randomized design; this
design, which relies solely on randomization, was used to cancel distortions of the results that
might come from nuisance factors. A nuisance factor is a variable that probably has an effect
on the response variable, but is not of interest to the analyst. It can be unknown and therefore
uncontrolled, or it can be known and not controlled. However, when the nuisance factor is
known and controllable, then a design technique called blocking can be used to systematically
eliminate its effects. Blocking means that all the treatments are carried out on a single
experimental unit. If only two treatments were applied, then the experimental unit would contain
the matched-pair treatments, which was the subject of the previous section on the Wilcoxon
Matched-Pairs Signed-Ranks test. This section discussed a test used when more than two
treatments are applied (or more than two variables are measured).
The situation of several related samples arises in an experiment that is designed to detect
differences in several possible treatments. The observations are arranged in blocks, which are
groups of experimental units similar to each other in some important respects. All the
treatments are administered once within each block. In a typical manufacturing experiment, for
example, each block might be a piece of material b i that needs to be treated with several
competing methods xi. Several identical pieces of material are manufactured, each being a
separate block. This would cause a problem if a completely randomized design were used
because, if the pieces of material vary, it will contribute to the overall variability of the testing.
This can be overcome by testing each block with each of the treatment. By doing this, the
blocks of pieces of material form a more homogeneous experimental unit on which to compare
the treatments. This design strategy effectively improves accuracy of the comparisons among
treatments by eliminating the variability of the blocks or pieces of materials. This design is
called a randomized complete block design. The word complete indicates that each block
contains all of the treatments.
The randomized complete block design is one of the most widely used experimental designs.
Units of test equipment or machinery are often different in their operating characteristics and
would be a typical blocking factor. Batches of raw materials, people, and time are common
nuisance sources of variability in transportation experiments that can be systematically
controlled through blocking. For example, suppose you want to test the effectiveness of 4
sizes of lettering on signage. You decide you will measure 10 peoples reactions and want a
total sample size of 40. This allows 10 replicates of each size of lettering. If you simply assign
the 40 tests (10 for size 1, 10 for size 2, 10 for size 3, and 10 for size 4) on a completely
random basis to the 10 people, the variability of the people will contribute to the variability
observed in the peoples reactions. Therefore, each person can be blocked by testing all four
lettering sizes on each person. This will allow us to compare the lettering sizes without the
high variability of the people confusing the results of the experiment.
The parametric test method for this situation (randomized complete block design) is called the
single-factor within-subjects analysis of variance or the two-way analysis of variance. The
nonparametric equivalent tests depend on the ranks of the observations. An extension of the
Wilcoxon Matched-Pairs Signed-Ranks test for two dependent samples to a situation involving
several samples is called the Quade Test. Dana Quade first developed it in 1972. The Quade
test uses the ranks of the observations within each block and the ranks of the block-to-block
sample ranges to develop a test statistic.
An alternate test to the Quade test was developed much earlier (1937) by noted economist
Milton Friedman. The Friedman test is an extension of the sign test and is better known.
Which test to use depends on the number of treatments. Conover (1999) recommends the use
of the Quade test for three treatments, the Friedman test for six or more treatments, and either
test for four or five treatments. These recommendations are based on the power of the tests for
various levels of treatments. Since the Friedman test is useful for a much larger range of
treatments, it is discussed in this section. Discussion of the Quade test is presented in
Conover (1999, p. 373-380).
ASSUMPTIONS OF FRIEDMAN TWO-WAY ANALYSIS OF VARIANCE BY RANKS TEST FOR
SEVERAL DEPENDENT VARIABLES
1.
The multivariate random variables are mutually independent. In other words, the results
within one block do not affect the results within any of the other blocks.
2.
Within each block the observations may be ranked according to some criterion of interest.
The data consist of b mutually independent observations where each observation contains k
random variables (Xi1, Xi2, . . . ,Xbk). The observations are designated as b blocks, i = 1, 2, . . .
, b. The random variable Xij is in block i and is subjected to treatment j (or comes from
population j). The data can be arranged in a matrix with i rows for the blocks and j columns for
the treatments (populations).
Table 12: Friedman Two-Way Analysis of Variance Table for Ranks Test
for Several Dependent Variables
treatment
1
treatment
2
...
treatment
k
block 1
X1,1
X1,2
...
X1, k
block 2
X2,1
X2,2
...
X2, k
...
...
...
...
...
block b
Xb,1
Xb,2
...
Xb, k
modification to the Friedman test statistic, which allows the F distribution to be used as its
approximation with better results.
First rank the treatments (populations) within each block separately. The smallest value within
a block is assigned rank 1 continuing through the largest value, which is assigned rank k . Use
average ranks in case of ties. Calculate the sum of the ranks Rj, for each treatment (population)
b
R j = R ( X ij ) for j = 1, 2, . .. , k
j =1
T1 =
Rj
bk ( k + 1) j =1
2
A = [R( X ij )]2
b
C=
i =1 j =1
bk ( k + 1) 2
4
b
(
k
+
1
)
( k 1) R j bC ( k 1) R j
2
j =1
j =1
T1 =
=
A C
A C
Now the final test statistic T2 is calculated by modifying T1 so it can be approximated by the
chi-square distribution
T2 =
( b 1)T1
b(k 1) T1
Reject the null hypothesis Ho at level if T2 is greater than its 1 - quantile. This 1 - quantile
s approximated by the F distribution Table C-5 with df num = k - 1 and df den = (b - 1)(k - 1).
Otherwise, accept Ho if T2 is less than or equal to the 1 - quantile indicating the means of all
the samples are equal in value. The approximate p-value can also be approximated from the F
distribution table. As one might suspect the approximation gets better as the number of blocks
b gets larger.
Computational Example: (Adapted from Montgomery (1997, p. 177-182)) A machine to test
hardness can be used with four different types of tips. The machine operates by pressing the
tip into a metal specimen and the hardness is determined by the depth of the depression. It is
suspected that the four tip types do not produce identical readings so an experiment is devised
to check this. The researcher decides to obtain four observations (replicates) of each tip type.
A completely randomized single-factor design would consist of randomly assigning each of the
16 tests (called runs) to an experimental unit (a metal specimen). This would require 16
different metal specimens. The problem with this design is that if the 16 metal specimens
differed in hardness, their variability would be added to any variability observed in the hardness
data caused by the tips.
To overcome the potential variability in the metal specimens, blocking can be used to develop a
randomized complete block design experiment. The metal specimens will be the blocks. Each
block will be tested with all four tips. Therefore, only four metal specimens will be needed to
conduct the 16 total tests. Within each block, the four tests need to be conducted in random
order. The observed response to the tests is the Rockwell C scale hardness minus 40 shown
in the following table.
Table 13: Randomized Complete Block Design for Metal Specimens Example
treatment (tip
type) 1
treatment (tip
type) 2
treatment (tip
type) 3
treatment (tip
type) 4
value
rank
value
rank
value
rank
value
rank
block
specimen 1
9.3
9.4
9.2
9.7
block
specimen) 2
9.4
2.5
9.3
9.4
2.5
9.6
block
specimen 3
9.6
9.8
9.5
10.0
block
specimen4
10.0
9.9
9.7
10.2
Rj (totals)
9.5
5.5
16
A = R (X ij ) 2 = 119.5
b
i =1 j =1
C=
bk ( k + 1) 2 (4)( 4)( 4 + 1) 2
=
= 100
4
4
Next compute the Friedman test statistic T1 using the formula adjusted for ties
2
( k 1) R j bC
2
2
2
2
j =1
= ( 4 1) ( 9.5) + (9) + (5.5) + (16) (4)(100) = 8.8462
T1 =
A C
119.5 100
T2 =
( b 1)T1
( 4 1)( 8.8462)
=
= 8.4148
b(k 1) T1 4 (4 1) 8.8462
For a critical region of 0.05, the 1 - quantile (0.95) of the F distribution with df num = k - 1 = 3
and df den = (b - 1)(k - 1) = (3)(3) = 9 from Table C-5 is 3.86. Since T2 = 8.4148 lies in this
critical region, i.e., in the region greater than 3.86, the null hypothesis Ho is rejected and it is
concluded that at least one tip type results in hardness values that are not equal to at least one
other tip type. From Table C-5, the p-value is less than 0.01. This means the null hypothesis
Ho could have been rejected at a significance level as small as = 0.01 (and even smaller, but
most tables dont have values any smaller).
MULTIPLE COMPARISONS USING FRIEDMAN TWO-WAY ANALYSIS OF VARIANCE BY
RANKS TEST FOR SEVERAL DEPENDENT VARIABLES
When the Friedman test rejects the null hypothesis, it indicates that one or more treatments
(populations) do not have the same means but it does not tell us which treatments. One
method to compare individual treatments is presented by Conover (1999, p. 371). This method
concludes that treatments l and m are different if the following inequality is satisfied.
2(bA R j )
(b 1)( k 1)
2
Ri and Rj are the rank sums of the two treatments (samples) being compared. The 1 - /2
quantile of the t distribution, t1-(/1) , with (b - 1)(k - 1) degrees of freedom is obtained from the t
distribution, Table C-4. A is as already defined for the Friedman test.
If there are no ties, A in the above inequality simplifies to
A=
bk ( k + 1)( 2k + 1)
6
Multiple comparisons for the hardness testing machine computational example can be made by
first computing the right side of the inequality. For a critical region of 0.05, from Table C-4, the 1
- /2 quantile (0.975) for the t distribution with (4 - 1)(4 - 1) = 9 degrees of freedom is 2.262.
t1( / 2)
2( bA R j 2 )
(b 1)( k 1)
= 2.262
]} = 4.8280
It can be concluded that any two-tip types whose rank sums are more than 4.8280 are unequal.
Therefore, tip types which yield mean hardness values different from each other are types 1 and
4, types 2 and 4, and types 3 and 4. No other differences are significant. This means that
types 1, 2, and 3 yield identical results (at = 0.05) while type 4 yields a significantly higher
reading than any of the other three tip types.
NONPARAMETRIC ANALYSIS OF BALANCED INCOMPLETE BLOCK DESIGNS
Sometimes it is inconvenient or impractical to administer all the treatments to each block.
Perhaps there is limited funds or perhaps the number of treatments is simply too large to
administer to each block. When blocking is used, but each block does not receive every
treatment, it is called a randomized incomplete block design. Furthermore, when certain
simple treatment scheme conditions are met to aid analysis, the design is called a balanced
incomplete block design. The parametric analysis methods for this type of design are
discussed in detail in Montgomery (1997, p. 208-219). When the normality assumptions are
not met, a test developed by Durbin in 1951 may be used. It is a rank test to test the null
hypothesis that there are no differences among treatments in a balanced incomplete block
design. For details about the Durbin test, see Conover (1999, p. 387-394).
Kendall's Tau ()
(Measure of Association)
----------------Hypothesis Test Using Kendall's Tau
(Test for Independence)
Use these tests for Ordinal, Interval,
and Ratio Scale Data (1)
independence allows the researcher to estimate a probability that the observed data occurs
when the null hypothesis is true (the null hypothesis is usually that the variables are
independent). For example, there is less than a 5% probability that a specific data sample of
two variables would occur given that the two data variables are independent. In this example,
since there is less than a 5% probability of the specific data occurring by chance alone, the
given condition of independence could be rejected (at the 5% significance level). This means
that the alternative hypothesis would be accepted, i.e., the two variables are not independent
and, thus, a relationship between them exists.
Measure of Association and Test for Independence for Nominal Scale Data
Cramers Contingency Coefficient - Measure of Association for Nominal Scale Data
The Test for Independence using nominal data discussed in the next section uses a r x c
contingency table to explore whether two variables within a sample are independent or not. But
sometimes instead of testing a hypothesis regarding independence, the analyst simply want to
express the degree of dependence shown in a particular contingency table. The most widely
used measure of dependence (also called a Measure of Association) for an r x c contingency
table is Cramers Contingency Coefficient. Sometimes called Cramers phi coefficient or simply
Cramers coefficient, this measure of dependence was first suggested by Harold Cramer (18931985), a Swedish mathematician and chemist, in 1946.
ASSUMPTIONS OF CRAMERS CONTINGENCY COEFFICIENT
The coefficient is based on the test statistic T developed in the Chi-square Test for
Independence. This test is detailed in the following section and all of its underlying
assumptions apply to Cramers Contingency Coefficient.
INPUTS FOR CRAMERS CONTINGENCY COEFFICIENT
A random sample of size N is obtained by the researcher. The observations may be classified
according to two variables. The first variable has r categories (rows) and the second variable
has c categories (columns). Let Oij be the number of observations associated with row i and
column j simultaneously (a cell). The cell counts Oij are arranged in the following form, which is
called a r x c contingency table.
The total number of observations from all samples is denoted by
N.
The number of
observations in the jth column is denoted by Cj, which is the number of observations that are in
row j (meaning category j of the second variable). The number of observations in the ith row is
denoted by Rj, which is the number of observations that are in row i (meaning category i of the
first variable).
Colum
n1
Colum
n2
...
Colum
nj
...
Colum
nc
Row
Totals
Row 1
O1 1
O1 2
...
O1 j
...
O1 c
R1
Row 2
O2 1
O2 2
...
O2 j
...
O2 c
R2
...
Row i
...
Oi1
Oi2
...
Oij
...
Oic
...
Ri
...
Row r
Or1
Or2
...
Orj
...
Orc
Rr
Column Totals
C1
C2
...
Cj
...
Cc
T = estimate of =
2
i =1
( Oi j Ei j ) 2
j =1
Ei j
that large values of T arise as the difference in cell counts (Oij - Eij) becomes more pronounced.
By examining extremely uneven contingency tables, a general rule was developed for the
maximum value of T being N(k - 1), where k is the smaller of the number of categories for the
two variables being considered and N is the total number of observations. Therefore, when T is
divided by the approximation of the maximum value of T, the result is a useful coefficient C
having a common scale. Current convention uses the square root form.
Cramer' s coefficien t = C =
T
N ( k 1)
with 72 cells. However 30 of these cells had zero observations, making the use of this test highly
problematic.
Ei j = P( cell ij ) N =
RC
Ri C j
N = i j
N N
N
Where Ei j represents the expected number of observations in cell (i, j) when Ho is true. The test
statistic T is given by
r
T = estimate of =
2
i =1
(Oi j Ei j ) 2
j =1
Ei j
=
i =1
Oi2j
E
j =1
ij
exceeds x1- (meaning that an observations categorization on the first variable is associated
(i.e., not independent) with the categorization on the second variable). Other wise the analyst
accepts Ho (meaning the two variables are independent).
Transit Example: In the design of off-peak transit routes (Ross and Wilson, Transportation
Engineering Journal, V.103, N.TE5, September 1977), researchers in Cedar Rapids, Iowa,
developed a routing technique to identify the trade area of major trip attractors and design the
transit routing to serve areas of high potential within the trade area. To demonstrate the
effectiveness of the technique, the researchers collected data for a non-CBD (central business
district) regional mall and its trade area. The data captured demographic characteristics of the
malls patrons and characteristics of their trips to the mall during off-peak hours. Screening
methods were used to eliminate from evaluation traffic zones of very low potential transit
ridership to the mall, essentially all areas outside a 10-minute travel time to the mall.
Additionally, low population traffic zones within the 10-minute limit were also dropped, typically
rural zones. The remaining traffic zones were aggregated into four districts conducive to
efficient transit routing to the mall.
Several questions were posed in the form of two hypotheses and tested using the Chi-square
Test for Independence for nominal data:
Ho: The two variables are independent of each other.
Ha: The two variables are not independent of each other.
A 1% or less level of significance was used as being statistically significant and the p-values
were reported when the level of significance was greater than 1%.
One example of a question explored was the preference of district residents for shopping at the
mall when compared to the CBD. The responses to this choice question were that they shopped
(1) more, (2) the same, or (3) less in the CBD than in the mall. Since there were four districts
being evaluated, the researchers set up a 3 by 4 contingency table. The null hypothesis is that
the rows (level of shopping at the mall) are independent of the columns (the district in which the
shopper lived). This was rejected at a degree of significance of 0.01. This is interpreted as
follows. Given that the data are independent, there is less than a 1% chance that the large
differences between the expected and observed values occurred by chance alone. Therefore the
alternate hypothesis is accepted: the level of shopping is different among the districts. Several
other questions were explored in a similar manner: (1) are the trip frequencies the same by
district (no), (2) are the planned versus unplanned trips to the mall the same by districts (yes with
a p-value = 0.58), (3) are the previous locations of the mall patrons just before coming to the mall
the same by district (yes with a p-value of 0.70), (4) are the trip making rates per occupied
dwelling unit the same by district (yes with a p-value > 0.70), and then (5) for each zone within a
district, are the trip making rates per occupied dwelling unit the same (results varied by zone).
From these tests, the researchers were able to draw these general conclusions:
1. After dropping the lowest use district, all three remaining districts appear to be producing
trips to the mall as a function of the number of occupied dwelling units in the districts.
2. In two of the districts at least one zone produces trips at a rate different than would be
expected if all trip rates were equal.
authority W.J. Conovers words: nonparametric methods use approximate solutions to exact
problems, while parametric methods use exact solutions to approximate problems. These
approximations tend to become better as the sample size gets larger because it usually
approaches its solution asymptotically or as some say, the law of large numbers comes into
play. So when samples sizes are sufficiently large the approximations are usually quite good.
For both the Chi-square Test for Independence and the Chi-square GOF test, the exact
distribution of the test statistic is very difficult to find and classically is almost never used. The
asymptotic chi-square approximation for the test statistic is satisfactory if the expected values
in the test statistic are not too small. However the exact distribution can be found, and has
been used for some time for small contingency tables, e.g., a 2 by 2 contingency table, where
the computations required were manageable. The computations for larger size contingency
table are difficult computationally. With the advent of modern computers, however, such
computations are possible and may provide a more useful method when testing small size
samples. These computations are still a nontrivial task and rely on special algorithms. Cyrus
R. Mehta and Nitin R. Patel are two researchers who have developed such special algorithms
and have published their work in this field. They report that software support for their methods
is available in many standard packages including StatXact-3, LogXact-2, SPSS Exact Tests,
and SAS Version 6.11. More statistical software publishers will probably add this capability as
it becomes more recognized. One should consult these software providers for more
information.
Measure of Association and Test for Independence for Ordinal, Interval, and
Ratio Scale Data
Kendalls Tau () - Measure of Association for Ordinal, Interval, and Ratio Scale Data
Kendalls Tau is of a type that can be called measures of rank correlation. Compared to
measures of association for nominal scale data, these measures use the additional information
contained in ordinal, interval, and ratio scale data in their computation. In general, this
additional information provides tests with greater power without having to make the assumptions
about the distribution of the test statistics as is the case for parametric tests (e.g., the widely
used Pearsons product moment correlation coefficient). Rank correlation methods use the
ranks (order) attributed to the data values rather than the values themselves resulting in
nonparametric tests.
Nc Nd
n( n 1) / 2
Let Nc be the number of concordant pairs of observations out of the nC2 possible pairs. A pair of
observations, such as (10.4, 0.6) and (14.2, 0.3), are called discordant if the two numbers in
one observation differ in opposite directions (one negative and one positive) from the respective
members in the other observation. Let Nd be the total number of discordant pairs of
observations. Pairs with ties between respective members are counted differently, as described
later. Recall nCr = (n!) / (r! (n - r)!) are all the number of ways in which r objects can be selected
from a set of n distinct objects. Therefore, the n observations may be paired nC2 = (n!) / (2! (n 2)!) = n (n - 1) / 2 different ways. Thus, the number of concordant pairs Nc plus the number of
discordant pairs Nd plus the number of pairs with dies will add up to n (n - 1) / 2. The measure
of correlation (association) when there are not ties is
If all pairs are concordant, then Kendalls Tau equals 1.0; when all pairs are discordant, it
equals -1.0.
Ties: In equation form, a pair of observations (X1, Y1) and (X2, Y2) is concordant if (Y2 - Y1) / (X2
- X1) > 0 and discordant if (Y2 - Y1) / (X2 - X1) < 0. If X1 = X2 the denominator is zero so no
comparison can be made. But when Y1 = Y2 (and X1 X2), then (Y2 - Y1) / (X2 - X1) = 0. In this
case the pair should be counted as one-half (1/2) concordant and one-half (1/2) discordant.
While this makes no difference in the numerator of Kendalls Tau because the one-half terms
cancel when computing Nc - Nd, it does change the way Tau should be computed. In the case
of ties, the measure of association (correlation) is
Nc Nd
, where
N c + Nd
if
Y j Yi
> 0 , add 1 to N c (concordan t)
X j Xi
if
Y j Yi
< 0 , add 1 to N d (discordan t)
X j Xi
if
Y j Yi
= 0 , add 12 to N c and
X j Xi
1
2
to N d
This version of Kendalls Tau has the advantage of achieving +1 or -1 even if ties are present.
This version is sometimes called the gamma coefficient.
Ranks
(Ri,Ri)
Concordant Pairs
Below
(Xi,Yi) and (Ri,Ri)
Discordant Pairs
Below
(Xi,Yi) and (Ri,Ri)
(Xi,Yi)
(530, 3.5 )
( 1, 5)
(540, 3.3)
( 2, 3)
(545, 3.7)
( 3, 8)
( 5, 1.5)
5.5
0.5
(560, 3.5)
( 5, 5)
4.5
1.5
(560, 3.6)
( 5, 7)
(1)
(1)
(560, 3.2 )
Xi tie
(570, 3.2(1) )
Xi tie
( 7, 1.5)
(580. 3.8)
(8, 9)
(610, 3.5)
(9.5, 5)
(9.5, 11.5)
0.5
1.5
(11, 10)
(12,.11.5 )
n/a
n/a
Nc = 44.5
Nd = 17.5
(2)
(610, 4.0 )
(640, 3.9)
(2)
(710, 4.0 )
(1)
(2)
Yi tie
Yi tie
Safety Example: In an evaluation of vehicular crashes with utility poles (Zegeer and Parker,
Transportation Research Record No. 970, 1984), researchers used data from four states to
investigate the effects of various traffic and roadway variables on the frequency and severity of
the crashes. Several methods were used to assess these effects including correlation analysis,
analysis of variance and covariance, and contingency table analysis.
Correlation analysis was conducted to determine if a relationship existed between the
independent and the dependent variables for purposes of determining the best variables to use in
a predictive model. Similarly correlation analysis between independent variables was conducted
in order to avoid problems of collinearity in predictive models that occurs when two or more
independent variables in a model are highly correlated. For variables having interval and ratio
scale data, the Pearson correlation coefficient was used. For measuring the association
between the discrete, ordinal independent variables, Kendall Tau correlation was used. For
example, a Tau value of 0.727 was reported for the correlation between area type and speed limit.
There were three area types: urban, urban fringe, and rural. A value of 1.000 would indicate
perfect correlation between these two variables while a value of 0.000 would indicates no
correlation. The researchers made a qualitative decision that this Tau value was sufficiently
close to 1.0 to warrant eliminating one of the variables--area type--as an independent variable in
their predictive models. After deciding on the best variables to include, the researchers used
linear and nonlinear regression analysis to develop predictive models.
It is important to note that when reporting results, lack of specificity in naming and/or describing
the actual statistical tests used can cause confusion. In this example, the researchers reported
they used the Pearson correlation coefficient, except for ordinal data, for which they used
Kendall Tau correlation analysis. In neither case is a statistical reference given, which would
have allowed a reader so inclined to determine the exact statistical tests that were used. Both
Pearson and Kendall formulated multiple tests for correlation and they are reported in literature
and textbooks under various names. In this case, Kendalls tau is a relatively unique name but
the Pearson correlation coefficient used must simply be guessed. It is probably Pearsons
Product Moment Correlation Coefficient, which is arguably the most widely used correlation
coefficient for interval and ratio scale variables when developing linear regression models. An
alternate method for providing specificity would be to include sample calculations or other
descriptors of the results such that the specific test could be deduced by the reader. Again, this
is lacking in the example reported here.
Other Nonparametric Measures of Association for Ordinal, Interval, and Ratio Scale
Data
Spearmans Rho ( ) is another measure of association that is historically more commonly
discussed in statistical textbooks. Its computation is a natural extension of the most popular
parametric measure of association, Pearsons product-moment correlation coefficient (r)
mentioned earlier. Spearmans Rho is simply Pearsons product-moment correlation coefficient
computed using the ranks of the two variables instead of their values. An advantage of
Spearmans Rho over Kendalls Tau is that it is easier to compute, but this becomes moot
when using many of todays computer software programs which will compute both.
The primary advantage of Kendalls Tau is that the sampling distribution of tau approaches
normality very quickly. Therefore, when the null hypothesis of independence is true, the normal
distribution provides a good approximation of the exact sampling distribution of tau of small
sample sizes--better that for Spearmans Rho which requires a larger sample size for this
approximation. Another commonly cited advantage is that Kendalls Tau is an unbiased
estimate (i.e., the most accurate) of the population parameter tau whereas Spearmans Rho is
not an unbiased estimate (i.e., not as accurate) of the population parameter rho.
While Kendalls Tau arguably provides a better measure of association that does Spearmans
Rho, it does not preclude the use of Spearmans Rho. On the contrary, Spearmans Rho
provides a useful nonparametric and should not be avoided, especially if, for example, it is the
only one available in the researchers statistical software. When hypothesis testing is used for
Tests of Independence, both Kendalls Tau and Spearmans Rho will produce nearly identical
results.
Hypothesis Test Using Kendalls Tau () - Test for Independence for Ordinal, Interval,
and Ratio Scale Data
Ha: The correlation between the two variables equals some value less than zero, i.e., pairs
of observations tend to be discordant.
This is the one-side test to use when the variables are suspected of negatively correlated
KENDALLS TAU ( )
For the two-sided test, reject the null hypothesis Ho at the level of significance (meaning the
two variables are correlated) if the test statistic Tau is less than the /2 quantile or greater than
its 1 - /2 quantile in the null distribution tabulated in Table C-9, otherwise accept Ho (meaning
the two variables are independent).
For the upper-sided test, reject the null hypothesis Ho at the level of significance (meaning the
two variables are positively correlated) if the test statistic Tau is greater than the 1 - /2 quantile
in the null distribution tabulated in Table C-9, otherwise accept Ho (meaning the two variables
are independent). Similarly, for the lower-sided test, reject the null hypothesis Ho at the level of
significance (meaning the two variables are negatively correlated) if the test statistic Tau is
less than the /2 quantile in the null distribution tabulated in Table C-8, otherwise accept Ho
(meaning the two variables are independent).
Goodness-of-Fit Methodology:
(1)
Testing one independent sample (or multiple samples, tested one at a time) against a
known distribution that is postulated by the researcher. Usually the researcher has some
clues as to what the postulated distribution might be. This may come from the
researchers prior experience or from that of another, perhaps gained through a literature
search. But regardless of how selected, the knowledge about the postulated distribution
will have to be sufficient to provide the detailed comparison statistics needed by the GOF
test.
2.
Testing two samples to see if they are drawn from the same distribution, which is unknown.
Sometimes the researcher does not want to find out which distribution a sample comes
from but wants to know if two samples both come from the same distribution. This is useful
knowledge in a number of situations. For example, a researcher may have two instruments
for sampling data; one instrument is believed to sample the data more accurately but costs
considerably more in time and money than the other method. The researcher can draw
samples with each instrument and test to see if they are drawn from the same underlying
distribution. If so, then the researcher is beginning to build a case for using the cheaper
instrument as being adequate to sample the data, even though it is less accurate.
3.
Testing multiple samples to see if they are all drawn from the same distribution, which is
unknown and remains so. This is an extension of the two-sample case, but here the
researcher wants to simultaneously test if several samples all come from the same
underlying distribution.
goodness-of-fit needs by many researchers. However, one can argue that in todays
environment of easy-to-use statistical software, the more complex tests for higher scale data
should be used where applicable, because they are generally more powerful than the chi-square
test.
The basic methodology for all chi-square tests is the same, regardless if for one sample or
multiple samples. A test uses formal hypothesis statements that are described later. But
simply stated, it is a test of how well observed data fit a theoretical distribution. The data are
classified into groups and the number of observations in each group denoted by O (observed).
The expected number in each group E is calculated from the theoretical distribution. The value
of 2 is then worked out using:
2 =
(Oi Ei )
Ei
all groups
A small value of 2 means the data fit the theoretical distribution well; a large value means they
fit the distribution poorly. This interpretation is straight forward if one looks at the equation for
2
. If the sample was in fact taken from the theoretical distribution, the number of observations O
in each group would be quite close to expected E number in each group. Therefore the
difference between these (O - E) will be quite small. And the summation shown in the equation
would also be quite small. On the other hand, if the sample came from a different distribution
than the theoretical distribution tested, the number of observations expected from the
theoretical distribution would be quite different than the number of observations. This larger
difference would lead to a large value of 2.
c classes (or categories), and the numbers of observations in each class are
arranged in the following manner. O j is the number of observations in category j, for j = 1, 2,
... , c.
are grouped into
Table 16: Chi-Square Goodness of Fit Table for Single Independent Variable
Total
number of
observations
Cell/Category/Clas
s
Class
1
Class
2
...
Class
j
...
Class
c
Observed
frequency
O1
O2
...
Oj
...
Oc
j = 1, 2, . . . , c
where Ej represents the expected number of observations in class j when Ho is true. The test
statistic T is given by:
c
(O j E j ) 2
j =1
Ej
T = estimate of =
2
O 2j
j =1
Ej
if T exceeds x1- (meaning the two distributions are not alike), otherwise the analyst accepts Ho
(meaning the two distributions are alike).
The degree of freedom used is (df = c - 1 - w), where c is the number of categories (or cells) and
w is the number of parameters that must be estimated. The number of parameters that must be
estimated w depends on which theoretical distribution you are comparing to the sample. As an
example, suppose one is using this test to determine whether the sample data are compatible
with the normal distribution. Said another way, is the sample data drawn from a parent
population having a Normal distribution. In order to do this, one must estimate the values of the
parent population mean () and standard deviation () by computing the values of the sample
data mean (XBAR ) and standard deviation (s BAR ). Since in this case two population parameters
were estimated from the data, the degrees of freedom would be df = c - 1 - 2 (which requires at
least four c categories (cells), since df must be equal to or greater than one).
for all i
The data are arranged in the following form, which is called an r x c contingency table.
Class
1
Class
2
...
Class
j
...
Class
c
Totals
Population 1
O1 1
O1 2
...
O1 j
...
O1 c
n1
Population 2
O2 1
O2 2
...
O2 j
...
O2 c
n2
...
Population i
...
Oi1
Oi2
...
Oij
...
Oic
ni
...
...
Population r
Or1
Or2
...
Orj
...
Orc
nr
Totals
C1
C2
...
Cj
...
Cc
N.
The number of
N = n1 + n2 + ... + ni + ... + nr
Cj = O1j + O2j + . . . + Oij + . . . + Orj
for all j
th
Ho: All of the probabilities in the same column are equal to each other (i.e., p1j = p2j = . . . =
prj , for all j).
Ha: At least two of the probabilities in the same column are not equal to each other (i.e., pi j
pk j for some j, and for some pair i and k ).
It is not necessary to stipulate the various probabilities. The null hypothesis merely states that
the probability of being in class j is the same for all populations, no matter what the probabilities
might be (and no matter which category is being considered). This test is sometimes called
the chi-square test for homogeneity because it evaluates whether or not the r samples are
homogeneous with respect to the proportions of observations in each of the c categories.
Ei j =
niC j
N
Where Ei j represents the expected number of observations in cell (i, j) when Ho is true. The test
statistic T is given by
r
T = estimate of =
2
i =1
(Oi j Ei j ) 2
j =1
Ei j
=
i =1
Oi2j
E
j =1
ij
( AOi ACi ) 2
ACi
i =1
1, 536
where AO is the number of observed accidents and AC is the number of computed accidents at
each of the 1,536 crossings.
Using this method, chi-square test statistics were computed for each of the four models
separately. The purpose of this was to compare how well each model fit the actual data. Or put
another way, how well the distribution of accidents obtained from each model fit the distribution
of the actual observed
accidents. The four test statistics were calculated to be to 2176, 3810, 961, and 833. The authors
concluded that the model producing the lowest test statistic was the best fit of the four models.
It should be noted that these authors used the chi-square GOF test to provide information
regarding goodness-of-fit of each model compared to each of the other models. For this purpose,
the chi-square test statistics are used as a measure of goodness of fit. No determination was
made as to the probability that one or more of the four estimated distributions were statistically
likely to be the same as the distribution of the observed data. To do this, one would obtain the (1 ) quantile of a chi-square random variable having (r - 1)(c - 1) degrees of freedom. Since only
two distributions are compared at once, r = 2 while c = 1536, resulting in a df = (2 - 1)(1536 - 1) =
1535. Most statistical table for the chi-square distribution do not list values for df > 100.
However, Conover (1999, p.510) provides a method for estimating these values by
2
2
wp = df 1
+ xp
9df
9df
How can I transform ordinal, interval and ratio scale data to nominal scale data so I
can use the chi-square test?
At times, a researcher may want to use the chi-square test for data that have a higher scale
than simply nominal. This means that the researcher is willing to give up the extra power that
usually comes from nonparametric tests using ranking methods, such as the KolmogorovSmirnov type for ordinal scale or higher data. To transform ordinal (or interval/ratio) data, one
simply has to devise a meaningful scheme to group the data into categories. There are two
practical considerations when doing this. First, data must be grouped in such a manner that no
category has less than five expected occurrences in a category. The second consideration
applies when comparing to a known distribution (say the Normal distribution). In this case, the
categories must be chosen such that the frequencies of the known distribution can be
estimated and there are enough categories to insure df 1. As an example, suppose the
following 40 interval scale data points (shown from smallest to largest) are measurements that
are suspected of following a normal distribution.
3.65
3.86
3.96
4.05
4.18
4.29
4.29
4.38
4.45
4.55
4.56
4.69
4.71
4.73
4.84
4.86
4.92
5.01
5.01
5.11
5.29
5.29
5.60
5.67
5.72
5.78
5.93
6.20
6.69
7.71
A chi-square test is used, the data are classified into 4 groups. This will provide us with (c - 1 w) degrees of freedom, or 1 degree of freedom (4 -1 - 2) since there are four categories, and two
parameters will be estimated for the hypothesized normal distribution.
The data are first ordered. Then the mean is estimated (4.48), and the standard deviation (1.22)
of the sample is estimated, to approximate the mean and standard deviation of the
hypothesized parent population (now hypothesized to be N (4.48, 1.22)). If groups are created
that represent the four quartiles of the N (4.48, 1.22) there will be four groups. Each group is
expected to contain one-fourth of any sample drawn from it. In this case, that analyst has 40
data points, and so would expect 10 points to be in each quartile. Each quartiles range of
values can be calculated using the standard normal distribution. For example, the first quartile
will contain values from - to 3.57. This is calculated using the 0.25 quartile of the standard
normal (-0.6725) from the Standard Normal distribution, Table C-1, and transforming it to N
(4.48, 1.22). As the reader will recall, use of the standard normal distribution table requires the
z=
where is the mean, the standard deviation, x the data value, and z the transformed value.
One computes z (-0.6725), and wants to find x so x = (1.22)(-0.675) + 4.48 = 3.57. Therefore,
the values for the first quartile will be - x 3.57. In similar fashion, the other quartile cut
points can be calculated as 4.48, 5.30, and +. Separating the observations using these
boundaries results in four groups, each containing the measurements that would be in the
appropriate quartile assuming they are distributed N (4.48, 1.22).
no. observations
range
1st quartile
1.87
2.32
2.40
2.93
3.01
3.11
3.17
3.34
3.45
3.48
2nd quartile
3.65
3.86
3.96
4.05
4.18
4.29
4.29
4.38
4.45
10
- x 3.57
9
3.57 x 4.48
3rd quartile
4.55
4.56
4.69
4.71
4.73
4.84
4.86
4.92
5.01
5.01
5.11
5.29
5.29
13
4.48 x 5.30
4th quartile
5.60
5.67
5.72
5.78
5.93
6.20
6.69
7.71
8
5.30 x +
The analyst can now use the chi-square test for a single independent sample, which is
described in the next section. For the 4 categories, the observed frequencies will be 10, 9, 13,
and 8 while the expected frequencies assuming a theorized N (4.48, 1.22) distribution will be
10, 10, 10, and 10. More categories could have been used. Assuming that categories of equal
probability size were used, then the largest number of categories is eight, which still maintains
a minimum of five expected occurrences in each category.
Goodness-of-Fit Tests for Ordinal, Interval, and Ratio Scale Data - KolmogorovSmirnov Type Tests
Kolmogorov and Smirnov developed procedures which compare the empirical distribution
functions (EDF) of two samples to see if they are similar--that is, drawn from the same known
or unknown parent population. The empirical distribution function (EDF) is a cumulative
probability distribution function constructed from observed or experimental data (see definition of
EDF for more detailed information). Two cumulative functions can be compared graphically,
which has an innate appeal, but Kolmogorov and Smirnov developed more rigorous statistical
procedures that are discussed in this section. All these procedures use the maximum vertical
distance between these functions of how well the functions resemble (fit) each other--or said
another way, their goodness of fit.
Kolmogorov developed statistics that are functions of the maximum vertical distance between
an EDF S (x) of an unknown distribution and the cumulative distribution function CDF F (x) of a
known distribution. These are one-sample tests and are said to be of the Kolmogorov-type.
Smirnov worked with these same maximum distances but between two empirical density
functions. These types of statistics are called the Smirnov-type. All Kolmogorov-Smirnov type
tests are for continuous distributions whereas the chi-square is valid for both continuous and
discrete distributions. However, one may still use the Kolmogorov-Smirnov type tests for
discrete distributions realizing the results yield a conservative approximation for the critical
levels. They are often preferred over the chi-square GOF test if the sample size is small. The
chi-square test assumes the number of observations is large enough so that this distribution
provides a good approximation of the distribution of the test statistic, whereas the Kolmogorov
test is exact even for small samples. Also the Kolmogorov-Smirnov type tests are more
efficient with data and usually more powerful.
Kolmogorov introduced his GOF test for a single sample in 1933. It provides an alternative to
the chi-square GOF test when dealing with data of ordinal scale or higher.
Smirnov introduced his GOF test for two samples in 1939. It is similar to the Kolmogorov single
sample test except it compares two unknown empirical distribution functions rather than an
unknown EDF to a known CDF. It is important to note that much of the literature refers to these
tests by combining the names of two originators and distinguishing them by the number of
samples; after this fashion, the tests are called the Kolmogorov-Smirnov GOF test for a single
sample and the Kolmogorov-Smirnov GOF test for two samples.
T = sup F ( x) S ( x )
x
*
B. One-sided test: The test statistic T+ is the maximum vertical difference by F*(x) above S(x):
T + = sup F * ( x ) S ( x )
x
C. One-sided test: The test statistic T- is the maximum vertical difference by S(x) above F*(x):
T = sup S ( x) F * ( x )
x
1.0
F*(x)
T and T +
0.5
S(x)
Tx
hypothesis test be done at a stated significance level or a p-value be reported to allow the reader
to adequate evaluate the reported results.
While the six GOF tests do vary in selecting the best fitting distribution, it is interesting to note
that the three statistical GOF tests all did yield the same result. These three methods--Chisquare, Kolmogorov-Smirnov, and Cramer-von Mises-Smirnov--all picked the same distribution
of those offered as the best fitting for the two samples. They did, however, pick different
distributions for each of the two samples. The three other GOF tests employed by the
researchers--absolute deviation, weighted absolute deviation, and log-likelihood value-- gave
varying results.
The two samples used by the researchers to support their arguments were actually the same
data. The first sample had 122 data points and the second sample was 120 of these points, the
two removed points being the lowest values, deemed to be outliers by the authors. The best
fitting hypothetical distributions for the two samples differed. This illustrates the importance of
outliers. It is recommended that outliers never be removed from data without good
justification. Furthermore, that justification should be stated in the findings--as the authors did in
this example paper. Obviously, the removal of these two outliers had a significant impact on
the conclusions drawn in this paper and interested readers would want to know the reasons for
removing the two outliers so they can form their own conclusions.
It should be noted once again the importance of specificity when discussing statistical tests used
to reach conclusions--either by citing a reference for the statistical test(s) used or by reporting
sufficient information such that the reader can accurately deduce the specific test
statistic/method used. In this example the researchers list the results for the KolmogorovSmirnov statistic GOF test for each of the six distributions tested. The values range from about
5.5 to 9.5. The Kolmogorov GOF test described in this manual uses the maximum vertical
distance between the empirical distribution function (EDF) of the data and the cumulative
distribution function (CDF) of the hypothesized distribution as the test statistic. This means that
the test statistic ranges from 0.0 to 1.0. Obviously the researchers were using some other form
of a Kolmogorov-Smirnov type test statistic and a compatible set tables to interpret the results.
Again, researchers are cautioned not to use the test statistics given in this manual with tables
taken from some other source unless certain that they are compatible. The researchers did cite
a reference for the Cramer-von Mises-Smirnov GOF test they used. This test was devised by the
three statisticians for whom it is named between 1928 and 1936. The citation given by the
researchers was a 1946 book by Cramer. While this documents the test adequately, such an old
reference may be generally inaccessible to most readers. Thought should be given to citing
more current references to assist the interested reader.
Lilliefors developed a GOF test in 1967, which tests the composite hypothesis of normality.
Under this method, the null hypothesis states that the sample is drawn, not from a single
specified distribution, but from the family of normal distributions. This allows that neither the
mean nor variance of the normal distribution must be specified. Conover (1999, p.443-447)
provides the detailed method and table for this test. In a similar manner, Lilliefors later (1969)
developed a GOF test for the family of exponential distributions. This test method and tables
are also detailed in Conover (1999, p.447-449). The power of both these tests are believed to be
greater than the chi-square test. A potentially useful application of the Lilliefors GOF test for
exponential distributions is when a researcher is theorizing that when events occur randomly,
the time between events follow an exponential distribution. In this situation, the test can be
used as a test of randomness of the data.
SHAPIRO-WILK GOF TEST FOR NORMAL DISTRIBUTION FOR SINGLE INDEPENDENT
SAMPLE
Another test for normality of an EDF is the Shapiro-Wilk GOF test. Some studies have
concluded that this test has greater power than the Lilliefors test in many situations. This test
is not of the Kolmogorov-type. Conover (1999, p.450-451) provided the detailed method and
tables for this test. A useful feature of this test is highlighted though an example by Conover,
wherein several independent goodness-of-fit tests are combined into one overall test of
normality. This allows several small samples from possibly different populations, which by
themselves are insufficient to reject the hypothesis of normality, to be combined and thereby
provide enough evidence to disprove normality.
Smirnov Goodness-of-Fit Test for Two Independent Samples (Ordinal, Interval, and
Ratio Scale Data)
ASSUMPTIONS OF SMIRNOV GOF TEST FOR TWO INDEPENDENT SAMPLES
3) The samples are random samples.
4) The two samples are mutually independent.
5) The measurement scale is at least ordinal.
6) The theoretical distribution (and by implication the sample distribution) is continuous. May
still be used for discrete distributions but doing so leads to a conservative test.
INPUTS FOR SMIRNOV GOF TEST FOR TWO INDEPENDENT SAMPLES
The data consist of two independent random samples, one of size n, X1, X2, . . . , Xn,
associated with some unknown distribution function F (x). The other sample is of size m, Y1,
Y2, . . . , Ym, associated with some unknown distribution function G (x).
HYPOTHESES OF SMIRNOV GOF TEST FOR TWO INDEPENDENT SAMPLES
A. Two-sided test
Ho: F (x) = G (x)
Ha: F (x) G (x)
for all x
for at least one x
B. One-sided test
Ho: F (x) G (x) for all x
Ha: F (x) > G (x)
for at least one x
This is used when the distributions are suspected of being the same except that the
sample distribution F (x) is shifted to the left of the sample distribution G (x). In other
words, the X values of the F (x) tend to be smaller than the Y values of the G (x). This is a
more general test than testing for the distributions only differing by a location parameter
(means or medians).
C. One-sided test
Ho: F (x) G (x) for all x
Ha: F (x) < G (x)
for at least one x
This is the one-side test to use when the distributions are suspected of being the same
except that the sample distribution F (x) (X values) is (larger) shifted to the right of the
sample distribution G (x) (Y values).
TEST STATISTIC (T, T +, T ) OF SMIRNOV GOF TEST FOR TWO INDEPENDENT SAMPLES
Let S1 (x) be the empirical distribution function (EDF) based on the random sample X1, X2, . . . ,
Xn. Let S2x) be the empirical distribution function (EDF) based on the other random sample
Y1,Y2,. . , Ym. The test statistic T is defined differently for hypotheses sets A, B and C.
A. Two sided test: The test statistic T is the maximum difference between the two EDFs, S1(x)
and S2(x):
T = max S1 ( x ) S 2 ( x)
x
B. One-sided test: The test statistic T+ is the maximum difference by S1(x) above S2(x):
T + = max [S1 ( x) S2 ( x ) ]
x
C. One-sided test: The test statistic T- is the maximum difference by S2(x) above S1(x):
T = max [S2 ( x) S1 ( x )]
x
INTREPRETATION OF OUTPUT (DECISION RULE) OF SMIRNOV GOF TEST FOR TWO
INDEPENDENT SAMPLES
Reject the null hypothesis Ho at the level of significance (meaning the two distributions are not
alike) if the appropriate test statistic (T, T+ or T-) exceeds the 1 - quantile (w1 - ) as given in
the Table C-11 of n = m or Table C-12 if n m, otherwise accept Ho (meaning the two
distributions are alike). Note that the two-sided test statistic T is always equal to the larger of
the one-sided test statistics T+ and T-.