Testing for Equal Distributions
Using the Likelihood Ratio Test
Submitted to
Dr. Roger Johnson
Department of Mathematics and Computer Science
South Dakota School of Mines and Technology
By
Anna Paterson
December 2008
Table of Contents
I. Introduction ......................................................................................................................... 1
III. Likelihood Ratio Test ...................................................................................................... 2
General ................................................................................................................................. 2
Exponential Distribution ...................................................................................................... 2
Untruncated ...................................................................................................................... 2
Truncated Fixed and Known ...................................................................................... 3
Truncated Fixed and Unknown .................................................................................. 5
IV. Chi-Square Results for Testing ...................................................................................... 6
V. Inverse CDF Method ......................................................................................................... 7
VI. Simulation ......................................................................................................................... 8
Untruncated Case ................................................................................................................. 8
Truncated Case ................................................................................................................... 11
VII. Works Cited .................................................................................................................. 15
I. Introduction
One method of testing whether two or more samples have equal distributions is known as
the likelihood ratio test. By comparing the maximum value of the likelihood of the null
hypothesis to the maximum value of the likelihood of the alternative hypothesis, we find a
statistic that can be used to make a decision regarding the hypotheses.
There are many times when data collection is limited by the device used. Disdrometers are
instruments used to measure the diameters of raindrops, but they lack the ability to measure
the smallest drops (Brawn and Upton 1997). Brawn and Upton (1997) point to two groups
impacted by this: telecommunications and meteorology, both of whom are interested in
knowing the size and distribution of raindrops during a storm. Telecommunications
engineers worry about the possibility of signal loss, while meteorologists use the data in
their forecast models (Brawn and Upton 1997). Use of the disdrometers results in a set of
truncated data.
We will study the theory behind the likelihood ratio test using k populations with
exponential distributions. In the first case, we will look at samples that are untruncated.
Then we will consider the truncated case where no data is available below a fixed
truncation point, which well call . Well split the truncated case into two subcases; one
where the precise value of is known and the other where is unknown. Once we have
found the likelihood ratio for each of these cases, we will use the fact that -2 ln () has a
chi-square distribution to find the p-value.
Finally, we will test our results using simulation. We will use the inverse cdf method to
generate random variables with an exponential distribution and use these generated
samples to test the performance of the likelihood ratio test.
- 1 -
III. Likelihood Ratio Test
General
A likelihood ratio test can be used to make a decision regarding two hypotheses. The
Neyman-Pearson Lemma says that for a fixed Type I error, , the likelihood ratio test is
least likely to accept a false null hypothesis when the hypotheses are simple (Rice
1995). The likelihood ratio test takes the form
L
L
A
H H
H
0
0
max
max
where L is the likelihood function, generally a product involving the probability density
function, for the distribution of the sample were studying. When the result of this ratio
is close to one, we can believe that the null hypothesis is true, whereas a ratio that is
close to zero indicates that the alternative hypothesis is true (Shao 1999). This is clear
when you consider that if the null hypothesis is true, the numerator and denominator
will be nearly equal, whereas if it is false then the numerator will be closer to zero.
Exponential Distribution
Lets look at three cases involving an exponential distribution. The first case is the
simplest; we assume there are no restrictions in gathering data and consider k
independent samples. In the second case, we study a truncated set of data, where there
is a known point below which we have no observations. Finally we will consider a
truncated set of data where the exact location of is unknown.
Untruncated
In this case, we will look at k independent samples with an exponential distribution.
We can use the likelihood ratio test on the following hypotheses:
k
H ... :
2 1 0
:
A
H Not
0
H
.
The probability density function (p.d.f.) for an exponential distribution is
x
e x f
) ( , 0 x .
If we have k independent samples where
ij
x
is the j
th
sample from the i
th
population
for i = 1,, k and j = 1,,
k
n
, the likelihood function will be the product of the
p.d.f. for each sample:
1
]
1
1
]
1
1
]
1
kj k
k
j j
x
k
n
j
x
n
j
x
n
j
k
e e e L
1
2
1
1
1
2 1
2 2
2
1 1
1
) , , , (
.
- 2 -
If we assume the null hypothesis is true, then we have
1
]
1
) ( exp ) (
1
i i
k
i
n
x n L
where
i
x
is the average of the i
th
sample and n represents
k
i
i
n
1
.
By taking the logarithm of L(), then setting its derivative equal to zero, we obtain
i i
k
i
x n
n
which we use in computing the numerator of the likelihood ratio, .
For the denominator, we want to maximize the likelihood function over
A
H H
0
.
Using the same method as above, we find that
i
i
x
1
k i , , 1
.
Combining these results we have
n
n
k
n n
x
x x x
k
2 1
2 1
.
Truncated Fixed and Known
In some experiments, data collection is hindered by limitations of the instruments
used. For example, the instrument used to measure the diameters of rain drops
cannot measure diameters smaller than a certain size. The result is a point, which
well call , below which no data is available.
The hypotheses are the same as for an untruncated set of data, but we need to find a
form for the likelihood function that accounts for the point . According to Meeker
- 3 -
and Escobar (1998), an alternative form of the likelihood function is attained
through the following steps:
) | ( ) (
~
> X x X P x F
) (
) (
>
<
X P
x X P
) ( 1
) ( ) (
F
F x F
) (
~
) (
~
x F
dx
d
x f
) ( 1
) (
F
x f
) ( 1
) (
) , , (
1
1
F
x f
L
i
n
i
k
.
We know
i
x
i
e x f
) ( , and it is easily determined that
e F ) ( 1 . Thus the
ratio of the two becomes
) (
) ( 1
) (
i
x i
e
F
x f
and the likelihood function for k independent samples becomes
1
]
1
1
]
1
)] ( exp[ )] ( exp[ ) , , , (
1
1 1 1
1
2 1
1
kj k k
n
j
j
n
j
k
x x L
k
.
Using the same maximization technique as above, we find that the numerator of the
likelihood ratio is maximized when
x
1
and the denominator is maximized when
,
1
i
i
x
. , , 2 , 1 k i
After inserting these into the likelihood ratio and performing some algebraic steps
we have
- 4 -
( )
( )
n
n
i
k
i
x
x
i
1 .
Notice that when is zero, this is the same as our result in the untruncated case.
Truncated Fixed and Unknown
In some experiments, we may not know the exact location of the cut-off point, ,
but we can still find the likelihood ratio. As long as every
ij
x
is greater than or
equal to , the likelihood function becomes
1
]
1
1
]
1
)] ( exp[ )] ( exp[ ) , , , , (
1
1 1 1
1
2 1
1
kj k k
n
j
j
n
j
k
x x L
k
.
Notice that the likelihood now depends on as well as
k
, , ,
2 1
. If any
ij
x
is
less than , then the likelihood function is zero.
For the numerator of , we want to maximize the likelihood function over
,
2 1 k
. The result is that the numerator is maximized when
x
where
( )
ij
j i
x
,
min
.
The denominator of is maximized over
, , , ,
2 1 k
when
i
i
x
,
. , , 2 , 1 k i
Algebraic manipulation of the resulting likelihood ratio results in
( )
( )
n
n
i
k
i
x
x
i
.
Notice that when , the likelihood ratio is the same as in the previous
truncated case where is known. Also, when
is zero, we have the same result as
in the untruncated case.
- 5 -
IV. Chi-Square Results for Testing
If
0
H
is true, then ln 2 has an approximate chi-square distribution, where the degrees
of freedom are the number of free parameters in
A
H H
0
less the number of free
parameters in
0
H
(Rice 1995). For example, in the untruncated case above there are k free
parameters in
A
H H
0
and 1 free parameter in
0
H
. Thus there are k - 1 degrees of
freedom in this case. It turns out that there are k - 1 degrees of freedom in each exponential
case above.
Notice that when is small, which provides evidence for the alternative hypothesis,
ln 2 is large. Conversely, when is large, giving evidence for the null hypothesis,
ln 2 is small. Of course, given that 1 0 , referring to as small simply means
that it is close to zero, whereas saying that is large means only that it is close to one.
We can calculate the p-value by finding
( ) ln 2 ) 1 (
2
k P .
If the p-value is small, we have reason to believe that the null hypothesis is false. A large p-
value supports the null hypothesis being true.
- 6 -
V. Inverse CDF Method
The inverse cdf method can be used to generate random variables. This is accomplished by
computing the inverse of the cumulative distribution function evaluated at a uniform
random variable.
Integrating the p.d.f. for the exponential function,
x
e x f
) ( , results in the cumulative
distribution function:
'
>
0 0
0
) (
0
x
x dt e
x F
x
t
0 1 >
x e
x
.
The inverse can be found performing by the following steps:
) 1 ln(
1
) (
1
1
)) ( (
1
) (
) (
1
1
1
x x F
x e
e x
x F F x
x F
x F
Refer to Proposition D in Rice (1995) for the proof that random variables having density f
may be generated by repeatedly calculating
) 1 ln(
1
) (
1
U U F
for values of U with a uniform density between 0 and 1. This function can be used for the
untruncated case, however we will need a modified version that accounts for in order to
generate random variables for the truncated case.
)) (
~
(
~
1
x F F x
1
( ( )) ( )
1 ( )
F F x F
F
%
(see top of p. 4)
e
e e
x F ) (
~
1
) ) (
~
(
1
1
x F
e
x e
x F
1
) ) (
~
(
1
) 1 ln(
1
) (
~
1
x x F
- 7 -
VI. Simulation
Using the previous result from the previous section, we can now generate random
variables with an exponential distribution in order to check the performance of the
likelihood ratio test. We can use Minitab to do this.
Untruncated Case
Lets first look at the simplest case by testing the following hypotheses:
2 1 0
: H
:
A
H Not
0
H
.
As a first example, let
1
= 1,
2
= 1.1, and use a sample size of 50 for both. We use
the random variable generating function
) 1 ln(
1
) (
1
U U F
for each to generate two samples with exponential distributions. Next we calculate
the likelihood ratio
n
n n
x
x x
2 1
2 1
where
1
x is the mean of sample one,
2
x is the mean of sample two, and x is the mean
of the combined samples. Since we set the sample sizes equal to each other,
50
2 1
n n and 100 n . Now we can find ln 2 and use the fact that this has an
approximate chi-square distribution with one degree of freedom to calculate the p-
value.
We expect that when the value of
1
is close to
2
, a larger sample size will be needed
to distinguish the difference between them. However as the difference between
1
and
2
grows, it should be detected even with a relatively small sample size.
Minitab was used to complete the simulation. After defining the sample size (k10),
1
(k11), and
2
(k12) the following macro was executed 1,000 times in order to obtain
1,000 p-values for each combination of values for
1
,
2
, and sample size.
- 8 -
Untruncated (Minitab Macro):
noecho
random k10 c1;
uniform 0 1.
let c2 = -loge(1-c1)/k11
random k10 c4;
uniform 0 1.
let c5 = -loge(1-c4)/k12
let k1 = mean(c2)
let k2 = mean(c5)
let k3 = -2*k10*(loge(k1)+loge(k2))+4*k10*(loge((k1+k2)/2))
cdf k3 k4;
Chisquare 1.
let k5 = 1-k4
stack k5 c10 c10
# p-values stored in c10
stack k3 c12 c12
# -2ln(lambda) stored in c12
The ideal sample p-value was found by assuming the sample means were equal to the
population means, plugging those values into the formula for , finding
) ln( 2
, and
using a TI-83 calculator to find the p-value;
2
( (1) 2ln( )) p value P .
We can compare the median p-values to the ideal sample p-values to get an idea of the
performance of the likelihood ratio test.
The results are as expected. When the means of the populations are very close, it takes
a larger sample size to distinguish between them. Conversely, when there is a larger
difference between the means of the populations, a smaller sample size will be
sufficient to discern the difference.
- 9 -
Untruncated:
Exponential
Means
Sample
Size Median
p-value
Ideal
Sample
p-value X Y n
1
= n
2
1
1 1 . 1 / 1
2
50 0.470046 0.633757
1
1 1 . 1 / 1
2
100 0.433461 0.500442
1
1 1 . 1 / 1
2
500 0.127699 0.131891
1
1 1 . 1 / 1
2
1000 0.037015 0.033106
1
1 3 . 1 / 1
2
50 0.186362 0.190209
1
1 3 . 1 / 1
2
100 0.077167 0.063948
1
1 3 . 1 / 1
2
500 0.000032 0.000034
1
1 3 . 1 / 1
2
1000 0.000000 0.000000
1
1
2
1/1.5 50 0.043479 0.043337
1
1
2
1/1.5 100 0.005560 0.004272
1
1
2
1/1.5 500 0.000000 0.000000
1
1
2
1/1.5 1000 0.000000 0.000000
The histograms below show the difference between a small sample size and a large
sample size. With a sample size of 50 and
1 2
1, 1.1
, the p-values are distributed
almost evenly between 0 and 1.
Compare this to a sample size of 1,000 where the majority of p-values lie between 0.00
and 0.04.
- 10 -
Truncated Case
Now lets examine the truncated case where the value of is fixed and unknown. The
hypotheses remain the same, and we will use the same example where
1
= 1,
2
=
1.1, and 50
2 1
n n . We also need a fixed value for , so we will let = 0.1. Using
the random variable generating function for the truncated case,
) 1 ln(
1
) (
~
1
x x F
,
we can generate two samples with exponential distributions and a minimum
observation . We will use the likelihood ratio
( ) ( )
( )
n
n n
x
x x
2 1
2 1
and then find ln 2 to calculate the p-value.
In Minitab, we defined the sample size (k10),
1
(k11),
2
(k12), and (k20) and then
the following macro was executed 1,000 times in order to obtain 1,000 p-values for
each combination of values for
1
,
2
, and sample size.
- 11 -
Truncated fixed and unknown:
noecho
random k10 c1;
uniform 0 1.
let c2 = -loge(1-c1)/k11+k20
random k10 c4;
uniform 0 1.
let c5 = -loge(1-c4)/k12+k20
let k1 = mean(c2)
let k2 = mean(c5)
stack c2 c5 c7
let k30 = min(c7)
let k3 = -2*k10*(loge(k1-k30)+loge(k2-k30))+4*k10*(loge(((k1+k2)/2)-k30))
cdf k3 k4;
Chisquare 1.
let k5 = 1-k4
stack k5 c10 c10
# p-values stored in c10
stack k3 c12 c12
# -2ln(lambda) stored in c12
Again, the results in this case show that more samples are needed to distinguish
between different means when the difference is small. When these results are
compared to the untruncated case, the median p-values in the truncated case are slightly
smaller in every variation. This is a good result because it means that even if data
points are undetectable below a certain point, it should not affect our accuracy in
determining whether the distributions are equal.
- 12 -
Truncated fixed and unknown:
Exponential Means Sample
Size Median
p-value X Y n
1
= n
2
1
1 1 . 1 / 1
2
50 0.467280
1
1 1 . 1 / 1
2
100 0.431102
1
1 1 . 1 / 1
2
500 0.127384
1
1 1 . 1 / 1
2
1000 0.036925
1
1 3 . 1 / 1
2
50 0.181737
1
1 3 . 1 / 1
2
100 0.075230
1
1 3 . 1 / 1
2
500 0.000032
1
1 3 . 1 / 1
2
1000 0.000000
1
1
2
1/1.5 50 0.041680
1
1
2
1/1.5 100 0.005284
1
1
2
1/1.5 500 0.000000
1
1
2
1/1.5 1000 0.000000
We still see the same outcome as with the untruncated case in that a larger sample size
results in more p-values being closer to zero.
- 13 -
- 14 -
VII. Works Cited
Brawn, D., Upton, G.J.G. (2007). Closed-form parameter estimates for a truncated gamma
distribution. Environmetrics 18, 633-645.
Meeker, W. Q. and Escobar, L. A. (1998). Statistical Methods for Reliability Data, New
York: John Wiley and Sons.
Rice, John A (1995), Mathematical Statistics and Data Analysis, Belmont: Wadsworth.
Shao, Jun (1999), Mathematical Statistics, New York: Springer.
- 15 -