Hwang Good-Turing Frequency Estimation in A Finite Population 2014
Hwang Good-Turing Frequency Estimation in A Finite Population 2014
201300168 321
Good–Turing frequency estimation (Good, 1953) is a simple, effective method for predicting detection
probabilities of objects of both observed and unobserved classes based on observed frequencies of
classes in a sample. The method has been used widely in several disciplines, such as information
retrieval, computational linguistics, text recognition, and ecological diversity estimation. Nevertheless,
existing studies assume sampling with replacement or sampling from an infinite population, which
might be inappropriate for many practical applications. In light of this limitation, this article presents
a modification of the Good–Turing estimation method to account for finite population sampling. We
provide three practical extensions of the modified method, and we examine performance of the modified
method and its extensions in simulation experiments.
Additional supporting information may be found in the online version of this article
at the publisher’s web-site
1 Introduction
It is straightforward to present the problem that the Good–Turing method (Good, 1953) was developed
to solve. Consider a population that consists of an unknown number of classes S, and let Ni denote
the number of objects in each class i = 1, . . . , S so that N = Si=1 Ni is the total population size. If we
take a simple random sample of size n from this population, we can tabulate the frequency Xi of each
class i present in the sample. If S were known,
we could represent these frequencies (some of which
may be zero) as X1 , . . . , XS so that n = Si=1 Xi is the total sample size. But because S is unknown,
we can only observe Xi whenever Xi ≥ 1. In this setting, a common research goal is to estimate the
proportion of the total population that each class represents. This research goal can also be framed
as wanting to predict the proportion Cr of the total population that all classes with frequency r in the
sample represent, which can be expressed as
S
N
Cr = i
I(Xi = r), r = 0, 1, . . . , (1)
N
i=1
where I(·) is the usual indicator function. Note that the prediction is equivalent to estimating E (Cr ),
that is, the expected proportion of the total population represented by classes with frequency r in
C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
322 W.-H. Hwang et al.: Good–Turing frequency estimation in a finite population
the sample. If there is no confusion, we will follow Good (1953) and use estimation terminology in
this work. Primary interest in practice is usually focused on small frequencies (Good, 2000), with
particular interest in the proportion of unobserved classes C0 . The complement of this quantity,
C+ = 1 − C0 , represents the proportion of observed classes and is often called the sample coverage in
the literature. Good (1953) first addressed the issue of estimating E (Cr ) and proposed an empirical
Bayesian approach. As mentioned in Good (1953, 2000), the method was motivated by the founder
of modern computer science and Good’s mentor, Dr. Alan Turing, so that it is generally called the
Good–Turing frequency estimation method.
The Good–Turing method has been applied successfully in several disciplines, such as information
retrieval (Song and Croft, 1999), computational linguistics (Church and Hanks, 1990), speech recog-
nition (Jelinek, 1998; Chen and Goodman, 1999), species richness estimation (Esty, 1985; Chao and
Lee, 1992), population size estimation (Chao et al., 1992), Shannon entropy estimation (Chao and
Shen, 2003), and missile coverage estimation (Lo, 1992). On theoretical aspects of the method, Esty
(1983) and Zhang and Zhang (2009) obtained conditions for asymptotic normality of the sample cov-
erage estimator, Orlitsky et al. (2003) addressed an optimal property based on information theory, and
McAllester and Schapire (2000) and Wagner et al. (2006) established several consistency properties.
The research literature surrounding the Good–Turing method topic is rich; however, existing studies
focus on sampling with replacement, which is equivalent to sampling from an infinite population or
from a finite population when the sampling fraction is negligible. (Hereafter, we refer to this gen-
eral situation as sampling with replacement.) In fact, to the best of our knowledge, implications of
applying the Good–Turing method when sampling without replacement in a finite population have
not previously been studied. One might expect that this method may result in substantial bias when
the sample is taken without replacement from a finite population and the sampling fraction cannot
be ignored. Examples of bias due to failing to account for sampling without replacement in species
richness estimation can be found in Haas and Stokes (1998), Haas et al. (2006), and Chao and Lin
(2012).
In this study, either the total population size N or the sampling fraction p = n/N is assumed to
be given. This information is available in a variety of applications; for instance, the total number of
accounts open at a bank or on a website is usually known by managers; however, the total does not
represent the number of registered persons because some individuals may have multiple accounts.
Quadrat sampling in ecology surveys provides another example. Owing to their sedentary character,
quadrat sampling of plants often involves random sampling without replacement from a division with
a known number of quadrats. More detailed examples are given in Haas and Stokes (1998).
Section 2 briefly reviews the Good–Turing method under sampling with replacement. Section 3
modifies the method to account for sampling without replacement. Section 4 presents three extensions
of the modified method: estimating the number of classes, estimating the Shannon entropy diversity
index, and predicting the number of new species in a subsequent sample. Section 5 first examines the
performance of proposed frequency and interval estimators by resampling from empirical data on rare
vascular plants then analyzes the entire dataset as a single sample. Section 6 conducts an extensive
simulation study to evaluate the performance of the proposed methods including two of the extensions
in Section 4. Section 7 concludes.
2 Classical method
In the classical Good–Turing method, let pi = Ni /N be the probability of observing an object from
the ith class for i = 1, . . . , S. Moreover, let f j = Si=1 I(Xi = j) be the number of classes in the
population with frequency j in the sample so that n = j≥1 j f j is the total sample size. Note that
f0 is unobservable here. Let D = j≥1 f j be the number of distinct classes in the sample; without
loss of generality, we arrange Xi ≥ 1 to be indexed as i = 1, . . . , D. If the sampling is conducted with
C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
Biometrical Journal 57 (2015) 2 323
replacement, theoretical tabulated frequencies of the random sample (X1 , . . . , XS ) can be reasonably
assumed to follow a multinomial distribution
n x x
Pr(X1 = x1 , . . . , XS = xS ) = p 1 · · · pSS ,
x1 , . . . , xS 1
where xi ≥ 0 for i = 1, . . . , S and n = Si=1 xi . It is straightforward to find that the maximum likelihood
estimator (MLE) of pi is Xi /n, and hence the MLE of E (Cr ) is r fr /n. Unfortunately, the estimator is
problematic since it does not reflect the possible existence of classes not observed in the sample. As a
simple example, the MLE always estimates that E (C0 ) is zero and no interval estimator is available.
For these reasons, the MLE is generally avoided in this setting.
By noticing that Xi follows a binomial distribution B(n, pi ) with n trials and success probability pi ,
the expectation of Cr can be decomposed into
r+1 (r + 2)(r + 1)
S S
E (Cr ) = Pr(Xi = r + 1) − Pr(Xi = r + 2)(1 − pi ),
n−r (n − r)(n − r − 1)
i=1 i=1
E( f )
1
where the latter is dominated by O n2
r+2
=O n
. Consequently, we have
r+1 1
E (Cr ) = E ( fr+1 ) + O .
n n
Then we may estimate E ( p̄r ) by p̄r = r∗ /n, where r∗ = (r + 1) fr+1 / fr is an adjusted frequency for an
arbitrary class with frequency r in the sample (Good, 1953). This formulation clearly elucidates the
difference between the Good–Turing method and the maximum likelihood approach.
C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
324 W.-H. Hwang et al.: Good–Turing frequency estimation in a finite population
In the context of sampling without replacement, the random vector (X1 , . . . , XS ) can be reasonably
assumed to follow a multivariate hypergeometric distribution with joint probability mass function
N1 NS
···
x1 x
Pr(X1 = x1 , . . . , XS = xS ) = S , (4)
N
n
where xi ≥ 0 for all i, n = i xi , and N r
i = 0 if Ni < r. For each i, the marginal probability mass
function of Xi follows a hypergeometric distribution
Ni N − Ni
k n−k
Pr(Xi = k) = . (5)
N
n
The expected value of Cr gives
r r+1
S S
N − Ni − n + r + 1
E (Cr ) = Pr(Xi = r) + Pr(Xi = r + 1)
N N n−r
i=1 i=1
(r + 1)(r + 2)
S
pr q(r + 1)
= E ( fr ) + E ( fr+1 ) − Pr(Xi = r + 2)δi ,
n n−r (n − r)(n − r − 1)
i=1
where δi = max{0, (N − Ni − n + r + 2)/N}. Note that 0 ≤ δi < 1, which can be revealed by the prob-
ability mass function
E (5) atk = r + 2 provided that Pr(Xi = r + 2) is positive. Therefore the last term
( fr+2 )
is dominated by O n2
= O 1n . As a result, we have
pr q(r + 1) 1
E (Cr ) = E ( fr ) + E ( fr+1 ) + O .
n n n
C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
Biometrical Journal 57 (2015) 2 325
When estimating the sample coverage, if we do not account for finite population sampling and apply
the classical Good–Turing method without adjustment, the coverage estimator approaches 1 − F1 /N,
representing an inherent bias due to often having F1 = 0. Specifically, the asymptotic bias of the
Good–Turing method is pE ( f1 )/n, which depends on the sampling fraction p in a finite population.
The discrepancy between the limiting behavior of the modified and classical Good–Turing estimators
demonstrates the validity of our adjustment in this extreme case and supports consideration of the
modified estimator in the general case when the sampling fraction cannot be ignored.
Next, we develop an interval estimator for the proposed estimator Cr . Since Cr is a random variable
rather than a fixed parameter, it is reasonable to consider the mean squared error of Cr instead of its
variance. In the Appendix, we derive
q(r + 1) 1
E (Cr − Cr )2 ≈ 2
(1 + qr) fr+1 + q(r + 2) fr+2 − (r + 1) fr+1 ( fr+1 − 1) . (8)
n n
Moreover, given a fixed p and under certain regularity conditions, we have (Cr − Cr )/σr ∼ N (0, 1)
when N is large, where σr2 is the right-hand side of (8). As a result, an approximate 95% confidence
(prediction) interval for Cr has the form
Cr ± 1.96 σr . (9)
S
Next, analogous to p̄r , we define N̄r = i=1 Ni I(Xi = r)/ fr as the average population frequency of
classes appearing exactly r times in the sample. Rewriting N̄r = NCr / fr by the definition in (1) and
directly using the proposed estimator Cr in (6), E (N̄r ) can be estimated directly by N̄ r = r + (r +
1)( fr+1 / fr )(q/p) for all r ≥ 1.
Remark 1. Rather than using the exact distribution (5), we may also obtain the estimators according to
a binomial approximation, which is more tractable for mathematical manipulation. As the population
size N is large and the sampling fraction p is fixed, it is easy to find that (Johnson et al., 1992)
Ni k N −k
Pr(Xi = k) ≈ p q i , k = 0, 1, . . . , Ni . (10)
k
Consequently, the formulation in (6) remains the same even if the derivation is based on the binomial
approximation. For the mean squared error estimator, we may observe that Pr(Xi = k, X j = ) ≈
N k N −k N N −
k
i p q i × j p q j , where the approximation ignores the dependence of Xi and X j . By adopt-
ing this approximation, we find E (Cr − Cr )2 ≈ q(r + 1){(1 + qr) fr+1 + q(r + 2) fr+2 }/n2 , which differs
from (8) by the term q(r + 1)2 fr+1 ( fr+1 − 1)/n3 . The discrepancy is generally minor in our experience.
We conclude that the binomial approximation is convenient for simplifying the computations because
it avoids the factorial and combinatorial calculations in the exact model.
Remark 2. In the context of sampling without replacement, C0 is no longer the probability of observing
an unseen class in a subsequent sample. Instead, that probability can be represented as Si=1 Ni I(Xi =
0)/(N − n) = NC0 /(N − n), and its expectation can be estimated by C0 /q = f1 /n. This observation
yields an interesting finding: the estimator f1 /n (i.e., the observed proportion of singletons in a sample)
is valid for estimating the probability of observing an unseen class in a subsequent sample, regardless
of whether the data were sampled with or without replacement.
Remark 3. The asymptotic normality of Cr − Cr may be proven following the development in Esty
(1983). Specifically, assume that Yi , i = 1, . . . , S, are independently distributed B(Ni , p) for each i.
Then the conditional distribution of Yi s given n = Si=1 Yi is identical to the distribution of Xi s in (4).
S
Further, define Cr,y = i=1 I(Yi = r)Ni /N and Cr,y , which are similar to the forms of Cr and Cr . Then
Cr,y − Cr,y is an unconditional version of Cr − Cr . Recognizing the asymptotic normality of Cr,y − Cr,y
is straightforward as it only involves a sum of independent random variables. Then, following a partial
C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
326 W.-H. Hwang et al.: Good–Turing frequency estimation in a finite population
inversion formula for characteristic functions (Esty, 1983, Lemma 2), the asymptotic normality of
Cr − Cr can be established as well.
4 Applications
In this section we present three extensions of the proposed frequency estimation method in the context
of finite population sampling.
E ( f1 )q log(q)γ 2
E (S0 − S) ≈ ,
pE (C+ )
where γ is the coefficient of variation of the Ni s defined by γ 2 = Si=1 (Ni − N̄ )2 /(SN̄ 2 ). Consequently,
S0 approaches the true S as p increases to 1; nevertheless, this estimator has a negative bias in
general, which can be seen by noting that log(q) < 0 when q < 1. Alternatively, one may simply apply
the method of moments to estimate the bias via f1 q log(q)γ 2 /(pC+ ), where γ̂ 2 = max{{S0 k k(k −
1) fk }/{n(n − 1)} − 1, 0}. The corresponding estimator of S can be expressed as
D f q log(q)γ 2
S= − 1 . (11)
C+ pC+
Similar to S0 , this estimator also converges to S as p increases to 1. For the other extreme case, since
lim p→0 q log(q)/p = −1, the estimator reduces to
D f1 γ 2
SCL = +
1 − f1 /n 1 − f1 /n
S
H=− pi log pi ,
i=1
C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
Biometrical Journal 57 (2015) 2 327
where pi = Ni /N. Applying the MLE of pi , denoted pi = Xi /n, suggests a naive estimator of the
Shannon index
S
HMLE = − I(Xi ≥ 1) pi log pi .
i=1
However, this MLE-based estimator may be inappropriate because it does not account for pi s with
unseen classes in the sample and so generally exhibits a negative bias. This bias can be confirmed by
expanding HMLE with respect to p̂i in a Taylor series about the point pi , which yields
q(S − 1) 1
E (HMLE ) = H − +O 2 . (12)
2n n
As the sampling fraction approaches 0 (or the population increases without bound), this expression
is almost the same as that in Basharin (1959). Thus, the MLE of H can result in a negative bias for
sampling both with and without replacement.
A Horvitz-Thompson type estimator for H can be expressed as
S
I(Xi ≥ 1)pi log pi
H=E − ; (13)
Pr(Xi ≥ 1)
i=1
thus there is an unbiased estimator of H provided that the pi s of observed classes (and so Pr(Xi ≥ 1))
are known. Since not all pi s are known in many practical situations, research questions about estimating
H in the form of (13) are focused on how to accurately estimate the pi s belonging to observed classes.
Assuming sampling with replacement, Chao and Shen (2003) suggested using the sample coverage
(C+ ) to correct the estimation of pi s of observed classes in (13) and proposed a widely used estimator
n
π̃k log π̃k
ĤCS = − fk ,
1 − (1 − π̃k )n
k=1
where π̃k = (1 − f1 /n)k/n. Note that the estimator does not converge to the true H when p = 1 (i.e.,
under a complete census) and F1 = 0. Thus, there is an inherent bias for the estimator when sampling
without replacement.
Similar to our proposed modification of frequency estimation to account for sampling without
replacement, we propose the estimator of the Shannon index
n
π̂k log π̂k
Ĥp = − fk
k=1
1 − qN π̂k
where π̂k = C+ k/n = (1 − f1 q/n)k/n. Taking the sampling fraction into account, the proposed esti-
mator will theoretically converge to the true H as p increases to 1.
C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
328 W.-H. Hwang et al.: Good–Turing frequency estimation in a finite population
of size m. This quantity is helpful for assessing the value of taking another sample (Shen et al., 2003;
Cowell et al., 2012). Assuming an infinite population, the goal concentrates on estimating
S
E (Sm | data) = I(Xi = 0) 1 − (1 − pi )m . (14)
i=1
to estimate E (Sm | data), where fˆ0 = ŜCL − D and ŜCL is the ACE estimator in Section 4.1. However,
in the context of sampling without replacement, using the approximation in (10), the main concern
in (14) becomes
S Ni
m
E (Sm | data) ≈ I(Xi = 0) 1 − 1 − .
N−n
i=1
Following the perspective of Shen et al. (2003), we suggest using the estimator
N̄
m 0
Ŝm = fˆ0 1 − 1 − (15)
N−n
= NC / f = q f /(p fˆ ).
to estimate E (Sm | data), where fˆ0 = Ŝ − D and N̄0 0 0 1 0
Remark 4. Using the binomial expansion with the estimator S̃m by Shen et al. (2003) and with the
was viewed as an integer), we found that both estimators result in the
proposed estimator Ŝm (if N̄0
same leading term, m f1 /n. Numerical experiments (Section 6) show that the leading term dominates
in both methods. As a consequence, it turns out that the estimator by Shen et al. (2003) is still valid
even when data were sampled without replacement.
Individuals (i) 1 2 3 4 5 6 7 8 9 10 11 12 13
Frequency (Fi ) 61 35 18 12 15 4 8 4 5 5 1 2 1
Individuals (i) 14 15 16 19 20 22 29 32 40 43 48 61
Frequency (Fi ) 2 3 2 1 2 1 1 1 1 1 1 1
C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
Biometrical Journal 57 (2015) 2 329
C+ C1 C2
1/2 CGT 0.939 ± 0.022 0.069 ± 0.031 0.054 ± 0.034
Cadj 0.970 ± 0.013 0.065 ± 0.017 0.062 ± 0.018
1/3 Cadj 0.960 ± 0.016 0.066 ± 0.022 0.059 ± 0.024
1/5 Cadj 0.952 ± 0.019 0.068 ± 0.025 0.057 ± 0.028
1/10 Cadj 0.946 ± 0.020 0.069 ± 0.028 0.055 ± 0.031
2 2
Iscore
α (l, u; θ ) = (u − l ) + (l − θ )I(θ < l ) + (θ − u)I(θ > u),
α α
C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
330 W.-H. Hwang et al.: Good–Turing frequency estimation in a finite population
Because the Good–Turing estimator does not depend on the sampling fraction p, its estimate is
only reported for the case p = 1/2. The estimation results for Ĉ+ and Ĉr , r = 1, 2, are similar to
the simulation study in Section 5.1. Interestingly, as the sampling fraction gets smaller, estimates
using Ĉadj approach those using ĈGT from above for E (C+ ) and E (C2 ) but from below for E (C1 ).
However, standard errors estimated using the proposed method approach those using the Good–
Turing estimators from below for all three target quantities. As one would expect, the effect of our
proposed correction for finite population sampling diminishes as the sampling fraction becomes very
small.
C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
Biometrical Journal 57 (2015) 2 331
Population 1 Population 2
9225 126
7500 100
5000
50
2500
22
336 5 4 3 4 3
0 33 1 0 0 0 0 0 0 0 0 1 1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Population 3 Population 4
5000
35 4788
30 4000
Number of Classes
3000
20
15
14 2000
1680
11 11
10 10
6 1000 894
5 578
4 4
393 289 213 193 181 143
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Population 5
20000
19130
15000
10000
6458
5000
3636
2301
1705
1202 1080 781 651 541
0
1 2 3 4 5 6 7 8 9 10
Figure 1 The first 10 frequencies (Fk , k = 1, 2, . . ., 10) of the five populations are displayed and
arranged with smallest variation to greatest variation regarding Ni , i = 1, . . . , S.
similar as the sampling fraction approaches 0. Figure 2 also confirms that the difference between C¯ adj
and C¯ GT is very slight at small sampling fractions for all populations.
Figure 3 presents the results for assessing the performance of the Shannon index estimators in terms
of bias (the average difference between the estimates and the H’s) and sample root mean squared error
(RMSE) of each estimator from the 5000 resulting estimates for the five populations across sampling
fractions. The figure shows that the proposed estimator outperforms HMLE under both measures for
all populations except Populations 4 and 5 when p ≥ 0.5, where HMLE is better at both measures than
the proposed estimator Hp , though the discrepancies between the two estimators are not of practical
importance. As noted in Section 4.2, the HMLE ignores species not observed in the sample and exhibits
a negative bias as shown in (12), so the underestimation of HMLE in all populations in Fig. 3 is expected.
Nevertheless, the magnitude of the bias of HMLE decreases with the sampling fraction as it eventually
converges to the true H. Since the estimator HCS by Chao and Shen (2003) was developed in the
context of sampling with replacement, the estimator can result in a substantial bias when data are
sampled without replacement and the sampling fraction is non-negligible. In Population 1, note that
the number of singletons is F1 = 9225 out of S = 9595 classes, so the performance of HCS is worse than
C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
332 W.-H. Hwang et al.: Good–Turing frequency estimation in a finite population
C + or C r C adj C GT
0.4 0.06
0.15
0.10
Population 3, C+ Population 3, C1 Population 3, C2
1.00
0.05
0.04
0.04
Estimate
0.98 0.03
0.03
0.02 0.02
0.96
0.01 0.01
0.00 0.00
Population 4, C+ Population 4, C1 Population 4, C2
1.00 0.08
0.04
0.95
0.06
0.90 0.03
0.04
0.85
0.02
0.02
Population 5, C+ Population 5, C1 Population 5, C2
1.000
0.06 0.04
0.975
0.05
0.950 0.03
0.04
0.925
0.03
0.02
0.900 0.02
0.01
5 10 20 30 40 50 60 70 80 90 5 10 20 30 40 50 60 70 80 90 5 10 20 30 40 50 60 70 80 90
C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
Biometrical Journal 57 (2015) 2 333
HMLE Hp HCS
Bias RMSE
3
Population 1
1
−1
−2 H̄ = 9.15
−3
Population 2
0
−1 H̄ = 4.54
0.10
Population 3
0.05
0.00
−0.05
H̄ = 4.08
−0.10
0.4
0.2
Population 4
0.0
−0.2 H̄ = 6.55
−0.4
0.2
Population 5
0.1
0.0
−0.1
H̄ = 7.25
−0.2
5 10 20 30 40 50 60 70 80 90 5 10 20 30 40 50 60 70 80 90
Figure 3 Comparison of the biases of HMLE , Hp , and HCS (left) and their RMSEs (right), where the
five populations of Ni , i = 1, . . . , S are arranged with smallest variation in the top row to greatest
variation in the bottom row.
C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
334 W.-H. Hwang et al.: Good–Turing frequency estimation in a finite population
the proposed estimator under both measures. However HCS has similar or better results compared
with the proposed estimator for some situations (e.g., Population 2 when p ≤ 0.30, Populations 3 and
5 when p = 0.05, and Population 4 for p ≤ 0.10). However, when the sampling fraction is large, Hp
outperforms HCS in both measures. Thus, the proposed estimator appears to be more valid than HCS
in general applications of sampling without replacement. In summary, compared to HCS , the proposed
estimator has similar results for small sampling fractions (p ≤ 0.30) but is superior in both measures
when the sampling fraction is large. Moreover, the proposed estimator is generally better than the
MLE in terms of the RMSE, but the advantage of the proposed estimator compared with the MLE
diminishes as the variation over Ni s increases as in Populations 4 and 5.
Figure 4 displays the average of 5000 observed numbers of new species (denoted S̄m ) as a reference
(dashed lines in the figure) versus the averages of 5000 estimates using Ŝm (proposed here and denoted
Ŝ¯m ) and S̃m (from Shen et al. (2003) and denoted S̃¯m ) for selected sizes of subsequent samples m.
When the sampling fraction p is large, most species are observed in the sample and the number of new
species in any subsequent sample is small relative to the species richness. Thus, in contrast with the
previous settings for the Shannon index, we consider three sampling fractions (p = 0.1, 0.2, 0.3) and
six ratios of the subsequent sample size m to the sample size n (0.2 to 1.2 with increments of 0.2) over
the five populations. As remarked in Section 4.3, both estimators Ŝm and S̃m result in the same leading
term in their binomial expansions. Figure 4 shows that Ŝ¯m almost coincides with S̃¯m . Though there is
a little difference between Ŝ¯m and S̃¯m and both estimators have positive biases for Population 3, the
magnitudes of the above two situations are insignificant. Consequently, both estimators appear to be
generally satisfactory for all populations.
In sum, since CGT is derived assuming sampling with replacement, there is an inherent bias when CGT
is applied to data that are sampled without replacement when the sampling fraction p is moderate. In
contrast, the proposed estimator Cadj takes the sampling fraction into account, and its performance is
demonstrably more robust than CGT to the sampling fraction. This simulation study clearly shows the
necessity of adjusting CGT when sampling without replacement, and the proposed estimator appears
to successfully generalize CGT to account for finite population sampling.
7 Conclusion
Good–Turing frequency estimation has been used in a wide range of disciplines, but the method
was developed under the assumption that sampling is done with replacement. This paper proposes an
adjustment to the Good–Turing method to account for the common situation in which sampling is done
without replacement from a finite population. The adjusted estimator inherits several characteristics
from the Good–Turing estimator; in particular, it reduces to the original Good–Turing estimator when
the sampling fraction approaches zero.
In this article we also presented three extensions of the proposed estimators. (1) Estimating the
number of classes in a population is recognized as a fundamental problem in various disciplines.
(2) Estimating the Shannon index has applications in monitoring diversity in an ecosystem. And (3)
predicting the number of new species in a subsequent sample can be used to inform the cost effectiveness
of taking an additional sample. Other applications involving frequency estimation may benefit from
leverage with the adjusted Good–Turing estimator. Although this study has pointed out some useful
applications by the modified method proposed here, nevertheless there are certainly other improving
methods in the aspect of finite population sampling. As an example, for estimating the number of
classes, Haas and Stokes (1998) consider the general jackknife procedure and propose some estimators
that have superior performance than the estimator Ŝ when the squared coefficient of variation γ 2 is
large (γ 2 > 1). Additionally, Valiant and Valiant (2011) provided a numerical algorithm for estimating
the number of supports of a distribution that can be applied to this topic as well. More recently, Chao
C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
Biometrical Journal 57 (2015) 2 335
Sm S m (proposed) Sm
2000 3000
900
1500
2000
600
1000
300 1000
500
30 30 30
25
20 20 20
15
10 10 10
2500
1500 2000
2000
1500
1000 1500
1000
1000
500
500 500
Population 5, p = 0.1 Population 5, p = 0.2 Population 5, p = 0.3
10000
10000
6000 8000
7500
6000
4000
4000 5000
2000
2000 2500
0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1 1.2
m/n
Figure 4 Observed (Sm ) and estimated (Sm proposed and S m by Shen et al. (2003)) average numbers
of new species in a subsequent sample of size m, where the four abundance models are arranged with
smallest variation in the top row to greatest variation in the bottom row and three sampling fractions
are located in the left (p = 0.1), middle (p = 0.2), and right (p = 0.3) columns.
C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
336 W.-H. Hwang et al.: Good–Turing frequency estimation in a finite population
and Lin (2012) developed a lower bound estimator that is very robust and, in particular, it can be very
accurate to serve as a point estimator when γ 2 is small or at large sampling proportions. We believe
that more promising methods in estimating the number of classes and other extensions of interest are
worth pursuing in future research. A referee has also indicated a simple weighted estimator which
may be applicable to our extensions in practice. As an illustration to such weighted approach, say
Ŝw = pŜ + qŜ∞ , for estimating the number of classes, where Ŝ∞ can be any valid estimator derived
from the infinite population. Clearly, Ŝw reduces to Ŝ∞ when p is 0. Consequently, the weighted
estimator may outperform Ŝ if Ŝ∞ is well selected. In an additional simulation study (unreported),
we compared the estimator Ŝ with the weighted estimator Ŝw , in which Ŝ∞ considers using the
estimator of Cecconi et al. (2012). The resulting weighted estimator outperforms the estimator Ŝ
in terms of estimation bias and RMSE when γ 2 is large; nevertheless, strength of the estimator
deteriorates in some test populations. Consequently, further conducting a more complete investigation
from theoretical and practical perspectives is of great help with a comprehensive performance of
such weighted estimator. Note that the weighted procedure can be applied to entropy estimation and
others.
In classical sampling theory, the finite population correction (FPC) factor only features in the
variance estimator if the parameter of interest focuses on the population mean whose estimator
is free of FPC. Nevertheless, we found that FPC emerges not only in the variance estimator but
also in the frequency estimator and the practical extensions in Section 4. Although we also found
that the estimator by Shen et al. (2003) is reasonable in the context of sampling without replace-
ment, the proposed method represents a general framework for reliable inference. An interesting
question for a further study concerns determining conditions under which FPC features in point
estimation.
Note that source codes to reproduce all figures and (Supporting Information) tables are
available as Supporting Information on the journal’s web page (http://onlinelibrary.wiley.com/
doi/10.1002/bimj.201300168/suppinfo).
Acknowledgments The authors thank Professor Anne Chao for inspiring the topic and Roman Gulati for
generous assistance editing the manuscript. This work was supported by the National Science Council of
Taiwan.
Conflict of interest
The authors have declared no conflict of interest.
2
√ √ Ni − r √ q(r + 1)
E{ n(Cr − Cr )} = E2
n I(Xi = r) − n I(Xi = r + 1)
i
N i
n
= T1 + T2 − 2T3 ,
√ N −r √ q(r+1)
where T1 = E{ n i Ni I(Xi = r)}2 , T2 = E{ n i I(Xi = r + 1)}2 , and
√ N −r n
T3 = E{ n i= j Ni q(r+1)
n
I(Xi = r)I(X j = r + 1)}.
C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
Biometrical Journal 57 (2015) 2 337
q2 (r + 1)(r + 2) q(r + 1) η
= E ( fr+2 ) + E ( fr+1 ) + O s + T4 ,
n N N
(N −r)(N −r)
where T4 = nE i= j
i
N2
j
I(Xi = r, X j = r) . Moreover, we have
⎧ ⎫
⎪
⎪ Ni Nj N − Ni − N j ⎪ ⎪
⎪
⎨ (N − r)(N − r) r ⎪
⎬
i j r n − 2r
T4 = nE
⎪
⎪ N 2 N ⎪
⎪
⎪ i= j
⎩ ⎪
⎭
n
Ni Nj N − Ni − N j
(r + 1)2 r + 1 r+1 n − 2r − 2 (N − Ni − N j − n + 2r + 2)(N − Ni − N j − n + 2r + 1)
=n
N2 N (n − 2r − 1)(n − 2r)
i= j
n
⎧ ⎫
⎪ N + N j − 2r − 2 Ni + N j − 2r − 1 ⎪
⎪ 1− i
⎪ )(1 − ⎪
⎪
q2 (r + 1)2 ⎪
⎨ Nq Nq ⎪
⎬
= Pr(Xi = X j = r + 1) .
n ⎪
⎪ 2r + 1 2r ⎪
⎪
i= j ⎪
⎪ 1 − 1 − ⎪
⎪
⎩ n n ⎭
Next, T2 = q (r+1) E ( fr+1 ) + T5 , where T5 = i= j q (r+1)
2 2 2 2
n n
Pr(Xi = r + 1, X j = r + 1). Similarly, it can
be shown that
Ni Nj N − Ni − N j
q(r + 1)2 r + 1 r+1 n − 2r − 2 N − (Ni + N j + n − 2r − 2)
T3 =
i= j
N N n − 2r − 1
n
⎧ Ni + N j − 2r − 2 ⎫
⎪
⎪ ⎪
⎪
q2 (r + 1)2 ⎨1 − ⎬
Nq
= Pr(Xi = X j = r + 1) .
n ⎪
⎪ 2r + 1 ⎪
⎪
i= j ⎩ 1− ⎭
n
As a result,
2
1 1 η
T4 + T5 − 2T3 = T5 − − +O s
Nq n N
q(r + 1)2
ηs2
= −E I(Xi = r + 1, X j = r + 1) + O ,
i= j
n2 N
q(r + 1)2 fr+1 ( fr+1 − 1) ηs2
= −E +O .
n2 N
C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
338 W.-H. Hwang et al.: Good–Turing frequency estimation in a finite population
√ 1
E{ n(Cr − Cr )}2 ≈ E q(r + 1)(1 + qr) fr+1 + (r + 1)(r + 2)q2 fr+2
n
q(r + 1)2 fr+1 ( fr+1 − 1)
−E .
n2
References
Basharin, G. P. (1959). On a statistical estimate for the entropy of a sequence of independent random variables.
Theory of Probability and Its Applications 4, 333–336.
Chao, A. and Lee, S. M. (1992). Estimating the number of classes via sample coverage. Journal of American
Statistical Association 87, 210–217.
Chao, A., Lee, S. M. and Jeng, S. L. (1992). Estimating population size for capture-recapture data when capture
probabilities vary by time and individual animal. Biometrics 48, 201–216.
Chao, A. and Lin, C. W. (2012). Nonparametric lower bounds for species richness and shared species richness
under sampling without replacement. Biometrics 68, 912–921.
Chao, A. and Shen, T. J. (2003). Nonparametric estimation of Shannon’s index of diversity when there are unseen
species. Environmental and Ecological Statistics 10, 429–443.
Chen, S. and Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer
Speech and Language 13, 310–318.
Condit, R., Hubbell, S. P. and Foster, R. B. (1996). Changes in a tropical forest with a shifting climate: results
from a 50-ha permanent census plot in Panama. Journal of Tropical Ecology 12, 231–256.
Colwell, R., Chao, A., Gotelli, N., Lin, S. and Mao, C. (2012). Models and estimators linking individual-based
and sample-based rarefaction, extrapolation, and comparison of assemblages. Journal of Plant Ecology 5,
3–21.
Cecconi, L., Gandolfi, A. and Sastri, C. C. A. (2012). A new estimator for the number of species in a population.
Sankhya A 74, 80–100
Church, K. W. and Hanks, P. (1990). Word association norms mutual information, and lexicography. Computa-
tional Linguistics 16, 22–29.
Esty, W. (1983). A normal limit law for a nonparametric estimator of the coverage of a random sample. The
Annals of Statistics 11, 905–912.
Esty, W. (1985). Estimation of the number of classes in a population and the coverage of a population. Mathe-
matical Scientist 10, 41–50.
Evert, S. and Baroni, M. (2007). “zipfR: Word frequency distributions in R.” In Proceedings of the 45th Annual
Meeting of the Association for Computational Linguistics, Posters and Demonstrations Sessions, pages
29–32, Prague, CZ (R package version 0.6-6 of 2012-04-03).
Fisher, R. A., Corbet, A. S. and Williams, C. B. (1943). The relation between the number of species and the
number of individuals in a random sample of an animal population. Journal of Animal Ecology 12, 42–58.
Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the
American Statistical Association 102, 359–378.
Good, I. J. (1953). The population of frequencies of species and the estimation of population parameters.
Biometrika 40, 45–63.
Good, I. J. (2000). Turing’s anticipation of empirical Bayes in connection with the cryptanalysis of the naval
Enigma. Journal of Statistical Computation and Simulation, 66, 101–111.
Goodman, L. A. (1949). On the estimation of the number of classes in a population. Annals of Mathematical
Statistics, 20, 572–579.
Haas, P. J. and Stokes, L. (1996). Estimating the number of classes in a finite population. IBM Research Report
RJ 10025, IBM Almaden Research Center, San Jose, CA, Revised March 1998.
Haas, P. J. and Stokes, L. (1998). Estimating the number of classes in a finite population. Journal of the American
Statistical Association 93, 1475–1487.
Haas, P. J., Liu, Y. and Stokes, L. (2006). An estimator of number of species from quadrat sampling. Biometrics
62, 135–141.
Hausser, J. and Strimmer K., (2009). Entropy inference and the James-Stein estimator, with application to
nonlinear gene association networks. Journal of Machine Learning Research 10, 1469–1484.
C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
Biometrical Journal 57 (2015) 2 339
Jelinek, F. (1998). Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA.
Johnson, N. L., Kotz, S. and Kemp, A. W. F. (1992). Univariate Discrete Distribution (2nd edn.). Wiley, New
York, NY.
Kucera, H. and Francis, W. N. (1967). Computational Analysis of Present-day American English. Brown University
Press, Providence, RI.
Lo, S. H. (1992). From the species problem to a general coverage problem via a new interpretation. The Annals of
Statistics 20, 1094–1109.
Magurran, A. E. (1988). Ecological Diversity and Its Measurement. Princeton University Press, Princeton, NJ.
McAllester, D. and Schapire, R. E. (2000). On the convergence rate of Good–Turing estimators. in Proc. 13th
Annu. Conference on Comput. Morgan Kaufmann, Learning Theory. San Francisco, CA, 1–6.
Miller, R. I. and Wiegert, R. G. (1989). Documenting completeness, species-area relations, and the species-
abundance distribution of a regional flora. Ecology 70, 16–22.
Mingoti, S. A. and Meeden, G. (1989). Estimating the total number of distinct species using presence and absence
data. Biometrics 48, 863–875.
Orlitsky, A., Santhanam, N. P. and Zhang, J. (2003). Always Good Turing: Asymptotically optimal probability
estimation. Science 302, 427–431.
Shen, T. J., Chao, A. and Lin, J. F. (2003). Predicting the number of new species in further taxonomic sampling.
Ecology 84, 798–804.
Shlosser, A. (1981). On estimation of the size of the dictionary of a long text on the basis of a sample, Engineering
Cybernetics, 19, 97–102.
Song, F. and Croft, W. (1999). Research and Development in Information Retrieval. ACM Press, New York, NY.
Valiant, G. and Valiant, P. (2011). Estimating the unseen: an n/log(n)-sample estimator for entropy and support
size, shown optimal via new CLTs. In Proceedings of the forty-third annual ACM symposium on Theory of
computing (STOC’11), 685–694. ACM, New York, NY, USA.
Wagner, A. B., Viswanath, P. and Kulkarni, S. R. (2006). Strong consistency of the Good–Turing estimator. IEEE
Symposium on Information Theory Proceeding, July 2006, 2526–2530.
Zhang, C.-H. and Zhang, Z. (2009). Asymptotic normality of a nonparametric estimator of sample coverage. The
Annals of Statistics, 37, 2582–2595.
C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com