0% found this document useful (0 votes)

14 views19 pages

Hwang Good-Turing Frequency Estimation in A Finite Population 2014

This document presents a modified Good-Turing frequency estimation method to address the limitations of existing studies that assume sampling with replacement or infinite populations. The authors propose three practical extensions of this modified method and evaluate its performance through simulation experiments. The study emphasizes the importance of accounting for finite population sampling in various applications, including ecology and information retrieval.

Uploaded by

Mario Puppi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views19 pages

Hwang Good-Turing Frequency Estimation in A Finite Population 2014

Uploaded by

Mario Puppi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Biometrical Journal 57 (2015) 2, 321–339 DOI: 10.1002/bimj.

201300168 321

Good–Turing frequency estimation in a finite population

Wen-Han Hwang1 , Chih-Wei Lin2 , and Tsung-Jen Shen∗,1
1
Institute of Statistics and Department of Applied Mathematics, National Chung Hsing University,
Taichung 40227, Taiwan
2
Department of Leisure Services Management, Chaoyang University of Technology, Taichung
41349, Taiwan

Received 20 August 2013; revised 13 March 2014; accepted 5 September 2014

Good–Turing frequency estimation (Good, 1953) is a simple, effective method for predicting detection
probabilities of objects of both observed and unobserved classes based on observed frequencies of
classes in a sample. The method has been used widely in several disciplines, such as information
retrieval, computational linguistics, text recognition, and ecological diversity estimation. Nevertheless,
existing studies assume sampling with replacement or sampling from an infinite population, which
might be inappropriate for many practical applications. In light of this limitation, this article presents
a modification of the Good–Turing estimation method to account for finite population sampling. We
provide three practical extensions of the modified method, and we examine performance of the modified
method and its extensions in simulation experiments.

Keywords: Frequency estimation; Finite population; Good–Turing; Number-of-classes

estimation; Sample coverage; Shannon index.

Additional supporting information may be found in the online version of this article
at the publisher’s web-site

1 Introduction
It is straightforward to present the problem that the Good–Turing method (Good, 1953) was developed
to solve. Consider a population that consists of an unknown number of classes S, and let Ni denote

the number of objects in each class i = 1, . . . , S so that N = Si=1 Ni is the total population size. If we
take a simple random sample of size n from this population, we can tabulate the frequency Xi of each
class i present in the sample. If S were known,
we could represent these frequencies (some of which
may be zero) as X1 , . . . , XS so that n = Si=1 Xi is the total sample size. But because S is unknown,
we can only observe Xi whenever Xi ≥ 1. In this setting, a common research goal is to estimate the
proportion of the total population that each class represents. This research goal can also be framed
as wanting to predict the proportion Cr of the total population that all classes with frequency r in the
sample represent, which can be expressed as

S
N
Cr = i
I(Xi = r), r = 0, 1, . . . , (1)
N
i=1

where I(·) is the usual indicator function. Note that the prediction is equivalent to estimating E (Cr ),
that is, the expected proportion of the total population represented by classes with frequency r in

∗ Corresponding author: e-mail: [email protected]

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
322 W.-H. Hwang et al.: Good–Turing frequency estimation in a finite population

the sample. If there is no confusion, we will follow Good (1953) and use estimation terminology in
this work. Primary interest in practice is usually focused on small frequencies (Good, 2000), with
particular interest in the proportion of unobserved classes C0 . The complement of this quantity,
C+ = 1 − C0 , represents the proportion of observed classes and is often called the sample coverage in
the literature. Good (1953) first addressed the issue of estimating E (Cr ) and proposed an empirical
Bayesian approach. As mentioned in Good (1953, 2000), the method was motivated by the founder
of modern computer science and Good’s mentor, Dr. Alan Turing, so that it is generally called the
Good–Turing frequency estimation method.
The Good–Turing method has been applied successfully in several disciplines, such as information
retrieval (Song and Croft, 1999), computational linguistics (Church and Hanks, 1990), speech recog-
nition (Jelinek, 1998; Chen and Goodman, 1999), species richness estimation (Esty, 1985; Chao and
Lee, 1992), population size estimation (Chao et al., 1992), Shannon entropy estimation (Chao and
Shen, 2003), and missile coverage estimation (Lo, 1992). On theoretical aspects of the method, Esty
(1983) and Zhang and Zhang (2009) obtained conditions for asymptotic normality of the sample cov-
erage estimator, Orlitsky et al. (2003) addressed an optimal property based on information theory, and
McAllester and Schapire (2000) and Wagner et al. (2006) established several consistency properties.
The research literature surrounding the Good–Turing method topic is rich; however, existing studies
focus on sampling with replacement, which is equivalent to sampling from an infinite population or
from a finite population when the sampling fraction is negligible. (Hereafter, we refer to this gen-
eral situation as sampling with replacement.) In fact, to the best of our knowledge, implications of
applying the Good–Turing method when sampling without replacement in a finite population have
not previously been studied. One might expect that this method may result in substantial bias when
the sample is taken without replacement from a finite population and the sampling fraction cannot
be ignored. Examples of bias due to failing to account for sampling without replacement in species
richness estimation can be found in Haas and Stokes (1998), Haas et al. (2006), and Chao and Lin
(2012).
In this study, either the total population size N or the sampling fraction p = n/N is assumed to
be given. This information is available in a variety of applications; for instance, the total number of
accounts open at a bank or on a website is usually known by managers; however, the total does not
represent the number of registered persons because some individuals may have multiple accounts.
Quadrat sampling in ecology surveys provides another example. Owing to their sedentary character,
quadrat sampling of plants often involves random sampling without replacement from a division with
a known number of quadrats. More detailed examples are given in Haas and Stokes (1998).
Section 2 briefly reviews the Good–Turing method under sampling with replacement. Section 3
modifies the method to account for sampling without replacement. Section 4 presents three extensions
of the modified method: estimating the number of classes, estimating the Shannon entropy diversity
index, and predicting the number of new species in a subsequent sample. Section 5 first examines the
performance of proposed frequency and interval estimators by resampling from empirical data on rare
vascular plants then analyzes the entire dataset as a single sample. Section 6 conducts an extensive
simulation study to evaluate the performance of the proposed methods including two of the extensions
in Section 4. Section 7 concludes.

2 Classical method
In the classical Good–Turing method, let pi = Ni /N be the probability of observing an object from

the ith class for i = 1, . . . , S. Moreover, let f j = Si=1 I(Xi = j) be the number of classes in the

population with frequency j in the sample so that n = j≥1 j f j is the total sample size. Note that

f0 is unobservable here. Let D = j≥1 f j be the number of distinct classes in the sample; without
loss of generality, we arrange Xi ≥ 1 to be indexed as i = 1, . . . , D. If the sampling is conducted with

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
Biometrical Journal 57 (2015) 2 323

replacement, theoretical tabulated frequencies of the random sample (X1 , . . . , XS ) can be reasonably
assumed to follow a multinomial distribution

n x x
Pr(X1 = x1 , . . . , XS = xS ) = p 1 · · · pSS ,
x1 , . . . , xS 1

where xi ≥ 0 for i = 1, . . . , S and n = Si=1 xi . It is straightforward to find that the maximum likelihood
estimator (MLE) of pi is Xi /n, and hence the MLE of E (Cr ) is r fr /n. Unfortunately, the estimator is
problematic since it does not reflect the possible existence of classes not observed in the sample. As a
simple example, the MLE always estimates that E (C0 ) is zero and no interval estimator is available.
For these reasons, the MLE is generally avoided in this setting.
By noticing that Xi follows a binomial distribution B(n, pi ) with n trials and success probability pi ,
the expectation of Cr can be decomposed into

r+1 (r + 2)(r + 1)
S S
E (Cr ) = Pr(Xi = r + 1) − Pr(Xi = r + 2)(1 − pi ),
n−r (n − r)(n − r − 1)
i=1 i=1
E( f )
1
where the latter is dominated by O n2
r+2
=O n
. Consequently, we have

r+1 1
E (Cr ) = E ( fr+1 ) + O .
n n

When n is sufficiently large, this yields a moment estimator

(r + 1) fr+1
Cr = , (2)
n
for small r. In particular, the proportion of unobserved classes can be estimated by C0 = f1 /n and
the sample coverage can be estimated by C+ = 1 − f1 /n. Note that here C0 can be interpreted as the
probability of observing in a subsequent sample an object with class not observed in this sample. The
estimator (2) is the classical Good–Turing estimator, though it was originally derived using a rather
complicated empirical Bayesian approach (Good, 1953).
Next, define p̄r as the average proportion of classes with frequency r in the sample

S
pi I(Xi = r)
i=1
p̄r = , r = 1, 2, . . . . (3)
S
I(Xi = r)
i=1

Then we may estimate E ( p̄r ) by p̄r = r∗ /n, where r∗ = (r + 1) fr+1 / fr is an adjusted frequency for an
arbitrary class with frequency r in the sample (Good, 1953). This formulation clearly elucidates the
difference between the Good–Turing method and the maximum likelihood approach.

3 Good–Turing method in a finite population

Assume that either the total population size N or sampling fraction p = n/N is known, define
q = 1 − p, and write Fj = Si=1 I(Ni = j) for all j. Note that f j converges to Fj as the sampling
fraction
p increases to 1 since this implies a census of the population is taken. Note that F0 = 0 and
j≥1 jFj = N in this setup.

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
324 W.-H. Hwang et al.: Good–Turing frequency estimation in a finite population

In the context of sampling without replacement, the random vector (X1 , . . . , XS ) can be reasonably
assumed to follow a multivariate hypergeometric distribution with joint probability mass function

N1 NS
···
x1 x
Pr(X1 = x1 , . . . , XS = xS ) = S , (4)
N
n

where xi ≥ 0 for all i, n = i xi , and N r
i = 0 if Ni < r. For each i, the marginal probability mass
function of Xi follows a hypergeometric distribution

Ni N − Ni
k n−k
Pr(Xi = k) = . (5)
N
n
The expected value of Cr gives

r r+1
S S
N − Ni − n + r + 1
E (Cr ) = Pr(Xi = r) + Pr(Xi = r + 1)
N N n−r
i=1 i=1

(r + 1)(r + 2)
S
pr q(r + 1)
= E ( fr ) + E ( fr+1 ) − Pr(Xi = r + 2)δi ,
n n−r (n − r)(n − r − 1)
i=1

where δi = max{0, (N − Ni − n + r + 2)/N}. Note that 0 ≤ δi < 1, which can be revealed by the prob-
ability mass function
E (5) atk = r + 2 provided that Pr(Xi = r + 2) is positive. Therefore the last term
( fr+2 )
is dominated by O n2
= O 1n . As a result, we have

pr q(r + 1) 1
E (Cr ) = E ( fr ) + E ( fr+1 ) + O .
n n n

Ignoring the remainder term O (1/n), we propose the estimator

pr fr + q(r + 1) fr+1
Cr = . (6)
n
Thus the proposed estimator is a sampling-fraction-weighted average of the MLE and the Good–
Turing estimator (2), with greater weight given to the latter for smaller sampling fractions. In fact, as
the sampling fraction p approaches 0, the proposed estimator reduces to the Good–Turing estimator.
Conversely, as the sampling fraction p increases to 1, the proposed estimator Cr approaches rFr /N for
all r and, in particular, the sample coverage estimator
q f1
C+ = 1 − (7)
n
approaches 1 as desired. We note this estimator (7) was first addressed in the appendix of Haas and
Stokes (1996), but their derivation required an assumption on Ni ≈ N/S for all i, which is seldom met
in practical applications, and it is also not straightforward to extend similar work to other Cr for r ≥ 1.
In contrast, the derivation here does not require this assumption and the result is easily explained as a
weighted average of the MLE and the classical Good–Turing estimator.

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
Biometrical Journal 57 (2015) 2 325

When estimating the sample coverage, if we do not account for finite population sampling and apply
the classical Good–Turing method without adjustment, the coverage estimator approaches 1 − F1 /N,
representing an inherent bias due to often having F1 = 0. Specifically, the asymptotic bias of the
Good–Turing method is pE ( f1 )/n, which depends on the sampling fraction p in a finite population.
The discrepancy between the limiting behavior of the modified and classical Good–Turing estimators
demonstrates the validity of our adjustment in this extreme case and supports consideration of the
modified estimator in the general case when the sampling fraction cannot be ignored.
Next, we develop an interval estimator for the proposed estimator Cr . Since Cr is a random variable
rather than a fixed parameter, it is reasonable to consider the mean squared error of Cr instead of its
variance. In the Appendix, we derive
q(r + 1) 1
E (Cr − Cr )2 ≈ 2
(1 + qr) fr+1 + q(r + 2) fr+2 − (r + 1) fr+1 ( fr+1 − 1) . (8)
n n
Moreover, given a fixed p and under certain regularity conditions, we have (Cr − Cr )/σr ∼ N (0, 1)
when N is large, where σr2 is the right-hand side of (8). As a result, an approximate 95% confidence
(prediction) interval for Cr has the form
Cr ± 1.96 σr . (9)
S
Next, analogous to p̄r , we define N̄r = i=1 Ni I(Xi = r)/ fr as the average population frequency of
classes appearing exactly r times in the sample. Rewriting N̄r = NCr / fr by the definition in (1) and
directly using the proposed estimator Cr in (6), E (N̄r ) can be estimated directly by N̄ r = r + (r +
1)( fr+1 / fr )(q/p) for all r ≥ 1.
Remark 1. Rather than using the exact distribution (5), we may also obtain the estimators according to
a binomial approximation, which is more tractable for mathematical manipulation. As the population
size N is large and the sampling fraction p is fixed, it is easy to find that (Johnson et al., 1992)

Ni k N −k
Pr(Xi = k) ≈ p q i , k = 0, 1, . . . , Ni . (10)
k
Consequently, the formulation in (6) remains the same even if the derivation is based on the binomial
approximation. For the mean squared error estimator, we may observe that Pr(Xi = k, X j = ) ≈
N k N −k N N −
k
i p q i × j p q j , where the approximation ignores the dependence of Xi and X j . By adopt-
ing this approximation, we find E (Cr − Cr )2 ≈ q(r + 1){(1 + qr) fr+1 + q(r + 2) fr+2 }/n2 , which differs
from (8) by the term q(r + 1)2 fr+1 ( fr+1 − 1)/n3 . The discrepancy is generally minor in our experience.
We conclude that the binomial approximation is convenient for simplifying the computations because
it avoids the factorial and combinatorial calculations in the exact model.
Remark 2. In the context of sampling without replacement, C0 is no longer the probability of observing

an unseen class in a subsequent sample. Instead, that probability can be represented as Si=1 Ni I(Xi =
0)/(N − n) = NC0 /(N − n), and its expectation can be estimated by C0 /q = f1 /n. This observation
yields an interesting finding: the estimator f1 /n (i.e., the observed proportion of singletons in a sample)
is valid for estimating the probability of observing an unseen class in a subsequent sample, regardless
of whether the data were sampled with or without replacement.
Remark 3. The asymptotic normality of Cr − Cr may be proven following the development in Esty
(1983). Specifically, assume that Yi , i = 1, . . . , S, are independently distributed B(Ni , p) for each i.

Then the conditional distribution of Yi s given n = Si=1 Yi is identical to the distribution of Xi s in (4).
S
Further, define Cr,y = i=1 I(Yi = r)Ni /N and Cr,y , which are similar to the forms of Cr and Cr . Then
Cr,y − Cr,y is an unconditional version of Cr − Cr . Recognizing the asymptotic normality of Cr,y − Cr,y
is straightforward as it only involves a sum of independent random variables. Then, following a partial

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
326 W.-H. Hwang et al.: Good–Turing frequency estimation in a finite population

inversion formula for characteristic functions (Esty, 1983, Lemma 2), the asymptotic normality of
Cr − Cr can be established as well.

4 Applications
In this section we present three extensions of the proposed frequency estimation method in the context
of finite population sampling.

4.1 Estimating the number of classes S

First consider a population where all Ni s are equal. Then S = D/C+ , so that a simple estimator is
S0 = D/C+ = D/(1 − q f1 /n). When the Ni s are unequal, we may approximate the asymptotic bias of
S0 , using the Taylor expansion at N̄ = N/S over all Ni s in E (D)/E (C+ ) and using the approximation
SN̄ pN̄−1 q ≈ E ( f1 ), as

E ( f1 )q log(q)γ 2
E (S0 − S) ≈ ,
pE (C+ )

where γ is the coefficient of variation of the Ni s defined by γ 2 = Si=1 (Ni − N̄ )2 /(SN̄ 2 ). Consequently,
S0 approaches the true S as p increases to 1; nevertheless, this estimator has a negative bias in
general, which can be seen by noting that log(q) < 0 when q < 1. Alternatively, one may simply apply
the method of moments to estimate the bias via f1 q log(q)γ 2 /(pC+ ), where γ̂ 2 = max{{S0 k k(k −
1) fk }/{n(n − 1)} − 1, 0}. The corresponding estimator of S can be expressed as
D f q log(q)γ 2
S= − 1 . (11)
C+ pC+
Similar to S0 , this estimator also converges to S as p increases to 1. For the other extreme case, since
lim p→0 q log(q)/p = −1, the estimator reduces to

D f1 γ 2
SCL = +
1 − f1 /n 1 − f1 /n

as p approaches 0. This reduced estimator is equivalent to the popular abundance-based coverage

estimator (ACE) from Chao and Lee (1992). This equivalence is not surprising because the ACE was
derived in the same manner. Seen in this light, the estimator (11) is a generalization of the ACE.
Further, and perhaps more interestingly, the estimator is equivalent to the unsmoothed second-order
estimator by Haas and Stokes (1998), which was constructed from a generalized jackknife procedure.
Through an extensive simulation study, Haas and Stokes (1998) showed the estimator (11) generally
outperforms several other often-used methods when γ 2 < 1.

4.2 Estimating the Shannon entropy diversity index

The Shannon index is used widely for measuring the diversity of an ecological assemblage (Magurran,
1988). It is defined by

S
H=− pi log pi ,
i=1

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
Biometrical Journal 57 (2015) 2 327

where pi = Ni /N. Applying the MLE of pi , denoted pi = Xi /n, suggests a naive estimator of the
Shannon index

S
HMLE = − I(Xi ≥ 1) pi log pi .
i=1

However, this MLE-based estimator may be inappropriate because it does not account for pi s with
unseen classes in the sample and so generally exhibits a negative bias. This bias can be confirmed by
expanding HMLE with respect to p̂i in a Taylor series about the point pi , which yields

q(S − 1) 1
E (HMLE ) = H − +O 2 . (12)
2n n

As the sampling fraction approaches 0 (or the population increases without bound), this expression
is almost the same as that in Basharin (1959). Thus, the MLE of H can result in a negative bias for
sampling both with and without replacement.
A Horvitz-Thompson type estimator for H can be expressed as

S
I(Xi ≥ 1)pi log pi
H=E − ; (13)
Pr(Xi ≥ 1)
i=1

thus there is an unbiased estimator of H provided that the pi s of observed classes (and so Pr(Xi ≥ 1))
are known. Since not all pi s are known in many practical situations, research questions about estimating
H in the form of (13) are focused on how to accurately estimate the pi s belonging to observed classes.
Assuming sampling with replacement, Chao and Shen (2003) suggested using the sample coverage
(C+ ) to correct the estimation of pi s of observed classes in (13) and proposed a widely used estimator

n
π̃k log π̃k
ĤCS = − fk ,
1 − (1 − π̃k )n
k=1

where π̃k = (1 − f1 /n)k/n. Note that the estimator does not converge to the true H when p = 1 (i.e.,
under a complete census) and F1 = 0. Thus, there is an inherent bias for the estimator when sampling
without replacement.
Similar to our proposed modification of frequency estimation to account for sampling without
replacement, we propose the estimator of the Shannon index

n
π̂k log π̂k
Ĥp = − fk
k=1
1 − qN π̂k

where π̂k = C+ k/n = (1 − f1 q/n)k/n. Taking the sampling fraction into account, the proposed esti-
mator will theoretically converge to the true H as p increases to 1.

4.3 Predicting the number of new species in a subsequent sample of size m

Given a sample of size n (data), ecologists or biologists are often interested in the number of species
(denoted Sm ) which are undetected in the sample but that would be discovered in a subsequent sample

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
328 W.-H. Hwang et al.: Good–Turing frequency estimation in a finite population

of size m. This quantity is helpful for assessing the value of taking another sample (Shen et al., 2003;
Cowell et al., 2012). Assuming an infinite population, the goal concentrates on estimating

S

E (Sm | data) = I(Xi = 0) 1 − (1 − pi )m . (14)
i=1

Shen et al. (2003) proposed

m
ˆ f1 /n
S̃m = f0 1 − 1 −
fˆ0

to estimate E (Sm | data), where fˆ0 = ŜCL − D and ŜCL is the ACE estimator in Section 4.1. However,
in the context of sampling without replacement, using the approximation in (10), the main concern
in (14) becomes

S Ni
m
E (Sm | data) ≈ I(Xi = 0) 1 − 1 − .
N−n
i=1

Following the perspective of Shen et al. (2003), we suggest using the estimator

N̄
m 0
Ŝm = fˆ0 1 − 1 − (15)
N−n
= NC / f = q f /(p fˆ ).
to estimate E (Sm | data), where fˆ0 = Ŝ − D and N̄0 0 0 1 0

Remark 4. Using the binomial expansion with the estimator S̃m by Shen et al. (2003) and with the
was viewed as an integer), we found that both estimators result in the
proposed estimator Ŝm (if N̄0
same leading term, m f1 /n. Numerical experiments (Section 6) show that the leading term dominates
in both methods. As a consequence, it turns out that the estimator by Shen et al. (2003) is still valid
even when data were sampled without replacement.

5 Empirical study of rare plant data

In this section we validate the proposed frequency estimator and illustrate its use in practice. To
validate the proposed method, we sampled without replacement from a dataset on rare vascular plant
species in the southern Appalachian region (Miller and Wiegert, 1989), which was previously used to
assess species-area relations and species-abundance distributions to reliably estimate species richness.
To illustrate practical application, we also analyzed the entire dataset as a sample.
First considering the dataset as a population, there are S = 188 species comprised of N = 1008
individuals. Tabulated frequencies are summarized in Table 1, where each number of individuals and
corresponding frequency represents a pair (i, Fi ).

Table 1 Frequency data of the rare vascular plant species.

Individuals (i) 1 2 3 4 5 6 7 8 9 10 11 12 13
Frequency (Fi ) 61 35 18 12 15 4 8 4 5 5 1 2 1
Individuals (i) 14 15 16 19 20 22 29 32 40 43 48 61
Frequency (Fi ) 2 3 2 1 2 1 1 1 1 1 1 1

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
Biometrical Journal 57 (2015) 2 329

Table 2 A sensitivity analysis of the proposed frequency estimation method.

p Method 95 % Confidence interval

C+ C1 C2
1/2 CGT 0.939 ± 0.022 0.069 ± 0.031 0.054 ± 0.034
Cadj 0.970 ± 0.013 0.065 ± 0.017 0.062 ± 0.018
1/3 Cadj 0.960 ± 0.016 0.066 ± 0.022 0.059 ± 0.024
1/5 Cadj 0.952 ± 0.019 0.068 ± 0.025 0.057 ± 0.028
1/10 Cadj 0.946 ± 0.020 0.069 ± 0.028 0.055 ± 0.031

5.1 Validation of proposed frequency estimation method

Using this dataset as a sampling population, we tested our estimator (denoted Cadj ) and the Good–
Turing estimator (denoted CGT ) for the expected sample coverage E (C+ ) and expected population
proportions E (Cr ) for frequencies r = 1, 2, 3. We also examined performance of the proposed interval
estimator (9). Specifically, we sampled without replacement 5000 datasets using sampling fractions
p = 0.05 and 0.1 to 0.9 in increments of 0.1. Supporting Information Tables S1– S4 present the results.
Using the 5000 estimates, we calculated the average estimate, average bias, sample RMSE, and estimate
RMSE derived from (8). To determine whether the proposed interval estimator in (9) attains a given
nominal level, the coverage percentage of the 95% confidence interval which covers the parameter of
interest is also reported. In addition to the nominal coverage percentage, the width of the interval
estimator is also important in practical applications. To simultaneously take both interval width and
coverage percentage into account, we employ an extensively investigated scoring rule (Gneiting and
Raftery, 2007) to calculate an interval score

2 2
Iscore
α (l, u; θ ) = (u − l ) + (l − θ )I(θ < l ) + (θ − u)I(θ > u),
α α

where (l, u) is a (1 − α) × 100% confidence interval of θ .

In this study, the average of 5000 interval scores was computed by the formula Iscore
0.05 (Ĉr − 1.96σr , Ĉr +
1.96σr ; C̄r ) and denoted “Interval Score” in Supporting Information Tables S1– S4. The tables demon-
strate that the proposed estimators are satisfactory in terms of both bias and RMSE. As a result, we
find strong support for applying our proposed adjustment to the Good–Turing estimators. Further-
more, the sample RMSE is close to the estimated RMSE derived from (8) except for C3 when p = 0.05
in Supporting Information Table S4. In that case, the sample RMSE is 0.0795, while the estimated
RMSE is 0.0641; the underestimation of the RMSE results in the coverage percentage being smaller
than the given nominal level. Otherwise, the proposed confidence interval estimators in (9) appear to be
reliable. The scoring rule also supports adjusting the Good–Turing estimators since the mean interval
scores of the proposed estimators are substantially lower than those of the Good–Turing estimators.

5.2 Illustration of proposed frequency estimation method

To illustrate application of the proposed frequency estimator, the entire dataset is analyzed as a
sample. Applying the proposed estimators Ĉ+ and Ĉr , r = 1, 2, requires information about the total
number of plants N or the sampling fraction p. Since this information is unavailable for this dataset,
we considered four sampling fractions as p = 1/2, 1/3, 1/5, and 1/10 as a sensitivity analysis as in
Mingoti and Meeden (1989) and the analyzed results were given in Table 2.

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
330 W.-H. Hwang et al.: Good–Turing frequency estimation in a finite population

Because the Good–Turing estimator does not depend on the sampling fraction p, its estimate is
only reported for the case p = 1/2. The estimation results for Ĉ+ and Ĉr , r = 1, 2, are similar to
the simulation study in Section 5.1. Interestingly, as the sampling fraction gets smaller, estimates
using Ĉadj approach those using ĈGT from above for E (C+ ) and E (C2 ) but from below for E (C1 ).
However, standard errors estimated using the proposed method approach those using the Good–
Turing estimators from below for all three target quantities. As one would expect, the effect of our
proposed correction for finite population sampling diminishes as the sampling fraction becomes very
small.

6 Simulation study of proposed estimators

This section extends our simulation study to analyze performance of the proposed estimators and
two extensions in Section 4. The estimator (11) for the number of classes is not included since its
performance was thoroughly studied in a comprehensive simulation study by Haas and Stokes (1998).
As in Section 5.1, we resample from empirical datasets treated as populations. We consider five
population datasets, which are briefly summarized below in terms of coefficients of variation (γ ) for
Ni , i = 1, . . . , S, numbers of classes (S), and total numbers of objects (N).
Population 1. An artificial example by Goodman (1949) with γ = 0.21, S = 9595, and N = 10,000.
Population 2. Openings in chess games, an example in Good (1953) with γ = 1.72, S = 174, and
N = 385.
Population 3. Lepidoptera in light trap at Rothamsted (Fisher et al., 1943), an example in Good (1953)
with γ = 2.92, S = 240, and N = 15,609.
Population 4. Frequencies of words in the early novel Oliver Twist (written 1837–1839) by English
author Charles Dickens (Evert and Baroni, 2007) with γ = 9.93, S = 10,710, and N = 157,302.
Population 5. Frequencies of words in the Brown Corpus (Kucera and Francis, 1967; Evert and Baroni,
2007) with γ = 21.31, S = 45,215, and N = 1,006,770.
To illustrate concentrations of objects in a few classes, Fig. 1 shows Fk , k = 1, 2, . . . , 10 for each
population.
As commented by Good (2000), most applications are interested in r ≤ 5. Thus, we compared
estimators of the expected sample coverage E (C+ ) = 1 − E (C0 ) and E (Cr ) for r = 1, . . . , 5. Given a
sampling population, we sampled without replacement 5000 datasets using the sampling fractions in
Section 5.1.
Figure 2 displays the results for sample coverage C+ and Cr , r = 1, 2, where the proposed estimators
adjusted for finite population sampling (denoted Cadj ) were compared with the classical Good–Turing
estimators (denoted CGT ). Here we omitted the results for Cr , r = 3, 4, 5, as they were similar to those
for C+ and Cr , r = 1, 2. The average of 5000 resulting estimates (C¯ adj and C¯ GT ) are displayed in Fig. 2
and the averages for C+ and Cr (denoted C̄+ and C̄r ), r = 1, 2, calculated with the generated data Ni s
are also shown for reference; see the dotted lines in Fig. 2. As expected, the adjusted estimator Cadj
consistently outperforms C for all r. Specifically, the C¯ nearly coincides with C̄ and C̄ , r = 1, 2,
GT adj + r
for all populations. In contrast, C¯ GT departs from the reference line (the dotted lines of Fig. 2) when the
sampling fraction is larger than 0.05. Note that C¯ GT is very close to C¯ adj in Population 3. As indicated
in Section 3, the bias of CGT , or the difference between its expectation and E (Cadj ), is pE ( f1 )/n when
estimating E (C+ ). Furthermore, the proportion of singletons (F1 /N) in Population 3 is 35/15, 609,
which is very small and is the limit of pE ( f1 )/n as n increases to N. As a consequence, the bias of CGT
is negligible in this population. As mentioned in Section 3, the estimates of Cadj and CGT should be

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
Biometrical Journal 57 (2015) 2 331

Population 1 Population 2
9225 126

7500 100

5000

2500
22

336 5 4 3 4 3
0 33 1 0 0 0 0 0 0 0 0 1 1

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Population 3 Population 4
5000
35 4788

30 4000
Number of Classes

3000
20

15
14 2000
1680
11 11
10 10

6 1000 894
5 578
4 4
393 289 213 193 181 143
0 0

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Population 5
20000
19130

15000

10000

6458
5000
3636
2301
1705
1202 1080 781 651 541
0

1 2 3 4 5 6 7 8 9 10

Object per Class

Figure 1 The first 10 frequencies (Fk , k = 1, 2, . . ., 10) of the five populations are displayed and
arranged with smallest variation to greatest variation regarding Ni , i = 1, . . . , S.

similar as the sampling fraction approaches 0. Figure 2 also confirms that the difference between C¯ adj
and C¯ GT is very slight at small sampling fractions for all populations.
Figure 3 presents the results for assessing the performance of the Shannon index estimators in terms
of bias (the average difference between the estimates and the H’s) and sample root mean squared error
(RMSE) of each estimator from the 5000 resulting estimates for the five populations across sampling
fractions. The figure shows that the proposed estimator outperforms HMLE under both measures for
all populations except Populations 4 and 5 when p ≥ 0.5, where HMLE is better at both measures than
the proposed estimator Hp , though the discrepancies between the two estimators are not of practical
importance. As noted in Section 4.2, the HMLE ignores species not observed in the sample and exhibits
a negative bias as shown in (12), so the underestimation of HMLE in all populations in Fig. 3 is expected.
Nevertheless, the magnitude of the bias of HMLE decreases with the sampling fraction as it eventually
converges to the true H. Since the estimator HCS by Chao and Shen (2003) was developed in the
context of sampling with replacement, the estimator can result in a substantial bias when data are
sampled without replacement and the sampling fraction is non-negligible. In Population 1, note that
the number of singletons is F1 = 9225 out of S = 9595 classes, so the performance of HCS is worse than

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
332 W.-H. Hwang et al.: Good–Turing frequency estimation in a finite population

C + or C r C adj C GT

Population 1, C+ Population 1, C1 Population 1, C2

0.8
0.75
0.6 0.04
0.50
0.4
0.02
0.25 0.2

0.00 0.0 0.00

Population 2, C+ Population 2, C1 Population 2, C2
1.0
0.30 0.10
0.8
0.25
0.08
0.6
0.20

0.4 0.06
0.15

0.10
Population 3, C+ Population 3, C1 Population 3, C2
1.00
0.05
0.04
0.04
Estimate

0.98 0.03
0.03

0.02 0.02
0.96
0.01 0.01

0.00 0.00
Population 4, C+ Population 4, C1 Population 4, C2
1.00 0.08

0.04
0.95
0.06

0.90 0.03
0.04

0.85
0.02
0.02
Population 5, C+ Population 5, C1 Population 5, C2
1.000
0.06 0.04
0.975
0.05
0.950 0.03
0.04
0.925
0.03
0.02
0.900 0.02
0.01
5 10 20 30 40 50 60 70 80 90 5 10 20 30 40 50 60 70 80 90 5 10 20 30 40 50 60 70 80 90

Sample Proportion (%)

Figure 2 The performance of the estimators for the sample coverage C+ (first column), the population
proportion of singletons C1 (second column), and the population proportion of doubletons C2 (third
column). The dotted lines of theoretical values individually denote sample means of C+ , and Cr ,
r = 1, 2 over 5000 replicates. In each column, the five populations of Ni , i = 1, . . . , S are arranged
with smallest variation in the top row to greatest variation in the bottom row.

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
Biometrical Journal 57 (2015) 2 333

HMLE Hp HCS

Bias RMSE
3

Population 1
1

−1

−2 H̄ = 9.15
−3

Population 2
0

−1 H̄ = 4.54

0.10

Population 3
0.05

0.00

−0.05
H̄ = 4.08
−0.10

0.4

0.2

Population 4
0.0

−0.2 H̄ = 6.55

−0.4

0.2
Population 5

0.1

0.0

−0.1
H̄ = 7.25
−0.2

5 10 20 30 40 50 60 70 80 90 5 10 20 30 40 50 60 70 80 90

Sample Proportion (%)

Figure 3 Comparison of the biases of HMLE , Hp , and HCS (left) and their RMSEs (right), where the
five populations of Ni , i = 1, . . . , S are arranged with smallest variation in the top row to greatest
variation in the bottom row.

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
334 W.-H. Hwang et al.: Good–Turing frequency estimation in a finite population

the proposed estimator under both measures. However HCS has similar or better results compared
with the proposed estimator for some situations (e.g., Population 2 when p ≤ 0.30, Populations 3 and
5 when p = 0.05, and Population 4 for p ≤ 0.10). However, when the sampling fraction is large, Hp
outperforms HCS in both measures. Thus, the proposed estimator appears to be more valid than HCS
in general applications of sampling without replacement. In summary, compared to HCS , the proposed
estimator has similar results for small sampling fractions (p ≤ 0.30) but is superior in both measures
when the sampling fraction is large. Moreover, the proposed estimator is generally better than the
MLE in terms of the RMSE, but the advantage of the proposed estimator compared with the MLE
diminishes as the variation over Ni s increases as in Populations 4 and 5.
Figure 4 displays the average of 5000 observed numbers of new species (denoted S̄m ) as a reference
(dashed lines in the figure) versus the averages of 5000 estimates using Ŝm (proposed here and denoted
Ŝ¯m ) and S̃m (from Shen et al. (2003) and denoted S̃¯m ) for selected sizes of subsequent samples m.
When the sampling fraction p is large, most species are observed in the sample and the number of new
species in any subsequent sample is small relative to the species richness. Thus, in contrast with the
previous settings for the Shannon index, we consider three sampling fractions (p = 0.1, 0.2, 0.3) and
six ratios of the subsequent sample size m to the sample size n (0.2 to 1.2 with increments of 0.2) over
the five populations. As remarked in Section 4.3, both estimators Ŝm and S̃m result in the same leading
term in their binomial expansions. Figure 4 shows that Ŝ¯m almost coincides with S̃¯m . Though there is
a little difference between Ŝ¯m and S̃¯m and both estimators have positive biases for Population 3, the
magnitudes of the above two situations are insignificant. Consequently, both estimators appear to be
generally satisfactory for all populations.
In sum, since CGT is derived assuming sampling with replacement, there is an inherent bias when CGT
is applied to data that are sampled without replacement when the sampling fraction p is moderate. In
contrast, the proposed estimator Cadj takes the sampling fraction into account, and its performance is
demonstrably more robust than CGT to the sampling fraction. This simulation study clearly shows the
necessity of adjusting CGT when sampling without replacement, and the proposed estimator appears
to successfully generalize CGT to account for finite population sampling.

7 Conclusion
Good–Turing frequency estimation has been used in a wide range of disciplines, but the method
was developed under the assumption that sampling is done with replacement. This paper proposes an
adjustment to the Good–Turing method to account for the common situation in which sampling is done
without replacement from a finite population. The adjusted estimator inherits several characteristics
from the Good–Turing estimator; in particular, it reduces to the original Good–Turing estimator when
the sampling fraction approaches zero.
In this article we also presented three extensions of the proposed estimators. (1) Estimating the
number of classes in a population is recognized as a fundamental problem in various disciplines.
(2) Estimating the Shannon index has applications in monitoring diversity in an ecosystem. And (3)
predicting the number of new species in a subsequent sample can be used to inform the cost effectiveness
of taking an additional sample. Other applications involving frequency estimation may benefit from
leverage with the adjusted Good–Turing estimator. Although this study has pointed out some useful
applications by the modified method proposed here, nevertheless there are certainly other improving
methods in the aspect of finite population sampling. As an example, for estimating the number of
classes, Haas and Stokes (1998) consider the general jackknife procedure and propose some estimators
that have superior performance than the estimator Ŝ when the squared coefficient of variation γ 2 is
large (γ 2 > 1). Additionally, Valiant and Valiant (2011) provided a numerical algorithm for estimating
the number of supports of a distribution that can be applied to this topic as well. More recently, Chao

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
Biometrical Journal 57 (2015) 2 335

Sm S m (proposed) Sm

Population 1, p = 0.1 Population 1, p = 0.2 Population 1, p = 0.3

1200

2000 3000
900
1500
2000
600
1000

300 1000
500

Population 2, p = 0.1 Population 2, p = 0.2 Population 2, p = 0.3

25 40
50
20
30 40
15
30
20
10 20
10
5 10
Population 3, p = 0.1 Population 3, p = 0.2 Population 3, p = 0.3
35
Number of Species

30 30 30
25

20 20 20
15

10 10 10

Population 4, p = 0.1 Population 4, p = 0.2 Population 4, p = 0.3

2500 3000

2500
1500 2000
2000
1500
1000 1500
1000
1000
500
500 500
Population 5, p = 0.1 Population 5, p = 0.2 Population 5, p = 0.3
10000
10000
6000 8000

7500
6000
4000
4000 5000

2000
2000 2500
0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1 1.2

m/n

Figure 4 Observed (Sm ) and estimated (Sm proposed and S m by Shen et al. (2003)) average numbers
of new species in a subsequent sample of size m, where the four abundance models are arranged with
smallest variation in the top row to greatest variation in the bottom row and three sampling fractions
are located in the left (p = 0.1), middle (p = 0.2), and right (p = 0.3) columns.

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
336 W.-H. Hwang et al.: Good–Turing frequency estimation in a finite population

and Lin (2012) developed a lower bound estimator that is very robust and, in particular, it can be very
accurate to serve as a point estimator when γ 2 is small or at large sampling proportions. We believe
that more promising methods in estimating the number of classes and other extensions of interest are
worth pursuing in future research. A referee has also indicated a simple weighted estimator which
may be applicable to our extensions in practice. As an illustration to such weighted approach, say
Ŝw = pŜ + qŜ∞ , for estimating the number of classes, where Ŝ∞ can be any valid estimator derived
from the infinite population. Clearly, Ŝw reduces to Ŝ∞ when p is 0. Consequently, the weighted
estimator may outperform Ŝ if Ŝ∞ is well selected. In an additional simulation study (unreported),
we compared the estimator Ŝ with the weighted estimator Ŝw , in which Ŝ∞ considers using the
estimator of Cecconi et al. (2012). The resulting weighted estimator outperforms the estimator Ŝ
in terms of estimation bias and RMSE when γ 2 is large; nevertheless, strength of the estimator
deteriorates in some test populations. Consequently, further conducting a more complete investigation
from theoretical and practical perspectives is of great help with a comprehensive performance of
such weighted estimator. Note that the weighted procedure can be applied to entropy estimation and
others.
In classical sampling theory, the finite population correction (FPC) factor only features in the
variance estimator if the parameter of interest focuses on the population mean whose estimator
is free of FPC. Nevertheless, we found that FPC emerges not only in the variance estimator but
also in the frequency estimator and the practical extensions in Section 4. Although we also found
that the estimator by Shen et al. (2003) is reasonable in the context of sampling without replace-
ment, the proposed method represents a general framework for reliable inference. An interesting
question for a further study concerns determining conditions under which FPC features in point
estimation.
Note that source codes to reproduce all figures and (Supporting Information) tables are
available as Supporting Information on the journal’s web page (http://onlinelibrary.wiley.com/
doi/10.1002/bimj.201300168/suppinfo).

Acknowledgments The authors thank Professor Anne Chao for inspiring the topic and Roman Gulati for
generous assistance editing the manuscript. This work was supported by the National Science Council of
Taiwan.

Conflict of interest
The authors have declared no conflict of interest.

Appendix: Derivation of the mean squared error

To obtain the mean squared error of Cr , write

2
√ √ Ni − r √ q(r + 1)
E{ n(Cr − Cr )} = E2
n I(Xi = r) − n I(Xi = r + 1)
i
N i
n
= T1 + T2 − 2T3 ,

√ N −r √ q(r+1)
where T1 = E{ n i Ni I(Xi = r)}2 , T2 = E{ n i I(Xi = r + 1)}2 , and
√ N −r n
T3 = E{ n i= j Ni q(r+1)
n
I(Xi = r)I(X j = r + 1)}.

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
Biometrical Journal 57 (2015) 2 337

For the term T1 , we see

(N − r)(N − r − 1) + (N − r)
T1 = nE i i i
I(Xi = r) + T4
i
N2

q2 (r + 1)(r + 2) q(r + 1) η
= E ( fr+2 ) + E ( fr+1 ) + O s + T4 ,
n N N
(N −r)(N −r)
where T4 = nE i= j
i
N2
j
I(Xi = r, X j = r) . Moreover, we have
⎧ ⎫
⎪
⎪ Ni Nj N − Ni − N j ⎪ ⎪
⎪
⎨ (N − r)(N − r) r ⎪
⎬
i j r n − 2r
T4 = nE
⎪
⎪ N 2 N ⎪
⎪
⎪ i= j
⎩ ⎪
⎭
n

Ni Nj N − Ni − N j
(r + 1)2 r + 1 r+1 n − 2r − 2 (N − Ni − N j − n + 2r + 2)(N − Ni − N j − n + 2r + 1)
=n
N2 N (n − 2r − 1)(n − 2r)
i= j
n
⎧ ⎫
⎪ N + N j − 2r − 2 Ni + N j − 2r − 1 ⎪
⎪ 1− i
⎪ )(1 − ⎪
⎪
q2 (r + 1)2 ⎪
⎨ Nq Nq ⎪
⎬
= Pr(Xi = X j = r + 1) .
n ⎪
⎪ 2r + 1 2r ⎪
⎪
i= j ⎪
⎪ 1 − 1 − ⎪
⎪
⎩ n n ⎭

Next, T2 = q (r+1) E ( fr+1 ) + T5 , where T5 = i= j q (r+1)
2 2 2 2

n n
Pr(Xi = r + 1, X j = r + 1). Similarly, it can
be shown that

Ni Nj N − Ni − N j
q(r + 1)2 r + 1 r+1 n − 2r − 2 N − (Ni + N j + n − 2r − 2)
T3 =
i= j
N N n − 2r − 1
n
⎧ Ni + N j − 2r − 2 ⎫
⎪
⎪ ⎪
⎪
q2 (r + 1)2 ⎨1 − ⎬
Nq
= Pr(Xi = X j = r + 1) .
n ⎪
⎪ 2r + 1 ⎪
⎪
i= j ⎩ 1− ⎭
n
As a result,
2
1 1 η
T4 + T5 − 2T3 = T5 − − +O s
Nq n N
q(r + 1)2
ηs2
= −E I(Xi = r + 1, X j = r + 1) + O ,
i= j
n2 N

q(r + 1)2 fr+1 ( fr+1 − 1) ηs2
= −E +O .
n2 N

Therefore, by ignoring the remainder terms, we have

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
338 W.-H. Hwang et al.: Good–Turing frequency estimation in a finite population

√ 1
E{ n(Cr − Cr )}2 ≈ E q(r + 1)(1 + qr) fr+1 + (r + 1)(r + 2)q2 fr+2
n

q(r + 1)2 fr+1 ( fr+1 − 1)
−E .
n2

References
Basharin, G. P. (1959). On a statistical estimate for the entropy of a sequence of independent random variables.
Theory of Probability and Its Applications 4, 333–336.
Chao, A. and Lee, S. M. (1992). Estimating the number of classes via sample coverage. Journal of American
Statistical Association 87, 210–217.
Chao, A., Lee, S. M. and Jeng, S. L. (1992). Estimating population size for capture-recapture data when capture
probabilities vary by time and individual animal. Biometrics 48, 201–216.
Chao, A. and Lin, C. W. (2012). Nonparametric lower bounds for species richness and shared species richness
under sampling without replacement. Biometrics 68, 912–921.
Chao, A. and Shen, T. J. (2003). Nonparametric estimation of Shannon’s index of diversity when there are unseen
species. Environmental and Ecological Statistics 10, 429–443.
Chen, S. and Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer
Speech and Language 13, 310–318.
Condit, R., Hubbell, S. P. and Foster, R. B. (1996). Changes in a tropical forest with a shifting climate: results
from a 50-ha permanent census plot in Panama. Journal of Tropical Ecology 12, 231–256.
Colwell, R., Chao, A., Gotelli, N., Lin, S. and Mao, C. (2012). Models and estimators linking individual-based
and sample-based rarefaction, extrapolation, and comparison of assemblages. Journal of Plant Ecology 5,
3–21.
Cecconi, L., Gandolfi, A. and Sastri, C. C. A. (2012). A new estimator for the number of species in a population.
Sankhya A 74, 80–100
Church, K. W. and Hanks, P. (1990). Word association norms mutual information, and lexicography. Computa-
tional Linguistics 16, 22–29.
Esty, W. (1983). A normal limit law for a nonparametric estimator of the coverage of a random sample. The
Annals of Statistics 11, 905–912.
Esty, W. (1985). Estimation of the number of classes in a population and the coverage of a population. Mathe-
matical Scientist 10, 41–50.
Evert, S. and Baroni, M. (2007). “zipfR: Word frequency distributions in R.” In Proceedings of the 45th Annual
Meeting of the Association for Computational Linguistics, Posters and Demonstrations Sessions, pages
29–32, Prague, CZ (R package version 0.6-6 of 2012-04-03).
Fisher, R. A., Corbet, A. S. and Williams, C. B. (1943). The relation between the number of species and the
number of individuals in a random sample of an animal population. Journal of Animal Ecology 12, 42–58.
Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the
American Statistical Association 102, 359–378.
Good, I. J. (1953). The population of frequencies of species and the estimation of population parameters.
Biometrika 40, 45–63.
Good, I. J. (2000). Turing’s anticipation of empirical Bayes in connection with the cryptanalysis of the naval
Enigma. Journal of Statistical Computation and Simulation, 66, 101–111.
Goodman, L. A. (1949). On the estimation of the number of classes in a population. Annals of Mathematical
Statistics, 20, 572–579.
Haas, P. J. and Stokes, L. (1996). Estimating the number of classes in a finite population. IBM Research Report
RJ 10025, IBM Almaden Research Center, San Jose, CA, Revised March 1998.
Haas, P. J. and Stokes, L. (1998). Estimating the number of classes in a finite population. Journal of the American
Statistical Association 93, 1475–1487.
Haas, P. J., Liu, Y. and Stokes, L. (2006). An estimator of number of species from quadrat sampling. Biometrics
62, 135–141.
Hausser, J. and Strimmer K., (2009). Entropy inference and the James-Stein estimator, with application to
nonlinear gene association networks. Journal of Machine Learning Research 10, 1469–1484.

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com
Biometrical Journal 57 (2015) 2 339

Jelinek, F. (1998). Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA.
Johnson, N. L., Kotz, S. and Kemp, A. W. F. (1992). Univariate Discrete Distribution (2nd edn.). Wiley, New
York, NY.
Kucera, H. and Francis, W. N. (1967). Computational Analysis of Present-day American English. Brown University
Press, Providence, RI.
Lo, S. H. (1992). From the species problem to a general coverage problem via a new interpretation. The Annals of
Statistics 20, 1094–1109.
Magurran, A. E. (1988). Ecological Diversity and Its Measurement. Princeton University Press, Princeton, NJ.
McAllester, D. and Schapire, R. E. (2000). On the convergence rate of Good–Turing estimators. in Proc. 13th
Annu. Conference on Comput. Morgan Kaufmann, Learning Theory. San Francisco, CA, 1–6.
Miller, R. I. and Wiegert, R. G. (1989). Documenting completeness, species-area relations, and the species-
abundance distribution of a regional flora. Ecology 70, 16–22.
Mingoti, S. A. and Meeden, G. (1989). Estimating the total number of distinct species using presence and absence
data. Biometrics 48, 863–875.
Orlitsky, A., Santhanam, N. P. and Zhang, J. (2003). Always Good Turing: Asymptotically optimal probability
estimation. Science 302, 427–431.
Shen, T. J., Chao, A. and Lin, J. F. (2003). Predicting the number of new species in further taxonomic sampling.
Ecology 84, 798–804.
Shlosser, A. (1981). On estimation of the size of the dictionary of a long text on the basis of a sample, Engineering
Cybernetics, 19, 97–102.
Song, F. and Croft, W. (1999). Research and Development in Information Retrieval. ACM Press, New York, NY.
Valiant, G. and Valiant, P. (2011). Estimating the unseen: an n/log(n)-sample estimator for entropy and support
size, shown optimal via new CLTs. In Proceedings of the forty-third annual ACM symposium on Theory of
computing (STOC’11), 685–694. ACM, New York, NY, USA.
Wagner, A. B., Viswanath, P. and Kulkarni, S. R. (2006). Strong consistency of the Good–Turing estimator. IEEE
Symposium on Information Theory Proceeding, July 2006, 2526–2530.
Zhang, C.-H. and Zhang, Z. (2009). Asymptotic normality of a nonparametric estimator of sample coverage. The
Annals of Statistics, 37, 2582–2595.

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

05continuous Univariate Distributions, Vol. 1 PDF
0% (1)
05continuous Univariate Distributions, Vol. 1 PDF
769 pages
Statistical Models Based On Counting Processes (PDFDrive) PDF
No ratings yet
Statistical Models Based On Counting Processes (PDFDrive) PDF
778 pages
Nike Deck Appendix
No ratings yet
Nike Deck Appendix
58 pages
Cochran 1977 Sampling Techniques
83% (6)
Cochran 1977 Sampling Techniques
442 pages
306938456
No ratings yet
306938456
7 pages
Sampling Theory: Sampling For Proportions and Percentages
No ratings yet
Sampling Theory: Sampling For Proportions and Percentages
10 pages
Lecture 13
No ratings yet
Lecture 13
6 pages
Cito Proefschrift Maarten Marsman PDF
No ratings yet
Cito Proefschrift Maarten Marsman PDF
114 pages
ChiSquare Examples
No ratings yet
ChiSquare Examples
22 pages
Paper 32 JRSS 2016 Tha Yad Singh
No ratings yet
Paper 32 JRSS 2016 Tha Yad Singh
9 pages
Large and Small Sample Tests
No ratings yet
Large and Small Sample Tests
14 pages
Testes de Qualidade de Ajuste
No ratings yet
Testes de Qualidade de Ajuste
113 pages
Introduction To Probabilistic Sampling
No ratings yet
Introduction To Probabilistic Sampling
39 pages
Advanced Sampling Theory
No ratings yet
Advanced Sampling Theory
1,242 pages
18.443 MIT Stats Course
No ratings yet
18.443 MIT Stats Course
139 pages
Unit 4 Chi Square Test WR
No ratings yet
Unit 4 Chi Square Test WR
8 pages
1994 Chen Weighted Sampling Max Entropy
No ratings yet
1994 Chen Weighted Sampling Max Entropy
14 pages
qt5nf6r8tt Nosplash
No ratings yet
qt5nf6r8tt Nosplash
12 pages
Bhattacharya - 1967 - Simple Method Resolution Distribution Into Gaussian Components
No ratings yet
Bhattacharya - 1967 - Simple Method Resolution Distribution Into Gaussian Components
22 pages
Chapter 12 (Technical English For Statistics)
No ratings yet
Chapter 12 (Technical English For Statistics)
6 pages
Young 1941
No ratings yet
Young 1941
9 pages
Chapter3 Sampling Proportions Percentages
No ratings yet
Chapter3 Sampling Proportions Percentages
10 pages
Lecture 11
No ratings yet
Lecture 11
7 pages
Chapter3 Sampling Proportions Percentages
No ratings yet
Chapter3 Sampling Proportions Percentages
10 pages
1.2 Scale of Measurement: NN JNN
No ratings yet
1.2 Scale of Measurement: NN JNN
3 pages
Stat8101 L3 25
No ratings yet
Stat8101 L3 25
43 pages
Bilal Fyp
No ratings yet
Bilal Fyp
37 pages
Department of Mathematics Indian Institute of Technology, Kharagpur Module No. #01 Lecture No. #39 Testing of Hypothesis-VII
No ratings yet
Department of Mathematics Indian Institute of Technology, Kharagpur Module No. #01 Lecture No. #39 Testing of Hypothesis-VII
17 pages
Journal 1
No ratings yet
Journal 1
10 pages
Measurement Error For Factor Class of Estimator
No ratings yet
Measurement Error For Factor Class of Estimator
14 pages
Lecture 12
No ratings yet
Lecture 12
6 pages
214 CHPT 11
No ratings yet
214 CHPT 11
18 pages
Stat 3014 Notes 11 Sampling
100% (2)
Stat 3014 Notes 11 Sampling
36 pages
Chapter3 Sampling Proportions Percentages
No ratings yet
Chapter3 Sampling Proportions Percentages
10 pages
Nature: Measurement of Diversity
No ratings yet
Nature: Measurement of Diversity
1 page
Cochran 1977 Sampling Techniques Third E
No ratings yet
Cochran 1977 Sampling Techniques Third E
442 pages
Programmed Stats
No ratings yet
Programmed Stats
551 pages
Cochran 1977 Sampling Techniques Third E
No ratings yet
Cochran 1977 Sampling Techniques Third E
442 pages
Programed Statistics
No ratings yet
Programed Statistics
551 pages
Sample Surveys: Rohan, Vijayan
No ratings yet
Sample Surveys: Rohan, Vijayan
72 pages
1 Onyeka s189 200
No ratings yet
1 Onyeka s189 200
12 pages
Consistent Estimation of Small Masses in Feature Sampling: Fadhel Ayed
No ratings yet
Consistent Estimation of Small Masses in Feature Sampling: Fadhel Ayed
28 pages
Testul Chi Patrat
No ratings yet
Testul Chi Patrat
9 pages
Estimating The Population Mean in Stratified Population Using Auxiliary Information Under Non-Response
No ratings yet
Estimating The Population Mean in Stratified Population Using Auxiliary Information Under Non-Response
17 pages
Chi Square Test
No ratings yet
Chi Square Test
4 pages
Tajmouati Samya Publications 09 08 2022 10 08 16 55
No ratings yet
Tajmouati Samya Publications 09 08 2022 10 08 16 55
6 pages
Topic 6 - 2024
No ratings yet
Topic 6 - 2024
5 pages
Sampling Notes - Part-02
No ratings yet
Sampling Notes - Part-02
8 pages
1 s2.0 S0888613X87800038 Main
No ratings yet
1 s2.0 S0888613X87800038 Main
15 pages
On Modification of Some Ratio Estimators Using Parameters of Auxiliary Variable For The Estimation of The Population Mean
No ratings yet
On Modification of Some Ratio Estimators Using Parameters of Auxiliary Variable For The Estimation of The Population Mean
9 pages
An Introduction To Probability and Statistics - 2015 - Rohatgi - Subject Index
No ratings yet
An Introduction To Probability and Statistics - 2015 - Rohatgi - Subject Index
11 pages
Ipsita Panda-Biostats Assignment
No ratings yet
Ipsita Panda-Biostats Assignment
11 pages
Fisher Pcps 22 700 25
No ratings yet
Fisher Pcps 22 700 25
26 pages
Lecture Notes
No ratings yet
Lecture Notes
90 pages
Chapter 3 - 2012
No ratings yet
Chapter 3 - 2012
5 pages
Survey Sampling: Stat 138
No ratings yet
Survey Sampling: Stat 138
8 pages
31 40
No ratings yet
31 40
263 pages
Rakhlin Mathstat sp22
No ratings yet
Rakhlin Mathstat sp22
108 pages
Kernel Density Estimation of Tsalli's Entropy With Applications in Adaptive System Training
No ratings yet
Kernel Density Estimation of Tsalli's Entropy With Applications in Adaptive System Training
7 pages
A Calculated Approach To Winning The Lottery - Alum - Mit.edu
0% (1)
A Calculated Approach To Winning The Lottery - Alum - Mit.edu
6 pages
(International Journal of Modern Physics C 1996-Aug Vol. 07 Iss. 04) LEVY, MOSHE - SOLOMON, SORIN - POWER LAWS ARE LOGARITHMIC BOLTZMANN LAWS (1996) (10.1142 - S0129183196000491) - Libgen - Li
No ratings yet
(International Journal of Modern Physics C 1996-Aug Vol. 07 Iss. 04) LEVY, MOSHE - SOLOMON, SORIN - POWER LAWS ARE LOGARITHMIC BOLTZMANN LAWS (1996) (10.1142 - S0129183196000491) - Libgen - Li
11 pages
It Is Easy To Determine Whether A Given Integer Is Prime
No ratings yet
It Is Easy To Determine Whether A Given Integer Is Prime
36 pages
Air Quality Index and Aerosol Density
No ratings yet
Air Quality Index and Aerosol Density
2 pages
Alm J. From Geometry To Number
No ratings yet
Alm J. From Geometry To Number
12 pages
Winfree E, Et Al. Design and Self-Assembly of Two-Dimensional DNA Crystals (Nature)
No ratings yet
Winfree E, Et Al. Design and Self-Assembly of Two-Dimensional DNA Crystals (Nature)
6 pages
He, Y. Et Al. Hierarchical Self-Assembly of DNA Into Symmetric Supramolecular Polyhedra. Nature
No ratings yet
He, Y. Et Al. Hierarchical Self-Assembly of DNA Into Symmetric Supramolecular Polyhedra. Nature
5 pages
A Fascinating Application of Steiner's Theorem For Trapezium. Geometric Constructions Using Straightedge Alone
No ratings yet
A Fascinating Application of Steiner's Theorem For Trapezium. Geometric Constructions Using Straightedge Alone
19 pages
Li X. W. The New Euler's Formula of The Crossed DNA Polyhedral Links
No ratings yet
Li X. W. The New Euler's Formula of The Crossed DNA Polyhedral Links
18 pages
Bajzer Z. Mathematical Modeling of Tumor Growth Kinetics
No ratings yet
Bajzer Z. Mathematical Modeling of Tumor Growth Kinetics
45 pages
Macedo H. D. Oliveira J. N. Matrices As Arrows! A Biproduct Approach To Typed Linear Algebra
No ratings yet
Macedo H. D. Oliveira J. N. Matrices As Arrows! A Biproduct Approach To Typed Linear Algebra
17 pages
Composer Lindenmayer System
100% (1)
Composer Lindenmayer System
8 pages
Ansal Api Summer Report
No ratings yet
Ansal Api Summer Report
80 pages
Mechanical Thrombectomy For Acute Ischemic Stroke - UpToDate
No ratings yet
Mechanical Thrombectomy For Acute Ischemic Stroke - UpToDate
20 pages
Indian Maps TOK
No ratings yet
Indian Maps TOK
3 pages
An Investigation Into Punctuation and Capitalization Errors Made by Hebron University EFL Students
No ratings yet
An Investigation Into Punctuation and Capitalization Errors Made by Hebron University EFL Students
21 pages
Prosocial Behavior - Extra Notes
No ratings yet
Prosocial Behavior - Extra Notes
5 pages
Comparison of The Markowitz and Single Index Model Based On M-V Criterion in Optimal Portfolio Formation
No ratings yet
Comparison of The Markowitz and Single Index Model Based On M-V Criterion in Optimal Portfolio Formation
6 pages
Storytelling With Digital Photographs: Marko Balabanović Lonny L. Chu Gregory J. Wolff
No ratings yet
Storytelling With Digital Photographs: Marko Balabanović Lonny L. Chu Gregory J. Wolff
8 pages
Facts About Fake NFL Jerseys Reddit Real Girls Not Safe Cheap That Will Instantly Put You in A Good Mood
No ratings yet
Facts About Fake NFL Jerseys Reddit Real Girls Not Safe Cheap That Will Instantly Put You in A Good Mood
3 pages
Devotional and Prayer Journal
No ratings yet
Devotional and Prayer Journal
18 pages
Christianity and The French Legion
No ratings yet
Christianity and The French Legion
19 pages
Pages From 0580 - Practice - Questions - (For - Examination - From - 2020)
No ratings yet
Pages From 0580 - Practice - Questions - (For - Examination - From - 2020)
26 pages
A Christian Lifestyle in The Last Days
No ratings yet
A Christian Lifestyle in The Last Days
16 pages
Ishta Devta1111
100% (1)
Ishta Devta1111
2 pages
Module 4 Network Analysis 01
No ratings yet
Module 4 Network Analysis 01
10 pages
The Political Sociology of C. Wright Mills
No ratings yet
The Political Sociology of C. Wright Mills
390 pages
Seminar Front, Certificate, Acknowled
No ratings yet
Seminar Front, Certificate, Acknowled
3 pages
Marketing Manager - Application
No ratings yet
Marketing Manager - Application
9 pages
Gouache: Artist Materials
No ratings yet
Gouache: Artist Materials
1 page
7.3 Instructional Support Materials To Promote Literacy
No ratings yet
7.3 Instructional Support Materials To Promote Literacy
6 pages
Nyaani Mansa Mamudu Et La Fin de L 'Empire Du Mali
No ratings yet
Nyaani Mansa Mamudu Et La Fin de L 'Empire Du Mali
43 pages
Character References For Resume
100% (1)
Character References For Resume
5 pages
FNCP (Hypertension)
No ratings yet
FNCP (Hypertension)
3 pages
Lamgool Boys Velvet Tuxedo Suit 4 Piece Slim Fit Kids Formal Outfit Set With Burgundy Blazer Jacket Shirt Bow Tie Pa
No ratings yet
Lamgool Boys Velvet Tuxedo Suit 4 Piece Slim Fit Kids Formal Outfit Set With Burgundy Blazer Jacket Shirt Bow Tie Pa
1 page
LASA Blood Sampling
No ratings yet
LASA Blood Sampling
4 pages
Dollarama Annual Information Form For 2009-2010
No ratings yet
Dollarama Annual Information Form For 2009-2010
49 pages
Annex 1 Task 4 - My Amazing Future (1) .En - Es
No ratings yet
Annex 1 Task 4 - My Amazing Future (1) .En - Es
3 pages
Synonym - Antonym
100% (1)
Synonym - Antonym
2 pages
ORB Customerdetail
No ratings yet
ORB Customerdetail
192 pages
Finalprospectus 21 22
No ratings yet
Finalprospectus 21 22
24 pages

Hwang Good-Turing Frequency Estimation in A Finite Population 2014

Uploaded by

Hwang Good-Turing Frequency Estimation in A Finite Population 2014

Uploaded by

Biometrical Journal 57 (2015) 2, 321–339 DOI: 10.1002/bimj.

Good–Turing frequency estimation in a finite population

Received 20 August 2013; revised 13 March 2014; accepted 5 September 2014

Keywords: Frequency estimation; Finite population; Good–Turing; Number-of-classes

∗ Corresponding author: e-mail: [email protected]

When n is sufficiently large, this yields a moment estimator

3 Good–Turing method in a finite population

Ignoring the remainder term O (1/n), we propose the estimator

4.1 Estimating the number of classes S

as p approaches 0. This reduced estimator is equivalent to the popular abundance-based coverage

4.2 Estimating the Shannon entropy diversity index

4.3 Predicting the number of new species in a subsequent sample of size m

Shen et al. (2003) proposed

5 Empirical study of rare plant data

Table 1 Frequency data of the rare vascular plant species.

Table 2 A sensitivity analysis of the proposed frequency estimation method.

p Method 95 % Confidence interval

5.1 Validation of proposed frequency estimation method

where (l, u) is a (1 − α) × 100% confidence interval of θ .

5.2 Illustration of proposed frequency estimation method

6 Simulation study of proposed estimators

Object per Class

Population 1, C+ Population 1, C1 Population 1, C2

0.00 0.0 0.00

Sample Proportion (%)

Sample Proportion (%)

Population 1, p = 0.1 Population 1, p = 0.2 Population 1, p = 0.3

Population 2, p = 0.1 Population 2, p = 0.2 Population 2, p = 0.3

Population 4, p = 0.1 Population 4, p = 0.2 Population 4, p = 0.3

Appendix: Derivation of the mean squared error

For the term T1 , we see

Therefore, by ignoring the remainder terms, we have

You might also like