Estimation of The Minimum Probability of A Multinomial Distribution

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Journal of Statistical Theory and Practice (2021) 15:24

https://doi.org/10.1007/s42519-020-00163-y

ORIGINAL ARTICLE

Estimation of the Minimum Probability of a Multinomial


Distribution

Ali Mahzarnia1 · Michael Grabchak1 · Jiancheng Jiang1

Accepted: 23 December 2020 / Published online: 20 January 2021


© Grace Scientific Publishing 2021

Abstract
The estimation of the minimum probability of a multinomial distribution is impor-
tant for a variety of application areas. However, standard estimators such as the
maximum likelihood estimator and the Laplace smoothing estimator fail to function
reasonably in many situations as, for small sample sizes, these estimators are fully
deterministic and completely ignore the data. Inspired by a smooth approximation
of the minimum used in optimization theory, we introduce a new estimator, which
takes advantage of the entire data set. We consider both the cases with a known and
an unknown number of categories. We categorize the asymptotic distributions of the
proposed estimator and conduct a small-scale simulation study to better understand
its finite sample performance.

Keywords  Minimum probability · Multinomial distribution · Smooth minimum

1 Introduction

Consider the multinomial distribution 𝐏 = (p1 , p2 , … , pk ) , where k ≥ 2 is the num-


ber of categories and pi > 0 is the probability of seeing an observation from cat-
egory i. We are interested in estimating the minimum probability
p0 = min{pi ;i = 1, … , k}

in both the cases where k is known and where it is unknown.


Given an independent and identically distributed random sample X1 , X2 , … , Xn
∑n
of size n from 𝐏 , let yi = j=1 1(Xj = i) be the number of observations of category
i. Here and throughout, we write 1(⋅) to denote the indicator function. The maximum
likelihood estimator (MLE) of pi is p̂ i = yi ∕n and the MLE of p0 is

* Michael Grabchak
[email protected]
1
Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte,
NC 28223, USA

13
Vol.:(0123456789)
24 
Page 2 of 19 Journal of Statistical Theory and Practice (2021) 15:24

p̂ 0 = min{̂pi ;i = 1, … , k}.

The MLE has the obvious drawback that p̂ 0 is zero when we do not have at least
one observation from each category. To deal with this issue, one generally uses a
modification of the MLE. Perhaps the most prominent modification is the so-called
Laplace smoothing estimator (LSE). This estimator was introduced by Pierre-Simon
Laplace in the late 1700s to estimate the probability that the sun will rise tomorrow,
see, e.g., [4]. The LSE of p0 is given by

p̂ LS
0
= min{(yi + 1)∕(n + k);i = 1, ⋯ , k}.

Note that both p̂ 0 and p̂ LS


0
are based only on the smallest yi . Note further that, in situ-
ations where we have not seen all of the categories in the sample, we always have
p̂ 0 = 0 and p̂ LS
0
= 1∕(n + k) . This holds, in particular, whenever n < k . Thus, in
these cases, the estimators are fully deterministic and completely ignore the data.
In this article, we introduce a new estimator for p0 , which is based on a smooth
approximation of the minimum. It uses information from all of the categories and
thus avoids becoming deterministic for small sample sizes. We consider both the
cases when the number of categories is known and when it is unknown. We show
consistency of this estimator and characterize its asymptotic distributions. We also
perform a small-scale simulation study to better understand its finite sample perfor-
mance. Our numerical results show that, in certain situations, it outperforms both
the MLE and the LSE.
The rest of the paper is organized as follows: In Sect.2, we introduce our estima-
tor for the case where the number of categories k is known and derive its asymp-
totic distributions. Then, in Sect. 3 we consider the case where k is unknown, and in
Sect. 4 we consider the related problem of estimating the maximum probability. In
Sect. 5 we give our simulation results, and in Sect. 6 we give some conclusions and
directions for future work. Finally, the proofs are given in “Appendix”. Before pro-
ceeding, we briefly describe a few applications:

1. One often needs to estimate the probability of a category that is not observed in
a random sample. This is often estimated using the LSE, which always gives the
deterministic value of 1∕(n + k) . On the other hand, a data-driven estimate would
be more reasonable. When the sample size is relatively large, it is reasonable to
assume that the unobserved category has the smallest probability and our estima-
tor could be used in this case. This situation comes up in a variety of applications
including language processing, computer vision, and linguistics, see, e.g., [6, 14],
or [15].
2. In the context of ecology, we may be interested in the probability of finding the
rarest species in an ecosystem. Aside for the intrinsic interest in this question,
this probability may be useful as a diversity index. In ecology, diversity indices
are metrics used to measure and compare the diversity of species in different
ecosystems, see, e.g., [7, 8], and the references therein. Generally one works with
several indices at once as they give different information about the ecosystem.
In particular, the probability of the rarest species may be especially useful when

13
Journal of Statistical Theory and Practice (2021) 15:24 Page 3 of 19  24

combined with the index of species richness, which is the total number of species
in the ecosystem.
3. Consider the problem of internet ad placement. There are generally multiple ads
that are shown on the same webpage, and at most one of these will be clicked.
Thus, if there are k − 1 ads, then there are k possible outcomes, with the last
outcome being that no ad is clicked. In this context, the probability of a click on
a given ad is called the click through rate or CTR. Assume that there are k − 1
ads that have been displayed together on the same page and that we have data
on these. Now, the ad company wants to replace one of these with a new ad, for
which there are no data. In this case, the minimum probability of the original
k − 1 ads may give a baseline for the CTR of the new ad. This may be useful for
pricing.

2 The Estimator When k Is Known

We begin with the case where the number of categories k is known. Let

k−1

k−1
𝐩 = (p1 , … , pk−1 ) , 𝐩̂ = (̂p1 , … , p̂ k−1 ) , and note that pk = 1 − pi and p̂ k = 1 − p̂ i .
i=1 i=1
Since p0 = g(𝐩) , where g(𝐩) = min{p1 , p2 , … , pk } , a natural estimator of p0 is given
by
̂ = min{̂p1 , p̂ 2 , … , p̂ k },
p̂ 0 = g(𝐩)

which is the MLE. However, this estimator takes the value of zero whenever there is
a category that has not been observed. To deal with this issue, we propose approxi-
mating g with a smoother function. Such approximations, which are sometimes
called smooth minimums, are often used in optimization theory, see, e.g., [1, 9, 10],
or [11]. Specifically, we introduce the function


k
𝛼
gn (𝐩) = w−1 pi e−n pi , (1)
i=1


k
where w = w(𝐩) = and 𝛼 > 0 is a tuning parameter. Note that
𝛼p
e−n j

j=1
lim gn (𝐩) = g(𝐩) = p0 . (2)
n→∞

This leads to the estimator


k
𝛼
̂ = ŵ −1
p̂ ∗0 = gn (𝐩) p̂ i e−n p̂ i , (3)
i=1


k
where ŵ = w(𝐩) e−n p̂ j.
𝛼
̂ =
j=1
We (now study
)the asymptotic distributions of p̂ ∗0 . Let
T
𝜕gn (𝐩) 𝜕gn (𝐩)
∇gn (𝐩) = 𝜕p , … , 𝜕p . It is straightforward to check that, for 1 ≤ i ≤ k − 1,
1 k−1

13
24 
Page 4 of 19 Journal of Statistical Theory and Practice (2021) 15:24

𝜕gn (𝐩) 𝛼 ( ) 𝛼 ( )
= e−n pi w−1 [1 + n𝛼 gn (𝐩) − pi ] − e−n pk w−1 [1 + n𝛼 gn (𝐩) − pk ]. (4)
𝜕pi

Let r be the cardinality of the set {j ∶ pj = p0 , j = 1, … , k} , i.e., r is the number of


categories that attain the minimum probability. Note that r ≥ 1 and that we have a
uniform distribution if and only if r = k . With this notation, we give the following
result.

Theorem  2.1  Assume that 0 < 𝛼 < 1∕2 and let 𝜎̂ n = {∇gn (𝐩) ̂ 1∕2 , where
̂ n (𝐩)}
̂ T Σ∇g
̂ − 𝐩̂ 𝐩̂  .
̂Σ = diag (𝐩) T

(i) If r ≠ k , then


√ D
n𝜎̂ n−1 {̂p∗0 − p0 } ���→
� N(0, 1).

(ii) If r = k , then


D
k2 n1−𝛼 {p0 − p̂ ∗0 } ���→
� χ2(k−1) .

Clearly, Theorem  2.1 both proves consistency and characterizes the asymp-
totic distributions. Further, it allows us to construct asymptotic confidence inter-
vals for p0 . If r ≠ k  , then an approximate 100(1 − 𝛾)% confidence interval is

p̂ ∗0 ± n−1∕2 𝜎̂ n z1−𝛾∕2 ,

where z1−𝛾∕2 is the 100(1 − 𝛾∕2) th percentile of the standard normal distribution. If
r = k , then the corresponding confidence interval is

[̂p∗0 , p̂ ∗0 + k−2 n𝛼−1 𝜒k−1,1−𝛾


2
],

where 𝜒k−1,1−𝛾
2
is the 100(1 − 𝛾) th percentile of a Chi-squared distribution with k − 1
degrees of freedom.
As far as we know, these are the first confidence interval for the minimum to
appear in the literature. In fact, to the best of our knowledge, the asymptotic dis-
tributions of the MLE and the LSE have not been established. One might think
that a version of Theorem  2.1 for the MLE could be proved using the asymp-
totic normality of 𝐩̂ and the delta method. However, the delta method cannot be
applied since the minimum function g is not differentiable. Even in the case of
the proposed estimator p̂ ∗0 , where we use a smooth minimum, the delta method
cannot be applied directly since the function gn depends on the sample size n.
Instead, a subtler approach is needed. The detailed proof is given in “Appendix”.

13
Journal of Statistical Theory and Practice (2021) 15:24 Page 5 of 19  24

3 The Estimator When k Is Unknown

In this section, we consider the situation where the number of categories k is


unknown. In this case, one cannot evaluate the�estimator p̂ ∗0 . The difficulty lies in
∑k �
the need to evaluate ŵ  . Let 𝓁 = j=1 1 yj = 0 be the number of categories that
are not observed in the sample and note that


k
𝛼p

k
𝛼 ( )
̂j
ŵ = e−n = e−n p̂ j 1 yj > 0 + 𝓁.
j=1 j=1

If we have an estimator 𝓁̂ of 𝓁 , then we can take


k
𝛼 ( )
ŵ ♯ = e−n p̂ j 1 yj > 0 + 𝓁̂
j=1

and define the estimator

1 ∑ −n𝛼 p̂ i
k
p̂ ♯0 = p̂ i e . (5)
ŵ ♯ i=1

Note that p̂ ♯0 can be evaluated without knowledge of k since p̂ i = 0 for any category i
that does not appear in the sample.
Now, assume that we have observed k♯ categories in our sample and note that
k ≤ k  . Without loss of generality, assume that these are categories 1, 2, …
♯ , k♯ .
∑k♯ −1
Assume that k ≥ 2 , let 𝐩̂ = (̂p1 , p̂ 2 , … , p̂ k♯ −1 ) , and note that p̂ k♯ = 1 − i=1 p̂ i .
♯ ♯

For i = 1, 2, … , (k♯ − 1) let


[ ]
𝛼 1 𝛼 1
hi = e−n p̂ i 1 − n𝛼 {̂pi − p̂ ♯0 } − e−n p̂ k♯ [1 − n𝛼 {̂pk♯ − p̂ ♯0 }]
ŵ♯ ŵ ♯
and let 𝐡 = (h1 , h2 , … , hk♯ −1 ) . Note that we can evaluate 𝐡 without knowing k.

Theorem  3.1 Assume that 𝓁̂ is such that, with probability 1, we eventually ( )T have


𝓁̂ = 0 . When k♯ ≥ 2 , let 𝜎̂ n♯ = {𝐡T Σ̂ ♯ 𝐡}1∕2 , where Σ♯ = diag(𝐩̂ ♯ ) − 𝐩̂ ♯ 𝐩̂ ♯  . When
k♯ = 1 , let 𝜎̂ n♯ = 1 . If the assumptions of Theorem 2.1 hold, then the results of Theo-
rem 2.1 hold with p̂ ♯0 in place of p̂ ∗0 and 𝜎̂ n♯ in place of 𝜎̂ n.

Proof Since k is finite and we eventually have 𝓁̂ = 0 , there exists an almost surely
finite random variable N such that if the sample size n ≥ N  , then 𝓁̂ = 0 , and we have
observed each category at least once. For such n, we have k♯ = k , ŵ ♯ = ŵ  , 𝐩̂ ♯ =p𝐩̂  ,
and √ ̂ = 𝐡 . If follows
∇gn (𝐩) p
that, for such n, 𝜎̂ n♯ = 𝜎̂ n and p̂ ♯0 = p̂ 0 . Hence 𝜎̂ n♯ ∕𝜎̂ n −
→1

and n𝜎̂ n−1 {̂p∗0 − p̂ 0 }−
→0 . From here the case r ≠ k follows by Theorem 2.1 and two
applications of Slutsky’s theorem. The case r = k is similar and is thus omitted.  ◻

13
24 
Page 6 of 19 Journal of Statistical Theory and Practice (2021) 15:24

There are a number of estimators for 𝓁 available in the literature, see, e.g., [2,
3, 5], or [16] and the references therein. One of the most popular is the so-called
Chao2 estimator [3, 5], which is given by

⎧ 2
n−1 f1
⎪ n 2f2
if f2 > 0
𝓁̂ = ⎨ (6)
⎪ n−1 f1 (f1 −1)
if f2 = 0,
⎩ n 2

∑k � �
where fi = j=1 1 yj = i is the number of categories that were observed exactly i
times in the sample. Since k is finite, we will, with probability 1, eventually observe
each category at least three times. Thus, we will eventually have f1 = f2 = 0 and
𝓁̂ = 0 . Thus, this estimator satisfies the assumptions of Theorem 3.1. In the rest of
the paper, when we use the notation p̂ ♯0 we will mean the estimator where 𝓁̂ is given
by (6).

4 Estimation of the Maximum

The problem of estimating the maximum probability is generally easier than that
of estimating the minimum. Nevertheless, it may be interesting to note that our
methodology can be modified to estimate the maximum. Let
p∨ = max{pi ∶ i = 1, ⋯ , k}.

We begin with the case where the number of categories k is known. We can approxi-
mate the maximum function with a smooth maximum given by


k
𝛼
g∨n (𝐩) = w−1
∨ pi en pi , (7)
i=1


k
where w∨ = w∨ (𝐩) = en pi . Note that
𝛼

i=1
g∨n (𝐩) = −gn (−𝐩),

where gn is given by (1). It is not difficult to verify that g∨n (𝐩) → p∨ as n → ∞ . This
suggests that we can estimate p∨ by


k
𝛼
̂ = ŵ −1
p̂ ∗∨ = g∨n (𝐩) ∨ p̂ i en p̂ i , (8)
i=1


k
where ŵ ∨ = w∨ (𝐩)
𝛼
̂ = en p̂ i .
i=1
Let r∨ ( be the cardinality
) of the set {j ∶ pj = p∨ , j = 1, … , k} and let
𝜕g∨n (𝐩) 𝜕g∨n (𝐩) T

∇gn (𝐩) = 𝜕p , … , 𝜕p  . It is easily verified that, for 1 ≤ i ≤ k − 1,
1 k−1

13
Journal of Statistical Theory and Practice (2021) 15:24 Page 7 of 19  24

𝜕g∨n (𝐩) 𝜕gn (−𝐩) 𝛼


= = en pi w−1

[1 + n𝛼 {pi − g∨n (𝐩)}]
𝜕pi 𝜕pi (9)
𝛼p
− en k w−1

[1 + n𝛼 {pk − g∨n (𝐩)}].

We now characterize the asymptotic distributions of p̂ ∨.

Theorem 4.1  Assume that 0 < 𝛼 < 1∕2 and let 𝜎̂ n∨ = {∇g∨n (𝐩) ̂ 1∕2 , where
̂ ∨ (𝐩)}
̂ T Σ∇g n
̂ − 𝐩̂ 𝐩̂ T  .
Σ̂ = diag (𝐩)

(i) If r∨ ≠ k , then



n ∗ D

� N(0, 1).
{̂p∨ − p∨ } ���→
𝜎̂ n

(ii) If r∨ = k , then


D
k2 n1−𝛼 {̂p∗∨ − p∨ } ���→
� χ2(k−1) .

As with the minimum, we can consider the case where the number of categories k
is unknown. In this case, we replace ŵ ∨ with


k
𝛼 ( )
ŵ ♯∨ = ̂
en p̂ i 1 yi > 0 + 𝓁,
i=1

for some estimator 𝓁̂ of 𝓁 . Under the assumptions of Theorem 3.1 on 𝓁̂ , a version of
that theorem for the maximum can be verified.

5 Simulations

In this section, we perform a small-scale simulation study to better understand the


finite sample performance of the proposed estimator. We consider both the cases
where the number of categories is known and where it is unknown. When the num-
ber of categories is known, we will compare the finite sample performance of our
estimator p∗0 with that of the MLE p̂ 0 and the LSE p̂ LS
0
 . When the number of catego-
ries is unknown, we will compare the performance of p̂ ♯0 with modifications of the
MLE and the LSE that do not require knowledge of k. Specifically, we will compare
with

y♯0 y♯0 + 1
p̂ 0,u = and p̂ LS (10)
0,u =
,
n + k♯
n

where y♯0 = min{yi ∶ yi > 0, i = 1, 2, … , k} and k♯ = ki=1 1(yi > 0) . Clearly, both
p̂ 0,u and p̂ LS
0,u
can be evaluated without knowledge of k. Throughout this section,

13
24 
Page 8 of 19 Journal of Statistical Theory and Practice (2021) 15:24

when evaluating p∗0 and p♯0 , we set the tuning parameter to be 𝛼 = 0.49 . We chose
this value because it tends to work well in practice and it is neither too large nor
too small. If we take 𝛼 to be large, then (2) implies that the estimator will be almost
indistinguishable from the MLE. On the other hand, if we take 𝛼 to be small, then
the estimator will not work well because it will be too far from convergence.
In our simulations, we consider two distributions. These are the uniform distribu-
tion on k categories, denoted by U(k), and the so-called square-root distribution on k
categories, denoted by S(k). The S(k) distribution has a probability mass function (pmf)
given by
1
p(i) = C √ , i = 1, 2, … , k,
i

where C is a normalizing constant. For each distribution, we will consider the case
where k = 10 and k = 20 . The true minimums for these distributions are given in
Table 1.
The simulations were performed as follows. For each of the four distributions and
each sample size n ranging from 1 to 200, we simulated R = 10000 random samples of
size n. For each of these random samples, we evaluated our estimator. This gave us the
values p̂ ∗0,1 , p̂ ∗0,2 , … , p̂ ∗0,R . We used these to estimate the relative root-mean-square error
(relative RMSE) as follows:
√ √
√ R ( √ ( ∗ )2
√ ∑ ) √ ∑
1 √1 2 √1 R p̂ 0,i
Relative RMSE = p̂ − p0 = √

−1 ,
p0 R i=1 0,i R i=1 p0

where p0 is the true minimum. We repeated this procedure with each of the estima-
tors. Plots of the resulting relative RMSEs for the various distributions and estima-
tors are given in Fig. 1 for the case where the number of categories k is known and
in Fig. 2 for the case where k is unknown. We can see that the proposed estimator
works very well for the uniform distributions in all cases. For the square-root distri-
bution, it also beats the other estimators for a wide range of sample sizes.
It may be interesting to note that, in the case where k is known, the relative RMSE
of the MLE p̂ 0 is exactly 1 for smaller sample sizes. This is because, when we have not
seen all of the categories in our sample, the MLE is exactly 0. In particular, this holds
for any sample size n < k . When the MLE is 0, then the LSE p̂ LS 0
is exactly 1∕(n + k) .
Thus, when k is known and n < k , both p̂ 0 and p̂ LS
0
are fully deterministic functions that
ignore the data entirely. This is not the case with p̂ ∗0 , which is always based on the data.
When k is unknown, we notice an interesting pattern in the errors of the MLE and
the LSE. There is a dip at the beginning, where the errors decrease quickly before
increasing just as quickly. After this, they level off and eventually begin to decrease

Table 1  True minimums for the


distributions considered Distribution U(10) U(20) S(10) S(20)
Minimum 0.100 0.050 0.063 0.029

13
Journal of Statistical Theory and Practice (2021) 15:24 Page 9 of 19  24

Uniform, k=10, Known Sqrt, k=10, Known


1.0

1.0
Relative RMSE
Relative RMSE
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0
0 50 100 150 200 0 50 100 150 200

sample size sample size

Uniform, k=20, Known Sqrt, k=20, Known


1.0

1.0
Relative RMSE
Relative RMSE
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0
0 50 100 150 200 0 50 100 150 200

sample size sample size

Fig. 1  Plots for the relative RMSE in the case where the sample size k is known. The solid line is for the
proposed estimator p̂ ∗0 , the dashed line is for the MLE p̂ 0 , and dotted line is for the LSE p̂ LS
0

Uniform, k=10, Unknown Sqrt, k=10, Unknown


1.0

1.0
Relative RMSE

Relative RMSE
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

0 50 100 150 200 0 50 100 150 200

sample size sample size

Uniform, k=20, Unknown Sqrt, k=20, Unknown


1.0

1.0
Relative RMSE

Relative RMSE
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

0 50 100 150 200 0 50 100 150 200

sample size sample size

Fig. 2  Plots for the relative RMSE in the case where the sample size k is unknown. The solid line is for
the proposed estimator p̂ ♯0 , the dashed line is for the MLE p̂ 0,u , and dotted line is the LSE p̂ LS
0,u

slowly. While it is not clear what causes this, an explanation may be as follows. From
(10), we can see that, for relatively small sample sizes, the numerators of both estima-
tors are likely to be small as we would have only seen very few observations from the
rarest category. As n begins to increase, the numerators should stay small, while the

13
24 
Page 10 of 19 Journal of Statistical Theory and Practice (2021) 15:24

denominators increase. This would make the estimators decrease and thus get closer to
the value of p0 . However, once n becomes relatively large, the numerators should begin
to increase, and thus, the errors would increase as well. It would not be until n gets even
larger that it would be large enough for the errors to begin to come down due to the
statistical properties of the estimators. If this is correct, then the dip is just an artifact of
the deterministic nature of these estimators. For comparison, in most cases the error of
p∗0 just decreases as the sample size increases. The one exception is under the square-
root distribution, when the number of categories is known. It is not clear what causes
the dip in this case, but it may be a similar issue.

6 Conclusions

In this paper, we have introduced a new method for estimating the minimum prob-
ability in a multinomial distribution. The proposed approach is based on a smooth
approximation of the minimum function. We have considered the cases where the
number of categories is known and where it is unknown. The approach is justified
by our theoretical results, which verify consistency and categorize the asymptotic
distributions. Further, a small-scale simulation study has shown that the method
performs better than several baseline estimators for a wide range of sample sizes,
although not for all sample sizes. A potential extension would be to prove asymp-
totic results in the situation where the number of categories increases with the sam-
ple size. This would be useful for studying the problem when there are a very large
number of categories. Other directions for future research include obtaining theo-
retical results about the finite sample performance of the estimator and proposing
modifications of the estimator with the aim of reducing the bias using, for instance,
a jackknife approach.

Acknowledgements  This paper was inspired by the question of Dr. Zhiyi Zhang (UNC Charlotte): How
to estimate the minimum probability of a multinomial distribution? We thank Ann Marie Stewart for her
editorial help. The authors wish to thank two anonymous referees whose comments have improved the
presentation of this paper. The second author’s work was funded, in part, by the Russian Science Founda-
tion (Project No. 17-11-01098).

Compliance with Ethical Standards 

Conflict of interest.  On behalf of all authors, the corresponding author states that there is no conflict of
interest.

Appendix: Proofs

Throughout the
section, let Σ = diag(𝐩) − 𝐩𝐩T  , 𝜎n = ∇gn (𝐩)T Σ∇gn (𝐩) ,

Λ = lim ∇gn (𝐩) , and 𝜎 = ΛT ΣΛ . It is well known that Σ is a positive definite
n→∞
matrix, see, e.g., [12]. For simplicity, we use the standard notation O(⋅) , o(⋅) , Op (⋅) ,
and op (⋅) , see, e.g., [13] for the definitions. In the case of matrices and vectors, this
notation should be interpreted as component wise.

13
Journal of Statistical Theory and Practice (2021) 15:24 Page 11 of 19  24

It may, at first, appear that Theorem 2.1 can be proved using the delta method.
However, the difficulty lies in the fact that the function gn (⋅) depends on n. For this
reason, the proof requires a more subtle approach. We begin with several lemmas.

Lemma A.1 

1 There is a constant 𝜖 > 0 such that p0 ≤ gn (𝐩) ≤ p0 + (k − r)e−n 𝜖 . When r ≠ k ,
𝛼

we can take 𝜖 = min (pj − p0 )


j∶pj >p0
2. For any constant 𝛽 ∈ ℝ

n𝛽 {gn (𝐩) − p0 }→
� 0 as n → ∞.

3 For any 1 ≤ j ≤ k and any constant 𝛽 ∈ ℝ


𝛼
n𝛽 e−n pj w−1 {gn (𝐩) − pj }→
� 0 as n → ∞.

Proof  We begin with the first part. First, assume that r = k . In this case, it is imme-
diate that gn (𝐩) = k−1 = p0 and the result holds with any 𝜖 > 0 . Now assume r ≠ k .
In this case,


k
𝛼

k
𝛼
p0 = p0 e−n pi w−1 ≤ pi e−n pi w−1 = gn (𝐩).
i=1 i=1

To show the other inequality, note that


{ }−1

k
e −n𝛼 p0 −1
w = e −n𝛼 (pi −p0 )
≤ (re0 )−1 = r−1 (11)
i=1

and that, for any pi > p0 , we have


{ }

k
n𝛼 pi −n𝛼 (pj −pi ) −n𝛼 (p0 −pi ) n𝛼 (pi −p0 ) 𝛼
e w= e ≥e =e ≥ exp n min (pj − p0 ) .
j∶pj >p0
j=1

Setting 𝜖 = min (pj − p0 ) > 0 , it follows that, for pi > p0,
j∶pj >p0

(12)
𝛼 𝛼
e−n pi w−1 ≤ e−n 𝜖 .
We thus get
∑ 𝛼
∑ 𝛼 𝛼
gn (𝐩) = pi e−n pi w−1 + pi e−n pi w−1 ≤ rp0 (r)−1 + (k − r)e−n 𝜖 .
i∶pi =p0 i∶pi >p0

The second part follows immediately from the first. We now turn to the third part.
When pj = p0 Eq. (11) and Part 1 imply that e−n pj w−1 ≤ r−1 and that there is an
𝛼

𝜖 > 0 such that

13
24 
Page 12 of 19 Journal of Statistical Theory and Practice (2021) 15:24

𝛼
0 ≤ gn (𝐩) − pj ≤ (k − r)e−n 𝜖 .

It follows that when pj = p0


𝛼 𝛼
0 ≤ n𝛽 e−n pj w−1 {gn (𝐏) − pj } ≤ (k − r)r−1 n𝛽 e−n 𝜖→ � ∞.
� 0 as n→

On the other hand, when pj > p0 , by Part 1 there is an 𝜖 > 0 such that
𝛼
0 ≤ |gn (𝐩) − pj | ≤ pj − p0 + (k − r)e−n 𝜖 .

Using this and Eq. (12) gives


𝛼 𝛼 𝛼 (2𝜖)
0 ≤ |n𝛽 e−n pj w−1 (gn (𝐩) − pj )| ≤ (pj − p0 )n𝛽 e−n 𝜖 + (k − r)n𝛽 e−n � 0,

as n → ∞ .  ◻

We now consider the case when the probabilities are estimated.

Lemma A.2 Let 𝐩∗n = 𝐩∗ = (p∗1 , … , p∗k−1 ) be a sequence of random vectors with


∑k−1 ∑k−1 ∑
k
p∗i ≥ 0 and i=1 p∗i ≤ 1 . Let pk = 1 − i=1 p∗i  , w∗ = e−n pi  , and assume that
𝛼 ∗

p i=1
� 𝐩 a.s. and n𝛼 (𝐩∗n − 𝐩) �→
𝐩∗n→ �� 0 . For every j = 1, 2, … , k , we have
( ) 𝛼 ∗
1 p
n𝛼 p∗j − p0 e−n pj ∗ − →0
w
and
𝛼 p∗ 1 p
n𝛼 e−n j {g (𝐩∗ ) − p∗j }−
→0 as n → ∞.
w∗ n n
Proof  First note that, from the definition of w∗ , we have
𝛼 p∗ 1
0 ≤ e−n j ≤ 1. (13)
w∗
Assume( that pj )= p0 . In
( this case,
) p the first equation follows from (13) and the fact
that n pj − p0 = n pj − pj −
𝛼 ∗ 𝛼 ∗
→0 . In particular, this completes the proof of the
first equation in the case where k = r.
Now assume that k ≠ r . Let p∗0 = min{p∗i ∶ i = 1, 2, … , k} , 𝜖 = min {pi − p0 } ,
i∶pi ≠p0
and 𝜖n∗ = min {p∗i − p∗0 } . Since 𝐩∗n → 𝐩 a.s., it follows that 𝜖n∗ → 𝜖 a.s. Further, by
i∶pi ≠p0
arguments similar to the proof of Eq. (12), we can show that, if pj ≠ p0 then there is
a random variable N, which is finite a.s., such that for n ≥ N
𝛼 p∗ 1 𝛼 ∗ 𝛼
e−n j

≤ e−n 𝜖n ≤ e−n 𝜖∕2 .
w
It follows that for such j and n ≥ N

13
Journal of Statistical Theory and Practice (2021) 15:24 Page 13 of 19  24

| | 𝛼 ∗ 1 𝛼
n𝛼 |p∗j − p0 |e−n pj ∗ ≤ 2n𝛼 e−n 𝜖∕2 → 0.
| | w
This completes the proof of the first limit.
Now assume either k = r or k ≠ r . For the second limit, note that
1
𝛼 p∗
n𝛼 e−n j(g (𝐩∗ ) − p∗j )
w∗ n n
𝛼 ∗ 1 𝛼 ∗ 1
= n𝛼 e−n pj ∗ (gn (𝐩∗ ) − p0 ) + n𝛼 e−n pj ∗ (p0 − p∗j )
w w
𝛼 ∗ 1 ∑ k
𝛼 ∗ 1 𝛼 ∗ 1
= n𝛼 e−n pj ∗ (p∗ − p0 )e−n pi ∗ + n𝛼 e−n pj ∗ (p0 − p∗j ).
w i=1 i w w

From here the result follows by the first limit and (13).  ◻

Lemma A.3  1. If r = k , then for each i = 1, 2, … , k


𝜕gn (𝐩)
= 0.
𝜕pi

2. If r ≠ k , then for each i = 1, 2, … , k

⎧ −1 , if p ≠ p and p = p
𝜕gn (𝐩) ⎪ r −1 k 0 i 0
lim
n→∞ 𝜕p
= ⎨ −r , if pk = p0 and pi ≠ p0 (14)
i ⎪ 0, otherwise.

Proof When r = k , the result is immediate from (4). Now assume that r ≠ k . We can
rearrange equation (4) as
𝜕gn (𝐩) ( 𝛼 𝛼 )
= w−1 e−n pi − e−n pk + rn , (15)
𝜕pi

where rn = n𝛼 e−n pi w−1 {gn (𝐩) − pi } − n𝛼 e−n pk w−1 {gn (𝐩) − pk } . Note that Lemma
𝛼 𝛼

A.1 implies that rn → 0 as n → ∞ . It follows that


𝜕gn (𝐩) 𝛼 𝛼
lim = lim e−n pi w−1 − lim e−n pk w−1
n→∞ 𝜕pi n→∞ n→∞
{ k }−1 { k }−1
∑ 𝛼 ∑ 𝛼
−n (pj −pi ) −n (pj −pk )
= lim e − lim e .
n→∞ n→∞
j=1 j=1

Consider the case where pk ≠ p0 and pi = p0 . In this case, the first part has r
component(s) in the denominator that are equal to one ( e0 ) and the remaining k − r
terms go to zero individually. However, since pk ≠ p0 , the denominator of the sec-
ond fraction has r terms of the form e−n (p0 −pk ) , which go to +∞ , while the other
𝛼

terms go to 0, 1, or +∞ . Thus, in this case, the limit is r−1 − 0 = r−1 . The arguments
in the other cases are similar and are thus omitted.  ◻

13
24 
Page 14 of 19 Journal of Statistical Theory and Practice (2021) 15:24

Lemma A.4 Assume that r ≠ k and let 𝐩∗n be as in Lemma A.2. In this case,
𝜕(gn (𝐩)) 𝜕(gn (𝐩∗𝐧 )) 𝜕 2 (gn (𝐩)) 𝜕 2 (gn (𝐩∗𝐧 ))
𝜕p
= O(1) 
, 𝜕p
= Op (1) 
, 𝜕p 𝜕p
= O(n𝛼)

, 𝜕p 𝜕p
= Op (n𝛼 ) ,
i i i j i j
𝜕 3 (gn (𝐩)) 𝜕 3 (gn (𝐩∗𝐧 ))
𝜕p𝓁 𝜕pi 𝜕pj
= O(n2𝛼 ) , and 𝜕p = Op (n2𝛼 ).
𝓁 𝜕pi 𝜕pj

Proof  The results for the first derivatives follow immediately from (4), (13), Lemma
A.2, and Lemma A.3. Now let 𝛿ij be 1 if i = j and zero otherwise. It is straightfor-
ward to verify that

𝜕 2 gn (𝐩) ( 𝛼 𝛼 ) 𝜕gn (𝐩)


=n𝛼 w−1 e−n pi − e−n pk
𝜕pj 𝜕pi 𝜕pj
( 𝛼 𝛼 ) 𝜕gn (𝐩)
+ n𝛼 w−1 e−n pj − e−n pk (16)
𝜕pi
𝛼 −n𝛼 pk −1 𝛼
[ ( ) ]
−n e w n gn (𝐩) − pk + 2
𝛼 [ ( ) ]
− 𝛿ij n𝛼 e−n pi w−1 n𝛼 gn (𝐩) − pi + 2 ,

that for 𝓁 ≠ i and 𝓁 ≠ j we have

𝜕 3 gn (𝐩) ( 𝛼 𝛼 ) 𝜕 2 gn (𝐩)
=n𝛼 w−1 e−n p𝓁 − e−n pk
𝜕p𝓁 𝜕pj 𝜕pi 𝜕pj 𝜕pi
( 𝛼 𝛼 ) 𝜕 2 gn (𝐩)
+ n𝛼 w−1 e−n pi − e−n pk
𝜕p𝓁 𝜕pj
( 𝛼 𝛼 ) 𝜕 2 gn (𝐩)
+ n𝛼 w−1 e−n pj − e−n pk
𝜕p𝓁 𝜕pi (17)
( )
2𝛼 −n𝛼 pk −1 gn (𝐩) 𝜕gn (𝐩) 𝜕gn (𝐩)
−n e w + + +1
𝜕p𝓁 𝜕pj 𝜕pi
𝛼 [ ( ) ]
− n2𝛼 e−n pk w−1 n𝛼 gn (𝐩) − pk + 2
𝛼 𝜕g (𝐩)
− 𝛿ij n2𝛼 e−n pi w−1 n ,
𝜕p𝓁

and that for i = j = 𝓁 we have


( )
𝜕 3 gn (𝐩) ( −n𝛼 p
𝛼 −1 −n 𝛼p ) 𝜕 2 gn (𝐩) 𝛼
=n w e i − e k 3 + 2n
𝜕p3i 𝜕p2i
(18)
2𝛼 𝜕gn (𝐩)
[ ( 𝛼 𝛼 )]
+n 1 − 3w e−n pi + e−n pk .
−1
𝜕pi

Combining this with Lemma  A.2 and the fact that 0 ≤ w−1 e−n ≤ 1 for any
𝛼p
s

1 ≤ s ≤ k gives the result.  ◻


1
Lemma A.5 Assume r ≠ k and 0 < 𝛼 < 0.5 , then ∇gn (𝐩)
̂ − ∇gn (𝐩) = Op (n𝛼− 2 ).

Proof  By the mean value theorem, we have

13
Journal of Statistical Theory and Practice (2021) 15:24 Page 15 of 19  24

1 1 √
n 2 −𝛼 ∇gn (𝐩)
̂ = n 2 −𝛼 ∇gn (𝐩) + n−𝛼 ∇2 gn (𝐩∗ ) n(𝐩̂ − 𝐩), (19)

where 𝐩∗ = 𝐩 + diag(𝝎)(𝐩̂ − 𝐩) for some 𝝎 ∈ [0, 1]k−1 . Note that by the strong law
of large numbers 𝐩̂ → 𝐩 a.s., which implies that 𝐩∗ − 𝐩 → 0 a.s. Similarly,
p
by the
multivariatep central limit theorem and Slutsky’s theorem n𝛼 (𝐩̂ − 𝐩)−→0 implies that
→0 . Thus, the assumptions of Lemma A.4 are satisfied and that lemma
n𝛼 (𝐩∗ − 𝐩)−
gives

n−𝛼 ∇2 gn (𝐩∗ ) n(𝐩̂ − 𝐩) = n−𝛼 Op (n𝛼 )Op (1).

From here, the result is immediate.  ◻

Lemma A.6  Assume that r ≠ k . In this case, 𝜎 > 0 and lim 𝜎n−1 𝜎 = 1 . Further, if
p n→∞
0 < 𝛼 < 0.5 , then 𝜎̂ n−1 𝜎n �→
�� 1.

Proof Since Σ is a positive definite matrix, and by Lemma  A.3, Λ ≠ 0 , it follows


that 𝜎 > 0 . From here, the fact that lim 𝜎n = 𝜎 gives the first result. Now assume
n→∞
that 0 < 𝛼 < 0.5 . It is easy to see that p̂ i p̂ j − pi pj = p̂ j (̂pi − pi ) + pi (̂pj − pj ) = Op (n−1∕2 )
and p̂ i (1 − p̂ i ) − pi (1 − pi ) = (̂pi − pi )(1 − pi − p̂ i ) = Op (n−1∕2 ) . Thus, Σ̂ = Σ + Op (n−1∕2 ) .
This together with Lemma A.3 and Lemma A.5 leads to

𝜎̂ n2 ̂ n (𝐩)
̂ T Σ∇g
∇gn (𝐩) ̂
=
𝜎n2 T
∇gn (𝐩) Σ∇gn (𝐩)
1 1 1
(∇gn (𝐩) + Op (n𝛼− 2 ))T (Σ + OP (n− 2 ))(∇gn (𝐩) + Op (n𝛼− 2 ))
=
∇gn (𝐩)T Σ∇gn (𝐩)
𝛼− 21 3 1 p
=1 + Op (n ) + Op (n𝛼−1 ) + Op (n2𝛼− 2 ) + Op (n− 2 ) + Op (n2𝛼−1 ) �→
�� 1,

which completes the proof.  ◻


√ −1 D
Lemma A.7 If r ≠ k and 0 < 𝛼 < 0.5 , then � N(0, 1).
̂ − gn (𝐩)} ���→
n𝜎̂ n {gn (𝐩)

Proof  Taylor’s theorem implies that


√ √
̂ − gn (𝐩)) = n(𝐩̂ − 𝐩)T ∇gn (𝐩)
n(gn (𝐩)

+ 0.5 n(𝐩̂ − 𝐩)T n−𝛼 ∇2 gn (𝐩∗ )n𝛼 (𝐩̂ − 𝐩),

where 𝐩∗ = 𝐩 + diag (𝝎)(𝐩̂ − 𝐩) for some 𝝎 ∈ [0, 1]k−1 . Using Lemma A.4 and argu-
ments similar to those used in the proof of Lemma A.5 gives n−𝛼 ∇2 gn (𝐩∗ ) = Op (1) ,

n(𝐩̂ − 𝐩) = Op (1) , and n𝛼 (𝐩̂ − 𝐩) = op (1) . It follows that the second term on the
RHS above is op (1) and hence that
√ √
̂ − gn (𝐩)) = n(𝐩̂ − 𝐩)T ∇gn (𝐩) + op (1).
n(gn (𝐩)
√ D
It is well known that n(𝐩̂ − 𝐩)T ���→ � N(0, Σ) . Hence

13
24 
Page 16 of 19 Journal of Statistical Theory and Practice (2021) 15:24

√ D
n(𝐩̂ − 𝐩)T Λ ���→
� N(0, ΛT ΣΛ)

and, by Slutsky’s theorem,


√ D
� N(0, ΛT ΣΛ).
̂ − gn (𝐩)) ���→
n(gn (𝐩)
p
By Lemma A.6, 𝜎n−1 𝜎 → 1 , and 𝜎̂ n−1 𝜎n �→
�� 1 . Hence, the result follows by another
application of Slutsky’s theorem.  ◻

Lemma A.8 Let 𝐀 = −n 1
−𝛼 2

1
gn (𝐩) and let 𝐈k−1 be the (k − 1) × (k − 1) identity
matrix. If r = k , then Σ 2 𝐀Σ 2 = 2k−2 𝐈k−1.

Proof Let 𝟏 be the column vector in ℝk−1 with all entries equal to 1. By Eq. (16), we
have

𝐀 = −n−𝛼 ∇2 gn (𝐩) = 2k−1 [𝟏𝟏T + 𝐈k−1 ]. (20)


Note that Σ = diag(𝐩) − 𝐩𝐩T = k−2 [k𝐈k−1 − 𝟏𝟏T ] . It follows that

𝐀Σ =2k−1 [𝟏𝟏T + 𝐈k−1 ]k−2 [k𝐈k−1 − 𝟏𝟏T ]


=2k−3 [k𝟏𝟏T − 𝟏𝟏T 𝟏𝟏T + k𝐈k−1 − 𝟏𝟏T ]
=2k−3 [k𝟏𝟏T − (k − 1)𝟏𝟏T + k𝐈k−1 − 𝟏𝟏T ] = 2k−2 𝐈k−1 .

Now multiplying both sides by Σ1∕2 on the left and Σ−1∕2 on the right gives the
result.  ◻

Proof of Theorem 2.1  (i) If r ≠ k , then


√ −1 ∗ √ √
̂ − gn (𝐩)} + n𝜎̂ n−1 {gn (𝐩) − p0 }.
n𝜎̂ n {̂p0 − p0 } = n𝜎̂ n−1 {gn (𝐩) (21)

The first part approaches a N(0, 1) distribution by Lemma A.7, and the second part
approaches zero in probability by Lemmas A.6 and A.1. From there, the first part of
the theorem follows by Slutsky’s theorem.
(ii) Assume that r = k . In this case, gn (𝐩) = p0 = k−1 , and by Lemma  A.3,
∇gn (𝐩) = 0 . Thus, Taylor’s theorem gives

n1−𝛼 {p0 − p̂ ∗0 } =n1−𝛼 {gn (𝐩) − gn (𝐩)}


̂
√ √ (22)
=0.5 n(𝐩̂ − 𝐩)T (−n−𝛼 )∇2 gn (𝐩) n(𝐩̂ − 𝐩) + rn ,
k−1 k−1 k−1 √ √
∑ ∑ ∑ 𝜕 3 g (𝐩∗ )
where rn = −6−1 n(̂pq − pq ) n(̂pr − pr )n𝛼 (̂ps − ps )n−2𝛼 𝜕p 𝜕pn 𝜕p  , and
q=1 r=1 s=1 q r s

𝐩∗ = 𝐩 + diag (𝝎)(𝐩̂ − 𝐩) for some 𝝎 ∈ [0, 1]k−1 . Lemma  A.4 √ implies that
𝜕 3 g (𝐩∗ )
n−2𝛼 𝜕p 𝜕pn 𝜕p = Op (1) 
. Combining this with the facts that n(̂pq − pq ) and
√ q r s
p − pr ) are Op (1) and that, for 𝛼 ∈ (0, 0.5) , n (̂ps − ps ) = op (1) , it follows that
n(̂
p r
𝛼

→0.
rn − √ 1
Let 𝐱n = n(𝐩̂ − 𝐩) , 𝐓n = Σ− 2 𝐱n , and 𝐀 = −n𝛼 ∇2 gn (𝐩) . Lemma A.8 implies that

13
Journal of Statistical Theory and Practice (2021) 15:24 Page 17 of 19  24

1 1 1 1
𝐱nT 𝐀𝐱n =(Σ− 2 𝐱n )T Σ 2 𝐀Σ 2 (Σ− 2 𝐱n )
=𝐓Tn (2k−2 𝐈k−1 )𝐓n .
D D
Since 𝐱n ���→
� N(0, Σ) , we have 𝐓n ���→
� 𝐓 , where 𝐓 ∼ N(0, 𝐈k−1 ). Let Ti be the ith com-
ponent of vector 𝐓 . Applying the continuous mapping theorem, we obtain

D ∑
k−1
� 𝐓T (2k−2 𝐈k−1 )𝐓 = 2k−2
𝐱nT 𝐀𝐱n ���→ Ti2 .
i=1

Thus, Eq. (22) becomes

D ∑
k−1
n1−𝛼 {p0 − gn (𝐩)} � k−2
̂ = 0.5𝐱nT 𝐀𝐱n + op (1) ���→ Ti2 .
i=1

The result follows from the fact that the Ti2 are independent and identically distrib-
uted random variables, each following the Chi-square distribution with 1 degree of
freedom.  ◻

The proof of Theorem 4.1 is very similar to that of Theorem 2.1 and is thus omit-
ted. However, to help the reader to reconstruct the proof, we note that the partial
derivatives of g∨n can be calculated using the facts that

𝜕g∨n (𝐩) 𝜕gn (−𝐩) 𝜕 2 g∨n (𝐩) 𝜕 2 gn (−𝐩)


= and =− .
𝜕pj 𝜕pj 𝜕pi 𝜕pj 𝜕pi 𝜕pj

Further, we formulate a version of Lemmas A.1 and A.2 for the maximum.

Lemma A.9 

1. There is a constant 𝜖 > 0 such that p∨ − (k − r∨ )e−n 𝜖 ≤ g∨n (𝐩) ≤ p∨ . When r∨ ≠ k ,
𝛼

we can take 𝜖 = min (p∨ − pj ).


j∶pj <p∨
2. For any constant 𝛽 ∈ ℝ

n𝛽 {g∨n (𝐩) − p∨ }→
� 0 as n → ∞.

3. For any 1 ≤ j ≤ k and any constant 𝛽 ∈ ℝ


𝛼
en pj ∨
n𝛽 {g (𝐩) − pj }→
� 0 as n → ∞.
w∨ n
∑k
4. If 𝐩∗n is as in Lemma A.2 and w∨∗ = en pi  , then for every j = 1, 2, … , k we have
𝛼 ∗

( ) 𝛼 ∗ i=1 p
1
n pj − p∨ en pj ∨∗ −
𝛼 ∗
→0
w

13
24 
Page 18 of 19 Journal of Statistical Theory and Practice (2021) 15:24

and
𝛼 p∗ 1 p
n𝛼 en j
∨∗
{g∨n (𝐩∗n ) − p∗j }−
→0 as n → ∞.
w

Proof  We only prove the first part, as proofs of the rest are similar to those of Lem-
mas A.1 and A.2. If r∨ = k , then g∨n (𝐩) = 1∕k = p∨ and the result holds with any
𝜖 > 0 . Now, assume that k ≠ r∨ and let 𝜖 be as defined above. First note that


k
𝛼p 1 ∑ 𝛼 1
k
g∨n (𝐩) = pj en j ≤ p∨ en pj ∨ = p∨ .
j=1
w∨ j=1
w

Note further that for pi < p∨


{ }−1 { }−1
𝛼
en pj ∑
k

n𝛼 (pi −pj ) n𝛼 (p∨ −pj )
= e ≤ e
w∨ i=1 i∶pi =p∨
1 −n𝛼 (p∨ −pj ) 1 𝛼
= e ≤ e−n 𝜖 .
r∨ r∨

It follows that
( n𝛼 p∨ )
∑ 1
𝛼
r∨ en p∨ r∨ e
n𝛼 pi
g∨n (𝐩) ≥ pi e = p∨ = p∨ + p∨ −1
i∶pi =p∨
w∨ w∨ w∨
( )
p∨ ∑ 𝛼 ∑
n𝛼 p∨ n p∨ −n𝛼 pi
=p∨ + ∨ r∨ e − e − e
w i∶pi =p∨ i∶pi <p∨
p∨ ∑ −n𝛼 p p∨ 𝛼
=p∨ − ∨ e i ≥ p −
∨ (k − r∨ )e−n 𝜖 .
w i∶p <p r∨
i ∨

From here the result follows.  ◻

References
1. Boyd S, Vandenberghe L (2004) Convex Optimization. Cambridge University Press, Cambridge
2. Chao A (1984) Nonparametric estimation of the number of classes in a population. Scandinavian J
Stat 11:265–270
3. Chao A (1987) Estimating the population size for capture-recapture data with unequal catchability.
Biometrics 43:783–791
4. Chung K, AitSahlia F (2003) Elementary Probability Theory with Stochastic Processes and an
Introduction to Mathematical Finance, 4th edn. Springer, New York
5. Colwell C (1994) Estimating terrestrial biodiversity through extrapolation. Philos Trans Biol Sci
345:101–118
6. Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) ‘Visual categorization with bags of
keypoints’, In: Workshop on Statistical Learning in Computer Vision, ECCV, pp. 1–22

13
Journal of Statistical Theory and Practice (2021) 15:24 Page 19 of 19  24

7. Grabchak M, Marcon E, Lang G, Zhang Z (2017) The generalized Simpson’s entropy is a measure
of biodiversity. PLOS ONE 12:e0173305
8. Grabchak M, Zhang Z (2018) Asymptotic normality for plug-in estimators of diversity indices on
countable alphabets. J Nonparam Stat 30:774–795
9. Gu Z, Shao M, Li L, Fu Y (2012) ‘Discriminative metric: Schatten norm vs. vector norm’, In: Pro-
ceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pp. 1213–1216
10. Haykin S (1994) Neural networks: a comprehensive foundation. Pearson Prentice Hall, New York
11. Lange M, Zühlke D, Holz T, Villmann O (2014) ‘Applications of lp-norms and their smooth approx-
imations for gradient based learning vector quantization’, In: ESANN 2014: Proceedings of the
22nd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine
Learning, pp. 271–276
12. May WL, Johnson WD (1998) On the singularity of the covariance matrix for estimates of multino-
mial proportions. J Biopharmaceut Stat 8:329–336
13. Shao J (2003) Mathematical Statistics, 2nd edn. Springer, New York
14. Turney P, Littman ML (2003) Measuring praise and criticism: inference of semantic orientation
from association. ACM Trans Inf Syst 21:315–346
15. Zhai C, Lafferty J (2017) A study of smoothing methods for language models applied to ad hoc
information retrieval. ACM SIGIR Forum 51:268–276
16. Zhang Z, Chen C, Zhang J (2020) Estimation of population size in entropic perspective. Commun
Stat Theory Methods 49:307–324

Publisher’s Note  Springer Nature remains neutral with regard to jurisdictional claims in published
maps and institutional affiliations.

13

You might also like