Estimation of The Minimum Probability of A Multinomial Distribution
Estimation of The Minimum Probability of A Multinomial Distribution
Estimation of The Minimum Probability of A Multinomial Distribution
https://doi.org/10.1007/s42519-020-00163-y
ORIGINAL ARTICLE
Abstract
The estimation of the minimum probability of a multinomial distribution is impor-
tant for a variety of application areas. However, standard estimators such as the
maximum likelihood estimator and the Laplace smoothing estimator fail to function
reasonably in many situations as, for small sample sizes, these estimators are fully
deterministic and completely ignore the data. Inspired by a smooth approximation
of the minimum used in optimization theory, we introduce a new estimator, which
takes advantage of the entire data set. We consider both the cases with a known and
an unknown number of categories. We categorize the asymptotic distributions of the
proposed estimator and conduct a small-scale simulation study to better understand
its finite sample performance.
1 Introduction
* Michael Grabchak
[email protected]
1
Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte,
NC 28223, USA
13
Vol.:(0123456789)
24
Page 2 of 19 Journal of Statistical Theory and Practice (2021) 15:24
p̂ 0 = min{̂pi ;i = 1, … , k}.
The MLE has the obvious drawback that p̂ 0 is zero when we do not have at least
one observation from each category. To deal with this issue, one generally uses a
modification of the MLE. Perhaps the most prominent modification is the so-called
Laplace smoothing estimator (LSE). This estimator was introduced by Pierre-Simon
Laplace in the late 1700s to estimate the probability that the sun will rise tomorrow,
see, e.g., [4]. The LSE of p0 is given by
p̂ LS
0
= min{(yi + 1)∕(n + k);i = 1, ⋯ , k}.
1. One often needs to estimate the probability of a category that is not observed in
a random sample. This is often estimated using the LSE, which always gives the
deterministic value of 1∕(n + k) . On the other hand, a data-driven estimate would
be more reasonable. When the sample size is relatively large, it is reasonable to
assume that the unobserved category has the smallest probability and our estima-
tor could be used in this case. This situation comes up in a variety of applications
including language processing, computer vision, and linguistics, see, e.g., [6, 14],
or [15].
2. In the context of ecology, we may be interested in the probability of finding the
rarest species in an ecosystem. Aside for the intrinsic interest in this question,
this probability may be useful as a diversity index. In ecology, diversity indices
are metrics used to measure and compare the diversity of species in different
ecosystems, see, e.g., [7, 8], and the references therein. Generally one works with
several indices at once as they give different information about the ecosystem.
In particular, the probability of the rarest species may be especially useful when
13
Journal of Statistical Theory and Practice (2021) 15:24 Page 3 of 19 24
combined with the index of species richness, which is the total number of species
in the ecosystem.
3. Consider the problem of internet ad placement. There are generally multiple ads
that are shown on the same webpage, and at most one of these will be clicked.
Thus, if there are k − 1 ads, then there are k possible outcomes, with the last
outcome being that no ad is clicked. In this context, the probability of a click on
a given ad is called the click through rate or CTR. Assume that there are k − 1
ads that have been displayed together on the same page and that we have data
on these. Now, the ad company wants to replace one of these with a new ad, for
which there are no data. In this case, the minimum probability of the original
k − 1 ads may give a baseline for the CTR of the new ad. This may be useful for
pricing.
We begin with the case where the number of categories k is known. Let
∑
k−1
∑
k−1
𝐩 = (p1 , … , pk−1 ) , 𝐩̂ = (̂p1 , … , p̂ k−1 ) , and note that pk = 1 − pi and p̂ k = 1 − p̂ i .
i=1 i=1
Since p0 = g(𝐩) , where g(𝐩) = min{p1 , p2 , … , pk } , a natural estimator of p0 is given
by
̂ = min{̂p1 , p̂ 2 , … , p̂ k },
p̂ 0 = g(𝐩)
which is the MLE. However, this estimator takes the value of zero whenever there is
a category that has not been observed. To deal with this issue, we propose approxi-
mating g with a smoother function. Such approximations, which are sometimes
called smooth minimums, are often used in optimization theory, see, e.g., [1, 9, 10],
or [11]. Specifically, we introduce the function
∑
k
𝛼
gn (𝐩) = w−1 pi e−n pi , (1)
i=1
∑
k
where w = w(𝐩) = and 𝛼 > 0 is a tuning parameter. Note that
𝛼p
e−n j
j=1
lim gn (𝐩) = g(𝐩) = p0 . (2)
n→∞
∑
k
𝛼
̂ = ŵ −1
p̂ ∗0 = gn (𝐩) p̂ i e−n p̂ i , (3)
i=1
∑
k
where ŵ = w(𝐩) e−n p̂ j.
𝛼
̂ =
j=1
We (now study
)the asymptotic distributions of p̂ ∗0 . Let
T
𝜕gn (𝐩) 𝜕gn (𝐩)
∇gn (𝐩) = 𝜕p , … , 𝜕p . It is straightforward to check that, for 1 ≤ i ≤ k − 1,
1 k−1
13
24
Page 4 of 19 Journal of Statistical Theory and Practice (2021) 15:24
𝜕gn (𝐩) 𝛼 ( ) 𝛼 ( )
= e−n pi w−1 [1 + n𝛼 gn (𝐩) − pi ] − e−n pk w−1 [1 + n𝛼 gn (𝐩) − pk ]. (4)
𝜕pi
Theorem 2.1 Assume that 0 < 𝛼 < 1∕2 and let 𝜎̂ n = {∇gn (𝐩) ̂ 1∕2 , where
̂ n (𝐩)}
̂ T Σ∇g
̂ − 𝐩̂ 𝐩̂ .
̂Σ = diag (𝐩) T
Clearly, Theorem 2.1 both proves consistency and characterizes the asymp-
totic distributions. Further, it allows us to construct asymptotic confidence inter-
vals for p0 . If r ≠ k , then an approximate 100(1 − 𝛾)% confidence interval is
p̂ ∗0 ± n−1∕2 𝜎̂ n z1−𝛾∕2 ,
where z1−𝛾∕2 is the 100(1 − 𝛾∕2) th percentile of the standard normal distribution. If
r = k , then the corresponding confidence interval is
where 𝜒k−1,1−𝛾
2
is the 100(1 − 𝛾) th percentile of a Chi-squared distribution with k − 1
degrees of freedom.
As far as we know, these are the first confidence interval for the minimum to
appear in the literature. In fact, to the best of our knowledge, the asymptotic dis-
tributions of the MLE and the LSE have not been established. One might think
that a version of Theorem 2.1 for the MLE could be proved using the asymp-
totic normality of 𝐩̂ and the delta method. However, the delta method cannot be
applied since the minimum function g is not differentiable. Even in the case of
the proposed estimator p̂ ∗0 , where we use a smooth minimum, the delta method
cannot be applied directly since the function gn depends on the sample size n.
Instead, a subtler approach is needed. The detailed proof is given in “Appendix”.
13
Journal of Statistical Theory and Practice (2021) 15:24 Page 5 of 19 24
∑
k
𝛼p
∑
k
𝛼 ( )
̂j
ŵ = e−n = e−n p̂ j 1 yj > 0 + 𝓁.
j=1 j=1
∑
k
𝛼 ( )
ŵ ♯ = e−n p̂ j 1 yj > 0 + 𝓁̂
j=1
1 ∑ −n𝛼 p̂ i
k
p̂ ♯0 = p̂ i e . (5)
ŵ ♯ i=1
Note that p̂ ♯0 can be evaluated without knowledge of k since p̂ i = 0 for any category i
that does not appear in the sample.
Now, assume that we have observed k♯ categories in our sample and note that
k ≤ k . Without loss of generality, assume that these are categories 1, 2, …
♯ , k♯ .
∑k♯ −1
Assume that k ≥ 2 , let 𝐩̂ = (̂p1 , p̂ 2 , … , p̂ k♯ −1 ) , and note that p̂ k♯ = 1 − i=1 p̂ i .
♯ ♯
Proof Since k is finite and we eventually have 𝓁̂ = 0 , there exists an almost surely
finite random variable N such that if the sample size n ≥ N , then 𝓁̂ = 0 , and we have
observed each category at least once. For such n, we have k♯ = k , ŵ ♯ = ŵ , 𝐩̂ ♯ =p𝐩̂ ,
and √ ̂ = 𝐡 . If follows
∇gn (𝐩) p
that, for such n, 𝜎̂ n♯ = 𝜎̂ n and p̂ ♯0 = p̂ 0 . Hence 𝜎̂ n♯ ∕𝜎̂ n −
→1
♯
and n𝜎̂ n−1 {̂p∗0 − p̂ 0 }−
→0 . From here the case r ≠ k follows by Theorem 2.1 and two
applications of Slutsky’s theorem. The case r = k is similar and is thus omitted. ◻
13
24
Page 6 of 19 Journal of Statistical Theory and Practice (2021) 15:24
There are a number of estimators for 𝓁 available in the literature, see, e.g., [2,
3, 5], or [16] and the references therein. One of the most popular is the so-called
Chao2 estimator [3, 5], which is given by
⎧ 2
n−1 f1
⎪ n 2f2
if f2 > 0
𝓁̂ = ⎨ (6)
⎪ n−1 f1 (f1 −1)
if f2 = 0,
⎩ n 2
∑k � �
where fi = j=1 1 yj = i is the number of categories that were observed exactly i
times in the sample. Since k is finite, we will, with probability 1, eventually observe
each category at least three times. Thus, we will eventually have f1 = f2 = 0 and
𝓁̂ = 0 . Thus, this estimator satisfies the assumptions of Theorem 3.1. In the rest of
the paper, when we use the notation p̂ ♯0 we will mean the estimator where 𝓁̂ is given
by (6).
4 Estimation of the Maximum
The problem of estimating the maximum probability is generally easier than that
of estimating the minimum. Nevertheless, it may be interesting to note that our
methodology can be modified to estimate the maximum. Let
p∨ = max{pi ∶ i = 1, ⋯ , k}.
We begin with the case where the number of categories k is known. We can approxi-
mate the maximum function with a smooth maximum given by
∑
k
𝛼
g∨n (𝐩) = w−1
∨ pi en pi , (7)
i=1
∑
k
where w∨ = w∨ (𝐩) = en pi . Note that
𝛼
i=1
g∨n (𝐩) = −gn (−𝐩),
where gn is given by (1). It is not difficult to verify that g∨n (𝐩) → p∨ as n → ∞ . This
suggests that we can estimate p∨ by
∑
k
𝛼
̂ = ŵ −1
p̂ ∗∨ = g∨n (𝐩) ∨ p̂ i en p̂ i , (8)
i=1
∑
k
where ŵ ∨ = w∨ (𝐩)
𝛼
̂ = en p̂ i .
i=1
Let r∨ ( be the cardinality
) of the set {j ∶ pj = p∨ , j = 1, … , k} and let
𝜕g∨n (𝐩) 𝜕g∨n (𝐩) T
∨
∇gn (𝐩) = 𝜕p , … , 𝜕p . It is easily verified that, for 1 ≤ i ≤ k − 1,
1 k−1
13
Journal of Statistical Theory and Practice (2021) 15:24 Page 7 of 19 24
Theorem 4.1 Assume that 0 < 𝛼 < 1∕2 and let 𝜎̂ n∨ = {∇g∨n (𝐩) ̂ 1∕2 , where
̂ ∨ (𝐩)}
̂ T Σ∇g n
̂ − 𝐩̂ 𝐩̂ T .
Σ̂ = diag (𝐩)
As with the minimum, we can consider the case where the number of categories k
is unknown. In this case, we replace ŵ ∨ with
∑
k
𝛼 ( )
ŵ ♯∨ = ̂
en p̂ i 1 yi > 0 + 𝓁,
i=1
for some estimator 𝓁̂ of 𝓁 . Under the assumptions of Theorem 3.1 on 𝓁̂ , a version of
that theorem for the maximum can be verified.
5 Simulations
y♯0 y♯0 + 1
p̂ 0,u = and p̂ LS (10)
0,u =
,
n + k♯
n
∑
where y♯0 = min{yi ∶ yi > 0, i = 1, 2, … , k} and k♯ = ki=1 1(yi > 0) . Clearly, both
p̂ 0,u and p̂ LS
0,u
can be evaluated without knowledge of k. Throughout this section,
13
24
Page 8 of 19 Journal of Statistical Theory and Practice (2021) 15:24
when evaluating p∗0 and p♯0 , we set the tuning parameter to be 𝛼 = 0.49 . We chose
this value because it tends to work well in practice and it is neither too large nor
too small. If we take 𝛼 to be large, then (2) implies that the estimator will be almost
indistinguishable from the MLE. On the other hand, if we take 𝛼 to be small, then
the estimator will not work well because it will be too far from convergence.
In our simulations, we consider two distributions. These are the uniform distribu-
tion on k categories, denoted by U(k), and the so-called square-root distribution on k
categories, denoted by S(k). The S(k) distribution has a probability mass function (pmf)
given by
1
p(i) = C √ , i = 1, 2, … , k,
i
where C is a normalizing constant. For each distribution, we will consider the case
where k = 10 and k = 20 . The true minimums for these distributions are given in
Table 1.
The simulations were performed as follows. For each of the four distributions and
each sample size n ranging from 1 to 200, we simulated R = 10000 random samples of
size n. For each of these random samples, we evaluated our estimator. This gave us the
values p̂ ∗0,1 , p̂ ∗0,2 , … , p̂ ∗0,R . We used these to estimate the relative root-mean-square error
(relative RMSE) as follows:
√ √
√ R ( √ ( ∗ )2
√ ∑ ) √ ∑
1 √1 2 √1 R p̂ 0,i
Relative RMSE = p̂ − p0 = √
∗
−1 ,
p0 R i=1 0,i R i=1 p0
where p0 is the true minimum. We repeated this procedure with each of the estima-
tors. Plots of the resulting relative RMSEs for the various distributions and estima-
tors are given in Fig. 1 for the case where the number of categories k is known and
in Fig. 2 for the case where k is unknown. We can see that the proposed estimator
works very well for the uniform distributions in all cases. For the square-root distri-
bution, it also beats the other estimators for a wide range of sample sizes.
It may be interesting to note that, in the case where k is known, the relative RMSE
of the MLE p̂ 0 is exactly 1 for smaller sample sizes. This is because, when we have not
seen all of the categories in our sample, the MLE is exactly 0. In particular, this holds
for any sample size n < k . When the MLE is 0, then the LSE p̂ LS 0
is exactly 1∕(n + k) .
Thus, when k is known and n < k , both p̂ 0 and p̂ LS
0
are fully deterministic functions that
ignore the data entirely. This is not the case with p̂ ∗0 , which is always based on the data.
When k is unknown, we notice an interesting pattern in the errors of the MLE and
the LSE. There is a dip at the beginning, where the errors decrease quickly before
increasing just as quickly. After this, they level off and eventually begin to decrease
13
Journal of Statistical Theory and Practice (2021) 15:24 Page 9 of 19 24
1.0
Relative RMSE
Relative RMSE
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 50 100 150 200 0 50 100 150 200
1.0
Relative RMSE
Relative RMSE
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 50 100 150 200 0 50 100 150 200
Fig. 1 Plots for the relative RMSE in the case where the sample size k is known. The solid line is for the
proposed estimator p̂ ∗0 , the dashed line is for the MLE p̂ 0 , and dotted line is for the LSE p̂ LS
0
1.0
Relative RMSE
Relative RMSE
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
1.0
Relative RMSE
Relative RMSE
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
Fig. 2 Plots for the relative RMSE in the case where the sample size k is unknown. The solid line is for
the proposed estimator p̂ ♯0 , the dashed line is for the MLE p̂ 0,u , and dotted line is the LSE p̂ LS
0,u
slowly. While it is not clear what causes this, an explanation may be as follows. From
(10), we can see that, for relatively small sample sizes, the numerators of both estima-
tors are likely to be small as we would have only seen very few observations from the
rarest category. As n begins to increase, the numerators should stay small, while the
13
24
Page 10 of 19 Journal of Statistical Theory and Practice (2021) 15:24
denominators increase. This would make the estimators decrease and thus get closer to
the value of p0 . However, once n becomes relatively large, the numerators should begin
to increase, and thus, the errors would increase as well. It would not be until n gets even
larger that it would be large enough for the errors to begin to come down due to the
statistical properties of the estimators. If this is correct, then the dip is just an artifact of
the deterministic nature of these estimators. For comparison, in most cases the error of
p∗0 just decreases as the sample size increases. The one exception is under the square-
root distribution, when the number of categories is known. It is not clear what causes
the dip in this case, but it may be a similar issue.
6 Conclusions
In this paper, we have introduced a new method for estimating the minimum prob-
ability in a multinomial distribution. The proposed approach is based on a smooth
approximation of the minimum function. We have considered the cases where the
number of categories is known and where it is unknown. The approach is justified
by our theoretical results, which verify consistency and categorize the asymptotic
distributions. Further, a small-scale simulation study has shown that the method
performs better than several baseline estimators for a wide range of sample sizes,
although not for all sample sizes. A potential extension would be to prove asymp-
totic results in the situation where the number of categories increases with the sam-
ple size. This would be useful for studying the problem when there are a very large
number of categories. Other directions for future research include obtaining theo-
retical results about the finite sample performance of the estimator and proposing
modifications of the estimator with the aim of reducing the bias using, for instance,
a jackknife approach.
Acknowledgements This paper was inspired by the question of Dr. Zhiyi Zhang (UNC Charlotte): How
to estimate the minimum probability of a multinomial distribution? We thank Ann Marie Stewart for her
editorial help. The authors wish to thank two anonymous referees whose comments have improved the
presentation of this paper. The second author’s work was funded, in part, by the Russian Science Founda-
tion (Project No. 17-11-01098).
Conflict of interest. On behalf of all authors, the corresponding author states that there is no conflict of
interest.
Appendix: Proofs
√
Throughout the
section, let Σ = diag(𝐩) − 𝐩𝐩T , 𝜎n = ∇gn (𝐩)T Σ∇gn (𝐩) ,
√
Λ = lim ∇gn (𝐩) , and 𝜎 = ΛT ΣΛ . It is well known that Σ is a positive definite
n→∞
matrix, see, e.g., [12]. For simplicity, we use the standard notation O(⋅) , o(⋅) , Op (⋅) ,
and op (⋅) , see, e.g., [13] for the definitions. In the case of matrices and vectors, this
notation should be interpreted as component wise.
13
Journal of Statistical Theory and Practice (2021) 15:24 Page 11 of 19 24
It may, at first, appear that Theorem 2.1 can be proved using the delta method.
However, the difficulty lies in the fact that the function gn (⋅) depends on n. For this
reason, the proof requires a more subtle approach. We begin with several lemmas.
Lemma A.1
1 There is a constant 𝜖 > 0 such that p0 ≤ gn (𝐩) ≤ p0 + (k − r)e−n 𝜖 . When r ≠ k ,
𝛼
n𝛽 {gn (𝐩) − p0 }→
� 0 as n → ∞.
Proof We begin with the first part. First, assume that r = k . In this case, it is imme-
diate that gn (𝐩) = k−1 = p0 and the result holds with any 𝜖 > 0 . Now assume r ≠ k .
In this case,
∑
k
𝛼
∑
k
𝛼
p0 = p0 e−n pi w−1 ≤ pi e−n pi w−1 = gn (𝐩).
i=1 i=1
Setting 𝜖 = min (pj − p0 ) > 0 , it follows that, for pi > p0,
j∶pj >p0
(12)
𝛼 𝛼
e−n pi w−1 ≤ e−n 𝜖 .
We thus get
∑ 𝛼
∑ 𝛼 𝛼
gn (𝐩) = pi e−n pi w−1 + pi e−n pi w−1 ≤ rp0 (r)−1 + (k − r)e−n 𝜖 .
i∶pi =p0 i∶pi >p0
The second part follows immediately from the first. We now turn to the third part.
When pj = p0 Eq. (11) and Part 1 imply that e−n pj w−1 ≤ r−1 and that there is an
𝛼
13
24
Page 12 of 19 Journal of Statistical Theory and Practice (2021) 15:24
𝛼
0 ≤ gn (𝐩) − pj ≤ (k − r)e−n 𝜖 .
On the other hand, when pj > p0 , by Part 1 there is an 𝜖 > 0 such that
𝛼
0 ≤ |gn (𝐩) − pj | ≤ pj − p0 + (k − r)e−n 𝜖 .
as n → ∞ . ◻
p i=1
� 𝐩 a.s. and n𝛼 (𝐩∗n − 𝐩) �→
𝐩∗n→ �� 0 . For every j = 1, 2, … , k , we have
( ) 𝛼 ∗
1 p
n𝛼 p∗j − p0 e−n pj ∗ − →0
w
and
𝛼 p∗ 1 p
n𝛼 e−n j {g (𝐩∗ ) − p∗j }−
→0 as n → ∞.
w∗ n n
Proof First note that, from the definition of w∗ , we have
𝛼 p∗ 1
0 ≤ e−n j ≤ 1. (13)
w∗
Assume( that pj )= p0 . In
( this case,
) p the first equation follows from (13) and the fact
that n pj − p0 = n pj − pj −
𝛼 ∗ 𝛼 ∗
→0 . In particular, this completes the proof of the
first equation in the case where k = r.
Now assume that k ≠ r . Let p∗0 = min{p∗i ∶ i = 1, 2, … , k} , 𝜖 = min {pi − p0 } ,
i∶pi ≠p0
and 𝜖n∗ = min {p∗i − p∗0 } . Since 𝐩∗n → 𝐩 a.s., it follows that 𝜖n∗ → 𝜖 a.s. Further, by
i∶pi ≠p0
arguments similar to the proof of Eq. (12), we can show that, if pj ≠ p0 then there is
a random variable N, which is finite a.s., such that for n ≥ N
𝛼 p∗ 1 𝛼 ∗ 𝛼
e−n j
∗
≤ e−n 𝜖n ≤ e−n 𝜖∕2 .
w
It follows that for such j and n ≥ N
13
Journal of Statistical Theory and Practice (2021) 15:24 Page 13 of 19 24
| | 𝛼 ∗ 1 𝛼
n𝛼 |p∗j − p0 |e−n pj ∗ ≤ 2n𝛼 e−n 𝜖∕2 → 0.
| | w
This completes the proof of the first limit.
Now assume either k = r or k ≠ r . For the second limit, note that
1
𝛼 p∗
n𝛼 e−n j(g (𝐩∗ ) − p∗j )
w∗ n n
𝛼 ∗ 1 𝛼 ∗ 1
= n𝛼 e−n pj ∗ (gn (𝐩∗ ) − p0 ) + n𝛼 e−n pj ∗ (p0 − p∗j )
w w
𝛼 ∗ 1 ∑ k
𝛼 ∗ 1 𝛼 ∗ 1
= n𝛼 e−n pj ∗ (p∗ − p0 )e−n pi ∗ + n𝛼 e−n pj ∗ (p0 − p∗j ).
w i=1 i w w
From here the result follows by the first limit and (13). ◻
⎧ −1 , if p ≠ p and p = p
𝜕gn (𝐩) ⎪ r −1 k 0 i 0
lim
n→∞ 𝜕p
= ⎨ −r , if pk = p0 and pi ≠ p0 (14)
i ⎪ 0, otherwise.
⎩
Proof When r = k , the result is immediate from (4). Now assume that r ≠ k . We can
rearrange equation (4) as
𝜕gn (𝐩) ( 𝛼 𝛼 )
= w−1 e−n pi − e−n pk + rn , (15)
𝜕pi
where rn = n𝛼 e−n pi w−1 {gn (𝐩) − pi } − n𝛼 e−n pk w−1 {gn (𝐩) − pk } . Note that Lemma
𝛼 𝛼
Consider the case where pk ≠ p0 and pi = p0 . In this case, the first part has r
component(s) in the denominator that are equal to one ( e0 ) and the remaining k − r
terms go to zero individually. However, since pk ≠ p0 , the denominator of the sec-
ond fraction has r terms of the form e−n (p0 −pk ) , which go to +∞ , while the other
𝛼
terms go to 0, 1, or +∞ . Thus, in this case, the limit is r−1 − 0 = r−1 . The arguments
in the other cases are similar and are thus omitted. ◻
13
24
Page 14 of 19 Journal of Statistical Theory and Practice (2021) 15:24
Lemma A.4 Assume that r ≠ k and let 𝐩∗n be as in Lemma A.2. In this case,
𝜕(gn (𝐩)) 𝜕(gn (𝐩∗𝐧 )) 𝜕 2 (gn (𝐩)) 𝜕 2 (gn (𝐩∗𝐧 ))
𝜕p
= O(1)
, 𝜕p
= Op (1)
, 𝜕p 𝜕p
= O(n𝛼)
, 𝜕p 𝜕p
= Op (n𝛼 ) ,
i i i j i j
𝜕 3 (gn (𝐩)) 𝜕 3 (gn (𝐩∗𝐧 ))
𝜕p𝓁 𝜕pi 𝜕pj
= O(n2𝛼 ) , and 𝜕p = Op (n2𝛼 ).
𝓁 𝜕pi 𝜕pj
Proof The results for the first derivatives follow immediately from (4), (13), Lemma
A.2, and Lemma A.3. Now let 𝛿ij be 1 if i = j and zero otherwise. It is straightfor-
ward to verify that
𝜕 3 gn (𝐩) ( 𝛼 𝛼 ) 𝜕 2 gn (𝐩)
=n𝛼 w−1 e−n p𝓁 − e−n pk
𝜕p𝓁 𝜕pj 𝜕pi 𝜕pj 𝜕pi
( 𝛼 𝛼 ) 𝜕 2 gn (𝐩)
+ n𝛼 w−1 e−n pi − e−n pk
𝜕p𝓁 𝜕pj
( 𝛼 𝛼 ) 𝜕 2 gn (𝐩)
+ n𝛼 w−1 e−n pj − e−n pk
𝜕p𝓁 𝜕pi (17)
( )
2𝛼 −n𝛼 pk −1 gn (𝐩) 𝜕gn (𝐩) 𝜕gn (𝐩)
−n e w + + +1
𝜕p𝓁 𝜕pj 𝜕pi
𝛼 [ ( ) ]
− n2𝛼 e−n pk w−1 n𝛼 gn (𝐩) − pk + 2
𝛼 𝜕g (𝐩)
− 𝛿ij n2𝛼 e−n pi w−1 n ,
𝜕p𝓁
Combining this with Lemma A.2 and the fact that 0 ≤ w−1 e−n ≤ 1 for any
𝛼p
s
13
Journal of Statistical Theory and Practice (2021) 15:24 Page 15 of 19 24
1 1 √
n 2 −𝛼 ∇gn (𝐩)
̂ = n 2 −𝛼 ∇gn (𝐩) + n−𝛼 ∇2 gn (𝐩∗ ) n(𝐩̂ − 𝐩), (19)
where 𝐩∗ = 𝐩 + diag(𝝎)(𝐩̂ − 𝐩) for some 𝝎 ∈ [0, 1]k−1 . Note that by the strong law
of large numbers 𝐩̂ → 𝐩 a.s., which implies that 𝐩∗ − 𝐩 → 0 a.s. Similarly,
p
by the
multivariatep central limit theorem and Slutsky’s theorem n𝛼 (𝐩̂ − 𝐩)−→0 implies that
→0 . Thus, the assumptions of Lemma A.4 are satisfied and that lemma
n𝛼 (𝐩∗ − 𝐩)−
gives
√
n−𝛼 ∇2 gn (𝐩∗ ) n(𝐩̂ − 𝐩) = n−𝛼 Op (n𝛼 )Op (1).
Lemma A.6 Assume that r ≠ k . In this case, 𝜎 > 0 and lim 𝜎n−1 𝜎 = 1 . Further, if
p n→∞
0 < 𝛼 < 0.5 , then 𝜎̂ n−1 𝜎n �→
�� 1.
𝜎̂ n2 ̂ n (𝐩)
̂ T Σ∇g
∇gn (𝐩) ̂
=
𝜎n2 T
∇gn (𝐩) Σ∇gn (𝐩)
1 1 1
(∇gn (𝐩) + Op (n𝛼− 2 ))T (Σ + OP (n− 2 ))(∇gn (𝐩) + Op (n𝛼− 2 ))
=
∇gn (𝐩)T Σ∇gn (𝐩)
𝛼− 21 3 1 p
=1 + Op (n ) + Op (n𝛼−1 ) + Op (n2𝛼− 2 ) + Op (n− 2 ) + Op (n2𝛼−1 ) �→
�� 1,
where 𝐩∗ = 𝐩 + diag (𝝎)(𝐩̂ − 𝐩) for some 𝝎 ∈ [0, 1]k−1 . Using Lemma A.4 and argu-
ments similar to those used in the proof of Lemma A.5 gives n−𝛼 ∇2 gn (𝐩∗ ) = Op (1) ,
√
n(𝐩̂ − 𝐩) = Op (1) , and n𝛼 (𝐩̂ − 𝐩) = op (1) . It follows that the second term on the
RHS above is op (1) and hence that
√ √
̂ − gn (𝐩)) = n(𝐩̂ − 𝐩)T ∇gn (𝐩) + op (1).
n(gn (𝐩)
√ D
It is well known that n(𝐩̂ − 𝐩)T ���→ � N(0, Σ) . Hence
13
24
Page 16 of 19 Journal of Statistical Theory and Practice (2021) 15:24
√ D
n(𝐩̂ − 𝐩)T Λ ���→
� N(0, ΛT ΣΛ)
Lemma A.8 Let 𝐀 = −n 1
−𝛼 2
∇
1
gn (𝐩) and let 𝐈k−1 be the (k − 1) × (k − 1) identity
matrix. If r = k , then Σ 2 𝐀Σ 2 = 2k−2 𝐈k−1.
Proof Let 𝟏 be the column vector in ℝk−1 with all entries equal to 1. By Eq. (16), we
have
Now multiplying both sides by Σ1∕2 on the left and Σ−1∕2 on the right gives the
result. ◻
The first part approaches a N(0, 1) distribution by Lemma A.7, and the second part
approaches zero in probability by Lemmas A.6 and A.1. From there, the first part of
the theorem follows by Slutsky’s theorem.
(ii) Assume that r = k . In this case, gn (𝐩) = p0 = k−1 , and by Lemma A.3,
∇gn (𝐩) = 0 . Thus, Taylor’s theorem gives
𝐩∗ = 𝐩 + diag (𝝎)(𝐩̂ − 𝐩) for some 𝝎 ∈ [0, 1]k−1 . Lemma A.4 √ implies that
𝜕 3 g (𝐩∗ )
n−2𝛼 𝜕p 𝜕pn 𝜕p = Op (1)
. Combining this with the facts that n(̂pq − pq ) and
√ q r s
p − pr ) are Op (1) and that, for 𝛼 ∈ (0, 0.5) , n (̂ps − ps ) = op (1) , it follows that
n(̂
p r
𝛼
→0.
rn − √ 1
Let 𝐱n = n(𝐩̂ − 𝐩) , 𝐓n = Σ− 2 𝐱n , and 𝐀 = −n𝛼 ∇2 gn (𝐩) . Lemma A.8 implies that
13
Journal of Statistical Theory and Practice (2021) 15:24 Page 17 of 19 24
1 1 1 1
𝐱nT 𝐀𝐱n =(Σ− 2 𝐱n )T Σ 2 𝐀Σ 2 (Σ− 2 𝐱n )
=𝐓Tn (2k−2 𝐈k−1 )𝐓n .
D D
Since 𝐱n ���→
� N(0, Σ) , we have 𝐓n ���→
� 𝐓 , where 𝐓 ∼ N(0, 𝐈k−1 ). Let Ti be the ith com-
ponent of vector 𝐓 . Applying the continuous mapping theorem, we obtain
D ∑
k−1
� 𝐓T (2k−2 𝐈k−1 )𝐓 = 2k−2
𝐱nT 𝐀𝐱n ���→ Ti2 .
i=1
D ∑
k−1
n1−𝛼 {p0 − gn (𝐩)} � k−2
̂ = 0.5𝐱nT 𝐀𝐱n + op (1) ���→ Ti2 .
i=1
The result follows from the fact that the Ti2 are independent and identically distrib-
uted random variables, each following the Chi-square distribution with 1 degree of
freedom. ◻
The proof of Theorem 4.1 is very similar to that of Theorem 2.1 and is thus omit-
ted. However, to help the reader to reconstruct the proof, we note that the partial
derivatives of g∨n can be calculated using the facts that
Further, we formulate a version of Lemmas A.1 and A.2 for the maximum.
Lemma A.9
1. There is a constant 𝜖 > 0 such that p∨ − (k − r∨ )e−n 𝜖 ≤ g∨n (𝐩) ≤ p∨ . When r∨ ≠ k ,
𝛼
n𝛽 {g∨n (𝐩) − p∨ }→
� 0 as n → ∞.
( ) 𝛼 ∗ i=1 p
1
n pj − p∨ en pj ∨∗ −
𝛼 ∗
→0
w
13
24
Page 18 of 19 Journal of Statistical Theory and Practice (2021) 15:24
and
𝛼 p∗ 1 p
n𝛼 en j
∨∗
{g∨n (𝐩∗n ) − p∗j }−
→0 as n → ∞.
w
Proof We only prove the first part, as proofs of the rest are similar to those of Lem-
mas A.1 and A.2. If r∨ = k , then g∨n (𝐩) = 1∕k = p∨ and the result holds with any
𝜖 > 0 . Now, assume that k ≠ r∨ and let 𝜖 be as defined above. First note that
∑
k
𝛼p 1 ∑ 𝛼 1
k
g∨n (𝐩) = pj en j ≤ p∨ en pj ∨ = p∨ .
j=1
w∨ j=1
w
It follows that
( n𝛼 p∨ )
∑ 1
𝛼
r∨ en p∨ r∨ e
n𝛼 pi
g∨n (𝐩) ≥ pi e = p∨ = p∨ + p∨ −1
i∶pi =p∨
w∨ w∨ w∨
( )
p∨ ∑ 𝛼 ∑
n𝛼 p∨ n p∨ −n𝛼 pi
=p∨ + ∨ r∨ e − e − e
w i∶pi =p∨ i∶pi <p∨
p∨ ∑ −n𝛼 p p∨ 𝛼
=p∨ − ∨ e i ≥ p −
∨ (k − r∨ )e−n 𝜖 .
w i∶p <p r∨
i ∨
References
1. Boyd S, Vandenberghe L (2004) Convex Optimization. Cambridge University Press, Cambridge
2. Chao A (1984) Nonparametric estimation of the number of classes in a population. Scandinavian J
Stat 11:265–270
3. Chao A (1987) Estimating the population size for capture-recapture data with unequal catchability.
Biometrics 43:783–791
4. Chung K, AitSahlia F (2003) Elementary Probability Theory with Stochastic Processes and an
Introduction to Mathematical Finance, 4th edn. Springer, New York
5. Colwell C (1994) Estimating terrestrial biodiversity through extrapolation. Philos Trans Biol Sci
345:101–118
6. Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) ‘Visual categorization with bags of
keypoints’, In: Workshop on Statistical Learning in Computer Vision, ECCV, pp. 1–22
13
Journal of Statistical Theory and Practice (2021) 15:24 Page 19 of 19 24
7. Grabchak M, Marcon E, Lang G, Zhang Z (2017) The generalized Simpson’s entropy is a measure
of biodiversity. PLOS ONE 12:e0173305
8. Grabchak M, Zhang Z (2018) Asymptotic normality for plug-in estimators of diversity indices on
countable alphabets. J Nonparam Stat 30:774–795
9. Gu Z, Shao M, Li L, Fu Y (2012) ‘Discriminative metric: Schatten norm vs. vector norm’, In: Pro-
ceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pp. 1213–1216
10. Haykin S (1994) Neural networks: a comprehensive foundation. Pearson Prentice Hall, New York
11. Lange M, Zühlke D, Holz T, Villmann O (2014) ‘Applications of lp-norms and their smooth approx-
imations for gradient based learning vector quantization’, In: ESANN 2014: Proceedings of the
22nd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine
Learning, pp. 271–276
12. May WL, Johnson WD (1998) On the singularity of the covariance matrix for estimates of multino-
mial proportions. J Biopharmaceut Stat 8:329–336
13. Shao J (2003) Mathematical Statistics, 2nd edn. Springer, New York
14. Turney P, Littman ML (2003) Measuring praise and criticism: inference of semantic orientation
from association. ACM Trans Inf Syst 21:315–346
15. Zhai C, Lafferty J (2017) A study of smoothing methods for language models applied to ad hoc
information retrieval. ACM SIGIR Forum 51:268–276
16. Zhang Z, Chen C, Zhang J (2020) Estimation of population size in entropic perspective. Commun
Stat Theory Methods 49:307–324
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published
maps and institutional affiliations.
13