0% found this document useful (0 votes)
27 views20 pages

U Statistics

Chapter 10 discusses U-statistics, which generalize unbiased estimation concepts like the sample mean and variance, and their importance in large-sample theory. It introduces statistical functionals, V-statistics, and the concept of U-statistics as unbiased estimators derived from expectation functionals. The chapter also explores Hoeffding's decomposition and the asymptotic normality of U-statistics, emphasizing their application in statistical analysis.

Uploaded by

sjmin711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views20 pages

U Statistics

Chapter 10 discusses U-statistics, which generalize unbiased estimation concepts like the sample mean and variance, and their importance in large-sample theory. It introduces statistical functionals, V-statistics, and the concept of U-statistics as unbiased estimators derived from expectation functionals. The chapter also explores Hoeffding's decomposition and the asymptotic normality of U-statistics, emphasizing their application in statistical analysis.

Uploaded by

sjmin711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Chapter 10

U-statistics

When one is willing to assume the existence of a simple random sample X1 , . . . , Xn , U-


statistics generalize common notions of unbiased estimation such as the sample mean and
the unbiased sample variance (in fact, the “U” in “U-statistics” stands for “unbiased”). Even
though U-statistics may be considered a bit of a special topic, their study in a large-sample
theory course has side benefits that make them valuable pedagogically. The theory of U-
statistics nicely demonstrates the application of some of the large-sample topics presented
thus far. Furtheromre, the study of U-statistics even enables a theoretical discussion of sta-
tistical functionals, which gives insight into the common modern practice of bootstrapping.

10.1 Statistical Functionals and V-Statistics

Let S be a set of cumulative distribution functions and let T denote a mapping from S into
the real numbers R. Then T is called a statistical functional. We may think of statistical
functionals as parameters of interest. If, say, we are given a simple random sample from
an distribution with unknown distribution function F , we may want to learn the value of
θ = T (F ) for a (known) functional T . Some particular instances of statistical functionals
are as follows:

• If T (F ) = F (c) for some constant c, then T is a statistical functional mapping each F


to PF (X ≤ c).

• If T (F ) = F −1 (p) for some constant p, where F −1 (p) is defined in Equation (3.13),


then T maps F to its pth quantile.

• If T (F ) = E F (X), then T maps F to its mean.

157
Suppose X1 , . . . , Xn is an independent and identically distributed sequence with distribution
function F (x). We define the empirical distribution function F̂n to be the distribution
function for a discrete uniform distribution on {X1 , . . . , Xn }. In other words,
n
1 1X
F̂n (x) = #{i : Xi ≤ x} = I{Xi ≤ x}.
n n i=1

Since F̂n (x) is a distribution function, a reasonable estimator of T (F ) is the so-called plug-in
estimator T (F̂n ). For example, if T (F ) = E F (X), then the plug-in estimator given a simple
random sample X1 , X2 , . . . from F is
n
1X
T (F̂n ) = E F̂n (X) = Xi = X n .
n i=1

As we will see later, a plug-in estimator is also known as a V-statistic or a V-estimator.


Suppose that for some real-valued function φ(x), we define T (F ) = E F φ(X). Note in this
case that

T {αF1 + (1 − α)F2 } = α E F1 φ(X) + (1 − α) E F2 φ(X) = αT (F1 ) + (1 − α)T (F2 ).

For this reason, such a functional is sometimes called a linear functional (see Definition 10.1).
To generalize this idea, we consider a real-valued function taking more than one real argu-
ment, say φ(x1 , . . . , xa ) for some a > 1, and define

T (F ) = E F φ(X1 , . . . , Xa ), (10.1)

which we take to mean the expectation of φ(X1 , . . . , Xa ) where X1 , . . . , Xa is a simple random


sample from the distribution function F . We see that

E F φ(X1 , . . . , Xa ) = E F φ(Xπ(1) , . . . , Xπ(a) )

for any permutation π mapping {1, . . . , a} onto itself. Since there are a! such permutations,
consider the function
def 1
X
φ∗ (x1 , . . . , xa ) = φ(xπ(1) , . . . , xπ(a) ).
a! all π

Since E F φ(X1 , . . . , Xa ) = E F φ∗ (X1 , . . . , Xa ) and φ∗ is symmetric in its arguments (i.e.,


permuting its a arguments does not change its value), we see that in Equation (10.1) we may
assume without loss of generality that φ is symmetric in its arguments. A function defined
as in Equation (10.1) is called an expectation functional, as summarized in the following
definition:

158
Definition 10.1 For some integer a ≥ 1, let φ: Ra → R be a function symmetric
in its a arguments. The expectation of φ(X1 , . . . , Xa ) under the assumption
that X1 , . . . , Xa are independent and identically distributed from some distri-
bution F will be denoted by E F φ(X1 , . . . , Xa ). Then the functional T (F ) =
E F φ(X1 , . . . , Xa ) is called an expectation functional. If a = 1, then T is also
called a linear functional.

Expectation functionals are important in this chapter because they are precisely the func-
tionals that give rise to V-statistics and U-statistics. The function φ(x1 , . . . , xa ) in Definition
10.1 is used so frequently that we give it a special name:
Definition 10.2 Let T (F ) = E F φ(X1 , . . . , Xa ) be an expectation functional, where
φ: Ra → R is a function that is symmetric in its arguments. In other words,
φ(x1 , . . . , xa ) = φ(xpi(1) , . . . , xπ(a) ) for any permutation π of the integers 1 through
a. Then φ is called the kernel function associated with T (F ).
Suppose T (F ) is an expectation functional defined according to Equation (10.1). If we have
a simple random sample of size n from F , then as noted earlier, a natural way to estimate
T (F ) is by the use of the plug-in estimator T (F̂n ). This estimator is called a V-estimator or a
V-statistic. It is possible to write down a V-statistic explicitly: Since F̂n assigns probability
1
n
to each Xi , we have
n n
1 X X
Vn = T (F̂n ) = E F̂n φ(X1 , . . . , Xa ) = a ··· φ(Xi1 , . . . , Xia ). (10.2)
n i =1 i =1
1 a

In the case a = 1, then Equation (10.2) becomes


n
1X
Vn = φ(Xi ).
n i=1

It is clear in this case that E Vn = T (F ), which we denote by θ. Furthermore, if σ 2 =


Var F φ(X) < ∞, then the central limit theorem implies that
√ d
n(Vn − θ) → N (0, σ 2 ).

For a > 1, however, the sum in Equation (10.2) contains some terms in which i1 , . . . , ia
are not all distinct. The expectation of such terms is not necessarily equal to θ = T (F )
because in Definition 10.1, θ requires a independent random variables from F . Thus, Vn is
not necessarily unbiased for a > 1.

Example 10.3 Let a = 2 and φ(x1 , x2 ) = |x1 − x2 |. It may be shown (Problem 10.2)
that the functional T (F ) = E F |X1 − X2 | is not linear in F . Furthermore, since

159
|Xi1 − Xi2 | is identically zero whenever i1 = i2 , it may also be shown that the
V-estimator of T (F ) is biased.

Since the bias in Vn is due to the duplication among the subscripts i1 , . . . , ia , one way to
correct this bias is to restrict the summation in Equation (10.2) to sets of subscripts i1 , . . . , ia
that contain no duplication. For example, we might sum instead over all possible subscripts
satisfying i1 < · · · < ia . The result is the U-statistic, which is the topic of Section 10.2.

Exercises for Section 10.1


Exercise 10.1 Let X1 , . . . , Xn be a simple random sample from F . For a fixed x for
which 0 < F (x) < 1, find the asymptotic distribution of F̂n (x).

Exercise 10.2 Let T (F ) = E F |X1 − X2 |.


(a) Show that T (F ) is not a linear functional by exhibiting distributions F1 and
F2 and a constant α ∈ (0, 1) such that
T {αF1 + (1 − α)F2 } =
6 αT (F1 ) + (1 − α)T (F2 ).

(b) For n > 1, demonstrate that the V-statistic Vn is biased in this case by
finding cn 6= 1 such that E F Vn = cn T (F ).

Exercise 10.3 Let X1 , . . . , Xn be a random sample from a distribution F with finite


third absolute moment.
(a) For a = 2, find φ(x1 , x2 ) such that E F φ(X1 , X2 ) = Var F X. Your φ function
should be symmetric in its arguments.
Hint: The fact that θ = E X12 − E X1 X2 leads immediately to a non-symmetric
φ function. Symmetrize it.
(b) For a = 3, find φ(x1 , x2 , x3 ) such that E F φ(X1 , X2 , X3 ) = E F (X − E F X)3 .
As in part (a), φ should be symmetric in its arguments.

10.2 Hoeffding’s Decomposition and Asymptotic Nor-


mality

Because the V-statistic


n n
1 X X
Vn = a ··· φ(Xi1 , . . . , Xia )
n i =1 i =1
1 a

160
is in general a biased estimator of the expectation functional T (F ) = E F φ(X1 , . . . , Xa )
due to presence of summands in which there are duplicated indices on the Xik , one way to
produce an unbiased estimator is to sum only over those (i1 , . . . , ia ) in which no duplicates
occur. Because φ is assumed to be symmetric in its arguments, we may without loss of
generality restrict attention to the cases in which 1 ≤ i1 < · · · < ia ≤ n. Doing this, we
obtain the U-statistic Un :
Definition 10.4 Let a be a positive integer and let φ(x1 , . . . , xa ) be the kernel func-
tion associated with an expectation functional T (F ) (see Definitions 10.1 and
10.2). Then the U-statistic corresponding to this functional equals
1 X X
Un = n
 · · · φ(Xi1 , . . . , Xia ), (10.3)
a 1≤i1 <···<ia ≤n

where X1 , . . . , Xn is a simple random sample of size n ≥ a.


The “U” in “U-statistic” stands for unbiased (the “V” in “V-statistic” stands for von Mises,
who was one of the originators of this theory in the late 1940’s). The unbiasedness of Un
follows since it is the average of na terms, each with expectation T (F ) = E F φ(X1 , . . . , Xa ).

Example 10.5 Consider a random sample X1 , . . . , Xn from F , and let


n
X
Rn = jI{Wj > 0}
j=1

be the Wilcoxon signed rank statistic, where W1 , . . . , Wn are simply X1 , . . . , Xn


reordered in increasing absolute value. We showed in Example 9.12 that
n X
X i
Rn = I{Xi + Xj > 0}.
i=1 j=1

Letting φ(a, b) = I{a + b > 0}, we see that φ is symmetric in its arguments and
thus
n
1 1 X
n
 R n = Un + n
 I{Xi > 0} = Un + OP ( n1 ) ,
2 2 i=1

where Un is the U-statistic corresponding to the expectation functional T (F ) =


PF (X1 + X2 > 0). Therefore, many of the properties of the signed rank test
that we have already derived elsewhere can also be obtained using the theory of
U-statistics developed later in this section.

161
In the special case a = 1, the V-statistic and the U-statistic coincide. In this case, we have
already seen that both Un and Vn are asymptotically normal by the central limit theorem.
However, for a > 1, the two statistics do not coincide in general. Furthermore, we may no
longer use the central limit theorem to obtain asymptotic normality because the summands
are not independent (each Xi appears in more than one summand).
To prove the asymptotic normality of U-statistics, we shall use a method sometimes known as
the H-projection method after its inventor, Wassily Hoeffding. If φ(x1 , . . . , xa ) is the kernel
function of an expectation functional T (F ) = E F φ(X1 , . . . , Xa ), suppose X1 , . . . , Xn is a
simple random sample from the distribution F . Let θ = T (F ) and let Un be the U-statistic
defined in Equation (10.3). For 1 ≤ i ≤ a, suppose that the values of X1 , . . . , Xi are held
constant, say X1 = x1 , . . . , Xi = xi . This may be viewed as projecting the random vector
(X1 , . . . , Xa ) onto the (a − i)-dimensional subspace in Ra given by {(x1 , . . . , xi , ci+1 , . . . , ca ) :
(ci+1 , . . . , ca ) ∈ Ra−i }. If we take the conditional expectation, the result will be a function
of x1 , . . . , xi , which we will denote by φi . That is, we define for i = 1, . . . , a

φi (x1 , . . . , xi ) = E F φ(x1 , . . . , xi , Xi+1 , . . . , Xa ). (10.4)

Equivalently, we may use conditional expectation notation to define

φi (X1 , . . . , Xi ) = E F {φ(X1 , . . . , Xa ) | X1 , . . . , Xi } . (10.5)

From Equation (10.5), we obtain E F φi (X1 , . . . , Xi ) = θ for all i. Furthermore, we define

σi2 = Var F φi (X1 , . . . , Xi ). (10.6)

The variance of Un can be expressed in terms of the σi2 as follows:


Theorem 10.6 The variance of a U-statistic is
a   
1 X a n−a 2
Var F Un = n σ .
a k=1
k a−k k

If σ12 , . . . , σa2 are all finite, then

a2 σ12
 
1
Var F Un = +o .
n n


Theorem 10.6 is proved in Exercise 10.4. This theorem shows that √ the variance of nUn tends
to a2 σ12 , and indeed we may well wonder whether it is true that n(Un − θ) is asymptotically
normal with this limiting variance. It is the idea of the H-projection method of Hoeffding
to prove exactly that fact.

162
We shall derive the asymptotic normality of Un in a sequence of steps. The basic idea will
be to show that Un − θ has the same limiting distribution as the sum
n
X
Ũn = E F (Un − θ | Xj ) (10.7)
j=1

of projections. The asymptotic distribution of Ũn follows from the central limit theorem
because Ũn is the sum of independent and identically distributed random variables.

Lemma 10.7 For all 1 ≤ j ≤ n,


a
E F (Un − θ | Xj ) = {φ1 (Xj ) − θ} .
n

Proof: Note that


1 X X
E F (Un − θ | Xj ) = n
 · · · E F {φ(Xi1 , . . . , Xia ) − θ | Xj } ,
a 1≤i1 <···<ia ≤n

where E F {φ(Xi1 , . . . , Xia ) − θ | Xj } equals φ1 (Xj ) − θ whenever j is among {i1 , . . . , ia }and


0 otherwise. The number of ways to choose {i1 , . . . , ia } so that j is among them is n−1 a−1
, so
we obtain
n−1

a−1 a
E F (Un − θ | Xj ) = n
 {φ1 (Xj ) − θ} = {φ1 (Xj ) − θ} .
a
n

Lemma 10.8 If σ12 < ∞ and Ũn is defined as in Equation (10.7), then
√ d
nŨn → N (0, a2 σ12 ).

Proof: Lemma 10.8 follows immediately from Lemma 10.7 and the central limit theorem
since aφ1 (Xj ) has mean aθ and variance a2 σ12 .
Now that we know the asymptotic distribution of Ũn , it remains to show that Un − θ and
Ũn have the same asymptotic behavior.

Lemma 10.9
n o
E F Ũn (Un − θ) = E F Ũn2 .

163
Proof: By Equation (10.7) and Lemma 10.7, E F Ũn2 = a2 σ12 /n. Furthermore,
n
n o aX
E F Ũn (Un − θ) = E F {(φ1 (Xj ) − θ)(Un − θ)}
n j=1
n
aX
= E F E F {(φ1 (Xj ) − θ)(Un − θ) | Xj }
n j=1
n
a2 X
= E F {φ1 (Xj ) − θ}2
n2 j=1
a2 σ12
= .
n
Lemma 10.10 If σi2 < ∞ for i = 1, . . . , a, then
√  
P
n Un − θ − Ũn → 0.

Proof: Since convergence in quadratic mean implies convergence in probability (Theorem


2.15), it suffices to show that
n√ o2
EF n(Un − θ − Ũn ) → 0.
 2  
By Lemma 10.9, n E F Un − θ − Ũn = n Var F Un − E F Ũn2 . But n Var F Un = a2 σ12 +
o(1) by Theorem 10.6, and n E F Ũn2 = a2 σ12 , proving the result.
√ √ √
Finally, since n(Un − θ) = nŨn + n(Un − θ − Ũn ), Lemmas 10.8 and 10.10 along with
Slutsky’s theorem result in the theorem we originally set out to prove:
Theorem 10.11 If σi2 < ∞ for i = 1, . . . , a, then
√ d
n(Un − θ) → N (0, a2 σ12 ). (10.8)

Exercises for Section 10.2


Exercise 10.4 Prove Theorem 10.6, as follows:
(a) Prove that

E φ(X1 , . . . , Xa )φ(X1 , . . . , Xk , Xa+1 , . . . , Xa+(a−k) ) = σk2 + θ2

164
and thus Cov {φ(X1 , . . . , Xa ), φ(X1 , . . . , Xk , Xa+1 , . . . , Xa+(a−k) )} = σk2 .
(b) Show that
 
n
Var Un =
a
 X a   
n a n−a
Cov {φ(X1 , . . . , Xa ), φ(X1 , . . . , Xk , Xa+1 , . . . , Xa+(a−k) )}
a k=1
k a−k

and then use part (a) to prove the first equation of theorem 10.6.
(c) Verify the second equation of theorem 10.6.

Exercise 10.5 Suppose a kernel function φ(x1 , . . . , xa ) satifies E |φ(Xi1 , . . . , Xia )| <
∞ for any (not necessarily distinct) i1 , . . . , ia . Prove that if Un and Vn are the
√ P
corresponding U- and V-statistics, then n(Vn − Un ) → 0 so that Vn has the same
asymptotic distribution as Un .
Hint: Verify and use the equation
 
1 X X
Vn − Un = Vn − a ··· φ(Xi1 , . . . , Xia )
n all i distinct
j
" #
1 1 X X
+ a− · · · φ(Xi1 , . . . , Xia ).
a! na all i distinct

n
j

Exercise 10.6 For the kernel function of Example 10.3, φ(a, b) = |a − b|, the corre-
sponding U-statistic is called Gini’s mean difference and it is denoted Gn . For a
random sample from uniform(0, τ ), find the asymptotic distribution of Gn .

Exercise 10.7 Let φ(x1 , x2 , x3 ) have the property


φ(a + bx1 , a + bx2 , a + bx3 ) = φ(x1 , x2 , x3 )sgn(b) for all a, b. (10.9)
Let θ = E φ(X1 , X2 , X3 ). The function sgn(x) is defined as I{x > 0} − I{x < 0}.
(a) We define the distribution F to be symmetric if for X ∼ F , there exists
some µ (the center of symmetry) such that X − µ and µ − X have the same
distribution. Prove that if F is symmetric then θ = 0.
(b) Let x and x̃ denote the mean and median of x1 , x2 , x3 . Let φ(x1 , x2 , x3 ) =
sgn(x − x̃). Show that this function satisfies criterion (10.9), then find the asymp-
totic distribution for the corresponding U-statistic if F is the standard uniform
distribution.

165
Exercise 10.8 If the arguments of the kernel function φ(x1 , . . . , xa ) of a U-statistic
are vectors instead of scalars, note that Theorem 10.11 still applies with no
modification. With this in mind, consider for x, y ∈ R2 the kernel φ(x, y) =
I{(y1 − x1 )(y2 − x2 ) > 0}.
(a) Given a simple random sample X (1) , . . . , X (n) , if Un denotes the U-statistic
corresponding to the kernel above, the statistic 2Un − 1 is called Kendall’s tau
(i) (i)
statistic. Suppose the marginal distributions of X1 and X2 are both continuous,
(i) (i) √
with X1 and X2 independent. Find the asymptotic distribution of n(Un − θ)
for an appropriate value of θ.
(b) To test the null hypothesis that a sample Z1 , . . . , Zn is independent and
identically distributed against the alternative hypothesis that the Zi are stochas-
tically increasing in i, suppose we reject the null hypothesis if the number of pairs
(Zi , Zj ) with Zi < Zj and i < j is greater than cn . This test is called Mann’s
test against trend. Based on your answer to part (a), find cn so that the test has
asymptotic level .05.
(c) Estimate the true level of the test in part (b) for a simple random sample
of size n from a standard normal distribution for each n ∈ {5, 15, 75}. Use 5000
samples in each case.

10.3 U-statistics in the non-iid case

In this section, we generalize the idea of U-statistics in two different directions. First, we
consider single U-statistics for situations in which there is more than one sample. Next, we
consider the joint asymptotic distribution of two (single-sample) U-statistics.
We begin by generalizing the idea of U-statistics to the case in which we have more than
one random sample. Suppose that Xi1 , . . . , Xini is a simple random sample from Fi for all
1 ≤ i ≤ s. In other words, we have s random samples, each potentially from a different
distribution, and ni is the size of the ith sample. We may define a statistical functional

θ = E φ (X11 , . . . , X1a1 ; X21 , . . . , X2a2 ; · · · ; Xs1 , . . . , Xsas ) . (10.10)

Notice that the kernel φ in Equation (10.10) has a1 + a2 + · · · + as arguments; furthermore,


we assume that the first a1 of them may be permuted without changing the value of φ, the
next a2 of them may be permuted without changing the value of φ, etc. In other words,
there are s distinct blocks of arguments of φ, and φ is symmetric in its arguments within
each of these blocks.

166
The U-statistic corresponding to the expectation functional (10.10) is
1 1 X X 
UN = n1 · · · ns ··· φ X1i1 , . . . , X1ia1 ; · · · ; Xsr1 , . . . , Xsras . (10.11)
 
a1 as 1≤i1 <···<ia1 ≤n1
···
1≤r1 <···<ras ≤ns

Note that N denotes the total of all the sample sizes: N = n1 + · · · + ns .


As we did in the case of single-sample U-statistics, define for 0 ≤ j1 ≤ ai , . . . , 0 ≤ js ≤ as

φj1 ···js (X11 , . . . , X1j1 ; · · · ; Xs1 , . . . , Xsjs ) =


E {φ(X11 , . . . , X1a1 ; · · · ; Xs1 , . . . , Xsas ) | X11 , . . . , X1j1 , · · · , Xs1 , . . . , Xsjs } (10.12)

and

σj21 ···js = Var φj1 ···js (X11 , . . . , X1j1 ; · · · ; Xs1 , . . . , Xsjs ). (10.13)

By an argument similar to the one used in the proof of Theorem 10.6, but much more tedious
notationally, we can show that

σj21 ···js = Cov {φ(X11 , . . . , X1a1 ; · · · ; Xs1 , . . . , Xsas ),


φ(X11 , . . . , X1j1 , X1,a1 +1 , . . . ; · · · ; Xs1 , . . . , Xsjs , Xs,as +1 , . . .)} . (10.14)

Notice that some of the ji may equal 0. This was not true in the single-sample case, since
φ0 would have merely been the constant θ, so σ02 = 0.
In the special case when s = 2, Equations (10.12), (10.13) and (10.14) become

φij (X1 , . . . , Xi ; Y1 , . . . , Yj ) = E {φ(X1 , . . . , Xa1 ; Y1 , . . . , Ya2 ) | X1 , . . . , Xi , Y1 , . . . , Yj } ,


σij2 = Var φij (X1 , . . . , Xi ; Y1 . . . , Yj ),

and

σij2 = Cov {φ(X1 , . . . , Xa1 ; Y1 , . . . , Ya2 ),


φ(X1 , . . . , Xi , Xa1 +1 , . . . , Xa1 +(a1 −i) ; Y1 , . . . , Yj , Ya2 +1 , . . . , Ya2 +(a2 −j) ) ,

respectively, for 0 ≤ i ≤ a1 and 0 ≤ j ≤ a2 .


Although we will not derive it here as we did for the single-sample case, there is an analagous
asymptotic normality result for multisample U-statistics, as follows.

Theorem 10.12 Suppose that for i = 1, . . . , s, Xi1 , . . . , Xini is a random sample


from the distribution Fi and that these s samples are independent of each other.

167
Suppose further that there exist constants ρ1 , . . . , ρs in the interval (0, 1) such
that ni /N → ρi for all i and that σa21 ···as < ∞. Then
√ d
N (UN − θ) → N (0, σ 2 ),

where

2 a21 2 a2s 2
σ = σ10···00 + · · · + σ00···01 .
ρ1 ρs

Although the notation required for the multisample U-statistic theory is nightmarish, life
becomes considerably simpler in the case s = 2 and a1 = a2 = 1, in which case we obtain
n1 Xn2
1 X
UN = φ(X1i ; X2j ).
n1 n2 i=1 j=1

Equivalently, we may assume that X1 , . . . , Xm are a simple random sample from F and
Y1 , . . . , Yn are a simple random sample from G, which gives
m n
1 XX
UN = φ(Xi ; Yj ). (10.15)
mn i=1 j=1

In the case of the U-statistic of Equation (10.15), Theorem 10.12 states that
√ 2 2
 
d σ10 σ01
N (UN − θ) → N 0, + ,
ρ 1−ρ
2 2
where ρ = lim m/N , σ10 = Cov {φ(X1 ; Y1 ), φ(X1 ; Y2 )}, and σ01 = Cov {φ(X1 ; Y1 ), φ(X2 ; Y1 )}.

Example 10.13 For independent random samples X1 , . . . Xm from F and Y1 , . . . , Yn


from G, consider the Wilcoxon rank-sum statistic W , defined to be the sum of
the ranks of the Yi among the combined sample. We may show that
m n
1 XX
W = n(n + 1) + I{Xi < Yj }.
2 i=1 j=1

Therefore, if we let φ(x; y) = I{x < y}, then the corresponding two-sample U-
statistic UN is related to W by W = 12 n(n + 1) + mnUN . Therefore, we may use
Theorem 10.12 to obtain the asymptotic normality of UN , and therefore of W .
However, we make no assumption here that F and G are merely shifted versions
of one another. Thus, we may now obtain in principle the asymptotic distribution
of the rank-sum statistic for any two distributions F and G that we wish, so long
as they have finite second moments.

168
The other direction in which we will generalize the development of U-statistics is consid-
eration of the joint distribution of two single-sample U-statistics. Suppose that there are
two kernel functions, φ(x1 , . . . , xa ) and ϕ(x1 , . . . , xb ), and we define the two corresponding
U-statistics
1 X X
Un(1) = n ··· φ(Xi1 , . . . , Xia )
a 1≤i1 <···<ia ≤n

and
1 X X
Un(2) = n
 ··· ϕ(Xj1 , . . . , Xjb )
b 1≤j1 <···<jb ≤n

(1) (2)
for a single random sample X1 , . . . , Xn from F . Define θ1 = E Un and θ2 = E Un .
Furthermore, define γij to be the covariance between φi (X1 , . . . , Xi ) and ϕj (X1 , . . . , Xj ),
where φi and ϕj are defined as in Equation (10.5). Letting k = min{i, j}, it may be proved
that

γij = Cov φ(X1 , . . . , Xa ), ϕ(X1 , . . . , Xk , Xa+1 , . . . , Xa+(b−k) ) . (10.16)

Note in particular that γij depends only on the value of min{i, j}.
(1)
The following theorem, stated without proof, gives the joint asymptotic distribution of Un
(2)
and Un .

Theorem 10.14 Suppose X1 , . . . , Xn is a random sample from F and that φ : Ra →


R and ϕ : Rb → R are two kernel functions satisfying Var φ(X1 , . . . , Xa ) < ∞
and Var ϕ(X1 , . . . , Xb ) < ∞. Define τ12 = Var φ1 (X1 ) and τ22 = Var ϕ1 (X1 ), and
let γij be defined as in Equation (10.16). Then
(  )
(1) 

   2 2 
Un θ1 d 0 a τ1 abγ11
n (2) − →N , .
Un θ2 0 abγ11 b2 τ22

Exercises for Section 10.3


Exercise 10.9 Suppose X1 , . . . , Xm and Y1 , . . . , Yn are independent random samples
from distributions Unif(0, θ) and Unif(µ, µ + θ), respectively. Assume m/N → ρ
as m, n → ∞ and 0 < µ < θ.
(a) Find the asymptotic distribution of the U-statistic of Equation (10.15), where
φ(x; y) = I{x < y}. In so doing, find a function g(x) such that E (UN ) = g(µ).

169
(b) Find the asymptotic distribution of g(Y − X).
(c) Find the range of values of µ for which the Wilcoxon estimate of g(µ) is
asymptotically more efficient than g(Y − X). (The asymptotic relative efficiency
in this case is the ratio of asymptotic variances.)

Exercise 10.10 Solve each part of Problem 10.9, but this time under the assumptions
that the independent random samples X1 , . . . , Xm and Y1 , . . . , Yn satisfy P (X1 ≤
t) = P (Y1 − θ ≤ t) = t2 for t ∈ [0, 1] and 0 < θ < 1. As in Problem 10.9, assume
m/N → ρ ∈ (0, 1).

Exercise 10.11 Suppose X1 , . . . , Xm and Y1 , . . . , Yn are independent random samples


from distributions N (0, 1) and N (µ, 1), respectively. Assume m/(m + n) → 1/2
as m, n → ∞. Let UN be the U-statistic of Equation (10.15), where φ(x; y) =
I{x < y}. Suppose that θ(µ) and σ 2 (µ) are such that
√ d
N [UN − θ(µ)] → N [0, σ 2 (µ)].
Calculate θ(µ) and σ 2 (µ) for µ ∈ {.2, .5, 1, 1.5, 2}.
Hint: This problem requires a bit of numerical integration. There are a couple
of ways you might do this. A symbolic mathematics program like Mathematica
or Maple will do it. There is a function called integrate in R and Splus and
one called quad in MATLAB for integrating a function. If you cannot get any of
these to work for you, let me know.

Exercise 10.12 Suppose X1 , X2 , . . . are independent and identically distributed with


finite variance. Define
n
1 X
Sn2 = (xi − x)2
n − 1 i=1
and let Gn be Gini’s mean difference, the U-statistic defined in Problem 10.6.
Note that Sn2 is also a U-statistic, corresponding to the kernel function φ(x1 , x2 ) =
(x1 − x2 )2 /2.
(a) If Xi are distributed as Unif(0, θ), give the joint asymptotic distribution of
Gn and Sn by first finding the joint asymptotic distribution of the U-statistics
Gn and Sn2 . Note that the covariance matrix need not be positive definite; in this
problem, the covariance matrix is singular.
(b) The singular asymptotic covariance matrix in this problem implies that as
n → ∞, the joint distribution of Gn and Sn becomes concentrated on a line. Does
this appear to be the case? For 1000 samples of size n from Uniform(0, 1), plot
scatterplots of Gn against Sn . Take n ∈ {5, 25, 100}.

170
10.4 Introduction to the Bootstrap

This section does not use very much large-sample theory aside from the weak law of large
numbers, and it is not directly related to the study of U-statistics. However, we include
it here because of its natural relationship with the concepts of statistical functionals and
plug-in estimators seen in Section 10.1, and also because it is an increasingly popular and
often misunderstood method in statistical estimation.
Consider a statistical functional Tn (F ) that depends on n. For instance, Tn (F ) may be some
property, such as bias or variance, of an estimator θ̂n of θ = θ(F ) based on a random sample
of size n from some distribution F .
As an example, let θ(F ) = F −1 21 be the median of F . Take θ̂n to be the mth order statistic


from a random sample of size n = 2m − 1 from F .


Consider the bias TnB (F ) = E F θ̂n − θ(F ) and the variance TnV (F ) = E F θ̂n2 − (E F θ̂n )2 .
Theoretical properties of TnB and TnV are very difficult to obtain. Even asymptotics aren’t
√ d
very helpful, since n(θ̂n − θ) → N {0, 1/(4f 2 (θ))} tells us only that the bias goes to zero
and the limiting variance may be very hard to estimate because it involves the unknown
quantity f (θ), which is hard to estimate.
Consider the plug-in estimators TnB (F̂n ) and TnV (F̂n ). (Recall that F̂n denotes the empirical
distribution function, which puts a mass of n1 on each of the n sample points.) In our median
example,

TnB (F̂n ) = E F̂n θ̂n∗ − θ̂n

and

TnV (F̂n ) = E F̂n (θ̂n∗ )2 − (E F̂n θ̂n∗ )2 ,

where θ̂n∗ is the sample median from a random sample X1∗ , . . . , Xn∗ from F̂n .
To see how difficult it is to calculate TnB (F̂n ) and TnV (F̂n ), consider the simplest nontrivial
case, n = 3: Conditional on the order statistics (X(1) , X(2) , X(3) ), there are 27 equally likely
possibilities for the value of (X1∗ , X2∗ , X3∗ ), the sample of size 3 from F̂n , namely

(X(1) , X(1) , X(1) ), (X(1) , X(1) , X(2) ), . . . , (X(3) , X(3) , X(3) ).

Of these 27 possibilities, exactly 1 + 6 = 7 have the value X(1) occurring 2 or 3 times.


Therefore, we obtain
7 13 7
P (θ̂n∗ = X(1) ) = , P (θ̂n∗ = X(2) ) = , and P (θ̂n∗ = X(3) ) = .
27 27 27

171
This implies that
1 1
E F̂n θ̂n∗ = (7X(1) + 13X(2) + 7X(3) ) and E F̂n (θ̂n∗ )2 = 2
(7X(1) 2
+ 13X(2) 2
+ 7X(3) ).
27 27

Therefore, since θ̂n = X(2) , we obtain

1
TnB (F̂n ) = (7X(1) − 14X(2) + 7X(3) )
27
and
14
TnV (F̂n ) = 2
(10X(1) 2
+ 13X(2) 2
+ 10X(3) − 13X(1) X(2) − 13X(2) X(3) − 7X(1) X(3) ).
729
To obtain the sampling distribution of these estimators, of course, we would have to consider
the joint distribution of (X(1) , X(2) , X(3) ). Naturally, the calculations become even more
difficult as n increases.
Alternatively, we could use resampling in order to approximate TnB (F̂n ) and TnV (F̂n ). This is
the bootstrapping idea, and it works like this: For some large number B, simulate B random
samples from F̂n , namely
∗ ∗
X11 , . . . , X1n ,
..
.
∗ ∗
XB1 , . . . , XBn ,

and approximate a quantity like E F̂n θ̂n∗ by the sample mean

B
1 X ∗
θ̂ ,
B i=1 in

∗ ∗ ∗
where θ̂in is the sample median of the ith bootstrap sample Xi1 , . . . , Xin . Notice that the
weak law of large numbers asserts that
B
1 X ∗ P
θ̂ → E F̂n θ̂n∗ .
B i=1 in

To recap, then, we wish to estimate some parameter Tn (F ) for an unknown distribution F


based on a random sample from F . We estimate Tn (F ) by Tn (F̂n ), but it is not easy to
evaluate Tn (F̂n ) so we approximate Tn (F̂n ) by resampling B times from F̂n and obtain a

bootstrap estimator TB,n . Thus, there are two relevant issues:

172

1. How good is the approximation of Tn (F̂n ) by TB,n ? (Note that Tn (F̂n ) is NOT an
unknown parameter; it is “known” but hard to evaluate.)
2. How precise is the estimation of Tn (F ) by Tn (F̂n )?

Question 1 is usually addressed using an asymptotic argument using the weak law or the
central limit theorem and letting B → ∞. For example, if we have an expectation functional
Tn (F ) = E F h(X1 , . . . , Xn ), then
B
∗ 1 X ∗ ∗ P
TB,n = h(Xi1 , . . . , Xin ) → Tn (F̂n )
B i=1
as B → ∞.
Question 2, on the other hand, is often tricky; asymptotic results involve letting n → ∞ and
are handled case-by-case. We will not discuss these asymptotics here. On a related note,
however, there is an argument in Lehmann’s book (on pages 432–433) about why a plug-in
estimator may be better than an asymptotic estimator. That is, if it is possible to show
Tn (F ) → T as n → ∞, then as an estimator of Tn (F ), Tn (F̂n ) may be preferable to T .
We conclude this section by considering the so-called parametric bootstrap. If we assume that
the unknown distribution function F comes from a family of distribution functions indexed
by a parameter µ, then Tn (F ) is really Tn (Fµ ). Then, instead of the plug-in estimator Tn (F̂n ),
we might consider the estimator Tn (Fµ̂ ), where µ̂ is an estimator of µ.
Everything proceeds as in the nonparametric version of bootstrapping. Since it may not be
easy to evaulate Tn (Fµ̂ ) explicitly, we first find µ̂ and then take B random samples of size
∗ ∗ ∗ ∗
n, X11 , . . . , X1n through XB1 , . . . , XBn , from Fµ̂ . These samples are used to approximate
Tn (Fµ̂ ).

Example 10.15 Suppose X1 , . . . , Xn is a random sample from Poisson(µ). Take µ̂ =


X. Suppose Tn (Fµ ) = Var Fµ µ̂. In this case, we happen to know that Tn (Fµ ) =
µ/n, but let’s ignore this knowledge and apply a parametric bootstrap. For some
large B, say 500, generate B samples from Poisson(µ̂) and use the sample variance
of µ̂∗ as an approximation to Tn (Fµ̂ ). In R, with µ = 1 and n = 20 we obtain
x <- rpois(20,1) # Generate the sample from F
muhat <- mean(x)
muhat
[1] 0.85
muhatstar <- rep(0,500) # Allocate the vector for muhatstar
for(i in 1:500) muhatstar[i] <- mean(rpois(20,muhat))
var(muhatstar)
[1] 0.04139177

173
Note that the estimate 0.041 is close to the known true value 0.05. This example
is simplistic because we already know that Tn (F ) = µ/n, which makes µ̂/n a more
natural estimator. However, it is not always so simple to obtain a closed-form
expression for Tn (F ).
Incidentally, we could also use a nonparametric bootstrap approach in this exam-
ple:
for (i in 1:500) muhatstar2[i] <- mean(sample(x,replace=T))
var(muhatstar2)
[1] 0.0418454
Of course, 0.042 is an approximation to Tn (F̂n ) rather than Tn (Fµ̂ ). Furthermore,
we can obtain a result arbitrarily close to Tn (F̂n ) by increasing the value of B:
muhatstar2_rep(0,100000)
for (i in 1:100000) muhatstar2[i] <- mean(sample(x,replace=T))
var(muhatstar2)
[1] 0.04136046
In fact, it is in principle possible to obtain an approximate variance for our
estimates of Tn (F̂n ) and Tn (Fµ̂ ), and, using the central limit theorem, construct
approximate confidence intervals for these quantities. This would allow us to
specify the quantities to any desired level of accuracy.

Exercises for Section 10.4


Exercise 10.13 (a) Devise a nonparametric bootstrap scheme for setting confidence
intervals for β in the linear regression model Yi = α + βxi + i . There is more
than one possible answer.
(b) Using B = 1000, implement your scheme on the following dataset to obtain a
95% confidence interval. Compare your answer with the standard 95% confidence
interval.
Y 21 16 20 34 33 43 47
x 460 498 512 559 614 675 719

(In the dataset, Y is the number of manatee deaths due to collisions with power-
boats in Florida and x is the number of powerboat registrations in thousands for
even years from 1978-1990.)

174
Exercise 10.14 Consider the following dataset that lists the latitude and mean Au-
gust temperature in degrees Fahrenheit for 7 US cities. The residuals are listed
for use in part (b).

City Latitude Temperature Residual


Miami 26 83 -5.696
Phoenix 33 92 10.116
Memphis 35 81 1.062
Baltimore 39 76 -0.046
Pittsburgh 40 71 -4.073
Boston 42 72 -1.127
Portland, OR 46 69 -0.235

Minitab gives the following output for a simple linear regression:


Predictor Coef SE Coef T P
Constant 113.99 13.01 8.76 0.000
latitude -0.9730 0.3443 -2.83 0.037

S = 5.546 R-Sq = 61.5% R-Sq(adj) = 53.8%


Note that this gives an asymptotic estimate of the variance of the slope parameter
as .34432 = .1185.
In (a) through (c) below, use the described method to simulate B = 500 boot-
strap samples (x∗b1 , yb1

), . . . , (x∗b7 , yb7

) for 1 ≤ b ≤ B. For each b, refit the model
to obtain β̂b∗ . Report the sample variance of β̂1∗ , . . . , β̂B∗ and compare with the
asymptotic estimate of .1185.
(a) Parametric bootstrap. Take x∗bi = xi for all b and i. Let ybi ∗
= β̂0 +β̂1 xi +i ,
2 2
where i ∼ N (0, σ̂ ). Obtain β̂0 , β̂1 , and σ̂ from the above output.
(b) Nonparametric bootstrap I. Take x∗bi = xi for all b and i. Let ybi ∗
=
∗ ∗ ∗
β̂0 + β̂1 xi + rbi , where rb1 , . . . , rb7 is an iid sample from the empirical distribution
of the residuals from the original model (you may want to refit the original model
to find these residuals).
(c) Nonparametric bootstrap II. Let (x∗b1 , yb1 ∗
), . . . , (x∗b7 , yb7

) be an iid sample
from the empirical distribution of (x1 , y1 ), . . . , (x7 , y7 ).
Note: In R or Splus, you can obtain the slope coefficient of the linear regression
of the vector y on the vector x using lm(y~x)$coef[2].

Exercise 10.15 The same resampling idea that is exploited in the bootstrap can be

175
used to approximate the value of difficult integrals by a technique sometimes
called Monte Carlo integration. Suppose we wish to compute
Z 1
2
θ=2 e−x cos3 (x) dx.
0

(a) Use numerical integration (e.g., the integrate function in R and Splus) to
verify that θ = 1.070516.
2
(b) Define g(t) = 2e−t cos3 (t). Let U1 , . . . , Un be an iid uniform(0,1) sample.
Let
n
1X
θ̂1 = g(Ui ).
n i=1

P
Prove that θ̂1 → θ.

(c) Define h(t) = 2 − 2t. Prove that if we take Vi = 1 − Ui for each i, then Vi
is a random variable with density h(t). Prove that with
n
1 X g(Vi )
θ̂2 = ,
n i=1 h(Vi )

P
we have θ̂2 → θ.
(d) For n = 1000, simulate θ̂1 and θ̂2 . Give estimates of the variance for each
estimator by reporting σ̂ 2 /n for each, where σ̂ 2 is the sample variance of the g(Ui )
or the g(Vi )/h(Vi ) as the case may be.
(e) Plot, on the same set of axes, g(t), h(t), and the standard uniform density
for t ∈ [0, 1]. From this plot, explain why the variance of θ̂2 is smaller than the
variance of θ̂1 . [Incidentally, the technique of drawing random variables from a
density h whose shape is close to the function g of interest is a variance-reduction
technique known as importance sampling.]
Note: This was sort of a silly example, since numerical methods yield an ex-
act value for θ. However, with certain high-dimensional integrals, the “curse of
dimensionality” makes exact numerical methods extremely time-consuming com-
putationally; thus, Monte Carlo integration does have a practical use in such
cases.

176

You might also like