0% found this document useful (0 votes)
74 views18 pages

A Modern Gauss-Markov Theorem

This document presents a modern version of the Gauss-Markov theorem that removes the restriction to linear estimators. It shows that the classical finite-sample variance lower bounds for unbiased estimators in linear regression still hold without assuming linear estimators. This provides a bridge between finite-sample and semi-parametric efficiency theory. The proof uses techniques from semi-parametric efficiency bounds to construct a finite-sample variance lower bound for unbiased estimators of linear regression coefficients.

Uploaded by

Bella Novita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views18 pages

A Modern Gauss-Markov Theorem

This document presents a modern version of the Gauss-Markov theorem that removes the restriction to linear estimators. It shows that the classical finite-sample variance lower bounds for unbiased estimators in linear regression still hold without assuming linear estimators. This provides a bridge between finite-sample and semi-parametric efficiency theory. The proof uses techniques from semi-parametric efficiency bounds to construct a finite-sample variance lower bound for unbiased estimators of linear regression coefficients.

Uploaded by

Bella Novita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

A Modern Gauss-Markov Theorem

Bruce E. Hansen*
University of Wisconsin†

December, 2020
Revised: December 2021

Abstract

This paper presents finite sample efficiency bounds for the core econometric prob-
lem of estimation of linear regression coefficients. We show that the classical Gauss-
Markov Theorem can be restated omitting the unnatural restriction to linear estima-
tors, without adding any extra conditions. Our results are lower bounds on the vari-
ances of unbiased estimators. These lower bounds correspond to the variances of the
the least squares estimator and the generalized least squares estimator, depending on
the assumption on the error covariances. These results show that we can drop the label
“linear estimator” from the pedagogy of the Gauss-Markov Theorem. Instead of refer-
ring to these estimators as BLUE, they can legitimately be called BUE (best unbiased
estimators).

* Research support from the NSF and the Phipps Chair are gratefully acknowledged. I posthumously
thank Gary Chamberlain for encouraging me to study finite sample semi-parametric efficiency, Jack
Porter for persuading me to write these results into a paper, and Yuzo Maruyama for catching an error
in the proof. I also thank Roger Koenker, Stephen Portnoy, and three referees for thoughtful comments
and suggestions.

Department of Economics, 1180 Observatory Drive, University of Wisconsin, Madison WI 53706.

1
1 Introduction
Three central results in core econometric theory are BLUE, Gauss-Markov, and Aitken’s.
The BLUE theorem states that the best (minimum variance) linear unbiased estimator
of a population expectation is the sample mean. The Gauss-Markov theorem states that
in a linear homoskedastic regression model the minimum variance linear unbiased esti-
mator of the regression coefficient is the least squares estimator. Aitken’s generalization
states that in a linear regression model with a general covariance matrix structure the
minimum variance linear unbiased estimator is the generalized least squares estimator.
These results are straightforward to prove and interpret, and thus are taught in intro-
ductory through advanced courses. The theory, however, has a gaping weakness. The
restriction to linear estimators is unnatural. There is no justifiable reason for modern
econometrics to restrict estimation to linear methods. This leaves open the question if
nonlinear estimators could possibly do better than least squares.
One possible answer lies in the theory of uniform minimum variance unbiased (UMVU)
estimation (see, e.g., Chapter 2 of Lehmann and Casella (1998)). Lehmann and Casella
(1998, Example 4.2) demonstrate that the sample mean is UMVU for the class of distri-
butions having a density. The latter restriction is critical for their demonstration, does
not generalize to distributions without densities, and it is unclear if the approach applies
to regression models.
A second possible answer is provided by the Cramér-Rao theorem. In the normal re-
gression model the minimum variance unbiased estimator of the regression coefficient
is least squares. This result removes the restriction to linearity. But the result is limited
to normal regression and so does not provide a complete answer.
A third possible answer is provided by the local asymptotic minimax theorem (see
Hajek (1972) and van der Vaart (1998, Chapter 8)) which states that in parametric mod-
els, estimation mean squared error cannot be asymptotically smaller than the Cramér-
Rao lower bound. This removes the restriction to linear and unbiased estimators, but is
focused on a parametric asymptotic framework.
A fourth approach to the problem is semi-parametric asymptotic efficiency, which
includes Stein (1956), Levit (1975), Begun, Hall, Huang, and Wellner (1983), Chamber-
lain (1987), Ritov and Bickel (1990), Newey (1990), Bickel, Klaassen, Ritov, and Wellner
(1993), and van der Vaart (1998, Chapter 25). This literature develops asymptotic ef-
ficiency bounds for estimation in semi-parametric models including linear regression.

2
This theory removes the restriction to linear unbiased estimators and parametric mod-
els, but only provides asymptotic efficiency bounds, not finite sample bounds. This lit-
erature leaves open the possibility that reduced estimation variance might be achieved
in finite samples by alternative estimators.
A fifth approach is adaptive efficiency under an independence or symmetry assump-
tion. If the regression error is independent of the regressors and/or symmetrically dis-
tributed about zero, efficiency improvements may be possible. If the regression er-
ror is fat-tailed, these improvements can be substantial. This literature includes the
quantile regression estimator of Koenker and Bassett (1978), the adaptive regression
estimator of Bickel (1982), and the generalized t estimator of McDonald and Newey
(1988). These improvements are only obtained under the validity of the imposed in-
dependence/symmetry assumptions; otherwise the estimators are inconsistent.
Our paper extends the above literatures by providing finite sample variance lower
bounds for unbiased estimation of linear regression coefficients without the restriction
to linear estimators and without the restriction to parametric models. Our results are
semi-parametric, imposing no restrictions on distributions beyond the existence of the
first two moments and no restriction on estimators beyond unbiasedness. Our lower
bounds generalize the classical BLUE and Gauss-Markov lower bounds, as we show that
the same bounds hold in finite samples without the restriction to linear estimators. Our
lower bounds also update the asymptotic semi-parametric lower bounds of Chamber-
lain (1987), as we show that the same bounds hold in finite samples for unbiased esti-
mators.
The results in this paper are a finite-sample version of the insight by Stein (1956)
that the supremum of Cramér-Rao bounds over all regular parametric submodels is a
lower bound on the asymptotic estimation variance. Our twist turns Stein’s insight into
a finite-sample argument, thereby constructing a lower bound on the finite-sample vari-
ance. Stein’s insight lies at the core of semi-parametric efficiency theory. Thus, our result
provides a bridge between finite-sample and semi-parametric efficiency theory.
Our primary purpose is to generalize the Gauss-Markov Theorem, providing a finite-
sample yet semi-parametric efficiency justification for least squares estimation. A by-
product of our result is the observation that it is impossible to achieve lower variance
than least squares without incurring estimation bias. Consequently, the simultaneous
goals of unbiasedness and low variance are incompatible. If estimators are low variance
(relative to least squares) they must be biased. This is not an argument against non-

3
parametric, shrinkage, or machine learning estimation, but rather is a statement that
these estimation methods should be acknowledged as biased and the latter is necessary
to achieve variance reductions.
Our results (similarly to BLUE, Gauss-Markov, Aitken, and Cramér-Rao) focus on un-
biased estimators, and thereby are restricted to the special context where unbiased esti-
mators exist. Indeed, the existence of an unbiased estimator is a necessary condition for
a finite variance bound. Doss and Sethuraman (1989) showed that when no unbiased
estimator exists, then any sequence of estimators with bias tending to zero will have
variance tending to infinity. A related literature (Zyskind and Martin (1969), Harville
(1981)) concerns conditions for linear estimators to be unbiased when allowing for gen-
eral covariance matrices.
A caveat is that the class of nonlinear unbiased estimators is small. As shown by
Koopmann (1982) and discussed in Gnot, Knautz, Trenkler, and Zmyslony (1992), any
unbiased estimator of the regression coefficient can be written as a linear-quadratic
function of the dependent variable Y . Koopmann’s result shows that while nonlinear
unbiased estimators exist, they constitute a narrow class.
The literature contains papers which generalize the Gauss-Markov theorem to allow
nonlinear estimators, but all are restrictive on the class of allowed nonlinearity, and all
are restrictive on the class of allowed error distributions. For example, Kariya (1985) al-
lows for estimators where the nonlinearity can be written in terms of the least squares
residuals. Berk and Hwang (1989) and Kariya and Kurata (2002) allow for nonlinear es-
timators which fall within certain equivariant classes. Each of these papers restricts the
error distributions to satisfy a form of spherical symmetry. In contrast, the results pre-
sented in this paper do not impose any restrictions on the estimators other than unbi-
asedness, and do not impose any restrictions on the error distributions.
The proof of our main result (presented in Section 6) is not inherently difficult, but
is not elementary either. It might be described as nuanced. It is based on a trick used
by Newey (1990, Appendix B) in his development of an asymptotic semi-parametric ef-
ficiency bound for estimation of a population expectation.

2 Gauss-Markov Theorem
Let Y be an n × 1 random vector and X an n × m full-rank regressor matrix with
m < n. We will treat X as fixed, though all the results apply to random regressors by

4
conditioning on X .
The linear regression model is

Y = Xβ+e (1)
E [e] = 0 (2)
var [e] = E ee 0 = σ2 Σ < ∞
£ ¤
(3)

where e is the n × 1 vector of regression errors. It is assumed that the n × n matrix Σ > 0
is known while the scalar σ2 > 0 is unknown.
Let F2 be the set of joint distributions F of random vectors Y satisfying (1)-(3). This
is the set of random vectors whose expectation is a linear function of X and has a finite
covariance matrix. Equivalently, F2 consists of all distributions which satisfy a linear
regression.
The homoskedastic and serially uncorrelated linear regression model adds the as-
sumption
Σ = I n. (4)

Let F02 ⊂ F2 be the set of joint distributions satisfying (1)-(4). The standard estimator of
β in model F02 is least squares

¢−1 ¡ 0 ¢
βbols = X 0 X
¡
X Y .

For all F ∈ F2 , βbols is unbiased for β, and for all F ∈ F02 , βbols has variance var βbols =
£ ¤
¢−1
σ2 X 0 X
¡
. The question of efficiency is whether there is an alternative unbiased esti-
mator with reduced variance.
The classical Gauss-Markov Theorem applies to linear estimators of β, which are
estimators that can be written as βb = A(X )Y , where A(X ) is an m × n function of X .
Linearity in this context means “linear in Y ”.

Theorem 1 (Gauss-Markov). If βb is a linear estimator, and unbiased for all F ∈ F2 , then

¢−1
var βb ≥ σ2 X 0 X
£ ¤ ¡

for all F ∈ F02 .

In words, no unbiased linear estimator has a finite sample covariance matrix smaller

5
than the least squares estimator. As this is the exact variance of the least squares esti-
mator, it follows that in the homoskedastic linear regression model, least squares is the
minimum variance linear unbiased estimator.
Part of the beauty of the Gauss-Markov Theorem is its simplicity. The only assump-
tions on the distribution concern the first and second moments of Y . The only assump-
tions on the estimator are linearity and unbiasedness. The statement in the theorem
that βb “is unbiased for all F ∈ F2 ” clarifies the context under which the estimator is re-
quired to be unbiased. The requirement that βb must be unbiased for any distribution
means that we are excluding estimators such as βb = 0, which is “unbiased” when the
true value satisfies β = 0. The estimator βb = 0 is not unbiased in the general set of linear
regression models F2 so is not unbiased in the sense of the theorem.
An unsatisfying feature of the Gauss-Markov Theorem is that it restricts attention to
linear estimators. This is unnatural as there is no reason to exclude nonlinear estimators.
Consequently, when the Gauss-Markov Theorem is taught it is typically followed by the
Cramér-Rao Theorem.
φ
Let F2 ⊂ F02 be the set of joint distributions satisfying (1)-(4) plus e ∼ N(0, I n σ2 ).

φ
Theorem 2 (Cramér-Rao). If βb is unbiased for all F ∈ F2 , then

¢−1
var βb ≥ σ2 X 0 X
£ ¤ ¡

φ
for all F ∈ F2 .

The Cramér-Rao Theorem shows that the restriction to linear estimators is unnec-
essary in the class of normal regression models. To obtain this result, in addition to
the Gauss-Markov assumptions, the Cramér-Rao Theorem adds the assumption that
the observations are independent and normally distributed. The normality assumption
is restrictive, however, so neither the Gauss-Markov nor Cramér-Rao Theorem is fully
satisfactory. Consequently, the two are typically taught as a pair with the joint goal of
¢−1
justifying the variance lower bound σ2 X 0 X
¡
and hence least squares estimation.
Closely related to the Gauss-Markov Theorem is the generalization by Aitken (1935)
to the context of general covariance matrices. In the linear regression model with non-
scalar covariance matrix Σ, Aitken’s generalized least squares (GLS) estimator is

¢−1 ¡ 0 −1 ¢
βbgls = X 0 Σ−1 X X Σ Y .
¡

6
¢−1
For all F ∈ F2 , βbgls is unbiased for β and has variance var βbgls = σ2 X 0 Σ−1 X
£ ¤ ¡
. The
question of efficiency is whether there is an alternative unbiased estimator with smaller
variance. Aitken’s Theorem follows Gauss-Markov in restricting attention to linear esti-
mators.

Theorem 3 (Aitken). If βb is a linear estimator, and unbiased for all F ∈ F2 , then

¢−1
var βb ≥ σ2 X 0 Σ−1 X
£ ¤ ¡

for all F ∈ F2 .

Aitken’s Theorem is less celebrated than the traditional Gauss-Markov Theorem, but
perhaps is more illuminating. It shows that, in general, the variance lower bound equals
the covariance matrix of the GLS estimator. Thus, in the general linear regression model,
generalized least squares is the minimum variance linear unbiased estimator. Aitken’s
theorem, however, rests on the restriction to linear estimators just as the Gauss-Markov
Theorem. In the context of independent observations, Aitken’s bound corresponds to
the asymptotic semi-parametric efficiency bound established by Chamberlain (1987).
The development of least squares and the Gauss-Markov Theorem involved a series
of contributions from some of the most influential probabilists of the nineteenth thru
early twentieth centuries. The method of least squares was introduced by Adrien Marie
Legendre (1805) as essentially an algorithmic solution to the problem of fitting coeffi-
cients when there are more equations than unknowns. This was quickly followed by
Carl Friedrich Gauss (1809), who provided a probabilistic foundation. Gauss proposed
that the equation errors be treated as random variables, and showed that if their den-
sity takes the form we now call “normal” or “Gaussian” then the maximum likelihood
estimator of the coefficient equals the least squares estimator. Shortly afterward, Pierre
Simon Laplace (1811) justified this choice of density function by showing that his central
limit theorem implied that linear estimators are approximately normally distributed in
large samples, and that in this context the lowest variance estimator is the least squares
estimator. Gauss (1823) synthesized these results and showed that the core result only
relies on the first and second moments of the observations and holds in finite samples.
Andreı̆ Andreevich Markov (1912) provided a textbook treatment of the theorem, and
clarified the central role of unbiasedness, which Gauss had only assumed implicitly. Fi-
nally, Alexander Aitken (1935) generalized the theorem to cover the case of arbitrary but

7
known covariance matrices. This history, and other details, are documented in Plackett
(1949) and Stigler (1986).

3 Modern Gauss-Markov
We now present our main result. We are interested if Aitken’s version of the Gauss-
Markov Theorem holds without the restriction to linear estimators.

Theorem 4 If βb is unbiased for all F ∈ F2 , then

¢−1
var βb ≥ σ2 X 0 Σ−1 X
£ ¤ ¡

for all F ∈ F2 .

We provide a sketch of the proof in Section 4 and a full proof in Section 6.


Theorem 4 is identical to Theorem 3, but without the limitation to linear estimators.
Theorem 4 is a strict improvement, as no additional condition is imposed. This shows
that the GLS estimator is the minimum variance unbiased estimator (MVUE) of β.
We can specialize to the context of homoskedastic and serially uncorrelated obser-
vations.

Theorem 5 If βb is unbiased for all F ∈ F2 , then

¢−1
var βb ≥ σ2 X 0 X
£ ¤ ¡

for all F ∈ F02 .

Theorem 5 is identical to Theorem 1, but without the limitation to linear estimators.


Again, this is a strict improvement. The implication is that in the homoskedastic linear
regression model, ordinary least squares is the MVUE of β.
Theorem 5 is also an improvement on Theorem 2 as it lifts the normality assump-
tion of the normal regression model. It is not a strict improvement, however, as the
Cramér-Rao Theorem only requires the estimator to be unbiased in the class of normal
regression models, while Theorem 5 requires unbiasedness for all regression models.
An important special case of Theorem 5 is estimation of the population expectation.
This is the linear regression model where X only contains a vector of ones.

8
Assume that the elements of Y have a common expectation µ with covariance matrix
Σσ . Equivalently, assume E [Y ] = 1n µ and var [Y ] = Σσ2 , where 1n is a vector of ones.
2

Let G2 be the set of joint distributions F of random vectors Y satisfying these conditions,
and let G02 be the subset with Σ = I n . G02 is the set of uncorrelated random variables
with a common variance. The standard estimator of µ is the sample mean Y , which is
unbiased and has variance var[Y ] = σ2 /n for F ∈ G02 .

Theorem 6 If µ
b is unbiased for all F ∈ G2 , then var µ
b ≥ σ2 /n for all F ∈ G02 .
£ ¤

As the lower bound σ2 /n equals var[Y ], we deduce that the sample mean is the
MVUE of µ. Equivalently, the sample mean is the best unbiased estimator (BUE) – there
is no need for the classical “linear” modifier.
Essentially, Theorems 4, 5, and 6 show that we can drop the label “linear estima-
tor” from the pedagogy of the Gauss-Markov Theorem. Instead, GLS, OLS, and sample
means are the best unbiased estimators of their population counterparts.

4 A Sketch of the Proof


In this section we give an simplified proof of Theorem 4, deferring a complete argu-
ment to Section 6.
For simplicity, suppose that the joint distribution F (y) of the n × 1 random vector Y
has a density f (y) with bounded support Y . Without loss of generality assume that the
true coefficient equals β0 = 0 and that σ2 = 1. We use here the assumption of bounded
support to simplify the proof; it is not used in the complete proof of Section 6.
Because Y has bounded support Y there is a set B ⊂ Rm such that ¯ y 0 Σ−1 X β¯ < 1 for
¯ ¯

all β ∈ B and y ∈ Y . For such values of β, define the auxiliary density function

f β (y) = f (y) 1 + y 0 Σ−1 X β .


¡ ¢
(5)

Under the assumptions, 0 ≤ f β (y) ≤ 2 f (y), f β (y) has support Y , and Y f β (y)d y = 1. To
R

see the later, observe that Y y f (y)d y = X β0 = 0 under the normalization β0 = 0, and
R

thus Z Z Z
f β (y)d y = f (y)d y + f (y)y 0 d yΣ−1 X β = 1
Y Y Y
R
because Y f (y)d y = 1. Thus f β is a parametric family of density functions with an as-
sociated distribution function F β . Evaluated at β0 we see that f 0 = f , which means that

9
F β is a correctly-specified parametric family with true parameter value β0 = 0.
To illustrate, take the case of a single observation with X = 1. Figure 1(a) displays an
example density f (y) = (3/4)(1 − y 2 ) on [−1, 1] with auxiliary density f β (y) = f (y) 1 + y .
¡ ¢

We can see how the auxiliary density is a tilted version of the original density f (y).

F2

fβ(y)
f(y) Fβ
F0

β0 β

(a) True and Auxiliary Densities (b) Space of Distribution Functions

Figure 1: Illustrations

Let Eβ denote expectation with respect to the auxiliary distribution. Because


R
Y y f (y)d y =
0 and Y y y 0 f (y)d y = Σ, we find
R

Z Z Z
Eβ [Y ] = y f β (y)d y = y f (y)d y + y y 0 f (y)d yΣ−1 X β = X β.
Y Y Y

This shows that F β is a regression model with regression coefficient β.


In Figure 1(a), the means of the two densities are indicated by the arrows to the x-
axis. In this example we can see how the auxiliary density has a larger expected value,
because the density has been tilted to the right.
The parametric family F β over β ∈ B has the following properties: its expectation
is X β, its variance is finite, the true value β0 lies in the interior of B , and the support
of the distribution does not depend on β. To visualize, Figure 1(b) displays the space of
finite-variance distributions F2 by the large circle. The dot indicates the true distribution
F = F 0 . The curved line represents the distribution family F β . This family F β is a sliver in
the space of distributions F2 but includes the true distribution F .

10
The likelihood score of the auxiliary density function is

∂ ∂ ¡
¯ ¯
0 −1
log f (Y ) + log 1 + Y Σ X β ¯¯ = X 0 Σ−1 Y .
¯ ¡ ¢¢¯
S= log f β (Y )¯
¯ = (6)
∂β β=0 ∂β β=0

Therefore the information matrix is

I = E SS 0 = X 0 Σ−1 E Y Y 0 Σ−1 X = X 0 Σ−1 X .


£ ¤ £ ¤

By assumption, βb is unbiased for all finite-variance distributions (the large circle in


Figure 1(b)). This means that βb is unbiased in the subset F β (the curve in Figure 1(b)).
The Cramér-Rao lower bound states that

¢−1
var βb ≥ I −1 = X 0 Σ−1 X
£ ¤ ¡
.

This is the variance lower bound, completing the proof.


Some explanation may help as the argument may appear to have pulled the prover-
bial “rabbit out of the hat”. Somehow we deduced a general variance lower bound, even
though we only examined a rather artificial-looking auxiliary model. A key insight due
to Stein (1956) is that the supremum of Cramér-Rao bounds over all regular parametric
submodels is a lower bound on the variance of any unbiased estimator. Stein’s insight
focused on asymptotic variances, but the same argument applies to finite sample vari-
ances, because the Cramér-Rao bound is a finite sample result. A corollary of Stein’s
insight is that the Cramér-Rao bound of any single regular parametric submodel is a
valid lower bound on the variance of any unbiased estimator. If this submodel is se-
lected judiciously, its Cramér-Rao bound will equal the supremum over all submodels,
and this holds when this Cramér-Rao bound equals the known finite-sample variance of
a candidate efficient estimator, which in our case is the GLS estimator.
Another way of looking at this is as follows. Because F β ⊂ F2 , estimation over F β
cannot be harder than estimation over the full set F2 . Thus the variance from estimation
over F β cannot be larger than estimation over F2 . This means that Cramér-Rao bound
for F β is a lower bound for the full set F2 .
This raises the question: How was the density (5) constructed? The trick is to con-
struct a density which (i) includes the true density as a special case, (ii) is a regression
model, and (iii) its Cramér-Rao bound equals the variance of the GLS estimator. The
key is (6), which shows that the likelihood score of (5) is proportional to the score of the

11
normal regression model with covariance matrix Σ. This was achieved by constructing
(5) to be proportionate to the normal regression score.

5 Conclusion
A core question in econometric methodology is: Why do we use specific estimators?
Why not others? A standard answer is efficiency: the estimators are best (in some sense)
among all estimators (in a class) for all data distributions (in some set). The Gauss-
Markov Theorem is a core efficiency result but restricts attention to linear estimators –
and this is an inherently uninteresting restriction. The present paper lifts this restric-
tion without imposing additional cost. Henceforth, least squares should be described
as the “best unbiased estimator” of the regression coefficient; the “linear” modifier is
unnecessary.

6 Proof of Theorem 4
We provide a proof of Theorem 4. Theorems 5 and 6 are special cases, so follow as
corollaries.

Proof of Theorem 4: Our approach is to calculate the Cramér-Rao bound for a carefully
crafted parametric model. This is based on an insight of Newey (1990, Appendix B) for
the simpler context of a population expectation.
Without loss of generality, assume that the true coefficient equals β0 = 0 and that
σ = 1. These are merely normalizations which simplify the notation.
2

Define the truncation function Rn → Rn

ψc (y) = y 1 ° y ° ≤ c − E [Y 1 {kY k ≤ c}] .


©° ° ª
(7)

Notice that it satisfies E ψc (Y ) = 0,


£ ¤

°ψc (y)° ≤ 2c,


° °
(8)

and
¤ def
E Y ψc (Y )0 = E Y Y 0 1 {kY k ≤ c} = Σc .
£ ¤ £

12
As c → ∞, Σc → E Y Y 0 = Σ. Pick c sufficiently large so that Σc > 0, which is feasible
£ ¤

because Σ > 0.
Define the auxiliary joint distribution function F β (y) by the Radon-Nikodym deriva-
tive ¡ ¢
d Fβ y
¡ ¢ = 1 + ψc (y)0 Σ−1
c Xβ
dF y
for parameters β in the set
½ ¾
m
° −1 ° 1
B c = β ∈ R : Σc X β ≤
° ° . (9)
4c

The Schwarz inequality and the bounds (8) and (9) imply that for β ∈ B c and all y

¯ψc (y)0 Σ−1 X β¯ ≤ °ψc (y)° °Σ−1 X β° ≤ 1 .


¯ ¯ ° °° °
c c
2

This implies that F β has the same support as F and satisfies the bounds
¡ ¢
1 d Fβ y 3
≤ ¡ ¢ ≤ . (10)
2 dF y 2

We calculate that
Z Z Z
d F y + ψc (y)0 Σ−1 βd
¡ ¢ ¡ ¢ ¡ ¢
d Fβ y = c X F y
¤0
= 1 + E ψc (Y ) Σ−1
c Xβ
£

=1 (11)

the last equality because E ψc (Y ) = 0. Together, these facts imply that F β is a valid
£ ¤

distribution function, and over β ∈ B c is a parametric family for Y . Evaluated at β0 = 0,


which is in the interior of B c , we see F 0 = F . This means that F β is a correctly-specified
parametric family with the true parameter value β0 .
Let Eβ denote expectation under the distribution F β . The expectation of Y in this

13
model is
Z
Eβ [Y ] =
¡ ¢
yd F β y
Z Z
= yd F y + yψc (y)0 Σ−1 c X βd F y
¡ ¢ ¡ ¢

= E [Y ] + E Y ψc (Y )0 Σ−1
c Xβ
£ ¤

= Xβ (12)

because E [Y ] = 0 and E Y ψc (Y )0 = Σc . Thus, distribution F β is a linear regression with


£ ¤

regression coefficient β.
The bound (10) implies

° y ° d Fβ y ≤ 3 ° y ° d F y = 3 E kY k2 = 3 tr (Σ) < ∞.
Z Z
° °2 ° °2
Eβ kY k2 =
£ ¤ ¡ ¢ ¡ ¢ £ ¤
2 2 2

This means that F β ∈ F2 for all β ∈ B c .


The likelihood score for F β is

∂ d F β (Y ) ¯
¯
S= log ¯
∂β d F (Y ) ¯β=0

¯
0 −1
log 1 + ψc (Y ) Σc X β ¯¯
¡ ¢¯
=
∂β β=0

= X 0 Σ−1
c ψc (Y ).

The information matrix is

Ic = E SS 0
£ ¤

= X 0 Σ−1 0
¤ −1
E ψ Σc X
£
c c (Y )ψ c (Y )
≤ X 0 Σ−1
c X, (13)

where the inequality is

E ψc (Y )ψc (Y )0 = Σc − E [Y 1 {kY k ≤ c}] E [Y 1 {kY k ≤ c}]0 ≤ Σc .


£ ¤

By assumption, the estimator βb is unbiased for β for all F ∈ F2 , which implies that it
is unbiased for all F ∈ F β . The model F β is regular (it is correctly specified as it contains

14
the true distribution F , the support of Y does not depend on β, and the true value β0 = 0
lies in the interior of B c ). Thus by the Cramér-Rao Theorem (see, for example, Theorem
10.6 of Hansen (2022))
¢−1
var βb ≥ Ic−1 ≥ X 0 Σ−1
£ ¤ ¡
c X

where the second inequality is (13). Because this holds for all c, and Σc → Σ as c → ∞,

¢−1 ¡ 0 −1 ¢−1
var βb ≥ lim sup X 0 Σ−1 = X Σ X
£ ¤ ¡
c X .
c→∞

This is the variance lower bound. ■

15
References
[1] Aitken, Alexander C. (1935): “On least squares and linear combinations of observa-
tions,” Proceedings of the Royal Statistical Society, 55, 42-48.

[2] Begun, Janet M., W. J. Hall, Wei-Min Huang, and Jon A. Wellner (1983): “Informa-
tion and asymptotic efficiency in parametric-nonparametric models,” The Annals
of Statistics, 11, 432-452.

[3] Berk, Robert and Jiunn T. Hwang (1989): “Optimality of the least squares estimator,”
Journal of Multivariate Analysis, 3, 245-254.

[4] Bickel, Peter J. (1982): “On adaptive estimation,” Annals of Statistics, 647-671.

[5] Bickel, Peter J., Chris A. J. Klaassen, Ya’acov Ritov, and Jon A. Wellner (1993): Effi-
cient and Adaptive Estimation for Semiparametric Models, Johns Hopkins Univer-
sity Press.

[6] Chamberlain, Gary (1987): “Asymptotic efficiency in estimation with conditional


moment restrictions,” Journal of Econometrics, 34, 305-334.

[7] Doss, Hani and Jayaram Sethuraman (1989): “The price of bias reduction when
there is no unbiased estimate,” The Annals of Statistics, 17, 440-442.

[8] Gauss, Carl Friedrich (1809): Theoria motus corporum celestium. Hamburg:
Perthese et Besser.

[9] Gauss, Carl Friedrich (1823): Theoria Comationis Observationum Erroribus Min-
imis Obnoxiae. Göttingen: Dieterich.

[10] Gnot, S., H. Knautz, G. Trenkler, and R. Zmyslony (1992): “Nonlinear unbiased es-
timation in linear models,” Statistics, 23, 5-16.

[11] Hajek, Jaroslav (1972): “Local asymptotic minimax and admissibility in estima-
tion,” Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and
Probability, 1, 175-194.

[12] Hansen, Bruce E. (2022): Probability and Statistics for Economists, Princeton Uni-
versity Press, forthcoming.

16
[13] Harville, David A. (1981): “Unbiased and minimum-variance unbiased estimation
of estimable functions for fixed linear models with arbitrary covariance structure,”
The Annals of Statistics, 9, 633-637.

[14] Kariya, Takeaki (1985): “A nonlinear version of the Gauss-Markov theorem,” Journal
of the American Statistical Association, 80, 476-477.

[15] Kariya, Takeaki and Hiroshi Kurata (2002): “A maximal extension of the Gauss-
Markov theorem and its nonlinear version,” Journal of Multivariate Analysis, 83,
37-55.

[16] Koenker, Roger, and Gilbert Bassett (1978): “Regression quantiles,” Econometrica,
46, 33-50.

[17] Koopman, Reinhardt (1982): Parameterschätzung bei a-priori-Information, vol. 12,


Vandenhoeck & Ruprecht.

[18] Laplace, Pierre Simon (1811): “Mémoire sur les integrales définies et leur applica-
tion aux probabilités, et specialement à la recherche du milieu qu’il faut choisir
entre les resultats des observations,” Mémoires de l’Académie des sciences de Paris,
279-347.

[19] Legendre, Adrien Marie (1805): Nouvelles méthodes pour la détermination des or-
bites des cométes. Paris: Courcier.

[20] Lehmann, Erich L. and George Casella (1998): Theory of Point Estimation, Second
Edition, Springer.

[21] Levit, B. Y. (1975): “On the efficiency of a class of nonparametric estimates,” Theory
of Probability and its Applications, 20, 723-740.

[22] Markov, Andreı̆ Andreevich (1912): Wahrscheinlichkeitsrechnung. Leipzig.

[23] McDonald, James B. and Whitney K. Newey (1988): “Partially adaptive estimation
of regression models via the generalized t distribution,” Econometric Theory, 4, 428-
457.

[24] Newey, Whitney K. (1990): “Semiparametric efficiency bounds,” Journal of Applied


Econometrics, 5, 99-135.

17
[25] Ritov, Ya’acov and Peter J. Bickel (1990): “Achieving information bounds in non and
semiparametric models,” The Annals of Statistics, 18, 925-938.

[26] Stein, Charles (1956): “Efficient nonparametric testing and estimation,” Berkeley
Symposium on Mathematical Statistics and Probability, 187-195.

[27] Stigler, Stephen M. (1986): The History of Statistics: The Measurement of Uncertainty
before 1900. Harvard University Press.

[28] van der Vaart, A.W. (1998): Asymptotic Statistics, Cambridge University Press.

[29] Zyskind, George, and Frank B. Martin (1969): “On best linear estimation and a gen-
eral Gauss-Markov theorem in linear models with arbitrary nonnegative covariance
structure,” SIAM Journal on Applied Mathematics, 17, 1190-1202.

18

You might also like