BOOK Nonparametric and Semiparametric Models-2004
BOOK Nonparametric and Semiparametric Models-2004
BOOK Nonparametric and Semiparametric Models-2004
Stefan
Sperlich, Axel Werwatz
Nonparametric and
Semiparametric Models
An Introduction
February 6, 2004
Springer
Berlin Heidelberg New York
Hong Kong London
Milan Paris Tokyo
download logo
Preface
VI
Preface
Preface
VII
Marlene Muller
Stefan Sperlich
Axel Werwatz
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXI
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Parametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.2 Nonparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.3 Semiparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
3
5
7
9
18
Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Motivation and Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.3 Varying the Binwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Statistical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3 Mean Squared Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.4 Mean Integrated Squared Error . . . . . . . . . . . . . . . . . . . . . .
21
21
21
23
23
24
25
26
27
29
Contents
29
30
32
35
36
38
39
39
39
40
43
43
45
46
46
48
49
50
51
51
53
55
56
57
57
59
60
61
66
70
72
75
79
80
82
Contents
Nonparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1 Univariate Kernel Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.2 Kernel Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.3 Local Polynomial Regression and Derivative
Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Other Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Nearest-Neighbor Estimator . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Median Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.3 Spline Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.4 Orthogonal Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Smoothing Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 A Closer Look at the Averaged Squared Error . . . . . . . .
4.3.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.3 Penalizing Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Confidence Regions and Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1 Pointwise Confidence Intervals . . . . . . . . . . . . . . . . . . . . . .
4.4.2 Confidence Bands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Multivariate Kernel Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 Statistical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.2 Practical Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
XI
85
85
85
88
94
98
98
101
101
104
107
110
113
114
118
119
120
124
128
130
132
135
137
139
145
145
148
148
149
151
151
153
XII
Contents
154
162
164
165
167
168
170
172
174
178
183
185
186
187
189
189
191
191
195
197
199
202
211
212
212
219
221
222
202
203
206
207
208
Contents
XIII
224
225
227
234
236
239
240
247
248
250
253
254
259
260
262
264
264
264
268
274
275
276
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
List of Figures
1.1
1.2
SPMcps85lin
1.3
SPMcps85lin
1.4
1.5
Engel curve
1.6
1.7
1.8
Logit fit
1.9
SPMengelcurve2 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SPMcps85add . . . . .
10
SPMcps85add . . . . . . . . .
11
SPMlogit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
14
16
SPMsim . . . . . . . . . . . . . . . . . . .
17
2.1
SPMhistogram . . . . . . . . . . .
22
2.2
2.3
24
2.4
2.5
25
28
31
XVI
List of Figures
2.6
2.7
3.1
3.2
3.3
3.4
3.5
3.6
Bias effects
3.7
SPMkdemse . . . . . . . . . . . . . . .
49
3.8
63
64
65
68
69
70
SPMdensity2D . . . . . . . . .
75
76
77
78
4.1
SPMengelcurve1 . . . . . .
87
4.2
91
4.3
SPMlocpolyreg . . . . . . . . . . . . . .
97
4.4
99
3.9
SPMashstock
32
SPMhiststock . . . . . . .
33
SPMkernel . . . . . . . . . . . . . . . . . . . . . . .
42
SPMdensity . . . . . . . .
43
SPMdenquauni . . . . . . . . . . . .
44
SPMdenepatri .
45
SPMkdeconstruct 46
SPMkdebias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
List of Figures
XVII
4.5
Nearest-neighbor regression
SPMknnreg . . . . . . . . . . . . . . . . . . 100
4.6
4.7
Spline regression
4.8
4.9
Wavelet regression
SPMmesmooreg . . . . . . . . . . . . . 102
SPMspline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
SPMwavereg . . . . . . . . . . . . . . . . . . . . . . . . . . 108
SPMsimulmase . . . . . . . . . . 111
SPMsimulmase . . . . . . . . . . . . . . . . . 112
SPMpenalize . . . . . . . . . . . . . . . . . . . . . . . 117
SPMtruenadloc . . . . 133
6.1
6.2
7.1
8.1
8.2
8.3
8.4
8.5
8.6
8.7
SPMmigmv . . . . . . . . . . . . . . . 201
217
218
219
227
231
232
233
XVIII
List of Figures
8.8
8.9
8.10
8.11
234
237
238
8.16
9.1
9.2
9.3
9.4
9.5
9.6
258
259
266
268
272
273
8.12
8.13
8.14
8.15
240
241
241
243
244
245
List of Tables
1.1
1.2
5
7
3.1
3.2
3.3
Kernel functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
0 for different kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Efficiency of kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
59
61
5.1
5.2
5.3
6.1
7.1
7.2
7.3
8.1
8.2
9.1
9.2
9.3
9.4
260
267
267
270
Notation
Abbreviations
cdf
df
degrees of freedom
iff
if and only if
i.i.d.
w.r.t.
with respect to
ADE
AM
additive model
AMISE
asymptotic MISE
AMSE
asymptotic MSE
APLM
ASE
ASH
CHARN
CV
cross-validation
DM
Deutsche Mark
GAM
GAPLM
GLM
XXII
Notation
GPLM
ISE
IRLS
LR
likelihood ratio
LS
least squares
MASE
MISE
ML
maximum likelihood
MLE
MSE
PLM
PMLE
RSS
S.D.
standard deviation
S.E.
standard error
SIM
SLS
USD
US Dollar
WADE
WSLS
random variables
x, y
scalars (realizations of X, Y)
X1 , . . . , X n
X(1) , . . . , X(n)
x1 , . . . , x n
realizations of X1 , . . . , Xn
vector of variables
vector (realizations of X)
x0
Notation
XXIII
binwidth or bandwidth
e
h
bandwidth matrix
identity matrix
vector of observations Y1 , . . . , Yn
parameter
parameter vector
e0
ej
1n
LR
X j
S, S P , S
smoother matrices
Matrix algebra
tr(A)
trace of matrix A
diag(A)
diagonal of matrix A
det(A)
determinant matrix A
rank(A)
rank of matrix A
XXIV
Notation
A1
inverse of matrix A
kuk
u> u
Functions
log
logarithm (base e)
Kh
KH
kKk22
fX
pdf of X
f (x, y)
Hf
K?K
e
w, w
weight functions
m()
`, `i
a, b, c
2 (K)
p (K)
Notation
()
pdf of X
XXV
Moments
EX
mean value of X
2 = Var(X)
E(Y|X)
E(Y|X = x)
E(Y|x)
same as E(Y|X = x)
2 (x)
EX1 g(X1 , X2 )
med(Y|X)
V()
MSEx
Distributions
U[0, 1]
U[a, b]
N(0, 1)
N(, 2 )
N(, )
2m
tm
XXVI
Notation
Estimates
b
estimated coefficient
fbh
fbh,i
bh
m
b p,h
m
b p,H
m
Convergence
o()
O()
o p ()
O p ()
a.s.
convergence in distribution
asymptotically equal
asymptotically proportional
Other
N
natural numbers
integers
real numbers
Notation XXVII
Rd
proportional
constantly equal
Bj
mj
1
Introduction
1 Introduction
81
79
77
75
73
71
81
79
77
75
73
71
Figure 1.1. Log-normal density estimates (upper graph) versus kernel density estimates (lower graph) of net-income, U.K. Family Expenditure Survey 196983
SPMfesdensities
1.2 Regression
sity) to the left (Figure 1.1, lower graph). This indicates that the net-income
distribution has in fact changed during this 15 year period.
1.2 Regression
Let us now consider a typical linear regression problem. We assume that
anyone of you has been exposed to the linear regression model where the
mean of a dependent variable Y is related to a set of explanatory variables
X1 , X2 , . . . , Xd in the following way:
E(Y|X) = X1 1 + . . . + Xd d = X > .
(1.1)
we can write
= Y E(Y|X)
(1.2)
Y = X > + .
(1.3)
Example 1.2.
To take a specific example, let Y be log wages and consider the explanatory
variables schooling (measured in years), labor market experience (measured as
AGE SCHOOL 6) and experience squared. If we assume that, on average,
log wages are linearly related to these explanatory variables then the linear
regression model applies:
E(Y|SCHOOL, EXP) = 0 + 1 SCHOOL + 2 EXP + 3 EXP2 .
Note that we have included an intercept ( 0 ) in the model.
(1.4)
The model of equation (1.4) has played an important role in empirical labor economics and is often called human capital earnings equation (or Mincer
earnings equation to honor Jacob Mincer, a pioneer of this line of research).
From the perspective of this course, an important characteristic of equation
(1.4) is its parametric form: the shape of the regression function is governed
by the unknown parameters j , j = 1, 2, . . . , d. That is, all we have to do in
order to determine the linear regression function (1.4) is to estimate the unknown parameters j . On the other hand, the parametric regression function
of equation (1.4) a priori rules out many conceivable nonlinear relationships
between Y and X.
1 Introduction
(1.5)
Suppose that you were assigned the following task: estimate the regression of
log wages on schooling and experience as accurately as possible in one trial.
That is, you are not allowed to change your model if you find that the initial
specification does not fit the data well. Of course, you could just go ahead
and assume, as we have done above, that the regression you are supposed to
estimate has the form specified in (1.4). That is, you assume that
m(SCHOOL, EXP) = 1 + 2 SCHOOL + 3 EXP + 4 EXP2 ,
and estimate the unknown parameters by the method of ordinary least
squares, for example. But maybe you would not fit this parametric model
if we told you that there are ways of estimating the regression function without having to make any prior assumptions about its functional form (except
that it is a smooth function). Remember that you have just one trial and if
the form of m(SCHOOL, EXP) is very different from (1.4) then estimating the
parametric model may give you very inaccurate results.
It turns out that there are indeed ways of estimating m() that merely assume that m() is a smooth function. These methods are called nonparametric
regression estimators and part of this course will be devoted to studying
nonparametric regression.
Nonparametric regression estimators are very flexible but their statistical precision decreases greatly if we include several explanatory variables
in the model. The latter caveat has been appropriately termed the curse of
dimensionality. Consequently, researchers have tried to develop models and
estimators which offer more flexibility than standard parametric regression
but overcome the curse of dimensionality by employing some form of dimension reduction. Such methods usually combine features of parametric and
nonparametric techniques. As a consequence, they are usually referred to as
semiparametric methods. Further advantages of semiparametric methods are
the possible inclusion of categorical variables (which can often only be included in a parametric way), an easy (economic) interpretation of the results,
and the possibility of a part specification of a model.
In the following three sections we use the earnings equation and other examples to illustrate the distinctions between parametric, nonparametric and
semiparametric regression and we certainly hope that this will whet your
appetite for the material covered in this course.
1.2 Regression
1 Introduction
0.1
0.5
0.2
0.3
0.4
0.5
1.5
10
x1
15
10
20
30
40
50
x2
SPMcps85lin
2.3
1.9
41.2
Experience
1.5
27.5
13.8
6.0
10.0
14.0
Schooling
SPMcps85lin
1.2 Regression
All of the element curves of the surface appear similar to Figure 1.2 (right)
in the direction of experience and like Figure 1.2 (left) in the direction of
schooling. To gain a better understanding of the three-dimensional picture
we have plotted a single wage-experience profile in three dimensions, fixing
schooling at 12 years. Hence, Figure 1.3 highlights the wage-earnings profile
for high school graduates.
1.2.2 Nonparametric Regression
Suppose that we want to estimate
E(Y|SCHOOL, EXP) = m(SCHOOL, EXP).
(1.6)
and we are only willing to assume that m() is a smooth function. Nonparametric regression estimators produce an estimate of m() at an arbitrary
point (SCHOOL = s, EXP = e) by locally weighted averaging over log wages
(here s and e denote two arbitrary values that SCHOOL and EXP may take
on, such as 12 and 15). Locally weighting means that those values of log
wages will be higher weighted for which the corresponding observations of
EXP and SCHOOL are close to the point (s, e). Let us illustrate this principle
with an example. Let s = 8 and e = 7 and suppose you can use the four
observations given in Table 1.2 to estimate m(8, 7):
Table 1.2. Example observations
Observation
1
2
3
4
1 Introduction
2.4
2.0
41.2
Experience
1.7
27.5
13.8
6.0
10.0
14.0
Schooling
SPMcps85reg
1.2 Regression
0.4
0
0.2
Food
0.6
Engel Curve
0.5
1.5
Net-income
2.5
SPMengelcurve2
(1.7)
Here g1 () and g2 () are two unknown, smooth functions and is an unknown parameter. Note that this model combines the simple additive structure of the parametric regression model (referred to hereafter as the additive
10
1 Introduction
model) with the flexibility of the nonparametric approach. This is done by not
imposing any strong shape restrictions on the functions that determine how
schooling and experience influence the mean regression of log wages. The
procedure employed to estimate this model will be explained in greater detail later in this course. It should be clear, however, that in order to estimate
the unknown functions g1 () and g2 () nonparametric regression estimators
have to be employed. That is, when estimating semiparametric models we
usually have to use nonparametric techniques. Hence, we will have to spend
a substantial amount of time studying nonparametric estimation if we want
to understand how to estimate semiparametric models. For now, we want to
focus on the results and compare them with the parametric fit.
-1
-0.4
-0.5
-0.2
0.2
10
X
15
10
20
30
40
50
Figure 1.6. Additive model fit vs. parametric fit, wage-schooling (left) and wageexperience (right)
SPMcps85add
In Figure 1.6 the parametrically estimated wage-schooling and wage-experience profiles are shown as thin lines whereas the estimates of g1 () and
g2 () are displayed as thick lines with bullets. The parametrically estimated
wage-school and wage-experience profiles show a good deal of similarity
with the estimate of g1 () and g2 (), except for the shape of the curve at extremal values. The good agreement between parametric estimates and additive model fit is also visible from the plot of the estimated regression surface,
which is shown in Figure 1.7.
Hence, we may conclude that in this specific example the parametric
model is supported by the more flexible nonparametric and semiparametric methods. This potential usefulness of nonparametric and semiparametric techniques for checking the adequacy of parametric models will be illustrated in several other instances in the latter part of this course.
1.2 Regression
11
0.4
0.1
41.2
Experience
-0.1
27.5
13.8
6.0
10.0
14.0
Schooling
SPMcps85add
Take a closer look at (1.6) and (1.7). Observe that in (1.6) we have to estimate one unknown function of two variables whereas in (1.7) we have to
estimate two unknown functions, each a function of one variable. It is in this
sense that we have reduced the dimensionality of the estimation problem.
Whereas all researchers might agree that additive models like the one in (1.7)
are achieving a dimension reduction over completely nonparametric regression, they may not agree to call (1.7) a semiparametric model, as there are no
parameters to estimate (except for the intercept parameter ). In the following example we confront a standard parametric model with a more flexible
model that, as you will see, truly deserves to be called semiparametric.
Example 1.4.
In the earnings-function example, the dependent variable log wages can
principally take on any positive value, i.e. the set of values Y is infinite. This
may not always be the case. For example, consider the decision of an EastGerman resident to move to Western Germany and denote the decision variable by Y. In this case, the dependent variable can take on only two values,
1 if the person can imagine moving to the west,
Y=
0 otherwise.
We will refer to this as a binary response later on.
12
1 Introduction
(1.8)
(1.9)
1
.
exp(X > )
(1.10)
Example 1.5.
In using a logit model, Burda (1993) estimated the effect of various explanatory variables on the migration decision of East German residents. The data
for fitting this model were drawn from a panel study of approximately 4,000
East German households in spring 1991. We use a subsample of n = 402 observations from the German state Mecklenburg-Vorpommern here. Due to
space constraints, we merely report the estimated coefficients of three components of the index X > , as we will refer to these estimates below:
0 + 1 INC + 2 AGE
= 2.2905 + 0.0004971 INC 0.45499 AGE
(1.11)
1.2 Regression
13
INC and AGE are used to abbreviate the household income and age of the
individual.
Figure 1.8 gives a graphical presentation of the results. Each observation
is represented by a +. As mentioned above, the characteristics of each person are transformed into an index (to be read off the horizontal axis) while
the dependent variable takes on one of two values, Y = 0 or Y = 1 (to be read
off the vertical axis). The curve plots estimates of P(Y = 1|X), the probability of Y = 1 as a function of X > . Note that the estimates of P(Y = 1|X), by
assumption, are simply points on the cdf of a standard logistic distribution.
0.5
0
Logit Model
-3
-2
-1
Index
SPMlogit
We shall continue with Example 1.4 below, but let us pause for a moment to consider the following substantial problem: the logit model, like
other parametric models, is based on rather strong functional form (linear
index) and distributional assumptions, neither of which are usually justified
by economic theory.
The first question to ask before developing alternatives to standard models like the logit model is: what are the consequences of estimating a logit
model if one or several of these assumptions are violated? Note that this is a
crucial question: if our parametric estimates are largely unaffected by model
14
1 Introduction
violations, then there is no need to develop and apply semiparametric models and estimators. Why would anyone put time and effort into a project that
promises little return?
One can employ the tools of asymptotic statistical theory to show that violating the assumptions of the logit model leads parameter estimates to being
inconsistent. That is, if the sample size goes to infinity, the logit maximumlikelihood estimator (logit-MLE) does not converge to the true parameter
value in probability. While it doesnt converge to the true parameter value
it does, however, converge to some other value. If this false value is close
enough to the true parameter value then we may not care very much about
this inconsistency.
Consistency is an asymptotic criterion for the performance of an estimator. That is, it looks at the properties of the estimator if the sample size grows
without limits. Yet, in practice, we are dealing with finite samples. Unfortunately, the finite-sample properties of the logit maximum-likelihood estimator can not be derived analytically. Hence, we have to rely on simulations to
collect evidence of its small-sample performance in the presence of misspecification. We conducted a small simulation in the context of Example 1.4 to
which we now return.
0.5
0
G(Index)
-4
-3
-2
-1
0
Index
Figure 1.9. Link function of the homoscedastic logit model (thin line) versus the link
function of the heteroscedastic model (solid line)
SPMtruelogit
1.2 Regression
15
Example 1.6.
Following Horowitz (1993) we generated data according to a heteroscedastic
model with two explanatory variables, INC and AGE. Here we considered
heteroscedasticity of the form
o2
1n
Var(|X = x) =
1 + (x> )2 Var(),
4
where has a (standard) logistic distribution. To give you an impression of
how dramatically the true heteroscedastic model differs from the supposed
homoscedastic logit model, we plotted the link functions of the two models
as shown in Figure 1.9.
To add a sense of realism to the simulation, we set the coefficients of these
variables equal to the estimates reported in (1.11). Note that the standard
logit model introduced above does not allow for heteroscedasticity. Hence, if
we apply the standard logit maximum-likelihood estimator to the simulated
data, we are estimating under misspecification. We performed 250 replications of this estimation experiment, using the full data set with 402 observations each time. As the estimated coefficients are only identified up to scale,
we compared the ratio of the true coefficients, I NC / AGE , to the ratio of
their estimated logit-MLE counterparts, bI NC / bAGE . Figure 1.10 shows the
sampling distribution of the logit-MLE coefficients, along with the true value
(vertical line).
As we have subtracted the true value from each estimated ratio and divided this difference by the true ratios absolute value, the true ratio is standardized to zero and differences on the horizontal axis can be interpreted as
percentage deviations from the truth. In Figure 1.10, the sampling distribution of the estimated ratios is centered around 0.11 which is the percentage
deviation from the truth of 11%. Hence, the logit-MLE underestimates the
true value.
Now that we have seen how serious the consequences of model misspecification can be, we might want to learn about semiparametric estimators that
have desirable properties under more general assumptions than their parametric counterparts. One way to generalize the logit model is the so-called
single index model (SIM) which keeps the linear form of the index X > but
allows the function G() in (1.9) to be an arbitrary smooth function g() (not
necessarily a distribution function) that has to be estimated from the data:
E(Y|X) = g(X > ),
(1.12)
16
1 Introduction
0.4
0.2
0
Sampling Distribution
0.6
-1.5
-1
-0.5
0
0.5
True+Estimated Ratio
1.5
Figure 1.10. Sampling distribution of the ratio of the estimated coefficients (density
estimate and mean value indicated as *) and the ratios true value (vertical line)
SPMsimulogit
Secondly, we have to estimate the unknown link function g() by nonparametrically regressing the dependent variable Y on the fitted index
b where
b is the coefficient vector we estimated in the first step. To
X>
do this, we use again a nonparametric estimator, the kernel estimator we
mentioned briefly above.
Example 1.7.
b from the logit fit and estimate
Let us consider what happens if we use
the link function nonparametrically. Figure 1.11 shows this estimated link
function. As before, the position of a + sign represents at the same time the
b and Y of a particular observation, while the curve depicts the
values of X >
estimated link function.
One additional remark should be made here: As you will soon learn, the
shape of the estimated link function (the curve) varies with the so-called
bandwidth, a parameter central in nonparametric function estimation. Thus,
there is no unique estimate of the link function, and it is a crucial (and difficult) problem of nonparametric regression to find the best bandwidth and
thus the optimal estimate. Fortunately, there are methods to select an ap-
1.2 Regression
17
0.5
0
-3
-2
-1
Index
SPMsim
propriate bandwidth. Here, we have chosen h = 0.7 index units for the
bandwidth. For comparison the shapes of both the single index (solid line)
and the logit (dashed line) link functions are shown ins in Figure 1.8. Even
though not identical they look rather similar.
18
1 Introduction
Summary
Parametric models are fully determined up to a parameter (vector). The fitted models can easily be interpreted and estimated
accurately if the underlying assumptions are correct. If, however,
they are violated then parametric estimates may be inconsistent
and give a misleading picture of the regression relationship.
5
Semiparametric and Generalized Regression
Models
(5.1)
146
Let us take a closer look at model (5.1). This model is known as the generalized linear model. Its use and estimation are extensively treated in McCullagh & Nelder (1989). Here we give only some selected motivating examples.
What is the reason for introducing this functional G, called the link? (Note
that other authors call its inverse G 1 the link.) Clearly, if G is the identity we
are back in the classical linear model. As a first alternative let us consider a
quite common approach for investigating growth models. Here, the model is
often assumed to be multiplicative instead of additive, i.e.
d
Y=
Xj
E log() = 0
(5.2)
E = 0.
(5.3)
j=1
in contrast to
Y=
Xj
+ ,
j=1
E{log(Y)|X} =
j log(Xj )
(5.4)
j=1
j log(Xj )
)
.
(5.5)
j=1
147
MIGRATION INTENTION
FAMILY/FRIENDS IN WEST
UNEMPLOYED/JOB LOSS CERTAIN
CITY SIZE 10,000100,000
FEMALE
Yes
38.5
85.6
19.7
29.3
51.1
No
61.5
11.2
78.9
64.2
49.8
(in %)
Min Max
Mean
S.D.
18
65
39.84 12.61
200 4000 2194.30 752.45
(5.6)
Y = v(X) and Y =
0 otherwise.
Hence, what we really observe is the binary variable Y that takes on the
value 1 if net-utility is positive (person intends to migrate) and 0 otherwise
(person intends to stay). Then some calculations lead to
P(Y = 1 | X = x) = E(Y | X = x) = G|x {v(x)}
(5.7)
(5.8)
The most popular distribution assumptions regarding the error are the normal and the logistic ones, leading to the so-called probit or logit models with
G() = () (Gaussian cdf), respectively G() = exp()/{1 + exp()}. We
will learn how to estimate the coefficients 0 and in Section 5.2.
The binary choice model can be easily extended to the multicategorical
case, which is usually called discrete choice model. We will not discuss extensions for multicategorical responses here. Some references for these models
are mentioned in the bibliographic notes.
148
where X r X. In practice, the ISE is replaced by its sample analog, the multivariate analog of the cross validation function (3.38). After the variables have
been selected, the conditional expectation of Y on X r is calculated by some
kind of standard nonparametric multivariate regression technique such as
the kernel regression estimator.
5.1.2 Nonparametric Link Function
Index models play an important role in econometrics. An index is a summary
of different variables into one number, e.g. the price index, the growth index,
or the cost-of-living index. It is clear that by summarizing all the information
149
E(Y|X) = c + g j (X j )
(5.10)
j=1
150
(5.11)
E(Y|X) = G
c + g j (X j )
(5.12)
j=1
E(Y|U, T) = G
U > + g j (Tj )
j=1
Here, the g j () will be univariate nonparametric functions of the variables Tj . In the case of an identity function G we speak of an additive
partial linear model (APLM)
151
= G().
This function G is called the link function. (We remark that Nelder & Wedderburn (1972), McCullagh & Nelder (1989) actually denote G 1 as the link
function.)
5.2.1 Exponential Families
In the GLM framework we assume that the distribution of Y is a member
of the exponential family. The exponential family covers a broad range of distributions, for example discrete as the Bernoulli or Poisson distribution and
continuous as the Gaussian (normal) or Gamma distribution.
A distribution is said to be a member of the exponential family if its probability function (if Y discrete) or its density function (if Y continuous) has the
structure
152
f (y, , ) = exp
y b()
+ c(y, )
a()
(5.13)
with some specific functions a(), b() and c(). These functions differ for
the distinct Y distributions. Generally speaking, we are only interested in
estimating the parameter . The additional parameter is as the variance
2 in the linear regression a nuisance parameter. McCullagh & Nelder
(1989) call the canonical parameter.
Example 5.2.
Suppose Y is normally distributed, i.e. Y N(, 2 ). Hence we can write its
density as
2
y2
1
1
2
(y) =
exp
(y )
= exp y 2 2 2 log( 2)
22
2
2
2
and we see that the normal distribution is a member of the exponential family with
a() = 2 , b() =
2
y2
, c(y, ) = 2 log( 2),
2
2
if y = 1,
P(Y = y) = y (1 )1y =
1
if y = 0.
This can be transformed into
y
p
(1 ) = exp y log
(1 )
P(Y = y) =
1
1
using the logit
= log
e
.
1 + e
153
0=
f (y, , ) dy =
f (y, , ) dy
Z
`(y, , )
E
(5.14)
2
`(y, , )
2
= E
`(y, , )
2
.
This and taking first and second derivatives of (5.13) gives now
00
Y b0 ()
b ()
Y b0 () 2
0=E
, and E
= E
.
a()
a()
a()
We can conclude
E(Y) = = b0 (),
Var(Y) = V()a() = b00 ()a().
We observe that the expectation of Y only depends on whereas the variance
of Y depends on the parameter of interest and the nuisance parameter .
Typically one assumes that the factor a() is identical over all observations.
5.2.2 Link Functions
Apart from the distribution of Y, the link function G is another important
part of the GLM. Recall the notation
154
= X > , = G().
In the case that
X> = =
the link function is called canonical link function. For models with a canonical link, some theoretical and practical problems are easier to solve. Table 5.2 summarizes characteristics for some exponential functions together
with canonical parameters and their canonical link functions. Note that for
the binomial and the negative binomial distribution we assume the parameter k to be known. The case of binary Y is a special case of the binomial
distribution (k = 1).
What link functions can we choose apart from the canonical? For most
of the models a number of special link functions exist. For binomial Y for
example, the logistic or Gaussian link functions are often used. Recall that
a binomial model with the canonical logit link is called logit model. If the
binomial distribution is combined with the Gaussian link, it is called probit
model. A further alternative for binomial Y is the complementary log-log
link
= log{ log(1 )}.
A very flexible class of link functions is the class of power functions which
are also called Box-Cox transformations (Box & Cox, 1964). They can be defined for all models for which we have observations with positive mean. This
family is usually specified as
if 6= 0,
=
log
if = 0.
`(Y, , ) =
`(Yi , i , ),
(5.15)
i=1
where i = (i ) = (xi> ) and `() on the right hand side of (5.15) denotes
the individual log-likelihood contribution for each observation i.
155
Range
of y
b()
()
Canonical
link ()
Variance
V()
a()
Bernoulli
B()
{0, 1}
log(1 + e )
e
1 + e
logit
(1 )
Binomial
B(k, )
[0, k]
integer
k log(1 + e )
ke
1 + e
1
k
Poisson
P()
[0, )
integer
exp()
exp()
Negative
Binomial
NB(, k)
[0, )
integer
k log(1 e )
ke
1 e
Normal
N(, 2 )
(, )
2 /2
identity
Gamma
G(, )
(0, )
log()
1/
reciprocal
1/
Inverse
Gaussian
IG(, 2 )
(0, )
(2)1/2
p 1
(2)
squared
reciprocal
1
(Yi i )2 .
22
log
log
log
k+
+
2
k
Example 5.4.
For Yi N(i , 2 ) we have
`(Yi , i , ) = log
1
2
(Yi i )2 .
i=1
(5.16)
156
`(Y, , ) =
(5.17)
i=1
Let us remark that in the case where the distribution of Y itself is unknown, but its two first moments can be specified, then the quasi-likelihood
may replace the log-likelihood (5.14). This means we assume that
E(Y) = ,
Var(Y) = a() V().
The quasi-likelihood is defined by
1
`(y, , ) =
a()
Zy
(s y)
ds ,
V(s)
(5.18)
()
cf. Nelder & Wedderburn (1972). If Y comes from an exponential family then
the derivatives of (5.14) and (5.18) coincide. Thus, (5.18) establishes in fact a
generalization of the likelihood approach.
Alternatively to the log-likelihood the deviance is used often. The deviance
function is defined as
D(Y, , ) = 2 {`(Y, max , ) `(Y, , )} ,
(5.19)
157
e`(Y, ) =
{Yi i b(i )} .
(5.21)
i=1
We will now maximize (5.21) w.r.t. . For that purpose take the first
derivative of (5.21). This yields the gradient
D() =
n
e
`(Y, ) = Yi b0 (i )
i
i=1
(5.22)
D(
A variant of the Newton-Raphson is the Fisher scoring algorithm which replaces the Hessian by its expectation (w.r.t. the observations Yi )
n
o1
b new =
b old EH(
b old )
b old ).
D(
To present both algorithms in a more detailed way, we need again some additional notation. Recall that we have i = G(xi> ) = b0 (i ), i = xi> and
b0 (i ) = i = G(i ). For the first and second derivatives of i we obtain (after
some calculation)
G 0 (i )
i =
x
V(i ) i
2
=
> i
G 00 (i )V(i ) G 0 (i )2 V 0 (i )
xi xi> .
V(i )2
D() =
G 0 ( )
{Yi i } V(ii)
i=1
xi .
158
H() =
=
>
b (i )
G 0 (i )2
G 00 (i )V(i ) G 0 (i )2 V 0 (i )
{Yi i }
V(i )
V(i )2
i=1
n
i=1
00
{Yi b (i )}
>
i
xi xi> .
Since EYi = i it turns out that the Fisher scoring algorithm is easier: We
replace H() by
n 0
G (i )2
EH() =
xi xi> .
V(i )
i=1
For the sake of simplicity let us concentrate on the Fisher scoring for the
moment. Define the weight matrix
0
G (1 )2
G 0 (n )2
W = diag
,...,
.
V(1 )
V(n )
Additionally, define
e=
Y
Y1 1
Yn n
,..., 0
0
G (1 )
G (n )
>
x1>
X = ... .
x1>
(5.23)
(5.24)
The iteration stops when the parameter estimate or the log-likelihood (or
both) do not change significantly any more. We denote the resulting paramb
eter estimate by .
We see that each iteration step (5.23) is the result of a weighted least
squares regression on the adjusted variables Zi on xi . Hence, a GLM can
be estimated by iteratively reweighted least squares (IRLS). Note further that
in the linear regression model, where we have G 0 1 and i = i = xi> ,
no iteration is necessary. The Newton-Raphson algorithm can be given in a
similar way (with the more complicated weights and a different formula for
the adjusted variables). There are several remarks on the algorithm:
159
b00 (i ) = G 0 (i ) = V(i ).
1
n
bi )2
(Yi
,
V(b
i )
i=1
(5.25)
L
b )
n(
N(0, ).
b the estimator of . Then, for deviance and log-likelihood it holds
Denote further by
b, ) 2nd and 2{`(Y,
b, ) `(Y, , )} 2d .
approximately: D(Y,
b can be estimated by
The asymptotic covariance of the coefficient
"
1
b = a()
b
i=1
G 0 (i,last )2
V(i,last )
#1
)
X i X i>
1
b n X> WX
= a()
,
with the subscript last denoting the values from the last iteration step. Using this estimated covariance we can make inference about the components
160
Coefficients t-value
0.512
2.39
0.599
5.20
0.221
2.31
0.311
3.77
-0.240
-3.15
-4.69 102 -14.56
1.42 104
2.73
Now, we are interested in estimating the probability of migration in dependence of the explanatory variables x. Recall, that
P(Y = 1|X) = E(Y|X).
A useful model is a GLM with a binary (Bernoulli) Y and the logit link for
example:
exp(x> )
P(Y = 1|X = x) = G(x> ) =
.
1 + exp(x> )
161
Table 5.3 shows in the middle column the results of this logit fit. The migration intention is definitely determined by age. However, also the unemployment, city size and household income variables
are highly significant, which
q
b jj ).
is indicated by their high t-values ( bj /
162
Bibliographic Notes
For general aspects on semiparametric regression we refer to the textbooks
of Pagan & Ullah (1999), Yatchew (2003), Ruppert, Wand & Carroll (1990).
Comprehensive presentations of the generalized linear model can be found
in Dobson (2001), McCullagh & Nelder (1989) and Hardin & Hilbe (2001).
Bibliographic Notes
163
metric selection mechanism. This idea has been extended to general pairwise
difference estimators for censored and truncated models in Honore & Powell (1994). A mostly comprehensive survey about parametric and semiparametric methods for parametric models with non- or semiparametric selection
bias can be found in Vella (1998). Even though implementation of and theory on these methods is often quite complicated, some of them turned out to
perform reasonably well.
The second approach, i.e. relaxing the functional forms of the functions
of interest, turned out to be much more complicated. To our knowledge, the
first articles on the estimation of triangular simultaneous equation systems
Sperlich &
have been Newey, Powell & Vella (1999) and Rodrguez-Poo,
Fernandez (1999), from which the former is purely nonparametric, whereas
the latter considers nested simultaneous equation systems and needs to specify the error distribution for identifiability reasons. Finally, Lewbel & Linton
(2002) found a smart way to identify nonparametric censored and truncated
regression functions; however, their estimation procedure is quite technical.
Note that so far neither their estimator nor the one of Newey, Powell & Vella
(1999) have been proved to perform well in practice.
164
Exercises
Exercise 5.1. Assume model (5.6) and consider X and to be independent.
Show that
P(Y = 1|X) = E(Y|X) = G {v(X)}
where G denotes the cdf of . Explain that (5.7) holds if we do not assume
independence of X and .
Exercise 5.2. Recall the paragraph about partial linear models. Why may it
be sufficient to include 1 X1 in the model when X1 is binary? What would
you do if X1 were categorical?
Exercise 5.3. Compute H() and EH() for the logit and probit models.
Exercise 5.4. Verify the canonical link functions for the logit and Poisson
model.
Exercise 5.5. Recall that in Example 5.6 we have fitted the model
E(Y|X) = P(Y = 1|X) = G(X > ),
where G is the standard logistic cdf. We motivated this model through the
latent-variable model Y = X > with having cdf G. How does the logit
model change if the latent-variable model is multiplied by a factor c? What
does this imply for the identification of the coefficient vector ?
Summary
Summary
The basis for many semiparametric regression models is the generalized linear model (GLM), which is given by
E(Y|X) = G{X > } .
Here, denotes the parameter vector to be estimated and G
denotes a known link function. Prominent examples of this type
of regression are binary choice models (logit or probit) or count
data models (Poisson regression).
The estimation of the GLM is performed through an interactive algorithm. This algorithm, the iteratively reweighted least
squares (IRLS) algorithm, applies weighted least squares to the
adjusted dependent variable Z in each iteration step:
new = (X> WX)1 X> WZ
This numerical approach needs to be appropriately modified for
estimating the semiparametric modifications of the GLM.
165
References
280
References
Bossaerts, P., Hafner, C. & Hardle, W. (1996). Foreign exchange rates have
surprising volatility, in P. M. Robinson & M. Rosenblatt (eds), Athens Conference on Applied Probability and Time Series Analysis. Volume II: Time Series
Analysis. In Memory of E.J. Hannan, Lecture Notes in Statistics, Springer,
pp. 5572.
Boularan, J., Ferre, L. & Vieu, P. (1994). Growth curves: a two-stage nonparametric approach, Journal of Statistical Planning and Inference 38: 327350.
Bowman, A. & Azzalini, A. (1997). Applied Smoothing Techniques for Data Analysis, Oxford University Press, Oxford, UK.
Box, G. & Cox, D. (1964). An analysis of transformations, Journal of the Royal
Statistical Society, Series B 26: 211243.
Breiman, L. & Friedman, J. H. (1985). Estimating optimal transformations
for multiple regression and correlations (with discussion), Journal of the
American Statistical Association 80(391): 580619.
Buja, A., Hastie, T. J. & Tibshirani, R. J. (1989). Linear smoothers and additive
models (with discussion), Annals of Statistics 17: 453555.
Burda, M. (1993). The determinants of EastWest German migration, European Economic Review 37: 452461.
Cao, R., Cuevas, A. & Gonzalez Manteiga, W. (1994). A comparative study of
several smoothing methods in density estimation, Computational Statistics
& Data Analysis 17(2): 153176.
Carroll, R. J., Fan, J., Gijbels, I. & Wand, M. P. (1997). Generalized partially
linear singleindex models, Journal of the American Statistical Association
92: 477489.
Carroll, R. J., Hardle, W. & Mammen, E. (2002). Estimation in an additive
model when the components are linked parametrically, Econometric Theory
18(4): 886912.
Chaudhuri, P. & Marron, J. S. (1999). SiZer for exploration of structures in
curves, Journal of the American Statistical Association 94: 807823.
Chen, R., Liu, J. S. & Tsay, R. S. (1995). Additivity tests for nonlinear autoregression, Biometrika 82: 369383.
Cleveland, W. S. (1979). Robust locally-weighted regression and smoothing
scatterplots, Journal of the American Statistical Association 74: 829836.
Collomb, G. (1985). Nonparametric regression an up-to-date bibliography,
Statistics 2: 309324.
Cosslett, S. (1983). Distributionfree maximum likelihood estimation of the
binary choice model, Econometrica 51: 765782.
Cosslett, S. (1987). Efficiency bounds for distributionfree estimators of the
binary choice model, Econometrica 55: 559586.
Dalelane, C. (1999). Bootstrap confidence bands for the integration estimator in additive models, Diploma thesis, Department of Mathematics,
Humboldt-Universitat zu Berlin.
References
281
282
References
Fahrmeir, L. & Tutz, G. (1994). Multivariate Statistical Modelling Based on Generalized Linear Models, Springer.
Fan, J. & Gijbels, I. (1996). Local Polynomial Modelling and Its Applications,
Vol. 66 of Monographs on Statistics and Applied Probability, Chapman and
Hall, New York.
Fan, J., Hardle, W. & Mammen, E. (1998). Direct estimation of lowdimensional components in additive models, Annals of Statistics 26: 943
971.
Fan, J. & Li, Q. (1996). Consistent model specification test: Omitted variables
and semiparametric forms, Econometrica 64: 865890.
Fan, J. & Marron, J. S. (1992). Best possible constant for bandwidth selection,
Annals of Statistics 20: 20572070.
Fan, J. & Marron, J. S. (1994). Fast implementations of nonparametric curve
estimators, Journal of Computational and Graphical Statistics 3(1): 3556.
References
283
Gozalo, P. L. & Linton, O. (2001). A nonparametric test of additivity in generalized nonparametric regression with estimated parameters, Journal of
Econometrics 104: 148.
Grasshoff, U., Schwalbach, J. & Sperlich, S. (1999). Executive pay and corporate financial performance. an explorative data analysis, Working paper
99-84 (33), Universidad Carlos III de Madrid.
Green, P. J. & Silverman, B. W. (1994). Nonparametric Regression and Generalized Linear Models, Vol. 58 of Monographs on Statistics and Applied Probability,
Chapman and Hall, London.
Green, P. J. & Yandell, B. S. (1985). Semi-parametric generalized linear models, Proceedings 2nd International GLIM Conference, Vol. 32 of Lecture Notes in
Statistics 32, Springer, New York, pp. 4455.
GSOEP (1991).
Das Sozio-okonomische Panel (SOEP) im Jahre 1990/91,
284
References
References
285
286
References
Muller,
M. (2001). Estimation and testing in generalized partial linear models
a comparative study, Statistics and Computing 11: 299309.
Muller,
M. (2004). Generalized linear models, in J. Gentle, W. Hardle &
Y. Mori (eds), Handbook of Computational Statistics (Volume I). Concepts and
Fundamentals, Springer, Heidelberg.
Nadaraya, E. A. (1964). On estimating regression, Theory of Probability and its
Applications 10: 186190.
Nelder, J. A. & Wedderburn, R. W. M. (1972). Generalized linear models,
Journal of the Royal Statistical Society, Series A 135(3): 370384.
Newey, W. K. (1990). Semiparametric efficiency bounds, Journal of Applied
Econometrics 5: 99135.
References
287
288
References
References
289
290
References
Yang, L., Sperlich, S. & Hardle, W. (2003). Derivative estimation and testing
in generalized additive models, Journal of Statistical Planning and Inference
115(2): 521542.
Yatchew, A. (2003). Semiparametric Regression for the Applied Econometrician,
Cambridge University Press.
Zheng, J. (1996). A consistent test of a functional form via nonparametric
estimation techniques, Journal of Econometrics 75: 263289.
Author Index
Achmus, S. 247
Ahn, H. 162
Amemiya, T. 162
Andrews, D. W. K. 247
Ansley, C. 247
Auestad, B. 212, 247
Azzalini, A. 135
Begun, J. 185
Berndt, E. 5, 63
Bickel, P. 62, 120, 162, 185
Bierens, H. 135
Bonneu, M. 185
Bossaerts, P. 121
Boularan, J. 247
Bowman, A. 135
Box, G. 154, 162
Breiman, L. 212, 247
Buja, A. 190, 197, 206, 212, 214, 247,
261
Burda, M. 12
Cao, R. 79, 135
Carroll, R. J. 162, 247, 256, 274
Chaudhuri, P. 79
Chen, R. 247
Cleveland, W. S. 135
Collomb, G. 135
Cosslett, S. 185
Cox, D. 154, 162
Cuevas, A. 79
Dalelane, C. 247
Daubechies, I. 135
Deaton, A. 211, 247
Delecroix, M. 177, 185
Delgado, M. A. 149
Denby, L. 206, 254
Dette, H. 135, 240
Dobson, A. J. 162
Doksum, K. 162
Donoho, D. L. 135
Duan, N. 162
Duin, R. P. W. 79
Eilers, P. H. C. 135
Epanechnikov, V. 60
Eubank, R. L. 135, 206, 247
Fahrmeir, L. 77
Fan, J. 57, 98, 135, 199, 254256, 274
Fernandez, A. I. 163
Ferre, L. 247
Friedman, J. H. 162, 212, 247
Fuss, M. 247
292
Author Index
Gallant, A. 185
Gasser, T. 90, 92, 135
Gijbels, I. 98, 135, 199, 256, 274
Gill, J. 162
Gill, R. D. 175
Gonzalez Manteiga, W. 79, 135
Gozalo, P. L. 135, 247, 274
Grasshoff, U. 242
Green, P. J. 103, 135, 206, 274
GSOEP 160, 180
Habbema, J. D. F. 79
Hafner, C. 121
Hall, P. 56, 57, 79
Hall, W. 185
Han, A. 185
Hansen, M. H. 247
Hardin, J. 162
Hardle, W. 110, 121, 122, 126, 127,
135, 177, 185, 202, 204, 206, 247,
257, 274
Harrison, D. 217, 219
Hart, J. D. 135, 247
Hastie, T. J. 190, 197, 198, 202, 206,
212214, 220, 235, 247, 261, 264,
274
Heckman, J. 162
Hengartner, N. 247
Hermans, J. 79
Hilbe, J. 162
Honore, B. E. 163
Horowitz, J. L. 15, 162, 182185, 274
Hristache, M. 177
Hsing, T. 162
Huang, W. 185
Huet, S. 265, 266, 269
Ichimura, H. 167, 172174, 185
Ingster, Y. I. 135
Johnstone, I. M. 135
Muller,
H. G. 90, 92, 135
Muller,
M. 98, 162, 181, 182, 202, 206
Author Index
Mundlak, Y. 247
Nadaraya, E. A. 89
Nelder, J. A. 146, 151, 152, 156, 159,
162, 206
Newey, W. K. 163, 174, 185, 247
Nielsen, J. P. 212, 216, 221, 222, 234,
236, 240, 247, 274
Nolan, D. 58
Nychka, D. 185
Opsomer, J. 216, 247
Pagan, A. 22, 135, 162, 185
Park, B. U. 51, 56, 79
Picard, D. 107, 135
Ploberger, W. 135
Powell, J. L. 162, 163, 172, 179, 180
Presedo-Quindimi, M. 135
Proenca, I. 180, 270
Reese, C. S. 206
Ripley, B. 162
Ritov, Y. 185
Robinson, P. M. 190, 206, 274
J. M. 163, 274
Rodrguez-Poo,
Rosenblatt, M. 62, 120
Rubinfeld, D. L. 217, 219
Ruppert, D. 71, 130, 132, 135, 162,
216, 224, 225, 247
293
Simpson, D. G. 247
Spady, R. 176
Speckman, P. E. 190, 206, 254, 274,
275
Sperlich, S. 163, 222, 224, 226, 228,
230, 232236, 240, 242, 247, 257,
263, 265, 266, 269, 274
Spokoiny, V. 136, 247, 257
Staniswalis, J. G. 193, 201, 206
Stefanski, L. A. 247
Stock, J. H. 172, 179, 180
Stoker, T. M. 172, 179, 180, 185
Stone, C. J. 55, 135, 211, 247, 253, 259
Stuetzle, W. 162, 247
Stute, W. 135
Terrell, G. R. 79
Thall, P. F. 206
Tibshirani, R. J. 190, 197, 198, 202,
206, 212214, 220, 235, 247, 261,
264, 274
Tjstheim, D. 212, 228, 233, 247
Treiman, D. J. 257
Truong, Y. 247
Tsay, R. S. 247
Tsybakov, A. B. 107, 122
Turlach, B. A. 51, 79
Tutz, G. 77
Ullah, A. 135, 162, 185
Schimek, M. G. 206
Schwalbach, J. 242
Schwert, W. 22
Scott, D. W. 30, 35, 69, 7173, 79, 135
Severance-Lossin, E. 224, 226, 230,
235, 236
Severini, T. A. 191, 193, 201, 206
Sheather, S. J. 57, 79
Silverman, B. W. 69, 72, 73, 79, 103,
104, 135, 206
Simonoff, J. 135
294
Author Index
Watson, G. S. 89
Wecker, W. 247
Wedderburn, R. W. M. 151, 156, 206
Weisberg, S. 185
Wellner, J. 185
Welsh, A. H. 185
Werwatz, A. 180, 270
Whang, Y.-J. 247
Subject Index
backfitting, 212
classical, 212
GAM, 260
GAPLM, 264
GPLM, 197
local scoring, 260
modified, 219
smoothed, 221
bandwidth
canonical, 57
kernel density estimation, 42
rule of thumb, 52
bandwidth choice
additive model, 236
kernel density estimation, 51, 56
kernel regression, 107
Silvermans rule of thumb, 51
bias
histogram, 25
kernel density estimation, 46
kernel regression, 93
multivariate density estimation, 71
multivariate regression, 130
bin, 21
binary response, 11, 146
binwidth, 21, 23
optimal choice, 29
rule of thumb, 30
canonical bandwidth, 57
canonical kernel, 57
canonical link function, 154
296
Subject Index
Gasser-Muller
estimator, 91
Fourier coefficients, 107
Fourier series, 104
frequency polygon, 35
Gasser-Muller
estimator, 91
Gauss-Seidel algorithm, 212
generalized additive model, see GAM
generalized additive partial linear
model, see GAPLM
generalized cross-validation, 116
generalized linear model, see GLM
generalized partial linear model, see
GPLM
approximate LR test, 202
modified LR test, 203
GLM, 151
estimation, 154
exponential family, 151
Fisher scoring, 157
hypotheses testing, 160
IRLS, 154
link function, 153
Newton-Raphson, 157
GPLM, 150, 189
backfitting, 197
hypotheses testing, 202
profile likelihood, 191
Speckman estimator, 195
gradient, 71
Hessian matrix, 71
histogram, 21
ASH, 32
asymptotic properties, 24
bias, 25
binwidth choice, 29
construction, 21
dependence on binwidth, 23
dependence on origin, 30
derivation, 23
MSE, 27
variance, 26
Subject Index
hypotheses testing
GPLM, 202
regression, 118
i.i.d, 21
identification, 162
AM, 213
SIM, 167
independent and identically distributed,
see i.i.d.
index, 12, 149, 167
semiparametric, 149
integrated squared error, see ISE
interaction terms, 227
IRLS, 154
ISE
kernel density estimation, 53
multivariate density estimation, 74
regression, 109
iteratively reweighted least squares, see
IRLS
kernel density estimation, 39
as a sum of bumps, 45
asymptotic properties, 46
bandwidth choice, 56
bias, 46
confidence bands, 61
confidence intervals, 61
dependence on bandwidth, 43
dependence on kernel, 43
derivation, 40
multivariate, 66
multivariate rule-of-thumb bandwidth, 73
optimal bandwidth, 50
rule-of-thumb bandwidth, 51
variance, 48
kernel function, 42
canonical, 57, 59
efficiency, 60
equivalent, 57
kernel regression, 88
bandwidth choice, 107
bias, 93
confidence bands, 120
confidence intervals, 119
297
cross-validation, 113
fixed design, 90
Nadaraya-Watson estimator, 89
penalizing functions, 114
random design, 89
statistical properties, 92
univariate, 88
variance, 93
k-NN, see k-nearest-neighbor
least squares, see LS
likelihood ratio, see LR
linear regression, 3
link function, 12, 146, 151, 153
canonical, 154
nonparametric, 148
power function, 154
local constant, 95
local linear, 95
local polynomial
derivative estimation, 98
regression, 94, 95
local scoring, 260, 261
log-likelihood
GLM, 153
pseudo likelihood, 175
quasi-likelihood, 156
marginal effect, 224
derivative, 225
marginal integration, 222
GAM, 262
GAPLM, 264
MASE
regression, 109, 110
maximum likelihood, see ML
maximum likelihood estimator, see MLE
mean averaged squared error, see MASE
mean integrated squared error, see MISE
mean squared error, see MSE
median smoothing, 101
MISE
histogram, 29
kernel density estimation, 50
regression, 109
ML, 152, 154
MLE, 152, 154
298
Subject Index
MSE
histogram, 27
kernel density estimation, 49
univariate regression, 108
multivariate density estimation, 66, 69
bias, 71
computation, 75
graphical representation, 75
variance, 71
multivariate regression, 128
asymptotics, 130
bias, 130
computation, 132
curse of dimensionality, 133
Nadaraya-Watson estimator, 129
variance, 130
Nadaraya-Watson estimator, 89
multivariate, 129
k-nearest-neighbor, 98100
Newton-Raphson algorithm
GLM, 157
nonparametric regression, 85
multivariate, 128
univariate, 85
origin, 21
orthogonal series
Fourier series, 104
orthogonal series regression, 104
orthonormality, 106
partial linear model, see PLM
pdf, 1, 21, 39
multivariate, 66
penalizing functions, 114
Akaikes information criterion, 117
finite prediction error, 117
generalized cross-validation, 116
Rices T, 117
Shibata s model selector, 116
penalty term
bandwidth choice, 110
spline, 103
spline smoothing, 102
PLM, 149, 189
estimation, 191
plug-in method, 55
refined, 79
Silvermans rule of thumb, 51
PMLE, 171
probability density function, see pdf
profile likelihood, 191
pseudo likelihood, 174
pseudo maximum likelihood estimator,
see PMLE
quasi-likelihood, 156
random design, 88, 89, 92
regression, 3
conditional expectation, 87
confidence bands, 118
confidence intervals, 118
fixed design, 88, 93
generalized, 145
hypotheses testing, 118
kernel regression, 88
linear, 3, 5
local polynomial, 94
median smoothing, 101
k-nearest-neighbor, 98
nonparametric, 7, 85
nonparametric univariate, 85
orthogonal series, 104
parametric, 5
random design, 88, 92
semiparametric, 9, 145
spline smoothing, 101
residual sum of squares, see RSS
resubstitution estimate, 112
Rices T, 117
RSS, 101
rule of thumb
histogram, 30
kernel density estimation, 52
multivariate density estimation, 73
semiparametric least squares, see SLS
Shibata s model selector, 116
Silvermans rule of thumb, 51
SIM, 167
estimation, 170
hypotheses testing, 183
Subject Index
identification, 168
PMLE, 174
SLS, 172
WADE, 178
single index model, see SIM
SLS, 171
smoothing spline, 104
Speckman estimator, 190, 195
spline kernel, 104
spline smoothing, 101
subset selection, 148
Taylor expansion
first order, 26
multivariate, 71
test
AM, GAM, GAPLM, 268
approximate LR test, 202
LR test, 160
modified LR test, 203
SIM, 183
299
time series
nonparametric, 121
variable selection, 148
variance
histogram, 26
kernel density estimation, 48
kernel regression, 93
multivariate density estimation, 71
multivariate regression, 130
WADE, 171, 178
wage equation, 3
WARPing, 35
wavelets, 107
weighted average derivative estimator,
see WADE
weighted semiparametric least squares,
see WSLS
XploRe, V