BOOK Nonparametric and Semiparametric Models-2004

Wolfgang Hardle, Marlene Muller,
Stefan
Sperlich, Axel Werwatz
Nonparametric and
Semiparametric Models
An Introduction
February 6, 2004
Springer
Berlin Heidelberg New York
Hong Kong London
Milan Paris Tokyo
Please note: this is only a sample of the

full book. The complete book can be downloaded on the e-book page of XploRe.
Just click the download logo:
http://www.xplorestat.de/ebooks/ebooks.html
download logo
For further information please contact

MD*Tech at [email protected]
Preface
The concept of smoothing is a central idea in statistics. Its role is to extract

structural elements of variable complexity from patterns of random variation. The nonparametric smoothing concept is designed to simultaneously
estimate and model the underlying structure. This involves high dimensional objects, like density functions, regression surfaces or conditional quantiles. Such objects are difficult to estimate for data sets with mixed, high dimensional and partially unobservable variables. The semiparametric modeling technique compromises the two aims, flexibility and simplicity of statistical procedures, by introducing partial parametric components. These (low
dimensional) components allow one to match structural conditions like for
example linearity in some variables and may be used to model the influence
of discrete variables. The flexibility of semiparametric modeling has made it
a widely accepted statistical technique.
The aim of this monograph is to present the statistical and mathematical
principles of smoothing with a focus on applicable techniques. The necessary
mathematical treatment is easily understandable and a wide variety of interactive smoothing examples are given. This text is an e-book; it is a downloadable entity (http://www.i-xplore.de) which allows the reader to recalculate
all arguments and applications without reference to a specific software platform. This new technique for proliferation of methods and ideas is specifically designed for the beginner in nonparametric and semiparametric statistics. It is based on the XploRe quantlet technology, developed at HumboldtUniversitat zu Berlin.
The text has evolved out of the courses Nonparametric Modeling and
Semiparametric Modeling, that the authors taught at Humboldt-Universitat zu Berlin, ENSAE Paris, Charles University Prague, and Universidad de
Cantabria, Santander. The book divides itself naturally into two parts:
VI
Preface
Part I: Nonparametric Models

histogram, kernel density estimation, nonparametric regression
Part ??: Semiparametric Models
generalized regression, single index models, generalized partial linear
models, additive and generalized additive models.
The first part (Chapters 24) covers the methodological aspects of nonparametric function estimation for cross-sectional data, in particular kernel
smoothing methods. Although our primary focus will be on flexible regression models, a closely related topic to consider is nonparametric density estimation. Since many techniques and concepts for the estimation of probability
density functions are also relevant for regression function estimation, we first
consider histograms (Chapter 2) and kernel density estimates (Chapter 3) in
more detail. Finally, in Chapter 4 we introduce several methods of nonparametrically estimating regression functions. The main part of this chapter is
devoted to kernel regression, but other approaches such as splines, orthogonal series and nearest neighbor methods are also covered.
The first part is intended for undergraduate students majoring in mathematics, statistics, econometrics or biometrics. It is assumed that the audience has a basic knowledge of mathematics (linear algebra and analysis) and
statistics (inference and regression analysis). The material is easy to utilize
since the e-book character of the text allows maximum flexibility in learning
(and teaching) intensity.
The second part (Chapters 59) is devoted to semiparametric regression
models, in particular extensions of the parametric generalized linear model.
In Chapter 5 we summarize the main ideas of the generalized linear model
(GLM). Typical concepts are the logit and probit models. Nonparametric extensions of the GLM consider either the link function (single index models,
Chapter 6) or the index argument (generalized partial linear models, additive and generalized additive models, Chapters 79). Single index models
focus on the nonparametric error distribution in an underlying latent variable model. Partial linear models take the pragmatic point of fixing the error
distribution but let the index be of non- or semiparametric structure. Generalized additive models concentrate on a (lower dimensional) additive structure
of the index with fixed link function. This model class balances the difficulty
of high-dimensional smoothing with the flexibility of nonparametrics.
In addition to the methodological aspects, the second part also covers
computational algorithms for the considered models. As in the first part we
focus on cross-sectional data. It is intended to be used by Master and PhD
students or researchers.
This book would not have been possible without substantial support
from many colleagues and students. It has benefited at several stages from
Preface
VII
useful remarks and suggestions of our students at Humboldt-Universitat

zu Berlin, ENSAE Paris and Charles University Prague. We are grateful to
Lorens Helmchen, Stephanie Freese, Danilo Mercurio, Thomas Kuhn,

Ying
Chen and Michal Benko for their support in text processing and program z ek, Zdenek
ming, Caroline Condron for language checking and Pavel C
Hlavka and Rainer Schulz for their assistance in teaching. We are indebted to
Joel Horowitz (Northwestern University), Enno Mammen (Universitat Heidelberg) and Helmut Rieder (Universitat Bayreuth) for their valuable comments on earlier versions of the manuscript. Thanks go also to Clemens
Heine, Springer Verlag, for being a very supportive and helpful editor.
Berlin/Kaiserslautern/Madrid, February 2004

Wolfgang Hardle
Marlene Muller
Stefan Sperlich
Axel Werwatz
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXI
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Parametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.2 Nonparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.3 Semiparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
3
5
7
9
18
Part I Nonparametric Models

2
Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Motivation and Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.3 Varying the Binwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Statistical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3 Mean Squared Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.4 Mean Integrated Squared Error . . . . . . . . . . . . . . . . . . . . . .
21
21
21
23
23
24
25
26
27
29
Contents
2.2.5 Optimal Binwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3 Dependence of the Histogram on the Origin . . . . . . . . . . . . . . . .
2.4 Averaged Shifted Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
30
32
35
36
38
Nonparametric Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.1 Motivation and Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.2 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.3 Varying the Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.4 Varying the Kernel Function . . . . . . . . . . . . . . . . . . . . . . . .
3.1.5 Kernel Density Estimation as a Sum of Bumps . . . . . . . .
3.2 Statistical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Mean Squared Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.4 Mean Integrated Squared Error . . . . . . . . . . . . . . . . . . . . . .
3.3 Smoothing Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Silvermans Rule of Thumb . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.3 Refined Plug-in Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.4 An Optimal Bandwidth Selector?! . . . . . . . . . . . . . . . . . . .
3.4 Choosing the Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Canonical Kernels and Bandwidths . . . . . . . . . . . . . . . . . .
3.4.2 Adjusting Bandwidths across Kernels . . . . . . . . . . . . . . . .
3.4.3 Optimizing the Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Confidence Intervals and Confidence Bands . . . . . . . . . . . . . . . .
3.6 Multivariate Kernel Density Estimation . . . . . . . . . . . . . . . . . . . . .
3.6.1 Bias, Variance and Asymptotics . . . . . . . . . . . . . . . . . . . . . .
3.6.2 Bandwidth Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6.3 Computation and Graphical Representation . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
39
39
40
43
43
45
46
46
48
49
50
51
51
53
55
56
57
57
59
60
61
66
70
72
75
79
80
82
Contents
Nonparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1 Univariate Kernel Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.2 Kernel Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.3 Local Polynomial Regression and Derivative
Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Other Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Nearest-Neighbor Estimator . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Median Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.3 Spline Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.4 Orthogonal Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Smoothing Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 A Closer Look at the Averaged Squared Error . . . . . . . .
4.3.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.3 Penalizing Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Confidence Regions and Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1 Pointwise Confidence Intervals . . . . . . . . . . . . . . . . . . . . . .
4.4.2 Confidence Bands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Multivariate Kernel Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 Statistical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.2 Practical Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
XI
85
85
85
88
94
98
98
101
101
104
107
110
113
114
118
119
120
124
128
130
132
135
137
139
Part II Semiparametric Models

5
Semiparametric and Generalized Regression Models . . . . . . . . . . .

5.1 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Variable Selection in Nonparametric Regression . . . . . .
5.1.2 Nonparametric Link Function . . . . . . . . . . . . . . . . . . . . . . .
5.1.3 Semi- or Nonparametric Index . . . . . . . . . . . . . . . . . . . . . .
5.2 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2 Link Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
145
145
148
148
149
151
151
153
XII
Contents
5.2.3 Iteratively Reweighted Least Squares Algorithm . . . . . .

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
154
162
164
165
Single Index Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.1 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Semiparametric Least Squares . . . . . . . . . . . . . . . . . . . . . . .
6.2.2 Pseudo Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . .
6.2.3 Weighted Average Derivative Estimation . . . . . . . . . . . . .
6.3 Testing the SIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
167
168
170
172
174
178
183
185
186
187
Generalized Partial Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.1 Partial Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Estimation Algorithms for PLM and GPLM . . . . . . . . . . . . . . . . .
7.2.1 Profile Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.2 Generalized Speckman Estimator . . . . . . . . . . . . . . . . . . . .
7.2.3 Backfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.4 Computational Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3 Testing the GPLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.1 Likelihood Ratio Test with Approximate Degrees of
Freedom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.2 Modified Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
189
189
191
191
195
197
199
202
Additive Models and Marginal Effects . . . . . . . . . . . . . . . . . . . . . . . . .

8.1 Backfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.1 Classical Backfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.2 Modified Backfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.3 Smoothed Backfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Marginal Integration Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . .
211
212
212
219
221
222
202
203
206
207
208
Contents
XIII
8.2.1 Estimation of Marginal Effects . . . . . . . . . . . . . . . . . . . . . . .

8.2.2 Derivative Estimation for the Marginal Effects . . . . . . . .
8.2.3 Interaction Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3 Finite Sample Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.1 Bandwidth Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.2 MASE in Finite Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.3 Equivalent Kernel Weights . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
224
225
227
234
236
239
240
247
248
250
Generalized Additive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.1 Additive Partial Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2 Additive Models with Known Link . . . . . . . . . . . . . . . . . . . . . . . .
9.2.1 GAM using Backfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2.2 GAM using Marginal Integration . . . . . . . . . . . . . . . . . . . .
9.3 Generalized Additive Partial Linear Models . . . . . . . . . . . . . . . .
9.3.1 GAPLM using Backfitting . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3.2 GAPLM using Marginal Integration . . . . . . . . . . . . . . . . .
9.4 Testing in Additive Models, GAM, and GAPLM . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
253
254
259
260
262
264
264
264
268
274
275
276
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
List of Figures
1.1
Log-normal versus kernel density estimates

SPMfesdensities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
Wage-schooling and wage-experience profile
SPMcps85lin
1.3
Parametrically estimated regression function
SPMcps85lin
1.4
Nonparametrically estimated regression function

SPMcps85reg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5
Engel curve
1.6
Additive model fit versus parametric fit
1.7
Surface plot for the additive model
1.8
Logit fit
1.9
SPMengelcurve2 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SPMcps85add . . . . .
10
SPMcps85add . . . . . . . . .
11
SPMlogit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
Link function of the homoscedastic versus the heteroscedastic

model
SPMtruelogit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
1.10 Sampling distribution of the ratio of the estimated coefficients

and the ratios true value
SPMsimulogit . . . . . . . . . . . . . . . . . .
16
1.11 Single index versus logit model
SPMsim . . . . . . . . . . . . . . . . . . .
17
2.1
Histogram for stock returns data
SPMhistogram . . . . . . . . . . .
22
2.2
2.3
Approximation of the area under the pdf . . . . . . . . . . . . . . . . . . . .

Histograms for stock returns, different binwidths
SPMhisdiffbin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.4
2.5
25
Squared bias, variance and MSE for the histogram

SPMhistmse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
Histograms for stock returns, different origins

SPMhisdiffori . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
XVI
List of Figures
2.6
Averaged shifted histogram for stock returns
2.7
Ordinary histogram for stock returns
3.1
Some Kernel functions
3.2
Density estimates for the stock returns
3.3
Different kernels for estimation
3.4
Different continuous kernels for estimation
3.5
Kernel density estimate as sum of bumps
3.6
Bias effects
3.7
Squared bias, variance and MSE
SPMkdemse . . . . . . . . . . . . . . .
49
3.8
Parametric versus nonparametric density estimate for

average hourly earnings
SPMcps85dist . . . . . . . . . . . . . . . . . . .
63
Confidence intervals versus density estimates for average

hourly earnings
SPMcps85dist . . . . . . . . . . . . . . . . . . . . . . . . . .
64
3.10 Confidence bands versus intervals for average hourly

earnings
SPMcps85dist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
3.11 Bivariate kernel contours for equal bandwidths

SPMkernelcontours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
3.12 Bivariate kernel contours for different bandwidths

69
3.13 Bivariate kernel contours for bandwidth matrix

70
3.14 Two-dimensional density estimate
SPMdensity2D . . . . . . . . .
75
3.15 Two-dimensional contour plot of a density estimate

SPMcontour2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
3.16 Two-dimensional intersections for three-dimensional density

estimate
SPMslices3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
3.17 Three-dimensional contour plots of a density estimate

SPMcontour3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
4.1
Nadaraya-Watson kernel regression
SPMengelcurve1 . . . . . .
87
4.2
Kernel regression estimates using different bandwidths

SPMregress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
4.3
Local polynomial regression
SPMlocpolyreg . . . . . . . . . . . . . .
97
4.4
Local linear regression and derivative estimation

SPMderivest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
3.9
SPMashstock
32
SPMhiststock . . . . . . .
33
SPMkernel . . . . . . . . . . . . . . . . . . . . . . .
42
SPMdensity . . . . . . . .
43
SPMdenquauni . . . . . . . . . . . .
44
SPMdenepatri .
45
SPMkdeconstruct 46
SPMkdebias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
List of Figures
XVII
4.5
Nearest-neighbor regression
SPMknnreg . . . . . . . . . . . . . . . . . . 100
4.6
Median smoothing regression
4.7
Spline regression
4.8
Orthogonal series regression using Legendre polynomials

SPMorthogon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.9
Wavelet regression
SPMmesmooreg . . . . . . . . . . . . . 102
SPMspline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
SPMwavereg . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.10 Squared bias, variance and MASE

4.11 Simulated data for MASE
SPMsimulmase . . . . . . . . . . 111
SPMsimulmase . . . . . . . . . . . . . . . . . 112
4.12 Nadaraya-Watson kernel regression with cross-validated

bandwidth
SPMnadwaest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.13 Local linear regression with cross-validated bandwidth
SPMlocpolyest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.14 Penalizing functions
SPMpenalize . . . . . . . . . . . . . . . . . . . . . . . 117
4.15 Confidence intervals and Nadaraya-Watson kernel regression

SPMengelconf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.16 Estimated mean function for DM/USD exchange rates
SPMfxmean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.17 Estimated variance function for DM/USD exchange rates
SPMfxvolatility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.18 Two-dimensional local linear estimate
SPMtruenadloc . . . . 133
6.1
6.2
Two link functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

The horizontal and the integral approach . . . . . . . . . . . . . . . . . . . . 182
7.1
GPLM logit fit for migration data
8.1
8.2
Estimated versus true additive functions . . . . . . . . . . . . . . . . . . . .

Additive component estimates and partial residuals for
Boston housing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Additive component estimates and partial residuals for
Boston housing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Estimated local linear versus true additive functions . . . . . . . . . .
Additive component and derivative estimates for Wisconsin
farm data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Additive component and derivative estimates for Wisconsin
farm data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Estimates for interaction terms for Wisconsin farm data . . . . . . .
8.3
8.4
8.5
8.6
8.7
SPMmigmv . . . . . . . . . . . . . . . 201
217
218
219
227
231
232
233
XVIII
List of Figures
8.8
8.9
8.10
8.11
234
237
238
8.16
Estimates for interaction terms for Wisconsin farm data . . . . . . .

Performance of MASE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Performance of MASE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Equivalent kernels for the bivariate Nadaraya-Watson
estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Equivalent kernels for backfitting using univariate
Nadaraya-Watson estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Equivalent kernels for marginal integration based on bivariate
Nadaraya-Watson smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3D surface estimates for manager data . . . . . . . . . . . . . . . . . . . . . . .
Backfitting additive and linear function estimates for manager
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Marginal integration estimates and 2 bands for manager data
9.1
9.2
9.3
9.4
9.5
9.6
Estimates of additive components for female labor supply data

Estimates of additive components for female labor supply data
Density plots for migration data (Sachsen) . . . . . . . . . . . . . . . . . . .
Additive curve estimates for age and income . . . . . . . . . . . . . . . . .
Density plots for unemployment data . . . . . . . . . . . . . . . . . . . . . . .
Estimates of additive components for unemployment data . . . .
258
259
266
268
272
273
8.12
8.13
8.14
8.15
240
241
241
243
244
245
List of Tables
1.1
1.2
OLS estimation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Example observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
7
3.1
3.2
3.3
Kernel functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
0 for different kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Efficiency of kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
59
61
5.1
5.2
5.3
Descriptive statistics for migration data . . . . . . . . . . . . . . . . . . . . . . 147

Characteristics of some GLM distributions . . . . . . . . . . . . . . . . . . . 155
Logit coefficients for migration data . . . . . . . . . . . . . . . . . . . . . . . . 160
6.1
WADE fit of unemployment data . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.1
7.2
7.3
Descriptive statistics for migration data (MecklenburgVorpommern) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

Logit and GPLM coefficients for migration data . . . . . . . . . . . . . . 201
Observed significance levels for testing GLM versus GPLM . . . 205
8.1
8.2
MASE for backfitting and marginal integration . . . . . . . . . . . . . . . 239

Parameter estimates for manager data . . . . . . . . . . . . . . . . . . . . . . . 243
9.1
9.2
9.3
9.4
OLS coefficients for female labor supply data . . . . . . . . . . . . . . . .

Descriptive statistics for migration data (Sachsen) . . . . . . . . . . . .
Logit and GAPLM coefficients for migration data . . . . . . . . . . . .
Logit and GAPLM coefficients for unemployment data . . . . . . .
260
267
267
270
Notation
Abbreviations
cdf
cumulative distribution function
df
degrees of freedom
iff
if and only if
i.i.d.
independent and identically distributed
w.r.t.
with respect to
pdf
probability density function
ADE
average derivative estimator
AM
additive model
AMISE
asymptotic MISE
AMSE
asymptotic MSE
APLM
additive partial linear model
ASE
averaged squared error
ASH
average shifted histogram
CHARN
conditional heteroscedastic autoregressive nonlinear
CV
cross-validation
DM
Deutsche Mark
GAM
generalized additive model
GAPLM
generalized additive partial linear model
GLM
generalized linear model
XXII
Notation
GPLM
generalized partial linear model
ISE
integrated squared error
IRLS
iteratively reweighted least squares
LR
likelihood ratio
LS
least squares
MASE
mean averaged squared error
MISE
mean integrated squared error
ML
maximum likelihood
MLE
maximum likelihood estimator
MSE
mean squared error
PLM
partial linear model
PMLE
pseudo maximum likelihood estimator
RSS
residual sum of squares
S.D.
standard deviation
S.E.
standard error
SIM
single index model
SLS
semiparametric least squares
USD
US Dollar
WADE
weighted average derivative estimator
WSLS
weighted semiparametric least squares
Scalars, Vectors and Matrices

X, Y
random variables
x, y
scalars (realizations of X, Y)
X1 , . . . , X n
random sample of size n
X(1) , . . . , X(n)
ordered random sample of size n
x1 , . . . , x n
realizations of X1 , . . . , Xn
vector of variables
vector (realizations of X)
x0
origin (of histogram)
Notation
XXIII
binwidth or bandwidth
e
h
auxiliary bandwidth in marginal integration
bandwidth matrix
identity matrix
data or design matrix
vector of observations Y1 , . . . , Yn
parameter
parameter vector
e0
first unit vector, i.e. e0 = (1, 0, . . . , 0)>
ej
(j + 1)th unit vector, i.e. e j = (0, . . . , 0, 1 , 0, . . . , 0)>

j
1n
vector of ones of length n
vector of expectations of Y1 , . . . , Yn in generalized

models
LR
vector of index values X 1> , . . . , X >

n in generalized
models
likelihood ratio test statistic
vector of variables (linear part of the model)
vector of continuous variables (nonparametric part of the

model)
random vector of all but th component
X j
random vector of all but th and jth component
S, S P , S
smoother matrices
vector of regression values m(X 1 ), . . . , m(X n )
vector of additive component function values

g (X 1 ),. . . ,g (X n )
Matrix algebra
tr(A)
trace of matrix A
diag(A)
diagonal of matrix A
det(A)
determinant matrix A
rank(A)
rank of matrix A
XXIV
Notation
A1
inverse of matrix A
kuk
norm of vector u, i.e.
u> u
Functions
log
logarithm (base e)
pdf of standard normal distribution
cdf of standard normal distribution
indicator function, i.e. I(A) = 1 if A holds, 0 otherwise
kernel function (univariate)
Kh
scaled kernel function, i.e. Kh (u) = K(u/h)/h
kernel function (multivariate)
KH
kKk22
scaled kernel function, i.e. KH (u) = K(H1 u)/ det(H)

R
second moment of K, i.e. u2 K(u) du
R
pth moment of K, i.e. u p K(u) du
R
squared L2 norm of K, i.e. {K(u)}2 du
probability density function (pdf)
fX
pdf of X
f (x, y)
joint density of X and Y
gradient vector (partial first derivatives)
Hf
K?K
Hessian matrix (partial second derivatives)

R
convolution of K, i.e. K ? K(u) = K(u v)K(v) dv
e
w, w
weight functions
unknown function (to be estimated)
m()
th derivative (to be estimated)
`, `i
log-likelihood, individual log-likelihood
known link function
unknown link function (to be estimated)
a, b, c
exponential family characteristics in generalized models
variance function of Y in generalized models
additive component (to be estimated)
2 (K)
p (K)
Notation
()
th derivative (to be estimated)
pdf of X
XXV
Moments
EX
mean value of X
2 = Var(X)
variance of X, i.e. Var(X) = E(X EX)2
E(Y|X)
conditional mean Y given X (random variable)
E(Y|X = x)
conditional mean Y given X = x (realization of E(Y|X))
E(Y|x)
same as E(Y|X = x)
2 (x)
conditional variance of Y given X = x (realization of

Var(Y|X))
EX1 g(X1 , X2 )
mean of g(X1 , X2 ) w.r.t. X1 only
med(Y|X)
conditional median Y given X (random variable)
same as E(Y|X) in generalized models
V()
variance function of Y in generalized models
nuisance (dispersion) parameter in generalized models
MSEx
MSE at the point x
conditional expectation function E(|X )
Distributions
U[0, 1]
uniform distribution on [0, 1]
U[a, b]
uniform distribution on [a, b]
N(0, 1)
standard normal or Gaussian distribution
N(, 2 )
normal distribution with mean and variance 2
N(, )
multi-dimensional normal distribution with mean and

covariance matrix
2m
2 distribution with m degrees of freedom
tm
t-distribution with m degrees of freedom
XXVI
Notation
Estimates
b
estimated coefficient
estimated coefficient vector
fbh
estimated density function
fbh,i
estimated density function when leaving out observation i

estimated regression function
bh
m
b p,h
m
estimated regression function using local polynomials of

degree p and bandwidth h
b p,H
m
estimated multivariate regression function using local

polynomials of degree p and bandwidth matrix H
Convergence
o()
a = o(b) iff a/b 0 as n or h 0
O()
a = O(b) iff a/b constant as n or h 0
o p ()
U = o p (V) iff for all e > 0 holds P(|U/V| > e) 0
O p ()
U = O p (V) iff for all e > 0 exists c > 0 such that

P(|U/V| > c) < e as n is sufficiently large or h is sufficiently small
a.s.
almost sure convergence

convergence in probability
convergence in distribution
asymptotically equal
asymptotically proportional
Other
N
natural numbers
integers
real numbers
Notation XXVII
Rd
d-dimensional real space
proportional
constantly equal
number of elements of a set
Bj
jth bin, i.e. [x0 + (j 1)h, x0 + jh)
mj
bin center of Bj , i.e. m j = x0 + (j 12 )h
1
Introduction
1.1 Density Estimation

Consider a continuous random variable and its probability density function
(pdf). The pdf tells you how the random variable is distributed. From the
pdf you cannot only calculate the statistical characteristics as mean and variance, but also the probability that this variable will take on values in a certain
interval.
The pdf is, thus, very useful as it characterizes completely the behavior of a random variable. This fact might provide enough motivation to
study nonparametric density estimation. Moreover nonparametric density
estimates can serve as a building block in nonparametric regression estimation, as regression functions are fully characterized through the distribution
of two (or more) variables.
The following example, which uses data from the Family Expenditure
Survey of each year from 1969 to 1983, gives some illustration of the fact that
density estimation has a substantial application in its own right.
Example 1.1.
Imagine that we have to answer the following questions: Is there a change
in the structure of the income distribution during the period from 1969 to
1983? (You may recall, that many people argued that the neo-liberal policies
of former Prime Minister Margaret Thatcher promoted income inequality in
the early 1980s.)
To answer this question, we have estimated the distribution of net-income
for each year from 1969 to 1983 both parametrically and nonparametrically.
In parametric estimation of the distribution of income we have followed
standard practice by fitting a log-normal distribution to the data. We employed the method of kernel density estimation (a generalization of the fa-
1 Introduction
miliar histogram, as we will soon see) to estimate the income distribution

nonparametrically. In the upper graph in Figure 1.1 we have plotted the estimated log-normal densities for each of the 15 years: Note that they are all
very similar. On the other hand the analogous plot of the kernel density estimates show a movement of the net-income mode (the maximum of the den-
Lognormal Density Estimates
81
79
77
75
73
71
Kernel Density Estimates
81
79
77
75
73
71
Figure 1.1. Log-normal density estimates (upper graph) versus kernel density estimates (lower graph) of net-income, U.K. Family Expenditure Survey 196983
SPMfesdensities
1.2 Regression
sity) to the left (Figure 1.1, lower graph). This indicates that the net-income
distribution has in fact changed during this 15 year period.
1.2 Regression
Let us now consider a typical linear regression problem. We assume that
anyone of you has been exposed to the linear regression model where the
mean of a dependent variable Y is related to a set of explanatory variables
X1 , X2 , . . . , Xd in the following way:
E(Y|X) = X1 1 + . . . + Xd d = X > .
(1.1)
Here E(Y|X) denotes the expectation conditional on the vector X = (X1 , X2 ,

. . ., Xd )> and j , j = 1, 2, . . . , d are unknown coefficients. Defining as the
deviation of Y from the conditional mean E(Y|X):
we can write
= Y E(Y|X)
(1.2)
Y = X > + .
(1.3)
Example 1.2.
To take a specific example, let Y be log wages and consider the explanatory
variables schooling (measured in years), labor market experience (measured as
AGE SCHOOL 6) and experience squared. If we assume that, on average,
log wages are linearly related to these explanatory variables then the linear
regression model applies:
E(Y|SCHOOL, EXP) = 0 + 1 SCHOOL + 2 EXP + 3 EXP2 .
Note that we have included an intercept ( 0 ) in the model.
(1.4)
The model of equation (1.4) has played an important role in empirical labor economics and is often called human capital earnings equation (or Mincer
earnings equation to honor Jacob Mincer, a pioneer of this line of research).
From the perspective of this course, an important characteristic of equation
(1.4) is its parametric form: the shape of the regression function is governed
by the unknown parameters j , j = 1, 2, . . . , d. That is, all we have to do in
order to determine the linear regression function (1.4) is to estimate the unknown parameters j . On the other hand, the parametric regression function
of equation (1.4) a priori rules out many conceivable nonlinear relationships
between Y and X.
1 Introduction
Let m(SCHOOL, EXP) be the true, unknown regression function of log

wages on schooling and experience. That is,
E(Y|SCHOOL, EXP) = m(SCHOOL, EXP).
(1.5)
Suppose that you were assigned the following task: estimate the regression of
log wages on schooling and experience as accurately as possible in one trial.
That is, you are not allowed to change your model if you find that the initial
specification does not fit the data well. Of course, you could just go ahead
and assume, as we have done above, that the regression you are supposed to
estimate has the form specified in (1.4). That is, you assume that
m(SCHOOL, EXP) = 1 + 2 SCHOOL + 3 EXP + 4 EXP2 ,
and estimate the unknown parameters by the method of ordinary least
squares, for example. But maybe you would not fit this parametric model
if we told you that there are ways of estimating the regression function without having to make any prior assumptions about its functional form (except
that it is a smooth function). Remember that you have just one trial and if
the form of m(SCHOOL, EXP) is very different from (1.4) then estimating the
parametric model may give you very inaccurate results.
It turns out that there are indeed ways of estimating m() that merely assume that m() is a smooth function. These methods are called nonparametric
regression estimators and part of this course will be devoted to studying
nonparametric regression.
Nonparametric regression estimators are very flexible but their statistical precision decreases greatly if we include several explanatory variables
in the model. The latter caveat has been appropriately termed the curse of
dimensionality. Consequently, researchers have tried to develop models and
estimators which offer more flexibility than standard parametric regression
but overcome the curse of dimensionality by employing some form of dimension reduction. Such methods usually combine features of parametric and
nonparametric techniques. As a consequence, they are usually referred to as
semiparametric methods. Further advantages of semiparametric methods are
the possible inclusion of categorical variables (which can often only be included in a parametric way), an easy (economic) interpretation of the results,
and the possibility of a part specification of a model.
In the following three sections we use the earnings equation and other examples to illustrate the distinctions between parametric, nonparametric and
semiparametric regression and we certainly hope that this will whet your
appetite for the material covered in this course.
1.2 Regression
1.2.1 Parametric Regression

Versions of the human capital earnings equation of (1.4) have probably been
estimated by more researchers than any other model of empirical economics.
For a detailed nontechnical and well-written discussion see Berndt (1991,
Chapter 5). Here, we want to point out that:
Under certain simplifying assumptions, 2 accurately measures the rate
of return to schooling.
Human capital theory suggests a concave wage-experience profile: rapid
human capital accumulation in the early stage of ones labor market career, with rising wages that peak somewhere during midlife and decline
thereafter as hours worked and the incentive to invest in human capital decrease. This is the reason for including both EXP and EXP2 in the
model. In order to get a profile as the one envisaged by theory, the estimated value of 3 should be positive and that of 4 should be negative.
Table 1.1. Results from OLS estimation for Example 1.2

Dependent Variable: Log Wages
Variable Coefficients
S.E. t-values
SCHOOL
0.0898 0.0083
10.788
EXP
0.0349 0.0056
6.185
EXP2
0.0005 0.0001
4.307
constant
0.5202 0.1236
4.209
R2 = 0.24, sample size n = 534
We have estimated the coefficients of (1.4) using ordinary least squares

(OLS), using a subsample of the 1985 Current Population Survey (CPS) provided by Berndt (1991). The results are given in Table 1.1.
The estimated rate of return to schooling is roughly 9%. Note that the
estimated coefficients of EXP and EXP2 have the signs predicted by human capital theory. The shape of the wage-schooling (a plot of SCHOOL
vs. 0.0898 SCHOOL) and wage-experience (a plot of EXP vs. 0.0349 EXP
0.0005 EXP2 ) profiles are given in the left and right graphs of Figure 1.2, respectively.
The estimated wage-schooling relation is linear by default since we did
not include SCHOOL2 , say, to allow for some kind of curvature within the
parametric framework. By looking at Figure 1.2 it is clear that the estimated
coefficients of EXP and EXP2 imply the kind of concave wage-earnings profile predicted by human capital theory.
1 Introduction
We have also plotted a graph (Figure 1.3) of the estimated regression

surface, i.e. a plot that has the values of the estimated regression function
(obtained by evaluating 0.0898 SCHOOL + 0.0349 EXP 0.0005 EXP2 at the
observed combinations of schooling and experience) on the vertical axis and
schooling and experience on the horizontal axes.
Wage <-- Experience
0.1
0.5
0.2
0.3
0.4
0.5
1.5
Wage <-- Schooling
10
x1
15
10
20
30
40
50
x2
Figure 1.2. Wage-schooling and wage-experience profile
SPMcps85lin
Wage <-- Schooling, Experience
2.3
1.9
41.2
Experience
1.5
27.5
13.8
6.0
10.0
14.0
Schooling
Figure 1.3. Parametrically estimated regression function
SPMcps85lin
1.2 Regression
All of the element curves of the surface appear similar to Figure 1.2 (right)
in the direction of experience and like Figure 1.2 (left) in the direction of
schooling. To gain a better understanding of the three-dimensional picture
we have plotted a single wage-experience profile in three dimensions, fixing
schooling at 12 years. Hence, Figure 1.3 highlights the wage-earnings profile
for high school graduates.
1.2.2 Nonparametric Regression
Suppose that we want to estimate
E(Y|SCHOOL, EXP) = m(SCHOOL, EXP).
(1.6)
and we are only willing to assume that m() is a smooth function. Nonparametric regression estimators produce an estimate of m() at an arbitrary
point (SCHOOL = s, EXP = e) by locally weighted averaging over log wages
(here s and e denote two arbitrary values that SCHOOL and EXP may take
on, such as 12 and 15). Locally weighting means that those values of log
wages will be higher weighted for which the corresponding observations of
EXP and SCHOOL are close to the point (s, e). Let us illustrate this principle
with an example. Let s = 8 and e = 7 and suppose you can use the four
observations given in Table 1.2 to estimate m(8, 7):
Table 1.2. Example observations
Observation
1
2
3
4
log(WAGES) SCHOOL EXP

7.31
8
8
7.6
16
1
7.4
8
6
7.8
12
2
In nonparametric regression m(8, 7) is estimated by averaging over the

observed values of the dependent variable log wage. But not all values will
be given the same weight. In our example, observation 1 will get the most
weight since it has values of schooling and experience that are very close to
the point where we want to estimate. This makes a lot of sense: if we want
to estimate mean log wages for individuals with 8 years of schooling and 7
years of experience then the observed log wage of a person with 8 years of
schooling and 8 years of experience seems to be much more informative than
the observed log wage of a person with 12 years of schooling and 2 years of
experience.
1 Introduction
2.4
2.0
41.2
Experience
1.7
27.5
13.8
6.0
10.0
14.0
Schooling
Figure 1.4. Nonparametrically estimated regression function
SPMcps85reg
Consequently, any reasonable weighting scheme will give more weight to

7.31 than to 7.8 when we average over observed log wages. The exact method
of weighting is determined by a weight function that makes precise the idea
of weighting nearby observations more heavily. In fact, the weight function
might be such that observations that are too far away get zero weight. In our
example, observation 2 has values of experience and schooling that are so
far away from 8 years of schooling and 7 years of experience that a weight
function might assign zero value to the corresponding value of log wages
(7.6). It is in this sense that the averaging is local. In Figure 1.4, the surface
of nonparametrically estimated values of m() are shown. Here, a so-called
kernel estimator has been used.
As long as we are dealing with only one regressor, the results of estimating a regression function nonparametrically can easily be displayed in a
graph. The following example illustrates this. It relates net-income data, as
we considered in Example 1.1, to a second variable that measures household
expenditure.
Example 1.3.
Consider for instance the dependence of food expenditure on net-income.
Figure 1.5 shows the so-called Engel curve (after the German Economist Engel) of net-income and food share estimated using data from the 1973 Family
1.2 Regression
Expenditure Survey of roughly 7000 British households. The figure supports

the theory of Engel who postulated in 1857:
... je a rmer eine Familie ist, einen desto groeren

Antheil von der
Gesammtausgabe mu zur Beschaffung der Nahrung aufgewendet
werden ... (The poorer a family, the bigger the share of total expenditure that has to be used for food.)
0.4
0
0.2
Food
0.6
Engel Curve
0.5
1.5
Net-income
2.5
Figure 1.5. Engel curve, U.K. Family Expenditure Survey 1973
SPMengelcurve2
1.2.3 Semiparametric Regression

To illustrate semiparametric regression let us return to the human capital
earnings function of Example 1.2. Suppose the regression function of log
wages on schooling and experience has the following shape:
E(Y|SCHOOL, EXP) = + g1 (SCHOOL) + g2 (EXP).
(1.7)
Here g1 () and g2 () are two unknown, smooth functions and is an unknown parameter. Note that this model combines the simple additive structure of the parametric regression model (referred to hereafter as the additive
10
1 Introduction
model) with the flexibility of the nonparametric approach. This is done by not
imposing any strong shape restrictions on the functions that determine how
schooling and experience influence the mean regression of log wages. The
procedure employed to estimate this model will be explained in greater detail later in this course. It should be clear, however, that in order to estimate
the unknown functions g1 () and g2 () nonparametric regression estimators
have to be employed. That is, when estimating semiparametric models we
usually have to use nonparametric techniques. Hence, we will have to spend
a substantial amount of time studying nonparametric estimation if we want
to understand how to estimate semiparametric models. For now, we want to
focus on the results and compare them with the parametric fit.
Wage <-- Experience
-1
-0.4
-0.5
-0.2
0.2
Wage <-- Schooling
10
X
15
10
20
30
40
50
Figure 1.6. Additive model fit vs. parametric fit, wage-schooling (left) and wageexperience (right)
SPMcps85add
In Figure 1.6 the parametrically estimated wage-schooling and wage-experience profiles are shown as thin lines whereas the estimates of g1 () and
g2 () are displayed as thick lines with bullets. The parametrically estimated
wage-school and wage-experience profiles show a good deal of similarity
with the estimate of g1 () and g2 (), except for the shape of the curve at extremal values. The good agreement between parametric estimates and additive model fit is also visible from the plot of the estimated regression surface,
which is shown in Figure 1.7.
Hence, we may conclude that in this specific example the parametric
model is supported by the more flexible nonparametric and semiparametric methods. This potential usefulness of nonparametric and semiparametric techniques for checking the adequacy of parametric models will be illustrated in several other instances in the latter part of this course.
1.2 Regression
11
0.4
0.1
41.2
Experience
-0.1
27.5
13.8
6.0
10.0
14.0
Schooling
Figure 1.7. Surface plot for the additive model
SPMcps85add
Take a closer look at (1.6) and (1.7). Observe that in (1.6) we have to estimate one unknown function of two variables whereas in (1.7) we have to
estimate two unknown functions, each a function of one variable. It is in this
sense that we have reduced the dimensionality of the estimation problem.
Whereas all researchers might agree that additive models like the one in (1.7)
are achieving a dimension reduction over completely nonparametric regression, they may not agree to call (1.7) a semiparametric model, as there are no
parameters to estimate (except for the intercept parameter ). In the following example we confront a standard parametric model with a more flexible
model that, as you will see, truly deserves to be called semiparametric.
Example 1.4.
In the earnings-function example, the dependent variable log wages can
principally take on any positive value, i.e. the set of values Y is infinite. This
may not always be the case. For example, consider the decision of an EastGerman resident to move to Western Germany and denote the decision variable by Y. In this case, the dependent variable can take on only two values,

1 if the person can imagine moving to the west,
Y=
0 otherwise.
We will refer to this as a binary response later on.
12
1 Introduction
In Example 1.2 we tried to estimate the effect of a persons education and

work experience on the log wage earned. Now, say we want to find out how
these two variables affect the decision of an East German resident to move
west, i.e. we want to know E(Y|x) where x is a (d 1) vector containing all
d variables considered to be influential to the migration decision. Since Y is
a binary variable (i.e. a Bernoulli distributed variable), we have that
E(Y|X) = P(Y = 1|X).
(1.8)
Thus, the regression of Y on X can be expressed as the probability that a

randomly sampled person from the East will migrate to the West, given this
persons characteristics collected in the vector X. Standard models for P(Y =
1|X) assume that this probability depends on X as follows:
P(Y = 1|X) = G(X > ),
(1.9)
where X > is a linear combination of all components of X. It aggregates

the multiple characteristics of a person into one number (therefore called the
index function or simply the index), where is an unknown vector of coefficients. G() denotes any continuous function that maps the real line to the
range of [0, 1]. G() is also called the link function, since it links the index X >
to the conditional expectation E(Y|X).
In the context of this lecture, the crucial question is precisely what parametric form these two functions take or, more generally, whether they will
take any parametric form at all. For now we want to compare two models:
one that assumes that G() is of a known parametric form and one that allows G() to be an unknown smooth function.
One of the most widely used fully parametric models applied to the case
of binary dependent variables is the logit model. The logit model assumes that
G(X > ) is the (standard) logistic cumulative distribution function (cdf) for
all X. Hence, in this case
E(Y|X) = P(Y = 1|X) =
1
.
exp(X > )
(1.10)
Example 1.5.
In using a logit model, Burda (1993) estimated the effect of various explanatory variables on the migration decision of East German residents. The data
for fitting this model were drawn from a panel study of approximately 4,000
East German households in spring 1991. We use a subsample of n = 402 observations from the German state Mecklenburg-Vorpommern here. Due to
space constraints, we merely report the estimated coefficients of three components of the index X > , as we will refer to these estimates below:
0 + 1 INC + 2 AGE
= 2.2905 + 0.0004971 INC 0.45499 AGE
(1.11)
1.2 Regression
13
INC and AGE are used to abbreviate the household income and age of the
individual.

Figure 1.8 gives a graphical presentation of the results. Each observation
is represented by a +. As mentioned above, the characteristics of each person are transformed into an index (to be read off the horizontal axis) while
the dependent variable takes on one of two values, Y = 0 or Y = 1 (to be read
off the vertical axis). The curve plots estimates of P(Y = 1|X), the probability of Y = 1 as a function of X > . Note that the estimates of P(Y = 1|X), by
assumption, are simply points on the cdf of a standard logistic distribution.
0.5
0
Link Function, Responses
Logit Model
-3
-2
-1
Index
Figure 1.8. Logit fit
SPMlogit
We shall continue with Example 1.4 below, but let us pause for a moment to consider the following substantial problem: the logit model, like
other parametric models, is based on rather strong functional form (linear
index) and distributional assumptions, neither of which are usually justified
by economic theory.
The first question to ask before developing alternatives to standard models like the logit model is: what are the consequences of estimating a logit
model if one or several of these assumptions are violated? Note that this is a
crucial question: if our parametric estimates are largely unaffected by model
14
1 Introduction
violations, then there is no need to develop and apply semiparametric models and estimators. Why would anyone put time and effort into a project that
promises little return?
One can employ the tools of asymptotic statistical theory to show that violating the assumptions of the logit model leads parameter estimates to being
inconsistent. That is, if the sample size goes to infinity, the logit maximumlikelihood estimator (logit-MLE) does not converge to the true parameter
value in probability. While it doesnt converge to the true parameter value
it does, however, converge to some other value. If this false value is close
enough to the true parameter value then we may not care very much about
this inconsistency.
Consistency is an asymptotic criterion for the performance of an estimator. That is, it looks at the properties of the estimator if the sample size grows
without limits. Yet, in practice, we are dealing with finite samples. Unfortunately, the finite-sample properties of the logit maximum-likelihood estimator can not be derived analytically. Hence, we have to rely on simulations to
collect evidence of its small-sample performance in the presence of misspecification. We conducted a small simulation in the context of Example 1.4 to
which we now return.
0.5
0
G(Index)
True versus Logit Link
-4
-3
-2
-1
0
Index
Figure 1.9. Link function of the homoscedastic logit model (thin line) versus the link
function of the heteroscedastic model (solid line)
SPMtruelogit
1.2 Regression
15
Example 1.6.
Following Horowitz (1993) we generated data according to a heteroscedastic
model with two explanatory variables, INC and AGE. Here we considered
heteroscedasticity of the form
o2
1n
Var(|X = x) =
1 + (x> )2 Var(),
4
where has a (standard) logistic distribution. To give you an impression of
how dramatically the true heteroscedastic model differs from the supposed
homoscedastic logit model, we plotted the link functions of the two models
as shown in Figure 1.9.

To add a sense of realism to the simulation, we set the coefficients of these
variables equal to the estimates reported in (1.11). Note that the standard
logit model introduced above does not allow for heteroscedasticity. Hence, if
we apply the standard logit maximum-likelihood estimator to the simulated
data, we are estimating under misspecification. We performed 250 replications of this estimation experiment, using the full data set with 402 observations each time. As the estimated coefficients are only identified up to scale,
we compared the ratio of the true coefficients, I NC / AGE , to the ratio of
their estimated logit-MLE counterparts, bI NC / bAGE . Figure 1.10 shows the
sampling distribution of the logit-MLE coefficients, along with the true value
(vertical line).
As we have subtracted the true value from each estimated ratio and divided this difference by the true ratios absolute value, the true ratio is standardized to zero and differences on the horizontal axis can be interpreted as
percentage deviations from the truth. In Figure 1.10, the sampling distribution of the estimated ratios is centered around 0.11 which is the percentage
deviation from the truth of 11%. Hence, the logit-MLE underestimates the
true value.
Now that we have seen how serious the consequences of model misspecification can be, we might want to learn about semiparametric estimators that
have desirable properties under more general assumptions than their parametric counterparts. One way to generalize the logit model is the so-called
single index model (SIM) which keeps the linear form of the index X > but
allows the function G() in (1.9) to be an arbitrary smooth function g() (not
necessarily a distribution function) that has to be estimated from the data:
E(Y|X) = g(X > ),
(1.12)
Estimation of the single index model (1.12) proceeds in two steps:

Firstly, the coefficient vector has to be estimated. Methods to calculate
the coefficients for discrete and continuous variables will be covered in
depth later.
16
1 Introduction
0.4
0.2
0
Sampling Distribution
0.6
True Ratio vs. Sampling Distribution
-1.5
-1
-0.5
0
0.5
True+Estimated Ratio
1.5
Figure 1.10. Sampling distribution of the ratio of the estimated coefficients (density
estimate and mean value indicated as *) and the ratios true value (vertical line)
SPMsimulogit
Secondly, we have to estimate the unknown link function g() by nonparametrically regressing the dependent variable Y on the fitted index
b where
b is the coefficient vector we estimated in the first step. To
X>
do this, we use again a nonparametric estimator, the kernel estimator we
mentioned briefly above.
Example 1.7.
b from the logit fit and estimate
Let us consider what happens if we use
the link function nonparametrically. Figure 1.11 shows this estimated link
function. As before, the position of a + sign represents at the same time the
b and Y of a particular observation, while the curve depicts the
values of X >
estimated link function.

One additional remark should be made here: As you will soon learn, the
shape of the estimated link function (the curve) varies with the so-called
bandwidth, a parameter central in nonparametric function estimation. Thus,
there is no unique estimate of the link function, and it is a crucial (and difficult) problem of nonparametric regression to find the best bandwidth and
thus the optimal estimate. Fortunately, there are methods to select an ap-
1.2 Regression
17
0.5
0
Link Function, Responses
Single Index Model
-3
-2
-1
Index
Figure 1.11. Single index versus logit model
SPMsim
propriate bandwidth. Here, we have chosen h = 0.7 index units for the
bandwidth. For comparison the shapes of both the single index (solid line)
and the logit (dashed line) link functions are shown ins in Figure 1.8. Even
though not identical they look rather similar.
18
1 Introduction
Summary
Parametric models are fully determined up to a parameter (vector). The fitted models can easily be interpreted and estimated
accurately if the underlying assumptions are correct. If, however,
they are violated then parametric estimates may be inconsistent
and give a misleading picture of the regression relationship.
Nonparametric models avoid restrictive assumptions of the

functional form of the regression function m. However, they
may be difficult to interpret and yield inaccurate estimates if the
number of regressors is large.
Semiparametric models combine components of parametric and

nonparametric models, keeping the easy interpretability of the
former and retaining some of the flexibility of the latter.
5
Semiparametric and Generalized Regression
Models
In the previous part of this book we found the curse of dimensionality to be

one of the major problems that arises when using nonparametric multivariate
regression techniques. For the practitioner, a further problem is that for more
than two regressors, graphical illustration or interpretation of the results is
hardly ever possible. Truly multivariate regression models are often far too
flexible and general for making detailed inference.
5.1 Dimension Reduction

Researchers have looked for possible remedies, and a lot of effort has been allocated to developing methods which reduce the complexity of high dimensional regression problems. This refers to the reduction of dimensionality as
well as allowance for partly parametric modeling. Not surprisingly, one follows the other. The resulting models can be grouped together as so-called
semiparametric models.
All models that we will study in the following chapters can be motivated
as generalizations of well-known parametric models, mainly of the linear
model
E(Y|X) = m(X) = X >
or its generalized version
E(Y|X) = m(X) = G{X > } .
(5.1)
Here G denotes a known function, X is the d-dimensional vector of regressors

and is a coefficient vector that is to be estimated from observations for Y
and X.
146
5 Semiparametric and Generalized Regression Models
Let us take a closer look at model (5.1). This model is known as the generalized linear model. Its use and estimation are extensively treated in McCullagh & Nelder (1989). Here we give only some selected motivating examples.
What is the reason for introducing this functional G, called the link? (Note
that other authors call its inverse G 1 the link.) Clearly, if G is the identity we
are back in the classical linear model. As a first alternative let us consider a
quite common approach for investigating growth models. Here, the model is
often assumed to be multiplicative instead of additive, i.e.
d
Y=
Xj
E log() = 0
(5.2)
E = 0.
(5.3)
j=1
in contrast to
Y=
Xj
+ ,
j=1
Depending on whether we have multiplicative errors or additive errors ,

we can transform model (5.2) to
d
E{log(Y)|X} =
j log(Xj )
(5.4)
j=1
and model (5.3) to

(
E(Y|X) = exp
j log(Xj )
)
.
(5.5)
j=1
Considering now log(X) as the regressor instead of X, equation (5.5) is

equivalent to (5.1) with G() = exp(). Equation (5.4), however, is a transformed model, see the bibliographic notes for references on this model family.
The most common cases in which link functions are used are binary responses (Y {0, 1}) or multicategorical (Y {0, 1, . . . , J}) responses and
count data (Y Poisson). For the binary case, let us introduce an example
that we will study in more detail in Chapters 7 and 9.
Example 5.1.
Imagine we are interested in possible determinants of the migration decision of East Germans to leave the East for West Germany. Think of Y as
being the net-utility from migrating from the eastern part of Germany to the
western part. Utility itself is not observable but we can observe characteristics of the decision makers and the alternatives that affect utility. As Y is
not observable it is called a latent variable. Let the observable characteristics
147
Table 5.1. Descriptive statistics for migration data, n = 3235

Y
X1
X2
X3
X4
MIGRATION INTENTION
FAMILY/FRIENDS IN WEST
UNEMPLOYED/JOB LOSS CERTAIN
CITY SIZE 10,000100,000
FEMALE
X5 AGE (in years)

X6 HOUSEHOLD INCOME (in DM)
Yes
38.5
85.6
19.7
29.3
51.1
No
61.5
11.2
78.9
64.2
49.8
(in %)
Min Max
Mean
S.D.
18
65
39.84 12.61
200 4000 2194.30 752.45
be summarized in a vector X. This vector X may contain variables such as

education, age, sex and other individual characteristics. A selection of such
characteristics is shown in Table 5.1.

In Example 5.1, we hope that the vector of regressors X captures the variables that systematically affect each persons utility whereas unobserved or
random influences are absorbed by the term . Suppose further, that the components of X influence net-utility through a multivariate function v() and
that the error term is additive. Then the latent-variable model is given by

1 if Y > 0,
(5.6)
Y = v(X) and Y =
0 otherwise.
Hence, what we really observe is the binary variable Y that takes on the
value 1 if net-utility is positive (person intends to migrate) and 0 otherwise
(person intends to stay). Then some calculations lead to
P(Y = 1 | X = x) = E(Y | X = x) = G|x {v(x)}
(5.7)
with G|x being the cdf of conditional on x.

Recall that standard parametric models assume that is independently
distributed of X with known distribution function G|x = G, and that the
index v() has the following simple form:
v(x) = 0 + x> .
(5.8)
The most popular distribution assumptions regarding the error are the normal and the logistic ones, leading to the so-called probit or logit models with
G() = () (Gaussian cdf), respectively G() = exp()/{1 + exp()}. We
will learn how to estimate the coefficients 0 and in Section 5.2.
The binary choice model can be easily extended to the multicategorical
case, which is usually called discrete choice model. We will not discuss extensions for multicategorical responses here. Some references for these models
are mentioned in the bibliographic notes.
148
Several approaches have been proposed to reduce dimensionality or to

generalize parametric regression models in order to allow for nonparametric
relationships. Here, we state three different approaches:
variable selection in nonparametric regression,
generalization of (5.1) to a nonparametric link function,
generalization of (5.1) to a semi- or nonparametric index,
which are discussed in more detail.
5.1.1 Variable Selection in Nonparametric Regression
The intention of variable selection is to choose an appropriate subset of variables, X r = (X j1 , . . . , X jr )> X = (X1 , . . . , Xd )> , from the set of all variables
that could potentially enter the regression. Of course, the selection of the variables could be determined by the particular problem at hand, i.e. we choose
the variables according to insights provided by some underlying economic
theory. This approach, however, does not really solve the statistical side of
our modeling process. The curse of dimensionality could lead us to keep the
number of variables as low as possible. On the other hand, fewer variables
could in turn reduce the explanatory power of the model. Thus, after having
chosen a set of variables on theoretical grounds in a first step, we still do not
know how many and, more importantly, which of these variables will lead to
optimal regression results. Therefore, a variable selection method is needed
that uses a statistical selection criterion.
Vieu (1994) has proposed to use the integrated square error ISE to measure the quality of a given subset of variables. In theory, a subset of variables
is defined to be an optimal subset if it minimizes the integrated squared error:
opt
ISE(X r ) = min ISE(X r )
Xr
where X r X. In practice, the ISE is replaced by its sample analog, the multivariate analog of the cross validation function (3.38). After the variables have
been selected, the conditional expectation of Y on X r is calculated by some
kind of standard nonparametric multivariate regression technique such as
the kernel regression estimator.
5.1.2 Nonparametric Link Function
Index models play an important role in econometrics. An index is a summary
of different variables into one number, e.g. the price index, the growth index,
or the cost-of-living index. It is clear that by summarizing all the information
149
contained in the variables X1 , . . . , Xd into one single index term we will

greatly reduce the dimensionality of a problem. Models based on such an
index are known as single index models (SIM). In particular we will discuss
single index models of the following form:

E(Y|X) = m(X) = g v (X) ,
(5.9)
where g() is an unknown link function and v () an up to specified index
function. The estimation can be carried out in two steps. First, we estimate
. Then, using the index values for our observations, we can estimate g by
nonparametric regression. Note that estimating g() by regressing the Y on
v b (X) is only a one-dimensional regression problem.
Obviously, (5.9) generalizes (5.7) in that we do not assume the link function G to be known. For that purpose we replaced G by g to emphasize that
the link function needs to be estimated. Notice, that often the general index
function v (X) is replaced by the linear index X > . Equations (5.5) and (5.6)
together with (5.8) give examples for such linear index functions.
5.1.3 Semi- or Nonparametric Index
In many applications a canonical partitioning of the explanatory variables
exists. In particular, if there are categorical or discrete explanatory variables
we may want to keep them separate from the other design variables. Note
that only the continuous variables in the nonparametric part of the model
cause the curse of dimensionality (Delgado & Mora, 1995). In the following
chapters we will study the following models:
Additive Model (AM)
The standard additive model is a generalization of the multiple linear regression model by introducing one-dimensional nonparametric functions
in the place of the linear components. Here, the conditional expectation
of Y given X = (X1 , . . . , Xd )> is assumed to be the sum of unknown functions of the explanatory variables plus an intercept term:
d
E(Y|X) = c + g j (X j )
(5.10)
j=1
Observe how reduction is achieved in this model: Instead of estimating

one function of several variables, as we do in completely nonparametric
regression, we merely have to estimate d functions of one-dimensional
variables X j .
Partial Linear Model (PLM)
Suppose we only want to model parts of the index linearly. This could
150
be for analytical reasons or for reasons going back to economic theory.

For instance, the impact of a dummy variable X1 {0, 1} might be sufficiently explained by estimating the coefficient 1 .
For the sake of clarity, let us now separate the d-dimensional vector of
explanatory variables into U = (U1 , . . . , U p )> and T = (T1 , . . . , Tq )> . The
regression of Y on X = (U, T) is assumed to have the form:
E(Y|U, T) = U > + m(T)
(5.11)
where m() is an unknown multivariate function of the vector T. Thus,

a partial linear model can be interpreted as a sum of a purely parametric part, U > , and a purely nonparametric part, m(T). Not surprisingly,
estimating and m() involves the combination of both parametric and
nonparametric regression techniques.
Generalized Additive Model (GAM)
Just like the (standard) additive model, generalized additive models are
based on the sum of d nonparametric functions of the d variables X (plus
an intercept term). In addition, they allow for a known parametric link
function, G(), that relates the sum of functions to the dependent variable:
(
)
d
E(Y|X) = G
c + g j (X j )
(5.12)
j=1
Generalized Partial Linear Model (GPLM)

Introducing a link G() for a partial linear model U > + m(T) yields the
generalized partial linear model (GPLM):
n
o
E(Y|U, T) = G U > + m(T) .
G denotes a known link function as in the GAM. In contrast to the GAM,
m() is possibly a multivariate nonparametric function of the variable T.
Generalized Partial Linear Partial Additive Model (GAPLM)
In high dimensions of T the estimate of the nonparametric function m()
in the GPLM faces the same problems as the fully nonparametric multidimensional regression function estimates: the curse of dimensionality and
the practical problem of interpretability. Hence, it is useful to think about
a lower dimensional modeling of the nonparametric part. This leads to
the GAPLM with an additive structure in the nonparametric component:
(
)
q
E(Y|U, T) = G
U > + g j (Tj )
j=1
Here, the g j () will be univariate nonparametric functions of the variables Tj . In the case of an identity function G we speak of an additive
partial linear model (APLM)
5.2 Generalized Linear Models
151
More discussion and motivation is given in the following chapters where

the different models are discussed in detail and the specific estimation procedures are presented. Before proceeding with this task, however, we will first
introduce some facts about the parametric generalized linear model (GLM).
The following section is intended to give more insight into this model since
its concept and the technical details of its estimation will be necessary for its
semiparametric modification in Chapters 6 to 9.

Generalized linear models (GLM) extend the concept of the widely used linear regression model. The linear model assumes that the response Y (the
dependent variable) is equal to a linear combination X > and a normally
distributed error term:
Y = X > + .
b is adapted to these assumptions. However,
The least squares estimator
the restriction of linearity is far too strict for a variety of practical situations. For example, a continuous distribution of the error term implies that
the response Y has a continuous distribution as well. Hence, this standard
linear regression model fails, for example, when dealing with binary data
(Bernoulli Y) or with count data (Poisson Y).
Nelder & Wedderburn (1972) introduced the term generalized linear models
(GLM). A good resource of material on this model is the monograph of McCullagh & Nelder (1989). The essential feature of the GLM is that the regression function, i.e. the expectation = E(Y|X) of Y is a monotone function of
the index = X > . We denote the function which relates and by G:
E(Y|X) = G(X > )
= G().
This function G is called the link function. (We remark that Nelder & Wedderburn (1972), McCullagh & Nelder (1989) actually denote G 1 as the link
function.)
5.2.1 Exponential Families
In the GLM framework we assume that the distribution of Y is a member
of the exponential family. The exponential family covers a broad range of distributions, for example discrete as the Bernoulli or Poisson distribution and
continuous as the Gaussian (normal) or Gamma distribution.
A distribution is said to be a member of the exponential family if its probability function (if Y discrete) or its density function (if Y continuous) has the
structure
152

f (y, , ) = exp
y b()
+ c(y, )
a()

(5.13)
with some specific functions a(), b() and c(). These functions differ for
the distinct Y distributions. Generally speaking, we are only interested in
estimating the parameter . The additional parameter is as the variance
2 in the linear regression a nuisance parameter. McCullagh & Nelder
(1989) call the canonical parameter.
Example 5.2.
Suppose Y is normally distributed, i.e. Y N(, 2 ). Hence we can write its
density as

2
y2
1
1
2
(y) =
exp
(y )
= exp y 2 2 2 log( 2)
22
2
2
2
and we see that the normal distribution is a member of the exponential family with
a() = 2 , b() =
2
y2
, c(y, ) = 2 log( 2),
2
2
where we set = and = .

Example 5.3.
Suppose now Y is Bernoulli distributed, i.e. its probability function is

if y = 1,
P(Y = y) = y (1 )1y =
1
if y = 0.
This can be transformed into
y

p
(1 ) = exp y log
(1 )
P(Y = y) =
1
1
using the logit

= log
e
.
1 + e
Thus we have an exponential family with

a() = 1, b() = log(1 ) = log(1 + e ), c(y, ) 0.
This is a distribution without an additional nuisance parameter .
b in the classical linear model

It is known that the least squares estimator
is also the maximum-likelihood estimator for normally distributed errors.
By imposing that the distribution of Y belongs to the exponential family it
153
is possible to stay in the framework of maximum-likelihood for the GLM.

Moreover, the use of the general concept of exponential families has the advantage that we can derive properties of different distributions at the same
time.
To derive the maximum-likelihood algorithm in detail, we need to present
some more properties of the probability function or density function f ().
First of all, f is a density (w.r.t. the Lebesgue measure in the continuous and
w.r.t. the counting measure in the discrete case). This allows us to write
Z
f (y, , ) dy = 1.
Under some suitable regularity conditions (it is possible to exchange differentiation and integration) this yields
Z
Z
0=
f (y, , ) dy =
f (y, , ) dy

Z
log f (y, , ) f (y, , ) dy = E

`(y, , ) ,
=
where `(y, , ) denotes the log-likelihood, i.e.

`(y, , ) = log f (y, , ).
The function
`(y, , )

E
(5.14)
is typically called score and it is known that
2
`(y, , )
2

= E
`(y, , )
2
.
This and taking first and second derivatives of (5.13) gives now

00

Y b0 ()
b ()
Y b0 () 2
0=E
, and E
= E
.
a()
a()
a()
We can conclude
E(Y) = = b0 (),
Var(Y) = V()a() = b00 ()a().
We observe that the expectation of Y only depends on whereas the variance
of Y depends on the parameter of interest and the nuisance parameter .
Typically one assumes that the factor a() is identical over all observations.
5.2.2 Link Functions
Apart from the distribution of Y, the link function G is another important
part of the GLM. Recall the notation
154
= X > , = G().
In the case that
X> = =
the link function is called canonical link function. For models with a canonical link, some theoretical and practical problems are easier to solve. Table 5.2 summarizes characteristics for some exponential functions together
with canonical parameters and their canonical link functions. Note that for
the binomial and the negative binomial distribution we assume the parameter k to be known. The case of binary Y is a special case of the binomial
distribution (k = 1).
What link functions can we choose apart from the canonical? For most
of the models a number of special link functions exist. For binomial Y for
example, the logistic or Gaussian link functions are often used. Recall that
a binomial model with the canonical logit link is called logit model. If the
binomial distribution is combined with the Gaussian link, it is called probit
model. A further alternative for binomial Y is the complementary log-log
link
= log{ log(1 )}.
A very flexible class of link functions is the class of power functions which
are also called Box-Cox transformations (Box & Cox, 1964). They can be defined for all models for which we have observations with positive mean. This
family is usually specified as

if 6= 0,
=
log
if = 0.
5.2.3 Iteratively Reweighted Least Squares Algorithm

As already pointed out, the estimation method of choice for a GLM is maximizing the likelihood function with respect to . Suppose that we have
the vector of observations Y = (Y1 , . . . , Yn )> and denote their expectations
(given X i = xi ) by the vector = (1 , . . . , n )> . More precisely, we have
i = G(xi> ).
The log-likelihood of the vector Y is then
n
`(Y, , ) =
`(Yi , i , ),
(5.15)
i=1
where i = (i ) = (xi> ) and `() on the right hand side of (5.15) denotes
the individual log-likelihood contribution for each observation i.
155
Table 5.2. Characteristics of some GLM distributions

Notation
Range
of y
b()
()
Canonical
link ()
Variance
V()
a()
Bernoulli
B()
{0, 1}
log(1 + e )
e
1 + e
logit
(1 )
Binomial
B(k, )
[0, k]
integer
k log(1 + e )
ke
1 + e

1
k
Poisson
P()
[0, )
integer
exp()
exp()
Negative
Binomial
NB(, k)
[0, )
integer
k log(1 e )
ke
1 e
Normal
N(, 2 )
(, )
2 /2
identity
Gamma
G(, )
(0, )
log()
1/
reciprocal
1/
Inverse
Gaussian
IG(, 2 )
(0, )
(2)1/2
p 1
(2)
squared
reciprocal
1
(Yi i )2 .
22

log
log

log
k+

+
2
k
Example 5.4.
For Yi N(i , 2 ) we have

`(Yi , i , ) = log
1
2
This gives the sample log-likelihood

1
1
`(Y, , ) = n log
2
2
2
(Yi i )2 .
i=1
(5.16)
156
Obviously, maximizing the log-likelihood for under normal Y is equivalent

to minimizing the least squares criterion as the objective function.

Example 5.5.
The calculation in Example 5.3 shows that the individual log-likelihood for
the binary responses Yi equals `(Yi , i , ) = Yi log(i ) + (1 Yi ) log(1 i ).
This leads to the sample version
n
`(Y, , ) =
{Yi log(i ) + (1 Yi ) log(1 i )} .
(5.17)
i=1
Note that one typically defines 0 log(0) = 0.
Let us remark that in the case where the distribution of Y itself is unknown, but its two first moments can be specified, then the quasi-likelihood
may replace the log-likelihood (5.14). This means we assume that
E(Y) = ,
Var(Y) = a() V().
The quasi-likelihood is defined by
1
`(y, , ) =
a()
Zy
(s y)
ds ,
V(s)
(5.18)
()
cf. Nelder & Wedderburn (1972). If Y comes from an exponential family then
the derivatives of (5.14) and (5.18) coincide. Thus, (5.18) establishes in fact a
generalization of the likelihood approach.
Alternatively to the log-likelihood the deviance is used often. The deviance
function is defined as
D(Y, , ) = 2 {`(Y, max , ) `(Y, , )} ,
(5.19)
where max (typically Y) is the non-restricted vector maximizing `(Y, , ).

The deviance (up to the factor a()) is the GLM analog of the residual sum of
squares (RSS) in linear regression and compares the log-likelihood ` for the
model with the maximal achievable value of `. Since the first term in
(5.19) is not dependent on the model and therefore not on , minimization of
the deviance corresponds exactly to maximization of the log-likelihood.
Before deriving the algorithm to determine , let us have a look at (5.15)
again. From `(Yi , i , ) = log f (Yi , i , ) and (5.13) we see

n
Yi i b(i )
`(Y, , ) =
c(Yi , ) .
(5.20)
a()
i=1
157
Obviously, neither a() nor c(Yi , ) have an influence on the maximization,

hence it is sufficient to consider
n
e`(Y, ) =
{Yi i b(i )} .
(5.21)
i=1
We will now maximize (5.21) w.r.t. . For that purpose take the first
derivative of (5.21). This yields the gradient
D() =
n

e
`(Y, ) = Yi b0 (i )
i
i=1
(5.22)
and our optimization problem is to solve

D() = 0,
a (in general) nonlinear system of equations in . For that reason, an iterative
method is needed. One possible solution is the Newton-Raphson algorithm, a
generalization of the Newton algorithm for the multidimensional parameter. Denote H() the Hessian of the log-likelihood, i.e. the matrix of second
derivatives with respect to all components of . Then, one Newton-Raphson
iteration step for is
n
o1
b new =
b old H(
b old )
b old ).
D(
A variant of the Newton-Raphson is the Fisher scoring algorithm which replaces the Hessian by its expectation (w.r.t. the observations Yi )
n
o1
b new =
b old EH(
b old )
b old ).
D(
To present both algorithms in a more detailed way, we need again some additional notation. Recall that we have i = G(xi> ) = b0 (i ), i = xi> and
b0 (i ) = i = G(i ). For the first and second derivatives of i we obtain (after
some calculation)
G 0 (i )
i =
x
V(i ) i
2
=
> i
G 00 (i )V(i ) G 0 (i )2 V 0 (i )
xi xi> .
V(i )2
Using this, we can express the gradient of the log-likelihood as

n
D() =
G 0 ( )
{Yi i } V(ii)
i=1
xi .
158
For the Hessian we get

(
n
H() =
=

>
b (i )
G 0 (i )2
G 00 (i )V(i ) G 0 (i )2 V 0 (i )
{Yi i }
V(i )
V(i )2
i=1
n
i=1
00
{Yi b (i )}
>
i

xi xi> .
Since EYi = i it turns out that the Fisher scoring algorithm is easier: We
replace H() by

n 0
G (i )2
EH() =
xi xi> .
V(i )
i=1
For the sake of simplicity let us concentrate on the Fisher scoring for the
moment. Define the weight matrix
0

G (1 )2
G 0 (n )2
W = diag
,...,
.
V(1 )
V(n )
Additionally, define

e=
Y
Y1 1
Yn n
,..., 0
0
G (1 )
G (n )
>
and the design matrix
x1>
X = ... .
x1>
Then one iteration step for can be rewritten as

e
new = old + (X> WX)1 X> WY
= (X> WX)1 X> WZ
(5.23)
where Z = (Z1 , . . . , Zn )> is the vector of adjusted dependent variables

Zi = xi> old + (Yi i ){G 0 (i )}1 .
(5.24)
The iteration stops when the parameter estimate or the log-likelihood (or
both) do not change significantly any more. We denote the resulting paramb
eter estimate by .
We see that each iteration step (5.23) is the result of a weighted least
squares regression on the adjusted variables Zi on xi . Hence, a GLM can
be estimated by iteratively reweighted least squares (IRLS). Note further that
in the linear regression model, where we have G 0 1 and i = i = xi> ,
no iteration is necessary. The Newton-Raphson algorithm can be given in a
similar way (with the more complicated weights and a different formula for
the adjusted variables). There are several remarks on the algorithm:
159
In the case of a canonical link function, the Newton-Raphson and the

Fisher scoring algorithm coincide. Here the second derivative of i is zero.
Additionally we have
b0 (i ) = G(i )
b00 (i ) = G 0 (i ) = V(i ).
This also simplifies the weight matrix W.

We still have to address the problem of starting values. A naive way
would be just to start with some arbitrary 0 , as e.g. 0 = 0. It turns out
that we do not in fact need a starting value for since the adjusted dependent variable can be equivalently initialized by appropriate i,0 and i,0 .
Typically the following choices are made, we refer here to McCullagh &
Nelder (1989).
For all but binomial models:
i,0 = Yi and i,0 = G 1 (i,0 )
For binomial models:
i,0 = (Yi + 12 )/(k + 1) and i,0 = G 1 (i,0 ).
(k denotes the binomial weights, i.e. k = 1 in the Bernoulli case.)
b for the dispersion parameter can be obtained from
An estimate
b
a() =
1
n
bi )2
(Yi
,
V(b
i )
i=1
(5.25)
bi denotes the estimated regression function for the ith observation.

when
b has an asymptotic normal distribution, except
The resulting estimator
b
of course for the standard linear regression case with normal errors where
has an exact normal distribution.
Theorem 5.1.
Under regularity conditions and as n we have for the estimated coefficient
vector
L
b )
n(
N(0, ).
b the estimator of . Then, for deviance and log-likelihood it holds
Denote further by
b, ) 2nd and 2{`(Y,
b, ) `(Y, , )} 2d .
approximately: D(Y,
b can be estimated by
The asymptotic covariance of the coefficient
"
1
b = a()
b
i=1
G 0 (i,last )2
V(i,last )
#1
)
X i X i>

1
b n X> WX
= a()
,
with the subscript last denoting the values from the last iteration step. Using this estimated covariance we can make inference about the components
160
of such as tests of significance. For selection between two nested models,

typically a likelihood ratio test (LR test) is used.
Example 5.6.
Let us illustrate the GLM using the data on East-West German migration
from Table 5.1. This is a sample of East Germans who have been surveyed in
1991 in the German Socio-Economic Panel, see GSOEP (1991). Among other
questions the participants have been asked if they can imagine moving to the
Western part of Germany or West Berlin. We give the value 1 for those who
responded positively and 0 if not.
Recall that the economic model is based on the idea that a person will
migrate if the utility (wage differential) exceeds the costs of migration. Of
course neither one of the variables, wage differential and costs, are directly
available. It is obvious that age has an important influence on migration intention. Younger people will have a higher wage differential. A currently low
household income and unemployment will also increase a possible gain in
wage after migration. On the other hand, the presence of friends or family
members in the Western part of Germany will reduce the costs of migration.
We also consider a city size indicator and gender as interesting variables (Table 5.1).
Table 5.3. Logit coefficients for migration data
constant
FAMILY/FRIENDS
UNEMPLOYED
CITY SIZE
FEMALE
AGE
INCOME
Coefficients t-value
0.512
2.39
0.599
5.20
0.221
2.31
0.311
3.77
-0.240
-3.15
-4.69 102 -14.56
1.42 104
2.73
Now, we are interested in estimating the probability of migration in dependence of the explanatory variables x. Recall, that
P(Y = 1|X) = E(Y|X).
A useful model is a GLM with a binary (Bernoulli) Y and the logit link for
example:
exp(x> )
P(Y = 1|X = x) = G(x> ) =
.
1 + exp(x> )
161
Table 5.3 shows in the middle column the results of this logit fit. The migration intention is definitely determined by age. However, also the unemployment, city size and household income variables
are highly significant, which
q
b jj ).

is indicated by their high t-values ( bj /
162
Bibliographic Notes
For general aspects on semiparametric regression we refer to the textbooks
of Pagan & Ullah (1999), Yatchew (2003), Ruppert, Wand & Carroll (1990).
Comprehensive presentations of the generalized linear model can be found
in Dobson (2001), McCullagh & Nelder (1989) and Hardin & Hilbe (2001).
For a more compact introduction see Muller

(2004), Venables & Ripley (2002,
Chapter 7) and Gill (2000).
In the following notes, we give some references for topics we consider
related to the considered models. References for specific models are listed in
the relevant chapters later on.
The transformation model in (5.4) was first introduced in an econometric
context by Box & Cox (1964). The discussion was revised many years later by
Bickel & Doksum (1981). In a more recent paper, Horowitz (1996) estimates
this model by considering a nonparametric transformation.
For a further reference of dimension reduction in nonparametric estimation we mention projection pursuit and sliced inverse regression. The projection pursuit algorithm is introduced and investigated in detail in Friedman
& Stuetzle (1981) and Friedman (1987). Sliced inverse
regression means the

estimation of Y = m X > 1 , X > 2 , . . . , X > k , , where is the disturbance
term and k the unknown dimension of the model. Introduction and theory
can be found e.g. in Duan & Li (1991), Li (1991) or Hsing & Carroll (1992).
More sophisticated models like censored or truncated dependent variables, models with endogenous variables or simultaneous equation systems
(Maddala, 1983) will not be dealt with in this book. There are two reasons:
On one hand the non- or semiparametric estimation of those models is much
more complicated and technical than most of what we aim to introduce in
this book. Here we only prepare the basics enabling the reader to consider
more special problems. On the other hand, most of these estimation problems
are rather particular and the treatment of them presupposes good knowledge
of the considered problem and its solution in the parametric world. Instead
of extending the book considerably by setting out this topic, we limit ourselves here to some more detailed bibliographic notes.
The non- and semiparametric literature on this is mainly separated into
two directions, parametric modeling with unknown error distribution or
modeling non-/semiparametrically the functional forms. In the second case
a principal question is the identifiability of the model.
For an introduction to the problem of truncation, sample selection and
limited dependent data, see Heckman (1976) and Heckman (1979). See also
the survey of Amemiya (1984). An interesting approach was presented by
Ahn & Powell (1993) for parametric censored selection models with nonpara-
Bibliographic Notes
163
metric selection mechanism. This idea has been extended to general pairwise
difference estimators for censored and truncated models in Honore & Powell (1994). A mostly comprehensive survey about parametric and semiparametric methods for parametric models with non- or semiparametric selection
bias can be found in Vella (1998). Even though implementation of and theory on these methods is often quite complicated, some of them turned out to
perform reasonably well.
The second approach, i.e. relaxing the functional forms of the functions
of interest, turned out to be much more complicated. To our knowledge, the
first articles on the estimation of triangular simultaneous equation systems
Sperlich &
have been Newey, Powell & Vella (1999) and Rodrguez-Poo,
Fernandez (1999), from which the former is purely nonparametric, whereas
the latter considers nested simultaneous equation systems and needs to specify the error distribution for identifiability reasons. Finally, Lewbel & Linton
(2002) found a smart way to identify nonparametric censored and truncated
regression functions; however, their estimation procedure is quite technical.
Note that so far neither their estimator nor the one of Newey, Powell & Vella
(1999) have been proved to perform well in practice.
164
Exercises
Exercise 5.1. Assume model (5.6) and consider X and to be independent.
Show that
P(Y = 1|X) = E(Y|X) = G {v(X)}
where G denotes the cdf of . Explain that (5.7) holds if we do not assume
independence of X and .
Exercise 5.2. Recall the paragraph about partial linear models. Why may it
be sufficient to include 1 X1 in the model when X1 is binary? What would
you do if X1 were categorical?
Exercise 5.3. Compute H() and EH() for the logit and probit models.
Exercise 5.4. Verify the canonical link functions for the logit and Poisson
model.
Exercise 5.5. Recall that in Example 5.6 we have fitted the model
E(Y|X) = P(Y = 1|X) = G(X > ),
where G is the standard logistic cdf. We motivated this model through the
latent-variable model Y = X > with having cdf G. How does the logit
model change if the latent-variable model is multiplied by a factor c? What
does this imply for the identification of the coefficient vector ?
Summary
Summary
The basis for many semiparametric regression models is the generalized linear model (GLM), which is given by
E(Y|X) = G{X > } .
Here, denotes the parameter vector to be estimated and G
denotes a known link function. Prominent examples of this type
of regression are binary choice models (logit or probit) or count
data models (Poisson regression).
The GLM can be generalized in several ways: Considering

an unknown smooth link function (instead of G) leads to the
single index model (SIM). Assuming a nonparametric additive
argument of G leads to the generalized additive model (GAM),
whereas a combination of additive linear and nonparametric
components in the argument of G give a generalized partial linear model (GPLM) or generalized partial linear partial additive
model (GAPLM). If there is no link function (or G is the identity
function) then we speak of additive models (AM) or partial
linear models (PLM) or additive partial linear models (APLM).
The estimation of the GLM is performed through an interactive algorithm. This algorithm, the iteratively reweighted least
squares (IRLS) algorithm, applies weighted least squares to the
adjusted dependent variable Z in each iteration step:
new = (X> WX)1 X> WZ
This numerical approach needs to be appropriately modified for
estimating the semiparametric modifications of the GLM.
165
References
Achmus, S. (2000). Nichtparametrische additive Modelle, Doctoral Thesis,

Technical University of Braunschweig, Germany.
Ahn, H. & Powell, J. L. (1993). Semiparametric selection of censored selection
models with a nonparametric selection mechanism, Econometrica 58: 329.
Amemiya, T. (1984). Tobit models: A survey, Journal of Econometrics 24: 361.
Andrews, D. W. K. & Whang, Y.-J. (1990). Additive interactive regression
models: circumvention of the curse of dimensionality, Econometric Theory
6: 466479.
Begun, J., Hall, W., Huang, W. & Wellner, J. (1983). Information and asymptotic efficiency in parametricnonparametric models, Annals of Statistics
11: 432452.
Berndt, E. (1991). The Practice of Econometrics, AddisonWesley.
Bickel, P. & Doksum, K. (1981). An analysis of transformations revisited,
Journal of the American Statistical Association 76: 296311.
Bickel, P., Klaassen, C., Ritov, Y. & Wellner, J. (1993). Efficient and Adaptive
Estimation for Semiparametric Models, The Johns Hopkins University Press.
Bickel, P. & Rosenblatt, M. (1973). On some global measures of the deviations
of density function estimators, Annals of Statistics 1: 10711095.
Bierens, H. (1990). A consistent conditional moment test of functional form,
Econometrica 58: 14431458.
Bierens, H. & Ploberger, W. (1997). Asymptotic theory of integrated conditional moment tests, Econometrica 65: 11291151.
Bonneu, M. & Delecroix, M. (1992). Estimation semiparametrique dans
les mod`eles explicatifs conditionnels a` indice simple, Cahier de gremaq,
92.09.256, GREMAQ, Universite Toulouse I.
Bonneu, M., Delecroix, M. & Malin, E. (1993). Semiparametric versus nonparametric estimation in single index regression model: A computational
approach, Computational Statistics 8: 207222.
280
References
Bossaerts, P., Hafner, C. & Hardle, W. (1996). Foreign exchange rates have
surprising volatility, in P. M. Robinson & M. Rosenblatt (eds), Athens Conference on Applied Probability and Time Series Analysis. Volume II: Time Series
Analysis. In Memory of E.J. Hannan, Lecture Notes in Statistics, Springer,
pp. 5572.
Boularan, J., Ferre, L. & Vieu, P. (1994). Growth curves: a two-stage nonparametric approach, Journal of Statistical Planning and Inference 38: 327350.
Bowman, A. & Azzalini, A. (1997). Applied Smoothing Techniques for Data Analysis, Oxford University Press, Oxford, UK.
Box, G. & Cox, D. (1964). An analysis of transformations, Journal of the Royal
Statistical Society, Series B 26: 211243.
Breiman, L. & Friedman, J. H. (1985). Estimating optimal transformations
for multiple regression and correlations (with discussion), Journal of the
American Statistical Association 80(391): 580619.
Buja, A., Hastie, T. J. & Tibshirani, R. J. (1989). Linear smoothers and additive
models (with discussion), Annals of Statistics 17: 453555.
Burda, M. (1993). The determinants of EastWest German migration, European Economic Review 37: 452461.
Cao, R., Cuevas, A. & Gonzalez Manteiga, W. (1994). A comparative study of
several smoothing methods in density estimation, Computational Statistics
& Data Analysis 17(2): 153176.
Carroll, R. J., Fan, J., Gijbels, I. & Wand, M. P. (1997). Generalized partially
linear singleindex models, Journal of the American Statistical Association
92: 477489.
Carroll, R. J., Hardle, W. & Mammen, E. (2002). Estimation in an additive
model when the components are linked parametrically, Econometric Theory
18(4): 886912.
Chaudhuri, P. & Marron, J. S. (1999). SiZer for exploration of structures in
curves, Journal of the American Statistical Association 94: 807823.
Chen, R., Liu, J. S. & Tsay, R. S. (1995). Additivity tests for nonlinear autoregression, Biometrika 82: 369383.
Cleveland, W. S. (1979). Robust locally-weighted regression and smoothing
scatterplots, Journal of the American Statistical Association 74: 829836.
Collomb, G. (1985). Nonparametric regression an up-to-date bibliography,
Statistics 2: 309324.
Cosslett, S. (1983). Distributionfree maximum likelihood estimation of the
binary choice model, Econometrica 51: 765782.
Cosslett, S. (1987). Efficiency bounds for distributionfree estimators of the
binary choice model, Econometrica 55: 559586.
Dalelane, C. (1999). Bootstrap confidence bands for the integration estimator in additive models, Diploma thesis, Department of Mathematics,
Humboldt-Universitat zu Berlin.
References
281
Daubechies, I. (1992). Ten Lectures on Wavelets, SIAM, Philadelphia, Pennsylvania.

Deaton, A. & Muellbauer, J. (1980). Economics and Consumer Behavior, Cambridge University Press, Cambridge.
Delecroix, M., Hardle, W. & Hristache, M. (2003). Efficient estimation in conditional single-index regression, Journal of Multivariate Analysis 86(2): 213
226.
Delgado, M. A. & Mora, J. (1995). Nonparametric and semiparametric estimation with discrete regressors, Econometrica 63(6): 14771484.
Denby, L. (1986). Smooth regression functions, Statistical report 26, AT&T Bell
Laboratories.
Dette, H. (1999). A consistent test for the functional form of a regression
based on a difference of variance estimators, Annals of Statistics 27: 1012
1040.
Dette, H., von Lieres und Wilkau, C. & Sperlich, S. (2004). A comparison of
different nonparametric methods for inference on additive models, Nonparametric Statistics 16. forthcoming.
Dobson, A. J. (2001). An Introduction to Generalized Linear Models, second edn,
Chapman and Hall, London.
Donoho, D. L. & Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet
shrinkage, Biometrika 81: 425455.
Donoho, D. L. & Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage, Journal of the American Statistical Association
90: 12001224.
Donoho, D. L., Johnstone, I. M., Kerkyacharian, G. & Picard, D. (1995).
Wavelet shrinkage: Asymptopia? (with discussion), Journal of the Royal Statistical Society, Series B 57: 301369.
Duan, N. & Li, K.-C. (1991). Slicing regression: A link-free regression method,
Annals of Statistics 19(2): 505530.
Duin, R. P. W. (1976). On the choice of smoothing parameters of Parzen estimators of probability density functions, IEEE Transactions on Computers
25: 11751179.
Eilers, P. H. C. & Marx, B. D. (1996). Flexible smoothing with b-splines and
penalties (with discussion, Statistical Science 11: 89121.
Epanechnikov, V. (1969). Nonparametric estimation of a multidimensional
probability density, Teoriya Veroyatnostej i Ee Primeneniya 14: 156162.
Eubank, R. L. (1999). Nonparametric Regression and Spline Smoothing, Marcel
Dekker, New York.
Eubank, R. L., Hart, J. D., Simpson, D. G. & Stefanski, L. A. (1995). Testing for
additivity in nonparametric regression, Annals of Statistics 23: 18961920.
Eubank, R. L., Kambour, E. L., Kim, J. T., Klipple, K., Reese, C. S. & Schimek,
M. G. (1998). Estimation in partially linear models, Computational Statistics
& Data Analysis 29: 2734.
282
References
Fahrmeir, L. & Tutz, G. (1994). Multivariate Statistical Modelling Based on Generalized Linear Models, Springer.
Fan, J. & Gijbels, I. (1996). Local Polynomial Modelling and Its Applications,
Vol. 66 of Monographs on Statistics and Applied Probability, Chapman and
Hall, New York.
Fan, J., Hardle, W. & Mammen, E. (1998). Direct estimation of lowdimensional components in additive models, Annals of Statistics 26: 943
971.
Fan, J. & Li, Q. (1996). Consistent model specification test: Omitted variables
and semiparametric forms, Econometrica 64: 865890.
Fan, J. & Marron, J. S. (1992). Best possible constant for bandwidth selection,
Annals of Statistics 20: 20572070.
Fan, J. & Marron, J. S. (1994). Fast implementations of nonparametric curve
estimators, Journal of Computational and Graphical Statistics 3(1): 3556.
Fan, J. & Muller,

M. (1995). Density and regression smoothing, in W. Hardle,
S. Klinke & B. A. Turlach (eds), XploRe an interactive statistical computing
environment, Springer, pp. 7799.
Friedman, J. H. (1987). Exploratory projection pursuit, Journal of the American
Statistical Association 82: 249266.
Friedman, J. H. & Stuetzle, W. (1981). Projection pursuit regression, Journal
of the American Statistical Association 76(376): 817823.
Friedman, J. H. & Stuetzle, W. (1982). Smoothing of scatterplots, Technical
report, Department of Statistics, Stanford.
Fuss, M., McFadden, D. & Mundlak, Y. (1978). A survey of functional forms
in the economic analysis of production, in M. Fuss & D. McFadden (eds),
Production Economics: A Dual Approach to Theory and Applications, NorthHolland, Amsterdam, pp. 219268.
Gallant, A. & Nychka, D. (1987). Semi-nonparametric maximum likelihood
estimation, Econometrica 55(2): 363390.
Gasser, T. & Muller,

H. G. (1984). Estimating regression functions and their
derivatives by the kernel method, Scandinavian Journal of Statistics 11: 171
185.
Gill, J. (2000). Generalized Linear Models: A Unified Approach, Sage University
Paper Series on Quantitative Applications in the Social Sciences, 07-134,
Thousand Oaks, CA.
Gill, R. D. (1989). Non- and semi-parametric maximum likelihood estimators
and the von Mises method (Part I), Scandinavian Journal of Statistics 16: 97
128.
Gill, R. D. & van der Vaart, A. W. (1993). Non- and semi-parametric maximum likelihood estimators and the von Mises method (Part II), Scandinavian Journal of Statistics 20: 271288.
Gonzalez Manteiga, W. & Cao, R. (1993). Testing hypothesis of general linear
model using nonparametric regression estimation, Test 2: 161189.
References
283
Gozalo, P. L. & Linton, O. (2001). A nonparametric test of additivity in generalized nonparametric regression with estimated parameters, Journal of
Econometrics 104: 148.
Grasshoff, U., Schwalbach, J. & Sperlich, S. (1999). Executive pay and corporate financial performance. an explorative data analysis, Working paper
99-84 (33), Universidad Carlos III de Madrid.
Green, P. J. & Silverman, B. W. (1994). Nonparametric Regression and Generalized Linear Models, Vol. 58 of Monographs on Statistics and Applied Probability,
Chapman and Hall, London.
Green, P. J. & Yandell, B. S. (1985). Semi-parametric generalized linear models, Proceedings 2nd International GLIM Conference, Vol. 32 of Lecture Notes in
Statistics 32, Springer, New York, pp. 4455.
GSOEP (1991).
Das Sozio-okonomische Panel (SOEP) im Jahre 1990/91,
Projektgruppe Das Sozio-okonomische

Panel, Deutsches Institut fur
Wirtschaftsforschung. Vierteljahreshefte zur Wirtschaftsforschung, pp.
146155.
Habbema, J. D. F., Hermans, J. & van den Broek, K. (1974). A stepwise discrimination analysis program using density estimation, COMPSTAT 74.
Proceedings in Computational Statistics, Physica, Vienna.
Hall, P. & Marron, J. S. (1991). Local minima in crossvalidation functions,
Journal of the Royal Statistical Society, Series B 53: 245252.
Hall, P., Marron, J. S. & Park, B. U. (1992). Smoothed cross-validation, Probability Theory and Related Fields 92: 120.
Hall, P., Sheather, S. J., Jones, M. C. & Marron, J. S. (1991). On optimal data
based bandwidth selection in kernel density estimation, Biometrika 78: 263
269.
Han, A. (1987). Nonparametric analysis of a generalized regression model,
Journal of Econometrics 35: 303316.
Hardin, J. & Hilbe, J. (2001). Generalized Linear Models and Extensions, Stata
Press.
Hardle, W. (1990). Applied Nonparametric Regression, Econometric Society
Monographs No. 19, Cambridge University Press.
Hardle, W. (1991). Smoothing Techniques, With Implementations in S, Springer,
New York.
Hardle, W., Huet, S., Mammen, E. & Sperlich, S. (2004). Bootstrap inference
in semiparametric generalized additive models, Econometric Theory 20: to
appear.
Hardle, W., Kerkyacharian, G., Picard, D. & Tsybakov, A. B. (1998). Wavelets,
Approximation, and Statistical Applications, Springer, New York.
Hardle, W. & Mammen, E. (1993). Testing parametric versus nonparametric
regression, Annals of Statistics 21: 19261947.
284
References
Hardle, W., Mammen, E. & Muller,

M. (1998). Testing parametric versus semiparametric modelling in generalized linear models, Journal of the
American Statistical Association 93: 14611474.
Hardle, W. & Muller,

M. (2000). Multivariate and semiparametric kernel regression, in M. Schimek (ed.), Smoothing and Regression, Wiley, New York,
pp. 357391.
Hardle, W. & Scott, D. W. (1992). Smoothing in by weighted averaging using
rounded points, Computational Statistics 7: 97128.
Hardle, W., Sperlich, S. & Spokoiny, V. (2001). Structural tests in additive
regression, Journal of the American Statistical Association 96(456): 13331347.
Hardle, W. & Stoker, T. M. (1989). Investigating smooth multiple regression
by the method of average derivatives, Journal of the American Statistical Association 84: 986995.
Hardle, W. & Tsybakov, A. B. (1997). Local polynomial estimators of the
volatility function in nonparametric autoregression, Journal of Econometrics
81(1): 223242.
Harrison, D. & Rubinfeld, D. L. (1978). Hedonic prices and the demand for
clean air, J. Environ. Economics and Management 5: 81102.
Hart, J. D. (1997). onparametric Smoothing and Lack-of-Fit Tests, Springer, New
York.
Hastie, T. J. & Tibshirani, R. J. (1986). Generalized additive models (with
discussion), Statistical Science 1(2): 297318.
Hastie, T. J. & Tibshirani, R. J. (1990). Generalized Additive Models, Vol. 43 of
Monographs on Statistics and Applied Probability, Chapman and Hall, London.
Heckman, J. (1976). The common structure of statistical models of truncation,
sample selection and limited dependent variables and a simple estimator
for such model, Annals of Economic and Social Measurement 5: 475492.
Heckman, J. (1979). Sample selection bias as a specification error, Econometrica 47: 153161.
Hengartner, N., Kim, W. & Linton, O. (1999). A computationally efficient oracle estimator for additive nonparametric regression with bootstrap confidence intervals, Journal of Computational and Graphical Statistics 8: 120.
Honore, B. E. & Powell, J. L. (1994). Pairwise difference estimators of censored and truncated regression models, Journal of Econometrics 64: 241278.
Horowitz, J. L. (1993). Semiparametric and nonparametric estimation of
quantal response models, in G. S. Maddala, C. R. Rao & H. D. Vinod (eds),
Handbook of Statistics, Elsevier Science Publishers, pp. 4572.
Horowitz, J. L. (1996). Semiparametric estimation of a regression model
with an unknown transformation of the dependent variable, Econometrica
64: 103137.
Horowitz, J. L. (1998a). Nonparametric estimation of a generalized additive
model with an unknown link function, Technical report, University of Iowa.
References
285
Horowitz, J. L. (1998b). Semiparametric Methods in Econometrics, Springer.

Horowitz, J. L. & Hardle, W. (1994). Testing a parametric model against a
semiparametric alternative, Econometric Theory 10: 821848.
Horowitz, J. L. & Hardle, W. (1996). Direct semiparametric estimation of
single-index models with discrete covariates, Journal of the American Statistical Association 91(436): 16321640.
Hsing, T. & Carroll, R. J. (1992). An asymptotic theory for sliced inverse
regression, Annals of Statistics 20(2): 10401061.
Ichimura, H. (1993). Semiparametric least squares (SLS) and weighted SLS
estimation of singleindex models, Journal of Econometrics 58: 71120.
Ingster, Y. I. (1993). Asymptotically minimax hypothesis testing for nonparametric alternatives. I - III, Math. Methods of Statist. 2: 85 114, 171 189, 249
268.
Jones, M. C., Marron, J. S. & Sheather, S. J. (1996). Progress in data-based
bandwidth selection for kernel density estimation, Computational Statistics
11(3): 337381.
Kallenberg, W. C. M. & Ledwina, T. (1995). Consistency and Monte-Carlo
simulations of a data driven version of smooth goodness-of-fit tests, Annals
of Statistics 23: 15941608.
Klein, R. & Spady, R. (1993). An efficient semiparametric estimator for binary
response models, Econometrica 61: 387421.
Korostelev, A. & Muller,

M. (1995). Single index models with mixed discretecontinuous explanatory variables, Discussion Paper 26, Sonderforschungsbereich 373, Humboldt-Universitat zu Berlin.
Ledwina, T. (1994). Data-driven version of Neymans smooth test of fit, Journal of the American Statistical Association 89: 10001005.
Lejeune, M. (1985). Estimation non-parametrique par noyaux: regression
polynomiale mobile, Revue de Statistique Appliquees 33: 4367.
Leontief, W. (1947a). Introduction to a theory of the internal structure of
functional relationships, Econometrica 15: 361373.
Leontief, W. (1947b). A note on the interrelation of subsets of independent
variables of a continuous function with continuous first derivatives., Bulletin of the American Mathematical Society 53: 343350.
Lewbel, A. & Linton, O. (2002). Nonparametric censored and truncated regression, Econometrica 70: 765780.
Li, K.-C. (1991). Sliced inverse regression for dimension reduction (with discussion), Journal of the American Statistical Association 86(414): 316342.
Linton, O. (1997). Efficient estimation of additive nonparametric regression
models, Biometrika 84: 469473.
Linton, O. (2000). Efficient estimation of generalized additive nonparametric
regression models, Econometric Theory 16(4): 502523.
Linton, O. & Hardle, W. (1996). Estimation of additive regression models
with known links, Biometrika 83(3): 529540.
286
References
Linton, O. & Nielsen, J. P. (1995). A kernel method of estimating structured

nonparametric regression based on marginal integration, Biometrika 82: 93
101.
Loader, C. (1999). Local Regression and Likelihood, Springer, New York.
Mack, Y. P. (1981). Local properties of k-nn regression estimates, SIAM J. Alg.
Disc. Math. 2: 311323.
Maddala, G. S. (1983). Limited-dependent and qualitative variables in econometrics, Econometric Society Monographs No. 4, Cambridge University Press.
Mammen, E., Linton, O. & Nielsen, J. P. (1999). The existence and asymptotic properties of a backfitting projection algorithm under weak conditions, Annals of Statistics 27: 14431490.
Mammen, E. & Nielsen, J. P. (2003). Generalised structured models,
Biometrika 90: 551566.
Manski, C. (1985). Semiparametric analysis of discrete response: Asymptotic
properties of the maximum score estimator, Journal of Econometrics 3: 205
228.
Marron, J. S. (1989). Comments on a data based bandwidth selector, Computational Statistics & Data Analysis 8: 155170.
Marron, J. S. & Hardle, W. (1986). Random approximations to some measures
of accuracy in nonparametric curve estimation, J. Multivariate Anal. 20: 91
113.
Marron, J. S. & Nolan, D. (1988). Canonical kernels for density estimation,
Statistics & Probability Letters 7(3): 195199.
Masry, E. & Tjstheim, D. (1995). Non-parametric estimation and identification of nonlinear arch time series: strong convergence properties and
asymptotic normality, Econometric Theory 11: 258289.
Masry, E. & Tjstheim, D. (1997). Additive nonlinear ARX time series and
projection estimates, Econometric Theory 13: 214252.
McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models, Vol. 37 of
Monographs on Statistics and Applied Probability, 2 edn, Chapman and Hall,
London.
Muller,
M. (2001). Estimation and testing in generalized partial linear models
a comparative study, Statistics and Computing 11: 299309.
Muller,
M. (2004). Generalized linear models, in J. Gentle, W. Hardle &
Y. Mori (eds), Handbook of Computational Statistics (Volume I). Concepts and
Fundamentals, Springer, Heidelberg.
Nadaraya, E. A. (1964). On estimating regression, Theory of Probability and its
Applications 10: 186190.
Nelder, J. A. & Wedderburn, R. W. M. (1972). Generalized linear models,
Journal of the Royal Statistical Society, Series A 135(3): 370384.
Newey, W. K. (1990). Semiparametric efficiency bounds, Journal of Applied
Econometrics 5: 99135.
References
287
Newey, W. K. (1994). The asymptotic variance of semiparametric estimation,

Econometrica 62: 13491382.
Newey, W. K. (1995). Convergence rates for series estimators, in G. Maddala, P. Phillips & T. Srinavsan (eds), Statistical Methods of Economics and
Quantitative Economics: Essays in Honor of C.R. Rao, Blackwell, Cambridge,
pp. 254275.
Newey, W. K., Powell, J. L. & Vella, F. (1999). Nonparametric estimation of
triangular simultaneous equation models, Econometrica 67: 565603.
Nielsen, J. P. & Linton, O. (1998). An optimization interpretation of integration and backfitting estimators for separable nonparametric models, Journal of the Royal Statistical Society, Series B 60: 217222.
Nielsen, J. P. & Sperlich, S. (2002). Smooth backfitting in practice, Working
paper 02-59, Universidad Carlos III de Madrid.
Opsomer, J. & Ruppert, D. (1997). Fitting a bivariate additive model by local
polynomial regression, Annals of Statistics 25: 186211.
Pagan, A. & Schwert, W. (1990). Alternative models for conditional stock
volatility, Journal of Econometrics 45: 267290.
Pagan, A. & Ullah, A. (1999). Nonparametric Econometrics, Cambridge University Press.
Park, B. U. & Marron, J. S. (1990). Comparison of datadriven bandwidth
selectors, Journal of the American Statistical Association 85: 6672.
Park, B. U. & Turlach, B. A. (1992). Practical performance of several data
driven bandwidth selectors, Computational Statistics 7: 251270.
Powell, J. L., Stock, J. H. & Stoker, T. M. (1989). Semiparametric estimation of
index coefficients, Econometrica 57(6): 14031430.
Proenca, I. & Werwatz, A. (1995). Comparing parametric and semiparametric
binary response models, in W. Hardle, S. Klinke & B. Turlach (eds), XploRe:
An Interactive Statistical Computing Environment, Springer, pp. 251274.
Robinson, P. M. (1988a). Root nconsistent semiparametric regression, Econometrica 56: 931954.
Robinson, P. M. (1988b). Semiparametric econometrics: A survey, Journal of
Applied Econometrics 3: 3551.
J. M., Sperlich, S. & Fernandez, A. I. (1999). Semiparametric
Rodrguez-Poo,
three step estimation methods for simultaneous equation systems, Working
paper 99-83 (32), Universidad Carlos III de Madrid.
J. M., Sperlich, S. & Vieu, P. (2003). Semiparametric estimaRodrguez-Poo,
tion of weak and strong separable models, Econometric Theory 19: 1008
1039.
Ruppert, D. & Wand, M. P. (1994). Multivariate locally weighted least squares
regression, Annals of Statistics 22(3): 13461370.
Ruppert, D., Wand, M. P. & Carroll, R. J. (1990). Semiparametric Regression,
Cambridge University Press.
288
References
Schimek, M. G. (2000a). Estimation and inference in partially linear models

with smoothing splines, Journal of Statistical Planning and Inference 91: 525
540.
Schimek, M. G. (ed.) (2000b). Smoothing and Regression, Wiley, New York.
Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization, John Wiley & Sons, New York, Chichester.
Scott, D. W. & Terrell, G. R. (1987).
Biased and unbiased crossvalidation in density estimation, Journal of the American Statistical Association 82(400): 11311146.
Scott, D. W. & Wand, M. P. (1991). Feasibility of multivariate density estimates, Biometrika 78: 197205.
Severance-Lossin, E. & Sperlich, S. (1999). Estimation of derivatives for additive separable models, Statistics 33: 241265.
Severini, T. A. & Staniswalis, J. G. (1994). Quasi-likelihood estimation
in semiparametric models, Journal of the American Statistical Association
89: 501511.
Severini, T. A. & Wong, W. H. (1992). Generalized profile likelihood and
conditionally parametric models, Annals of Statistics 20: 17681802.
Sheather, S. J. & Jones, M. C. (1991). A reliable databased bandwidth selection method for kernel density estimation, Journal of the Royal Statistical
Society, Series B 53: 683690.
Silverman, B. W. (1984). Spline smoothing: the equivalent variable kernel
method, Annals of Statistics 12: 898916.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis,
Vol. 26 of Monographs on Statistics and Applied Probability, Chapman and
Hall, London.
Simonoff, J. (1996). Smoothing Methods in Statistics, Springer, New York.
Speckman, P. E. (1988). Regression analysis for partially linear models, Journal of the Royal Statistical Society, Series B 50: 413436.
Sperlich, S. (1998). Additive Modelling and Testing Model Specification, Shaker
Verlag.
Sperlich, S., Linton, O. & Hardle, W. (1999). Integration and backfitting methods in additive models: Finite sample properties and comparison, Test
8: 419458.
Sperlich, S., Tjstheim, D. & Yang, L. (2002). Nonparametric estimation and
testing of interaction in additive models, Econometric Theory 18(2): 197251.
Spokoiny, V. (1996). Adaptive hypothesis testing using wavelets, Annals of
Spokoiny, V. (1998). Adaptive and spatially adaptive testing of a nonparametric hypothesis, Math. Methods of Statist. 7: 245273.
Staniswalis, J. G. & Thall, P. F. (2001). An explanation of generalized profile
likelihoods, Statistics and Computing 11: 293298.
References
289
Stone, C. J. (1977). Consistent nonparametric regression, Applied Statistics

5: 595635.
Stone, C. J. (1984). An asymptotically optimal window selection rule for kernel density estimates, Annals of Statistics 12(4): 12851297.
Stone, C. J. (1985). Additive regression and other nonparametric models,
Annals of Statistics 13(2): 689705.
Stone, C. J. (1986). The dimensionality reduction principle for generalized
additive models, Annals of Statistics 14(2): 590606.
Stone, C. J., Hansen, M. H., Kooperberg, C. & Truong, Y. (1997). Polynomial
splines and their tensor products in extended linear modeling (with discussion), Annals of Statistics 25: 13711470.
Stute, W. (1997). Nonparametric model checks for regression, Annals of Statistics 25: 613641.
Stute, W., Gonzalez Manteiga, W. & Presedo-Quindimi, M. (1998). Bootstrap
approximations in model checks for regression, Journal of the American Statistical Association 93: 141149.
Tjstheim, D. & Auestad, B. (1994a). Nonparametric identification of nonlinear time series: Projections, Journal of the American Statistical Association
89: 13981409.
Tjstheim, D. & Auestad, B. (1994b). Nonparametric identification of nonlinear time series: Selecting significant lags, Journal of the American Statistical
Association 89: 14101430.
Treiman, D. J. (1975). Problems of concept and measurement in the comparative study of occupational mobility, Social Science Research 4: 183230.
Vella, F. (1998). Estimating models with sample selection bias: A survey, The
Journal of Human Resources 33: 127169.
Venables, W. N. & Ripley, B. (2002). Modern Applied Statistics with S, fourth
edn, Springer, New York.
Vieu, P. (1994). Choice of regressors in nonparametric estimation, Computational Statistics & Data Analysis 17: 575594.
Wahba, G. (1990). Spline models for observational data, 2 edn, SIAM, Philadelphia, Pennsylvania.
Wand, M. P. & Jones, M. C. (1995). Kernel Smoothing, Vol. 60 of Monographs on
Statistics and Applied Probability, Chapman and Hall, London.
Watson, G. S. (1964). Smooth regression analysis, Sankhya , Series A 26: 359
372.
Wecker, W. & Ansley, C. (1983). The signal extraction approach to nonlinear
regression and spline smoothing, Journal of the American Statistical Association 78: 351365.
Weisberg, S. & Welsh, A. H. (1994). Adapting for the missing link, Annals of
Wu, C. (1986). Jackknife, bootstrap and other resampling methods in regression analysis (with discussion), Annals of Statistics 14: 12611350.
290
References
Yang, L., Sperlich, S. & Hardle, W. (2003). Derivative estimation and testing
in generalized additive models, Journal of Statistical Planning and Inference
115(2): 521542.
Yatchew, A. (2003). Semiparametric Regression for the Applied Econometrician,
Cambridge University Press.
Zheng, J. (1996). A consistent test of a functional form via nonparametric
estimation techniques, Journal of Econometrics 75: 263289.
Author Index
Achmus, S. 247
Ahn, H. 162
Amemiya, T. 162
Andrews, D. W. K. 247
Ansley, C. 247
Auestad, B. 212, 247
Azzalini, A. 135
Begun, J. 185
Berndt, E. 5, 63
Bickel, P. 62, 120, 162, 185
Bierens, H. 135
Bonneu, M. 185
Bossaerts, P. 121
Boularan, J. 247
Bowman, A. 135
Box, G. 154, 162
Breiman, L. 212, 247
Buja, A. 190, 197, 206, 212, 214, 247,
261
Burda, M. 12
Cao, R. 79, 135
Carroll, R. J. 162, 247, 256, 274
Chaudhuri, P. 79
Chen, R. 247
Cleveland, W. S. 135
Collomb, G. 135
Cosslett, S. 185
Cox, D. 154, 162
Cuevas, A. 79
Dalelane, C. 247
Daubechies, I. 135
Deaton, A. 211, 247
Delecroix, M. 177, 185
Delgado, M. A. 149
Denby, L. 206, 254
Dette, H. 135, 240
Dobson, A. J. 162
Doksum, K. 162
Donoho, D. L. 135
Duan, N. 162
Duin, R. P. W. 79
Eilers, P. H. C. 135
Epanechnikov, V. 60
Eubank, R. L. 135, 206, 247
Fahrmeir, L. 77
Fan, J. 57, 98, 135, 199, 254256, 274
Fernandez, A. I. 163
Ferre, L. 247
Friedman, J. H. 162, 212, 247
Fuss, M. 247
292
Author Index
Gallant, A. 185
Gasser, T. 90, 92, 135
Gijbels, I. 98, 135, 199, 256, 274
Gill, J. 162
Gill, R. D. 175
Gonzalez Manteiga, W. 79, 135
Gozalo, P. L. 135, 247, 274
Grasshoff, U. 242
Green, P. J. 103, 135, 206, 274
GSOEP 160, 180
Habbema, J. D. F. 79
Hafner, C. 121
Hall, P. 56, 57, 79
Hall, W. 185
Han, A. 185
Hansen, M. H. 247
Hardin, J. 162
Hardle, W. 110, 121, 122, 126, 127,
135, 177, 185, 202, 204, 206, 247,
257, 274
Harrison, D. 217, 219
Hart, J. D. 135, 247
Hastie, T. J. 190, 197, 198, 202, 206,
212214, 220, 235, 247, 261, 264,
274
Heckman, J. 162
Hengartner, N. 247
Hermans, J. 79
Hilbe, J. 162
Honore, B. E. 163
Horowitz, J. L. 15, 162, 182185, 274
Hristache, M. 177
Hsing, T. 162
Huang, W. 185
Huet, S. 265, 266, 269
Ichimura, H. 167, 172174, 185
Ingster, Y. I. 135
Johnstone, I. M. 135
Jones, M. C. 57, 71, 73, 79, 110, 135

Kallenberg, W. C. M. 136
Kambour, E. L. 206
Kerkyacharian, G. 107, 135
Kim, J. T. 206
Kim, W. 247
Klaassen, C. 185
Klein, R. 176
Klipple, K. 206
Kooperberg, C. 247
Korostelev, A. 181, 182
Ledwina, T. 136
Lejeune, M. 135
Leontief, W. 212, 247
Lewbel, A. 163
Li, K.-C. 162
Li, Q. 135
Linton, O. 135, 163, 212, 216, 221,
232, 234, 236, 240, 247, 263, 274
Liu, J. S. 247
Loader, C. 135
Mack, Y. P. 100
Maddala, G. S. 162
Malin, E. 185
Mammen, E. 126, 127, 135, 202, 204,
206, 216, 221, 240, 247, 254256,
265, 266, 269, 274
Manski, C. 185
Marron, J. S. 5658, 79, 110, 135
Marx, B. D. 135
Masry, E. 247
McCullagh, P. 146, 151, 152, 159, 162,
206
McFadden, D. 247
Mora, J. 149
Muellbauer, J. 211, 247
Muller,
H. G. 90, 92, 135
Muller,
M. 98, 162, 181, 182, 202, 206
Author Index
Mundlak, Y. 247
Nadaraya, E. A. 89
Nelder, J. A. 146, 151, 152, 156, 159,
162, 206
Newey, W. K. 163, 174, 185, 247
Nielsen, J. P. 212, 216, 221, 222, 234,
236, 240, 247, 274
Nolan, D. 58
Nychka, D. 185
Opsomer, J. 216, 247
Pagan, A. 22, 135, 162, 185
Park, B. U. 51, 56, 79
Picard, D. 107, 135
Ploberger, W. 135
Powell, J. L. 162, 163, 172, 179, 180
Presedo-Quindimi, M. 135
Proenca, I. 180, 270
Reese, C. S. 206
Ripley, B. 162
Ritov, Y. 185
Robinson, P. M. 190, 206, 274
J. M. 163, 274
Rodrguez-Poo,
Rosenblatt, M. 62, 120
Rubinfeld, D. L. 217, 219
Ruppert, D. 71, 130, 132, 135, 162,
216, 224, 225, 247
293
Simpson, D. G. 247
Spady, R. 176
Speckman, P. E. 190, 206, 254, 274,
275
Sperlich, S. 163, 222, 224, 226, 228,
230, 232236, 240, 242, 247, 257,
263, 265, 266, 269, 274
Spokoiny, V. 136, 247, 257
Staniswalis, J. G. 193, 201, 206
Stefanski, L. A. 247
Stock, J. H. 172, 179, 180
Stoker, T. M. 172, 179, 180, 185
Stone, C. J. 55, 135, 211, 247, 253, 259
Stuetzle, W. 162, 247
Stute, W. 135
Terrell, G. R. 79
Thall, P. F. 206
Tibshirani, R. J. 190, 197, 198, 202,
206, 212214, 220, 235, 247, 261,
264, 274
Tjstheim, D. 212, 228, 233, 247
Treiman, D. J. 257
Truong, Y. 247
Tsay, R. S. 247
Tsybakov, A. B. 107, 122
Turlach, B. A. 51, 79
Tutz, G. 77
Ullah, A. 135, 162, 185
Schimek, M. G. 206
Schwalbach, J. 242
Schwert, W. 22
Scott, D. W. 30, 35, 69, 7173, 79, 135
Severance-Lossin, E. 224, 226, 230,
235, 236
Severini, T. A. 191, 193, 201, 206
Sheather, S. J. 57, 79
Silverman, B. W. 69, 72, 73, 79, 103,
104, 135, 206
Simonoff, J. 135
van den Broek, K. 79

van der Vaart, A. W. 175
Vella, F. 163
Venables, W. N. 162
Vieu, P. 148, 247, 274
von Lieres und Wilkau, C. 240
Wahba, G. 135, 247
Wand, M. P. 7173, 79, 110, 130, 132,
135, 162, 224, 225, 247, 256, 274
294
Author Index
Watson, G. S. 89
Wecker, W. 247
Wedderburn, R. W. M. 151, 156, 206
Weisberg, S. 185
Wellner, J. 185
Welsh, A. H. 185
Werwatz, A. 180, 270
Whang, Y.-J. 247
Wong, W. H. 191, 206

Wu, C. 127
Yandell, B. S. 206, 274
Yang, L. 228, 233, 263, 274
Yatchew, A. 135, 162
Zheng, J. 135
Subject Index
additive model, see AM, 149, 211

backfitting, 212
bandwidth choice, 236
derivative, 225
equivalent kernel weights, 240
finite sample behavior, 234
hypotheses testing, 268
interaction terms, 227
marginal effect, 224
marginal integration, 222
MASE, 239
additive partial linear model, see APLM
ADE, 171, 178
AMISE
histogram, 29
kernel density estimation, 51
kernel regression, 93
AMSE
local polynomial regression, 96
APLM, 254
ASE
regression, 109, 110
ASH, 32
asymptotic MISE, see AMISE
asymptotic MSE, see AMSE
asymptotic properties
histogram, 24
average derivative estimator, see ADE
average shifted histogram, see ASH
averaged squared error, see ASE
backfitting, 212
classical, 212
GAM, 260
GAPLM, 264
GPLM, 197
local scoring, 260
modified, 219
smoothed, 221
bandwidth
canonical, 57
rule of thumb, 52
bandwidth choice
additive model, 236
kernel density estimation, 51, 56
Silvermans rule of thumb, 51
bias
histogram, 25
multivariate density estimation, 71
multivariate regression, 130
bin, 21
binary response, 11, 146
binwidth, 21, 23
optimal choice, 29
rule of thumb, 30
canonical bandwidth, 57
canonical kernel, 57
canonical link function, 154
296
Subject Index
CHARN model, 121

conditional expectation, 12, 86
conditional heteroscedastic autoregressive nonlinear, see CHARN
confidence bands
confidence intervals
cross-validation, see CV
biased, 79
pseudo-likelihood, 79
smoothed, 79
curse of dimensionality, 4, 133, 145, 167,
216, 240
density estimation, 1
histogram, 21
kernel estimation, 42
nonparametric, 1, 39
derivative estimation
additive function, 225
regression, 98
design
fixed, 88
random, 88
deviance, 156
dimension reduction, 145
Engel curve, 8
equivalent kernel, 57
equivalent kernel weights, 240
explanatory variable, 3
exponential family, 151
finite prediction error, 117
Fisher scoring algorithm, 157
fixed design, 88, 90
Gasser-Muller
estimator, 91
Fourier coefficients, 107
Fourier series, 104
frequency polygon, 35
GAM, 150, 253, 259

backfitting, 260
GAPLM, 150, 264
backfitting, 264
Gasser-Muller
estimator, 91
Gauss-Seidel algorithm, 212
generalized additive model, see GAM
generalized additive partial linear
model, see GAPLM
generalized cross-validation, 116
generalized linear model, see GLM
generalized partial linear model, see
GPLM
approximate LR test, 202
modified LR test, 203
GLM, 151
estimation, 154
exponential family, 151
Fisher scoring, 157
IRLS, 154
link function, 153
Newton-Raphson, 157
GPLM, 150, 189
backfitting, 197
profile likelihood, 191
Speckman estimator, 195
gradient, 71
Hessian matrix, 71
histogram, 21
ASH, 32
asymptotic properties, 24
bias, 25
binwidth choice, 29
construction, 21
dependence on binwidth, 23
dependence on origin, 30
derivation, 23
MSE, 27
variance, 26
Subject Index
hypotheses testing
GPLM, 202
regression, 118
i.i.d, 21
identification, 162
AM, 213
SIM, 167
independent and identically distributed,
see i.i.d.
index, 12, 149, 167
semiparametric, 149
integrated squared error, see ISE
interaction terms, 227
IRLS, 154
ISE
regression, 109
iteratively reweighted least squares, see
IRLS
as a sum of bumps, 45
asymptotic properties, 46
bias, 46
confidence bands, 61
confidence intervals, 61
dependence on bandwidth, 43
dependence on kernel, 43
derivation, 40
multivariate, 66
multivariate rule-of-thumb bandwidth, 73
optimal bandwidth, 50
rule-of-thumb bandwidth, 51
variance, 48
kernel function, 42
canonical, 57, 59
efficiency, 60
equivalent, 57
bias, 93
297
cross-validation, 113
fixed design, 90
Nadaraya-Watson estimator, 89
penalizing functions, 114
random design, 89
statistical properties, 92
univariate, 88
variance, 93
k-NN, see k-nearest-neighbor
least squares, see LS
likelihood ratio, see LR
linear regression, 3
link function, 12, 146, 151, 153
canonical, 154
nonparametric, 148
power function, 154
local constant, 95
local linear, 95
local polynomial
derivative estimation, 98
regression, 94, 95
local scoring, 260, 261
log-likelihood
GLM, 153
pseudo likelihood, 175
quasi-likelihood, 156
marginal effect, 224
derivative, 225
GAM, 262
GAPLM, 264
MASE
regression, 109, 110
maximum likelihood, see ML
maximum likelihood estimator, see MLE
mean averaged squared error, see MASE
mean integrated squared error, see MISE
mean squared error, see MSE
median smoothing, 101
MISE
histogram, 29
regression, 109
ML, 152, 154
MLE, 152, 154
298
Subject Index
MSE
histogram, 27
univariate regression, 108
multivariate density estimation, 66, 69
bias, 71
computation, 75
graphical representation, 75
variance, 71
asymptotics, 130
bias, 130
computation, 132
curse of dimensionality, 133
variance, 130
multivariate, 129
k-nearest-neighbor, 98100
Newton-Raphson algorithm
GLM, 157
nonparametric regression, 85
multivariate, 128
univariate, 85
origin, 21
orthogonal series
Fourier series, 104
orthogonal series regression, 104
orthonormality, 106
partial linear model, see PLM
pdf, 1, 21, 39
multivariate, 66
penalizing functions, 114
Akaikes information criterion, 117
finite prediction error, 117
generalized cross-validation, 116
Rices T, 117
Shibata s model selector, 116
penalty term
spline, 103
spline smoothing, 102
PLM, 149, 189
estimation, 191
plug-in method, 55
refined, 79
PMLE, 171
probability density function, see pdf
profile likelihood, 191
pseudo likelihood, 174
pseudo maximum likelihood estimator,
see PMLE
quasi-likelihood, 156
random design, 88, 89, 92
regression, 3
conditional expectation, 87
fixed design, 88, 93
generalized, 145
linear, 3, 5
local polynomial, 94
median smoothing, 101
k-nearest-neighbor, 98
nonparametric, 7, 85
nonparametric univariate, 85
orthogonal series, 104
parametric, 5
random design, 88, 92
semiparametric, 9, 145
residual sum of squares, see RSS
resubstitution estimate, 112
Rices T, 117
RSS, 101
rule of thumb
histogram, 30
semiparametric least squares, see SLS
Shibata s model selector, 116
SIM, 167
estimation, 170
Subject Index
identification, 168
PMLE, 174
SLS, 172
WADE, 178
single index model, see SIM
SLS, 171
smoothing spline, 104
Speckman estimator, 190, 195
spline kernel, 104
subset selection, 148
Taylor expansion
first order, 26
multivariate, 71
test
AM, GAM, GAPLM, 268
approximate LR test, 202
LR test, 160
modified LR test, 203
SIM, 183
299
time series
nonparametric, 121
variable selection, 148
variance
histogram, 26
WADE, 171, 178
wage equation, 3
WARPing, 35
wavelets, 107
weighted average derivative estimator,
see WADE
weighted semiparametric least squares,
see WSLS
XploRe, V

BOOK Nonparametric and Semiparametric Models-2004

Uploaded by

Copyright:

Available Formats

BOOK Nonparametric and Semiparametric Models-2004

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BOOK Nonparametric and Semiparametric Models-2004

Uploaded by

Copyright:

Available Formats

Wolfgang Hardle, Marlene Muller,

Please note: this is only a sample of the

For further information please contact

The concept of smoothing is a central idea in statistics. Its role is to extract

Part I: Nonparametric Models

useful remarks and suggestions of our students at Humboldt-Universitat

Lorens Helmchen, Stephanie Freese, Danilo Mercurio, Thomas Kuhn,

Berlin/Kaiserslautern/Madrid, February 2004

Part I Nonparametric Models

2.2.5 Optimal Binwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Nonparametric Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Part II Semiparametric Models

Semiparametric and Generalized Regression Models . . . . . . . . . . .

5.2.3 Iteratively Reweighted Least Squares Algorithm . . . . . .

Single Index Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Generalized Partial Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Additive Models and Marginal Effects . . . . . . . . . . . . . . . . . . . . . . . . .

8.2.1 Estimation of Marginal Effects . . . . . . . . . . . . . . . . . . . . . . .

Generalized Additive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Log-normal versus kernel density estimates

Wage-schooling and wage-experience profile

Parametrically estimated regression function

Nonparametrically estimated regression function

Additive model fit versus parametric fit

Surface plot for the additive model

Link function of the homoscedastic versus the heteroscedastic

1.10 Sampling distribution of the ratio of the estimated coefficients

1.11 Single index versus logit model

Histogram for stock returns data

Approximation of the area under the pdf . . . . . . . . . . . . . . . . . . . .

Squared bias, variance and MSE for the histogram

Histograms for stock returns, different origins

Averaged shifted histogram for stock returns

Ordinary histogram for stock returns

Some Kernel functions

Density estimates for the stock returns

Different kernels for estimation

Different continuous kernels for estimation

Kernel density estimate as sum of bumps

Squared bias, variance and MSE

Parametric versus nonparametric density estimate for

Confidence intervals versus density estimates for average

3.10 Confidence bands versus intervals for average hourly

3.11 Bivariate kernel contours for equal bandwidths

3.12 Bivariate kernel contours for different bandwidths

3.13 Bivariate kernel contours for bandwidth matrix

3.14 Two-dimensional density estimate

3.15 Two-dimensional contour plot of a density estimate

3.16 Two-dimensional intersections for three-dimensional density

3.17 Three-dimensional contour plots of a density estimate

Nadaraya-Watson kernel regression

Kernel regression estimates using different bandwidths

Local polynomial regression

Local linear regression and derivative estimation

Median smoothing regression

Orthogonal series regression using Legendre polynomials

4.10 Squared bias, variance and MASE

4.12 Nadaraya-Watson kernel regression with cross-validated

4.15 Confidence intervals and Nadaraya-Watson kernel regression

Two link functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170