Nonlinear Least Squares Theory
CHUNG-MING KUAN
Department of Finance & CRETA
March 9, 2010
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 1 / 33
Lecture Outline
1 Nonlinear Specifications
2 The NLS Method
The NLS Estimator
Nonlinear Optimization Algorithms
3 Asymptotic Properties of the NLS Estimator
Digression: Uniform Law of Large Numbers
Consistency
Asymptotic Normality
Wald Tests
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 2 / 33
Nonlinear Specifications
Given the dependent variable y , consider the nonlinear specification:
y = f (x; β) + e(β),
where x is ` × 1, β is k × 1, and f is a given function. There are many
choices of f . A flexible model is to transform one (or several) x by the
Box-Cox transform of x:
xγ − 1
,
γ
which yields x − 1 when γ = 1, 1 − 1/x when γ = −1, and a value close
to ln x when γ → 0.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 3 / 33
The CES (constant elasticity of substitution) production function:
−λ/γ
y = α δL−γ + (1 − δ)K −γ
,
where α > 0, 0 < δ < 1 and γ ≥ −1, which yields:
λ −γ
ln δL + (1 − δ)K −γ .
ln y = ln α −
γ
The translog (transcendental logarithmic) production function:
ln y = β1 +β2 ln L+β3 ln K +β4 (ln L)(ln K )+β5 (ln L)2 +β6 (ln K )2 ,
which is linear in parameters; in this case, the OLS method suffices.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 4 / 33
Nonlinear Time Series Models
An exponential autoregressive (EXPAR) model:
p
X
2
yt = αj + βj exp −γyt−1 yt−j + et .
j=1
A self-exciting threshold autoregressive (SETAR) model:
(
a0 + a1 yt−1 + · · · + ap yt−p + et , if yt−d ∈ (−∞, c],
yt =
b0 + b1 yt−1 + · · · + bp yt−p + et , if yt−d ∈ (c, ∞),
where 1 ≤ d ≤ p is the delay parameter, and c is the threshold
parameter. Alternatively,
p
X p
X
yt = a0 + aj yt−j + δ0 + δj yt−j 1{yt−d >c} + et ,
j=1 j=1
with aj + δj = bj .
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 5 / 33
Replacing the indicator function in SETAR model with a “smooth”
function h we obtain the smooth threshold autoregressive (STAR)
model:
p
X p
X
yt = a0 + aj yt−j + δ0 + δj yt−j h(yt−d ; c, δ) + et ,
j=1 j=1
where h is a distribution function, e.g.,
1
h(yt−d ; c, δ) = ,
1 + exp[−(yt−d − c)/s]
with c the threshold value and s a scale parameter. The STAR model
admits smooth transition between different regimes, and it behaves
like a SETAR model when (yt−d − c)/s is large.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 6 / 33
Artificial Neural Networks
A 3-layer neural network can be expressed as
q
X p
X
f (x1 . . . . , xp ; β) = g α0 + αi h γi0 + γij xj ,
i=1 j=1
which contains p input units, q hidden units, and one output unit. The
functions h and g are known as activation functions, and the parameters
in these functions are connection weights.
h is typically an S-shaped function; two leading choices are the logistic
function h(x) = 1/(1 + e −x ) and the hyperbolic tangent function
e x − e −x
h(x) = .
e x + e −x
The function g may be the identity function or the same as h.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 7 / 33
Artificial neural networks are designed to mimic the behavior of biological
neural systems and have the following properties.
Universal approximation: Neural network is capable of approximating
any Borel-measurable function to any degree of accuracy, provided
that q is sufficiently large. In this sense, neural networks can be
understood as a series expansion, with hidden units functions as the
basis functions.
Parsimonious model: To achieve a given degree of approximation
accuracy, neural networks are simpler than the polynomial and
trigonometric expansions, in the sense that the number of hidden
units q can grow at a much slower rate.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 8 / 33
The NLS Estimator
The NLS criterion function:
1
QT (β) = [y − f(x1 , . . . , xT ; β)]0 [y − f(x1 , . . . , xT ; β)]
T
T
1 X
= [yt − f (xt ; β)]2 .
T
t=1
The first order condition contains k nonlinear equations with k
unknowns:
2 set
∇β QT (β) = − ∇β f(x1 , . . . , xT ; β) [y − f(x1 , . . . , xT ; β)] = 0,
T
where ∇β f(x1 , . . . , xT ; β) is a k × T matrix. A solution to the first
order condition is the NLS estimator β̂ T .
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 9 / 33
[ID-2] f (x; ·) is twice continuously differentiable in the second argument
on Θ1 , such that for given data (yt , xt ), t = 1, . . . , T , ∇2β QT (β̂ T ) is
positive definite.
While [ID-2] ensures that β̂ T is a minimum of QT (β), it does not
guarantee the uniqueness of this solution. For a given data set, there
may exist multiple, local minima of QT (β).
For linear regressions, f(β) = Xβ so that ∇β f(β) = X0 and
∇2β f(β) = 0. It follows that ∇2β QT (β) = 2(X0 X)/T , which is
positive definite if, and only if, X has full column rank. Note that in
linear regression, the identification condition does not depend on β.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 10 / 33
Nonlinear Optimization Algorithms
An NLS estimate is usually computed using a numerical method. In
particular, an iterative algorithm starts from some initial value of the
parameter and then repeatedly calculates next available value according to
a particular rule until an optimum is reached approximately.
A generic, iterative algorithm is
β (i+1) = β (i) + s (i) d(i) .
That is, the (i + 1) th iterated value β (i+1) is obtained from β (i) with an
adjustment term s (i) d(i) , where d(i) characterizes the direction of change
in the parameter space and s (i) controls the amount of change. Note that
an iterative algorithm can only locate a local optimum.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 11 / 33
Gradient Method
The first-order Taylor expansion of Q(β) about β † is
QT (β) ≈ QT (β † ) + [∇β QT (β † )]0 (β − β † ).
Replacing β with β (i+1) and β † with β (i) ,
0
QT β (i+1) ≈ QT β (i) + ∇β QT β (i) s (i) d(i) .
Setting d(i) = −g(i) , where g(i) is ∇β QT (β) evaluated at β (i) , we have
QT β (i+1) ≈ QT β (i) − s (i) g(i)0 g(i) ,
where g(i)0) g(i) ≥ 0. This leads to:
β (i+1) = β (i) − s (i) g(i) .
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 12 / 33
Steepest Descent Algorithm
To maximize the step length, note that
∂QT β (i+1)
(i+1)
(i+1) ∂β
= −g(i+1)0 g(i) = 0.
= ∇ Q
β T β
∂s (i) ∂s (i)
Let H(i) = ∇2β QT (β)|β=β(i) . By Taylor’s expansion of g , we have
g(i+1) ≈ g(i) + H(i) β (i+1) − β (i) = g(i) − H(i) s (i) g(i) .
Thus, 0 = g(i+1)0 g(i) ≈ g(i)0 g(i) − s (i) g(i)0 H(i) g(i) , or equivalently,
g(i)0 g(i)
s (i) = ≥ 0,
g(i)0 H(i) g(i)
when H(i) is p.d. We obtain the steepest descent algorithm:
" #
g (i)0 g(i)
β (i+1) = β (i) − g(i) .
g(i)0 H(i) g(i)
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 13 / 33
Newton Method
The Newton method takes into account the second order derivatives.
Consider the second-order Taylor expansion of Q(β) around some β † :
1
QT (β) ≈ QT (β † ) + g†0 (β − β † ) + (β − β † )0 H† (β − β † ).
2
The first order condition of QT (β) is g† + H† (β − β † ) ≈ 0, so that
β ≈ β † − (H† )−1 g† .
This suggests the following Newton-Raphson algorithm:
−1
β (i+1) = β (i) − H(i) g(i) ,
−1
with the step length 1 and the direction vector − H(i) g(i) .
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 14 / 33
From Taylor’s expansion it is easy to see that
1 −1 (i)
QT β (i+1) − QT β (i) ≈ − g(i)0 H(i)
g ≤ 0,
2
provided that H(i) is p.s.d. Thus, the Newton-Raphson algorithm usually
results in a decrease of QT .
When QT is (locally) quadratic, the second-order expansion is exact, so
that β = β † − (H† )−1 g† must be a minimum of QT (β). This immediately
suggests that the Newton-Raphson algorithm can reach the minimum in a
single step. Yet, there are two drawbacks.
The Hessian matrix need not be positive definite.
The Hessian matrix must be inverted at each iteration step.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 15 / 33
Gauss-Newton Algorithm
Letting Ξ(β) = ∇β f(β) we have
2 2 2
H(β) = − ∇ f(β)[y − f(β)] + Ξ(β)0 Ξ(β),
T β T
Ignoring the first term, an approximation to H(β) is 2Ξ(β)0 Ξ(β)/T ,
which requires only the first order derivatives and is guaranteed to be
p.s.d. The Gauss-Newton algorithm utilizes this approximation as
0 −1
β (i+1) = β (i) + Ξ β (i) Ξ β (i) Ξ β (i) y − f β (i) .
Note that the adjustment term can be obtained as the OLS estimate of
regressing y − f β (i) on Ξ β (i) ; this is known as the Gauss-Newton
regression.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 16 / 33
Other Modifications
To maintain a correct search direction, H(i) needs to be p.d.
(i)
Correcting H(i) by: Hc = H(i) + c (i) I, where c (i) > 0 is chosen to
(i)
“force” Hc to be p.d.
(i) (i)
For H̃ = H−1 , one may compute H̃c = H̃ + cI. Such a correction
is used in the Marquardt-Levenberg algorithm.
(i)
The quasi-Newton method corrects H̃ by a symmetric matrix:
(i+1) (i)
H̃ = H̃ + C(i) .
This is used by the Davidon-Fletcher-Powell (DFP) algorithm and the
Broydon-Fletcher-Goldfarb-Shanno (BFGS) algorithm.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 17 / 33
Initial Values and Convergence Criteria
Initial values: Specified by the researcher or obtained using a random
number generator. Prior information, if available, should also be
taken into account.
Convergence criteria:
(i+1)
− β (i)
< c, where k · k denotes the Euclidean norm,
β
g β (i)
< c, or
QT β (i+1) − QT β (i) < c.
For the Gauss-Newton algorithm, one may stop the algorithm when
TR 2 is “close” to zero, where R 2 is the coefficient of determination of
the Gauss-Newton regression.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 18 / 33
Digression: Uniform Law of Large Numbers
Consider the function q(zt (ω); θ). It is a r.v. for a given θ and a function
of θ for a given ω. Suppose {q(zt ; θ)} obeys a SLLN for each θ ∈ Θ:
T
1 X a.s.
QT (ω; θ) = q(zt (ω); θ) −→ Q(θ),
T
t=1
where Q(θ) is non-stochastic. Note that Ωc0 (θ) = {ω : QT (ω; θ) 6→ Q(θ)}
varies with θ.
Although IP(Ωc0 (θ)) = 0, ∪θ∈Θ Ωc0 (θ) is an uncountable union of
non-convergence sets and may not have probability zero.
∩θ∈Θ Ω0 (θ) may occur with probability less than one.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 19 / 33
When θ also depends on T (e.g., when θ is replaced by an estimator θ̃T ),
there may not exist a finite T ∗ such that QT (ω; θ̃T ) are arbitrarily close to
Q(ω; θ̃T ) for all T > T ∗ . Thus, we need a notion of convergence that is
uniform on the parameter space Θ.
We say that QT (ω; θ) converges to Q(θ) uniformly in θ almost surely (in
probability) if
sup |QT (θ) − Q(θ)| → 0, a.s. (in probability).
θ∈Θ
We also say that q(zt (ω); θ) obey a strong (or weak) uniform law of large
numbers (SULLN or WULLN).
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 20 / 33
Example: Let zt be i.i.d. with zero mean and
1
T θ,
0 ≤ θ ≤ 2T ,
1 1
qT (zt (ω); θ) = zt (ω) + 1 − T θ, 2T < θ ≤ T ,
1
0, T < θ < ∞.
Observe that for θ ≥ 1/T and θ = 0,
T T
1 X 1 X a.s.
QT (ω; θ) = qT (zt ; θ) = zt −→ 0,
T T
t=1 t=1
by Kolmogorov’s SLLN. For a given θ, we can choose T large enough such
a.s.
that QT (ω; θ) −→ 0, where 0 is the pointwise limit. Yet for Θ = [0, ∞),
a.s.
sup |QT (ω; θ)| = |z̄T + 1/2| −→ 1/2,
θ∈Θ
so that the uniform limit is different from the pointwise limit.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 21 / 33
What is the extra condition needed to ensure SULLN if we already have,
for each θ ∈ Θ,
T
1 X a.s.
QT (θ) = [qTt (zt ; θ) − IE(qTt (zt ; θ))] −→ 0.
T
t=1
Suppose QT (θ) satisfies a Lipschitz-type condition: for θ and θ † in Θ,
|QT (θ) − QT (θ † )| ≤ CT kθ − θ † k a.s.,
where |CT | ≤ ∆ a.s. and ∆ does not depend on θ. Then,
sup |QT (θ)| ≤ sup |QT (θ) − QT (θ † )| + |QT (θ † )|.
θ∈Θ θ∈Θ
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 22 / 33
Given > 0, we can choose θ † such that kθ − θ † k < /(2∆). Then,
sup |QT (θ) − QT (θ † )| ≤ CT ≤ ,
θ∈Θ 2∆ 2
uniformly in T . Also, by pointwise convergence of QT , |QT (θ † )| < /2 for
large T . Consequently, for all T sufficiently large,
sup |QT (θ)| ≤ .
θ∈Θ
This shows that pointwise convergence and a Lipschitz condition on QT
together suffice for a SULLN or WULLN.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 23 / 33
Consistency
The NLS criterion function is QT (β) = T −1 T 2
P
t=1 [yt − f (xt ; β)] , and its
minimizer is the NLS estimator β̂ T . Suppose IE[QT (β)] is continuous on
Θ1 such that β o is its unique, global minimum. If QT (β) is close to
IE[QT (β)], we would expect β̂ T close to β o .
To see this, assuming that QT obeys a SULLN:
sup QT (β) − IE[QT (β)] → 0,
β∈Θ1
for all ω ∈ Ω0 and IP(Ω0 ) = 1. Set
= inf
c
IE[QT (β)] − IE[QT (β o )] ,
β∈B ∩Θ1
for an open neighborhood B of β o .
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 24 / 33
For ω ∈ Ω0 , we have for large T , IE[QT (β̂ T )] − QT (β̂ T ) < 2 , and
QT (β̂ T ) − IE[QT (β o )] ≤ QT (β o ) − IE[QT (β o )] < ,
2
because the NLS estimator β̂ T minimizes QT (β). It follows that
IE[QT (β̂ T )] − IE[QT (β o )]
≤ IE[QT (β̂ T )] − QT (β̂ T ) + QT (β̂ T ) − IE[QT (β o )] < ,
for all T sufficiently large. As β̂ T is such that IE[QT (β̂ T )] is closer to
IE[QT (β o )] with probability one, it can not be outside the neighborhood B
of β o . As B is arbitrary, β̂ T must be converging to β o almost surely.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 25 / 33
Q: How do we ensure a SULLN or WULLN?
If Θ1 is compact and convex, we have from the mean-value theorem and
the Cauchy-Schwartz inequality that
|QT (β) − QT (β † )| ≤ k∇β QT (β ‡ )k kβ − β † k a.s.,
where β ‡ is the mean value of β and β † , in the sense that
|β − β † | < |β ‡ − β † |. Hence, the Lipschitz-type condition would hold for
CT = sup ∇β QT (β),
β∈Θ1
with ∇β QT (β) = −2 T
P
t=1 ∇β f (xt ; β)[yt − f (xt ; β)]/T . Note that
∇β QT (β) may be bounded in probability, but it may not be bounded in
an almost sure sense. (Why?)
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 26 / 33
We impose the following conditions.
[C1] {(yt w0t )0 } is a sequence of random vectors, and xt is vector
containing some elements of Y t−1 and W t .
(i) The sequences {yt2 }, {yt f (xt ; β)} and {f (xt ; β)2 } all obey a WLLN for
each β in Θ1 , where Θ1 is compact and convex.
(ii) yt , f (xt ; β) and ∇β f (xt ; β) all have bounded second moment
uniformly in β.
[C2] There exists a unique parameter vector β o such that
IE(yt | Y t−1 , W t ) = f (xt ; β o ).
Theorem 8.1
Given the nonlinear specification: y = f (x; β) + e(β), suppose that [C1]
IP
and [C2] hold. Then, β̂ T −→ β o .
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 27 / 33
Remark: Theorem 8.1 is not satisfactory because it only deals with the
convergence to the global minimum. Yet, an iterative algorithm is not
guaranteed to find a global minimum of the NLS objective function.
Hence, it is more reasonable to expect the NLS estimator converging to
some local minimum of IE[QT (β)]. Therefore, we shall, in what follows,
assert only that the NLS estimator converges in probability to a local
minimum β ∗ of IE[QT (β)]. In this case, f (x; β ∗ ) is, at most, an
approximation to the conditional mean function.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 28 / 33
Asymptotic Normality
By the mean-value expansion of ∇β QT (β̂ T ) about β ∗ ,
0 = ∇β QT (β̂ T ) = ∇β QT (β ∗ ) + ∇2β QT (β †T )(β̂ T − β ∗ ),
where β †T is a mean value of β̂ T and β ∗ . Thus, when ∇2β QT (β †T ) is
invertible, we have
√ √
T (β̂ T − β ∗ ) = −[∇2β QT (β †T )]−1 T ∇β QT (β ∗ )
√
= −HT (β ∗ )−1 T ∇β QT (β ∗ ) + oIP (1),
√
where HT (β) = IE[∇2β QT (β)]. That is, T (β̂ T − β ∗ ) and
√
−HT (β ∗ )−1 T ∇β QT (β ∗ ) are asymptotically equivalent.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 29 / 33
Under suitable conditions,
√ T
2 X
T ∇β QT (β ∗ ) = − √ ∇β f (xt ; β ∗ )[yt − f (xt ; β ∗ )]
T t=1
√ D
obeys a CLT, i.e., (V∗T )−1/2 T ∇β QT (β ∗ ) −→ N (0, Ik ), where
T
!
2 X
V∗T = var √ ∗ ∗
∇β f (xt ; β )[yt − f (xt ; β )] .
T t=1
Then for D∗T = HT (β ∗ )−1 V∗T HT (β ∗ )−1 ,
√ D
(D∗T )−1/2 HT (β ∗ )−1 T ∇β QT (β ∗ ) −→ N (0, Ik ).
By asymptotic equivalence,
√ D
(D∗T )−1/2 T (β̂ T − β ∗ ) −→ N (0, Ik ).
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 30 / 33
When D∗T is replaced by a consistent estimator D
b ,
T
√
b −1/2 T (β̂ − β ∗ ) −→
D
D
N (0, Ik ).
T T
Note that
T
∗2 X 0
IE ∇β f (xt ; β ∗ ) ∇β f (xt ; β ∗ )
HT (β ) =
T
t=1
T
2 X
IE ∇2β f (xt ; β ∗ ) yt − f (xt ; β ∗ ) ,
−
T
t=1
which can be consistently estimated by its sample counterpart:
T T
b = 2
X 0 2 X
∇2β f (xt ; β̂ T )êt .
H T ∇ β f (xt ; β̂ T ) ∇ β f (x t ; β̂ T ) −
T T
t=1 t=1
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 31 / 33
When t = yt − f (xt ; β ∗ ) are uncorrelated with ∇2β f (xt ; β ∗ ), HT (β ∗ )
depends only on the expectation of the outer product of ∇β f (xt ; β ∗ ) so
that H
b may be simplified as
T
T
b = 2
X 0
H T ∇β f (xt ; β̂ T ) ∇β f (xt ; β̂ T ) .
T
t=1
PT 0
This is analogous to estimating Mxx by t=1 xt xt /T in linear regressions.
If {t } is not a martingale difference sequence with respect to Y t−1 and
W t , V∗T can be consistently estimated using a Newey-West type
estimator. This is more likely in practice as the NLS estimator typically
converges to a local optimum β ∗ .
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 32 / 33
Wald Tests
Hypothesis: Rβ ∗ = r, where R is a q × k selection matrix and r is a
q × 1 vector of pre-specified constants.
By the asymptotic normality result, we have under the null that
√ √
b−1/2 T R(β̂ − β ∗ ) = Γ
Γ b−1/2 T (Rβ̂ − r) −→
D
N (0, Iq ),
T T T T
where Γ b R0 , and D
b = RD b is a consistent estimator for D∗ .
T T T T
The Wald statistic is
−1 D
b (Rβ̂ − r)0 −→ χ2 (q).
WT = T (Rβ̂ T − r)ΓT T
For nonlinear restrictions r(β ∗ ) = 0, the Wald test is not invariant
with respect to the form of r(β) = 0.
C.-M. Kuan (National Taiwan Univ.) Nonlinear Least Squares Theory March 9, 2010 33 / 33