0% found this document useful (0 votes)
51 views15 pages

ECE275AB Lecture 10 View Graphs 2008-2009

This lecture discusses vector derivatives and the gradient in the context of optimizing objective functions. Key points: - The gradient of an objective function l(x) is defined as the vector partial derivative ∂l(x)/∂x. Points where the gradient is zero are called stationary points. - The Jacobian matrix describes the derivative of a vector-valued function f(x). It is used to discuss properties like invertibility. - For an objective function to have a local extremum at a stationary point, the Hessian (matrix of second derivatives) must be positive or negative definite. - Examples are given of optimizing the negative log-likelihood for linear and nonlinear models. The normal equations

Uploaded by

Ajay Verma
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views15 pages

ECE275AB Lecture 10 View Graphs 2008-2009

This lecture discusses vector derivatives and the gradient in the context of optimizing objective functions. Key points: - The gradient of an objective function l(x) is defined as the vector partial derivative ∂l(x)/∂x. Points where the gradient is zero are called stationary points. - The Jacobian matrix describes the derivative of a vector-valued function f(x). It is used to discuss properties like invertibility. - For an objective function to have a local extremum at a stationary point, the Hessian (matrix of second derivatives) must be positive or negative definite. - Examples are given of optimizing the negative log-likelihood for linear and nonlinear models. The normal equations

Uploaded by

Ajay Verma
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Lecture 10 ECE 275A

Vector Derivatives and the Gradient


ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 1/15
Objective Functions (x)
Let (x) be a real-valued function (aka functional) of an n-dimensional real vector
x . = R
n
() : . = R
n
1 = R
where . is an n-dimensional real Hilbert space with metric matrix =
T
> 0. Note
that henceforth vectors in . are represented a column vectors in R
n
. Because here the
value space \ = R is one-dimensional, wlog we can take 1 to be Cartesian (because all
(necessarily positive scalar) metric weightings yield inner products and norms that are
equivalent up to an overall positive scaling).
We will call (x) an objective function. If (x) is a cost, loss, or penalty function,
then we assume that it is bounded from below and we attempt to minimize it wrt x. If
(x) is a prot, gain, or reward function, then we assume that it is bounded from
above and we attempt to maximize it wrt x.
For example, suppose we wish to match a model pdf p
x
(y) to a true, but unknown,
density p
x
0
(y) for an observed random vector, where we assume p
x
(y) p
x
0(y),
x. We can then use a penalty function of x to be given by a measure of (non-averaged
or instantaneous) divergence or discrepancy D
I
(x
0
x) of the model pdf p
x
(y) from
the true pdf p
x
0
(y) dened by
D
I
(x
0
|x) log

p
x
0
(y)
p
x
(y)

= log p
x
0
(y) log p
x
(y)
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 2/15
Instantaneous Divergence & Negative Log-likelihood
Note that minimizing the instantaneous divergence D
I
(x
0
|x) is equivalent to
maximizing the log-likelihood p
x
(y) or minimizing the negative log-likelihood
(x) = log p
x
(y)
A special case arises from use of the nonlinear Gaussian additive noise model with
known noise covariance C and known mean function h()
y = h(x) + n, n N(0, C) y N(h(x), C)
which yields the nonlinear weighted least-squares problem
(x) = log p
x
(y)
.
= |y h(x)|
2
W
, W = C
1
Further setting h(x) = Ax is the linear weighted least-squares problem we have already
discussed
(x) = log p
x
(y)
.
= |y Ax|
2
W
, W = C
1
The symbol
.
= denotes that fact that we are ignoring additive terms and multiplicative
factors which are irrelevant for the purposes of obtaining a extremum (here, a minimum)
of a loss function. Of course we cannot ignore these terms if we are interested in the
optimal value of the loss function itself.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 3/15
Stationary Points and the Vector Partial Derivative
Henceforth let the real scalar function (x) be twice partial differentiable with respect to
all components of x R
n
. A necessary condition for x be be a local extremum
(maximum or minimum) of (x) is that

x
(x)

(x)
x
1
. . .
(x)
x
n


1n
= 0
where the vector partial derivative operator


x
1
. . .

x
n

is dened as a row operator. (See the extensive discussion in the Lecture Supplement
on Real Vector Derivative.)
A vector that satises

x
(x) = 0 is known as a stationary point of (x). Stationary
points are points at which (x) has a local maximum, minimum, or inection.
Sufcient conditions for a stationary point to be a local extremum require that we develop
a theory of vector differentiation that will allow us to clearly and succinctly discuss
second-order derivative properties of objective functions.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 4/15
Derivative of a Vector-Valued Function The Jacobian
Let f(x) R
m
have elements f
i
(x), i = 1, , m, which are all differentiable with
respect to the components of x R
n
.
We dene the vector partial derivative of the vector function f(x) as
J
f
(x)

x
f(x)

x
f
1
(x)
.
.
.

x
f
m
(x)

f
1
(x)
x
1

f
1
(x)
x
n
.
.
.
.
.
.
.
.
.
f
m
(x)
x
1

f
m
(x)
x
n


mn
The matrix J
f
(x) =

x
f(x) is known as the Jacobian matrix or operator of the
mapping f(x). It is the linearization of the nonlinear mapping f(x) at the point x. Often
we write y = f(x) and the corresponding Jacobian as J
y
(x).
If m = n, then y = f(x) can be viewed as a change of variables, in which case
det J
y
(x) is known as the Jacobian of the transformation. det J
y
(x) plays a fundamental
role in the change of variable formula of pdfs.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 5/15
The Jacobian Cont.

A function f : R
n
R
m
is locally one-to-one in an open neighborhood of x if
and only if its Jacobian (linearization) J
f
() =

x
f() is a one-to-one matrix for all
points R
n
in the open neighborhood of x.

A function f : R
n
R
m
is locally onto an open neighborhood of y = f(x) if and
only if its Jacobian (linearization) J
f
() =

x
f() is an onto matrix for all points
R
n
in the corresponding neighborhood of x.

A function f : R
n
R
m
is locally invertible in an open neighborhood of x if and
only if it is locally one-to-one and onto in the open neighborhood of x which is true
if and only if its Jacobian (linearization) J
f
() =

x
f() is a one-to-one and onto
(and hence invertible) matrix for all points R
n
in the open neighborhood of x.
This is known as the Inverse Function Theorem.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 6/15
Vector Derivative Identities
c
T
x
x
= c
T
for an arbitrary vector c
Ax
x
= A for an arbitrary matrix A
g
T
(x)h(x)
x
= g
T
(x)
h(x)
x
+ h
T
(x)
g(x)
x
, g(x)
T
h(x) scalar
x
T
Ax
x
= x
T
A + x
T
A
T
for an arbitrary matrix A
x
T
x
x
= 2x
T
when =
T
h(g(x))
x
=
h
g
g
x
(Chain Rule)
Note that the last identity) is a statement about Jacobians and can be restated in an
illuminating manner as
J
hg
= J
h
J
g
(1)
which says that the linearization of a composition is the composition of the
linearizations.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 7/15
Application to Linear Gaussian Model
Stationary points of
(x) = log p
x
(y)
.
= |e(x)|
2
W
where e(x) = y A(x) and W = C
1
satisfy
0 =

x
=

e
e
x
= (2e
T
W)(A) = 2(y Ax)
T
WA
or
A
T
W(y Ax) = 0 e(x) = y Ax A(A

) with A

= A
T
W
Therefore stationary points satisfy the Normal Equation
A
T
WAx = A
T
y A

A = A

y
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 8/15
The Hessian of an Objective Function
The Hessian, or matrix of second partial derivatives of (x), is dened by
H(x)

x


x
(x)

T
=

(x)
x
1
x
1

(x)
x
n
x
1
.
.
.
.
.
.
.
.
.
(x)
x
1
x
n

(x)
x
n
x
n


nn
As a consequence of the fact that
(x)
x
i
x
j
=
(x)
x
j
x
i
the Hessian is obviously symmetric
1(x) = 1
T
(x)
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 9/15
Vector Taylor Series Expansion
Taylor series expansion of a scalar-valued function (x) about a point x
0
to
second-order in x = x x
0
:
(x
0
+ x) = (x
0
) +
(x
0
)
x
x +
1
2
x
T
H(x
0
)x + h.o.t.
where 1 is the Hessian of (x).
Taylor series expansion of a vector-valued function h(x) about a point x
0
to
rst-order in x = x x
0
:
h(x) = h(x
0
+ x) = h(x
0
) +
h(x
0
)
x
x + h.o.t.
To obtain notationally uncluttered expressions for higher order expansions, one switches
to the use of tensor notation.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 10/15
Sufcient Condition for an Extremum
Let x
0
be a stationary value of (x),
(x
0
)
x
= 0. Then from the second-order expansion
of (x) about x
0
we have
(x) = (x) (x
0
)
1
2
(x x
0
)
T
1(x
0
)(x x
0
) =
1
2
x
T
1(x
0
)x
assuming that x = x x
0
is small enough in norm so that higher order terms in the
expansion can be neglected. (That is, we consider only local excursions away from x
0
.)
We see that If the Hessian is positive denite, then all local excursions of x away from x
0
increase the value of (x) and thus
Suff. Cond. for Stationary Point x
0
to be a Unique Local Min: H(x
0
) > 0
Contrawise, if the Hessian is negative denite, then all local excursions of x away from
x
0
decrease the value of (x) and thus
Suff. Cond. for Stationary Point x
0
to be a Unique Local Max: H(x
0
) < 0
If the Hessian 1(x
0
) is full rank and indenite at a stationary point x
0
, then x
0
is a
saddle point. If 1(x
0
) 0 then x
0
is a non-unique local minimum. If 1(x
0
) 0 then x
0
is a non-unique local maximum.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 11/15
Application to Linear Gaussian Model Cont.
Note that the scalar loss function
(x)
.
= |y Ax|
2
w
= (y Ax)
T
W(y Ax) = y
T
Wy 2y
T
WAx + x
T
A
T
WAx
has an exact quadratic expansion. Thus the arguments given in the previous slide hold
for arbitrarily large (global) excursions away from a stationary point x
0
. In particular, if 1
is positive denite, the stationary point x
0
must be a unique global minimum.
Having shown that

x
= 2x
T
A
T
WA2y
T
WA
we determine the Hessian to be
1(x) =

x

T
= 2 A
T
WA 0
Therefore stationary points (which, as we have seen, necessarily satisfy the normal
equation) are global minima of (x).
Furthermore, if A is one-to-one (has full column rank), then 1(x) = 2 A
T
WA > 0 and
there is only one unique stationary point (i.e., weighted least-squares solution) which
minimizes (x). Of course this could not be otherwise, as we know from our previous
analysis of the weighted least-squares problem using Hilbert space theory.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 12/15
The Gradient of an Objective Function (x)
Note that d(x) =
(x)
x
dx or, setting

= d/dt and v = dx/dt,

(v) =
(x)
x
v =
(x)
x

1
x

x
v =

1
x

(x)
x

x
v = /
x
(x), v)
with the Gradient of (x) dened by

x
(x)
1
x

(x)
x

T
=
1
x

(x)
x
1
.
.
.
(x)
x
n

with
x
a local Riemannian metric. Note that

(v) =
x
(x), v is a linear
functional of the velocity vector v.
In the vector space structure we have seen to date
x
is independent of x,
x
= , and
represents the metric of the (globally) dened metric space containing x. Furthermore, if
= I, then the space is a (globally) Cartesian vector space.
When considering spaces of smoothly parameterized regular family probability density
functions, a natural Riemannian (local, non-Cartesian) metric is provided by the Fisher
Information Matrix to be discussed later in this course.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 13/15
The Gradient as the Direction of Steepest Ascent
What unit velocities, |v| = 1, in the domain space . result in the fastest rate of change
of (x) as measured by [

(v)[?
Equivalently, What unit velocity directions v in the domain space . result in the fastest
rate of change of (x) as measured by [

(v)[?
From the Cauchy-Schwarz inequality, we have
[

(v)[ = [/
x
(x), v)[ |
x
(x)| |v| = |
x
(x)|
or
|
x
(x)|

(v) |
x
(x)|
Note that
v = c
x
(x) with c = |
x
(x)|
1


(v) = |
x
(x)|

x
(x) = direction of steepest ascent.
v = c
x
(x) with c = |
x
(x)|
1


(v) = |
x
(x)|

x
(x) = direction of steepest descent.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 14/15
The Cartesian or Naive Gradient
In a Cartesian vector space the gradient of a cost function (x) corresponds to taking
= I, in which case we have the Cartesian gradient

c
x
(x) =

(x)
x

T
=

(x)
x
1
.
.
.
(x)
x
n

Often one naively assumes that the gradient takes this form even if it is not evident that
the space is, in fact, Cartesian. In this case one might more accurately refer to the
gradient shown as the standard gradient or, even, the naive gradient.
Because the Cartesian gradient is the standard form assumed in many applications, it is
common to just refer to it as the gradient, even if it is not the the correct, true gradient.
(Assuming that we agree that the true gradient must give the direction of steepest
descent and therefore depends on the metric
x
.)
This is the terminology adhered to by Amari and his colleagues, who then refer to the
true Riemannian metric-dependent gradient as the natural gradient. As mentioned
earlier, when considering spaces of smoothly parameterized regular family probability
density functions, a natural Riemannian metric is provided by the Fisher Information
Matrix. Amari is one of the rst researchers to consider parametric estimation from this
Information Geometry perspective. He has argued that the use of the natural (true)
gradient can signicantly improve the performance of statistical learning algorithms.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 15/15

You might also like