0% found this document useful (0 votes)

51 views15 pages

ECE275AB Lecture 10 View Graphs 2008-2009

This lecture discusses vector derivatives and the gradient in the context of optimizing objective functions. Key points: - The gradient of an objective function l(x) is defined as the vector partial derivative ∂l(x)/∂x. Points where the gradient is zero are called stationary points. - The Jacobian matrix describes the derivative of a vector-valued function f(x). It is used to discuss properties like invertibility. - For an objective function to have a local extremum at a stationary point, the Hessian (matrix of second derivatives) must be positive or negative definite. - Examples are given of optimizing the negative log-likelihood for linear and nonlinear models. The normal equations

Uploaded by

Ajay Verma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views15 pages

ECE275AB Lecture 10 View Graphs 2008-2009

Uploaded by

Ajay Verma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Lecture 10 ECE 275A

Vector Derivatives and the Gradient

ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 1/15
Objective Functions (x)
Let (x) be a real-valued function (aka functional) of an n-dimensional real vector
x . = R
n
() : . = R
n
1 = R
where . is an n-dimensional real Hilbert space with metric matrix =
T
> 0. Note
that henceforth vectors in . are represented a column vectors in R
n
. Because here the
value space \ = R is one-dimensional, wlog we can take 1 to be Cartesian (because all
(necessarily positive scalar) metric weightings yield inner products and norms that are
equivalent up to an overall positive scaling).
We will call (x) an objective function. If (x) is a cost, loss, or penalty function,
then we assume that it is bounded from below and we attempt to minimize it wrt x. If
(x) is a prot, gain, or reward function, then we assume that it is bounded from
above and we attempt to maximize it wrt x.
For example, suppose we wish to match a model pdf p
x
(y) to a true, but unknown,
density p
x
0
(y) for an observed random vector, where we assume p
x
(y) p
x
0(y),
x. We can then use a penalty function of x to be given by a measure of (non-averaged
or instantaneous) divergence or discrepancy D
I
(x
0
x) of the model pdf p
x
(y) from
the true pdf p
x
0
(y) dened by
D
I
(x
0
|x) log

p
x
0
(y)
p
x
(y)

= log p
x
0
(y) log p
x
(y)
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 2/15
Instantaneous Divergence & Negative Log-likelihood
Note that minimizing the instantaneous divergence D
I
(x
0
|x) is equivalent to
maximizing the log-likelihood p
x
(y) or minimizing the negative log-likelihood
(x) = log p
x
(y)
A special case arises from use of the nonlinear Gaussian additive noise model with
known noise covariance C and known mean function h()
y = h(x) + n, n N(0, C) y N(h(x), C)
which yields the nonlinear weighted least-squares problem
(x) = log p
x
(y)
.
= |y h(x)|
2
W
, W = C
1
Further setting h(x) = Ax is the linear weighted least-squares problem we have already
discussed
(x) = log p
x
(y)
.
= |y Ax|
2
W
, W = C
1
The symbol
.
= denotes that fact that we are ignoring additive terms and multiplicative
factors which are irrelevant for the purposes of obtaining a extremum (here, a minimum)
of a loss function. Of course we cannot ignore these terms if we are interested in the
optimal value of the loss function itself.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 3/15
Stationary Points and the Vector Partial Derivative
Henceforth let the real scalar function (x) be twice partial differentiable with respect to
all components of x R
n
. A necessary condition for x be be a local extremum
(maximum or minimum) of (x) is that

x
(x)

(x)
x
1
. . .
(x)
x
n

1n
= 0
where the vector partial derivative operator

x
1
. . .

x
n

is dened as a row operator. (See the extensive discussion in the Lecture Supplement
on Real Vector Derivative.)
A vector that satises

x
(x) = 0 is known as a stationary point of (x). Stationary
points are points at which (x) has a local maximum, minimum, or inection.
Sufcient conditions for a stationary point to be a local extremum require that we develop
a theory of vector differentiation that will allow us to clearly and succinctly discuss
second-order derivative properties of objective functions.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 4/15
Derivative of a Vector-Valued Function The Jacobian
Let f(x) R
m
have elements f
i
(x), i = 1, , m, which are all differentiable with
respect to the components of x R
n
.
We dene the vector partial derivative of the vector function f(x) as
J
f
(x)

x
f(x)

x
f
1
(x)
.
.
.

x
f
m
(x)

f
1
(x)
x
1

f
1
(x)
x
n
.
.
.
.
.
.
.
.
.
f
m
(x)
x
1

f
m
(x)
x
n

mn
The matrix J
f
(x) =

x
f(x) is known as the Jacobian matrix or operator of the
mapping f(x). It is the linearization of the nonlinear mapping f(x) at the point x. Often
we write y = f(x) and the corresponding Jacobian as J
y
(x).
If m = n, then y = f(x) can be viewed as a change of variables, in which case
det J
y
(x) is known as the Jacobian of the transformation. det J
y
(x) plays a fundamental
role in the change of variable formula of pdfs.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 5/15
The Jacobian Cont.

A function f : R
n
R
m
is locally one-to-one in an open neighborhood of x if
and only if its Jacobian (linearization) J
f
() =

x
f() is a one-to-one matrix for all
points R
n
in the open neighborhood of x.

A function f : R
n
R
m
is locally onto an open neighborhood of y = f(x) if and
only if its Jacobian (linearization) J
f
() =

x
f() is an onto matrix for all points
R
n
in the corresponding neighborhood of x.

A function f : R
n
R
m
is locally invertible in an open neighborhood of x if and
only if it is locally one-to-one and onto in the open neighborhood of x which is true
if and only if its Jacobian (linearization) J
f
() =

x
f() is a one-to-one and onto
(and hence invertible) matrix for all points R
n
in the open neighborhood of x.
This is known as the Inverse Function Theorem.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 6/15
Vector Derivative Identities
c
T
x
x
= c
T
for an arbitrary vector c
Ax
x
= A for an arbitrary matrix A
g
T
(x)h(x)
x
= g
T
(x)
h(x)
x
+ h
T
(x)
g(x)
x
, g(x)
T
h(x) scalar
x
T
Ax
x
= x
T
A + x
T
A
T
for an arbitrary matrix A
x
T
x
x
= 2x
T
when =
T
h(g(x))
x
=
h
g
g
x
(Chain Rule)
Note that the last identity) is a statement about Jacobians and can be restated in an
illuminating manner as
J
hg
= J
h
J
g
(1)
which says that the linearization of a composition is the composition of the
linearizations.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 7/15
Application to Linear Gaussian Model
Stationary points of
(x) = log p
x
(y)
.
= |e(x)|
2
W
where e(x) = y A(x) and W = C
1
satisfy
0 =

x
=

e
e
x
= (2e
T
W)(A) = 2(y Ax)
T
WA
or
A
T
W(y Ax) = 0 e(x) = y Ax A(A

) with A

= A
T
W
Therefore stationary points satisfy the Normal Equation
A
T
WAx = A
T
y A

A = A

y
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 8/15
The Hessian of an Objective Function
The Hessian, or matrix of second partial derivatives of (x), is dened by
H(x)

x

x
(x)

T
=

(x)
x
1
x
1

(x)
x
n
x
1
.
.
.
.
.
.
.
.
.
(x)
x
1
x
n

(x)
x
n
x
n

nn
As a consequence of the fact that
(x)
x
i
x
j
=
(x)
x
j
x
i
the Hessian is obviously symmetric
1(x) = 1
T
(x)
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 9/15
Vector Taylor Series Expansion
Taylor series expansion of a scalar-valued function (x) about a point x
0
to
second-order in x = x x
0
:
(x
0
+ x) = (x
0
) +
(x
0
)
x
x +
1
2
x
T
H(x
0
)x + h.o.t.
where 1 is the Hessian of (x).
Taylor series expansion of a vector-valued function h(x) about a point x
0
to
rst-order in x = x x
0
:
h(x) = h(x
0
+ x) = h(x
0
) +
h(x
0
)
x
x + h.o.t.
To obtain notationally uncluttered expressions for higher order expansions, one switches
to the use of tensor notation.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 10/15
Sufcient Condition for an Extremum
Let x
0
be a stationary value of (x),
(x
0
)
x
= 0. Then from the second-order expansion
of (x) about x
0
we have
(x) = (x) (x
0
)
1
2
(x x
0
)
T
1(x
0
)(x x
0
) =
1
2
x
T
1(x
0
)x
assuming that x = x x
0
is small enough in norm so that higher order terms in the
expansion can be neglected. (That is, we consider only local excursions away from x
0
.)
We see that If the Hessian is positive denite, then all local excursions of x away from x
0
increase the value of (x) and thus
Suff. Cond. for Stationary Point x
0
to be a Unique Local Min: H(x
0
) > 0
Contrawise, if the Hessian is negative denite, then all local excursions of x away from
x
0
decrease the value of (x) and thus
Suff. Cond. for Stationary Point x
0
to be a Unique Local Max: H(x
0
) < 0
If the Hessian 1(x
0
) is full rank and indenite at a stationary point x
0
, then x
0
is a
saddle point. If 1(x
0
) 0 then x
0
is a non-unique local minimum. If 1(x
0
) 0 then x
0
is a non-unique local maximum.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 11/15
Application to Linear Gaussian Model Cont.
Note that the scalar loss function
(x)
.
= |y Ax|
2
w
= (y Ax)
T
W(y Ax) = y
T
Wy 2y
T
WAx + x
T
A
T
WAx
has an exact quadratic expansion. Thus the arguments given in the previous slide hold
for arbitrarily large (global) excursions away from a stationary point x
0
. In particular, if 1
is positive denite, the stationary point x
0
must be a unique global minimum.
Having shown that

x
= 2x
T
A
T
WA2y
T
WA
we determine the Hessian to be
1(x) =

x

T
= 2 A
T
WA 0
Therefore stationary points (which, as we have seen, necessarily satisfy the normal
equation) are global minima of (x).
Furthermore, if A is one-to-one (has full column rank), then 1(x) = 2 A
T
WA > 0 and
there is only one unique stationary point (i.e., weighted least-squares solution) which
minimizes (x). Of course this could not be otherwise, as we know from our previous
analysis of the weighted least-squares problem using Hilbert space theory.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 12/15
The Gradient of an Objective Function (x)
Note that d(x) =
(x)
x
dx or, setting

= d/dt and v = dx/dt,

(v) =
(x)
x
v =
(x)
x

1
x

x
v =

1
x

(x)
x

x
v = /
x
(x), v)
with the Gradient of (x) dened by

x
(x)
1
x

(x)
x

T
=
1
x

(x)
x
1
.
.
.
(x)
x
n

with
x
a local Riemannian metric. Note that

(v) =
x
(x), v is a linear
functional of the velocity vector v.
In the vector space structure we have seen to date
x
is independent of x,
x
= , and
represents the metric of the (globally) dened metric space containing x. Furthermore, if
= I, then the space is a (globally) Cartesian vector space.
When considering spaces of smoothly parameterized regular family probability density
functions, a natural Riemannian (local, non-Cartesian) metric is provided by the Fisher
Information Matrix to be discussed later in this course.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 13/15
The Gradient as the Direction of Steepest Ascent
What unit velocities, |v| = 1, in the domain space . result in the fastest rate of change
of (x) as measured by [

(v)[?
Equivalently, What unit velocity directions v in the domain space . result in the fastest
rate of change of (x) as measured by [

(v)[?
From the Cauchy-Schwarz inequality, we have
[

(v)[ = [/
x
(x), v)[ |
x
(x)| |v| = |
x
(x)|
or
|
x
(x)|

(v) |
x
(x)|
Note that
v = c
x
(x) with c = |
x
(x)|
1

(v) = |
x
(x)|

x
(x) = direction of steepest ascent.
v = c
x
(x) with c = |
x
(x)|
1

(v) = |
x
(x)|

x
(x) = direction of steepest descent.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 14/15
The Cartesian or Naive Gradient
In a Cartesian vector space the gradient of a cost function (x) corresponds to taking
= I, in which case we have the Cartesian gradient

c
x
(x) =

(x)
x

T
=

(x)
x
1
.
.
.
(x)
x
n

Often one naively assumes that the gradient takes this form even if it is not evident that
the space is, in fact, Cartesian. In this case one might more accurately refer to the
gradient shown as the standard gradient or, even, the naive gradient.
Because the Cartesian gradient is the standard form assumed in many applications, it is
common to just refer to it as the gradient, even if it is not the the correct, true gradient.
(Assuming that we agree that the true gradient must give the direction of steepest
descent and therefore depends on the metric
x
.)
This is the terminology adhered to by Amari and his colleagues, who then refer to the
true Riemannian metric-dependent gradient as the natural gradient. As mentioned
earlier, when considering spaces of smoothly parameterized regular family probability
density functions, a natural Riemannian metric is provided by the Fisher Information
Matrix. Amari is one of the rst researchers to consider parametric estimation from this
Information Geometry perspective. He has argued that the use of the natural (true)
gradient can signicantly improve the performance of statistical learning algorithms.
ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 15/15

Tensor Analysis and Differential Geometry
100% (2)
Tensor Analysis and Differential Geometry
163 pages
Category 2 Maths Workshop
No ratings yet
Category 2 Maths Workshop
114 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
181 pages
Parallelogram Method
No ratings yet
Parallelogram Method
13 pages
EN - 840Dsl Kinematic Chain Milling 0408 - 2019-03
100% (1)
EN - 840Dsl Kinematic Chain Milling 0408 - 2019-03
322 pages
Kinamatics
No ratings yet
Kinamatics
10 pages
Peter Broughton - Paul Ndumbaro-The Analysis of Cable and Catenary Structures-T. Telford (1994)
100% (1)
Peter Broughton - Paul Ndumbaro-The Analysis of Cable and Catenary Structures-T. Telford (1994)
98 pages
Lecture 0.2 - Linear Methods For Regression, Optimization
No ratings yet
Lecture 0.2 - Linear Methods For Regression, Optimization
53 pages
ANSYS Fluent 723 Porousmediacondition See Page14 15
No ratings yet
ANSYS Fluent 723 Porousmediacondition See Page14 15
21 pages
Linear Algebra Assignment Solution
100% (1)
Linear Algebra Assignment Solution
28 pages
Hsslive Xi Physics All in One QB Seema 2025
100% (1)
Hsslive Xi Physics All in One QB Seema 2025
104 pages
Math Camp Calculus
No ratings yet
Math Camp Calculus
51 pages
Gradients Involving Matrices
No ratings yet
Gradients Involving Matrices
5 pages
Change of Variables in A Double Integral
No ratings yet
Change of Variables in A Double Integral
12 pages
Euclidean Vector Space PDF
No ratings yet
Euclidean Vector Space PDF
2 pages
Tensor
No ratings yet
Tensor
20 pages
Derivation of Continuity Equation in Cylindrical Coordinates
100% (1)
Derivation of Continuity Equation in Cylindrical Coordinates
5 pages
TIGER Code
No ratings yet
TIGER Code
104 pages
Vectori
No ratings yet
Vectori
22 pages
Second Order Partial Derivatives The Hessian Matrix Minima and Maxima
No ratings yet
Second Order Partial Derivatives The Hessian Matrix Minima and Maxima
12 pages
Mlfweek 2
No ratings yet
Mlfweek 2
3 pages
Chapter 3 Part 1 Algebraic and Complex Number Position Analysis of Four Bar Mechanism
No ratings yet
Chapter 3 Part 1 Algebraic and Complex Number Position Analysis of Four Bar Mechanism
8 pages
Force Table
No ratings yet
Force Table
3 pages
Ann2018 L5
No ratings yet
Ann2018 L5
23 pages
Scalar and Vector Field Operations
No ratings yet
Scalar and Vector Field Operations
6 pages
Krasnov-Makarenko-Kiselev Problems and Exercises in The Calculus of Variations
100% (2)
Krasnov-Makarenko-Kiselev Problems and Exercises in The Calculus of Variations
221 pages
Chapter 003
No ratings yet
Chapter 003
54 pages
M1 Examination Questions by Topic
No ratings yet
M1 Examination Questions by Topic
31 pages
Derivada de Fréchet
No ratings yet
Derivada de Fréchet
18 pages
ME554 Sheet 3 Final PDF
No ratings yet
ME554 Sheet 3 Final PDF
31 pages
2016 RM Unit 1
No ratings yet
2016 RM Unit 1
69 pages
Name Class Date: End of Unit Test
100% (1)
Name Class Date: End of Unit Test
6 pages
Weatherwax Nocedal Solutions
No ratings yet
Weatherwax Nocedal Solutions
23 pages
Second Set of Slides Notes PDF
No ratings yet
Second Set of Slides Notes PDF
23 pages
Tangent, Normal, Binormal Vectors March 13
No ratings yet
Tangent, Normal, Binormal Vectors March 13
18 pages
Chap 4 (Part Iii) PDF
No ratings yet
Chap 4 (Part Iii) PDF
18 pages
Lecture Notes PDF
No ratings yet
Lecture Notes PDF
143 pages
Tan Et Al 2006
No ratings yet
Tan Et Al 2006
13 pages
Jacobian Matrix and Determinant
No ratings yet
Jacobian Matrix and Determinant
5 pages
02 Level2-Particleequilibrium
No ratings yet
02 Level2-Particleequilibrium
22 pages
Maths SQP
No ratings yet
Maths SQP
12 pages
Calculus
No ratings yet
Calculus
5 pages
Three-Phase Transformer Modeling Using Symmetrical Components
No ratings yet
Three-Phase Transformer Modeling Using Symmetrical Components
7 pages
Max-Min Problems in and The Hessian Matrix: Taylor's Theorem in R
No ratings yet
Max-Min Problems in and The Hessian Matrix: Taylor's Theorem in R
8 pages
Tangent Planes and Linear Approximations
No ratings yet
Tangent Planes and Linear Approximations
9 pages
Assignment 2 Solutions
100% (1)
Assignment 2 Solutions
10 pages
EC400 Slides Lecture 2
No ratings yet
EC400 Slides Lecture 2
35 pages
Connexions Module: m11240
100% (2)
Connexions Module: m11240
4 pages
Nptel CN Maths
No ratings yet
Nptel CN Maths
32 pages
Polarity
No ratings yet
Polarity
19 pages
TMA 4180 Optimeringsteori Variational Calculus and Classical Mechanics
No ratings yet
TMA 4180 Optimeringsteori Variational Calculus and Classical Mechanics
9 pages
f10 PDF
No ratings yet
f10 PDF
3 pages
9.triple Integrals - Matlab
No ratings yet
9.triple Integrals - Matlab
9 pages
Maxime Comprehensive File
No ratings yet
Maxime Comprehensive File
15 pages
Convex Optimization Prerequisite - Topics
No ratings yet
Convex Optimization Prerequisite - Topics
6 pages
Gradient Descent - Xiaowei Huang
No ratings yet
Gradient Descent - Xiaowei Huang
53 pages
Curs Tehnici de Optimizare
No ratings yet
Curs Tehnici de Optimizare
141 pages
Gradients Derivatives
No ratings yet
Gradients Derivatives
23 pages
Machining by Cutting
No ratings yet
Machining by Cutting
4 pages
Real Analysis
No ratings yet
Real Analysis
49 pages
Slides 10-2023
No ratings yet
Slides 10-2023
32 pages
18 Vector Calculus and Optimization
No ratings yet
18 Vector Calculus and Optimization
6 pages
Mclas Tema1 v2
No ratings yet
Mclas Tema1 v2
74 pages
Maths
No ratings yet
Maths
6 pages
Matrix Differentiation
No ratings yet
Matrix Differentiation
15 pages
Multivariatecalculus
No ratings yet
Multivariatecalculus
16 pages
Applications of Partial Differentiations
No ratings yet
Applications of Partial Differentiations
85 pages
Extrems Value
No ratings yet
Extrems Value
7 pages
Analysis Real Solutions 9
No ratings yet
Analysis Real Solutions 9
7 pages
Akki Maxima N Minima
No ratings yet
Akki Maxima N Minima
12 pages
Solutions To Mid-Semester Examination
No ratings yet
Solutions To Mid-Semester Examination
6 pages
Physics Reviewer Stem 12
100% (2)
Physics Reviewer Stem 12
6 pages
Notes
No ratings yet
Notes
21 pages
MLF Combined
No ratings yet
MLF Combined
84 pages
Functions of Several Variables: Unconstrained Extrema: N K N N N H 0 N N
No ratings yet
Functions of Several Variables: Unconstrained Extrema: N K N N N H 0 N N
5 pages
Calculus of Variations
No ratings yet
Calculus of Variations
10 pages
Calculus of Variation and Image Processing: Scalar Product
No ratings yet
Calculus of Variation and Image Processing: Scalar Product
9 pages
Chapter 14
No ratings yet
Chapter 14
14 pages
Midtermsols Sp2010
No ratings yet
Midtermsols Sp2010
6 pages
1 Linear Transformations and Their Matrix Repre-Sentations
No ratings yet
1 Linear Transformations and Their Matrix Repre-Sentations
9 pages
CS6910 Tutorial1
No ratings yet
CS6910 Tutorial1
10 pages
Vector Differentiation: 1.1 Limits of Vector Valued Functions
No ratings yet
Vector Differentiation: 1.1 Limits of Vector Valued Functions
19 pages
Differential Calculus For Vector Functions 1 Vector Functions of Variable
No ratings yet
Differential Calculus For Vector Functions 1 Vector Functions of Variable
11 pages
Multivariable Calculus: Inverse-Implicit Function Theorems: N N M F X
No ratings yet
Multivariable Calculus: Inverse-Implicit Function Theorems: N N M F X
11 pages
An Introduction To Functional Derivatives
No ratings yet
An Introduction To Functional Derivatives
8 pages
5 Differentiability
No ratings yet
5 Differentiability
4 pages
Ma1505 Cheat
No ratings yet
Ma1505 Cheat
4 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Capsule Calculus
From Everand
Capsule Calculus
Ira Ritow
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

ECE275AB Lecture 10 View Graphs 2008-2009

Uploaded by

ECE275AB Lecture 10 View Graphs 2008-2009

Uploaded by

Lecture 10 ECE 275A

Vector Derivatives and the Gradient

You might also like