0% found this document useful (0 votes)
4 views6 pages

Lecture11

Lecture 11 of the Optimization course covers matrix differential calculus, including finding maxima, minima, and saddle points using derivatives and the Infomax ICA method. It also discusses Newton's method for solving nonlinear equations and finding extrema, emphasizing the convergence properties and the introduction of damped Newton for improved stability. Key concepts include the use of Jacobians, Hessians, and the natural gradient in optimization problems.

Uploaded by

rz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views6 pages

Lecture11

Lecture 11 of the Optimization course covers matrix differential calculus, including finding maxima, minima, and saddle points using derivatives and the Infomax ICA method. It also discusses Newton's method for solving nonlinear equations and finding extrema, emphasizing the convergence properties and the introduction of damped Newton for improved stability. Key concepts include the use of Jacobians, Hessians, and the natural gradient in optimization problems.

Uploaded by

rz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

10-725: Optimization Fall 2012

Lecture 11: October 2


Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Tongbo Huang, Shoou-I Yu

Note: LaTeX template courtesy of UC Berkeley EECS dept.


Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.
They may be distributed outside this class only with the permission of the Instructor.

This lecture’s notes illustrate some uses of various LATEX macros. Take a look at this and imitate.

11.1 Matrix Differential Calculus

11.1.1 Review of Previous Class

Matrix differential is a solution to matrix calculation pain, it can be understood by either of the following:

• A compact way of writing Taylor expansion.


• Definition:
df = a(x; dx)[r(dx)] where r is the residual term
a(x, .) is linear in the second argument
r(dx)
||dx||→0 as dx → 0

The derivative is linear, so it passes through addition and scalar multiplication.


It also generalizes Jacobian, Hessian, gradient and velocity.
Other topics covered: chain rule, product rule, bilinear functions, identities and identities theorems. Please
refer to previous scribed notes for details.

11.1.2 Finding a maximum, minimum or saddle points

The principle: set coefficient of dX to 0 to find min, max, or saddle point:

• if df = c(A; dX)[r(dX)] then dX = tA, df = c(A; tA) = tc(A; A)


• function is at min/max/saddle point iff c(A; A) = 0
• if c is any product, then A = 0

11-1
11-2 Lecture 11: October 2

11.1.3 Infomax ICA

Suppose we have n training examples xi ∈ Rd and a scalar-valued, component-wise function g. We would


like to find the d × d matrix W that maximizes the entropy of yi = g(W xi ).
Detour: volume rule:

vol(AS) = |det(A)|vol(S)

Interpretation: small determinant value means the existence of small eigenvalue, thus squash the volume
flat, vice versa.
Back to infomax ICA. We have yi = g(W xi ) where dyi = J(x; W )dxi = Ji dxi . We want to maximize the
entropy over the distribution of y:

P (xi )
maxW Σi (−ln(P (yi ))), P (yi ) =
|detJ(xi ; W )|

And from
Z
maxH(P (y)) = − P (y)ln(P (y))dy = −E(ln(P (y)))

it is equivalent to maximizing

maxW Σi ln(|detJ(xi ; W )|)

since P (x) is independent to W .

11.1.4 Solving ICA Gradient

Define ui = g 0 (W xi ), vi = g 00 (W xi ).
For gradient of yi = g(W xi ):

dyi = g 0 (W xi ) ◦ d(W xi )
= ui ◦ (W dxi )
= diag(ui )W dxi

For gradient of Ji = diag(ui )W :

dJi = d(diag(ui ))W + diag(ui )dW


= diag(vi ◦ d(W xi ))W + diag(ui )dW
= diag(ui )dW + diag(vi )diag(d(W xi ))W
Lecture 11: October 2 11-3

Finally, define diag(αi ) = diag(ui )−1 diag(vi ) solving the gradient of Σi ln(|detJ(xi ; W )|):

dL = Σi d(ln|detJ(xi ; W )|)
= Σi tr(Ji−1 dJi )
= Σi tr(W −1 dW + W −1 diag(ui )−1 diag(vi )diag(d(W xi )))W
= Σi tr(W −1 dW ) + tr(diag(αi )diag(d(W xi )))
= ntr(W −1 dW ) + Σi tr(αiT d(W xi ))
= ntr(W −1 dW ) + tr(Σi xi αiT d(W xi ))
= nW −T + Σi αi xTi
= nW −T + C

11.1.5 Natural Gradient

Define L(W ) as a function from Rd×d to R, then dL = tr(GT dW ). So step S is:

||SW −1 ||2F
S = argmaxS M (S), M (S) = tr(GT S) −
2
which, in scaler case:
S2
M = gS −
2W 2
So, to find the max/min/saddle point:

1
M = tr(GT S) − tr(SW −1 W −T S T )
2
1
dM = tr(GT dS) − tr(dSW −1 W −T S T )
2
So, natural gradient becomes G = W −1 W −T , and thus GW T W = S. Using the gradient previously derived,
[W −T + C]W T W = W + CW T W .

11.1.6 More Info


• Minkas cheat sheet:
http://research.microsoft.com/en-us/um/people/minka/papers/matrix/

• Magnus & Neudecker. Matrix Differential Calculus. Wiley, 1999. 2nd ed.
http://www.amazon.com/Differential-Calculus- Applications-Statistics-Econometrics/dp/047198633X

• Bell & Sejnowski. An information-maximization approach to blind separation and blind deconvolution.
Neural Computation, v7, 1995.

11.2 Newton’s Method

Newton’s method have two main applications: solving nonlinear equations and finding minima/maxima/saddles.
11-4 Lecture 11: October 2

11.2.1 Solving Nonlinear Equations

For x ∈ Rd and f : Rd → Rd which is differentiable, we want to solve f (x) = 0. We perform first order
Taylor approximation on f (x),

f (y) ≈ f (x) + J(x)(y − x) = fˆ(y)

where J(x) is the Jacobian. We now try to solve for fˆ(y) = 0.

f (x) + J(x)(y − x) = 0

dx = y − x = −J(x)−1 f (x) (11.1)

dx represents the update step for Newton’s method.


We now work on the example of approximating the reciprocal of the Golden Number φ. The function and
the derivative is as follows.
1 1
f (x) = − φ, f 0 (x) = − 2
x x

And dx becomes the following.

1
dx = −J(x)−1 f (x) = x2 ( − φ) = x − x2 φ
x

The update rule is x+ = x + dx. Figure 11.1 shows an iteration of Newton’s method when x = 1.

Figure 11.1: Example of one iteration of Newton’s Method in solving non-linear equations.

We now perform error analysis of Newton’s method. For value x, the error is  = xφ − 1. For value x+ , the
error + is the following.
Lecture 11: October 2 11-5

+ = x+ φ − 1
= x + x − x2 φ φ − 1


= (x + x (1 − xφ)) φ − 1
= (x − x) φ − 1
= xφ − 1 − xφ
=  − xφ
=  (1 − xφ)
= −2

This shows that if  < 1, then Newton’s method has quadratic convergence. However, if  > 1, then Newton’s
method will diverge.

11.2.2 Finding Minima/Maxima/Saddles

For x ∈ Rd and f : Rd → Rd which is twice differentiable, we want to find minx f (x). In the example, we
only focus on minimizing f , but finding the maxima and saddle points are the same. We first define g = f 0 .
Minimizing f is the same as finding x such that g = f 0 = 0. From Equation 11.1, the Newton’s update is
the following,
d = −J −1 g
−1
= − (g 0 ) g
00 −1
= − (f ) f0
= −H −1 g
where H is the Hessian. We now show that Newton’s method is a descent method if H  0. We set dx = td
for t > 0. r(dx) is the residual. Using first order Taylor expansion, we get the following.
df = g T dx + r(dx)
 
−1
= g T t − (f 00 ) f 0 + r(dx)
−1
= −tf 0T (f 00 ) f 0 + r(dx)
= −tf 0T H −1 f 0 + r(dx)
If H  0, then H −1  0, which makes the first term always negative, thus making Newton’s method a
descent method.

11.2.3 Newton’s Method and Steepest Descent

Newton’s method is a special case of steepest descent when the norm used is the Hessian norm. To find the
step for steepest descent, we minimize the following.
1 √
min g T d + ||d||2H , ||d||2H = dT Hd (11.2)
d 2
The solution to this minimization is d = −H −1 g. Steepest descent with a constraint or a penalty in the
objective is equivalent. The equivalence will be covered when duality is covered in class. Figure 11.2 shows
the difference between the direction of steps for gradient descent and Newton’s method.
11-6 Lecture 11: October 2

Figure 11.2: Direction of step for gradient descent and Newton’s method.

11.2.4 Damped Newton

Damped Newton is combining the Newton’s method with backtracking line search to make sure that the
objective value does not increase.

Initialize x1
for k = 1, 2, . . .
gk = f 0 (xk ); gradient
Hk = f 00 (xk ); Hessian
dk = −Hk \gk ; Newton direction
tk = 1; backtracking line search
while f (xk + tk dk ) > f (xk ) + tgkT dk /3 divide by 3 to make sure < 12 for future proofs
tk = βtk β<1
xk+1 = xk + tk dk step

Damped Newton is affine invariant, meaning that suppose g(x) = f (Ax + b), and we get Newton’s updates
x1 , x2 , . . . from g(x) and y1 , y2 , . . . from f (y), and if y1 = Ax1 + b, then yi = Axi + b ∀i.
For damped Newton, if f is bounded below, then f (xk ) converges. If f is strictly convex with bounded level
sets, then xk converges. Finally, damped Newton typically converges at quadratic rate in the neighborhood
of x∗ .

You might also like