0% found this document useful (0 votes)
30 views40 pages

11 Matrix Newton

This document discusses matrix differential calculus and Newton's method for optimization. Some key points: - Matrix differentials provide a compact way to write Taylor expansions and define derivatives for matrices, vectors, and scalars. - Newton's method finds the root of a function by iteratively computing the Jacobian and updating based on its inverse. - For minimization, Newton's method finds the step that minimizes a quadratic approximation of the function based on its Hessian. - Initialization and ensuring the Hessian is invertible are important for Newton's method to converge quickly.

Uploaded by

Kumar Priyanshu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views40 pages

11 Matrix Newton

This document discusses matrix differential calculus and Newton's method for optimization. Some key points: - Matrix differentials provide a compact way to write Taylor expansions and define derivatives for matrices, vectors, and scalars. - Newton's method finds the root of a function by iteratively computing the Jacobian and updating based on its inverse. - For minimization, Newton's method finds the step that minimizes a quadratic approximation of the function based on its Hessian. - Initialization and ensuring the Hessian is invertible are important for Newton's method to converge quickly.

Uploaded by

Kumar Priyanshu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Matrix differential calculus

10-725 Optimization
Geoff Gordon
Ryan Tibshirani
Review
• Matrix differentials: sol’n to matrix calculus pain
‣ compact way of writing Taylor expansions, or …
‣ definition:
‣ df = a(x; dx) [+ r(dx)]
‣ a(x; .) linear in 2nd arg
‣ r(dx)/||dx|| → 0 as dx → 0

• d(…) is linear: passes thru +, scalar *


• Generalizes Jacobian, Hessian, gradient, velocity
Geoff Gordon—10-725 Optimization—Fall 2012 2
Review
• Chain rule
• Product rule
• Bilinear functions: cross product, Kronecker,
Frobenius, Hadamard, Khatri-Rao, …
• Identities
‣ rules for working with ○, tr()
‣ trace rotation

• Identification theorems
Geoff Gordon—10-725 Optimization—Fall 2012 3
Finding a maximum
or minimum, or saddle point

ID for df(x) scalar x vector x matrix X


scalar f df = a dx df = aTdx df = tr(ATdX)

vector f df = a dx 2 df = A dx

matrix F dF = A dx1.5
1

0.5

ï0.5

ï1
Geoff Gordon—10-725 Optimization—Fall 2012 ï3 ï2 ï1 0 1 2 3 4
Finding a maximum
or minimum, or saddle point

ID for df(x) scalar x vector x matrix X


scalar f df = a dx df = aTdx df = tr(ATdX)

vector f df = a dx df = A dx

matrix F dF = A dx

Geoff Gordon—10-725 Optimization—Fall 2012 5


And so forth…
• Can’t draw it for X a matrix, tensor, …
• But same principle holds: set coefficient of dX
to 0 to find min, max, or saddle point:
‣ if df = c(A; dX) [+ r(dX)] then

‣ so: max/min/sp iff


‣ for c(.; .) any “product”,

Geoff Gordon—10-725 Optimization—Fall 2012 6


10

Ex: Infomax ICA 5

xi
ï5

• Training examples xi ∈ ℝd, i = 1:n ï10


ï10 ï5 0 5 10

• Transformation y = g(Wx )
10

i i
5

‣ W ∈ ℝd!d 0

‣ g(z) = ï5

Wxi
• Want: ï10
ï10 ï5 0 5 10

0.8

0.6

0.4

0.2

Geoff Gordon—10-725 Optimization—Fall 2012


0.2 0.4 0.6 0.8
y23i
Volume rule

Geoff Gordon—10-725 Optimization—Fall 2012 8


10

Ex: Infomax ICA 5

• xi
ï5

yi = g(Wxi) ï10
ï10 ï5 0 5 10
‣ dyi = 10

• Method: maxW !i –ln(P(yi)) 0

‣ where P(yi) =
ï5

Wxi
ï10
ï10 ï5 0 5 10

0.8

0.6

0.4

0.2

Geoff Gordon—10-725 Optimization—Fall 2012


0.2 0.4 0.6 0.8
y24i
Gradient
• L = ∑ ln |det J |
i
i yi = g(Wxi) dyi = Ji dxi

Geoff Gordon—10-725 Optimization—Fall 2012 10


Gradient
Ji = diag(ui) W dJi = diag(ui) dW + diag(vi) diag(dW xi) W

dL =

Geoff Gordon—10-725 Optimization—Fall 2012 11


Natural gradient
• L(W): Rd×d → R dL = tr(GTdW)
• step S = arg max S M(S) = tr(G TS) – ||SW-1||2 /2
F
‣ scalar case: M = gs – s2 / 2w2

• M=
• dM =

Geoff Gordon—10-725 Optimization—Fall 2012 12


ICA natural gradient
• [W-T + C] WTW =

yi Wxi

start with W0 = I
Geoff Gordon—10-725 Optimization—Fall 2012 13
ICA natural gradient
• [W-T + C] WTW =

yi Wxi

start with W0 = I
Geoff Gordon—10-725 Optimization—Fall 2012 13
ICA on natural image patches

Geoff Gordon—10-725 Optimization—Fall 2012 14


ICA on natural image patches

Geoff Gordon—10-725 Optimization—Fall 2012 15


More info
• Minka’s cheat sheet:
‣ http://research.microsoft.com/en-us/um/people/minka/
papers/matrix/

• Magnus & Neudecker. Matrix Differential Calculus.


Wiley, 1999. 2nd ed.
‣ http://www.amazon.com/Differential-Calculus-
Applications-Statistics-Econometrics/dp/047198633X

• Bell & Sejnowski. An information-maximization


approach to blind separation and blind
deconvolution. Neural Computation, v7, 1995.
Geoff Gordon—10-725 Optimization—Fall 2012 16
Newton’s method
10-725 Optimization
Geoff Gordon
Ryan Tibshirani
Nonlinear equations
• x ∈ Rd f: Rd→Rd, diff’ble 1.5

‣ solve: 1

• Taylor: 0.5

‣ J: 0

• Newton:
ï0.5

ï1
0 1 2

Geoff Gordon—10-725 Optimization—Fall 2012 18


Error analysis

Geoff Gordon—10-725 Optimization—Fall 2012 19


dx = x*(1-x*phi)
0: 0.7500000000000000
1: 0.5898558813281841
2: 0.6167492604787597
3: 0.6180313181415453
4: 0.6180339887383547
5: 0.6180339887498948
6: 0.6180339887498949
7: 0.6180339887498948
8: 0.6180339887498949
*: 0.6180339887498948
Geoff Gordon—10-725 Optimization—Fall 2012 20
Bad initialization
1.3000000000000000
-0.1344774409873226
-0.2982157033270080
-0.7403273854022190
-2.3674743431148597
-13.8039236412225819
-335.9214859516196157
-183256.0483360671496484
-54338444778.1145248413085938

Geoff Gordon—10-725 Optimization—Fall 2012 21


Minimization
• x ∈ Rd f: Rd→R, twice diff’ble
‣ find:

• Newton:

Geoff Gordon—10-725 Optimization—Fall 2012 22


Descent
• Newton step: d = –(f’’(x))-1 f’(x)
• Gradient step: –g = –f’(x)
• Taylor: df =
• Let t > 0, set dx =
‣ df =

• So:

Geoff Gordon—10-725 Optimization—Fall 2012 23


9.5 Steepest descent
Newton’s method

g = f’(x)
H = f’’(x)
x
||d||H =

x + ∆xnsd
x + ∆xnt

PSfrag replacements

Geoff Gordon—10-725 Optimization—Fall 2012 24


Newton w/ line search
• Pick x 1

• For k = 1, 2, …
‣ gk = f’(xk); Hk = f’’(xk) gradient & Hessian
‣ dk = –Hk \ gk Newton direction
‣ tk = 1 backtracking line search
‣ while f(xk + tk dk) > f(xk) + t gkTdk / 2
‣ tk = β t k β<1
‣ xk+1 = xk + tk dk step

Geoff Gordon—10-725 Optimization—Fall 2012 25


Properties of damped Newton
• Affine invariant: suppose g(x) = f(Ax+b)
‣ x1, x2, … from Newton on g()
‣ y1, y2, … from Newton on f()
‣ If y1 = Ax1 + b, then:

• Convergent:
‣ if f bounded below, f(xk) converges
‣ if f strictly convex, bounded level sets, xk converges
‣ typically quadratic rate in neighborhood of x*

Geoff Gordon—10-725 Optimization—Fall 2012 26


Equality constraints
• min f(x) s.t. h(x) = 0
2

ï1

ï2
ï2 0 2
Geoff Gordon—10-725 Optimization—Fall 2012 27
Optimality w/ equality
• min f(x) s.t. h(x) = 0
‣ f: Rd → R, h: Rd → Rk (k ≤ d)
‣ g: Rd → Rd (gradient of f)

• Useful special case: min f(x) s.t. Ax = 0

Geoff Gordon—10-725 Optimization—Fall 2012 28


Picture
 
x
max c�  y  s.t.
z

x2 + y 2 + z 2 = 1
a� x = b

Geoff Gordon—10-725 Optimization—Fall 2012 29


Optimality w/ equality
• min f(x) s.t. h(x) = 0
‣ f: Rd → R, h: Rd → Rk (k ≤ d)
‣ g: Rd → Rd (gradient of f)

• Now suppose:
‣ dg = dh =

• Optimality:

Geoff Gordon—10-725 Optimization—Fall 2012 30


Example: bundle adjustment
• Latent: 3

‣ Robot positions xt, θt 2

‣ Landmark positions yk 1

• Observed: odometry, 0

landmark vectors ï1

‣ vt = Rθt[xt+1–xt] + noise ï2

‣ wt = [θt+1–θt + noise]π ï3
‣ dkt = Rθt[yk–xt] + noise ï4
O = {observed kt pairs} ï5
ï2 0 2
Geoff Gordon—10-725 Optimization—Fall 2012 31
Example: bundle adjustment
• Latent:
‣ Robot positions xt, θt
‣ Landmark positions yk

• Observed: odometry,
landmark vectors
‣ vt = Rθt[xt+1–xt] + noise
‣ wt = [θt+1–θt + noise]π
‣ dkt = Rθt[yk–xt] + noise

Geoff Gordon—10-725 Optimization—Fall 2012 32


Bundle adjustment
� 2
� 2
min t �v t − R(u t )[x t+1 − x t ]� + t �R u
wt t − u t+1 � +
xt ,ut ,yk
� 2
(t,k)∈O �d k,t − R(u t )[y k − x t ]�

s.t. ut ut =1

Geoff Gordon—10-725 Optimization—Fall 2012 33


Ex: MLE in exponential family

L = − ln P (xk | θ)
k
P (xk | θ) =
g(θ) =

Geoff Gordon—10-725 Optimization—Fall 2012 34


MLE Newton interpretation

Geoff Gordon—10-725 Optimization—Fall 2012 35


Comparison
of methods for minimizing a convex function

Newton FISTA (sub)grad stoch. (sub)grad.

convergence

cost/iter

smoothness

Geoff Gordon—10-725 Optimization—Fall 2012 36


Variations
• Trust region
‣ [H(x) + tI]dx = –g(x)
‣ [H(x) + tD]dx = –g(x)

• Quasi-Newton
‣ use only gradients, but build estimate of Hessian
‣ in R d, d gradient estimates at “nearby” points

determine approx. Hessian (think finite differences)


‣ can often get “good enough” estimate w/ fewer—
can even forget old info to save memory (L-BFGS)

Geoff Gordon—10-725 Optimization—Fall 2012 37


Variations: Gauss-Newton
�1
2
L = min �yk − f (xk , θ)�
θ 2
k

Geoff Gordon—10-725 Optimization—Fall 2012 38


Variations: Fisher scoring
• Recall Newton in exponential family
E[xx� | θ]dθ = x̄ − E[x | θ]

• Can use this formula in place of Newton, even


if not an exponential family
‣ descent direction, even w/ no regularization
‣ “Hessian” is independent of data
‣ often a wider radius of convergence than Newton
‣ can be superlinearly convergent

Geoff Gordon—10-725 Optimization—Fall 2012 39

You might also like