11 Matrix Newton
11 Matrix Newton
10-725 Optimization
Geoff Gordon
Ryan Tibshirani
Review
• Matrix differentials: sol’n to matrix calculus pain
‣ compact way of writing Taylor expansions, or …
‣ definition:
‣ df = a(x; dx) [+ r(dx)]
‣ a(x; .) linear in 2nd arg
‣ r(dx)/||dx|| → 0 as dx → 0
• Identification theorems
Geoff Gordon—10-725 Optimization—Fall 2012 3
Finding a maximum
or minimum, or saddle point
vector f df = a dx 2 df = A dx
matrix F dF = A dx1.5
1
0.5
ï0.5
ï1
Geoff Gordon—10-725 Optimization—Fall 2012 ï3 ï2 ï1 0 1 2 3 4
Finding a maximum
or minimum, or saddle point
vector f df = a dx df = A dx
matrix F dF = A dx
xi
ï5
• Transformation y = g(Wx )
10
i i
5
‣ W ∈ ℝd!d 0
‣ g(z) = ï5
Wxi
• Want: ï10
ï10 ï5 0 5 10
0.8
0.6
0.4
0.2
• xi
ï5
yi = g(Wxi) ï10
ï10 ï5 0 5 10
‣ dyi = 10
‣ where P(yi) =
ï5
Wxi
ï10
ï10 ï5 0 5 10
0.8
0.6
0.4
0.2
dL =
• M=
• dM =
yi Wxi
start with W0 = I
Geoff Gordon—10-725 Optimization—Fall 2012 13
ICA natural gradient
• [W-T + C] WTW =
yi Wxi
start with W0 = I
Geoff Gordon—10-725 Optimization—Fall 2012 13
ICA on natural image patches
‣ solve: 1
• Taylor: 0.5
‣ J: 0
• Newton:
ï0.5
ï1
0 1 2
• Newton:
• So:
g = f’(x)
H = f’’(x)
x
||d||H =
x + ∆xnsd
x + ∆xnt
PSfrag replacements
• For k = 1, 2, …
‣ gk = f’(xk); Hk = f’’(xk) gradient & Hessian
‣ dk = –Hk \ gk Newton direction
‣ tk = 1 backtracking line search
‣ while f(xk + tk dk) > f(xk) + t gkTdk / 2
‣ tk = β t k β<1
‣ xk+1 = xk + tk dk step
• Convergent:
‣ if f bounded below, f(xk) converges
‣ if f strictly convex, bounded level sets, xk converges
‣ typically quadratic rate in neighborhood of x*
ï1
ï2
ï2 0 2
Geoff Gordon—10-725 Optimization—Fall 2012 27
Optimality w/ equality
• min f(x) s.t. h(x) = 0
‣ f: Rd → R, h: Rd → Rk (k ≤ d)
‣ g: Rd → Rd (gradient of f)
x2 + y 2 + z 2 = 1
a� x = b
• Now suppose:
‣ dg = dh =
• Optimality:
‣ Landmark positions yk 1
• Observed: odometry, 0
landmark vectors ï1
‣ vt = Rθt[xt+1–xt] + noise ï2
‣ wt = [θt+1–θt + noise]π ï3
‣ dkt = Rθt[yk–xt] + noise ï4
O = {observed kt pairs} ï5
ï2 0 2
Geoff Gordon—10-725 Optimization—Fall 2012 31
Example: bundle adjustment
• Latent:
‣ Robot positions xt, θt
‣ Landmark positions yk
• Observed: odometry,
landmark vectors
‣ vt = Rθt[xt+1–xt] + noise
‣ wt = [θt+1–θt + noise]π
‣ dkt = Rθt[yk–xt] + noise
convergence
cost/iter
smoothness
• Quasi-Newton
‣ use only gradients, but build estimate of Hessian
‣ in R d, d gradient estimates at “nearby” points