0% found this document useful (0 votes)

30 views40 pages

11 Matrix Newton

This document discusses matrix differential calculus and Newton's method for optimization. Some key points: - Matrix differentials provide a compact way to write Taylor expansions and define derivatives for matrices, vectors, and scalars. - Newton's method finds the root of a function by iteratively computing the Jacobian and updating based on its inverse. - For minimization, Newton's method finds the step that minimizes a quadratic approximation of the function based on its Hessian. - Initialization and ensuring the Hessian is invertible are important for Newton's method to converge quickly.

Uploaded by

Kumar Priyanshu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views40 pages

11 Matrix Newton

Uploaded by

Kumar Priyanshu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Matrix differential calculus

10-725 Optimization
Geoff Gordon
Ryan Tibshirani
Review
• Matrix differentials: sol’n to matrix calculus pain
‣ compact way of writing Taylor expansions, or …
‣ definition:
‣ df = a(x; dx) [+ r(dx)]
‣ a(x; .) linear in 2nd arg
‣ r(dx)/||dx|| → 0 as dx → 0

• d(…) is linear: passes thru +, scalar *

• Generalizes Jacobian, Hessian, gradient, velocity
Geoff Gordon—10-725 Optimization—Fall 2012 2
Review
• Chain rule
• Product rule
• Bilinear functions: cross product, Kronecker,
Frobenius, Hadamard, Khatri-Rao, …
• Identities
‣ rules for working with ￮, tr()
‣ trace rotation

• Identification theorems
Geoff Gordon—10-725 Optimization—Fall 2012 3
Finding a maximum
or minimum, or saddle point

ID for df(x) scalar x vector x matrix X

scalar f df = a dx df = aTdx df = tr(ATdX)

vector f df = a dx 2 df = A dx

matrix F dF = A dx1.5
1

0.5

ï0.5

ï1
Geoff Gordon—10-725 Optimization—Fall 2012 ï3 ï2 ï1 0 1 2 3 4
Finding a maximum
or minimum, or saddle point

ID for df(x) scalar x vector x matrix X

scalar f df = a dx df = aTdx df = tr(ATdX)

vector f df = a dx df = A dx

matrix F dF = A dx

Geoff Gordon—10-725 Optimization—Fall 2012 5

And so forth…
• Can’t draw it for X a matrix, tensor, …
• But same principle holds: set coefficient of dX
to 0 to find min, max, or saddle point:
‣ if df = c(A; dX) [+ r(dX)] then

‣ so: max/min/sp iff

‣ for c(.; .) any “product”,

Geoff Gordon—10-725 Optimization—Fall 2012 6

Ex: Infomax ICA 5

xi
ï5

• Training examples xi ∈ ℝd, i = 1:n ï10

ï10 ï5 0 5 10

• Transformation y = g(Wx )
10

i i
5

‣ W ∈ ℝd!d 0

‣ g(z) = ï5

Wxi
• Want: ï10
ï10 ï5 0 5 10

0.8

0.6

0.4

0.2

Geoff Gordon—10-725 Optimization—Fall 2012

0.2 0.4 0.6 0.8
y23i
Volume rule

Geoff Gordon—10-725 Optimization—Fall 2012 8

Ex: Infomax ICA 5

• xi
ï5

yi = g(Wxi) ï10
ï10 ï5 0 5 10
‣ dyi = 10

• Method: maxW !i –ln(P(yi)) 0

‣ where P(yi) =
ï5

Wxi
ï10
ï10 ï5 0 5 10

0.8

0.6

0.4

0.2

Geoff Gordon—10-725 Optimization—Fall 2012

0.2 0.4 0.6 0.8
y24i
Gradient
• L = ∑ ln |det J |
i
i yi = g(Wxi) dyi = Ji dxi

Geoff Gordon—10-725 Optimization—Fall 2012 10

Gradient
Ji = diag(ui) W dJi = diag(ui) dW + diag(vi) diag(dW xi) W

dL =

Geoff Gordon—10-725 Optimization—Fall 2012 11

Natural gradient
• L(W): Rd×d → R dL = tr(GTdW)
• step S = arg max S M(S) = tr(G TS) – ||SW-1||2 /2
F
‣ scalar case: M = gs – s2 / 2w2

• M=
• dM =

Geoff Gordon—10-725 Optimization—Fall 2012 12

ICA natural gradient
• [W-T + C] WTW =

yi Wxi

start with W0 = I
Geoff Gordon—10-725 Optimization—Fall 2012 13
ICA natural gradient
• [W-T + C] WTW =

yi Wxi

start with W0 = I
Geoff Gordon—10-725 Optimization—Fall 2012 13
ICA on natural image patches

Geoff Gordon—10-725 Optimization—Fall 2012 14

ICA on natural image patches

Geoff Gordon—10-725 Optimization—Fall 2012 15

More info
• Minka’s cheat sheet:
‣ http://research.microsoft.com/en-us/um/people/minka/
papers/matrix/

• Magnus & Neudecker. Matrix Differential Calculus.

Wiley, 1999. 2nd ed.
‣ http://www.amazon.com/Differential-Calculus-
Applications-Statistics-Econometrics/dp/047198633X

• Bell & Sejnowski. An information-maximization

approach to blind separation and blind
deconvolution. Neural Computation, v7, 1995.
Geoff Gordon—10-725 Optimization—Fall 2012 16
Newton’s method
10-725 Optimization
Geoff Gordon
Ryan Tibshirani
Nonlinear equations
• x ∈ Rd f: Rd→Rd, diff’ble 1.5

‣ solve: 1

• Taylor: 0.5

‣ J: 0

• Newton:
ï0.5

ï1
0 1 2

Geoff Gordon—10-725 Optimization—Fall 2012 18

Error analysis

Geoff Gordon—10-725 Optimization—Fall 2012 19

dx = x*(1-x*phi)
0: 0.7500000000000000
1: 0.5898558813281841
2: 0.6167492604787597
3: 0.6180313181415453
4: 0.6180339887383547
5: 0.6180339887498948
6: 0.6180339887498949
7: 0.6180339887498948
8: 0.6180339887498949
*: 0.6180339887498948
Geoff Gordon—10-725 Optimization—Fall 2012 20
Bad initialization
1.3000000000000000
-0.1344774409873226
-0.2982157033270080
-0.7403273854022190
-2.3674743431148597
-13.8039236412225819
-335.9214859516196157
-183256.0483360671496484
-54338444778.1145248413085938

Geoff Gordon—10-725 Optimization—Fall 2012 21

Minimization
• x ∈ Rd f: Rd→R, twice diff’ble
‣ find:

• Newton:

Geoff Gordon—10-725 Optimization—Fall 2012 22

Descent
• Newton step: d = –(f’’(x))-1 f’(x)
• Gradient step: –g = –f’(x)
• Taylor: df =
• Let t > 0, set dx =
‣ df =

• So:

Geoff Gordon—10-725 Optimization—Fall 2012 23

9.5 Steepest descent
Newton’s method

g = f’(x)
H = f’’(x)
x
||d||H =

x + ∆xnsd
x + ∆xnt

PSfrag replacements

Geoff Gordon—10-725 Optimization—Fall 2012 24

Newton w/ line search
• Pick x 1

• For k = 1, 2, …
‣ gk = f’(xk); Hk = f’’(xk) gradient & Hessian
‣ dk = –Hk \ gk Newton direction
‣ tk = 1 backtracking line search
‣ while f(xk + tk dk) > f(xk) + t gkTdk / 2
‣ tk = β t k β<1
‣ xk+1 = xk + tk dk step

Geoff Gordon—10-725 Optimization—Fall 2012 25

Properties of damped Newton
• Affine invariant: suppose g(x) = f(Ax+b)
‣ x1, x2, … from Newton on g()
‣ y1, y2, … from Newton on f()
‣ If y1 = Ax1 + b, then:

• Convergent:
‣ if f bounded below, f(xk) converges
‣ if f strictly convex, bounded level sets, xk converges
‣ typically quadratic rate in neighborhood of x*

Geoff Gordon—10-725 Optimization—Fall 2012 26

Equality constraints
• min f(x) s.t. h(x) = 0
2

ï1

ï2
ï2 0 2
Geoff Gordon—10-725 Optimization—Fall 2012 27
Optimality w/ equality
• min f(x) s.t. h(x) = 0
‣ f: Rd → R, h: Rd → Rk (k ≤ d)
‣ g: Rd → Rd (gradient of f)

• Useful special case: min f(x) s.t. Ax = 0

Geoff Gordon—10-725 Optimization—Fall 2012 28

Picture
 
x
max c�  y  s.t.
z

x2 + y 2 + z 2 = 1
a� x = b

Geoff Gordon—10-725 Optimization—Fall 2012 29

Optimality w/ equality
• min f(x) s.t. h(x) = 0
‣ f: Rd → R, h: Rd → Rk (k ≤ d)
‣ g: Rd → Rd (gradient of f)

• Now suppose:
‣ dg = dh =

• Optimality:

Geoff Gordon—10-725 Optimization—Fall 2012 30

Example: bundle adjustment
• Latent: 3

‣ Robot positions xt, θt 2

‣ Landmark positions yk 1

• Observed: odometry, 0

landmark vectors ï1

‣ vt = Rθt[xt+1–xt] + noise ï2

‣ wt = [θt+1–θt + noise]π ï3
‣ dkt = Rθt[yk–xt] + noise ï4
O = {observed kt pairs} ï5
ï2 0 2
Geoff Gordon—10-725 Optimization—Fall 2012 31
Example: bundle adjustment
• Latent:
‣ Robot positions xt, θt
‣ Landmark positions yk

• Observed: odometry,
landmark vectors
‣ vt = Rθt[xt+1–xt] + noise
‣ wt = [θt+1–θt + noise]π
‣ dkt = Rθt[yk–xt] + noise

Geoff Gordon—10-725 Optimization—Fall 2012 32

Bundle adjustment
� 2
� 2
min t �v t − R(u t )[x t+1 − x t ]� + t �R u
wt t − u t+1 � +
xt ,ut ,yk
� 2
(t,k)∈O �d k,t − R(u t )[y k − x t ]�
�
s.t. ut ut =1

Geoff Gordon—10-725 Optimization—Fall 2012 33

Ex: MLE in exponential family
�
L = − ln P (xk | θ)
k
P (xk | θ) =
g(θ) =

Geoff Gordon—10-725 Optimization—Fall 2012 34

MLE Newton interpretation

Geoff Gordon—10-725 Optimization—Fall 2012 35

Comparison
of methods for minimizing a convex function

Newton FISTA (sub)grad stoch. (sub)grad.

convergence

cost/iter

smoothness

Geoff Gordon—10-725 Optimization—Fall 2012 36

Variations
• Trust region
‣ [H(x) + tI]dx = –g(x)
‣ [H(x) + tD]dx = –g(x)

• Quasi-Newton
‣ use only gradients, but build estimate of Hessian
‣ in R d, d gradient estimates at “nearby” points

determine approx. Hessian (think finite differences)

‣ can often get “good enough” estimate w/ fewer—
can even forget old info to save memory (L-BFGS)

Geoff Gordon—10-725 Optimization—Fall 2012 37

Variations: Gauss-Newton
�1
2
L = min �yk − f (xk , θ)�
θ 2
k

Geoff Gordon—10-725 Optimization—Fall 2012 38

Variations: Fisher scoring
• Recall Newton in exponential family
E[xx� | θ]dθ = x̄ − E[x | θ]

• Can use this formula in place of Newton, even

if not an exponential family
‣ descent direction, even w/ no regularization
‣ “Hessian” is independent of data
‣ often a wider radius of convergence than Newton
‣ can be superlinearly convergent

Geoff Gordon—10-725 Optimization—Fall 2012 39

Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
Pma Exam
100% (1)
Pma Exam
358 pages
Gas Engineering CH 3-1
No ratings yet
Gas Engineering CH 3-1
37 pages
Nonlinear Optimization: Benny Yakir
No ratings yet
Nonlinear Optimization: Benny Yakir
38 pages
Nonlinear Optimization: Benny Yakir
No ratings yet
Nonlinear Optimization: Benny Yakir
38 pages
Wolfram Mathematica Tutorial Collection
No ratings yet
Wolfram Mathematica Tutorial Collection
38 pages
Note Set 7 - Nonlinear Equations: 7.1 - Overview
No ratings yet
Note Set 7 - Nonlinear Equations: 7.1 - Overview
10 pages
Lecture11
No ratings yet
Lecture11
6 pages
Chapter
No ratings yet
Chapter
46 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
Notes HQ
No ratings yet
Notes HQ
96 pages
10-725 Optimization Geoff Gordon Ryan Tibshirani
No ratings yet
10-725 Optimization Geoff Gordon Ryan Tibshirani
23 pages
Eecs127 Reader
No ratings yet
Eecs127 Reader
199 pages
NLO Notes
No ratings yet
NLO Notes
75 pages
Optimization Algorithms For Data Analysis Wright
No ratings yet
Optimization Algorithms For Data Analysis Wright
49 pages
Matinf 2360 Part 3
No ratings yet
Matinf 2360 Part 3
106 pages
CH 4
No ratings yet
CH 4
28 pages
Optimization Toolbox: User's Guide
No ratings yet
Optimization Toolbox: User's Guide
305 pages
Ee227c Notes PDF
No ratings yet
Ee227c Notes PDF
122 pages
Ee227c Notes 2 PDF
No ratings yet
Ee227c Notes 2 PDF
122 pages
Lecture1 introductionPCA
No ratings yet
Lecture1 introductionPCA
75 pages
Workbook Integral Calculus
No ratings yet
Workbook Integral Calculus
165 pages
Numerical Solution of Linear Systems: Chen Greif
No ratings yet
Numerical Solution of Linear Systems: Chen Greif
59 pages
A First Course in Linear Optimization
No ratings yet
A First Course in Linear Optimization
196 pages
Preguntas Del Examen
No ratings yet
Preguntas Del Examen
8 pages
Maths For Intelligent Systems
No ratings yet
Maths For Intelligent Systems
76 pages
Unit VI Optimization Techniques Question Bank Solved Answer
No ratings yet
Unit VI Optimization Techniques Question Bank Solved Answer
20 pages
Optimización Lineal
No ratings yet
Optimización Lineal
304 pages
A First Course in Linear Optimi - Wei Zhi
No ratings yet
A First Course in Linear Optimi - Wei Zhi
306 pages
Dive Into Deep Learning-435-462
No ratings yet
Dive Into Deep Learning-435-462
28 pages
Num Methods
No ratings yet
Num Methods
495 pages
Lecture Maths
No ratings yet
Lecture Maths
104 pages
Agao22 Script
No ratings yet
Agao22 Script
208 pages
Chapter8-Unconstrained Optimization
No ratings yet
Chapter8-Unconstrained Optimization
14 pages
MAT 461/561: Numerical Analysis II: James V. Lambers May 5, 2014
No ratings yet
MAT 461/561: Numerical Analysis II: James V. Lambers May 5, 2014
124 pages
Lecture 2
No ratings yet
Lecture 2
19 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
Optimization Summary
No ratings yet
Optimization Summary
47 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Optimization For Machine Learning: Lecture 12: Coordinate Descent, BCD, Altmin 6.881: MIT
No ratings yet
Optimization For Machine Learning: Lecture 12: Coordinate Descent, BCD, Altmin 6.881: MIT
124 pages
Introduction To Optimization
No ratings yet
Introduction To Optimization
18 pages
NEOM UNIT-1 Sept-23
No ratings yet
NEOM UNIT-1 Sept-23
34 pages
15 Optimization Script
No ratings yet
15 Optimization Script
62 pages
Power Systems Operation and Management: Second Lecture
No ratings yet
Power Systems Operation and Management: Second Lecture
35 pages
Introduction PDF
No ratings yet
Introduction PDF
416 pages
Nonlinear
100% (2)
Nonlinear
416 pages
Nonlinear Programming Unconstrained
No ratings yet
Nonlinear Programming Unconstrained
182 pages
Optimumengineeringdesign Day3a
No ratings yet
Optimumengineeringdesign Day3a
34 pages
Mathematical Analysis of Descent Algorithms in Artificial Intelligence Convergence, Loss Landscapes, and Structural Optimization
No ratings yet
Mathematical Analysis of Descent Algorithms in Artificial Intelligence Convergence, Loss Landscapes, and Structural Optimization
8 pages
351 PDF
No ratings yet
351 PDF
68 pages
Optimization
No ratings yet
Optimization
38 pages
Jiyue Zeng Honors Thesis
No ratings yet
Jiyue Zeng Honors Thesis
59 pages
Bologna 07
No ratings yet
Bologna 07
315 pages
Dhaki Shiv Mandir Boq
No ratings yet
Dhaki Shiv Mandir Boq
7 pages
Master Your Entry
100% (3)
Master Your Entry
11 pages
Individual Learner's Record (LR)
No ratings yet
Individual Learner's Record (LR)
2 pages
7 Marine Warranty Survey-Marine & Offshore Consultants - TUV Rheinland
No ratings yet
7 Marine Warranty Survey-Marine & Offshore Consultants - TUV Rheinland
2 pages
Module 3.2 Laplace & Inverse Laplace Transforms
No ratings yet
Module 3.2 Laplace & Inverse Laplace Transforms
17 pages
What Is This Systems Thing About
No ratings yet
What Is This Systems Thing About
3 pages
Jeanine Meyer - Origami As A General Education Math Cour
No ratings yet
Jeanine Meyer - Origami As A General Education Math Cour
4 pages
3RD Year First Sem Forensic 3 Chimistry and Toxicology
No ratings yet
3RD Year First Sem Forensic 3 Chimistry and Toxicology
14 pages
Research Analyst
No ratings yet
Research Analyst
3 pages
Bioinformatics Companies in India
No ratings yet
Bioinformatics Companies in India
3 pages
The One Where Everybody Learns English
No ratings yet
The One Where Everybody Learns English
14 pages
A-13 Agar Baird Parker LT 304212
No ratings yet
A-13 Agar Baird Parker LT 304212
2 pages
Grade 5 SOS Final
No ratings yet
Grade 5 SOS Final
143 pages
Analysis Report - Soil Nail SGHR100 MacMat
No ratings yet
Analysis Report - Soil Nail SGHR100 MacMat
2 pages
LNG Custody Transfer Handbook PDF
100% (2)
LNG Custody Transfer Handbook PDF
108 pages
Conflict Management Ogl 220 Worksheet
No ratings yet
Conflict Management Ogl 220 Worksheet
4 pages
AC Biode Flexibuster EN-1
No ratings yet
AC Biode Flexibuster EN-1
17 pages
Petrophysics Chap1
No ratings yet
Petrophysics Chap1
66 pages
Water Use Reduction Additional Guidance 10-17-2016 v9 - 0
No ratings yet
Water Use Reduction Additional Guidance 10-17-2016 v9 - 0
8 pages
Classroom Behavior and Academic Performance of Public Elementary School Pupils
No ratings yet
Classroom Behavior and Academic Performance of Public Elementary School Pupils
34 pages
Final - UBD - Science Stage 1-3
No ratings yet
Final - UBD - Science Stage 1-3
16 pages
Lit. Rev
No ratings yet
Lit. Rev
1 page
Formative
No ratings yet
Formative
29 pages
Requirements Before Issuance of Sanitary Permit: For Food Establishments
No ratings yet
Requirements Before Issuance of Sanitary Permit: For Food Establishments
4 pages
Animals 12 02251
No ratings yet
Animals 12 02251
25 pages
Herbicide Resistance in Plants
No ratings yet
Herbicide Resistance in Plants
198 pages
Vte Current Handbook
No ratings yet
Vte Current Handbook
39 pages
SunFlower Series Solar Street Light - GS LIGHT
No ratings yet
SunFlower Series Solar Street Light - GS LIGHT
11 pages
ESP Lesson Plan
No ratings yet
ESP Lesson Plan
7 pages

11 Matrix Newton

Uploaded by

11 Matrix Newton

Uploaded by

Matrix differential calculus

• d(…) is linear: passes thru +, scalar *

ID for df(x) scalar x vector x matrix X

ID for df(x) scalar x vector x matrix X

Geoff Gordon—10-725 Optimization—Fall 2012 5

‣ so: max/min/sp iff

Geoff Gordon—10-725 Optimization—Fall 2012 6

Ex: Infomax ICA 5

• Training examples xi ∈ ℝd, i = 1:n ï10

Geoff Gordon—10-725 Optimization—Fall 2012

Geoff Gordon—10-725 Optimization—Fall 2012 8

Ex: Infomax ICA 5

• Method: maxW !i –ln(P(yi)) 0

Geoff Gordon—10-725 Optimization—Fall 2012

Geoff Gordon—10-725 Optimization—Fall 2012 10

Geoff Gordon—10-725 Optimization—Fall 2012 11

Geoff Gordon—10-725 Optimization—Fall 2012 12

Geoff Gordon—10-725 Optimization—Fall 2012 14

Geoff Gordon—10-725 Optimization—Fall 2012 15

• Magnus & Neudecker. Matrix Differential Calculus.

• Bell & Sejnowski. An information-maximization

Geoff Gordon—10-725 Optimization—Fall 2012 18

Geoff Gordon—10-725 Optimization—Fall 2012 19

Geoff Gordon—10-725 Optimization—Fall 2012 21

Geoff Gordon—10-725 Optimization—Fall 2012 22

Geoff Gordon—10-725 Optimization—Fall 2012 23

Geoff Gordon—10-725 Optimization—Fall 2012 24

Geoff Gordon—10-725 Optimization—Fall 2012 25

Geoff Gordon—10-725 Optimization—Fall 2012 26

• Useful special case: min f(x) s.t. Ax = 0

Geoff Gordon—10-725 Optimization—Fall 2012 28

Geoff Gordon—10-725 Optimization—Fall 2012 29

Geoff Gordon—10-725 Optimization—Fall 2012 30

‣ Robot positions xt, θt 2

Geoff Gordon—10-725 Optimization—Fall 2012 32

Geoff Gordon—10-725 Optimization—Fall 2012 33

Geoff Gordon—10-725 Optimization—Fall 2012 34

Geoff Gordon—10-725 Optimization—Fall 2012 35

Newton FISTA (sub)grad stoch. (sub)grad.

Geoff Gordon—10-725 Optimization—Fall 2012 36

determine approx. Hessian (think finite differences)

Geoff Gordon—10-725 Optimization—Fall 2012 37

Geoff Gordon—10-725 Optimization—Fall 2012 38

• Can use this formula in place of Newton, even

Geoff Gordon—10-725 Optimization—Fall 2012 39

You might also like