0% found this document useful (0 votes)

4 views6 pages

Lecture11

Lecture 11 of the Optimization course covers matrix differential calculus, including finding maxima, minima, and saddle points using derivatives and the Infomax ICA method. It also discusses Newton's method for solving nonlinear equations and finding extrema, emphasizing the convergence properties and the introduction of damped Newton for improved stability. Key concepts include the use of Jacobians, Hessians, and the natural gradient in optimization problems.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views6 pages

Lecture11

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

10-725: Optimization Fall 2012

Lecture 11: October 2

Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Tongbo Huang, Shoou-I Yu

Note: LaTeX template courtesy of UC Berkeley EECS dept.

Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.
They may be distributed outside this class only with the permission of the Instructor.

This lecture’s notes illustrate some uses of various LATEX macros. Take a look at this and imitate.

11.1 Matrix Differential Calculus

11.1.1 Review of Previous Class

Matrix differential is a solution to matrix calculation pain, it can be understood by either of the following:

• A compact way of writing Taylor expansion.

• Definition:
df = a(x; dx)[r(dx)] where r is the residual term
a(x, .) is linear in the second argument
r(dx)
||dx||→0 as dx → 0

The derivative is linear, so it passes through addition and scalar multiplication.

It also generalizes Jacobian, Hessian, gradient and velocity.
Other topics covered: chain rule, product rule, bilinear functions, identities and identities theorems. Please
refer to previous scribed notes for details.

11.1.2 Finding a maximum, minimum or saddle points

The principle: set coefficient of dX to 0 to find min, max, or saddle point:

• if df = c(A; dX)[r(dX)] then dX = tA, df = c(A; tA) = tc(A; A)

• function is at min/max/saddle point iff c(A; A) = 0
• if c is any product, then A = 0

11-1
11-2 Lecture 11: October 2

11.1.3 Infomax ICA

Suppose we have n training examples xi ∈ Rd and a scalar-valued, component-wise function g. We would

like to find the d × d matrix W that maximizes the entropy of yi = g(W xi ).
Detour: volume rule:

vol(AS) = |det(A)|vol(S)

Interpretation: small determinant value means the existence of small eigenvalue, thus squash the volume
flat, vice versa.
Back to infomax ICA. We have yi = g(W xi ) where dyi = J(x; W )dxi = Ji dxi . We want to maximize the
entropy over the distribution of y:

P (xi )
maxW Σi (−ln(P (yi ))), P (yi ) =
|detJ(xi ; W )|

And from
Z
maxH(P (y)) = − P (y)ln(P (y))dy = −E(ln(P (y)))

it is equivalent to maximizing

maxW Σi ln(|detJ(xi ; W )|)

since P (x) is independent to W .

11.1.4 Solving ICA Gradient

Define ui = g 0 (W xi ), vi = g 00 (W xi ).
For gradient of yi = g(W xi ):

dyi = g 0 (W xi ) ◦ d(W xi )
= ui ◦ (W dxi )
= diag(ui )W dxi

For gradient of Ji = diag(ui )W :

dJi = d(diag(ui ))W + diag(ui )dW

= diag(vi ◦ d(W xi ))W + diag(ui )dW
= diag(ui )dW + diag(vi )diag(d(W xi ))W
Lecture 11: October 2 11-3

Finally, define diag(αi ) = diag(ui )−1 diag(vi ) solving the gradient of Σi ln(|detJ(xi ; W )|):

dL = Σi d(ln|detJ(xi ; W )|)
= Σi tr(Ji−1 dJi )
= Σi tr(W −1 dW + W −1 diag(ui )−1 diag(vi )diag(d(W xi )))W
= Σi tr(W −1 dW ) + tr(diag(αi )diag(d(W xi )))
= ntr(W −1 dW ) + Σi tr(αiT d(W xi ))
= ntr(W −1 dW ) + tr(Σi xi αiT d(W xi ))
= nW −T + Σi αi xTi
= nW −T + C

11.1.5 Natural Gradient

Define L(W ) as a function from Rd×d to R, then dL = tr(GT dW ). So step S is:

||SW −1 ||2F
S = argmaxS M (S), M (S) = tr(GT S) −
2
which, in scaler case:
S2
M = gS −
2W 2
So, to find the max/min/saddle point:

1
M = tr(GT S) − tr(SW −1 W −T S T )
2
1
dM = tr(GT dS) − tr(dSW −1 W −T S T )
2
So, natural gradient becomes G = W −1 W −T , and thus GW T W = S. Using the gradient previously derived,
[W −T + C]W T W = W + CW T W .

11.1.6 More Info

• Minkas cheat sheet:
http://research.microsoft.com/en-us/um/people/minka/papers/matrix/

• Magnus & Neudecker. Matrix Differential Calculus. Wiley, 1999. 2nd ed.
http://www.amazon.com/Differential-Calculus- Applications-Statistics-Econometrics/dp/047198633X

• Bell & Sejnowski. An information-maximization approach to blind separation and blind deconvolution.
Neural Computation, v7, 1995.

11.2 Newton’s Method

Newton’s method have two main applications: solving nonlinear equations and finding minima/maxima/saddles.
11-4 Lecture 11: October 2

11.2.1 Solving Nonlinear Equations

For x ∈ Rd and f : Rd → Rd which is differentiable, we want to solve f (x) = 0. We perform first order
Taylor approximation on f (x),

f (y) ≈ f (x) + J(x)(y − x) = fˆ(y)

where J(x) is the Jacobian. We now try to solve for fˆ(y) = 0.

f (x) + J(x)(y − x) = 0

dx = y − x = −J(x)−1 f (x) (11.1)

dx represents the update step for Newton’s method.

We now work on the example of approximating the reciprocal of the Golden Number φ. The function and
the derivative is as follows.
1 1
f (x) = − φ, f 0 (x) = − 2
x x

And dx becomes the following.

1
dx = −J(x)−1 f (x) = x2 ( − φ) = x − x2 φ
x

The update rule is x+ = x + dx. Figure 11.1 shows an iteration of Newton’s method when x = 1.

Figure 11.1: Example of one iteration of Newton’s Method in solving non-linear equations.

We now perform error analysis of Newton’s method. For value x, the error is = xφ − 1. For value x+ , the
error + is the following.
Lecture 11: October 2 11-5

+ = x+ φ − 1
= x + x − x2 φ φ − 1

= (x + x (1 − xφ)) φ − 1
= (x − x) φ − 1
= xφ − 1 − xφ
= − xφ
= (1 − xφ)
= −2

This shows that if < 1, then Newton’s method has quadratic convergence. However, if > 1, then Newton’s
method will diverge.

11.2.2 Finding Minima/Maxima/Saddles

For x ∈ Rd and f : Rd → Rd which is twice differentiable, we want to find minx f (x). In the example, we
only focus on minimizing f , but finding the maxima and saddle points are the same. We first define g = f 0 .
Minimizing f is the same as finding x such that g = f 0 = 0. From Equation 11.1, the Newton’s update is
the following,
d = −J −1 g
−1
= − (g 0 ) g
00 −1
= − (f ) f0
= −H −1 g
where H is the Hessian. We now show that Newton’s method is a descent method if H 0. We set dx = td
for t > 0. r(dx) is the residual. Using first order Taylor expansion, we get the following.
df = g T dx + r(dx)

−1
= g T t − (f 00 ) f 0 + r(dx)
−1
= −tf 0T (f 00 ) f 0 + r(dx)
= −tf 0T H −1 f 0 + r(dx)
If H 0, then H −1 0, which makes the first term always negative, thus making Newton’s method a
descent method.

11.2.3 Newton’s Method and Steepest Descent

Newton’s method is a special case of steepest descent when the norm used is the Hessian norm. To find the
step for steepest descent, we minimize the following.
1 √
min g T d + ||d||2H , ||d||2H = dT Hd (11.2)
d 2
The solution to this minimization is d = −H −1 g. Steepest descent with a constraint or a penalty in the
objective is equivalent. The equivalence will be covered when duality is covered in class. Figure 11.2 shows
the difference between the direction of steps for gradient descent and Newton’s method.
11-6 Lecture 11: October 2

Figure 11.2: Direction of step for gradient descent and Newton’s method.

11.2.4 Damped Newton

Damped Newton is combining the Newton’s method with backtracking line search to make sure that the
objective value does not increase.

Initialize x1
for k = 1, 2, . . .
gk = f 0 (xk ); gradient
Hk = f 00 (xk ); Hessian
dk = −Hk \gk ; Newton direction
tk = 1; backtracking line search
while f (xk + tk dk ) > f (xk ) + tgkT dk /3 divide by 3 to make sure < 12 for future proofs
tk = βtk β<1
xk+1 = xk + tk dk step

Damped Newton is affine invariant, meaning that suppose g(x) = f (Ax + b), and we get Newton’s updates
x1 , x2 , . . . from g(x) and y1 , y2 , . . . from f (y), and if y1 = Ax1 + b, then yi = Axi + b ∀i.
For damped Newton, if f is bounded below, then f (xk ) converges. If f is strictly convex with bounded level
sets, then xk converges. Finally, damped Newton typically converges at quadratic rate in the neighborhood
of x∗ .

Gradient Descent
No ratings yet
Gradient Descent
18 pages
Optimization PPT - Part-2
No ratings yet
Optimization PPT - Part-2
42 pages
11 Matrix Newton
No ratings yet
11 Matrix Newton
40 pages
Lecture 7 Newton
No ratings yet
Lecture 7 Newton
44 pages
Or Goalprograming Last Lecture Sllbs
No ratings yet
Or Goalprograming Last Lecture Sllbs
39 pages
Optimization Methods (MFE) : Elena Perazzi
100% (1)
Optimization Methods (MFE) : Elena Perazzi
31 pages
Unconstrained Minimization
No ratings yet
Unconstrained Minimization
7 pages
Chapter 10
No ratings yet
Chapter 10
16 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
E1 251 Linear and Nonlinear Op2miza2on
No ratings yet
E1 251 Linear and Nonlinear Op2miza2on
24 pages
2.NCC-SFC-LMT-KKT 2
No ratings yet
2.NCC-SFC-LMT-KKT 2
56 pages
06 Optimization
No ratings yet
06 Optimization
42 pages
04 Nonlinear Systems and Optimization
No ratings yet
04 Nonlinear Systems and Optimization
74 pages
06 23ECE216 GradientDescent v2
No ratings yet
06 23ECE216 GradientDescent v2
73 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
Mathematical Methods of Optimization
No ratings yet
Mathematical Methods of Optimization
62 pages
Newton Raphson
No ratings yet
Newton Raphson
51 pages
Numerical Optimization: Unit 9: Penalty Method and Interior Point Method Unit 10: Filter Method and The Maratos Effect
No ratings yet
Numerical Optimization: Unit 9: Penalty Method and Interior Point Method Unit 10: Filter Method and The Maratos Effect
24 pages
4-Optimization of 2 Variables, Gradient Descent
No ratings yet
4-Optimization of 2 Variables, Gradient Descent
12 pages
EE364a Homework 7 Solutions
No ratings yet
EE364a Homework 7 Solutions
16 pages
IE684 Lab03
No ratings yet
IE684 Lab03
6 pages
Optimization 2
No ratings yet
Optimization 2
40 pages
14 Newton
No ratings yet
14 Newton
24 pages
Chapter 4: Unconstrained Optimization
No ratings yet
Chapter 4: Unconstrained Optimization
25 pages
OPTFIT Aflevering
No ratings yet
OPTFIT Aflevering
9 pages
Exam 2018
No ratings yet
Exam 2018
18 pages
Equality Constrained Minimization
No ratings yet
Equality Constrained Minimization
19 pages
Lecture 14
No ratings yet
Lecture 14
9 pages
Solutions 3
No ratings yet
Solutions 3
5 pages
Lec4 Gradient Method Revise
No ratings yet
Lec4 Gradient Method Revise
33 pages
Optimumengineeringdesign Day3a
No ratings yet
Optimumengineeringdesign Day3a
34 pages
Gradient Descent PDF
No ratings yet
Gradient Descent PDF
9 pages
Lecture 12
No ratings yet
Lecture 12
16 pages
5 1 SD 17122020
No ratings yet
5 1 SD 17122020
47 pages
Module 5 Optimization QB
No ratings yet
Module 5 Optimization QB
5 pages
Calculus - Class Notes
No ratings yet
Calculus - Class Notes
4 pages
Preguntas Del Examen
No ratings yet
Preguntas Del Examen
8 pages
CS-6777 Liu Abs
No ratings yet
CS-6777 Liu Abs
103 pages
Ann2018 L5
No ratings yet
Ann2018 L5
23 pages
Multi-Variable Optimization Methods
No ratings yet
Multi-Variable Optimization Methods
21 pages
Opt Lec 10
No ratings yet
Opt Lec 10
16 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Lec3 Gradient Based Method Part I
No ratings yet
Lec3 Gradient Based Method Part I
30 pages
6 Gradient Method
No ratings yet
6 Gradient Method
19 pages
Chương 9
No ratings yet
Chương 9
12 pages
Lecture 2
No ratings yet
Lecture 2
19 pages
Sheet 2 Solution
No ratings yet
Sheet 2 Solution
5 pages
Project For Automated Train by Roshan
No ratings yet
Project For Automated Train by Roshan
6 pages
Gradient Of A Function هّلادلا رادحنإ
No ratings yet
Gradient Of A Function هّلادلا رادحنإ
11 pages
HW4 Solutions Autotag
No ratings yet
HW4 Solutions Autotag
7 pages
Note Set 7 - Nonlinear Equations: 7.1 - Overview
No ratings yet
Note Set 7 - Nonlinear Equations: 7.1 - Overview
10 pages
Process Optimization
No ratings yet
Process Optimization
70 pages
Unconstrained Optimization Methods: Amirkabir University of Technology Dr. Madadi
No ratings yet
Unconstrained Optimization Methods: Amirkabir University of Technology Dr. Madadi
13 pages
(k+1) K (K) (K) (K) : Recall That A Direction Is A Vector of Unit Length
No ratings yet
(k+1) K (K) (K) (K) : Recall That A Direction Is A Vector of Unit Length
5 pages
Optimization Class Notes MTH-9842
No ratings yet
Optimization Class Notes MTH-9842
25 pages
Lec 17 Multivariable OT
No ratings yet
Lec 17 Multivariable OT
30 pages
Optimization Based On Gradient Descent
No ratings yet
Optimization Based On Gradient Descent
24 pages
Optim
No ratings yet
Optim
70 pages
Section 8 3 Filter Design by The Insertion Loss Method Package
No ratings yet
Section 8 3 Filter Design by The Insertion Loss Method Package
41 pages
Module 1
No ratings yet
Module 1
6 pages
10 SVM
No ratings yet
10 SVM
23 pages
OIT Math112 Trigonometry
No ratings yet
OIT Math112 Trigonometry
157 pages
Plant Health Monitoring Using Digital Image Processing: By: Sivapriya.G
No ratings yet
Plant Health Monitoring Using Digital Image Processing: By: Sivapriya.G
12 pages
Strivera2z 1
No ratings yet
Strivera2z 1
63 pages
602 Numerical Methods
No ratings yet
602 Numerical Methods
16 pages
Aalborg - Lecture Notes On Polynomials
No ratings yet
Aalborg - Lecture Notes On Polynomials
7 pages
Simplified Swarm Optimization With Initialization Scheme F 2019 Applied Soft
No ratings yet
Simplified Swarm Optimization With Initialization Scheme F 2019 Applied Soft
15 pages
Program For Bisection Method
No ratings yet
Program For Bisection Method
5 pages
FALLSEM2024-25 BCSE332L TH VL2024250101754 2024-07-29 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE332L TH VL2024250101754 2024-07-29 Reference-Material-I
85 pages
Dijkstra's Algorithm Group 3 A21+A22+A23
No ratings yet
Dijkstra's Algorithm Group 3 A21+A22+A23
49 pages
Managing Diversification Attilio Meucci 2010
No ratings yet
Managing Diversification Attilio Meucci 2010
23 pages
DSP 1
No ratings yet
DSP 1
3 pages
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
No ratings yet
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
20 pages
CSE 185 Introduction To Computer Vision: Local Invariant Features
No ratings yet
CSE 185 Introduction To Computer Vision: Local Invariant Features
57 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
Clustering
No ratings yet
Clustering
3 pages
Assignment 5 Solution
No ratings yet
Assignment 5 Solution
3 pages
Content Beyond Document - CPDS
No ratings yet
Content Beyond Document - CPDS
3 pages
Mws Mec Inp TXT Direct Examples
No ratings yet
Mws Mec Inp TXT Direct Examples
9 pages
Lab 3
No ratings yet
Lab 3
6 pages
DSAP CH 12 AVL Trees
No ratings yet
DSAP CH 12 AVL Trees
19 pages
Optimization Techniques Unit 31743837442209
No ratings yet
Optimization Techniques Unit 31743837442209
11 pages
Change of Basis
No ratings yet
Change of Basis
11 pages
Hyperparameters Optimization of Convolutional Neur
No ratings yet
Hyperparameters Optimization of Convolutional Neur
18 pages
8 Puzzel Problem Using Best First Search: Import As Def
No ratings yet
8 Puzzel Problem Using Best First Search: Import As Def
3 pages
Clustering An African Hairstyle Dataset Using Pca and K-Means
No ratings yet
Clustering An African Hairstyle Dataset Using Pca and K-Means
11 pages
Mid Sem II Questions: What Is Symbol Rate Packing?
No ratings yet
Mid Sem II Questions: What Is Symbol Rate Packing?
2 pages
Intro To Regression
No ratings yet
Intro To Regression
4 pages
2021 Mts 5d 03 - 10673 - Linear Mathematical Models
No ratings yet
2021 Mts 5d 03 - 10673 - Linear Mathematical Models
3 pages
Real TimeObjectDetectionusingYOLOAreview
No ratings yet
Real TimeObjectDetectionusingYOLOAreview
7 pages
Data Structure 44 45
No ratings yet
Data Structure 44 45
2 pages
EML4314C (Fall (2014 ( (Exam (1 ( (October (16, (2014 ( ( (: Grading Key
No ratings yet
EML4314C (Fall (2014 ( (Exam (1 ( (October (16, (2014 ( ( (: Grading Key
6 pages
DataSets 1
No ratings yet
DataSets 1
4 pages

Lecture11

Uploaded by

Lecture11

Uploaded by

10-725: Optimization Fall 2012

Lecture 11: October 2

Note: LaTeX template courtesy of UC Berkeley EECS dept.

11.1 Matrix Differential Calculus

11.1.1 Review of Previous Class

• A compact way of writing Taylor expansion.

The derivative is linear, so it passes through addition and scalar multiplication.

11.1.2 Finding a maximum, minimum or saddle points

The principle: set coefficient of dX to 0 to find min, max, or saddle point:

• if df = c(A; dX)[r(dX)] then dX = tA, df = c(A; tA) = tc(A; A)

11.1.3 Infomax ICA

Suppose we have n training examples xi ∈ Rd and a scalar-valued, component-wise function g. We would

maxW Σi ln(|detJ(xi ; W )|)

since P (x) is independent to W .

11.1.4 Solving ICA Gradient

For gradient of Ji = diag(ui )W :

dJi = d(diag(ui ))W + diag(ui )dW

11.1.5 Natural Gradient

Define L(W ) as a function from Rd×d to R, then dL = tr(GT dW ). So step S is:

11.1.6 More Info

11.2 Newton’s Method

11.2.1 Solving Nonlinear Equations

f (y) ≈ f (x) + J(x)(y − x) = fˆ(y)

where J(x) is the Jacobian. We now try to solve for fˆ(y) = 0.

dx = y − x = −J(x)−1 f (x) (11.1)

dx represents the update step for Newton’s method.

And dx becomes the following.

11.2.2 Finding Minima/Maxima/Saddles

11.2.3 Newton’s Method and Steepest Descent

11.2.4 Damped Newton

You might also like