Lecture11
Lecture11
This lecture’s notes illustrate some uses of various LATEX macros. Take a look at this and imitate.
Matrix differential is a solution to matrix calculation pain, it can be understood by either of the following:
11-1
11-2 Lecture 11: October 2
vol(AS) = |det(A)|vol(S)
Interpretation: small determinant value means the existence of small eigenvalue, thus squash the volume
flat, vice versa.
Back to infomax ICA. We have yi = g(W xi ) where dyi = J(x; W )dxi = Ji dxi . We want to maximize the
entropy over the distribution of y:
P (xi )
maxW Σi (−ln(P (yi ))), P (yi ) =
|detJ(xi ; W )|
And from
Z
maxH(P (y)) = − P (y)ln(P (y))dy = −E(ln(P (y)))
it is equivalent to maximizing
Define ui = g 0 (W xi ), vi = g 00 (W xi ).
For gradient of yi = g(W xi ):
dyi = g 0 (W xi ) ◦ d(W xi )
= ui ◦ (W dxi )
= diag(ui )W dxi
Finally, define diag(αi ) = diag(ui )−1 diag(vi ) solving the gradient of Σi ln(|detJ(xi ; W )|):
dL = Σi d(ln|detJ(xi ; W )|)
= Σi tr(Ji−1 dJi )
= Σi tr(W −1 dW + W −1 diag(ui )−1 diag(vi )diag(d(W xi )))W
= Σi tr(W −1 dW ) + tr(diag(αi )diag(d(W xi )))
= ntr(W −1 dW ) + Σi tr(αiT d(W xi ))
= ntr(W −1 dW ) + tr(Σi xi αiT d(W xi ))
= nW −T + Σi αi xTi
= nW −T + C
||SW −1 ||2F
S = argmaxS M (S), M (S) = tr(GT S) −
2
which, in scaler case:
S2
M = gS −
2W 2
So, to find the max/min/saddle point:
1
M = tr(GT S) − tr(SW −1 W −T S T )
2
1
dM = tr(GT dS) − tr(dSW −1 W −T S T )
2
So, natural gradient becomes G = W −1 W −T , and thus GW T W = S. Using the gradient previously derived,
[W −T + C]W T W = W + CW T W .
• Magnus & Neudecker. Matrix Differential Calculus. Wiley, 1999. 2nd ed.
http://www.amazon.com/Differential-Calculus- Applications-Statistics-Econometrics/dp/047198633X
• Bell & Sejnowski. An information-maximization approach to blind separation and blind deconvolution.
Neural Computation, v7, 1995.
Newton’s method have two main applications: solving nonlinear equations and finding minima/maxima/saddles.
11-4 Lecture 11: October 2
For x ∈ Rd and f : Rd → Rd which is differentiable, we want to solve f (x) = 0. We perform first order
Taylor approximation on f (x),
f (x) + J(x)(y − x) = 0
1
dx = −J(x)−1 f (x) = x2 ( − φ) = x − x2 φ
x
The update rule is x+ = x + dx. Figure 11.1 shows an iteration of Newton’s method when x = 1.
Figure 11.1: Example of one iteration of Newton’s Method in solving non-linear equations.
We now perform error analysis of Newton’s method. For value x, the error is = xφ − 1. For value x+ , the
error + is the following.
Lecture 11: October 2 11-5
+ = x+ φ − 1
= x + x − x2 φ φ − 1
= (x + x (1 − xφ)) φ − 1
= (x − x) φ − 1
= xφ − 1 − xφ
= − xφ
= (1 − xφ)
= −2
This shows that if < 1, then Newton’s method has quadratic convergence. However, if > 1, then Newton’s
method will diverge.
For x ∈ Rd and f : Rd → Rd which is twice differentiable, we want to find minx f (x). In the example, we
only focus on minimizing f , but finding the maxima and saddle points are the same. We first define g = f 0 .
Minimizing f is the same as finding x such that g = f 0 = 0. From Equation 11.1, the Newton’s update is
the following,
d = −J −1 g
−1
= − (g 0 ) g
00 −1
= − (f ) f0
= −H −1 g
where H is the Hessian. We now show that Newton’s method is a descent method if H 0. We set dx = td
for t > 0. r(dx) is the residual. Using first order Taylor expansion, we get the following.
df = g T dx + r(dx)
−1
= g T t − (f 00 ) f 0 + r(dx)
−1
= −tf 0T (f 00 ) f 0 + r(dx)
= −tf 0T H −1 f 0 + r(dx)
If H 0, then H −1 0, which makes the first term always negative, thus making Newton’s method a
descent method.
Newton’s method is a special case of steepest descent when the norm used is the Hessian norm. To find the
step for steepest descent, we minimize the following.
1 √
min g T d + ||d||2H , ||d||2H = dT Hd (11.2)
d 2
The solution to this minimization is d = −H −1 g. Steepest descent with a constraint or a penalty in the
objective is equivalent. The equivalence will be covered when duality is covered in class. Figure 11.2 shows
the difference between the direction of steps for gradient descent and Newton’s method.
11-6 Lecture 11: October 2
Figure 11.2: Direction of step for gradient descent and Newton’s method.
Damped Newton is combining the Newton’s method with backtracking line search to make sure that the
objective value does not increase.
Initialize x1
for k = 1, 2, . . .
gk = f 0 (xk ); gradient
Hk = f 00 (xk ); Hessian
dk = −Hk \gk ; Newton direction
tk = 1; backtracking line search
while f (xk + tk dk ) > f (xk ) + tgkT dk /3 divide by 3 to make sure < 12 for future proofs
tk = βtk β<1
xk+1 = xk + tk dk step
Damped Newton is affine invariant, meaning that suppose g(x) = f (Ax + b), and we get Newton’s updates
x1 , x2 , . . . from g(x) and y1 , y2 , . . . from f (y), and if y1 = Ax1 + b, then yi = Axi + b ∀i.
For damped Newton, if f is bounded below, then f (xk ) converges. If f is strictly convex with bounded level
sets, then xk converges. Finally, damped Newton typically converges at quadratic rate in the neighborhood
of x∗ .