Lecture 7 (with notes)
Lecture 7 (with notes)
Optimization
Nan-Jung Hsu
1/ 38
Optimization
Given an objective function g(x) (e.g., least squares, negative log-likelihood, any
loss function), find
min g(x).
x
2/ 38
Examples of Optimization in Statistical Inference
3/ 38
Least Squares (LS) Problem
Ordinary LS (OLS):
I loss function:
n
X
Q( ) = (yi xi )2 = ||Y X ||2 = (Y X )0 (Y X ),
i=1
Q0 ( ) = 2X 0 (Y X ).
Generalized LS (GLS):
Mahalanobisdistauce
ˆ = arg min (Y X )0 ⌃ 1
(Y X ) = (X 0 ⌃ 1
X) 1
X 0⌃ 1
Y,
-
Ei Ǜ Nlo M Xi ⼆
( 1 Xii
.
, Xiz Xip)
,
, . _
,
Goalifīnd P = (
BÁPD 未
suchthat
⼆点 (
gi-xipi sminimyedhyfw-mc-hzlo-gDLe.tk
Lcp )
下
⼼ … …
別 ,
X =
(
Xipnxyj
Then L (B) ⼆
IIYXPII
orequivalently ,
-2
(IX)
Nonlinear LS Problem (cont.)
Loss function:
n
X 2
Q1 ( ) = (yi m (xi )) = ||Y m ||2 ,
i=1
Q2 ( , ✓) = (Y m )0 ⌃✓ 1 (Y m ),
Examples:
0+ 1t
Fit a function: yt = e + ✏t
Estimate parameters for Yi ⇠ P oi(m(xi )), where m(x) = exp( 0 + 1 x).
f (a)
resulting in a closed-form solution: x̃ = a f 0 (a) .
Algorithm:
I set an initial x(0)
f (x(t) )
I iteratively approximate the solution by x(t+1) = x(t) f 0 (x(t) )
; t = 0, 1, ...
I check convergence: kx(t) x(t 1)
k ! 0 as t increases
6/ 38
Newton’s Method (cont.)
kx(t) x⇤ k Kkx(t 1)
x ⇤ k2 , for some K.
7/ 38
Minimization Problem
Goal: minimize g(x): Rp ! R
Equivalent to finding root for g 0 (x) [solve p equations simultaneously]
Taylor approximation for g 0 (x) around x0 :
⇢ 2
. @ g(x)
g 0 (x) = g 0 (x0 ) + (x x0 ) . (⇤)
| {z } | {z } @x@x0 x=x0 | {z }
p⇥1 gradient: p⇥1 | {z } p⇥1
Hessian H : p⇥p
1
x = x0 [H(x0 )] g 0 (x0 ).
8/ 38
Maximum Likelihood Problem
(Y1 , Y2 , . . . , Yn ) are random variables with joint density f (y1 , y2 , . . . , yn ; ✓)
Goal: estimate parameter ✓
Likelihood function and log-likelihood function:
iid
If Yi ⇠ f (y; ✓), then
n
Y n
X n
X
L(✓) = f (Yi ; ✓); `(✓) = log f (Yi ; ✓) = `i (✓).
i=1 i=1 i=1
⇥ @
⇤
If `(·) is di↵erentiable, then @✓ `(✓) ✓=✓ˆ =0 [root-finding problem]
9/ 38
Solve MLE by Newton Method
MLE ✓ˆ can be numerically solved via applying (⇤⇤) to g(✓) ⌘ `(✓), which
implies
(t+1) (t) @2 ⇣ (t)
⌘ 1 @ ⇣ ⌘
✓ = ✓ `(✓ ) `(✓ (t) )
@✓ 2 @✓
2 1
@ @
= ✓ (t) `(✓ )(t)
`(✓ (t) )
@✓ 2 @✓
h i 1
= ✓ (t) H(✓ (t) ) s(✓ (t) ), t = 0, 1, 2, ...
where
@
the gradient s(✓) ⌘ @✓ `(✓) is also called score function,
@2
the hessian matrix H(✓) ⌘ @✓ 2 `(✓) is also called the negative empirical
information matrix.
10/ 38
More Convex Optimization Methods
Gradient descent
Stochastic gradient descent
Coordinate descent method:
to optimize a multi-variable convex (concave) function by optimizing along
one coordinate at a time
[simplify a high-dim optimization problem into a low-dim problem]
Lagrange multiplier method: to optimize a function subject to equality
constraints.
Quadratic programming: to optimize a function subject to linear inequality
constraints.
Alternating direction method of multipliers (ADMM): handling various types
of constraints
11/ 38
Gradient Descent
Review: Newton method updates
1
@2
x(t+1) = x(t) 0
g(x(t) ) g 0 (x(t) ) .
@x@x
| {z } | {z }
Hessian gradient
If step size is too large, GD can If step size is too small, GD can
overshoot the minimum. It be slow.
may fail to converge.
13/ 38
Example 1: Penalized Regression
1
min ky X k2 + k k2 , for some 0
2
X 0y = X 0X + ·( 0
)2 1
,
h i 1
= X 0X + ·( 0
)2 1
·I X 0 y,
R Lab7 demo
14/ 38
Stochastic Gradient Descent (SGD)
1
Pn
Goal: min n i=1 gi (x) , where gi is the individual loss on the i-th data point
x
15/ 38
SGD (cont.)
1
Pn
Goal: min n i=1 gi (x) , where gi is the individual loss on the i-th data point
x
x(i) x(i 1)
↵i gi0 (x(i 1)
), i = 1, 2, ...
SGD+Momentum (SGDM):
I set initials: t = 0, x(0) and v0 (momentum)
I at t-th iteration,
1 compute gradient: g 0 (x(t 1) )
AdaGrad
0
I define the gradient with respect to xj : gt,j ⌘ @x@
j
g(x(t) )
I
(t+1) (t) ⌘ 0
xj xj p ·gt,j ,
Gt,j + ✏
| {z }
adaptive ↵t
P 0
where Gt,j = it (gi,j )2 is the cumulative squared gradient of jth element
17/ 38
Extensions of SGD (cont.)
Adadelta: at each iteration,
I replace the cumulative squared gradient by an exponentially weighted average
(EWMA scheme):
E[(gj0 )2 ]t E[(gj0 )2 ]t 1 + (1 0
)(gt,j )2 (a popular : 0.9)
(t) ⌘ 0
I define: xj ⌘ q · gt,j (intermediate step)
E[(gj0 )2 ]t +✏
I define:
(t)
E[ x2j ]t = E[ x2j ]t 1 + (1 )( xj )2
(t)
I refine xj as
q
E[ x2j ]t 1 + ✏
(t) 0
xj q gt,j
E[(gj0 )2 ]t + ✏
(t+1) (t) (t)
I update xj xj + xj
(note: the intermediate step is not run, and ⌘ is eliminated from the algorithm)
18/ 38
Extensions of SGD (cont.)
RMSProp: a special case of Adadelta,
I compute EWMA: E[(gj0 )2 ]t 0.9E[(gj0 )2 ]t 1
0
+ 0.1(gt,j )2
(t+1) (t) ⌘ 0
I update xj xj q · gt,j (a popular ⌘: 0.001)
E[(gj0 )2 ]t +✏
I 0
EWMA momentum: mt = 1 mt 1 + (1 1 )gt
I 0 2
EWMA squared gradient: vt = 2 vt 1 + (1 2 )(gt ) (defined
elementwisely)
I bias correction:
mt vt
m̂t = t
, v̂t = t
1 1 1 2
I update
⌘
x(t+1) x(t) p m̂t .
v̂t + ✏
I 8
popular choice of tuning parameter: 1 = 0.9, 2 = 0.999, ✏ = 10
Most power for teachers? 19/ 38
Coordinate Descent Algorithm
Goal: min g(x), where g(·) is smooth (component-wise convex) and
x
x = (x1 , x2 , . . . , xk )0 .
Initialize x(0)
For iteration t = 1, 2, ..., update coordinate j 2 {1, 2, ..., k} via
(t+1) (t)
xj = arg min g(xj , x j ),
xj
(t)
where x j represents x(t) but excluding j-th coordinate.
(t)
Possible forms of x j
(t) (t+1) (t+1) (t+1) (t) (t) 0
I , · · · , xj
xj = (x1 , x2 1 , xj+1 , · · · , xk ) [converge faster]
(t) (t) (t) (t) (t) (t)
I x
j = (x1 , x2 , · · · , xj 1 , xj+1 , · · · , xk )0 [can be implemented in parallel]
(t+1) (t) ⇤ ⇤
xj = xj + , where = arg min f (x(t) + ej )
2
For OLS problem, assuming ⌃(✓) = I, can be solved via CD algorithm:
(t 1)
(t) Xj0 (Y X j j ) (t 1) Xj0 e(t 1)
j = = j + , i = 1, 2, ...,
Xj0 Xj Xj0 Xj
22/ 38
Example 3: Matrix-variate Time Series
Yt = |{z}
A Yt 1 |{z}B + ✏t , t = 1, 2, ...
|{z} | {z }
n⇥m n⇥n n⇥m m⇥m
X
2
(Â, B̂) = arg min kYt AYt 1 BkF .
A,B
t
P
Â(t) = arg minA t kYt AYt 1 B̂
(t 1) 2
kF
P
B̂ (t) = arg minB t kYt Â(t) Yt 2
1 BkF
23/ 38
Example 4: Coordinate Descent for Lasso
Under an orthogonal design matrix X (i.e., X 0 X = I), the OLS estimator
satisfies ˆols = (X 0 X) 1 X 0 y = X 0 y.
Solving lasso to minimize
1 1 0 1 X
ky X k22 + k k1 = yy y0 X + 0
+ | j|
2 2 2 j
is equivalent to minimize
X⇢ 1 X ⇢1
ˆols j + 2
+ | j| or ( j
ˆols )2 + | j | ,
j j j
j
2 j
2
24/ 38
Soft-thresholding Rule
1
min (y )2 + | | (y and are given)
|2 {z }
Q( )
8
>
< y+ , if > 0,
@Q
) = (y )+ · sign( ) = y, if = 0,
@ >
:
y , if < 0.
Solve @Q
=0 ) ˆ = sign(y)(|y| )+ ⌘ S (y).
@
25/ 38
Optimization with Linear Constraints
(primal problem)
0
min L(x, ), where L(x, ) ⌘ f (x) + (Ax b).
x,
g( ) ⌘ inf L(x, ),
x
⇤
= arg max g( ), x⇤ = arg min L(x, ⇤
).
x
26/ 38
LM Method (cont.)
⇤
= arg max g( ), (Lagrange dual problem)
27/ 38
Illustration: A Toy Example
min x2 , subject to x + 1 = 0.
x2R
28/ 38
Illustration: A Toy Example (cont.)
min x2 , subject to x + 1 = 0.
x2R
8
6
x^2
4
2
0
−3 −2 −1 0 1 2 3
29/ 38
Same Problem via Gradient Ascent (Descent)
Original Problem:
⇤
= arg max g( ),
x⇤ = arg min L(x, ⇤
).
x
Gradient Ascent:
30/ 38
Augmented Lagranging
Goal:
min L⇢ (x, ),
x,
where
0 ⇢
L⇢ (x, ) ⌘ f (x) + (Ax b) + kAx bk22 .
2
Equivalent Problem:
n ⇢ o
min f (x) + kAx bk22 subject to Ax = b.
x
| 2 {z } 31/ 38
Augmented Lagranging (cont)
Gradient Ascent:
32/ 38
Alternating Direction Method of Multiplier (ADMM)
Goal:
min {f (x) + g(z)} subject to Ax + Bz = c,
x,z
Augmented Lagranging:
ADMM:
x(t+1) = arg min L⇢ (x, z (t) , (t)
),
34/ 38
Example 5: Fused Lasso
n n
1X 2
X
min (yi i) +⌧ | i i 1|
2 i=1 i=2
Re-write
n n
1X 2
X
min (yi i) +⌧ | i |, subject to i = i i 1, i = 2, ..., n.
, 2 i=1 i=2
1 0 ⇢
L⇢ ( , , ) = ky k2 + ⌧ | | + (D ) + kD k2 ,
2 2 2 3
1 1 ··· 0
6 7
6 0 1 1 0 7
0 6 7
= ( 2, 3 , ..., n ) , D=6 .. .. .. .. 7.
6 . . . . 7
4 5
0 ··· 1 1
35/ 38
Example 5 (cont.)
1 0 ⇢
L⇢ ( , , ) = ky k2 + ⌧ | | + (D )+ kD k2
2 2
For : ⇢
1 0 ⇢
min ky k2 + (D )+ kD k2 .
2 2
Taking derivative w.r.t. leads to solving
⇥ ⇤
( y) + D0 + ⇢D0 (D )=0 ) (I + ⇢D0 D) 1
y + ⇢D0 ( /⇢)
For : n ⇢ o
0
min ⌧| |+ (D )+ kD k2
2
Re-arranging the quadratic part leads to solving the following equivalent form:
✓ ◆
1 1 2 ⌧ 1
min (D + ) + | | ) S ⌧⇢ D +
2 ⇢ ⇢ ⇢
For :
+ ⇢ (D )
36/ 38
Other Optimization Methods
37/ 38
References
38/ 38