0% found this document useful (0 votes)
13 views

Lecture 7 (with notes)

Uploaded by

孫利東
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lecture 7 (with notes)

Uploaded by

孫利東
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Lecture 7

Optimization

Nan-Jung Hsu

1/ 38
Optimization

Given an objective function g(x) (e.g., least squares, negative log-likelihood, any
loss function), find
min g(x).
x

solve g 0 (x) = 0 analytically (impossible for most of the time!)


find the solution numerically (an approximate solution)
we emphasize on convex optimization, but most of the methods have been
applied to solve a general (non-convex) problem

2/ 38
Examples of Optimization in Statistical Inference

Least squares problem: linear or nonlinear fitting (minimization)


Maximum likelihood estimation (maximization)
Solve Bayes estimate (posterior mode) in Bayesian analysis
Estimation with constraints, e.g., regularized (penalized) estimation
Planning inference, e.g., allocation problem with cost constraints

3/ 38
Least Squares (LS) Problem
Ordinary LS (OLS):
I loss function:
n
X
Q( ) = (yi xi )2 = ||Y X ||2 = (Y X )0 (Y X ),
i=1

Q0 ( ) = 2X 0 (Y X ).

I goal: find ˆ to minimize Q( ), i.e., solving Q0 ( ) = 0


I solution: ˆ = (X 0 X) 1 X 0 Y 1 0 9021220

Generalized LS (GLS):
Mahalanobisdistauce
ˆ = arg min (Y X )0 ⌃ 1
(Y X ) = (X 0 ⌃ 1
X) 1
X 0⌃ 1
Y,
-

where the given ⌃ could be a very general positive-definite matrix (covariance


matrix).
4/ 38
Modeliyi
ftp.xiitfzxiii-tfpxipt Ei

Ei Ǜ Nlo M Xi ⼆
( 1 Xii
.
, Xiz Xip)
,
, . _
,

Goalifīnd P = (
BÁPD 未
suchthat

⼆点 (
gi-xipi sminimyedhyfw-mc-hzlo-gDLe.tk
Lcp )


⼼ … …
別 ,
X =
(
Xipnxyj
Then L (B) ⼆
IIYXPII

PJL (f) -_-


2点 lyi-xipxik.fr ko.li …
:P

orequivalently ,
-2
(IX)
Nonlinear LS Problem (cont.)
Loss function:
n
X 2
Q1 ( ) = (yi m (xi )) = ||Y m ||2 ,
i=1

Q2 ( , ✓) = (Y m )0 ⌃✓ 1 (Y m ),

Examples:
0+ 1t
Fit a function: yt = e + ✏t
Estimate parameters for Yi ⇠ P oi(m(xi )), where m(x) = exp( 0 + 1 x).

(m(xi ) = EYi = var(Yi ))


Estimate variance related parameter for ⌃ ⌘ [⌃ij ], where
⇣ ⌘
(xi xj )2
⌃ij = exp ✓ , 1  i, j  n.
In general, no closed–form solution for the nonlinear LS problem!
5/ 38
Newton’s Method (illustrated in 1-dim case)
Also called Newton-Raphson algorithm, which is a root-finding algorithm.
Let f (x) : R ! R be a di↵erentiable function.
Goal: find x⇤ such that f (x⇤ ) = 0
014
Taylor expansion:
.
f (x) = f (a) + f 0 (a)(x a), for x ⇡ a.

solve f (x⇤ ) = 0 approximately by solving

0 = f (a) + f 0 (a)(x̃ a), for given a,

f (a)
resulting in a closed-form solution: x̃ = a f 0 (a) .

Algorithm:
I set an initial x(0)
f (x(t) )
I iteratively approximate the solution by x(t+1) = x(t) f 0 (x(t) )
; t = 0, 1, ...
I check convergence: kx(t) x(t 1)
k ! 0 as t increases
6/ 38
Newton’s Method (cont.)

Algorithm (see wiki for demo)


I set an initial x(0)
f (x(t) )
I iteratively approximate the solution by x(t+1) = x(t) f 0 (x(t) )
; t = 0, 1, ...
I (t) (t 1)
check convergence: kx x k ! 0 as t increases

The algorithm converges quadratically when x(t) is close enough to x⇤ , i.e.,

kx(t) x⇤ k  Kkx(t 1)
x ⇤ k2 , for some K.

And the error |f (x(t) ) f (x⇤ )| decays quadratically, e.g., , 2


, 4
,···

Related R function: optim, optimize, nlm, nlminb, uniroot, ...

7/ 38
Minimization Problem
Goal: minimize g(x): Rp ! R
Equivalent to finding root for g 0 (x) [solve p equations simultaneously]
Taylor approximation for g 0 (x) around x0 :
⇢ 2
. @ g(x)
g 0 (x) = g 0 (x0 ) + (x x0 ) . (⇤)
| {z } | {z } @x@x0 x=x0 | {z }
p⇥1 gradient: p⇥1 | {z } p⇥1
Hessian H : p⇥p

Solve (⇤) = 0, leading to

1
x = x0 [H(x0 )] g 0 (x0 ).

Newton method iterates


h i 1
x(t+1) = x(t) H(x(t) ) g 0 (x(t) ), t = 0, 1, 2, ... (⇤⇤)

8/ 38
Maximum Likelihood Problem
(Y1 , Y2 , . . . , Yn ) are random variables with joint density f (y1 , y2 , . . . , yn ; ✓)
Goal: estimate parameter ✓
Likelihood function and log-likelihood function:

L(✓) = f (Y1 , Y2 , . . . , Yn ; ✓); `(✓) = log L(✓).

iid
If Yi ⇠ f (y; ✓), then
n
Y n
X n
X
L(✓) = f (Yi ; ✓); `(✓) = log f (Yi ; ✓) = `i (✓).
i=1 i=1 i=1

Maximum (log) likelihood estimate:

✓ˆ = arg max L(✓) = arg min { `(✓)}


✓ ✓

⇥ @

If `(·) is di↵erentiable, then @✓ `(✓) ✓=✓ˆ =0 [root-finding problem]
9/ 38
Solve MLE by Newton Method
MLE ✓ˆ can be numerically solved via applying (⇤⇤) to g(✓) ⌘ `(✓), which
implies
 
(t+1) (t) @2 ⇣ (t)
⌘ 1 @ ⇣ ⌘
✓ = ✓ `(✓ ) `(✓ (t) )
@✓ 2 @✓
 2 1 
@ @
= ✓ (t) `(✓ )(t)
`(✓ (t) )
@✓ 2 @✓
h i 1
= ✓ (t) H(✓ (t) ) s(✓ (t) ), t = 0, 1, 2, ...

where
@
the gradient s(✓) ⌘ @✓ `(✓) is also called score function,
@2
the hessian matrix H(✓) ⌘ @✓ 2 `(✓) is also called the negative empirical
information matrix.

10/ 38
More Convex Optimization Methods
Gradient descent
Stochastic gradient descent
Coordinate descent method:
to optimize a multi-variable convex (concave) function by optimizing along
one coordinate at a time
[simplify a high-dim optimization problem into a low-dim problem]
Lagrange multiplier method: to optimize a function subject to equality
constraints.
Quadratic programming: to optimize a function subject to linear inequality
constraints.
Alternating direction method of multipliers (ADMM): handling various types
of constraints
11/ 38
Gradient Descent
Review: Newton method updates
 1
@2
x(t+1) = x(t) 0
g(x(t) ) g 0 (x(t) ) .
@x@x
| {z } | {z }
Hessian gradient

Hessian might not be feasible for a complex g


Gradient descent method finds the local minimum via repeatedly
implementing the following update:

x(t+1) = x(t) ↵t g 0 (x(t) ), where ↵t > 0 (step size/learning rate).

Convergence requires certain conditions on g(·) and ↵t ; the latter controls


the convergence rate. A popular choice: ↵t = ↵1 t , 2 (0.5, 1] (which
P P
satisfies ↵t = 1, ↵t2 < 1).
R toolbox: package gradDescent for regression task 12/ 38
Step Size / Learning Rate
A too-large learning rate may cause an unstable process (left), while a too-small
learning rate may result in very slow updates (right).

If step size is too large, GD can If step size is too small, GD can
overshoot the minimum. It be slow.
may fail to converge.

13/ 38
Example 1: Penalized Regression

1
min ky X k2 + k k2 , for some 0
2

compute the gradient: g 0 ( ) = X 0 (y X )+ ·( 0


)2 1

gradient descent: update ↵t g 0 ( ) until convergence

self-consistency: the solution of g 0 ( ) = 0 satisfies

X 0y = X 0X + ·( 0
)2 1
,
h i 1
= X 0X + ·( 0
)2 1
·I X 0 y,

which suggests another updating scheme:


 ⇣ ⌘2 1
1
(t+1)
X 0X + · ( (t) )0 (t)
·I X 0y

R Lab7 demo
14/ 38
Stochastic Gradient Descent (SGD)
1
Pn
Goal: min n i=1 gi (x) , where gi is the individual loss on the i-th data point
x

Standard gradient descent method:


( n
)
(t+1) (t) 1 X 0 (t)
x =x ↵t g (x )
n i=1 i

Stochastic gradient descent:


n o
x(i) x(i 1)
↵i gi0 (x(i 1)
) , i = 1, 2, ...

mini-batch gradient descent:


( )
1 X 0 (t)
x(t+1) = x(t) ↵t gi (x ) , Dt is a subset index of the data.
|Dt |
i2Dt

15/ 38
SGD (cont.)
1
Pn
Goal: min n i=1 gi (x) , where gi is the individual loss on the i-th data point
x

Stochastic gradient descent:

x(i) x(i 1)
↵i gi0 (x(i 1)
), i = 1, 2, ...

commonly used in 2 scenarios:


I data are collected on-line (in batch): updating parameter x whenever new
data come in; gi (x) corresponds to the loss function contributed from i-th
batch data (run through each data once, not repeatedly)
I big (o↵-line) data are partitioned into n smaller portions; gi (x) corresponds to
the loss function contributed from i-th data portion

stochastic gradient descent is feasible for handling non-convex problems in a


big data setting
16/ 38
Extensions of SGD

SGD+Momentum (SGDM):
I set initials: t = 0, x(0) and v0 (momentum)
I at t-th iteration,
1 compute gradient: g 0 (x(t 1) )

2 compute momentum: vt = vt 1 ↵t g 0 (x(t 1) ) (a popular is 0.9)


3 update parameter: x(t) = x(t 1) + vt

AdaGrad
0
I define the gradient with respect to xj : gt,j ⌘ @x@
j
g(x(t) )
I
(t+1) (t) ⌘ 0
xj xj p ·gt,j ,
Gt,j + ✏
| {z }
adaptive ↵t
P 0
where Gt,j = it (gi,j )2 is the cumulative squared gradient of jth element

17/ 38
Extensions of SGD (cont.)
Adadelta: at each iteration,
I replace the cumulative squared gradient by an exponentially weighted average
(EWMA scheme):

E[(gj0 )2 ]t E[(gj0 )2 ]t 1 + (1 0
)(gt,j )2 (a popular : 0.9)

(t) ⌘ 0
I define: xj ⌘ q · gt,j (intermediate step)
E[(gj0 )2 ]t +✏
I define:
(t)
E[ x2j ]t = E[ x2j ]t 1 + (1 )( xj )2
(t)
I refine xj as
q
E[ x2j ]t 1 + ✏
(t) 0
xj q gt,j
E[(gj0 )2 ]t + ✏
(t+1) (t) (t)
I update xj xj + xj
(note: the intermediate step is not run, and ⌘ is eliminated from the algorithm)
18/ 38
Extensions of SGD (cont.)
RMSProp: a special case of Adadelta,
I compute EWMA: E[(gj0 )2 ]t 0.9E[(gj0 )2 ]t 1
0
+ 0.1(gt,j )2
(t+1) (t) ⌘ 0
I update xj xj q · gt,j (a popular ⌘: 0.001)
E[(gj0 )2 ]t +✏

Adam (adaptive moment estimation):


I set initial m0 = 0, v0 = 0, set 2 (0, 1)
1, 2

I 0
EWMA momentum: mt = 1 mt 1 + (1 1 )gt

I 0 2
EWMA squared gradient: vt = 2 vt 1 + (1 2 )(gt ) (defined
elementwisely)
I bias correction:
mt vt
m̂t = t
, v̂t = t
1 1 1 2
I update

x(t+1) x(t) p m̂t .
v̂t + ✏
I 8
popular choice of tuning parameter: 1 = 0.9, 2 = 0.999, ✏ = 10
Most power for teachers? 19/ 38
Coordinate Descent Algorithm
Goal: min g(x), where g(·) is smooth (component-wise convex) and
x
x = (x1 , x2 , . . . , xk )0 .

Initialize x(0)
For iteration t = 1, 2, ..., update coordinate j 2 {1, 2, ..., k} via
(t+1) (t)
xj = arg min g(xj , x j ),
xj

(t)
where x j represents x(t) but excluding j-th coordinate.
(t)
Possible forms of x j
(t) (t+1) (t+1) (t+1) (t) (t) 0
I , · · · , xj
xj = (x1 , x2 1 , xj+1 , · · · , xk ) [converge faster]
(t) (t) (t) (t) (t) (t)
I x
j = (x1 , x2 , · · · , xj 1 , xj+1 , · · · , xk )0 [can be implemented in parallel]

Simple and computationally efficient for high-dim setting of x


Updating can be done in random order or in blocks (block coordinate
descent)
20/ 38
Illustration of Coordinate Descent

(t+1) (t) ⇤ ⇤
xj = xj + , where = arg min f (x(t) + ej )

[Figure from wiki] 21/ 38


Example 2: Coordinate Descent for LS
Assume data follows Y ⇠ (X , ⌃(✓)). The goal is to estimate ( , ✓).

2
For OLS problem, assuming ⌃(✓) = I, can be solved via CD algorithm:

(t 1)
(t) Xj0 (Y X j j ) (t 1) Xj0 e(t 1)
j = = j + , i = 1, 2, ...,
Xj0 Xj Xj0 Xj

where j is excluding j-th entry, Xj is the j-th column of X, X j is X


excluding j-th column, e(t 1)
=Y X (t 1)
is the current residual vector.

For GLS problem: min (Y X )0 [⌃(✓)] 1


(Y X ) , block CD
,✓
algorithm alternates the following updates:
⇣ ⌘ 1⇣ ⌘
I (t)
= X 0 [⌃(✓ (t 1) )] 1 X X 0 [⌃(✓ (t 1) )] 1 Y
n o
I ✓ (t) = min tr [⌃(✓)] 1 (Y X (t) )(Y X (t) )0

22/ 38
Example 3: Matrix-variate Time Series

Yt = |{z}
A Yt 1 |{z}B + ✏t , t = 1, 2, ...
|{z} | {z }
n⇥m n⇥n n⇥m m⇥m

X
2
(Â, B̂) = arg min kYt AYt 1 BkF .
A,B
t

P
Â(t) = arg minA t kYt AYt 1 B̂
(t 1) 2
kF
P
B̂ (t) = arg minB t kYt Â(t) Yt 2
1 BkF

Both steps are LS problems with a closed solution!

23/ 38
Example 4: Coordinate Descent for Lasso
Under an orthogonal design matrix X (i.e., X 0 X = I), the OLS estimator
satisfies ˆols = (X 0 X) 1 X 0 y = X 0 y.
Solving lasso to minimize

1 1 0 1 X
ky X k22 + k k1 = yy y0 X + 0
+ | j|
2 2 2 j

is equivalent to minimize

X⇢ 1 X ⇢1
ˆols j + 2
+ | j| or ( j
ˆols )2 + | j | ,
j j j
j
2 j
2

which can be solved separately for each j, leading to the solution

ˆlasso = sign( ˆols ) ˆols , a+ = max(a, 0). (soft-thresholding rule)


j j j +

24/ 38
Soft-thresholding Rule

1
min (y )2 + | | (y and are given)
|2 {z }
Q( )
8
>
< y+ , if > 0,
@Q
) = (y )+ · sign( ) = y, if = 0,
@ >
:
y , if < 0.

Solve @Q
=0 ) ˆ = sign(y)(|y| )+ ⌘ S (y).
@
25/ 38
Optimization with Linear Constraints

min f (x) subject to Ax = b, x 2 Rp , A : k ⇥ p, b 2 Rk .


x

(primal problem)

Lagrange Multiplier (LM) method:

0
min L(x, ), where L(x, ) ⌘ f (x) + (Ax b).
x,

Equivalent to the following:

g( ) ⌘ inf L(x, ),
x

= arg max g( ), x⇤ = arg min L(x, ⇤
).
x

26/ 38
LM Method (cont.)

g( ) = inf L(x, )  inf {L(x, ) : Ax = b} = inf {f (x) : Ax = b}


x x x

to achieve the goal (on RHS), solve the


= arg max g( ), (Lagrange dual problem)

to solve the exact solution of x for the RHS:

x⇤ = arg min L(x, ⇤


)
x

27/ 38
Illustration: A Toy Example

min x2 , subject to x + 1 = 0.
x2R

Obviously, the solution is x = 1.


L(x, ) = x2 + (x + 1) (gray lines)
@
@x L(x, ) = 2x + which implies that, for a fixed , L(x, ) has the
minimum at x = /2 (red dot x-coord)
g( ) = L( /2, ) = ( /2)2 + ( /2) + = 2
/4 (red dot y-coord)
g0 ( ) = 1 /2 implies ⇤
=2
g( ) reaches its maximum when the corresponding x-coord at the correct
solution x = 1.

28/ 38
Illustration: A Toy Example (cont.)

min x2 , subject to x + 1 = 0.
x2R

8
6
x^2

4
2
0

−3 −2 −1 0 1 2 3

29/ 38
Same Problem via Gradient Ascent (Descent)

Original Problem:


= arg max g( ),
x⇤ = arg min L(x, ⇤
).
x

Gradient Ascent:

x(t+1) = arg min L(x, (t)


),
x
(t+1) (t) (t) (t+1)
= +↵ (Ax b).
| {z }
@
@ g( )

30/ 38
Augmented Lagranging
Goal:

min f (x) subject to Ax = b, x 2 Rp , A : k ⇥ p, b 2 Rk .


x

Augmented LM Method (⇢ > 0 is given):

min L⇢ (x, ),
x,

where
0 ⇢
L⇢ (x, ) ⌘ f (x) + (Ax b) + kAx bk22 .
2

Equivalent Problem:
n ⇢ o
min f (x) + kAx bk22 subject to Ax = b.
x
| 2 {z } 31/ 38
Augmented Lagranging (cont)

Gradient Ascent:

x(t+1) = arg min L⇢ (x, (t)


),
(t+1) (t)
= + ⇢ (Ax(t+1) b).

32/ 38
Alternating Direction Method of Multiplier (ADMM)
Goal:
min {f (x) + g(z)} subject to Ax + Bz = c,
x,z

where f (·) and g(·) are both convex.

lasso type problem:


1
min kAx bk22 + ⌧ kxk1 , subject to x = z.
2
| {z } | {z }
g(z)
f (x)
fusion type problem:
n n
1X 2
X
min (yi i) +⌧ | i i 1 |, subject to =D ,
2 i=1 i=2
| {z } | {z }
f ( )= 1
2
ky k2
2
g( )=⌧ k k1
2 3
1 1 ··· 0
with 6 0 1 1 0
7
6 7
i = i i 1, i = 2, ..., n; = ( 2, 3 , ..., n )
0
, D=6
6 .. .. .. ..
7.
7
4 . . . . 5
0 ··· 1 1 33/ 38
ADMM (cont.)
Define L⇢ (x, z, ) ⌘ f (x) + g(z) + 0
(Ax + Bz c) + ⇢2 kAx + Bz ck22 .

Augmented Lagranging:

(x(t+1) , z (t+1) ) = arg min L⇢ (x, z, (t)


),
(t+1) (t)
= + ⇢ (Ax(t+1) + Bz (t+1) c).

ADMM:
x(t+1) = arg min L⇢ (x, z (t) , (t)
),

z (t+1) = arg min L⇢ (x(t+1) , z, (t)


),
(t+1) (t) (t+1) (t+1)
= + ⇢ (Ax + Bz c).

(scaled form) replace by w ⌘ /⇢, leading to

w(t+1) = w(t) + (Ax(t+1) + Bz (t+1) c).

34/ 38
Example 5: Fused Lasso
n n
1X 2
X
min (yi i) +⌧ | i i 1|
2 i=1 i=2

Re-write
n n
1X 2
X
min (yi i) +⌧ | i |, subject to i = i i 1, i = 2, ..., n.
, 2 i=1 i=2

Solve via ADMM:

1 0 ⇢
L⇢ ( , , ) = ky k2 + ⌧ | | + (D ) + kD k2 ,
2 2 2 3
1 1 ··· 0
6 7
6 0 1 1 0 7
0 6 7
= ( 2, 3 , ..., n ) , D=6 .. .. .. .. 7.
6 . . . . 7
4 5
0 ··· 1 1
35/ 38
Example 5 (cont.)
1 0 ⇢
L⇢ ( , , ) = ky k2 + ⌧ | | + (D )+ kD k2
2 2

For : ⇢
1 0 ⇢
min ky k2 + (D )+ kD k2 .
2 2
Taking derivative w.r.t. leads to solving
⇥ ⇤
( y) + D0 + ⇢D0 (D )=0 ) (I + ⇢D0 D) 1
y + ⇢D0 ( /⇢)

For : n ⇢ o
0
min ⌧| |+ (D )+ kD k2
2
Re-arranging the quadratic part leads to solving the following equivalent form:
✓ ◆
1 1 2 ⌧ 1
min (D + ) + | | ) S ⌧⇢ D +
2 ⇢ ⇢ ⇢

For :
+ ⇢ (D )
36/ 38
Other Optimization Methods

Goal: find an approximate global solution to a very complex (non-convex)


objective function with many local optima.
I Simulated annealing (SA)
I Generic algorithm (GA)
I Particle swarm optimization (PSO)

Bayesian (sequential) optimization: based on Gaussian process (Lecture 9)

37/ 38
References

Convex Optimization – Boyd and Vandenberghe


https://web.stanford.edu/~boyd/cvxbook/
An overview of gradient descent optimization algorithms – Sebastian Ruder
https://arxiv.org/abs/1609.04747
Distributed optimization and statistical learning via the alternating direction
method of multipliers – Boyd, et al.
https://web.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf

38/ 38

You might also like