0% found this document useful (0 votes)

22 views

Lecture 7 (with notes)

Uploaded by

孫利東

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Lecture 7 (with notes)

Uploaded by

孫利東

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Lecture 7

Optimization

Nan-Jung Hsu

1/ 38
Optimization

Given an objective function g(x) (e.g., least squares, negative log-likelihood, any
loss function), find
min g(x).
x

solve g 0 (x) = 0 analytically (impossible for most of the time!)

find the solution numerically (an approximate solution)
we emphasize on convex optimization, but most of the methods have been
applied to solve a general (non-convex) problem

2/ 38
Examples of Optimization in Statistical Inference

Least squares problem: linear or nonlinear fitting (minimization)

Maximum likelihood estimation (maximization)
Solve Bayes estimate (posterior mode) in Bayesian analysis
Estimation with constraints, e.g., regularized (penalized) estimation
Planning inference, e.g., allocation problem with cost constraints

3/ 38
Least Squares (LS) Problem
Ordinary LS (OLS):
I loss function:
n
X
Q( ) = (yi xi )2 = ||Y X ||2 = (Y X )0 (Y X ),
i=1

Q0 ( ) = 2X 0 (Y X ).

I goal: find ˆ to minimize Q( ), i.e., solving Q0 ( ) = 0

I solution: ˆ = (X 0 X) 1 X 0 Y 1 0 9021220

Generalized LS (GLS):
Mahalanobisdistauce
ˆ = arg min (Y X )0 ⌃ 1
(Y X ) = (X 0 ⌃ 1
X) 1
X 0⌃ 1
Y,
-

where the given ⌃ could be a very general positive-definite matrix (covariance

matrix).
4/ 38
Modeliyi
ftp.xiitfzxiii-tfpxipt Ei
⼆

Ei Ǜ Nlo M Xi ⼆
( 1 Xii
.
, Xiz Xip)
,
, . _
,

Goalifīnd P = (
BÁPD 未
suchthat

⼆点 (
gi-xipi sminimyedhyfw-mc-hzlo-gDLe.tk
Lcp )

下
⼼ … …
別 ,
X =
(
Xipnxyj
Then L (B) ⼆
IIYXPII

PJL (f) -_-

2点 lyi-xipxik.fr ko.li …
:P

orequivalently ,
-2
(IX)
Nonlinear LS Problem (cont.)
Loss function:
n
X 2
Q1 ( ) = (yi m (xi )) = ||Y m ||2 ,
i=1

Q2 ( , ✓) = (Y m )0 ⌃✓ 1 (Y m ),

Examples:
0+ 1t
Fit a function: yt = e + ✏t
Estimate parameters for Yi ⇠ P oi(m(xi )), where m(x) = exp( 0 + 1 x).

(m(xi ) = EYi = var(Yi ))

Estimate variance related parameter for ⌃ ⌘ [⌃ij ], where
⇣ ⌘
(xi xj )2
⌃ij = exp ✓ , 1  i, j  n.
In general, no closed–form solution for the nonlinear LS problem!
5/ 38
Newton’s Method (illustrated in 1-dim case)
Also called Newton-Raphson algorithm, which is a root-finding algorithm.
Let f (x) : R ! R be a di↵erentiable function.
Goal: find x⇤ such that f (x⇤ ) = 0
014
Taylor expansion:
.
f (x) = f (a) + f 0 (a)(x a), for x ⇡ a.

solve f (x⇤ ) = 0 approximately by solving

0 = f (a) + f 0 (a)(x̃ a), for given a,

f (a)
resulting in a closed-form solution: x̃ = a f 0 (a) .

Algorithm:
I set an initial x(0)
f (x(t) )
I iteratively approximate the solution by x(t+1) = x(t) f 0 (x(t) )
; t = 0, 1, ...
I check convergence: kx(t) x(t 1)
k ! 0 as t increases
6/ 38
Newton’s Method (cont.)

Algorithm (see wiki for demo)

I set an initial x(0)
f (x(t) )
I iteratively approximate the solution by x(t+1) = x(t) f 0 (x(t) )
; t = 0, 1, ...
I (t) (t 1)
check convergence: kx x k ! 0 as t increases

The algorithm converges quadratically when x(t) is close enough to x⇤ , i.e.,

kx(t) x⇤ k  Kkx(t 1)
x ⇤ k2 , for some K.

And the error |f (x(t) ) f (x⇤ )| decays quadratically, e.g., , 2

, 4
,···

Related R function: optim, optimize, nlm, nlminb, uniroot, ...

7/ 38
Minimization Problem
Goal: minimize g(x): Rp ! R
Equivalent to finding root for g 0 (x) [solve p equations simultaneously]
Taylor approximation for g 0 (x) around x0 :
⇢ 2
. @ g(x)
g 0 (x) = g 0 (x0 ) + (x x0 ) . (⇤)
| {z } | {z } @x@x0 x=x0 | {z }
p⇥1 gradient: p⇥1 | {z } p⇥1
Hessian H : p⇥p

Solve (⇤) = 0, leading to

1
x = x0 [H(x0 )] g 0 (x0 ).

Newton method iterates

h i 1
x(t+1) = x(t) H(x(t) ) g 0 (x(t) ), t = 0, 1, 2, ... (⇤⇤)

8/ 38
Maximum Likelihood Problem
(Y1 , Y2 , . . . , Yn ) are random variables with joint density f (y1 , y2 , . . . , yn ; ✓)
Goal: estimate parameter ✓
Likelihood function and log-likelihood function:

L(✓) = f (Y1 , Y2 , . . . , Yn ; ✓); `(✓) = log L(✓).

iid
If Yi ⇠ f (y; ✓), then
n
Y n
X n
X
L(✓) = f (Yi ; ✓); `(✓) = log f (Yi ; ✓) = `i (✓).
i=1 i=1 i=1

Maximum (log) likelihood estimate:

✓ˆ = arg max L(✓) = arg min { `(✓)}

✓ ✓

⇥ @
⇤
If `(·) is di↵erentiable, then @✓ `(✓) ✓=✓ˆ =0 [root-finding problem]
9/ 38
Solve MLE by Newton Method
MLE ✓ˆ can be numerically solved via applying (⇤⇤) to g(✓) ⌘ `(✓), which
implies
 
(t+1) (t) @2 ⇣ (t)
⌘ 1 @ ⇣ ⌘
✓ = ✓ `(✓ ) `(✓ (t) )
@✓ 2 @✓
 2 1 
@ @
= ✓ (t) `(✓ )(t)
`(✓ (t) )
@✓ 2 @✓
h i 1
= ✓ (t) H(✓ (t) ) s(✓ (t) ), t = 0, 1, 2, ...

where
@
the gradient s(✓) ⌘ @✓ `(✓) is also called score function,
@2
the hessian matrix H(✓) ⌘ @✓ 2 `(✓) is also called the negative empirical
information matrix.

10/ 38
More Convex Optimization Methods
Gradient descent
Stochastic gradient descent
Coordinate descent method:
to optimize a multi-variable convex (concave) function by optimizing along
one coordinate at a time
[simplify a high-dim optimization problem into a low-dim problem]
Lagrange multiplier method: to optimize a function subject to equality
constraints.
Quadratic programming: to optimize a function subject to linear inequality
constraints.
Alternating direction method of multipliers (ADMM): handling various types
of constraints
11/ 38
Gradient Descent
Review: Newton method updates
 1
@2
x(t+1) = x(t) 0
g(x(t) ) g 0 (x(t) ) .
@x@x
| {z } | {z }
Hessian gradient

Hessian might not be feasible for a complex g

Gradient descent method finds the local minimum via repeatedly
implementing the following update:

x(t+1) = x(t) ↵t g 0 (x(t) ), where ↵t > 0 (step size/learning rate).

Convergence requires certain conditions on g(·) and ↵t ; the latter controls

the convergence rate. A popular choice: ↵t = ↵1 t , 2 (0.5, 1] (which
P P
satisfies ↵t = 1, ↵t2 < 1).
R toolbox: package gradDescent for regression task 12/ 38
Step Size / Learning Rate
A too-large learning rate may cause an unstable process (left), while a too-small
learning rate may result in very slow updates (right).

If step size is too large, GD can If step size is too small, GD can
overshoot the minimum. It be slow.
may fail to converge.

13/ 38
Example 1: Penalized Regression

1
min ky X k2 + k k2 , for some 0
2

compute the gradient: g 0 ( ) = X 0 (y X )+ ·( 0

)2 1

gradient descent: update ↵t g 0 ( ) until convergence

self-consistency: the solution of g 0 ( ) = 0 satisfies

X 0y = X 0X + ·( 0
)2 1
,
h i 1
= X 0X + ·( 0
)2 1
·I X 0 y,

which suggests another updating scheme:

 ⇣ ⌘2 1
1
(t+1)
X 0X + · ( (t) )0 (t)
·I X 0y

R Lab7 demo
14/ 38
Stochastic Gradient Descent (SGD)
1
Pn
Goal: min n i=1 gi (x) , where gi is the individual loss on the i-th data point
x

Standard gradient descent method:

( n
)
(t+1) (t) 1 X 0 (t)
x =x ↵t g (x )
n i=1 i

Stochastic gradient descent:

n o
x(i) x(i 1)
↵i gi0 (x(i 1)
) , i = 1, 2, ...

mini-batch gradient descent:

( )
1 X 0 (t)
x(t+1) = x(t) ↵t gi (x ) , Dt is a subset index of the data.
|Dt |
i2Dt

15/ 38
SGD (cont.)
1
Pn
Goal: min n i=1 gi (x) , where gi is the individual loss on the i-th data point
x

Stochastic gradient descent:

x(i) x(i 1)
↵i gi0 (x(i 1)
), i = 1, 2, ...

commonly used in 2 scenarios:

I data are collected on-line (in batch): updating parameter x whenever new
data come in; gi (x) corresponds to the loss function contributed from i-th
batch data (run through each data once, not repeatedly)
I big (o↵-line) data are partitioned into n smaller portions; gi (x) corresponds to
the loss function contributed from i-th data portion

stochastic gradient descent is feasible for handling non-convex problems in a

big data setting
16/ 38
Extensions of SGD

SGD+Momentum (SGDM):
I set initials: t = 0, x(0) and v0 (momentum)
I at t-th iteration,
1 compute gradient: g 0 (x(t 1) )

2 compute momentum: vt = vt 1 ↵t g 0 (x(t 1) ) (a popular is 0.9)

3 update parameter: x(t) = x(t 1) + vt

AdaGrad
0
I define the gradient with respect to xj : gt,j ⌘ @x@
j
g(x(t) )
I
(t+1) (t) ⌘ 0
xj xj p ·gt,j ,
Gt,j + ✏
| {z }
adaptive ↵t
P 0
where Gt,j = it (gi,j )2 is the cumulative squared gradient of jth element

17/ 38
Extensions of SGD (cont.)
Adadelta: at each iteration,
I replace the cumulative squared gradient by an exponentially weighted average
(EWMA scheme):

E[(gj0 )2 ]t E[(gj0 )2 ]t 1 + (1 0
)(gt,j )2 (a popular : 0.9)

(t) ⌘ 0
I define: xj ⌘ q · gt,j (intermediate step)
E[(gj0 )2 ]t +✏
I define:
(t)
E[ x2j ]t = E[ x2j ]t 1 + (1 )( xj )2
(t)
I refine xj as
q
E[ x2j ]t 1 + ✏
(t) 0
xj q gt,j
E[(gj0 )2 ]t + ✏
(t+1) (t) (t)
I update xj xj + xj
(note: the intermediate step is not run, and ⌘ is eliminated from the algorithm)
18/ 38
Extensions of SGD (cont.)
RMSProp: a special case of Adadelta,
I compute EWMA: E[(gj0 )2 ]t 0.9E[(gj0 )2 ]t 1
0
+ 0.1(gt,j )2
(t+1) (t) ⌘ 0
I update xj xj q · gt,j (a popular ⌘: 0.001)
E[(gj0 )2 ]t +✏

Adam (adaptive moment estimation):

I set initial m0 = 0, v0 = 0, set 2 (0, 1)
1, 2

I 0
EWMA momentum: mt = 1 mt 1 + (1 1 )gt

I 0 2
EWMA squared gradient: vt = 2 vt 1 + (1 2 )(gt ) (defined
elementwisely)
I bias correction:
mt vt
m̂t = t
, v̂t = t
1 1 1 2
I update
⌘
x(t+1) x(t) p m̂t .
v̂t + ✏
I 8
popular choice of tuning parameter: 1 = 0.9, 2 = 0.999, ✏ = 10
Most power for teachers? 19/ 38
Coordinate Descent Algorithm
Goal: min g(x), where g(·) is smooth (component-wise convex) and
x
x = (x1 , x2 , . . . , xk )0 .

Initialize x(0)
For iteration t = 1, 2, ..., update coordinate j 2 {1, 2, ..., k} via
(t+1) (t)
xj = arg min g(xj , x j ),
xj

(t)
where x j represents x(t) but excluding j-th coordinate.
(t)
Possible forms of x j
(t) (t+1) (t+1) (t+1) (t) (t) 0
I , · · · , xj
xj = (x1 , x2 1 , xj+1 , · · · , xk ) [converge faster]
(t) (t) (t) (t) (t) (t)
I x
j = (x1 , x2 , · · · , xj 1 , xj+1 , · · · , xk )0 [can be implemented in parallel]

Simple and computationally efficient for high-dim setting of x

Updating can be done in random order or in blocks (block coordinate
descent)
20/ 38
Illustration of Coordinate Descent

(t+1) (t) ⇤ ⇤
xj = xj + , where = arg min f (x(t) + ej )

[Figure from wiki] 21/ 38

Example 2: Coordinate Descent for LS
Assume data follows Y ⇠ (X , ⌃(✓)). The goal is to estimate ( , ✓).

2
For OLS problem, assuming ⌃(✓) = I, can be solved via CD algorithm:

(t 1)
(t) Xj0 (Y X j j ) (t 1) Xj0 e(t 1)
j = = j + , i = 1, 2, ...,
Xj0 Xj Xj0 Xj

where j is excluding j-th entry, Xj is the j-th column of X, X j is X

excluding j-th column, e(t 1)
=Y X (t 1)
is the current residual vector.

For GLS problem: min (Y X )0 [⌃(✓)] 1

(Y X ) , block CD
,✓
algorithm alternates the following updates:
⇣ ⌘ 1⇣ ⌘
I (t)
= X 0 [⌃(✓ (t 1) )] 1 X X 0 [⌃(✓ (t 1) )] 1 Y
n o
I ✓ (t) = min tr [⌃(✓)] 1 (Y X (t) )(Y X (t) )0
✓

22/ 38
Example 3: Matrix-variate Time Series

Yt = |{z}
A Yt 1 |{z}B + ✏t , t = 1, 2, ...
|{z} | {z }
n⇥m n⇥n n⇥m m⇥m

X
2
(Â, B̂) = arg min kYt AYt 1 BkF .
A,B
t

P
Â(t) = arg minA t kYt AYt 1 B̂
(t 1) 2
kF
P
B̂ (t) = arg minB t kYt Â(t) Yt 2
1 BkF

Both steps are LS problems with a closed solution!

23/ 38
Example 4: Coordinate Descent for Lasso
Under an orthogonal design matrix X (i.e., X 0 X = I), the OLS estimator
satisfies ˆols = (X 0 X) 1 X 0 y = X 0 y.
Solving lasso to minimize

1 1 0 1 X
ky X k22 + k k1 = yy y0 X + 0
+ | j|
2 2 2 j

is equivalent to minimize

X⇢ 1 X ⇢1
ˆols j + 2
+ | j| or ( j
ˆols )2 + | j | ,
j j j
j
2 j
2

which can be solved separately for each j, leading to the solution

ˆlasso = sign( ˆols ) ˆols , a+ = max(a, 0). (soft-thresholding rule)

j j j +

24/ 38
Soft-thresholding Rule

1
min (y )2 + | | (y and are given)
|2 {z }
Q( )
8
>
< y+ , if > 0,
@Q
) = (y )+ · sign( ) = y, if = 0,
@ >
:
y , if < 0.

Solve @Q
=0 ) ˆ = sign(y)(|y| )+ ⌘ S (y).
@
25/ 38
Optimization with Linear Constraints

min f (x) subject to Ax = b, x 2 Rp , A : k ⇥ p, b 2 Rk .

(primal problem)

Lagrange Multiplier (LM) method:

0
min L(x, ), where L(x, ) ⌘ f (x) + (Ax b).
x,

Equivalent to the following:

g( ) ⌘ inf L(x, ),
x
⇤
= arg max g( ), x⇤ = arg min L(x, ⇤
).
x

26/ 38
LM Method (cont.)

g( ) = inf L(x, )  inf {L(x, ) : Ax = b} = inf {f (x) : Ax = b}

x x x

to achieve the goal (on RHS), solve the

⇤
= arg max g( ), (Lagrange dual problem)

to solve the exact solution of x for the RHS:

x⇤ = arg min L(x, ⇤

)
x

27/ 38
Illustration: A Toy Example

min x2 , subject to x + 1 = 0.
x2R

Obviously, the solution is x = 1.

L(x, ) = x2 + (x + 1) (gray lines)
@
@x L(x, ) = 2x + which implies that, for a fixed , L(x, ) has the
minimum at x = /2 (red dot x-coord)
g( ) = L( /2, ) = ( /2)2 + ( /2) + = 2
/4 (red dot y-coord)
g0 ( ) = 1 /2 implies ⇤
=2
g( ) reaches its maximum when the corresponding x-coord at the correct
solution x = 1.

28/ 38
Illustration: A Toy Example (cont.)

min x2 , subject to x + 1 = 0.
x2R

8
6
x^2

4
2
0

−3 −2 −1 0 1 2 3

29/ 38
Same Problem via Gradient Ascent (Descent)

Original Problem:

⇤
= arg max g( ),
x⇤ = arg min L(x, ⇤
).
x

Gradient Ascent:

x(t+1) = arg min L(x, (t)

),
x
(t+1) (t) (t) (t+1)
= +↵ (Ax b).
| {z }
@
@ g( )

30/ 38
Augmented Lagranging
Goal:

min f (x) subject to Ax = b, x 2 Rp , A : k ⇥ p, b 2 Rk .

Augmented LM Method (⇢ > 0 is given):

min L⇢ (x, ),
x,

where
0 ⇢
L⇢ (x, ) ⌘ f (x) + (Ax b) + kAx bk22 .
2

Equivalent Problem:
n ⇢ o
min f (x) + kAx bk22 subject to Ax = b.
x
| 2 {z } 31/ 38
Augmented Lagranging (cont)

Gradient Ascent:

x(t+1) = arg min L⇢ (x, (t)

),
(t+1) (t)
= + ⇢ (Ax(t+1) b).

32/ 38
Alternating Direction Method of Multiplier (ADMM)
Goal:
min {f (x) + g(z)} subject to Ax + Bz = c,
x,z

where f (·) and g(·) are both convex.

lasso type problem:

1
min kAx bk22 + ⌧ kxk1 , subject to x = z.
2
| {z } | {z }
g(z)
f (x)
fusion type problem:
n n
1X 2
X
min (yi i) +⌧ | i i 1 |, subject to =D ,
2 i=1 i=2
| {z } | {z }
f ( )= 1
2
ky k2
2
g( )=⌧ k k1
2 3
1 1 ··· 0
with 6 0 1 1 0
7
6 7
i = i i 1, i = 2, ..., n; = ( 2, 3 , ..., n )
0
, D=6
6 .. .. .. ..
7.
7
4 . . . . 5
0 ··· 1 1 33/ 38
ADMM (cont.)
Define L⇢ (x, z, ) ⌘ f (x) + g(z) + 0
(Ax + Bz c) + ⇢2 kAx + Bz ck22 .

Augmented Lagranging:

(x(t+1) , z (t+1) ) = arg min L⇢ (x, z, (t)

),
(t+1) (t)
= + ⇢ (Ax(t+1) + Bz (t+1) c).

ADMM:
x(t+1) = arg min L⇢ (x, z (t) , (t)
),

z (t+1) = arg min L⇢ (x(t+1) , z, (t)

),
(t+1) (t) (t+1) (t+1)
= + ⇢ (Ax + Bz c).

(scaled form) replace by w ⌘ /⇢, leading to

w(t+1) = w(t) + (Ax(t+1) + Bz (t+1) c).

34/ 38
Example 5: Fused Lasso
n n
1X 2
X
min (yi i) +⌧ | i i 1|
2 i=1 i=2

Re-write
n n
1X 2
X
min (yi i) +⌧ | i |, subject to i = i i 1, i = 2, ..., n.
, 2 i=1 i=2

Solve via ADMM:

1 0 ⇢
L⇢ ( , , ) = ky k2 + ⌧ | | + (D ) + kD k2 ,
2 2 2 3
1 1 ··· 0
6 7
6 0 1 1 0 7
0 6 7
= ( 2, 3 , ..., n ) , D=6 .. .. .. .. 7.
6 . . . . 7
4 5
0 ··· 1 1
35/ 38
Example 5 (cont.)
1 0 ⇢
L⇢ ( , , ) = ky k2 + ⌧ | | + (D )+ kD k2
2 2

For : ⇢
1 0 ⇢
min ky k2 + (D )+ kD k2 .
2 2
Taking derivative w.r.t. leads to solving
⇥ ⇤
( y) + D0 + ⇢D0 (D )=0 ) (I + ⇢D0 D) 1
y + ⇢D0 ( /⇢)

For : n ⇢ o
0
min ⌧| |+ (D )+ kD k2
2
Re-arranging the quadratic part leads to solving the following equivalent form:
✓ ◆
1 1 2 ⌧ 1
min (D + ) + | | ) S ⌧⇢ D +
2 ⇢ ⇢ ⇢

For :
+ ⇢ (D )
36/ 38
Other Optimization Methods

Goal: find an approximate global solution to a very complex (non-convex)

objective function with many local optima.
I Simulated annealing (SA)
I Generic algorithm (GA)
I Particle swarm optimization (PSO)

Bayesian (sequential) optimization: based on Gaussian process (Lecture 9)

37/ 38
References

Convex Optimization – Boyd and Vandenberghe

https://web.stanford.edu/~boyd/cvxbook/
An overview of gradient descent optimization algorithms – Sebastian Ruder
https://arxiv.org/abs/1609.04747
Distributed optimization and statistical learning via the alternating direction
method of multipliers – Boyd, et al.
https://web.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf

38/ 38

Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
Convex Module B
No ratings yet
Convex Module B
29 pages
Berkeley-tutorial Optimization for Machine Learningpart2
No ratings yet
Berkeley-tutorial Optimization for Machine Learningpart2
35 pages
Lecture05_descent
No ratings yet
Lecture05_descent
31 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
ch4
No ratings yet
ch4
28 pages
SGD
No ratings yet
SGD
19 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
OpTimIzation Overview
No ratings yet
OpTimIzation Overview
47 pages
ML MODULE 5 FULL NOTES
No ratings yet
ML MODULE 5 FULL NOTES
23 pages
Stochastic Gradient Descent Algorithm
No ratings yet
Stochastic Gradient Descent Algorithm
6 pages
Optimization2
No ratings yet
Optimization2
40 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
Opt_Lec_10
No ratings yet
Opt_Lec_10
16 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
Lecture 12
No ratings yet
Lecture 12
16 pages
CS-6777 Liu Abs
No ratings yet
CS-6777 Liu Abs
103 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Optimumengineeringdesign Day3a
No ratings yet
Optimumengineeringdesign Day3a
34 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
No ratings yet
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
45 pages
Lec3 Gradient Based Method Part I
No ratings yet
Lec3 Gradient Based Method Part I
30 pages
Notes HQ
No ratings yet
Notes HQ
96 pages
Mathematical Methods of Optimization
No ratings yet
Mathematical Methods of Optimization
62 pages
Simon Chapter 3
No ratings yet
Simon Chapter 3
12 pages
optim
No ratings yet
optim
33 pages
02_lecturenote_GD
No ratings yet
02_lecturenote_GD
10 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
MLSS Complete PDF
No ratings yet
MLSS Complete PDF
106 pages
Lecture 7 Newton
No ratings yet
Lecture 7 Newton
44 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
7
No ratings yet
7
5 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
week 10 notes MLF
No ratings yet
week 10 notes MLF
20 pages
Stats 102B Cheat Sheet
No ratings yet
Stats 102B Cheat Sheet
4 pages
06_23ECE216_GradientDescent_v2
No ratings yet
06_23ECE216_GradientDescent_v2
73 pages
Unit VI Optimization Techniques question bank solved answer
No ratings yet
Unit VI Optimization Techniques question bank solved answer
20 pages
Math Optimization
No ratings yet
Math Optimization
11 pages
optimization
No ratings yet
optimization
6 pages
02_grad_desc
No ratings yet
02_grad_desc
54 pages
1 Intro
No ratings yet
1 Intro
25 pages
Lec 6 Tutorial
No ratings yet
Lec 6 Tutorial
27 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
27 pages
Unconstrained and Constrained Optimization Algorithms by Soman K.P
No ratings yet
Unconstrained and Constrained Optimization Algorithms by Soman K.P
166 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
EDO - Lecture 5 - 2024
No ratings yet
EDO - Lecture 5 - 2024
47 pages
Optimization
No ratings yet
Optimization
21 pages
Stochastic Optimization For Machine Learning
No ratings yet
Stochastic Optimization For Machine Learning
59 pages
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
optimization-techniques
No ratings yet
optimization-techniques
9 pages
ML Notes
No ratings yet
ML Notes
14 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Chapter One Roots of Non-Linear Equations
No ratings yet
Chapter One Roots of Non-Linear Equations
15 pages
Answer:: Chapter 19 - Solution Procedures For Transportation and Assignment Problems True / False
No ratings yet
Answer:: Chapter 19 - Solution Procedures For Transportation and Assignment Problems True / False
13 pages
University Updates: Text Books
No ratings yet
University Updates: Text Books
2 pages
42956
No ratings yet
42956
3 pages
Qmethods & Manscie Prelim LP (Maximization Model)
No ratings yet
Qmethods & Manscie Prelim LP (Maximization Model)
13 pages
Linear Programming
No ratings yet
Linear Programming
8 pages
Homework II Solution: August 25, 2017
No ratings yet
Homework II Solution: August 25, 2017
4 pages
Conjugate Gradient Method
No ratings yet
Conjugate Gradient Method
8 pages
Answers Homework3 PDF
No ratings yet
Answers Homework3 PDF
7 pages
Quantitative Analysis for Management 12th Edition Render Test Bank - 2025 Scribd Download Full Chapters
100% (5)
Quantitative Analysis for Management 12th Edition Render Test Bank - 2025 Scribd Download Full Chapters
53 pages
Knapsack Problem
No ratings yet
Knapsack Problem
30 pages
Least Common Multiple of Algebraic Expressions
100% (1)
Least Common Multiple of Algebraic Expressions
1 page
5.3 Processing 2 Jobs K Machines
No ratings yet
5.3 Processing 2 Jobs K Machines
12 pages
Numerical Methods3
No ratings yet
Numerical Methods3
17 pages
Operations Research: Department of Mathematics
No ratings yet
Operations Research: Department of Mathematics
9 pages
Assignment Problem SPL Case
No ratings yet
Assignment Problem SPL Case
5 pages
Seki 2010
No ratings yet
Seki 2010
6 pages
Numerical Differentiation
No ratings yet
Numerical Differentiation
11 pages
CS 236501 Introduction To AI: Tutorial 8 Resolution
No ratings yet
CS 236501 Introduction To AI: Tutorial 8 Resolution
20 pages
Math. Quadratic and Cubic Equations Solve With VBA Functions
No ratings yet
Math. Quadratic and Cubic Equations Solve With VBA Functions
41 pages
5 - Production Planning problems-LPP Models
No ratings yet
5 - Production Planning problems-LPP Models
7 pages
Logic Gates
No ratings yet
Logic Gates
6 pages
Day 1.2 GCF Common Monomial Factor
No ratings yet
Day 1.2 GCF Common Monomial Factor
19 pages
Unit 6.3 - Linear Program Simplex Method
No ratings yet
Unit 6.3 - Linear Program Simplex Method
6 pages
Simplex Method (Minimization Example) : Object Function
No ratings yet
Simplex Method (Minimization Example) : Object Function
6 pages
Numerical method final exam 2023,Sem-I
No ratings yet
Numerical method final exam 2023,Sem-I
6 pages
Mathematics Polynomials 3 Eng
No ratings yet
Mathematics Polynomials 3 Eng
27 pages
NN optimizers
No ratings yet
NN optimizers
2 pages
Linear Programming Solution Methods
No ratings yet
Linear Programming Solution Methods
38 pages
pp41-63
No ratings yet
pp41-63
23 pages

Lecture 7 (with notes)

Uploaded by

Lecture 7 (with notes)

Uploaded by

Lecture 7

solve g 0 (x) = 0 analytically (impossible for most of the time!)

Least squares problem: linear or nonlinear fitting (minimization)

I goal: find ˆ to minimize Q( ), i.e., solving Q0 ( ) = 0

where the given ⌃ could be a very general positive-definite matrix (covariance

PJL (f) -_-

(m(xi ) = EYi = var(Yi ))

solve f (x⇤ ) = 0 approximately by solving

0 = f (a) + f 0 (a)(x̃ a), for given a,

Algorithm (see wiki for demo)

The algorithm converges quadratically when x(t) is close enough to x⇤ , i.e.,

And the error |f (x(t) ) f (x⇤ )| decays quadratically, e.g., , 2

Related R function: optim, optimize, nlm, nlminb, uniroot, ...

Solve (⇤) = 0, leading to

Newton method iterates

L(✓) = f (Y1 , Y2 , . . . , Yn ; ✓); `(✓) = log L(✓).

Maximum (log) likelihood estimate:

✓ˆ = arg max L(✓) = arg min { `(✓)}

Hessian might not be feasible for a complex g

x(t+1) = x(t) ↵t g 0 (x(t) ), where ↵t > 0 (step size/learning rate).

Convergence requires certain conditions on g(·) and ↵t ; the latter controls

compute the gradient: g 0 ( ) = X 0 (y X )+ ·( 0

gradient descent: update ↵t g 0 ( ) until convergence

self-consistency: the solution of g 0 ( ) = 0 satisfies

which suggests another updating scheme:

Standard gradient descent method:

Stochastic gradient descent:

mini-batch gradient descent:

Stochastic gradient descent:

commonly used in 2 scenarios:

stochastic gradient descent is feasible for handling non-convex problems in a

2 compute momentum: vt = vt 1 ↵t g 0 (x(t 1) ) (a popular is 0.9)

Adam (adaptive moment estimation):

Simple and computationally efficient for high-dim setting of x

[Figure from wiki] 21/ 38

where j is excluding j-th entry, Xj is the j-th column of X, X j is X

For GLS problem: min (Y X )0 [⌃(✓)] 1

Both steps are LS problems with a closed solution!

which can be solved separately for each j, leading to the solution

ˆlasso = sign( ˆols ) ˆols , a+ = max(a, 0). (soft-thresholding rule)

min f (x) subject to Ax = b, x 2 Rp , A : k ⇥ p, b 2 Rk .

Lagrange Multiplier (LM) method:

Equivalent to the following:

g( ) = inf L(x, )  inf {L(x, ) : Ax = b} = inf {f (x) : Ax = b}

to achieve the goal (on RHS), solve the

to solve the exact solution of x for the RHS:

x⇤ = arg min L(x, ⇤

Obviously, the solution is x = 1.

x(t+1) = arg min L(x, (t)

min f (x) subject to Ax = b, x 2 Rp , A : k ⇥ p, b 2 Rk .

Augmented LM Method (⇢ > 0 is given):

x(t+1) = arg min L⇢ (x, (t)

where f (·) and g(·) are both convex.

lasso type problem:

(x(t+1) , z (t+1) ) = arg min L⇢ (x, z, (t)

z (t+1) = arg min L⇢ (x(t+1) , z, (t)

(scaled form) replace by w ⌘ /⇢, leading to

w(t+1) = w(t) + (Ax(t+1) + Bz (t+1) c).

Solve via ADMM:

Goal: find an approximate global solution to a very complex (non-convex)

Bayesian (sequential) optimization: based on Gaussian process (Lecture 9)

Convex Optimization – Boyd and Vandenberghe

You might also like