0% found this document useful (0 votes)
2 views6 pages

Optimization

This document provides an overview of advanced optimization techniques for machine learning, focusing on gradient descent, stochastic gradient descent, and Nesterov's accelerated gradient descent. It discusses the mathematical formulations and algorithms for these methods, including their advantages and applications in optimizing objective functions. Additionally, it introduces the Iterative Soft-Thresholding Algorithm (ISTA) for handling non-differentiable functions, particularly in the context of L1 regularization.

Uploaded by

nidhinair1705
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views6 pages

Optimization

This document provides an overview of advanced optimization techniques for machine learning, focusing on gradient descent, stochastic gradient descent, and Nesterov's accelerated gradient descent. It discusses the mathematical formulations and algorithms for these methods, including their advantages and applications in optimizing objective functions. Additionally, it introduces the Iterative Soft-Thresholding Algorithm (ISTA) for handling non-differentiable functions, particularly in the context of L1 regularization.

Uploaded by

nidhinair1705
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Machine Learning & Data Mining Caltech CS/CNS/EE 155

Advanced Optimization Notes Last Updated: Feburary 22nd, 2016

1 Introduction
In this course note, we will be discussing advanced optimization techniques for machine learning. We start
with a simple review of gradient descent and stochastic gradient descent.
Let L(w) denote the objective function to be minimized during training. For instance, in machine learn-
ing, L(w) often includes a loss function summed over the training examples S as well as the regularization
penalty:
λ X
L(w) = kwk22 + `(xi , yi , w),
2
(xi ,yi )∈S

where λ denotes the regularization strength of the (in this case) L2 regularization penalty, and `(x, y, w)
denotes the training error of w on training example (x, y) (e.g., squared error on a linear model: (y −wT x)2 ).
The goal then is to solve the following unconstrained optimization problem:

argminw L(w). (1)

We mostly will assume that L is convex.


The simplest way to minimize (1) is via gradient descent,
Algorithm 1 Gradient Descent
1: Initialize w0 , typically as 0
2: t ← 1
3: Compute gradient: gt ← ∇w L(w = wt−1 )
4: Update: wt ← wt−1 − ηt gt
5: t ← t + 1
6: Repeat from Step 3 until some termination condition is met

The main design decision is the choice of step size ηt . In general, each ηt needs to be sufficient small, or else
gradient descent will not converge. However, larger choices of ηt leads to faster convergence. The simplest
approach is to use a fixed step size.
Another way, which is often more efficient in practice, is stochastic gradient descent,
Algorithm 2 Stochastic Gradient Descent
1: Initialize w0 , typically as 0
2: t ← 1
3: Compute stochastic gradient: gt s.t. E[gt ] = ∇w L(w = wt−1 )
4: Update: wt ← wt−1 − ηt gt
5: t ← t + 1
6: Repeat from Step 3 until some termination condition is met

The main difference between stochastic gradient and gradient descent is in Step 3, where the gradient used
gt is now a stochastic gradient. In practice, virtually all objective functions can be decomposed additively:
X
L(w) = Li (w).
i=1

For instance, the most straightforward decomposition decomposition is:

λ
Li (w) = kw||22 + `(xi , yi , w).
2N

1
Machine Learning & Data Mining Caltech CS/CNS/EE 155
Advanced Optimization Notes Last Updated: Feburary 22nd, 2016

For any given decomposition, stochastic gradient descent chooses an Li randomly at Step 3. It is easy to
verify that
1
E[gt ] = Ei [∇w Li (w = wt−1 )] = ∇w L(w = wt−1 ),
N
where N is the number of components in decomposition of L. In practice, one often loops through all the
Li ’s in some order. The key benefit of stochastic gradient descent over gradient descent is that it requires
much less computation time (e.g., only processing a few training examples rather than the entire training
set). However, the gradient is noisier, and so typically the step size ηt is much smaller than that for standard
gradient descent. The term mini-batch SGD is often used to refer to the setting where each Li contains a
small batch of training data (e.g., 10-500 samples).

2 Accelerated Gradient Descent


Nesterov’s accelerated gradient descent essentially adds a momentum term to the gradient descent proce-
dure. The general form will look something like this:
1. Compute gradient gt
2. Update momentum term mt
3. Update model wt using both gt and mt
We first present the original Nesterov’s method, which was designed for gradient descent.
Algorithm 3 Nesterov’s Accelerated Gradient Descent
1: Initialize w0 and v0 , typically as 0
2: Initialize α0 ← 0
3: t ← 1
4: Compute gradient: gt ← ∇w L(w = wt−1 )
5: Update to intermediate value: vt ← wt−1 − ηt gt

1+ 1+4α2
t−1
6: αt ← 2
7: γt ← 1−α t−1
αt
8: Update with momentum: wt ← (1 − γt )vt + γt vt−1
9: t←t+1
10: Repeat from Step 4 until some termination condition is met

In other words, Nesterov’s method keep track of an intermediate solution vt that is the direct result of
gradient descent (see Line 5). However, the model update rule is one that combines both the immediate
gradient as well as a momentum term (Line 8). One can rewrite Line 8 as:

wt ← vt + γt (vt−1 + vt ) = vt − γt (vt − vt−1 ), (2)

which can be interpreted as moving further along the direction of vt − vt−1 by a magnitude of −γt . Note
that γ1 = 0, and that γt very quickly converges to −1. So in the limit, (2) is behaving as:

wt ← vt + (vt − vt−1 ). (3)

So to summarize:
• vt ← wt−1 − ∇w L(w = wt−1 ) is computed via the standard gradient descent update rule.

2
Machine Learning & Data Mining Caltech CS/CNS/EE 155
Advanced Optimization Notes Last Updated: Feburary 22nd, 2016

• wt ← vt − γt (vt − vt−1 ) is computed via updating vt with the momentum term (vt − vt−1 ) scaled by
−γt (where γt ≤ 0).
Of course there are many different ways to choose a momentum term, so why this particular choice? It was
proven in [1] that
2kw0 − w∗ k2
L(vt ) − L(w∗ ) ≤ ,
ηt2
where w∗ is the
√ minimizer of L, and η lower bounds all ηt .1 In other words, for any error tolerance  > 0,

it takes O(1/ ) time steps to achieve L(vt ) − L(w ) ≤ . This can be much faster than the general O(1/)
convergence rate for ordinary gradient descent for differentiable convex functions.

2.1 Accelerated Stochastic Gradient Descent


There is no universally accepted way to incorporate Nesterov’s method into stochastic gradient descent.
One straightforward approach is to do the momentum update periodically, e.g., once every 10-500 rounds
of mini-batch SGD. One version is shown below, with B denoting the number of rounds to iterate per
momentum update.
Algorithm 4 Nesterov’s Accelerated Gradient Descent
1: Initialize w0 and v0 , typically as 0
2: Initialize α0 ← 0
3: t ← 1
4: Compute stochastic gradient: gt s.t. E[gt ] = ∇w L(w = wt−1 )
5: if mod(t, B) = 0 then
6: Update to√ intermediate value: vt ← wt−1 − ηt gt
1+t−11+4α2
7: αt ← 2
1−αt−1
8: γt ← αt
9: Update with momentum: wt ← (1 − γt )vt + γt vt−1
10: else
11: Standard SGD update: wt ← wt−1 − ηt gt
12: vt ← vt−1
13: end if
14: t←t+1
15: Repeat from Step 4 until some termination condition is met

If B = 1, then the momentum is applied every round. However, this can often lead to unstable estimates
since the individual stochastic gradients can be quite noisy. By average over, say, 100 rounds of stochastic
gradients, then the momentum estimate is more stable.

3 Relationship Between Gradient Descent and Proximal Updates


Let us consider an alternative form of iterative optimization:

1
wt = argminw0 L(w0 ) + kw0 − wt−1 k22 . (4)
2ηt
1 In the case where each ηt is constant, then η = ηt .

3
Machine Learning & Data Mining Caltech CS/CNS/EE 155
Advanced Optimization Notes Last Updated: Feburary 22nd, 2016

Here, the idea is that we want wt to be the best solution to L that is not to far from the previous wt−1 . So
long as 1/ηt is sufficiently large, then the so-called “proximal” will dominate whenever w0 is far from wt−1 .
This form of the optimization problem might seem a bit circular, since if we can solve (4) then we might be
able to solve the original problem. However, we shall see later that this form has some very nice properties.
We can show that a variant of (4) is equivalent to gradient descent. By properties of convex functions,
we can lower bound (4) as:
1 1
L(w0 ) + kw0 − wt−1 k22 ≥ L(wt−1 ) + ∇w L(w = wt−1 )T (w0 − wt−1 ) + kw0 − wt−1 k22 . (5)
2ηt 2ηt

The closer w0 is to wt , the closer the gap, and at w0 = wt , (5) becomes an equality.
One can thus consider a variant of (4), which can be thought of as the linear approximation to (4) at
wt−1 :
1
argminw0 ∇w L(w = wt−1 )T (w0 − wt−1 ) + kw0 − wt−1 k22 . (6)
2ηt

Taking the derivative of (6) and setting it to 0 yields:

1 0
0 = ∇w L(w = wt−1 ) + (w − wt−1 ) ⇒ wt = wt−1 − ηt ∇w L(w = wt ),
ηt

which is exactly the gradient descent update rule. In other words, the gradient descent update rule is the
closed form solution to (6), which is in turn the linear approximation to to (4) at wt−1 . So one can think of
(4) and (6) as a generalization of gradient descent. Specifically, gradient descent is the closed-form solution
to (6) whenever L is differentiable.
The more general setting can be thought of as:
Algorithm 5 Proximal Updates
1: Initialize w0 , typically as 0
2: t ← 1
3: Update: wt ← argminw0 (4) or argminw0 (6)
4: t ← t + 1
5: Repeat from Step 3 until some termination condition is met

4 Iterative Soft-Thresholding for Non-Differentiable Functions


We now consider the case where we can decompose L into differentiable and non-differentiable compo-
nents:
L(w) = G(w) + H(w),
where G is differentiable and H is not differentiable. For example, G can be a differentiable loss function
summed over all the training examples:
X
G(w) = `(x, y, w),
(xi ,yi )∈S

and H can be the L1 penalty:


H(w) = λkwk1 .

4
Machine Learning & Data Mining Caltech CS/CNS/EE 155
Advanced Optimization Notes Last Updated: Feburary 22nd, 2016

We will show how to solve such an L using the Iterative Soft-Thresholding Algorithm (ISTA). We do so by
alternating between solving G and H. For the differentiable part G, we solve for the next step using (6),
which yields a closed form solution of:
1
vt ← argminw0 ∇w G(w = wt−1 )T (w0 − wt−1 ) + kw0 − wt−1 k22 ≡ wt−1 − ηt ∇w G(w = wt−1 ). (7)
2ηt
Afterwards, for the non-differentiable part H, we instead solve using (4), which yields:
1
wt ← argminw0 H(w0 ) + kw0 − vt k22 , (8)
2ηt
where vt was the intermediate solution to the gradient update of G (7).
It turns out that for many commonly used non-differentiable functions, H, (8) has a closed-form solu-
tion. Differentiating (8) and setting it to 0 yields:
1 0
0 = ∇w H(w = w0 ) + (w − vt ) ⇒ ∇w [ηt H(w = w0 )] + w0 = vt . (9)
ηt
Consider L1 regularization H(w) = λkwk1 . We can write the sub-differential of αH (for any positive
constant α) component-wise as:

  −αλ  if w ≤ −αλ
∇w αλkwk1 = − αλ, αλ if − αλ < w < αλ
αλ if w ≥ αλ

In other words, when w is close to 0, there is a continuous range of differentials, and any of them that
satisfies (9) is an optimal solution. We can thus write the closed-form solution to (9) component-wise as:

 vt + ηt λ if vt ≤ −ηt λ
wt = 0 if − ηt λ < vt < ηt λ (10)
vt − ηt λ if vt ≥ −ηt λ

The above update rule is known as “soft-thresholding”. The entire algorithm is thus:
Algorithm 6 Iterative Soft-Thresholding Algorithm (ISTA) for optimizing L1 regularized training objectives
1: Initialize w0 , typically as 0
2: Decompose L = G + H, where H = λ|w|1
3: t ← 1
4: Update: vt ← argminw0 (4), the solution is given in (7)
5: Update: wt ← argminw0 (6), the solution is given in (10)
6: t ← t + 1
7: Repeat from Step 4 until some termination condition is met

Of course, one can also use standard sub-gradient descent for optimizing L1 regularized training ob-
jectives. However, the goal of using L1 regularization is so that the solution w is sparse (few non-zeros).
However, standard gradient descent will not actually produce a sparse solution (you can try it yourself).
ISTA is guaranteed to produce a sparse solution because it zeros out small components of w.
The above algorithm is gradient descent, and requires differentiating the entire G each iteration. In prac-
tice, one often solves for G using stochastic gradient descent. In which case, one straightforward approach
is to do some number of rounds of mini-batch SGD on G, followed by a soft-thresholding step. Just keep
in mind that the step sizes ηt need to be adjusted. If you do B rounds of mini-batch SGD on G for each
soft-thresholding step, then you need to set the soft-thresholding step-size to Bηt .

5
Machine Learning & Data Mining Caltech CS/CNS/EE 155
Advanced Optimization Notes Last Updated: Feburary 22nd, 2016

4.1 Fast Iterative Soft-Thresholding Algorithm


The new approach presented in [1] is actually the Fast Iterative Soft-Thresholding Algorithm (FISTA), which
combines ISTA with Nesterov’s accelerated gradient descent.
Algorithm 7 Fast Iterative Soft-Thresholding Algorithm (FISTA) for optimizing L1 regularized objectives
1: Initialize w0 and u0 , typically as 0
2: Decompose L = G + H, where H = λ|w|1
3: Initialize α0 ← 0
4: t ← 1
5: Update: vt ← argminw0 (4), the solution is given in (7)
6: Update: ut ← argminw0 (6), the solution is given in (10)

1+ 1+4α2
t−1
7: αt ← 2
1−αt−1
8: γt ← α t
9: Update: wt ← (1 − γt )ut + γt ut
10: t←t+1
11: Repeat from Step 5 until some termination condition is met

References
[1] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse
problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.

You might also like