Optimization
Optimization
1 Introduction
In this course note, we will be discussing advanced optimization techniques for machine learning. We start
with a simple review of gradient descent and stochastic gradient descent.
Let L(w) denote the objective function to be minimized during training. For instance, in machine learn-
ing, L(w) often includes a loss function summed over the training examples S as well as the regularization
penalty:
λ X
L(w) = kwk22 + `(xi , yi , w),
2
(xi ,yi )∈S
where λ denotes the regularization strength of the (in this case) L2 regularization penalty, and `(x, y, w)
denotes the training error of w on training example (x, y) (e.g., squared error on a linear model: (y −wT x)2 ).
The goal then is to solve the following unconstrained optimization problem:
The main design decision is the choice of step size ηt . In general, each ηt needs to be sufficient small, or else
gradient descent will not converge. However, larger choices of ηt leads to faster convergence. The simplest
approach is to use a fixed step size.
Another way, which is often more efficient in practice, is stochastic gradient descent,
Algorithm 2 Stochastic Gradient Descent
1: Initialize w0 , typically as 0
2: t ← 1
3: Compute stochastic gradient: gt s.t. E[gt ] = ∇w L(w = wt−1 )
4: Update: wt ← wt−1 − ηt gt
5: t ← t + 1
6: Repeat from Step 3 until some termination condition is met
The main difference between stochastic gradient and gradient descent is in Step 3, where the gradient used
gt is now a stochastic gradient. In practice, virtually all objective functions can be decomposed additively:
X
L(w) = Li (w).
i=1
λ
Li (w) = kw||22 + `(xi , yi , w).
2N
1
Machine Learning & Data Mining Caltech CS/CNS/EE 155
Advanced Optimization Notes Last Updated: Feburary 22nd, 2016
For any given decomposition, stochastic gradient descent chooses an Li randomly at Step 3. It is easy to
verify that
1
E[gt ] = Ei [∇w Li (w = wt−1 )] = ∇w L(w = wt−1 ),
N
where N is the number of components in decomposition of L. In practice, one often loops through all the
Li ’s in some order. The key benefit of stochastic gradient descent over gradient descent is that it requires
much less computation time (e.g., only processing a few training examples rather than the entire training
set). However, the gradient is noisier, and so typically the step size ηt is much smaller than that for standard
gradient descent. The term mini-batch SGD is often used to refer to the setting where each Li contains a
small batch of training data (e.g., 10-500 samples).
In other words, Nesterov’s method keep track of an intermediate solution vt that is the direct result of
gradient descent (see Line 5). However, the model update rule is one that combines both the immediate
gradient as well as a momentum term (Line 8). One can rewrite Line 8 as:
which can be interpreted as moving further along the direction of vt − vt−1 by a magnitude of −γt . Note
that γ1 = 0, and that γt very quickly converges to −1. So in the limit, (2) is behaving as:
So to summarize:
• vt ← wt−1 − ∇w L(w = wt−1 ) is computed via the standard gradient descent update rule.
2
Machine Learning & Data Mining Caltech CS/CNS/EE 155
Advanced Optimization Notes Last Updated: Feburary 22nd, 2016
• wt ← vt − γt (vt − vt−1 ) is computed via updating vt with the momentum term (vt − vt−1 ) scaled by
−γt (where γt ≤ 0).
Of course there are many different ways to choose a momentum term, so why this particular choice? It was
proven in [1] that
2kw0 − w∗ k2
L(vt ) − L(w∗ ) ≤ ,
ηt2
where w∗ is the
√ minimizer of L, and η lower bounds all ηt .1 In other words, for any error tolerance > 0,
∗
it takes O(1/ ) time steps to achieve L(vt ) − L(w ) ≤ . This can be much faster than the general O(1/)
convergence rate for ordinary gradient descent for differentiable convex functions.
If B = 1, then the momentum is applied every round. However, this can often lead to unstable estimates
since the individual stochastic gradients can be quite noisy. By average over, say, 100 rounds of stochastic
gradients, then the momentum estimate is more stable.
1
wt = argminw0 L(w0 ) + kw0 − wt−1 k22 . (4)
2ηt
1 In the case where each ηt is constant, then η = ηt .
3
Machine Learning & Data Mining Caltech CS/CNS/EE 155
Advanced Optimization Notes Last Updated: Feburary 22nd, 2016
Here, the idea is that we want wt to be the best solution to L that is not to far from the previous wt−1 . So
long as 1/ηt is sufficiently large, then the so-called “proximal” will dominate whenever w0 is far from wt−1 .
This form of the optimization problem might seem a bit circular, since if we can solve (4) then we might be
able to solve the original problem. However, we shall see later that this form has some very nice properties.
We can show that a variant of (4) is equivalent to gradient descent. By properties of convex functions,
we can lower bound (4) as:
1 1
L(w0 ) + kw0 − wt−1 k22 ≥ L(wt−1 ) + ∇w L(w = wt−1 )T (w0 − wt−1 ) + kw0 − wt−1 k22 . (5)
2ηt 2ηt
The closer w0 is to wt , the closer the gap, and at w0 = wt , (5) becomes an equality.
One can thus consider a variant of (4), which can be thought of as the linear approximation to (4) at
wt−1 :
1
argminw0 ∇w L(w = wt−1 )T (w0 − wt−1 ) + kw0 − wt−1 k22 . (6)
2ηt
1 0
0 = ∇w L(w = wt−1 ) + (w − wt−1 ) ⇒ wt = wt−1 − ηt ∇w L(w = wt ),
ηt
which is exactly the gradient descent update rule. In other words, the gradient descent update rule is the
closed form solution to (6), which is in turn the linear approximation to to (4) at wt−1 . So one can think of
(4) and (6) as a generalization of gradient descent. Specifically, gradient descent is the closed-form solution
to (6) whenever L is differentiable.
The more general setting can be thought of as:
Algorithm 5 Proximal Updates
1: Initialize w0 , typically as 0
2: t ← 1
3: Update: wt ← argminw0 (4) or argminw0 (6)
4: t ← t + 1
5: Repeat from Step 3 until some termination condition is met
4
Machine Learning & Data Mining Caltech CS/CNS/EE 155
Advanced Optimization Notes Last Updated: Feburary 22nd, 2016
We will show how to solve such an L using the Iterative Soft-Thresholding Algorithm (ISTA). We do so by
alternating between solving G and H. For the differentiable part G, we solve for the next step using (6),
which yields a closed form solution of:
1
vt ← argminw0 ∇w G(w = wt−1 )T (w0 − wt−1 ) + kw0 − wt−1 k22 ≡ wt−1 − ηt ∇w G(w = wt−1 ). (7)
2ηt
Afterwards, for the non-differentiable part H, we instead solve using (4), which yields:
1
wt ← argminw0 H(w0 ) + kw0 − vt k22 , (8)
2ηt
where vt was the intermediate solution to the gradient update of G (7).
It turns out that for many commonly used non-differentiable functions, H, (8) has a closed-form solu-
tion. Differentiating (8) and setting it to 0 yields:
1 0
0 = ∇w H(w = w0 ) + (w − vt ) ⇒ ∇w [ηt H(w = w0 )] + w0 = vt . (9)
ηt
Consider L1 regularization H(w) = λkwk1 . We can write the sub-differential of αH (for any positive
constant α) component-wise as:
−αλ if w ≤ −αλ
∇w αλkwk1 = − αλ, αλ if − αλ < w < αλ
αλ if w ≥ αλ
In other words, when w is close to 0, there is a continuous range of differentials, and any of them that
satisfies (9) is an optimal solution. We can thus write the closed-form solution to (9) component-wise as:
vt + ηt λ if vt ≤ −ηt λ
wt = 0 if − ηt λ < vt < ηt λ (10)
vt − ηt λ if vt ≥ −ηt λ
The above update rule is known as “soft-thresholding”. The entire algorithm is thus:
Algorithm 6 Iterative Soft-Thresholding Algorithm (ISTA) for optimizing L1 regularized training objectives
1: Initialize w0 , typically as 0
2: Decompose L = G + H, where H = λ|w|1
3: t ← 1
4: Update: vt ← argminw0 (4), the solution is given in (7)
5: Update: wt ← argminw0 (6), the solution is given in (10)
6: t ← t + 1
7: Repeat from Step 4 until some termination condition is met
Of course, one can also use standard sub-gradient descent for optimizing L1 regularized training ob-
jectives. However, the goal of using L1 regularization is so that the solution w is sparse (few non-zeros).
However, standard gradient descent will not actually produce a sparse solution (you can try it yourself).
ISTA is guaranteed to produce a sparse solution because it zeros out small components of w.
The above algorithm is gradient descent, and requires differentiating the entire G each iteration. In prac-
tice, one often solves for G using stochastic gradient descent. In which case, one straightforward approach
is to do some number of rounds of mini-batch SGD on G, followed by a soft-thresholding step. Just keep
in mind that the step sizes ηt need to be adjusted. If you do B rounds of mini-batch SGD on G for each
soft-thresholding step, then you need to set the soft-thresholding step-size to Bηt .
5
Machine Learning & Data Mining Caltech CS/CNS/EE 155
Advanced Optimization Notes Last Updated: Feburary 22nd, 2016
References
[1] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse
problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.