0% found this document useful (0 votes)

2 views6 pages

Optimization

This document provides an overview of advanced optimization techniques for machine learning, focusing on gradient descent, stochastic gradient descent, and Nesterov's accelerated gradient descent. It discusses the mathematical formulations and algorithms for these methods, including their advantages and applications in optimizing objective functions. Additionally, it introduces the Iterative Soft-Thresholding Algorithm (ISTA) for handling non-differentiable functions, particularly in the context of L1 regularization.

Uploaded by

nidhinair1705

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views6 pages

Optimization

Uploaded by

nidhinair1705

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Machine Learning & Data Mining Caltech CS/CNS/EE 155

Advanced Optimization Notes Last Updated: Feburary 22nd, 2016

1 Introduction
In this course note, we will be discussing advanced optimization techniques for machine learning. We start
with a simple review of gradient descent and stochastic gradient descent.
Let L(w) denote the objective function to be minimized during training. For instance, in machine learn-
ing, L(w) often includes a loss function summed over the training examples S as well as the regularization
penalty:
λ X
L(w) = kwk22 + `(xi , yi , w),
2
(xi ,yi )∈S

where λ denotes the regularization strength of the (in this case) L2 regularization penalty, and `(x, y, w)
denotes the training error of w on training example (x, y) (e.g., squared error on a linear model: (y −wT x)2 ).
The goal then is to solve the following unconstrained optimization problem:

argminw L(w). (1)

We mostly will assume that L is convex.

The simplest way to minimize (1) is via gradient descent,
Algorithm 1 Gradient Descent
1: Initialize w0 , typically as 0
2: t ← 1
3: Compute gradient: gt ← ∇w L(w = wt−1 )
4: Update: wt ← wt−1 − ηt gt
5: t ← t + 1
6: Repeat from Step 3 until some termination condition is met

The main design decision is the choice of step size ηt . In general, each ηt needs to be sufficient small, or else
gradient descent will not converge. However, larger choices of ηt leads to faster convergence. The simplest
approach is to use a fixed step size.
Another way, which is often more efficient in practice, is stochastic gradient descent,
Algorithm 2 Stochastic Gradient Descent
1: Initialize w0 , typically as 0
2: t ← 1
3: Compute stochastic gradient: gt s.t. E[gt ] = ∇w L(w = wt−1 )
4: Update: wt ← wt−1 − ηt gt
5: t ← t + 1
6: Repeat from Step 3 until some termination condition is met

The main difference between stochastic gradient and gradient descent is in Step 3, where the gradient used
gt is now a stochastic gradient. In practice, virtually all objective functions can be decomposed additively:
X
L(w) = Li (w).
i=1

For instance, the most straightforward decomposition decomposition is:

λ
Li (w) = kw||22 + `(xi , yi , w).
2N

1
Machine Learning & Data Mining Caltech CS/CNS/EE 155
Advanced Optimization Notes Last Updated: Feburary 22nd, 2016

For any given decomposition, stochastic gradient descent chooses an Li randomly at Step 3. It is easy to
verify that
1
E[gt ] = Ei [∇w Li (w = wt−1 )] = ∇w L(w = wt−1 ),
N
where N is the number of components in decomposition of L. In practice, one often loops through all the
Li ’s in some order. The key benefit of stochastic gradient descent over gradient descent is that it requires
much less computation time (e.g., only processing a few training examples rather than the entire training
set). However, the gradient is noisier, and so typically the step size ηt is much smaller than that for standard
gradient descent. The term mini-batch SGD is often used to refer to the setting where each Li contains a
small batch of training data (e.g., 10-500 samples).

2 Accelerated Gradient Descent

Nesterov’s accelerated gradient descent essentially adds a momentum term to the gradient descent proce-
dure. The general form will look something like this:
1. Compute gradient gt
2. Update momentum term mt
3. Update model wt using both gt and mt
We first present the original Nesterov’s method, which was designed for gradient descent.
Algorithm 3 Nesterov’s Accelerated Gradient Descent
1: Initialize w0 and v0 , typically as 0
2: Initialize α0 ← 0
3: t ← 1
4: Compute gradient: gt ← ∇w L(w = wt−1 )
5: Update to intermediate value: vt ← wt−1 − ηt gt
√
1+ 1+4α2
t−1
6: αt ← 2
7: γt ← 1−α t−1
αt
8: Update with momentum: wt ← (1 − γt )vt + γt vt−1
9: t←t+1
10: Repeat from Step 4 until some termination condition is met

In other words, Nesterov’s method keep track of an intermediate solution vt that is the direct result of
gradient descent (see Line 5). However, the model update rule is one that combines both the immediate
gradient as well as a momentum term (Line 8). One can rewrite Line 8 as:

wt ← vt + γt (vt−1 + vt ) = vt − γt (vt − vt−1 ), (2)

which can be interpreted as moving further along the direction of vt − vt−1 by a magnitude of −γt . Note
that γ1 = 0, and that γt very quickly converges to −1. So in the limit, (2) is behaving as:

wt ← vt + (vt − vt−1 ). (3)

So to summarize:
• vt ← wt−1 − ∇w L(w = wt−1 ) is computed via the standard gradient descent update rule.

2
Machine Learning & Data Mining Caltech CS/CNS/EE 155
Advanced Optimization Notes Last Updated: Feburary 22nd, 2016

• wt ← vt − γt (vt − vt−1 ) is computed via updating vt with the momentum term (vt − vt−1 ) scaled by
−γt (where γt ≤ 0).
Of course there are many different ways to choose a momentum term, so why this particular choice? It was
proven in [1] that
2kw0 − w∗ k2
L(vt ) − L(w∗ ) ≤ ,
ηt2
where w∗ is the
√ minimizer of L, and η lower bounds all ηt .1 In other words, for any error tolerance > 0,
∗
it takes O(1/ ) time steps to achieve L(vt ) − L(w ) ≤ . This can be much faster than the general O(1/)
convergence rate for ordinary gradient descent for differentiable convex functions.

2.1 Accelerated Stochastic Gradient Descent

There is no universally accepted way to incorporate Nesterov’s method into stochastic gradient descent.
One straightforward approach is to do the momentum update periodically, e.g., once every 10-500 rounds
of mini-batch SGD. One version is shown below, with B denoting the number of rounds to iterate per
momentum update.
Algorithm 4 Nesterov’s Accelerated Gradient Descent
1: Initialize w0 and v0 , typically as 0
2: Initialize α0 ← 0
3: t ← 1
4: Compute stochastic gradient: gt s.t. E[gt ] = ∇w L(w = wt−1 )
5: if mod(t, B) = 0 then
6: Update to√ intermediate value: vt ← wt−1 − ηt gt
1+t−11+4α2
7: αt ← 2
1−αt−1
8: γt ← αt
9: Update with momentum: wt ← (1 − γt )vt + γt vt−1
10: else
11: Standard SGD update: wt ← wt−1 − ηt gt
12: vt ← vt−1
13: end if
14: t←t+1
15: Repeat from Step 4 until some termination condition is met

If B = 1, then the momentum is applied every round. However, this can often lead to unstable estimates
since the individual stochastic gradients can be quite noisy. By average over, say, 100 rounds of stochastic
gradients, then the momentum estimate is more stable.

3 Relationship Between Gradient Descent and Proximal Updates

Let us consider an alternative form of iterative optimization:

1
wt = argminw0 L(w0 ) + kw0 − wt−1 k22 . (4)
2ηt
1 In the case where each ηt is constant, then η = ηt .

3
Machine Learning & Data Mining Caltech CS/CNS/EE 155
Advanced Optimization Notes Last Updated: Feburary 22nd, 2016

Here, the idea is that we want wt to be the best solution to L that is not to far from the previous wt−1 . So
long as 1/ηt is sufficiently large, then the so-called “proximal” will dominate whenever w0 is far from wt−1 .
This form of the optimization problem might seem a bit circular, since if we can solve (4) then we might be
able to solve the original problem. However, we shall see later that this form has some very nice properties.
We can show that a variant of (4) is equivalent to gradient descent. By properties of convex functions,
we can lower bound (4) as:
1 1
L(w0 ) + kw0 − wt−1 k22 ≥ L(wt−1 ) + ∇w L(w = wt−1 )T (w0 − wt−1 ) + kw0 − wt−1 k22 . (5)
2ηt 2ηt

The closer w0 is to wt , the closer the gap, and at w0 = wt , (5) becomes an equality.
One can thus consider a variant of (4), which can be thought of as the linear approximation to (4) at
wt−1 :
1
argminw0 ∇w L(w = wt−1 )T (w0 − wt−1 ) + kw0 − wt−1 k22 . (6)
2ηt

Taking the derivative of (6) and setting it to 0 yields:

1 0
0 = ∇w L(w = wt−1 ) + (w − wt−1 ) ⇒ wt = wt−1 − ηt ∇w L(w = wt ),
ηt

which is exactly the gradient descent update rule. In other words, the gradient descent update rule is the
closed form solution to (6), which is in turn the linear approximation to to (4) at wt−1 . So one can think of
(4) and (6) as a generalization of gradient descent. Specifically, gradient descent is the closed-form solution
to (6) whenever L is differentiable.
The more general setting can be thought of as:
Algorithm 5 Proximal Updates
1: Initialize w0 , typically as 0
2: t ← 1
3: Update: wt ← argminw0 (4) or argminw0 (6)
4: t ← t + 1
5: Repeat from Step 3 until some termination condition is met

4 Iterative Soft-Thresholding for Non-Differentiable Functions

We now consider the case where we can decompose L into differentiable and non-differentiable compo-
nents:
L(w) = G(w) + H(w),
where G is differentiable and H is not differentiable. For example, G can be a differentiable loss function
summed over all the training examples:
X
G(w) = `(x, y, w),
(xi ,yi )∈S

and H can be the L1 penalty:

H(w) = λkwk1 .

4
Machine Learning & Data Mining Caltech CS/CNS/EE 155
Advanced Optimization Notes Last Updated: Feburary 22nd, 2016

We will show how to solve such an L using the Iterative Soft-Thresholding Algorithm (ISTA). We do so by
alternating between solving G and H. For the differentiable part G, we solve for the next step using (6),
which yields a closed form solution of:
1
vt ← argminw0 ∇w G(w = wt−1 )T (w0 − wt−1 ) + kw0 − wt−1 k22 ≡ wt−1 − ηt ∇w G(w = wt−1 ). (7)
2ηt
Afterwards, for the non-differentiable part H, we instead solve using (4), which yields:
1
wt ← argminw0 H(w0 ) + kw0 − vt k22 , (8)
2ηt
where vt was the intermediate solution to the gradient update of G (7).
It turns out that for many commonly used non-differentiable functions, H, (8) has a closed-form solu-
tion. Differentiating (8) and setting it to 0 yields:
1 0
0 = ∇w H(w = w0 ) + (w − vt ) ⇒ ∇w [ηt H(w = w0 )] + w0 = vt . (9)
ηt
Consider L1 regularization H(w) = λkwk1 . We can write the sub-differential of αH (for any positive
constant α) component-wise as:

 −αλ if w ≤ −αλ
∇w αλkwk1 = − αλ, αλ if − αλ < w < αλ
αλ if w ≥ αλ


In other words, when w is close to 0, there is a continuous range of differentials, and any of them that
satisfies (9) is an optimal solution. We can thus write the closed-form solution to (9) component-wise as:

 vt + ηt λ if vt ≤ −ηt λ
wt = 0 if − ηt λ < vt < ηt λ (10)
vt − ηt λ if vt ≥ −ηt λ


The above update rule is known as “soft-thresholding”. The entire algorithm is thus:
Algorithm 6 Iterative Soft-Thresholding Algorithm (ISTA) for optimizing L1 regularized training objectives
1: Initialize w0 , typically as 0
2: Decompose L = G + H, where H = λ|w|1
3: t ← 1
4: Update: vt ← argminw0 (4), the solution is given in (7)
5: Update: wt ← argminw0 (6), the solution is given in (10)
6: t ← t + 1
7: Repeat from Step 4 until some termination condition is met

Of course, one can also use standard sub-gradient descent for optimizing L1 regularized training ob-
jectives. However, the goal of using L1 regularization is so that the solution w is sparse (few non-zeros).
However, standard gradient descent will not actually produce a sparse solution (you can try it yourself).
ISTA is guaranteed to produce a sparse solution because it zeros out small components of w.
The above algorithm is gradient descent, and requires differentiating the entire G each iteration. In prac-
tice, one often solves for G using stochastic gradient descent. In which case, one straightforward approach
is to do some number of rounds of mini-batch SGD on G, followed by a soft-thresholding step. Just keep
in mind that the step sizes ηt need to be adjusted. If you do B rounds of mini-batch SGD on G for each
soft-thresholding step, then you need to set the soft-thresholding step-size to Bηt .

5
Machine Learning & Data Mining Caltech CS/CNS/EE 155
Advanced Optimization Notes Last Updated: Feburary 22nd, 2016

4.1 Fast Iterative Soft-Thresholding Algorithm

The new approach presented in [1] is actually the Fast Iterative Soft-Thresholding Algorithm (FISTA), which
combines ISTA with Nesterov’s accelerated gradient descent.
Algorithm 7 Fast Iterative Soft-Thresholding Algorithm (FISTA) for optimizing L1 regularized objectives
1: Initialize w0 and u0 , typically as 0
2: Decompose L = G + H, where H = λ|w|1
3: Initialize α0 ← 0
4: t ← 1
5: Update: vt ← argminw0 (4), the solution is given in (7)
6: Update: ut ← argminw0 (6), the solution is given in (10)
√
1+ 1+4α2
t−1
7: αt ← 2
1−αt−1
8: γt ← α t
9: Update: wt ← (1 − γt )ut + γt ut
10: t←t+1
11: Repeat from Step 5 until some termination condition is met

References
[1] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse
problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.

Sewing Machine Tension Adjustment Ebook
100% (1)
Sewing Machine Tension Adjustment Ebook
19 pages
DS100-2-Grp#4 Chapter 6 Advanced Analytical Theory and Methods Regression (CADAY, CASTOR, CRUZ, SANORIA, TAN)
No ratings yet
DS100-2-Grp#4 Chapter 6 Advanced Analytical Theory and Methods Regression (CADAY, CASTOR, CRUZ, SANORIA, TAN)
4 pages
Seia Report 22
100% (1)
Seia Report 22
518 pages
Aspenone: Deployment Guide
100% (2)
Aspenone: Deployment Guide
40 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Growing and Marketing of Romaine Lettuce in Areas of Cavite: An Entrepreneurial Development Project
No ratings yet
Growing and Marketing of Romaine Lettuce in Areas of Cavite: An Entrepreneurial Development Project
28 pages
Motionmountain Volumen 3 PDF
No ratings yet
Motionmountain Volumen 3 PDF
440 pages
Schimmel Deciphering Signs
No ratings yet
Schimmel Deciphering Signs
287 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
The Naga Tribes
100% (1)
The Naga Tribes
6 pages
Chap 4 2
No ratings yet
Chap 4 2
214 pages
Gradient Descent (GD) - GD With Momentum - Nesterov Accelerated GD - Stochastic GD - OrIGINAL
No ratings yet
Gradient Descent (GD) - GD With Momentum - Nesterov Accelerated GD - Stochastic GD - OrIGINAL
25 pages
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
No ratings yet
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
19 pages
ITER Cabling Handbook
No ratings yet
ITER Cabling Handbook
61 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
How To Calculate Glycol Circulation Rate
No ratings yet
How To Calculate Glycol Circulation Rate
1 page
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Blood Module Booklet
No ratings yet
Blood Module Booklet
53 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
06 23ECE216 GradientDescent v2
No ratings yet
06 23ECE216 GradientDescent v2
73 pages
Deising 2018
No ratings yet
Deising 2018
52 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
Katyusha: The First Direct Acceleration of Stochastic Gradient Methods
No ratings yet
Katyusha: The First Direct Acceleration of Stochastic Gradient Methods
45 pages
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
No ratings yet
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
45 pages
Gradient Descent Deep Learning Lecture
No ratings yet
Gradient Descent Deep Learning Lecture
5 pages
ESOL International English Listening Examination Level C2 Proficient
No ratings yet
ESOL International English Listening Examination Level C2 Proficient
16 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Dailyreportdelhicann 25072019110777
No ratings yet
Dailyreportdelhicann 25072019110777
42 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
Optim
No ratings yet
Optim
33 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Lecture 03
No ratings yet
Lecture 03
32 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
NIJ Autoloading Pistols For Police Officers
No ratings yet
NIJ Autoloading Pistols For Police Officers
32 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
Lecture05 Descent
No ratings yet
Lecture05 Descent
31 pages
Class06 SGD
No ratings yet
Class06 SGD
24 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Cours 5
No ratings yet
Cours 5
23 pages
Adam Optimizer
No ratings yet
Adam Optimizer
22 pages
McAdams Olson 2010 Personality Development
No ratings yet
McAdams Olson 2010 Personality Development
26 pages
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
22 pages
EDA Lecture Module 4
No ratings yet
EDA Lecture Module 4
20 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
DL Unit - 2
No ratings yet
DL Unit - 2
20 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Backpropagation Optimization Tutorial
No ratings yet
Backpropagation Optimization Tutorial
14 pages
Design and Scale-Up Challenges in Hydrothermal Liquefaction
No ratings yet
Design and Scale-Up Challenges in Hydrothermal Liquefaction
14 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Advanced Gradient Descent
No ratings yet
Advanced Gradient Descent
14 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
Technical Writing
No ratings yet
Technical Writing
8 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
3 Types of Gradient Descent Algorithms For Small & Large Datasets
No ratings yet
3 Types of Gradient Descent Algorithms For Small & Large Datasets
9 pages
COMPOSITE 639 Gear Shift
No ratings yet
COMPOSITE 639 Gear Shift
8 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Autumn Migration Over Malta
No ratings yet
Autumn Migration Over Malta
7 pages
Gradient Descent Algorithm.Y...
No ratings yet
Gradient Descent Algorithm.Y...
10 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Gradient Descent and SGD
No ratings yet
Gradient Descent and SGD
8 pages
Lecture 7: Stochastic Gradient Descent
No ratings yet
Lecture 7: Stochastic Gradient Descent
4 pages
CT Class Very Important
No ratings yet
CT Class Very Important
6 pages
DLL - Mapeh 6 - Q2 - W5
No ratings yet
DLL - Mapeh 6 - Q2 - W5
6 pages
14-RMSProp and Adam Optimization-12!08!2024
No ratings yet
14-RMSProp and Adam Optimization-12!08!2024
2 pages
1 One Dimension: Gradient Descent
No ratings yet
1 One Dimension: Gradient Descent
5 pages
Nesterov Momentum
No ratings yet
Nesterov Momentum
3 pages
Documents
No ratings yet
Documents
5 pages
The Uveitis - Periodontal Disease Connection in Pregnancy: Controversy Between Myth and Reality
No ratings yet
The Uveitis - Periodontal Disease Connection in Pregnancy: Controversy Between Myth and Reality
5 pages
MOVIE REVIEW and Analysis
No ratings yet
MOVIE REVIEW and Analysis
5 pages
COA Guggul
No ratings yet
COA Guggul
1 page
Msds Acetone
No ratings yet
Msds Acetone
4 pages
Cholesterol Know Your Numbers Handout
No ratings yet
Cholesterol Know Your Numbers Handout
3 pages
Printable Minimalism
No ratings yet
Printable Minimalism
2 pages
Schedule Be'e Mos Bridge
No ratings yet
Schedule Be'e Mos Bridge
1 page
cs188 Fa23 Note23
No ratings yet
cs188 Fa23 Note23
2 pages
This Study Resource Was: Hypomagnesemia
No ratings yet
This Study Resource Was: Hypomagnesemia
1 page
Sequences and Infinite Series, A Collection of Solved Problems
From Everand
Sequences and Infinite Series, A Collection of Solved Problems
Steven Tan
No ratings yet
Solutions to Problems in Fluids and Turbomachinery
From Everand
Solutions to Problems in Fluids and Turbomachinery
Rahul Basu
No ratings yet

Optimization

Uploaded by

Optimization

Uploaded by

Machine Learning & Data Mining Caltech CS/CNS/EE 155

Advanced Optimization Notes Last Updated: Feburary 22nd, 2016

argminw L(w). (1)

We mostly will assume that L is convex.

For instance, the most straightforward decomposition decomposition is:

2 Accelerated Gradient Descent

wt ← vt + γt (vt−1 + vt ) = vt − γt (vt − vt−1 ), (2)

wt ← vt + (vt − vt−1 ). (3)

2.1 Accelerated Stochastic Gradient Descent

3 Relationship Between Gradient Descent and Proximal Updates

Taking the derivative of (6) and setting it to 0 yields:

4 Iterative Soft-Thresholding for Non-Differentiable Functions

and H can be the L1 penalty:

4.1 Fast Iterative Soft-Thresholding Algorithm

You might also like