0% found this document useful (0 votes)

206 views78 pages

Optimization Methods For Machine Learning: Stephen Wright

Optimization methods are increasingly important for machine learning tasks involving large datasets. Wright outlines how machine learning problems can be formulated as optimization problems to extract meaning from data. Some key applications discussed include least squares regression, matrix completion from incomplete observations, and explicitly parametrizing matrices as low-rank to obtain compact formulations. Optimization provides algorithms and techniques for modeling, formulating, and solving these machine learning tasks.

Uploaded by

Sidharth Krishnasagar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

206 views78 pages

Optimization Methods For Machine Learning: Stephen Wright

Uploaded by

Sidharth Krishnasagar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 78

Optimization Methods for Machine Learning

Stephen Wright

University of Wisconsin-Madison

IPAM, October 2017

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 1 / 63

Outline
Data Analysis and Machine Learning
I Context

I Applications / Examples, including formulation as optimization

problems
Optimization in Data Analysis
I Relevant Algorithms

Optimization is being revolutionized by its interactions with machine

learning and data analysis.
new algorithms, and new interest in old algorithms;
challenging formulations and new paradigms;
renewed emphasis on certain topics: convex optimization algorithms,
complexity, structured nonsmoothness, ...
many new (excellent) researchers working on the machine learning /
optimization spectrum.
Wright (UW-Madison) Optimization in Data Analysis Oct 2017 2 / 63
Data Analysis
Related Terms: Machine Learning, Statistical Inference, Data Mining.
Extract meaning from data: Understand statistical properties, learn
important features and fundamental structures in the data.
Use this knowledge to make predictions about other, similar data.

Highly multidisciplinary area!

Foundations in Statistics;
Computer Science: AI, Machine Learning, Databases, Parallel
Systems;
Optimization provides a toolkit of modeling/formulation and
algorithmic techniques.

Modeling and domain-specific knowledge is vital: “80% of data analysis is

spent on the process of cleaning and preparing the data.”
[Dasu and Johnson, 2003].
(Most academic research deals with the other 20%.)
Wright (UW-Madison) Optimization in Data Analysis Oct 2017 3 / 63
The Age of “Big Data”
New “Data Science Centers” at many institutions, new degree programs
(e.g. Masters in Data Science), new funding initiatives.

Huge amounts of data are collected, routinely and continuously.

I Consumer and citizen data: phone calls and text, social media

apps, email, surveillance cameras, web activity, online shopping,...

I Scientific data (particle colliders, satellites, biological / genomic,

astronomical,...)
Affects everyone directly!
Powerful computers and new specialized architectures make it
possible to handle larger data sets and analyze them more thoroughly.
Methodological innovations in some areas. e.g. Deep Learning.
I Speech recognition in smart phones

I AlphaGo: Deep Learning for Go.

I Image recognition

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 4 / 63

Typical Setup
After cleaning and formatting, obtain a data set of m objects:
Vectors of features: aj , j = 1, 2, . . . , m.
Outcome / observation / label yj for each feature vector.

The outcomes yj could be:

a real number: regression
a label indicating that aj lies in one of M classes (for M ≥ 2):
classification
multiple labels: classify aj according to multiple criteria.
no labels (yj is null):
I subspace identification: Locate low-dimensional subspaces that

approximately contain the (high-dimensional) vectors aj ;

I clustering: Partition the a into a few clusters.
j
(Structure may reveal which features in the aj are important /
distinctive, or enable predictions to be made about new vectors a.)

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 5 / 63

Fundamental Data Analysis Task
Seek a function φ that:
approximately maps aj to yj for each j: φ(aj ) ≈ yj for j = 1, 2, . . . , m
if there are no labels yj , or if some labels are missing, seek φ that
does something useful with the data {aj }, e.g. assigns each aj to an
appropriate cluster or subspace.
satisfies some additional properties — simplicity, structure — that
make it “plausible” for the application, robust to perturbations in the
data, generalizable to other data samples.

Can usually define φ in terms of some parameter vector x — thus

identification of φ becomes a data-fitting problem: Find the best x.
Objective function in this problem often built up of m terms that capture
mismatch between predictions and observations for each (aj , yj ).
The process of finding φ is called learning or training.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 6 / 63

What’s the use of the mapping φ?

Analysis: φ — especially the parameter x that defines it — reveals

structure in the data. Examples:
I Feature selection: reveal the components of vectors a that are
j
most important in determining the outputs yj , and quantifies the
importance of these features.
I Uncovers some hidden structure, e.g.

F finds some low-dimensional subspaces that contain the a ;

j
F find clusters that contain the a ;
j
F find a decision tree that builds intuition about how outputs

yj depend on inputs aj .
Prediction: Given new data vectors ak , predict outputs yk ← φ(ak ).

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 7 / 63

Complications

noise or errors in aj and yj . Would like φ (and x) to be robust to

this. We want the solution to generalize to perturbations of the
observed data. Often achieve this via regularized formulations.
avoid overfitting: Observed data is viewed as an empirical, sampled
representation of some underlying reality. Want to avoid overfitting to
the particular sample. (Training should produce a similar result for
other samples from the same data set.) Again, generalization /
regularization.
missing data: Vectors aj may be missing elements (but may still
contain useful information).
missing labels: Some or all yj may be missing or null —
semi-supervised or unsupervised learning.
online learning: Data (aj , yj ) is arriving in a stream rather than all
known up-front.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 8 / 63

Application I: Least Squares
m
1X T 1
min f (x) := (aj x − yj )2 = kAx − y k22 .
x 2 2
j=1

[Gauss, 1799], [Legendre, 1805]; see [Stigler, 1981].

Here the function mapping data to output is linear: φ(aj ) = ajT x.
`2 regularization reduces sensitivity of the solution x to noise in y .
1
kAx − y k22 + λkxk22 .
min
x 2

`1 regularization yields solutions x with few nonzeros:

1
kAx − y k22 + λkxk1 .
min
x 2

Feature selection: Nonzero locations in x indicate important

components of aj .
Nonconvex separable regularizers (SCAD, MCP) have nice statistical
properties, but lead to nonconvex optimization formulations.
Wright (UW-Madison) Optimization in Data Analysis Oct 2017 9 / 63
Application II: Matrix Completion
Regression over a structured matrix: Observe a matrix X by probing it
with linear operators Aj (X ), giving observations yj , j = 1, 2, . . . , m. Solve
a regression problem:
m
1 X 1
min (Aj (X ) − yj )2 = kA(X ) − y k22 .
X 2m 2m
j=1

Each Aj may observe a single element of X , or a linear combination of

elements. Can be represented as a matrix Aj , so that Aj (X ) = hAj , X i.
Seek the “simplest” X that satisfies the observations. Nuclear-norm
(sum-of-singular-values) regularization term induces low rank on X :

1
min kA(X ) − y k22 + λkX k∗ , for some λ > 0.
X 2m
[Recht et al., 2010]

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 10 / 63

Explicit Low-Rank Parametrization
Compact, nonconvex formulation is obtained by parametrizing X directly:
X = LR T , where L ∈ Rm×r , R ∈ Rn×r ,
where r is known (or suspected) rank.
m
1 X
min (Aj (LR T ) − yj )2 .
L,R 2m
j=1

For symmetric X , write X = ZZ T , where Z ∈ Rn×r , so that

m
1 X
min (Aj (ZZ T ) − yj )2 .
Z 2m
j=1

(No need for regularizer — rank is hard-wired into the formulation.)

Despite the nonconvexity, near-global minima can be found when Aj are
incoherent. Use appropriate initialization [Candès et al., 2014],
[Zheng and Lafferty, 2015] or the observation that all local minima are
near-global [Bhojanapalli et al., 2016].
Wright (UW-Madison) Optimization in Data Analysis Oct 2017 11 / 63
Matrix Completion from Individual Entries

When the observations Aj are of individual elements, recovery is still

possible when the true matrix X is itself low-rank and incoherent (i.e. not
too “spiky” and with singular vectors randomly oriented).
Need sufficiently many observations [Candès and Recht, 2009].
Procedures based on trimming + truncated singular value decomposition
(for initialization) and projected gradient (for refinement) produce good
solutions [Keshavan et al., 2010].

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 12 / 63

Application III: Nonnegative Matrix Factorization

Given m × n matrix Y , seek factors L (m × r ) and R (n × r ) that are

element-wise positive, such that LR T ≈ Y .
1
min kLR T − Y k2F subject to L ≥ 0, R ≥ 0.
L,R 2

Applications in computer vision, document clustering, chemometrics, . . .

Could combine with matrix completion, when not all elements of Y are
known, if it makes sense on the application to have nonnegative factors.
If positivity constraint were not present, could solve this in closed form
with an SVD, since Y is observed completely.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 13 / 63

Application IV: Sparse Inverse Covariance
Let Z ∈ Rp be a (vector) random variable with zero mean. Let
z1 , z2 , . . . , zN be samples of Z . Sample covariance matrix (estimates
covariance between components of Z ):
N
1 X T
S := z` z` .
N −1
`=1

Seek a sparse inverse covariance matrix: X ≈ S −1 .

X reveals dependencies between components of Z : Xij = 0 if the i and j
components of Z are conditionally independent.
(Nonzeros in X indicate arcs in the dependency graph.)
Obtain X from the regularized formulation:
P
min hS, X i − log det(X ) + λkX k1 , where kX k1 = i,j |Xij |.
X

[d’Aspremont et al., 2008, Friedman et al., 2008].

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 14 / 63
Application V: Sparse Principal Components (PCA)

Seek sparse approximations to the leading eigenvectors of the sample

covariance matrix S.
For the leading sparse principal component, solve

max v T Sv = hS, vv T i s.t. kv k2 = 1, kv k0 ≤ k,

v ∈Rn

for some given k ∈ {1, 2, . . . , n}. Convex relaxation replaces vv T by an

n × n positive semidefinite proxy M:

max hS, Mi s.t. M 0, hI , Mi = 1, kMk1 ≤ R,

M∈SRn×n

where | · |1 is the sum of absolute values [d’Aspremont et al., 2007].

Adjust the parameter R to obtain desired sparsity.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 15 / 63

Sparse PCA (rank r )

For sparse leading rank-r eigenspace, seek V ∈ Rn×r with orthonormal

columns such that hS, VV T i is maximized, and V has at most k nonzero
rows. Convex relaxation:

max hS, Mi s.t. 0 M I , hI , Mi ≤ r , kMk1 ≤ R.

M∈SRn×n

Explicit low-rank formulation is

max hS, FF T i s.t. kF k2 ≤ 1, kF k2,1 ≤ R̄,

F ∈Rn×r
Pn
where kF k2,1 := i=1 kFi· k2 .

[Chen and Wainwright, 2015]

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 16 / 63

Application VI: Sparse + Low-Rank

Given Y ∈ Rm×n , seek low-rank M and sparse S such that M + S ≈ Y .

Applications:
Robust PCA: Sparse S represents “outlier” observations.
Foreground-Background separation in video processing.
I Each column of Y is one frame of video, each row is one pixel

evolving in time.
I Low-rank part M represents background, sparse part S represents

foreground.

Convex formulation:

min kMk∗ + λkSk1 s.t. Y = M + S.

M,S

[Candès et al., 2011, Chandrasekaran et al., 2011]

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 17 / 63

Sparse + Low-Rank: Compact Formulation

Compact formulation: Variables L ∈ Rn×r , R ∈ Rm×r , S ∈ Rm×n sparse.

1
min kLR T + S − Y k2F + λkSk1 (fully obvserved)
L,R,S 2
1
min kPΦ (LR T + S − Y )k2F + λkSk1 (partially observed),
L,R,S 2

where Φ represents the locations of the observed entries.

[Chen and Wainwright, 2015, Yi et al., 2016].
(For well-posedness, need to assume that the “true” L, R, S satisfy certain
incoherence properties.)

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 18 / 63

Application VII: Subspace Identification
Given vectors aj ∈ R n with missing entries, find a subspace of Rn such
that all “completed” vectors aj lie approximately in this subspace.
If Ωj ⊂ {1, 2, . . . , n} is the set of observed elements in aj , seek X ∈ Rn×d
such that
[aj − Xsj ]Ωj ≈ 0,
for some sj ∈ Rd and all j = 1, 2, . . . .
[Balzano et al., 2010, Balzano and Wright, 2014].
Application: Structure from motion. Reconstruct opaque object from
planar projections of surface reference points.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 19 / 63

Application VIII: Linear Support Vector Machines
Each item of data belongs to one of two classes: yj = +1 and yj = −1.
Seek (x, β) such that

ajT x − β ≥ 1 when yj = +1;

ajT x − β ≤ −1 when yj = −1.

The mapping is φ(aj ) = sign(ajT x − β).

Design an objective so that the jth loss term is zero when φ(aj ) = yj ,
positive otherwise. A popular one is hinge loss:
m
1 X
H(x) = max(1 − yj (ajT x − β), 0).
m
j=1

Add a regularization term (λ/2)kxk22 for some λ > 0 to maximize the

margin between the classes.
Wright (UW-Madison) Optimization in Data Analysis Oct 2017 20 / 63
Regularize for Generalizability

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 21 / 63

Regularize for Generalizability

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 21 / 63

Regularize for Generalizability

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 21 / 63

Regularize for Generalizability

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 21 / 63

Regularize for Generalizability

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 21 / 63

Application IX: Nonlinear SVM

Data aj , j = 1, 2, . . . , m may not be separable neatly into two classes

yj = +1 and yj = −1. Apply a nonlinear transformation aj → ψ(aj )
(“lifting”) to make separation more effective. Seek (x, β) such that

ψ(aj )T x − β ≥ 1 when yj = +1;

ψ(aj )T x − β ≤ −1 when yj = −1.

Leads to the formulation:

m
1 X 1
min max(1 − yj (ψ(aj )T x − β), 0) + λkxk22 .
x m 2
j=1

Can avoid defining ψ explicitly by using instead the dual of this QP.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 22 / 63

Nonlinear SVM: Dual

Dual is a quadratic program in m variables, with simple constraints:

1 T
min α Qα − e T α s.t. 0 ≤ α ≤ (1/λ)e, y T α = 0.
α∈Rm 2

where Qk` = yk y` ψ(ak )T ψ(a` ), y = (y1 , y2 , . . . , ym )T , e = (1, 1, . . . , 1)T .

No need to choose ψ(·) explicitly. Instead choose a kernel K , such that

K (ak , a` ) ∼ ψ(ak )T ψ(a` ).

[Boser et al., 1992, Cortes and Vapnik, 1995]. “Kernel trick.”

Gaussian kernels are popular:

K (ak , a` ) = exp(−kak − a` k2 /(2σ)), for some σ > 0.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 23 / 63

Nonlinear SVM

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 24 / 63

Application X: Logistic Regression
Binary logistic regression is similar to binary SVM, except that we seek a
function p that gives odds of data vector a being in class 1 or class 2,
rather than making a simple prediction.
Seek odds function p parametrized by x ∈ Rn :
T
p(a; x) := (1 + e a x )−1 .

Choose x so that p(aj ; x) ≈ 1 when yj = 1 and p(aj ; x) ≈ 0 when yj = 2.

Choose x to minimize a negative log likelihood function:
 
1 X X
L(x) = −  log(1 − p(aj ; x)) + log p(aj ; x)
m
yj =2 yj =1

Sparse solutions x are interesting because the indicate which components

of aj are critical to classification. Can solve: minz L(z) + λkzk1 .
Wright (UW-Madison) Optimization in Data Analysis Oct 2017 25 / 63
Multiclass Logistic Regression
Have M classes instead of just 2. M can be large e.g. identify phonemes
in speech, identify line outages in a power grid.
Labels yj` = 1 if data point j is in class `; yj` = 0 otherwise; ` = 1, . . . , M.
Find subvectors x[`] , ` = 1, 2, . . . , M such that if aj is in class k we have

ajT x[k] ajT x[`] for all ` 6= k.

Find x[`] , ` = 1, 2, . . . , M by minimizing a negative log-likelihood function:

m
"M M
!#
1 X X X
f (x) = − yj` (ajT x[`] ) − log exp(ajT x[`] )
m
j=1 `=1 `=1

Can use group LASSO regularization terms to select important features

from the vectors aj , by imposing a common sparsity pattern on all x[`] .

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 26 / 63

Application XI: Deep Learning
Inputs are the vectors aj , out-
puts are odds of aj belonging
to each class (as in multiclass
output nodes logistic regression).
At each layer, inputs are con-
verted to outputs by a linear
hidden layers transformation composed with
an element-wise function:

a`+1 = σ(W ` a` + g ` ),
input nodes
where a` is node values at
layer `, (W ` , g ` ) are parame-
ters in the network, σ is the
element-wise function.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 27 / 63

Deep Learning

The element-wise function σ makes transformations to scalar input:

Logistic function: t → 1/(1 + e −t );
Hinge: t → max(t, 0);
Bernoulli: random! t → 1 with probability 1/(1 + e −t ) and t → 0
otherwise (inspired by neuron behavior).

The example depicted shows a completely connected network — but more

typically networks are engineered to the application (speech processing,
object recognition, . . . ).
local aggregation of inputs: pooling;
restricted connectivity + constraints on weights (elements of W `
matrices): convolutions.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 28 / 63

Training Deep Learning Networks

The network contains many parameters — (W ` , g ` ), ` = 1, 2, . . . , L in the

notation above — that must be selected by training on the data (aj , yj ),
j = 1, 2, . . . , m.
Objective has the form:
m
X
h(x; aj , yj )
j=1

where x = (W 1 , g 1 , W 2 , g 2 , . . . ) are the parameters in the model and h

measures the mismatch between observed output yj and the outputs
produced by the model (as in multiclass logistic regression).
Nonlinear, Nonconvex. It’s also random in some cases (but then we can
work with expectation).
Composition of many simple functions.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 29 / 63

GoogLeNet
Visual object recognition: Google’s prize-winning network from 2014
[Szegedy et al., 2015].

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 30 / 63

How Does a Neural Network Make The Problem Easier?

Can think of the neural network as transforming the raw data in a way
that makes the ultimate task (regression, classification) easier.
We consider a multiclass classification application in power systems. The
raw data is PMU measurements at different points in a power grid, under
different operating conditions. The goal is to use this data to detect line
outages. Each class corresponds to outage of a particular line.
High-dimensional. Can illustrate by doing a singular value decomposition
of the data matrix and plotting pairs of principal components on a 2-d
graph.
Do this before and after transformation. One hidden layer with 200 nodes.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 31 / 63

Raw Data (Before Transformation)

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 32 / 63

After Transformation by One Layer

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 33 / 63

A Basic Paradigm
Many optimization formulations in data analysis have this form:
m
1 X
min hj (x) + λΩ(x),
x m
j=1

where
hj depends on parameters x of the mapping φ, and data items (aj , yj );
Ω is the regularization term, often nonsmooth, convex, and separable
in the components of x (but not always!).
λ ≥ 0 is the regularization parameter.
(Ω could also be an indicator for simple set e.g. x ≥ 0.)
Alternative formulation:
m
1 X
min hj (x) s.t. Ω(x) ≤ τ.
x m
j=1

Structure in hj (x) and Ω strongly influences the choice of algorithms.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 34 / 63
Optimization in Data Analysis: Some Background
Optimization formulations of data analysis / machine learning problems
were popular, and becoming more so... but not universal:
Heuristics are popular: k-means clustering, random forests.
Linear algebra approaches are appropriate in some applications.
Greedy feature selection.
Algorithmic developments have sometimes happened independently in the
machine learning and optimization communities, e.g. stochastic gradient
from 1980s-2009.
Occasional contacts in the past:
Mangasarian solving SVM formulations [Mangasarian, 1965,
Wohlberg and Mangasarian, 1990, Bennett and Mangasarian, 1992]
Backpropagation in neural networks equivalent to gradient descent
[Mangasarian and Solodov, 1994]
Basis pursuit [Chen et al., 1998].
...but now, crossover is much more frequent and systematic.
Wright (UW-Madison) Optimization in Data Analysis Oct 2017 35 / 63
Optimization in Learning: Context

Naive use of “standard” optimization algorithms is usually not appropriate.

Large data sets make some standard operations expensive: Evaluation of
gradients, Hessians, functions, Newton steps, line searches.
The optimization formulation is viewed as a sampled, empirical
approximation of some underlying “infinite” reality.
Don’t need or want an exact minimizer of the stated objective — this
would be overfitting the empirical data. An approximate solution of
the optimization problem suffices.
Choose the regularization parameter λ so that φ gives best
performance on some “holdout” set of data: validation, tuning.

Desired solution property is called generalizability.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 36 / 63

Low-Dimensional Structure

Generalizability often takes the form of low-dimensional structure, when

the function is parametrized appropriately.
Variable vector x may be sparse (feature selection, compressed
sensing).
Variable matrix X may be low-rank, or low-rank + sparse.
Data objects aj lie approximately in a low-dimensional subspace.

Convex formulations with regularization are often tractable and efficient in

practice.
Discrete or nonlinear formulations are more natural, but harder to solve
and analyze. Recent advances make discrete formulations more appealing
(Bertsimas et al.)

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 37 / 63

ALGORITHMS

Seek optimization algorithms — mostly elementary — that can exploit the

structure of the formulations above.
Full-gradient algorithms.
I with projection and shrinking to handle Ω

Accelerated gradient
Stochastic gradient
I and hybrids with full-gradient

Coordinate descent
Conditional gradient
Newton’s method, and approximate Newton
Augmented Lagrangian / ADMM

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 38 / 63

Everything Old is New Again

“Old” approaches from the optimization literature have become popular:

Nesterov acceleration of gradient methods [Nesterov, 1983].
Alternating direction method of multipliers (ADMM)
[Eckstein and Bertsekas, 1992].
Parallel coordinate descent and incremental gradient algorithms
[Bertsekas and Tsitsiklis, 1989]
Stochastic gradient [Robbins and Monro, 1951]
Frank-Wolfe / conditional gradient
[Frank and Wolfe, 1956, Dunn, 1979].

Many extensions have been made to these methods and their convergence
analysis. Many variants and adaptations proposed.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 39 / 63

Gradient Methods
Basic formulation, without regularization:
m
1 X
min H(x) := hj (x).
x m
j=1

Steepest descent takes steps in the negative gradient direction:

x k+1 ← x k − αk ∇H(x k ).
Classical analysis applies for H smooth, convex, and
Lipschitz: k∇H(x) − ∇H(z)k ≤ Lkx − zk, for some L > 0,
Lojasiewicz: k∇H(x)k2 ≥ 2µ[H(x) − H ∗ ], for some µ ≥ 0.

µ = 0: sublinear convergence for αk ≡ 1/L: H(x k ) − H ∗ = O(1/k).

µ > 0: linear convergence with αk ≡ 1/L:
µ k
H(x k ) − H ∗ ≤ 1 − [H(x 0 ) − H ∗ ].
L
Wright (UW-Madison) Optimization in Data Analysis Oct 2017 40 / 63
Shrinking, Projecting for Ω
In the regularized form:

min H(x) + λΩ(x),

the gradient step in H can be combined with a shrink operation in which

the effect of Ω is accounted for exactly:
1
x k+1 = arg min ∇H(x k )T (z − x k ) + kz − xk k2 + λΩ(z).
z 2αk

When Ω is the indicator function for a convex set, x k+1 is the projection
of x k − αk ∇H(x k ) onto this set: gradient projection.
For many Ω of interest, this problem can be solved quickly (e.g. O(n)).
Algorithms and convergence theory for steepest descent on smooth H
usually extend to this setting (for convex Ω).

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 41 / 63

Accelerated Gradient and Momentum

Accelerated gradient methods [Nesterov, 1983] highly influential.

Fundamental idea: Momentum! Search direction at iteration k depends
on the latest gradient ∇H(x k ) and also the search direction at iteration
k − 1, which encodes gradient information from earlier iterations.
Heavy-ball & conjugate gradient (incl. nonlinear CG) also use momentum.
Heavy-Ball for minx H(x):

x k+1 = x k − α∇H(x k ) + β(x k − x k−1 ).

Nesterov’s optimal method:

x k+1 = x k − αk ∇H(x k + βk (x k − x k−1 )) + βk (x k − x k−1 ).

Typically αk ≈ 1/L and βk ≈ 1.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 42 / 63

Accelerated Gradient and Momentum

Accelerated gradient methods [Nesterov, 1983] highly influential.

x k+1 = x k − α∇H(x k ) + β(x k − x k−1 ).

Nesterov’s optimal method:

x k+1 = x k − αk ∇H(x k + βk (x k − x k−1 )) + βk (x k − x k−1 ).

Typically αk ≈ 1/L and βk ≈ 1.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 42 / 63

Accelerated Gradient Convergence
Typical convergence:

Weakly convex µ = 0: H(x k ) − H ∗ = O(1/k 2 );

r k
∗ µ
k
Strongly convex µ > 0: H(x ) − H ≤ M 1 − [H(x 0 ) − H ∗ ].
L

Approach can be extended to regularized functions H(x) + λΩ(x)

[Beck and Teboulle, 2009].
Partial-gradient approaches (stochastic gradient, coordinate descent)
can be accelerated in similar ways.
Some fascinating new interpretations have been proposed recently
I Algebraic analysis based on lower-bounding quadratics

[Drusvyatskiy et al., 2016];

I Geometric derivation [Bubeck et al., 2015].

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 43 / 63

Full Gradient: Does It Make Sense?
The methods above, based on full gradients, are useful for some problems
in which X is a matrix, in which full gradients are practical to compute.
Matrix completion, including explicitly parametrized problems with
X = LR T or X = ZZ T ; nonnegative matrix factorization.
Subspace identification;
Sparse covariance estimation;

They are less appealing when the objective is the sum of m terms, with m
large. To calculate
m
1 X
∇H(x) = ∇hj (x),
m
j=1

generally need to make a full pass through the data.

Often not practical for massive data sets. But can be hybridized with
stochastic gradient methods (see below).

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 44 / 63

Stochastic Gradient (SG)
Pm
For H(x) = (1/m) j=1 hj (x), iteration k has the form:
Choose jk ∈ {1, 2, . . . , m} uniformly at random;
Set x k+1 ← x k − αk ∇hjk (x k ).

∇hjk (x k ) is a proxy for ∇H(x k ) but it depends on just one data item ajk
and is much cheaper to evaluate.
Unbiased — Ej ∇hj (x) = ∇H(x) — but the variance may be very large.

Average the iterates for more robust convergence:

Pk
k γ` x `
x̄ = P`=1
k
, where γ` are positive weights.
`=0 γ`

Minibatch: Use a set Jk ⊂ {1, 2, . . . , m} rather than a single item.

(Smaller variance in the gradient estimate.)
See [Nemirovski et al., 2009] and many other works.
Wright (UW-Madison) Optimization in Data Analysis Oct 2017 45 / 63
Stochastic Gradient (SG)
Convergence results for hj convex require bounds on the variance of the
gradient estimate:
m
1 X
k∇hj (x)k22 ≤ B 2 + Lg kx − x ∗ k2 .
m
j=1

Analyze expected convergence, e.g. E(f (x k ) − f ∗ ) or E(f (x̄ k ) − f ∗ ),

where the expectation is over the sequence of indices j0 , j1 , j2 , . . . .
Sample results:
H strongly convex, αk ∼ 1/k: E(H(x k ) − H ∗ ) = O(1/k);
√ √
H weakly convex, αk ∼ 1/ k: E(H(x̄ k ) − H ∗ ) = O(1/ k).
B = 0, H strongly convex, αk = const: E(kx k − x ∗ k22 ) = O(ρk ) for
some ρ ∈ (0, 1).

Generalizes beyond finite sums, to H(x) = Eξ f (x; ξ), where ξ is random.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 46 / 63
Stochastic Gradient Applications

SG fits well the summation form of H (with large m), so has widespread
applications:
SVM (primal formulation).
Logistic regression: binary and multiclass.
Deep Learning. The Killer App! (Nonconvex) [LeCun et al., 1998]
Subspace Identification (GROUSE): Project stochastic gradient
searches onto subspace [Balzano and Wright, 2014].

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 47 / 63

Hybrids of Full Gradient and Stochastic Gradient
Stabilize SG by hybridizing with steepest descent (full-gradient). Get linear
convergence for strongly convex functions, sublinear for weakly convex.
SAG: [LeRoux et Pal., 2012] Maintain approximations gj to ∇hj , use search
direction −(1/m) m j=1 gj . At iteration k, choose jk at random, and
update gjk = ∇hjk (x k ).
SAGA: [Defazio et al., 2014] Similar to SAG, but use search direction
m
k 1 X
−∇hjk (x ) + gjk − gj .
m
j=1

SVRG: [Johnson and Zhang, 2013] Similar again, but periodically do a

full gradient evaluation to refresh all gj .
Too much storage, BUT when hj have the “ERM” form hj (ajT x) (linear
least squares, linear SVM), all gradients can be stored in a scalar:
∇x hj (ajT x) = aj hj0 (ajT x).
Wright (UW-Madison) Optimization in Data Analysis Oct 2017 48 / 63
Coordinate Descent (CD) Framework
... for smooth unconstrained minimization: minx H(x):

Set Choose x 1 ∈ Rn ;
for ` = 0, 1, 2, . . . (epochs) do
for j = 1, 2, . . . , n (inner iterations) do
Define k = `n + j
Choose index i = i(`, j) ∈ {1, 2, . . . , n};
Choose αk > 0;
x k+1 ← x k − αk ∇i H(x k )ei ;
end for
end for

where
ei = (0, . . . , 0, 1, 0, . . . , 0)T : the ith coordinate vector;
∇i H(x) = ith component of the gradient ∇H(x);
αk > 0 is the step length.
Wright (UW-Madison) Optimization in Data Analysis Oct 2017 49 / 63
CD Variants
CCD (Cyclic CD): i(`, j) = j.
RCD (Randomized CD a.k.a. Stochastic CD): i(`, j) is chosen
uniformly at random from {1, 2, . . . , n}.
RPCD (Randomized Permutations CD):
I At the start of epoch `, we choose a random permutation of

{1, 2, . . . , n}, denoted by π` .

I Index i(`, j) is chosen to be the jth entry in π .
`

Important quantities in analysis:

Lmax : componentwise Lipschitz constant for ∇H:

|∇i H(x + tei ) − ∇i H(x)| ≤ Li |t|, Lmax = max Li .

i=1,2,...,n

L: usual Lipschitz constant: |∇H(x + d) − ∇H(x)| ≤ Lkdk.

Lojasiewicz constant µ: k∇H(x)k2 ≥ 2µ[H(x) − H ∗ ]
Wright (UW-Madison) Optimization in Data Analysis Oct 2017 50 / 63
Randomized CD Convergence
Of the three variants, convergence of the randomized form has by far the
most elementary analysis [Nesterov, 2012].
Get convergence rates for quantity φk := E(H(x k ) − x ∗ ).

µ
µ > 0 : φk+1 ≤ 1 − φk , k = 1, 2, . . . ,
nLmax
2nLmax R02
µ = 0 : φk ≤ , k = 1, 2, . . . ,
k
where R0 bounds distance from x 0 to solution set.
If the economics of evaluating gradient components are right, this can be
a factor L/Lmax faster than full-gradient steepest descent!
This ratio is in range [1, n]. Maximized by H(x) = (11T )x.
Functions like this are good cases for RCD and RPCD, which are much
faster than CCD or steepest descent.
Wright (UW-Madison) Optimization in Data Analysis Oct 2017 51 / 63
Cyclic and Random-Permutations CD Convergence
Analysis of [Beck and Tetruashvili, 2013] treats CCD as an approximate
form of Steepest Descent, bounding improvement in f over one cycle in
terms of the gradient at the start of the cycle.
Get linear and sublinear rates that are slower than both RCD and Steepest
Descent. This analysis is fairly tight — recent analysis of
[Sun and Ye, 2016] confirms slow rates on a worst-case example.
Same analysis applies to Randomized Permutations (RPCD), but practical
results for RPCD are much better, and usually at least as good as RCD.
Results on quadratic function

H(x) = (1/2)x T Ax

demonstrate this behavior (see [Wright, 2015] and my talks in 2015).

We can explain good behavior of RCCD now, on the worst-case example
for CCD. [Lee and Wright, 2016]

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 52 / 63

CD Extensions

Block CD: Replace single component i by block I ⊂ {1, 2, . . . , n}.

Dual nonlinear SVM [Platt, 1999]. Choose two components of α per
iteration, to stay feasible w.r.t. constraint y T α = 0.
Can be accelerated (efficiently) using “Nesterov” techniques:
[Nesterov, 2012, Lee and Sidford, 2013].
Adaptable to the separable regularized case H(x) + λΩ(x).
Parallel asynchronous variants, suitable for implementation on
shared-memory multicore computers, have been proposed and
analyzed. [Bertsekas and Tsitsiklis, 1989, Liu and Wright, 2015,
Liu et al., 2015]

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 53 / 63

Conditional Gradient / “Frank-Wolfe”

min f (x),
x∈Ω

where f is a convex function and Ω is a closed, bounded, convex set.

Start at x0 ∈ Ω. At iteration k:

vk := arg min v T ∇f (xk );

v ∈Ω
2
xk+1 := xk + αk (vk − xk ), αk = .
k +2

Potentially useful when it is easy to minimize a linear function over

the original constraint set Ω;
Admits an elementary convergence theory: 1/k sublinear rate.
Same convergence rate holds if we use a line search for αk .
Revived by [Jaggi, 2013].
Wright (UW-Madison) Optimization in Data Analysis Oct 2017 54 / 63
Newton’s Method

min f (x), with f smooth.

x∈Rn
Newton’s method motivated by the second-order Taylor-series
approximation at current iterate x k :
1
f (x k + p) = f (x k ) + ∇f (x k )T p + p T ∇2 f (x k )p + O(kpk3 ). (1)
2

When ∇2 f (x k ) is positive definite, can choose p to minimize the quadratic

1
p k = arg min f (x k ) + ∇f (x k )T p + p T ∇2 f (x k )p,
p 2
which is
p k = −∇2 f (x k )−1 ∇f (x k ) Newton step!
Thus, basic form of Newton’s method is
x k+1 = x k − ∇2 f (x k )−1 ∇f (x k ). (2)
Wright (UW-Madison) Optimization in Data Analysis Oct 2017 55 / 63
Practical Newton

If x ∗ satisfies second-order sufficient conditions:

∇f (x ∗ ) = 0, ∇2 f (x ∗ ) positive definite,

then Newton’s method has local quadratic convergence:

kx k+1 − x ∗ k ≤ C kx k − x ∗ k2 , for all k ≥ k̄,

when kx k̄ − x ∗ k is sufficiently small.

When ∇2 f (x k ) is not positive semidefinite, motivation for Newton step (as
minimizer of the second-order Taylor series expansion) no longer applies.
Newton can be modified to retain its validity as a descent method, and
still retain fast local convergence to second-order sufficient solutions.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 56 / 63

If ∇2 f (x k ) is indefinite, can modify p k :
Add positive values to the diagonal of ∇2 f (x k ) while factorizing it
during the calculation of p k ;
Redefine
p k := −[∇2 f (x k ) + λk I ]−1 ∇f (x k ),
for some λk > 0, so that the modified Hessian is positive definite and
p k is a descent direction.
(Equivalent) Redefine p k as solution of a trust-region subproblem:

1
min f (x k ) + ∇f (x k )T p + p T ∇2 f (x k )p subject to kpk2 ≤ ∆k .
p 2

Cubic regularization: Given Lipschitz constant L for ∇2 f (x), solve

1 L
min f (x k ) + ∇f (x k )T p + p T ∇2 f (x k )p + kpk3 .
p 2 6

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 57 / 63

Global Convergence
(Assume that f is bounded below.)
Methods that calculate a descent direction p k via modified Newton, then
do a line search typically have accumulation points that are stationary:
∇f (x ∗ ) = 0.
Special trust-region, line-search, and cubic regularization methods have
stronger guarantees e.g. accumulation points satisfy second-order
necessary conditions: ∇f (x ∗ ) = 0, ∇2 f (x ∗ ) positive semidefinite.

k∇f (x k )k ≤ within k = O(−3/2 ) iterations;

∇2 f (x k ) ≥ −I within k = O(−3 ) iterations,

where the constants in O(·) depend on [f (x 0 ) − f¯] and L.

[Nesterov and Polyak, 2006, Royer and Wright, 2017], others.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 58 / 63

Augmented Lagrangian
Consider the linearly constrained problem,

min f (x) s.t. Ax = b,

where f : Rn → R is convex.
Define the Lagrangian function:

L(x, λ) := f (x) + λT (Ax − b).

x ∗ is a solution if and only if there exists a vector of Lagrange multipliers

λ∗ ∈ Rm such that

−AT λ∗ ∈ ∂f (x ∗ ), Ax ∗ = b,

or equivalently:

0 ∈ ∂x L(x ∗ , λ∗ ), ∇λ L(x ∗ , λ∗ ) = 0.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 59 / 63

Augmented Lagrangian
The augmented Lagrangian is (with ρ > 0)
ρ
L(x, λ; ρ) := f (x) + λT (Ax − b) + kAx − bk22 .
| {z } |2 {z }
Lagrangian “augmentation”

Basic Augmented Lagrangian (a.k.a. method of multipliers) is

xk = arg min L(x, λk−1 ; ρ);

x
λk = λk−1 + ρ(Axk − b);

[Hestenes, 1969, Powell, 1969]

Some constraints on x (such as x ∈ Ω) can be handled explicitly:

xk = arg min L(x, λk−1 ; ρ);

x∈Ω
λk = λk−1 + ρ(Axk − b);

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 60 / 63

Alternating Direction Method of Multipliers (ADMM)

min f (x) + h(z) s.t. Ax + Bz = c,

(x∈Ωx ,z∈Ωz )

for which the Augmented Lagrangian is

ρ
L(x, z, λ; ρ) := f (x) + h(z) + λT (Ax + Bz − c) + kAx − Bz − ck22 .
2
Standard AL would minimize L(x, z, λ; ρ) w.r.t. (x, z) jointly. However,
since coupled in the quadratic term, separability is lost.
In ADMM, minimize over x and z separately and sequentially:
xk = arg min L(x, zk−1 , λk−1 ; ρ);
x∈Ωx
zk = arg min L(xk , z, λk−1 ; ρ);
z∈Ωz
λk = λk−1 + ρ(Axk + Bzk − c).
Extremely useful framework for many data analysis / learning settings.
Major references: [Eckstein and Bertsekas, 1992, Boyd et al., 2011]
Wright (UW-Madison) Optimization in Data Analysis Oct 2017 61 / 63
Not Discussed!

Many interesting topics not mentioned, including

quasi-Newton methods.
Linear equations Ax = b: Kaczmarz algorithms.
Image and video processing: denoising and deblurring.
Graphs: detect structure and cliques, consensus optimization, . . . .
Integer and combinatorial formulations.
Parallel variants: synchronous and asynchronous.
Online learning.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 62 / 63

Conclusions: Optimization in Data Analysis

HUGE interest across multiple communities.

Ongoing challenges because of increasing scale and complexity of
machine learning / data analysis applications. Also because of the
computational platforms.
Optimization methodology is integrated with the applications.
The optimization / data analysis / machine learning research
communities are becoming integrated too!

FIN

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 63 / 63

References I

Balzano, L., Nowak, R., and Recht, B. (2010).

Online identification and tracking of subspaces from highly incomplete information.
In 48th Annual Allerton Conference on Communication, Control, and Computing, pages
704–711.
http://arxiv.org/abs/1006.4046.
Balzano, L. and Wright, S. J. (2014).
Local convergence of an algorithm for subspace identification from partial data.
Foundations of Computational Mathematics, 14:1–36.
DOI: 10.1007/s10208-014-9227-7.
Beck, A. and Teboulle, M. (2009).
A fast iterative shrinkage-threshold algorithm for linear inverse problems.
SIAM Journal on Imaging Sciences, 2(1):183–202.
Beck, A. and Tetruashvili, L. (2013).
On the convergence of block coordinate descent type methods.
SIAM Journal on Optimization, 23(4):2037–2060.
Bennett, K. P. and Mangasarian, O. L. (1992).
Robust linear programming discrimination of two linearly inseparable sets.
Optimization Methods and Software, 1(1):23–34.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 1 / 10

References II
Bertsekas, D. P. and Tsitsiklis, J. N. (1989).
Parallel and Distributed Computation: Numerical Methods.
Prentice-Hall, Inc., Englewood Cliffs, New Jersey.
Bhojanapalli, S., Neyshabur, B., and Srebro, N. (2016).
Global optimality of local search for low-rank matrix recovery.
Technical Report arXiv:1605.07221, Toyota Technological Institute.
Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992).
A training algorithm for optimal margin classifiers.
In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages
144–152.
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011).
Distributed optimization and statistical learning via the alternating direction methods of
multipliers.
Foundations and Trends in Machine Learning, 3(1):1–122.
Bubeck, S., Lee, Y. T., and Singh, M. (2015).
A geometric alternative to Nesterov’s accelerated gradient descent.
Technical Report arXiv:1506.08187, Microsoft Research.
Candès, E., Li, X., and Soltanolkotabi, M. (2014).
Phase retrieval via a Wirtinger flow.
Technical Report arXiv:1407.1065, Stanford University.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 2 / 10

References III
Candès, E. and Recht, B. (2009).
Exact matrix completion via convex optimization.
Foundations of Computational Mathematics, 9:717–772.
Candès, E. J., Li, X., Ma, Y., and Wright, J. (2011).
Robust principal component analysis?
Journal of the ACM, 58.3:11.
Chandrasekaran, V., Sanghavi, S., Parrilo, P. A., and Willsky, A. S. (2011).
Rank-sparsity incoherence for matrix decomposition.
SIAM Journal on Optimization, 21(2):572–596.
Chen, S. S., Donoho, D. L., and Saunders, M. A. (1998).
Atomic decomposition by basis pursuit.
SIAM Journal on Scientific Computing, 20(1):33–61.
Chen, Y. and Wainwright, M. J. (2015).
Fast low-rank estimation by projected gradent descent: General statistical and algorithmic
guarantees.
Technical Report arXiv:1509.03025, University of California-Berkeley.
Cortes, C. and Vapnik, V. N. (1995).
Support-vector networks.
Machine Learning, 20:273–297.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 3 / 10

References IV

d’Aspremont, A., Banerjee, O., and El Ghaoui, L. (2008).

First-order methods for sparse covariance selection.
SIAM Journal on Matrix Analysis and Applications, 30:56–66.
d’Aspremont, A., El Ghaoui, L., Jordan, M. I., and Lanckriet, G. (2007).
A direct formulation for sparse PCA using semidefinte programming.
SIAM Review, 49(3):434–448.
Dasu, T. and Johnson, T. (2003).
Exploratory Data Mining and Data Cleaning.
John Wiley & Sons.
Defazio, A., Bach, F., and Lacoste-Julien, S. (2014).
SAGA: a fast incremental gradient method with support for non-strongly composite convex
objectives.
In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q.,
editors, Advances in Neural Information Processing Systems 27, pages 1646–1654. Curran
Associates, Inc.
Drusvyatskiy, D., Fazel, M., and Roy, S. (2016).
An optimal first-order method based on optimal quadratic averaging.
Technical Report arXiv:1604.06543, University of Washington.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 4 / 10

References V
Dunn, J. C. (1979).
Rates of convergence for conditional gradient algorithms near singular and nonsingular
extremals.
SIAM Journal on Control and Optimization, 17(2):187–211.
Eckstein, J. and Bertsekas, D. P. (1992).
On the Douglas-Rachford splitting method and the proximal point algorithm for maximal
monotone operators.
Mathematical Programming, 55:293–318.
Frank, M. and Wolfe, P. (1956).
An algorithm for quadratic programming.
Naval Research Logistics Quarterly, 3:95–110.
Friedman, J., Hastie, T., and Tibshirani, R. (2008).
Sparse inverse covariance estimation with the graphical lasso.
Biostatistics, 9(3):432–441.
Hestenes, M. R. (1969).
Multiplier and gradient methods.
Journal of Optimization Theory and Applications, 4:303–320.
Jaggi, M. (2013).
Revisiting Frank-Wolfe: Projection-free sparse convex optimization.
In Proceedings of the 30th International Conference on Machine Learning.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 5 / 10

References VI
Johnson, R. and Zhang, T. (2013).
Accelerating stochastic gradient descent using predictive variance reduction.
In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q.,
editors, Advances in Neural Information Processing Systems 26, pages 315–323. Curran
Associates, Inc.
Keshavan, R. H., Montanari, A., and Oh, S. (2010).
Matrix completion from a few entries.
IEEE Transactions on Information Theory, 56(6):2980–2998.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998).
Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11):2278–2324.
Lee, C.-p. and Wright, S. J. (2016).
Random permutations fix a worst case for cyclic coordinate descent.
Technical Report arXiv:1607.08320, Computer Sciences Department, University of
Wisconsin-Madison.
Lee, Y. T. and Sidford, A. (2013).
Efficient accelerated coordinate descent methods and faster algorihtms for solving linear
systems.
In 54th Annual Symposium on Foundations of Computer Science, pages 147–156.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 6 / 10

References VII

LeRoux, N., Schmidt, M., and Bach, F. (2012).

A stochastic gradient methods with an exponential convergence rate for finite training
sets.
In Pereira, F., Burges, C., Bottou, L., and Weinberger, K., editors, Advances in Neural
Information Processing Systems 25, pages 2663–2671. Curran Associates, Inc.
Liu, J. and Wright, S. J. (2015).
Asynchronous stochastic coordinate descent: Parallelism and convergence properties.
SIAM Journal on Optimization, 25(1):351–376.
Liu, J., Wright, S. J., Ré, C., Bittorf, V., and Sridhar, S. (2015).
An asynchronous parallel stochastic coordinate descent algorithm.
Journal of Machine Learning Research, 16:285–322.
arXiv:1311.1873.
Mangasarian, O. L. (1965).
Linear and nonlinear separation of patterns by linear programming.
Operations Research, 13(3):444–452.
Mangasarian, O. L. and Solodov, M. V. (1994).
Serial and parallel backpropagation convergence via nonmonotone perturbed minimization.
Optimization Methods and Software, 4:103–116.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 7 / 10

References VIII
Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009).
Robust stochastic approximation approach to stochastic programming.
SIAM Journal on Optimization, 19(4):1574–1609.
Nesterov, Y. (1983).
A method for unconstrained convex problem with the rate of convergence O(1/k 2 ).
Doklady AN SSSR, 269:543–547.
Nesterov, Y. (2012).
Efficiency of coordinate descent methods on huge-scale optimization problems.
SIAM Journal on Optimization, 22:341–362.
Nesterov, Y. and Polyak, B. T. (2006).
Cubic regularization of Newton method and its global performance.
Mathematical Programming, Series A, 108:177–205.
Platt, J. C. (1999).
Fast training of support vector machines using sequential minimal optimization.
In Schölkopf, B., Burges, C. J. C., and Smola, A. J., editors, Advances in Kernel Methods
— Support Vector Learning, pages 185–208, Cambridge, MA. MIT Press.
Powell, M. J. D. (1969).
A method for nonlinear constraints in minimization problems.
In Fletcher, R., editor, Optimization, pages 283–298. Academic Press, New York.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 8 / 10

References IX
Recht, B., Fazel, M., and Parrilo, P. (2010).
Guaranteed minimum-rank solutions to linear matrix equations via nuclear norm
minimization.
SIAM Review, 52(3):471–501.
Robbins, H. and Monro, S. (1951).
A stochastic approximation method.
Annals of Mathematical Statistics, 22(3).
Royer, C. W. and Wright, S. J. (2017).
Complexity analysis of second-order line-search algorithms for smooth nonconvex
optimization.
Technical Report arXiv:1706.03131, University of Wisconsin-Madison.
Stigler, S. M. (1981).
Gauss and the invention of least squares.
Annals of Statistics, 9(3):465–474.
Sun, R. and Ye, Y. (2016).
Worst-case complexity of cyclic coordinate descent: o(n2 ) gap with randomized version.
Technical Report arXiv:1604.07130, Department of Management Science and Engineering,
Stanford University, Stanford, California.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 9 / 10

References X

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke,
V., and Rabinovich, A. (2015).
Going deeper with convolutions.
In CVPR.
Wohlberg, W. H. and Mangasarian, O. L. (1990).
Multisurface method of pattern separation for medical diagnosis applied to breast cytology.
Proceedings of the National Academy of Sciences, 87(23):9193–9196.
Wright, S. J. (2015).
Coordinate descent algorithms.
Mathematical Programming, Series B, 151:3–34.
Yi, X., Park, D., Chen, Y., and Caramanis, C. (2016).
Fast algorithms for robust pca via gradient descent.
Technical Report arXiv:1605.07784, University of Texas-Austin.
Zheng, Q. and Lafferty, J. (2015).
A convergent gradient descent algorithm for rank minimization and semidefinite
programming from random linear measurements.
Technical Report arXiv:1506.06081, Statistics Department, University of Chicago.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 10 / 10

Chapter 2 Optimization
No ratings yet
Chapter 2 Optimization
47 pages
Nonlinear Programming PDF
No ratings yet
Nonlinear Programming PDF
224 pages
Assignment Problems Exercise
50% (2)
Assignment Problems Exercise
6 pages
MLF Notes - Rishab Dec 24
No ratings yet
MLF Notes - Rishab Dec 24
6 pages
Lecture Slides - 11-16 - 2024
No ratings yet
Lecture Slides - 11-16 - 2024
128 pages
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
No ratings yet
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
258 pages
1 Introduction PDF
No ratings yet
1 Introduction PDF
60 pages
Non Convex Optimization
No ratings yet
Non Convex Optimization
139 pages
2022lectures1-8 Optimization For DataScience
No ratings yet
2022lectures1-8 Optimization For DataScience
35 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
Nonlinear Programming Concepts PDF
No ratings yet
Nonlinear Programming Concepts PDF
224 pages
Unit 5 (Dimensionality Reduction)
No ratings yet
Unit 5 (Dimensionality Reduction)
96 pages
Tối Ưu Hóa Cho Khoa Học Dữ Liệu
No ratings yet
Tối Ưu Hóa Cho Khoa Học Dữ Liệu
64 pages
ISYE 8803 - Kamran - M7 - Regularization Applications
No ratings yet
ISYE 8803 - Kamran - M7 - Regularization Applications
42 pages
Optimization For Data Science
No ratings yet
Optimization For Data Science
18 pages
CDT 05 PCA SVD FoDS
No ratings yet
CDT 05 PCA SVD FoDS
34 pages
Ai - Foundations of Machine Learning III
No ratings yet
Ai - Foundations of Machine Learning III
98 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
79 pages
Lecture: Dimensionality Reduction With Principal Component Analysis
No ratings yet
Lecture: Dimensionality Reduction With Principal Component Analysis
42 pages
Matlab Opt An Dint
No ratings yet
Matlab Opt An Dint
43 pages
PCA Fin. Econ.
No ratings yet
PCA Fin. Econ.
56 pages
Introduction To Optimization-Lec1
No ratings yet
Introduction To Optimization-Lec1
36 pages
Sparse Regression and Dictionary Learning
No ratings yet
Sparse Regression and Dictionary Learning
14 pages
KF PF
No ratings yet
KF PF
45 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
When Models Meet Data
No ratings yet
When Models Meet Data
25 pages
Lecture-05 - Least Squares and Optimization
No ratings yet
Lecture-05 - Least Squares and Optimization
34 pages
15 Data Analyst Questions
No ratings yet
15 Data Analyst Questions
9 pages
L7 Ann
No ratings yet
L7 Ann
22 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Process Optimization Algorythms PDF
No ratings yet
Process Optimization Algorythms PDF
77 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Lecture 12
No ratings yet
Lecture 12
31 pages
10 Autoencoders
No ratings yet
10 Autoencoders
42 pages
MIT18 409S15 Bookex
No ratings yet
MIT18 409S15 Bookex
123 pages
F-Bach
No ratings yet
F-Bach
36 pages
B e
No ratings yet
B e
651 pages
Convex Cardinality Optimization
No ratings yet
Convex Cardinality Optimization
26 pages
KPSVD
No ratings yet
KPSVD
6 pages
PrincipalComponentAnalysis LectureNotesPublic
No ratings yet
PrincipalComponentAnalysis LectureNotesPublic
24 pages
Notes HQ
No ratings yet
Notes HQ
96 pages
Optimization Algorithms For Data Analysis Wright
No ratings yet
Optimization Algorithms For Data Analysis Wright
49 pages
15PCA
No ratings yet
15PCA
27 pages
Day School 03
No ratings yet
Day School 03
32 pages
BS (Commerce)
No ratings yet
BS (Commerce)
107 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
85 pages
Computational OPT Book 2023 Chapter 01
No ratings yet
Computational OPT Book 2023 Chapter 01
18 pages
Chapter
No ratings yet
Chapter
46 pages
Lec 4 - Data Science
No ratings yet
Lec 4 - Data Science
3 pages
O4MD 01 Introduction
No ratings yet
O4MD 01 Introduction
10 pages
Fisher Linear Discriminant Analysis: Max Welling
No ratings yet
Fisher Linear Discriminant Analysis: Max Welling
4 pages
Principal Component Analysis (PCA) Application To Images: Outline of The Lecture
No ratings yet
Principal Component Analysis (PCA) Application To Images: Outline of The Lecture
26 pages
I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
No ratings yet
I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
20 pages
Data Characterization
No ratings yet
Data Characterization
31 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
OpTimIzation Overview
No ratings yet
OpTimIzation Overview
47 pages
Bizom - Concept Deck'22
No ratings yet
Bizom - Concept Deck'22
21 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
LP Solved Problems
100% (1)
LP Solved Problems
5 pages
ECE 449 Notes
No ratings yet
ECE 449 Notes
5 pages
Multi-Attribute Concept Design Procedure of A Generic Naval Vessel PDF
No ratings yet
Multi-Attribute Concept Design Procedure of A Generic Naval Vessel PDF
10 pages
Dynamic Programming For Dummies Parts I & II
No ratings yet
Dynamic Programming For Dummies Parts I & II
53 pages
Linear Algebra: Submitted by Ahmad Saeed Submitted To Sir Muzzam Ali BITM-F18-022
No ratings yet
Linear Algebra: Submitted by Ahmad Saeed Submitted To Sir Muzzam Ali BITM-F18-022
5 pages
A Modeling Method of Task-Oriented Energy Consumption For Machining Manufacturing System
No ratings yet
A Modeling Method of Task-Oriented Energy Consumption For Machining Manufacturing System
8 pages
Construction of Public Market Building PROJECT PROPOSAL
100% (1)
Construction of Public Market Building PROJECT PROPOSAL
17 pages
Decision-Directed Least-Squares Phase Perturbation Compensation in OFDM Systems
No ratings yet
Decision-Directed Least-Squares Phase Perturbation Compensation in OFDM Systems
13 pages
Role of Manufacturing or Industrial Engineer
No ratings yet
Role of Manufacturing or Industrial Engineer
5 pages
Branch and Bound
No ratings yet
Branch and Bound
50 pages
OR HW F02 Soln 1to3
No ratings yet
OR HW F02 Soln 1to3
49 pages
Goal Programming Goal Programming: DR Muhammad Al Salamah, Industrial Engineering, KFUPM
No ratings yet
Goal Programming Goal Programming: DR Muhammad Al Salamah, Industrial Engineering, KFUPM
33 pages
Thesis On Solar Wind Hybrid System
100% (2)
Thesis On Solar Wind Hybrid System
4 pages
A Nonlinear Knapsack Problem
No ratings yet
A Nonlinear Knapsack Problem
14 pages
Primer Vector Theory
No ratings yet
Primer Vector Theory
50 pages
Unit I CAD CAM
No ratings yet
Unit I CAD CAM
19 pages
Dynare - Tutorial
No ratings yet
Dynare - Tutorial
27 pages
Entropy Optimization Principles and Their Applications
No ratings yet
Entropy Optimization Principles and Their Applications
18 pages
SF2863 Systems Engineering, 7.5 HP - Intro To Markov Decision Processes PDF
No ratings yet
SF2863 Systems Engineering, 7.5 HP - Intro To Markov Decision Processes PDF
39 pages
Stochastic and Private Energy System Optimization - Vladimir Dvorkin
No ratings yet
Stochastic and Private Energy System Optimization - Vladimir Dvorkin
150 pages
Assignment3 - LP - With Solution
No ratings yet
Assignment3 - LP - With Solution
7 pages
Draw Rate Optimisation in Block Cave Production Scheduling Using Mathematical Programming
No ratings yet
Draw Rate Optimisation in Block Cave Production Scheduling Using Mathematical Programming
14 pages
Monotonic Optimizationin Communicationand Networking Systems
No ratings yet
Monotonic Optimizationin Communicationand Networking Systems
77 pages
CSC311H5F LEC0101 Syllabus
No ratings yet
CSC311H5F LEC0101 Syllabus
5 pages
A Survey On Methods of Evaluation of Reliability of Distribution Systems With Distributed Generation
No ratings yet
A Survey On Methods of Evaluation of Reliability of Distribution Systems With Distributed Generation
8 pages
1 s2.0 S0196890421006701 Main
No ratings yet
1 s2.0 S0196890421006701 Main
11 pages
Exercise 2
No ratings yet
Exercise 2
2 pages

Optimization Methods For Machine Learning: Stephen Wright

Uploaded by

Optimization Methods For Machine Learning: Stephen Wright

Uploaded by

Optimization Methods for Machine Learning

IPAM, October 2017

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 1 / 63

I Applications / Examples, including formulation as optimization

Optimization is being revolutionized by its interactions with machine

Highly multidisciplinary area!

Modeling and domain-specific knowledge is vital: “80% of data analysis is

Huge amounts of data are collected, routinely and continuously.

apps, email, surveillance cameras, web activity, online shopping,...

I AlphaGo: Deep Learning for Go.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 4 / 63

The outcomes yj could be:

approximately contain the (high-dimensional) vectors aj ;

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 5 / 63

Can usually define φ in terms of some parameter vector x — thus

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 6 / 63

Analysis: φ — especially the parameter x that defines it — reveals

F finds some low-dimensional subspaces that contain the a ;

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 7 / 63

noise or errors in aj and yj . Would like φ (and x) to be robust to

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 8 / 63

[Gauss, 1799], [Legendre, 1805]; see [Stigler, 1981].

`1 regularization yields solutions x with few nonzeros:

Feature selection: Nonzero locations in x indicate important

Each Aj may observe a single element of X , or a linear combination of

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 10 / 63

For symmetric X , write X = ZZ T , where Z ∈ Rn×r , so that

(No need for regularizer — rank is hard-wired into the formulation.)

When the observations Aj are of individual elements, recovery is still

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 12 / 63

Given m × n matrix Y , seek factors L (m × r ) and R (n × r ) that are

Applications in computer vision, document clustering, chemometrics, . . .

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 13 / 63

Seek a sparse inverse covariance matrix: X ≈ S −1 .

[d’Aspremont et al., 2008, Friedman et al., 2008].

Seek sparse approximations to the leading eigenvectors of the sample

max v T Sv = hS, vv T i s.t. kv k2 = 1, kv k0 ≤ k,

for some given k ∈ {1, 2, . . . , n}. Convex relaxation replaces vv T by an

max hS, Mi s.t. M  0, hI , Mi = 1, kMk1 ≤ R,

where | · |1 is the sum of absolute values [d’Aspremont et al., 2007].

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 15 / 63

For sparse leading rank-r eigenspace, seek V ∈ Rn×r with orthonormal

max hS, Mi s.t. 0  M  I , hI , Mi ≤ r , kMk1 ≤ R.

Explicit low-rank formulation is

max hS, FF T i s.t. kF k2 ≤ 1, kF k2,1 ≤ R̄,

[Chen and Wainwright, 2015]

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 16 / 63

Given Y ∈ Rm×n , seek low-rank M and sparse S such that M + S ≈ Y .

min kMk∗ + λkSk1 s.t. Y = M + S.

[Candès et al., 2011, Chandrasekaran et al., 2011]

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 17 / 63

Compact formulation: Variables L ∈ Rn×r , R ∈ Rm×r , S ∈ Rm×n sparse.

where Φ represents the locations of the observed entries.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 18 / 63

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 19 / 63

ajT x − β ≥ 1 when yj = +1;

The mapping is φ(aj ) = sign(ajT x − β).

Add a regularization term (λ/2)kxk22 for some λ > 0 to maximize the

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 21 / 63

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 21 / 63

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 21 / 63

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 21 / 63

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 21 / 63

Data aj , j = 1, 2, . . . , m may not be separable neatly into two classes

ψ(aj )T x − β ≥ 1 when yj = +1;

Leads to the formulation:

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 22 / 63

Dual is a quadratic program in m variables, with simple constraints:

where Qk` = yk y` ψ(ak )T ψ(a` ), y = (y1 , y2 , . . . , ym )T , e = (1, 1, . . . , 1)T .

K (ak , a` ) ∼ ψ(ak )T ψ(a` ).

[Boser et al., 1992, Cortes and Vapnik, 1995]. “Kernel trick.”

K (ak , a` ) = exp(−kak − a` k2 /(2σ)), for some σ > 0.

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 23 / 63

Wright (UW-Madison) Optimization in Data Analysis Oct 2017 24 / 63

Choose x so that p(aj ; x) ≈ 1 when yj = 1 and p(aj ; x) ≈ 0 when yj = 2.

Sparse solutions x are interesting because the indicate which components

max hS, Mi s.t. M 0, hI , Mi = 1, kMk1 ≤ R,

max hS, Mi s.t. 0 M I , hI , Mi ≤ r , kMk1 ≤ R.

ajT x[k] ajT x[`] for all ` 6= k.