Optimization Methods For Machine Learning: Stephen Wright
Optimization Methods For Machine Learning: Stephen Wright
Stephen Wright
University of Wisconsin-Madison
problems
Optimization in Data Analysis
I Relevant Algorithms
astronomical,...)
Affects everyone directly!
Powerful computers and new specialized architectures make it
possible to handle larger data sets and analyze them more thoroughly.
Methodological innovations in some areas. e.g. Deep Learning.
I Speech recognition in smart phones
I Image recognition
yj depend on inputs aj .
Prediction: Given new data vectors ak , predict outputs yk ← φ(ak ).
1
min kA(X ) − y k22 + λkX k∗ , for some λ > 0.
X 2m
[Recht et al., 2010]
evolving in time.
I Low-rank part M represents background, sparse part S represents
foreground.
Convex formulation:
Can avoid defining ψ explicitly by using instead the dual of this QP.
m
"M M
!#
1 X X X
f (x) = − yj` (ajT x[`] ) − log exp(ajT x[`] )
m
j=1 `=1 `=1
a`+1 = σ(W ` a` + g ` ),
input nodes
where a` is node values at
layer `, (W ` , g ` ) are parame-
ters in the network, σ is the
element-wise function.
Can think of the neural network as transforming the raw data in a way
that makes the ultimate task (regression, classification) easier.
We consider a multiclass classification application in power systems. The
raw data is PMU measurements at different points in a power grid, under
different operating conditions. The goal is to use this data to detect line
outages. Each class corresponds to outage of a particular line.
High-dimensional. Can illustrate by doing a singular value decomposition
of the data matrix and plotting pairs of principal components on a 2-d
graph.
Do this before and after transformation. One hidden layer with 200 nodes.
where
hj depends on parameters x of the mapping φ, and data items (aj , yj );
Ω is the regularization term, often nonsmooth, convex, and separable
in the components of x (but not always!).
λ ≥ 0 is the regularization parameter.
(Ω could also be an indicator for simple set e.g. x ≥ 0.)
Alternative formulation:
m
1 X
min hj (x) s.t. Ω(x) ≤ τ.
x m
j=1
Accelerated gradient
Stochastic gradient
I and hybrids with full-gradient
Coordinate descent
Conditional gradient
Newton’s method, and approximate Newton
Augmented Lagrangian / ADMM
Many extensions have been made to these methods and their convergence
analysis. Many variants and adaptations proposed.
When Ω is the indicator function for a convex set, x k+1 is the projection
of x k − αk ∇H(x k ) onto this set: gradient projection.
For many Ω of interest, this problem can be solved quickly (e.g. O(n)).
Algorithms and convergence theory for steepest descent on smooth H
usually extend to this setting (for convex Ω).
They are less appealing when the objective is the sum of m terms, with m
large. To calculate
m
1 X
∇H(x) = ∇hj (x),
m
j=1
∇hjk (x k ) is a proxy for ∇H(x k ) but it depends on just one data item ajk
and is much cheaper to evaluate.
Unbiased — Ej ∇hj (x) = ∇H(x) — but the variance may be very large.
SG fits well the summation form of H (with large m), so has widespread
applications:
SVM (primal formulation).
Logistic regression: binary and multiclass.
Deep Learning. The Killer App! (Nonconvex) [LeCun et al., 1998]
Subspace Identification (GROUSE): Project stochastic gradient
searches onto subspace [Balzano and Wright, 2014].
Set Choose x 1 ∈ Rn ;
for ` = 0, 1, 2, . . . (epochs) do
for j = 1, 2, . . . , n (inner iterations) do
Define k = `n + j
Choose index i = i(`, j) ∈ {1, 2, . . . , n};
Choose αk > 0;
x k+1 ← x k − αk ∇i H(x k )ei ;
end for
end for
where
ei = (0, . . . , 0, 1, 0, . . . , 0)T : the ith coordinate vector;
∇i H(x) = ith component of the gradient ∇H(x);
αk > 0 is the step length.
Wright (UW-Madison) Optimization in Data Analysis Oct 2017 49 / 63
CD Variants
CCD (Cyclic CD): i(`, j) = j.
RCD (Randomized CD a.k.a. Stochastic CD): i(`, j) is chosen
uniformly at random from {1, 2, . . . , n}.
RPCD (Randomized Permutations CD):
I At the start of epoch `, we choose a random permutation of
H(x) = (1/2)x T Ax
min f (x),
x∈Ω
∇f (x ∗ ) = 0, ∇2 f (x ∗ ) positive definite,
1
min f (x k ) + ∇f (x k )T p + p T ∇2 f (x k )p subject to kpk2 ≤ ∆k .
p 2
1 L
min f (x k ) + ∇f (x k )T p + p T ∇2 f (x k )p + kpk3 .
p 2 6
where f : Rn → R is convex.
Define the Lagrangian function:
−AT λ∗ ∈ ∂f (x ∗ ), Ax ∗ = b,
or equivalently:
0 ∈ ∂x L(x ∗ , λ∗ ), ∇λ L(x ∗ , λ∗ ) = 0.
FIN
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke,
V., and Rabinovich, A. (2015).
Going deeper with convolutions.
In CVPR.
Wohlberg, W. H. and Mangasarian, O. L. (1990).
Multisurface method of pattern separation for medical diagnosis applied to breast cytology.
Proceedings of the National Academy of Sciences, 87(23):9193–9196.
Wright, S. J. (2015).
Coordinate descent algorithms.
Mathematical Programming, Series B, 151:3–34.
Yi, X., Park, D., Chen, Y., and Caramanis, C. (2016).
Fast algorithms for robust pca via gradient descent.
Technical Report arXiv:1605.07784, University of Texas-Austin.
Zheng, Q. and Lafferty, J. (2015).
A convergent gradient descent algorithm for rank minimization and semidefinite
programming from random linear measurements.
Technical Report arXiv:1506.06081, Statistics Department, University of Chicago.