distributed optimization and statistical
distributed optimization and statistical
in
Machine Learning
Vol. 3, No. 1 (2010) 1–122
c 2011 S. Boyd, N. Parikh, E. Chu, B. Peleato
and J. Eckstein
DOI: 10.1561/2200000016
1
Electrical Engineering Department, Stanford University, Stanford, CA
94305, USA, [email protected]
2
Computer Science Department, Stanford University, Stanford, CA 94305,
USA, [email protected]
3
Electrical Engineering Department, Stanford University, Stanford, CA
94305, USA, [email protected]
4
Electrical Engineering Department, Stanford University, Stanford, CA
94305, USA, [email protected]
5
Management Science and Information Systems Department and
RUTCOR, Rutgers University, Piscataway, NJ 08854, USA,
[email protected]
Contents
1 Introduction 3
2 Precursors 7
2.1 Dual Ascent 7
2.2 Dual Decomposition 9
2.3 Augmented Lagrangians and the Method of Multipliers 10
4 General Patterns 25
8.1 Examples 62
8.2 Splitting across Examples 64
8.3 Splitting across Features 66
9 Nonconvex Problems 73
9.1 Nonconvex Constraints 73
9.2 Bi-convex Problems 76
10 Implementation 78
10.1 Abstract Implementation 78
10.2 MPI 80
10.3 Graph Computing Frameworks 81
10.4 MapReduce 82
11 Numerical Examples 87
Acknowledgments 105
References 111
Abstract
Many problems of recent interest in statistics and machine learning
can be posed in the framework of convex optimization. Due to the
explosion in size and complexity of modern datasets, it is increasingly
important to be able to solve problems with a very large number of fea-
tures or training examples. As a result, both the decentralized collection
or storage of these datasets as well as accompanying distributed solu-
tion methods are either necessary or at least highly desirable. In this
review, we argue that the alternating direction method of multipliers
is well suited to distributed convex optimization, and in particular to
large-scale problems arising in statistics, machine learning, and related
areas. The method was developed in the 1970s, with roots in the 1950s,
and is equivalent or closely related to many other algorithms, such
as dual decomposition, the method of multipliers, Douglas–Rachford
splitting, Spingarn’s method of partial inverses, Dykstra’s alternating
projections, Bregman iterative algorithms for 1 problems, proximal
methods, and others. After briefly surveying the theory and history of
the algorithm, we discuss applications to a wide variety of statistical
and machine learning problems of recent interest, including the lasso,
sparse logistic regression, basis pursuit, covariance selection, support
vector machines, and many others. We also discuss general distributed
optimization, extensions to the nonconvex setting, and efficient imple-
mentation, including some details on distributed MPI and Hadoop
MapReduce implementations.
1
Introduction
3
4 Introduction
Outline
We begin in §2 with a brief review of dual decomposition and the
method of multipliers, two important precursors to ADMM. This sec-
tion is intended mainly for background and can be skimmed. In §3,
we present ADMM, including a basic convergence theorem, some vari-
ations on the basic version that are useful in practice, and a survey of
some of the key literature. A complete convergence proof is given in
appendix A.
In §4, we describe some general patterns that arise in applications
of the algorithm, such as cases when one of the steps in ADMM can
6 Introduction
7
8 Precursors
problem is
maximize g(y),
N
f (x) = fi (xi ),
i=1
A = [A1 · · · AN ] ,
N
so Ax = i=1 Ai xi , the Lagrangian can be written as
N
N
L(x, y) = Li (xi , y) = fi (xi ) + y T Ai xi − (1/N )y T b ,
i=1 i=1
xk+1
i := argmin Li (xi , y k ) (2.4)
xi
We see that by using ρ as the step size in the dual update, the iterate
(xk+1 , y k+1 ) is dual feasible. As the method of multipliers proceeds, the
primal residual Axk+1 − b converges to zero, yielding optimality.
The greatly improved convergence properties of the method of mul-
tipliers over dual ascent comes at a cost. When f is separable, the aug-
mented Lagrangian Lρ is not separable, so the x-minimization step (2.7)
cannot be carried out separately in parallel for each xi . This means that
the basic method of multipliers cannot be used for decomposition. We
will see how to address this issue next.
Augmented Lagrangians and the method of multipliers for con-
strained optimization were first proposed in the late 1960s by Hestenes
[97, 98] and Powell [138]. Many of the early numerical experiments on
the method of multipliers are due to Miele et al. [124, 125, 126]. Much
of the early work is consolidated in a monograph by Bertsekas [15],
who also discusses similarities to older approaches using Lagrangians
and penalty functions [6, 5, 71], as well as a number of generalizations.
3
Alternating Direction Method of Multipliers
3.1 Algorithm
ADMM is an algorithm that is intended to blend the decomposability
of dual ascent with the superior convergence properties of the method
of multipliers. The algorithm solves problems in the form
minimize f (x) + g(z)
(3.1)
subject to Ax + Bz = c
13
14 Alternating Direction Method of Multipliers
where ρ > 0. The algorithm is very similar to dual ascent and the
method of multipliers: it consists of an x-minimization step (3.2), a
z-minimization step (3.3), and a dual variable update (3.4). As in the
method of multipliers, the dual variable update uses a step size equal
to the augmented Lagrangian parameter ρ.
The method of multipliers for (3.1) has the form
y k+1
:= y + ρ(Axk+1 + Bz k+1 − c).
k
where u = (1/ρ)y is the scaled dual variable. Using the scaled dual vari-
able, we can express ADMM as
xk+1 := argmin f (x) + (ρ/2)Ax + Bz k − c + uk 22 (3.5)
x
z k+1 := argmin g(z) + (ρ/2)Axk+1 + Bz − c + uk 22 (3.6)
z
uk+1 := uk + Axk+1 + Bz k+1 − c. (3.7)
k
uk = u0 + rj ,
j=1
3.2 Convergence
There are many convergence results for ADMM discussed in the liter-
ature; here, we limit ourselves to a basic but still very general result
that applies to all of the examples we will consider. We will make one
16 Alternating Direction Method of Multipliers
3.2.1 Convergence
Under assumptions 1 and 2, the ADMM iterates satisfy the following:
This shows that when the residuals rk and sk are small, the objective
suboptimality also must be small. We cannot use this inequality directly
in a stopping criterion, however, since we do not know x . But if we
guess or estimate that xk − x 2 ≤ d, we have that
where pri > 0 and dual > 0 are feasibility tolerances for the primal and
dual feasibility conditions (3.8) and (3.9), respectively. These tolerances
can be chosen using an absolute and relative criterion, such as
√
pri = p abs + rel max{Axk 2 , Bz k 2 , c2 },
√
dual = n abs + rel AT y k 2 ,
where abs > 0 is an absolute tolerance and rel > 0 is a relative toler-
√ √
ance. (The factors p and n account for the fact that the 2 norms are
in Rp and Rn , respectively.) A reasonable value for the relative stopping
20 Alternating Direction Method of Multipliers
where µ > 1, τ incr > 1, and τ decr > 1 are parameters. Typical choices
might be µ = 10 and τ incr = τ decr = 2. The idea behind this penalty
parameter update is to try to keep the primal and dual residual norms
within a factor of µ of one another as they both converge to zero.
The ADMM update equations suggest that large values of ρ place a
large penalty on violations of primal feasibility and so tend to produce
3.4 Extensions and Variations 21
3.4.3 Over-relaxation
In the z- and y-updates, the quantity Axk+1 can be replaced with
25
26 General Patterns
analysis,
f˜(v) = inf f (x) + (ρ/2)x − v22
x
holds when all the inverses exist. This means that if linear systems
with coefficient matrix P can be solved efficiently, and p is small, or
at least no larger than n in order, then the x-update can be computed
efficiently as well. The same trick as above can also be used to obtain
an efficient method for computing multiple updates: The factorization
of I + ρAP −1 AT can be cached and cheaper back-solves can be used
in computing the updates.
4.2 Quadratic Objective Terms 29
P + ρI F T xk+1 q − ρ(z k − uk )
+ = 0.
F 0 ν −g
iterative method used to compute the update xk+1 ) than if the iterative
method were started at zero or some other default initialization. This
is especially the case when ADMM has almost converged, in which case
the updates will not change significantly from their previous values.
4.4 Decomposition
4.4.1 Block Separability
Suppose x = (x1 , . . . , xN ) is a partition of the variable x into subvectors
and that f is separable with respect to this partition, i.e.,
f (x) = f1 (x1 ) + · · · + fN (xN ),
where xi ∈ R and N
ni
i=1 ni = N . If the quadratic term Ax2 is also
2
Even though the first term is not differentiable, we can easily compute
a simple closed-form solution to this problem by using subdifferential
calculus; see [140, §23] for background. Explicitly, the solution is
x+
i := Sλ/ρ (vi ),
or equivalently,
Yet another formula, which shows that the soft thresholding operator
is a shrinkage operator (i.e., moves a point toward zero), is
minimize f (x)
(5.1)
subject to x ∈ C,
33
34 Constrained Convex Optimization
method [56, 9], which is far more efficient than the classical method
that does not use the dual variable u.
Here, the norm of the primal residual xk − z k 2 has a nice inter-
pretation. Since xk ∈ C and z k ∈ D, xk − z k 2 is an upper bound on
dist(C, D), the Euclidean distance between C and D. If we terminate
with rk 2 ≤ pri , then we have found a pair of points, one in C and
one in D, that are no more than pri far apart. Alternatively, the point
(1/2)(xk + z k ) is no more than a distance pri /2 from both C and D.
P + ρI AT xk+1 q − ρ(z k − uk )
+ = 0.
A 0 ν −b
The problems addressed in this section will help illustrate why ADMM
is a natural fit for machine learning and statistical problems in particu-
lar. The reason is that, unlike dual ascent or the method of multipliers,
ADMM explicitly targets problems that split into two distinct parts, f
and g, that can then be handled separately. Problems of this form are
pervasive in machine learning, because a significant number of learning
problems involve minimizing a loss function together with a regulariza-
tion term or side constraints. In other cases, these side constraints are
introduced through problem transformations like putting the problem
in consensus form, as will be discussed in §7.1.
This section contains a variety of simple but important problems
involving 1 norms. There is widespread current interest in many of these
problems across statistics, machine learning, and signal processing, and
applying ADMM yields interesting algorithms that are state-of-the-art,
or closely related to state-of-the-art methods. We will see that ADMM
naturally lets us decouple the nonsmooth 1 term from the smooth loss
term, which is computationally advantageous. In this section, we focus on
the non-distributed versions of these problems for simplicity; the problem
of distributed model fitting will be treated in the sequel.
38
6.1 Least Absolute Deviations 39
where the Huber penalty function g hub is quadratic for small arguments
and transitions to an absolute value for larger values. For scalar a, it
is given by
a2 /2 |a| ≤ 1
g hub (a) =
|a| − 1/2 |a| > 1
ρ k+1 1
z k+1 := Ax − b + uk + S (Axk+1 − b + uk ).
1+ρ 1 + ρ 1+1/ρ
6.4 Lasso
An important special case of (6.1) is 1 regularized linear regression,
also called the lasso [156]. This involves solving
where f (x) = (1/2)Ax − b22 and g(z) = λz1 . By §4.2 and §4.4,
ADMM becomes
The second term is the total variation of x. This problem is often called
total variation denoising [145], and has applications in signal process-
ing. When A = I and F is a second difference matrix, the problem (6.3)
is called 1 trend filtering [101].
In ADMM form, the problem (6.3) can be written as
ADMM for this problem is the same as above with the z-update
replaced with block soft thresholding
Sκ (a) = (1 − κ/a2 )+ a,
ai ∼ N (0, Σ), i = 1, . . . , N,
46 1 -Norm Problems
consider the task of estimating the covariance matrix Σ under the prior
assumption that Σ−1 is sparse. Since (Σ−1 )ij is zero if and only if
the ith and jth components of the random variable are conditionally
independent, given the other variables, this problem is equivalent to the
structure learning problem of estimating the topology of the undirected
graphical model representation of the Gaussian [104]. Determining the
sparsity pattern of the inverse covariance matrix Σ−1 is also called the
covariance selection problem.
For n very small, it is feasible to search over all sparsity patterns
in Σ−1 since for a fixed sparsity pattern, determining the maximum
likelihood estimate of Σ is a tractable (convex optimization) problem.
A good heuristic that scales to much larger values of n is to minimize
the negative log-likelihood (with respect to the parameter X = Σ−1 )
with an 1 regularization term to promote sparsity of the estimated
inverse covariance matrix [7]. If S is the empirical covariance matrix
(1/N ) N T
i=1 ai ai , then the estimation problem can be written as
where · F is the Frobenius norm, i.e., the square root of the sum of
the squares of the entries.
This algorithm can be simplified much further. The Z-minimization
step is elementwise soft thresholding,
k+1 k+1
Zij := Sλ/ρ (Xij + Uijk ),
S − X −1 + ρ(X − Z k + U k ) = 0,
ρX − X −1 = ρ(Z k − U k ) − S. (6.5)
We will construct a matrix X that satisfies this condition and thus min-
imizes the X-minimization objective. First, take the orthogonal eigen-
value decomposition of the righthand side,
ρ(Z k − U k ) − S = QΛQT ,
ρX̃ − X̃ −1 = Λ,
which are always positive since ρ > 0. It follows that X = QX̃QT sat-
isfies the optimality condition (6.5), so this is the solution to the X-
minimization. The computational effort of the X-update is that of an
eigenvalue decomposition of a symmetric matrix.
7
Consensus and Sharing
48
7.1 Global Variable Consensus Optimization 49
solve the problem above in such a way that each term can be handled
by its own processing element, such as a thread or processor.
This problem arises in many contexts. In model fitting, for exam-
ple, x represents the parameters in a model and fi represents the loss
function associated with the ith block of data or measurements. In this
case, we would say that x is found by collaborative filtering, since the
data sources are ‘collaborating’ to develop a global model.
This problem can be rewritten with local variables xi ∈ Rn and a
common global variable z:
N
minimize i=1 fi (xi ) (7.1)
subject to xi − z = 0, i = 1, . . . , N.
This is called the global consensus problem, since the constraint is that
all the local variables should agree, i.e., be equal. Consensus can be
viewed as a simple technique for turning additive objectives N i=1 fi (x),
which show up frequently but do not split due to the variable being
shared across terms, into separable objectives N i=1 fi (xi ), which split
easily. For a useful recent discussion of consensus algorithms, see [131]
and the references therein.
ADMM for the problem (7.1) can be derived either directly from
the augmented Lagrangian
N
Lρ (x1 , . . . , xN , z, y) = fi (xi ) + yiT (xi − z) + (ρ/2)xi − z22 ,
i=1
C = {(x1 , . . . , xN ) | x1 = x2 = · · · = xN }.
1 k+1
N
z k+1 := xi + (1/ρ)yik
N
i=1
yik+1 := yik + ρ(xk+1
i − z k+1 ).
50 Consensus and Sharing
Substituting the first equation into the second shows that y k+1 = 0,
i.e., the dual variables have average value zero after the first iteration.
Using z k = xk , ADMM can be written as
xk+1
i := argmin f (x
i i ) + y kT
i (xi − xk
) + (ρ/2)xi − x
k 2
2
xi
N
rk 22 = xki − xk 22 , sk 22 = N ρ2 xk − xk−1 22 .
i=1
The scaled form of ADMM for this problem also has an appealing
form, which we record here for convenience:
xk+1
i := argmin fi (xi ) + (ρ/2)xi − z k
+ u
k 2
i 2 (7.6)
xi
z k+1 := argmin g(z) + (N ρ/2)z − xk+1 − uk 22 (7.7)
z
uk+1
i := uki + xk+1
i − z k+1 . (7.8)
In many cases, this version is simpler and easier to work with than the
unscaled form.
7.2 General Form Consensus Optimization 53
(xi )j = zG(i,j) , i = 1, . . . , N, j = 1, . . . , ni .
Fig. 7.1. General form consensus optimization. Local objective terms are on the left; global
variable components are on the right. Each edge in the bipartite graph is a consistency
constraint, linking a local variable and a global variable component.
N
Lρ (x, z, y) = fi (xi ) + yiT (xi − z̃i ) + (ρ/2)xi − z̃i 22 ,
i=1
7.2 General Form Consensus Optimization 55
i.e., the sum of the dual variable entries that correspond to any given
global index g is zero. The z-update step can thus be written in the
simpler form
zgk+1 := (1/kg ) (xk+1
i )j ,
G(i,j)=g
7.3 Sharing
Another canonical problem that will prove useful in the sequel is the
sharing problem
N N
minimize i=1 fi (xi ) + g( i=1 xi ) (7.11)
The first and last steps can be carried out independently in parallel for
each i = 1, . . . , N . As written, the z-update requires solving a problem
7.3 Sharing 57
zi = ai + z − a, (7.13)
so the z-update can be computed by solving the unconstrained problem
minimize g(N z) + (ρ/2) N i=1 z − a2
2
7.3.1 Duality
Attaching Lagrange multipliers νi to the constraints xi − zi = 0, the
dual function Γ of the ADMM sharing problem (7.12) is given by
−g ∗ (ν1 ) − i fi∗ (−νi ) if ν1 = ν2 = · · · = νN
Γ(ν1 , . . . , νN ) =
−∞ otherwise.
58 Consensus and Sharing
Letting ψ = g ∗ and hi (ν) = fi∗ (−ν), the dual sharing problem can be
written as
N
minimize i=1 hi (νi ) + ψ(ν) (7.15)
subject to νi − ν = 0,
acts via price adjustment, i.e., increasing or decreasing the price of each
good depending on whether there is an excess demand or excess supply
of the good, respectively.
Dual decomposition is the simplest algorithmic expression of
tâtonnement. In this setting, each agent adjusts his consumption xi
to minimize his individual cost fi (xi ) adjusted by the cost y T xi , where
y is the price vector. The central collector (called the ‘secretary of mar-
ket’ in [163]) works toward equilibrium by adjusting the prices y up or
down depending on whether each commodity or good is overproduced
or underproduced. ADMM differs only in the inclusion of the proximal
regularization term in the updates for each agent. As y k converges to
an optimal price vector y , the effect of the proximal regularization
term vanishes. The proximal regularization term can be interpreted as
each agent’s commitment to help clear the market.
8
Distributed Model Fitting
61
62 Distributed Model Fitting
8.1 Examples
8.1.1 Regression
Consider a linear modeling problem with measurements of the form
bi = aTi x + vi ,
8.1.2 Classification
Many classification problems can also be put in the form of the general
model fitting problem (8.1), with A, b, l, and r appropriately chosen. We
follow the standard setup from statistical learning theory, as described
in, e.g., [8]. Let pi ∈ Rn−1 denote the feature vector of the ith example
8.1 Examples 63
and let qi ∈ {−1, 1} denote the binary outcome or class label, for i =
1, . . . , m. The goal is to find a weight vector w ∈ Rn−1 and offset v ∈ R
such that
sign(pTi w + v) = qi
1
m
li (qi (pTi w + v)) + rwt (w). (8.2)
m
i=1
This has the generic model fitting form (8.1), with x = (w, v), ai =
(qi pi , −qi ), bi = 0, and regularizer r(x) = rwt (w). (We also need to scale
li by 1/m.) In the sequel, we will address such problems using the form
(8.1) without comment, assuming that this transformation has been
carried out.
In statistical learning theory, the problem (8.2) is referred to as
penalized empirical risk minimization or structural risk minimization.
When the loss function is convex, this is sometimes termed convex
risk minimization. In general, fitting a classifier by minimizing a sur-
rogate loss function, i.e., a convex upper bound to 0-1 loss, is a
well studied and widely used approach in machine learning; see, e.g.,
[165, 180, 8].
Many classification models in machine learning correspond to dif-
ferent choices of loss function li and regularization or penalty rwt .
64 Distributed Model Fitting
8.2.1 Lasso
For the lasso, this yields the distributed algorithm
xk+1
i := argmin (1/2)A x
i i − b
i 2
2
+ (ρ/2)xi − z k
+ u
k 2
i 2
xi
k+1
z := Sλ/ρN (xk+1 + uk )
uk+1
i := uki + xk+1
i − z k+1 .
Each xi -update takes the form of a Tikhonov-regularized least squares
(i.e., ridge regression) problem, with analytical solution
xk+1
i := (ATi Ai + ρI)−1 (ATi bi + ρ(z k − uki )).
The techniques from §4.2 apply: If a direct method is used, then the fac-
torization of ATi Ai + ρI can be cached to speed up subsequent updates,
and if mi < n, then the matrix inversion lemma can be applied to let
us factor the smaller matrix Ai ATi + ρI instead.
Comparing this distributed-data lasso algorithm with the serial ver-
sion in §6.4, we see that the only difference is the collection and aver-
aging steps, which couple the computations for the data blocks.
An ADMM-based distributed lasso algorithm is described in [121],
with applications in signal processing and wireless communications.
66 Distributed Model Fitting
explanatory variables for any given outcome. For example, the obser-
vations may be a corpus of documents, and the features could include
all words and pairs of adjacent words (bigrams) that appear in each
document. In bioinformatics, there are usually relatively few people
in a given association study, but there can be a very large number of
potential features relating to factors like observed DNA mutations in
each individual. There are many examples in other areas as well, and
the goal is to solve such problems in a distributed fashion with each
processor handling a subset of the features. In this section, we show
how this can be done by formulating it as a sharing problem from §7.3.
We partition the parameter vector x as x = (x1 , . . . , xN ), with xi ∈
Rni , where N i=1 ni = n. Conformably partition the data matrix A as
A = [A1 · · · AN ], with Ai ∈ Rm×ni , and the regularization function as
N
r(x) = N i=1 ri (xi ). This implies that Ax = i=1 Ai xi , i.e., Ai xi can
be thought of as a ‘partial’ prediction of b using only the features ref-
erenced in xi . The model fitting problem (8.1) becomes
N
minimize l i=1 Ai xi −b + N
i=1 ri (xi ).
Following the approach used for the sharing problem (7.12), we express
the problem as
N
minimize l i=1 zi −b + N
i=1 ri (xi )
subject to Ai xi − zi = 0, i = 1, . . . , N,
As in the discussion for the sharing problem, we carry out the z-update
by first solving for the average z k+1 :
k+1
z k+1 := argmin l(N z − b) + (N ρ/2)z − Ax − uk 22
z
k+1
zik+1 := z k+1 + Ai xk+1
i + uki − Ax − uk ,
k+1
where Ax = (1/N ) N k+1
i=1 Ai xi . Substituting the last expression
into the update for ui , we find that
k+1
uk+1
i = Ax + uk − z k+1 ,
which shows that, as in the sharing problem, all the dual variables are
equal. Using a single dual variable uk ∈ Rm , and eliminating zi , we
arrive at the algorithm
k
xk+1
i := argmin r i (xi ) + (ρ/2)A i xi − A i xk
i − z k
+ Ax + u
k 2
2
xi
k+1
z k+1 := argmin l(N z − b) + (N ρ/2)z − Ax − uk 22
z
k+1
u k+1 k
:= u + Ax − z k+1 .
8.3.1 Lasso
In this case, the algorithm above becomes
k
xk+1
i := argmin (ρ/2)A x
i i − A xk
i i − z k
+ Ax + u
k 2
2 + λx
i 1
xi
1 k+1
z k+1 := b + ρAx + ρuk
N +ρ
k+1
uk+1 := uk + Ax − z k+1 .
Each xi -update is a lasso problem with ni variables, which can be solved
using any single processor lasso method.
In the xi -updates, we have xk+1
i = 0 (meaning that none of the
features in the ith block are used) if and only if
T k k
Ai (Ai xi + z − Ax − u ) ≤ λ/ρ.
k k
2
When this occurs, the xi -update is fast (compared to the case when
xk+1
i = 0). In a parallel implementation, there is no benefit to speeding
up only some of the tasks being executed in parallel, but in a serial
setting we do benefit.
The z-update and u-update are the same as for the lasso, but the xi
update becomes
k
xk+1
i := argmin (ρ/2)A i xi − A i xk
i − z k
+ Ax + u
k 2
2 + λxi 2 .
xi
(Only the subscript on the last norm differs from the lasso case.) This
involves minimizing a function of the form
(ρ/2)Ai xi − v22 + λxi 2 ,
which can be carried out as follows. The solution is xi = 0 if and only
if ATi v2 ≤ λ/ρ. Otherwise, the solution has the form
xi = (ATi Ai + νI)−1 ATi v,
70 Distributed Model Fitting
for the value of ν > 0 that gives νxi 2 = λ/ρ. This value can be found
using a one-parameter search (e.g., via bisection) over ν. We can speed
up the computation of xi for several values of ν (as needed for the
parameter search) by computing and caching an eigendecomposition of
ATi Ai . Assuming Ai is tall, i.e., m ≥ ni (a similar method works when
m < ni ), we compute an orthogonal Q for which ATi Ai = Q diag(λ)QT ,
where λ is the vector of eigenvalues of ATi Ai (i.e., the squares of the
singular values of Ai ). The cost is O(mn2i ) flops, dominated (in order)
by forming ATi Ai . We subsequently compute xi 2 using
xi 2 = diag(λ + ν1)−1 QT ATi v 2 .
This can be computed in O(ni ) flops, once QT ATi v is computed, so
the search over ν is costless (in order). The cost per iteration is thus
O(mni ) (to compute QT ATi v), a factor of ni better than carrying out
the xi -update without caching.
where aij is the jth component of the feature vector of the ith example,
and bi is the associated outcome. Here the optimization variables are
the functions fj ∈ Fj , where Fj is a subspace of functions; rj is now
a regularization functional. Usually fj is linearly parametrized by a
finite number of coefficients, which are the underlying optimization
variables, but this formulation can also handle the case when Fj is
infinite-dimensional. In either case, it is clearer to think of the feature
functions fj as the variables to be determined.
72 Distributed Model Fitting
fjk+1 :=
m
k
argmin rj (fj ) + (ρ/2) i=1 (fj (aij ) − fj (aij ) − z i + f i + ui )
k k k 2
fj ∈Fj
N
m k+1
z k+1 := argmin i=1 li (N z i − bi ) + (ρ/2) i=1 z − f − uk 22
z
k+1
u k+1 k
:= u + f − z k+1 ,
k
where f i = (1/n) nj=1 fjk (aij ), the average value of the predicted
response nj=1 fjk (aij ) for the ith feature.
The fj -update is an 2 (squared) regularized function fit. The z-
update can be carried out componentwise.
9
Nonconvex Problems
minimize f (x)
subject to x ∈ S,
73
74 Nonconvex Problems
W k+1
:= argmin X k+1 − V k+1 W + U k 2F
W ≥0
U k+1
:= U + X k+1 − V k+1 W k+1 .
k
The first step splits across the rows of X and V , so can be performed
by solving a set of quadratic programs, in parallel, to find each row of
X and V separately; the second splits in the columns of W , so can be
performed by solving parallel quadratic programs to find each column.
10
Implementation
78
10.1 Abstract Implementation 79
10.2 MPI
Message Passing Interface (MPI) [77] is a language-independent
message-passing specification used for parallel algorithms, and is the
most widely used model for high-performance parallel computing today.
There are numerous implementations of MPI on a variety of distributed
platforms, and interfaces to MPI are available from a wide variety of
languages, including C, C++, and Python.
There are multiple ways to implement consensus ADMM in MPI,
but perhaps the simplest is given in Algorithm 1. This pseudocode
uses a single program, multiple data (SPMD) programming style, in
which each processor or subsystem runs the same program code but
has its own set of local variables and can read in a separate subset
of the data. We assume there are N processors, with each processor
i storing local variables xi and ui , a (redundant) copy of the global
variable z, and handling only the local data implicit in the objective
component fi .
In step 4, Allreduce denotes using the MPI Allreduce operation to
compute the global sum over all processors of the contents of the vector
w, and store the result in w on every processor; the same applies to
the scalar t. After step 4, then, w = ni=1 (xi + ui ) = N (x + u) and
t = r22 = ni=1 ri 22 on all processors. We use Allreduce because
10.3 Graph Computing Frameworks 81
10.4 MapReduce
MapReduce [46] is a popular programming model for distributed batch
processing of very large datasets. It has been widely used in indus-
try and academia, and its adoption has been bolstered by the open
source project Hadoop, inexpensive cloud computing services avail-
able through Amazon, and enterprise products and services offered
by Cloudera. MapReduce libraries are available in many languages,
including Java, C++, and Python, among many others, though Java
is the primary language for Hadoop. Though it is awkward to express
ADMM in MapReduce, the amount of cloud infrastructure available
for MapReduce computing can make it convenient to use in practice,
especially for large problems. We briefly review some key features of
Hadoop below; see [170] for general background.
A MapReduce computation consists of a set of Map tasks, which
process subsets of the input data in parallel, followed by a Reduce task,
which combines the results of the Map tasks. Both the Map and Reduce
functions are specified by the user and operate on key-value pairs. The
Map function performs the transformation
ẑ = N i=1 (xi + ui ) rather than z or z̃ because summation is associa-
tive while averaging is not. We assume N is known (or, alternatively,
the Reducer can compute the sum N i=1 1). We have N Mappers, one
for each subsystem, and each Mapper updates ui and xi using the
ẑ from the previous iteration. Each Mapper independently executes
the proximal step to compute z, but this is usually a cheap opera-
tion like soft thresholding. It emits an intermediate key-value pair that
essentially serves as a message to the central collector. There is a sin-
gle Reducer, playing the role of a central collector, and its incoming
values are the messages from the Mappers. The updated records are
84 Implementation
MPI case. (The wrapper checks the termination criteria instead of the
Reducer because they are not associative to check.)
The main difficulty is that MapReduce tasks are not designed to
be iterative and do not preserve state in the Mappers across iter-
ations, so implementing an iterative algorithm like ADMM requires
some understanding of the underlying infrastructure. Hadoop con-
tains a number of components supporting large-scale, fault-tolerant
distributed computing applications. The relevant components here are
HDFS, a distributed file system based on Google’s GFS [85], and
HBase, a distributed database based on Google’s BigTable [32].
HDFS is a distributed filesystem, meaning that it manages the
storage of data across an entire cluster of machines. It is designed for
situations where a typical file may be gigabytes or terabytes in size and
high-speed streaming read access is required. The base units of storage
in HDFS are blocks, which are 64 MB to 128 MB in size in a typi-
cal configuration. Files stored on HDFS are comprised of blocks; each
block is stored on a particular machine (though for redundancy, there
are replicas of each block on multiple machines), but different blocks in
the same file need not be stored on the same machine or even nearby.
For this reason, any task that processes data stored on HDFS (e.g., the
local datasets Di ) should process a single block of data at a time, since
a block is guaranteed to reside wholly on one machine; otherwise, one
may cause unnecessary network transfer of data.
In general, the input to each Map task is data stored on HDFS, and
Mappers cannot access local disk directly or perform any stateful com-
putation. The scheduler runs each Mapper as close to its input data
as possible, ideally on the same node, in order to minimize network
transfer of data. To help preserve data locality, each Map task should
also be assigned around a block’s worth of data. Note that this is very
different from the implementation presented for MPI, where each pro-
cess can be told to pick up the local data on whatever machine it is
running on.
10.4 MapReduce 85
Since each Mapper only handles a single block of data, there will
usually be a number of Mappers running on the same machine. To
reduce the amount of data transferred over the network, Hadoop sup-
ports the use of combiners, which essentially Reduce the results of all
the Map tasks on a given node so only one set of intermediate key-
value pairs need to be transferred across machines for the final Reduce
task. In other words, the Reduce step should be viewed as a two-step
process: First, the results of all the Mappers on each individual node
are reduced with Combiners, and then the records across each machine
are Reduced. This is a major reason why the Reduce function must be
commutative and associative.
Since the input value to a Mapper is a block of data, we also need a
mechanism for a Mapper to read in local variables, and for the Reducer
to store the updated variables for the next iteration. Here, we use
HBase, a distributed database built on top of HDFS that provides
fast random read-write access. HBase, like BigTable, provides a dis-
tributed multi-dimensional sorted map. The map is indexed by a row
key, a column key, and a timestamp. Each cell in an HBase table can
contain multiple versions of the same data indexed by timestamp; in
our case, we can use the iteration counts as the timestamps to store
and access data from previous iterations; this is useful for checking ter-
mination criteria, for example. The row keys in a table are strings, and
HBase maintains data in lexicographic order by row key. This means
that rows with lexicographically adjacent keys will be stored on the
same machine or nearby. In our case, variables should be stored with
the subsystem identifier at the beginning of row key, so information for
the same subsystem is stored together and is efficient to access. For
more details, see [32, 170].
The discussion and pseudocode above omits and glosses over many
details for simplicity of exposition. MapReduce frameworks like Hadoop
also support much more sophisticated implementations, which may
be necessary for very large scale problems. For example, if there
are too many values for a single Reducer to handle, we can use an
approach analogous to the one suggested for MPI: Mappers emit pairs
to ‘regional’ reduce jobs, and then an additional MapReduce step is car-
ried out that uses an identity mapper and aggregates regional results
86 Implementation
into a global result. In this section, our goal is merely to give a gen-
eral flavor of some of the issues involved in implementing ADMM in
a MapReduce framework, and we refer to [46, 170, 111] for further
details. There has also been some recent work on alternative MapRe-
duce systems that are specifically designed for iterative computation,
which are likely better suited for ADMM [25, 179], though the imple-
mentations are less mature and less widely available. See [37, 93] for
examples of recent papers discussing machine learning and optimiza-
tion in MapReduce frameworks.
11
Numerical Examples
87
88 Numerical Examples
Fig. 11.1. Norms of primal residual (top) and dual residual (bottom) versus iteration, for
a lasso problem. The dashed lines show pri (top) and dual (bottom).
Fig. 11.2. Objective suboptimality versus iteration for a lasso problem. The stopping crite-
rion is satisfied at iteration 15, indicated by the vertical dashed line.
Since A is fat (i.e., m < n), we apply the matrix inversion lemma
to (AT A + ρI)−1 and instead compute the factorization of the smaller
matrix I + (1/ρ)AAT , which is then cached for subsequent x-updates.
The factor step itself takes about nm2 + (1/3)m3 flops, which is the
cost of forming AAT and computing the Cholesky factorization. Subse-
quent updates require two matrix-vector multiplications and forward-
backward solves, which require approximately 4mn + 2m2 flops. (The
cost of the soft thresholding step in the z-update is negligible.) For these
problem dimensions, the flop count analysis suggests a factor/solve
ratio of around 350, which means that 350 subsequent ADMM itera-
tions can be carried out for the cost of the initial factorization.
In our basic implementation, the factorization step takes about 1
second, and subsequent x-updates take around 30 ms. (This gives a fac-
tor/solve ratio of only 33, less than predicted, due to a particularly effi-
cient matrix-matrix multiplication routine used in Matlab.) Thus the
total cost of solving an entire lasso problem is around 1.5 seconds—only
50% more than the initial factorization. In terms of parameter estima-
tion, we can say that computing the lasso estimate requires only 50%
11.1 Small Dense Lasso 91
Fig. 11.3. Iterations needed versus λ for warm start (solid line) and cold start (dashed line).
Fig. 11.4. Progress of primal and dual residual norm for distributed 1 regularized logistic
regression problem. The dashed lines show pri (top) and dual (bottom).
Figure 11.5 shows the suboptimality p̃k − p for the consensus vari-
able, where
m
k
p̃ = log 1 + exp(−bi (aTi wk + v k )) + λwk 1 .
i=1
11.3 Group Lasso with Feature Splitting 95
Fig. 11.6. Norms of primal residual (top) and dual residual (bottom) versus iteration, for
the distributed group lasso problem. The dashed lines show pri (top) and dual (bottom).
Fig. 11.7. Suboptimality of distributed group lasso versus iteration. The stopping criterion
is satisfied at iteration 47, indicated by the vertical dashed line.
that the overall problem has a skinny coefficient matrix but each of
the subproblems has a fat coefficient matrix. We emphasize that the
coefficient matrix is dense, so the full dataset requires over 30 GB to
store and has 3.2 billion nonzero entries in the total coefficient matrix
A. This is far too large to be solved efficiently, or at all, using standard
serial methods on commonly available hardware.
We solved the problem using a cluster of 10 machines. We used
Cluster Compute instances, which have 23 GB of RAM, two quad-core
Intel Xeon X5570 ‘Nehalem’ chips, and are connected to each other
with 10 Gigabit Ethernet. We used hardware virtual machine images
running CentOS 5.4. Since each node had 8 cores, we ran the code
with 80 processes, so each subsystem ran on its own core. In MPI,
communication between processes on the same machine is performed
locally via the shared-memory Byte Transfer Layer (BTL), which pro-
vides low latency and high bandwidth communication, while commu-
nication across machines goes over the network. The data was sized
so all the processes on a single machine could work entirely in RAM.
Each node had its own attached Elastic Block Storage (EBS) volume
that contained only the local data relevant to that machine, so disk
throughput was shared among processes on the same machine but not
across machines. This is to emulate a scenario where each machine is
only processing the data on its local disk, and none of the dataset is
transferred over the network. We emphasize that usage of a cluster set
up in this fashion costs under $20 per hour.
We solved the problem with a deliberately naive implementation of
the algorithm, based directly on the discussion of §6.4, §8.2, and §10.2.
The implementation consists of a single file of C code, under 400 lines
despite extensive comments. The linear algebra (BLAS operations and
the Cholesky factorization) were performed using a stock installation
of the GNU Scientific Library.
We now report the breakdown of the wall-clock runtime. It took
roughly 30 seconds to load all the data into memory. It then took
4-5 minutes to form and then compute the Cholesky factorizations of
I + (1/ρ)Ai ATi . After caching these factorizations, it then took 0.5-2
seconds for each subsequent ADMM iteration. This includes the back-
solves in the xi -updates and all the message passing. For this problem,
11.4 Distributed Large-Scale Lasso with MPI 99
Fig. 11.8. Fit versus cardinality for the lasso (dotted line), lasso with posterior least squares
fit (dashed line), and regressor selection (solid line).
The x-update step has exactly the same expression as in the lasso
example, so we use the same method, based on the matrix inversion
lemma and caching, described in that example. The z-update step con-
sists of keeping the c largest magnitude components of x + u and zero-
ing the rest. For the sake of clarity, we performed an intermediate
sorting of the components, but more efficient schemes are possible. In
any case, the cost of the z-update is negligible compared with that of
the x-update.
Convergence of ADMM for a nonconvex problem such as this one
is not guaranteed; and even when it does converge, the final result can
depend on the choice of ρ and the initial values for z and u. To explore
this, we ran 100 ADMM simulations with randomly chosen initial val-
ues and ρ ranging between 0.1 and 100. Indeed, some of them did not
converge, or at least, were converging slowly. But most of them con-
verged, though not to exactly the same points. However, the objective
values obtained by those that converged were reasonably close to each
other, typically within 5%. The different values of x found had small
102 Numerical Examples
103
104 Conclusions
We are very grateful to Rob Tibshirani and Trevor Hastie for encour-
aging us to write this review. Thanks also to Alexis Battle, Dimitri
Bertsekas, Danny Bickson, Tom Goldstein, Dimitri Gorinevsky, Daphne
Koller, Vicente Malave, Stephen Oakley, and Alex Teichman for help-
ful comments and discussions. Yang Wang and Matt Kraning helped in
developing ADMM for the sharing and exchange problems, and Arezou
Keshavarz helped work out ADMM for generalized additive models. We
thank Georgios Giannakis and Alejandro Ribeiro for pointing out some
very relevant references that we had missed in an earlier version. We
thank John Duchi for a very careful reading of the manuscript and for
suggestions that greatly improved it.
Support for this work was provided in part by AFOSR grant
FA9550-09-0130 and NASA grant NNX07AEIIA. Neal Parikh was sup-
ported by the Cortlandt and Jean E. Van Rensselaer Engineering
Fellowship from Stanford University and by the National Science Foun-
dation Graduate Research Fellowship under Grant No. DGE-0645962.
Eric Chu was supported by the Pan Wen-Yuan Foundation Scholarship.
105
A
Convergence Proof
The basic convergence result given in §3.2 can be found in several ref-
erences, such as [81, 63]. Many of these give more sophisticated results,
with more general penalties or inexact minimization. For completeness,
we give a proof here.
We will show that if f and g are closed, proper, and convex, and
the Lagrangian L0 has a saddle point, then we have primal residual
convergence, meaning that rk → 0, and objective convergence, meaning
that pk → p , where pk = f (xk ) + g(z k ). We will also see that the dual
residual sk = ρAT B(z k − z k−1 ) converges to zero.
Let (x , z , y ) be a saddle point for L0 , and define
106
107
p ≤ pk+1 + y T rk+1 ,
(Here we use the basic fact that the subdifferential of the sum of a
subdifferentiable function and a differentiable function with domain
Rn is the sum of the subdifferential and the gradient; see, e.g., [140,
§23].)
Since y k+1 = y k + ρrk+1 , we can plug in y k = y k+1 − ρrk+1 and
rearrange to obtain
and that
and substituting rk+1 = (1/ρ)(y k+1 − y k ) in the first two terms gives
ρrk+1 22 − 2ρ(B(z k+1 − z k ))T rk+1 + 2ρ(B(z k+1 − z k ))T (B(z k+1 − z )),
z k+1 − z = (z k+1 − z k ) + (z k − z )
z k+1 − z k = (z k+1 − z ) − (z k − z )
With the previous step, this implies that (A.4) can be written as
and
to get that
111
112 References
[26] R. H. Byrd, P. Lu, and J. Nocedal, “A Limited Memory Algorithm for Bound
Constrained Optimization,” SIAM Journal on Scientific and Statistical Com-
puting, vol. 16, no. 5, pp. 1190–1208, 1995.
[27] E. J. Candès and Y. Plan, “Near-ideal model selection by 1 minimization,”
Annals of Statistics, vol. 37, no. 5A, pp. 2145–2177, 2009.
[28] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact
signal reconstruction from highly incomplete frequency information,” IEEE
Transactions on Information Theory, vol. 52, no. 2, p. 489, 2006.
[29] E. J. Candès and T. Tao, “Near-optimal signal recovery from random pro-
jections: Universal encoding strategies?,” IEEE Transactions on Information
Theory, vol. 52, no. 12, pp. 5406–5425, 2006.
[30] Y. Censor and S. A. Zenios, “Proximal minimization algorithm with D-
functions,” Journal of Optimization Theory and Applications, vol. 73, no. 3,
pp. 451–464, 1992.
[31] Y. Censor and S. A. Zenios, Parallel Optimization: Theory, Algorithms, and
Applications. Oxford University Press, 1997.
[32] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows,
T. Chandra, A. Fikes, and R. E. Gruber, “BigTable: A distributed storage
system for structured data,” ACM Transactions on Computer Systems, vol. 26,
no. 2, pp. 1–26, 2008.
[33] G. Chen and M. Teboulle, “A proximal-based decomposition method for con-
vex minimization problems,” Mathematical Programming, vol. 64, pp. 81–101,
1994.
[34] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by
basis pursuit,” SIAM Review, vol. 43, pp. 129–159, 2001.
[35] Y. Chen, T. A. Davis, W. W. Hager, and S. Rajamanickam, “Algo-
rithm 887: CHOLMOD, supernodal sparse Cholesky factorization and
update/downdate,” ACM Transactions on Mathematical Software, vol. 35,
no. 3, p. 22, 2008.
[36] W. Cheney and A. A. Goldstein, “Proximity maps for convex sets,” Proceed-
ings of the American Mathematical Society, vol. 10, no. 3, pp. 448–450, 1959.
[37] C. T. Chu, S. K. Kim, Y. A. Lin, Y. Y. Yu, G. Bradski, A. Y. Ng, and
K. Olukotun, “MapReduce for machine learning on multicore,” in Advances
in Neural Information Processing Systems, 2007.
[38] J. F. Claerbout and F. Muir, “Robust modeling with erratic data,” Geophysics,
vol. 38, p. 826, 1973.
[39] P. L. Combettes, “The convex feasibility problem in image recovery,” Advances
in Imaging and Electron Physics, vol. 95, pp. 155–270, 1996.
[40] P. L. Combettes and J. C. Pesquet, “A Douglas-Rachford splitting approach
to nonsmooth convex variational signal recovery,” IEEE Journal on Selected
Topics in Signal Processing, vol. 1, no. 4, pp. 564–574, 2007.
[41] P. L. Combettes and J. C. Pesquet, “Proximal Splitting Methods in Signal
Processing,” arXiv:0912.3522, 2009.
[42] P. L. Combettes and V. R. Wajs, “Signal recovery by proximal forward-
backward splitting,” Multiscale Modeling and Simulation, vol. 4, no. 4,
pp. 1168–1200, 2006.
114 References
[96] B. S. He, H. Yang, and S. L. Wang, “Alternating direction method with self-
adaptive penalty parameters for monotone variational inequalities,” Journal
of Optimization Theory and Applications, vol. 106, no. 2, pp. 337–356, 2000.
[97] M. R. Hestenes, “Multiplier and gradient methods,” Journal of Optimization
Theory and Applications, vol. 4, pp. 302–320, 1969.
[98] M. R. Hestenes, “Multiplier and gradient methods,” in Computing Methods
in Optimization Problems, (L. A. Zadeh, L. W. Neustadt, and A. V. Balakr-
ishnan, eds.), Academic Press, 1969.
[99] J.-B. Hiriart-Urruty and C. Lemaréchal, Fundamentals of Convex Analysis.
Springer, 2001.
[100] P. J. Huber, “Robust estimation of a location parameter,” Annals of Mathe-
matical Statistics, vol. 35, pp. 73–101, 1964.
[101] S.-J. Kim, K. Koh, S. Boyd, and D. Gorinevsky, “1 Trend filtering,” SIAM
Review, vol. 51, no. 2, pp. 339–360, 2009.
[102] S.-J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky, “An interior-point
method for large-scale 1 -regularized least squares,” IEEE Journal of Selected
Topics in Signal Processing, vol. 1, no. 4, pp. 606–617, 2007.
[103] K. Koh, S.-J. Kim, and S. Boyd, “An interior-point method for large-scale 1 -
regularized logistic regression,” Journal of Machine Learning Research, vol. 1,
no. 8, pp. 1519–1555, 2007.
[104] D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and
Techniques. MIT Press, 2009.
[105] S. A. Kontogiorgis, Alternating directions methods for the parallel solution of
large-scale block-structured optimization problems. PhD thesis, University of
Wisconsin-Madison, 1994.
[106] S. A. Kontogiorgis and R. R. Meyer, “A variable-penalty alternating direc-
tions method for convex optimization,” Mathematical Programming, vol. 83,
pp. 29–53, 1998.
[107] L. S. Lasdon, Optimization Theory for Large Systems. MacMillan, 1970.
[108] J. Lawrence and J. E. Spingarn, “On fixed points of non-expansive piecewise
isometric mappings,” Proceedings of the London Mathematical Society, vol. 3,
no. 3, p. 605, 1987.
[109] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, “Basic linear
algebra subprograms for Fortran usage,” ACM Transactions on Mathematical
Software, vol. 5, no. 3, pp. 308–323, 1979.
[110] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factoriza-
tion,” Advances in Neural Information Processing Systems, vol. 13, 2001.
[111] J. Lin and M. Schatz, “Design Patterns for Efficient Graph Algorithms in
MapReduce,” in Proceedings of the Eighth Workshop on Mining and Learning
with Graphs, pp. 78–85, 2010.
[112] P. L. Lions and B. Mercier, “Splitting algorithms for the sum of two nonlinear
operators,” SIAM Journal on Numerical Analysis, vol. 16, pp. 964–979, 1979.
[113] D. C. Liu and J. Nocedal, “On the Limited Memory Method for Large Scale
Optimization,” Mathematical Programming B, vol. 45, no. 3, pp. 503–528,
1989.
118 References