0% found this document useful (0 votes)
23 views8 pages

Duchi SH Si CH 08

This paper presents efficient algorithms for projecting vectors onto the ℓ1-ball, which is crucial for high-dimensional machine learning tasks. The authors introduce two methods: one that performs exact projection in O(n) expected time and another that handles perturbed vectors in O(k log(n)) time, demonstrating their effectiveness in both batch and online learning scenarios. The algorithms are shown to outperform traditional interior point methods and contribute to achieving sparse solutions that enhance interpretability and generalization in machine learning models.

Uploaded by

nah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views8 pages

Duchi SH Si CH 08

This paper presents efficient algorithms for projecting vectors onto the ℓ1-ball, which is crucial for high-dimensional machine learning tasks. The authors introduce two methods: one that performs exact projection in O(n) expected time and another that handles perturbed vectors in O(k log(n)) time, demonstrating their effectiveness in both batch and online learning scenarios. The algorithms are shown to outperform traditional interior point methods and contribute to achieving sparse solutions that enhance interpretability and generalization in machine learning models.

Uploaded by

nah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Efficient Projections onto the ℓ1 -Ball for Learning in High Dimensions

John Duchi JDUCHI @ CS . STANFORD . EDU


Google, Mountain View, CA 94043
Shai Shalev-Shwartz SHAI @ TTI - C . ORG
Toyota Technological Institute, Chicago, IL, 60637
Yoram Singer SINGER @ GOOGLE . COM
Tushar Chandra TUSHAR @ GOOGLE . COM
Google, Mountain View, CA 94043

Abstract but mathematically equivalent approach is to cast the prob-


lem as a constrained optimization problem. In this setting
We describe efficient algorithms for projecting a
we seek a minimizer of the objective function while con-
vector onto the ℓ1 -ball. We present two methods
straining the solution to have a bounded norm. Many re-
for projection. The first performs exact projec-
cent advances in statistical machine learning and related
tion in O(n) expected time, where n is the di-
fields can be explained as convex optimization subject to
mension of the space. The second works on vec-
a 1-norm constraint on the vector of parameters w. Im-
tors k of whose elements are perturbed outside
posing an ℓ1 constraint leads to notable benefits. First, it
the ℓ1 -ball, projecting in O(k log(n)) time. This
encourages sparse solutions, i.e a solution for which many
setting is especially useful for online learning in
components of w are zero. When the original dimension
sparse feature spaces such as text categorization
of w is very high, a sparse solution enables easier inter-
applications. We demonstrate the merits and ef-
pretation of the problem in a lower dimension space. For
fectiveness of our algorithms in numerous batch
the usage of ℓ1 -based approach in statistical machine learn-
and online learning tasks. We show that vari-
ing see for example (Tibshirani, 1996) and the references
ants of stochastic gradient projection methods
therein. Donoho (2006b) provided sufficient conditions for
augmented with our efficient projection proce-
obtaining an optimal ℓ1 -norm solution which is sparse. Re-
dures outperform interior point methods, which
cent work on compressed sensing (Candes, 2006; Donoho,
are considered state-of-the-art optimization tech-
2006a) further explores how ℓ1 constraints can be used for
niques. We also show that in online settings gra-
recovering a sparse signal sampled below the Nyquist rate.
dient updates with ℓ1 projections outperform the
The second motivation for using ℓ1 constraints in machine
exponentiated gradient algorithm while obtain-
learning problems is that in some cases it leads to improved
ing models with high degrees of sparsity.
generalization bounds. For example, Ng (2004) examined
the task of PAC learning a sparse predictor and analyzed
cases in which an ℓ1 constraint results in better solutions
1. Introduction than an ℓ2 constraint.
A prevalent machine learning approach for decision and In this paper we re-examine the task of minimizing a con-
prediction problems is to cast the learning task as penal- vex function subject to an ℓ1 constraint on the norm of
ized convex optimization. In penalized convex optimiza- the solution. We are particularly interested in cases where
tion we seek a set of parameters, gathered together in a the convex function is the average loss over a training
vector w, which minimizes a convex objective function in set of m examples where each example is represented as
w with an additional penalty term that assesses the com- a vector of high dimension. Thus, the solution itself is
plexity of w. Two commonly used penalties are the 1- a high-dimensional vector as well. Recent work on ℓ2
norm and the square of the 2-norm of w. An alternative constrained optimization for machine learning indicates
that gradient-related projection algorithms are more effi-
Appearing in Proceedings of the 25 th International Conference
on Machine Learning, Helsinki, Finland, 2008. Copyright 2008
cient in approaching a solution of good generalization than
by the author(s)/owner(s). second-order algorithms when the number of examples and
Efficient Projections onto the ℓ1 -Ball for Learning in High Dimensions

the dimension are large. For instance, Shalev-Shwartz data (the MNIST handwritten digit dataset and the Reuters
et al. (2007) give recent state-of-the-art methods for solv- RCV1 corpus) for batch and online learning. Our projec-
ing large scale support vector machines. Adapting these tion based methods outperform competing algorithms in
recent results to projection methods onto the ℓ1 ball poses terms of sparsity, and they exhibit faster convergence and
algorithmic challenges. While projections onto ℓ2 balls are lower regret than previous methods.
straightforward to implement in linear time with the ap-
propriate data structures, projection onto an ℓ1 ball is a 2. Notation and Problem Setting
more involved task. The main contribution of this paper is
the derivation of gradient projections with ℓ1 domain con- We start by establishing the notation used throughout the
straints that can be performed almost as fast as gradient paper. The set of integers 1 through n is denoted by [n].
projection with ℓ2 constraints. Scalars are denoted by lower case letters and vectors by
lower case bold face letters. We use the notation w ≻ b
Our starting point is an efficient method for projection onto
to designate that all of the components of w are greater
the probabilistic simplex. The basic idea is to show that,
than b. We use k · k as a shorthand for the Euclidean norm
after sorting the vector we need to project, it is possible to
k · k2 . The other norm we usePthroughout the paper is the 1-
calculate the projection exactly in linear time. This idea n
norm of the vector, kvk1 = i=1 |vi |. Lastly, we consider
was rediscovered multiple times. It was first described in
order statistics and sorting vectors frequently throughout
an abstract and somewhat opaque form in the work of Gafni
this paper. To that end, we let v(i) denote the ith order
and Bertsekas (1984) and Bertsekas (1999). Crammer and
statistic of v, that is, v(1) ≥ v(2) ≥ . . . ≥ v(n) for v ∈ Rn .
Singer (2002) rediscovered a similar projection algorithm
as a tool for solving the dual of multiclass SVM. Hazan In the setting considered in this paper we are provided with
(2006) essentially reuses the same algorithm in the con- a convex function L : Rn → R. Our goal is to find the
text of online convex programming. Our starting point is minimum of L(w) subject to an ℓ1 -norm constraint on w.
another derivation of Euclidean projection onto the sim- Formally, the problem we need to solve is
plex that paves the way to a few generalizations. First we
show that the same technique can also be used for project- minimize L(w) s.t. kwk1 ≤ z . (1)
w
ing onto the ℓ1 -ball. This algorithm is based on sorting the
components of the vector to be projected and thus requires Our focus is on variants of the projected subgradient
O(n log(n)) time. We next present an improvement of the method for convex optimization (Bertsekas, 1999). Pro-
algorithm that replaces sorting with a procedure resembling jected subgradient methods minimize a function L(w) sub-
median-search whose expected time complexity is O(n). ject to the constraint that w ∈ X, for X convex, by gener-
In many applications, however, the dimension of the feature ating the sequence {w(t) } via
space is very high yet the number of features which attain  
non-zero values for an example may be very small. For in- w(t+1) = ΠX w(t) − ηt ∇(t) (2)
stance, in our experiments with text classification in Sec. 7,
the dimension is two million (the bigram dictionary size) where ∇(t) is (an unbiased estimate of) the (sub)gradient
while each example has on average one-thousand non-zero of L at w(t) and ΠX (x) = argmin y {kx − yk | y ∈
features (the number of unique tokens in a document). Ap-
plications where the dimensionality is high yet the number X} is Euclidean projection of x onto X. In the rest of the
of “on” features in each example is small render our second paper, the main algorithmic focus is on the projection step
algorithm useless in some cases. We therefore shift gears (computing an unbiased estimate of the gradient of L(w) is
and describe a more complex algorithm that employs red- straightforward in the applications considered in this paper,
black trees to obtain a linear dependence on the number as is the modification of w(t) by ∇(t) ).
of non-zero features in an example and only logarithmic
dependence on the full dimension. The key to our con- 3. Euclidean Projection onto the Simplex
struction lies in the fact that we project vectors that are the
sum of a vector in the ℓ1 -ball and a sparse vector—they are For clarity, we begin with the task of performing Euclidean
“almost” in the ℓ1 -ball. projection onto the positive simplex; our derivation natu-
rally builds to the more efficient algorithms. As such, the
In conclusion to the paper we present experimental results most basic projection task we consider can be formally de-
that demonstrate the merits of our algorithms. We compare scribed as the following optimization problem,
our algorithms with several specialized interior point (IP)
methods as well as general methods from the literature for n
solving ℓ1 -penalized problems on both synthetic and real 1 X
minimize kw − vk22 s.t. wi = z , wi ≥ 0 . (3)
w 2 i=1
Efficient Projections onto the ℓ1 -Ball for Learning in High Dimensions

When z = 1 the above is projection onto the probabilistic I NPUT: A vector v ∈ Rn and a scalar z > 0
simplex. The Lagrangian of the problem in Eq. (3) is Sort v into µ : µ1 ≥ µ2 ≥ . . . ≥ µp
j
! ( ! )
n
1
X
1
Find ρ = max j ∈ [n] : µj − j µr − z > 0
X
L(w, ζ) = kw − vk2 + θ wi − z − ζ · w ,
2 i=1 r=1
ρ
!
X
where θ ∈ R is a Lagrange multiplier and ζ ∈ Rn+ is a Define θ = ρ1 µi − z
vector of non-negative Lagrange multipliers. Differenti- i=1
ating with respect to wi and comparing to zero gives the O UTPUT: w s.t. wi = max {vi − θ , 0}
dL
optimality condition, dw = wi − vi + θ − ζi = 0.
i Figure 1. Algorithm for projection onto the simplex.
The complementary slackness KKT condition implies that
whenever wi > 0 we must have that ζi = 0. Thus, if 4. Euclidean Projection onto the ℓ1 -Ball
wi > 0 we get that
We next modify the algorithm to handle the more general
wi = vi − θ + ζi = vi − θ . (4) ℓ1 -norm constraint, which gives the minimization problem
All the non-negative elements of the vector w are tied via minimize kw − vk22 s.t. kwk1 ≤ z . (7)
a single variable, so knowing the indices of these elements w∈Rn

gives a much simpler problem. Upon first inspection, find-


We do so by presenting a reduction to the problem of pro-
ing these indices seems difficult, but the following lemma
jecting onto the simplex given in Eq. (3). First, we note
(Shalev-Shwartz & Singer, 2006) provides a key tool in de-
that if kvk1 ≤ z then the solution of Eq. (7) is w = v.
riving our procedure for identifying non-zero elements.
Therefore, from now on we assume that kvk1 > z. In this
Lemma 1. Let w be the optimal solution to the minimiza- case, the optimal solution must be on the boundary of the
tion problem in Eq. (3). Let s and j be two indices such constraint set and thus we can replace the inequality con-
that vs > vj . If ws = 0 then wj must be zero as well. straint kwk1 ≤ z with an equality constraint kwk1 = z.
Having done so, the sole difference between the problem
Denoting by I the set of indices of the non-zero compo-
in Eq. (7) and the one in Eq. (3) is that in the latter we
nents of the sorted optimal solution, I = {i ∈ [n] : v(i) >
have an additional set of constraints, w ≥ 0. The follow-
0}, we see that Lemma 1 implies that I = [ρ] for some
ing lemma indicates that each non-zero component of the
1 ≤ ρ ≤ n. Had we known ρ we could have simply used
optimal solution w shares the sign of its counterpart in v.
Eq. (4) to obtain that
ρ ρ
Lemma 3. Let w be an optimal solution of Eq. (7). Then,
n n
X X X X  for all i, wi vi ≥ 0.
wi = w(i) = w(i) = v(i) − θ = z
i=1 i=1 i=1 i=1
Proof. Assume by contradiction that the claim does not
and therefore hold. Thus, there exists i for which wi vi < 0. Let ŵ
ρ
!
1 X be a vector such that ŵi = 0 and for all j 6= i we have
θ= v(i) − z . (5) ŵj = wj . Therefore, kŵk1 = kwk1 − |wi | ≤ z and hence
ρ i=1 ŵ is a feasible solution. In addition,
Given θ we can characterize the optimal solution for w as
kw − vk22 − kŵ − vk22 = (wi − vi )2 − (0 − vi )2
wi = max {vi − θ , 0} . (6) = wi2 − 2wi vi > wi2 > 0 .
We are left with the problem of finding the optimal ρ, and
We thus constructed a feasible solution ŵ which attains an
the following lemma (Shalev-Shwartz & Singer, 2006) pro-
objective value smaller than that of w. This leads us to the
vides a simple solution once we sort v in descending order.
desired contradiction.
Lemma 2. Let w be the optimal solution to the minimiza-
tion problem given in Eq. (3). Let µ denote the vector ob- Based on the above lemma and the symmetry of the ob-
tained by sorting v in a descending order. Then, the num- jective, we are ready to present our reduction. Let u be a
ber of strictly positive elements in w is vector obtained by taking the absolute value of each com-
( j
! ) ponent of v, ui = |vi |. We now replace Eq. (7) with
1 X
ρ(z, µ) = max j ∈ [n] : µj − µr − z > 0 .
j r=1 minimize
n
kβ − uk22 s.t. kβk1 ≤ z and β ≥ 0 . (8)
β∈R

The pseudo-code describing the O(n log n) procedure for Once we obtain the solution for the problem above we con-
solving Eq. (3) is given in Fig. 1. struct the optimal of Eq. (7) by setting wi = sign(vi ) βi .
Efficient Projections onto the ℓ1 -Ball for Learning in High Dimensions

I NPUT A vector v ∈ Rn and a scalar z > 0 2001). The algorithm computes partial sums just-in-time
I NITIALIZE U = [n] s = 0 ρ = 0 and has expected linear time complexity.
W HILE U 6= φ The algorithm identifies ρ and the pivot value v(ρ) without
P ICK k ∈ U at random sorting the vector v by using a divide and conquer proce-
PARTITION U : dure. The procedure works in rounds and on each round
G = {j ∈ U | vj ≥ vk } either eliminates elements shown to be strictly smaller than
L = {j ∈ U | vj < vk } X v(ρ) or updates the partial sum leading to Eq. (9). To do so
C ALCULATE ∆ρ = |G| ; ∆s = vj the algorithm maintains a set of unprocessed elements of
IF (s + ∆s) − (ρ + ∆ρ)vk < z
j∈G v. This set contains the components of v whose relation-
s = s + ∆s ; ρ = ρ + ∆ρ ; U ← L ship to v(ρ) we do not know. We thus initially set U = [n].
E LSE On each round of the algorithm we pick at random an in-
U ← G \ {k} dex k from the set U . Next, we partition the set U into
E ND I F two subsets G and L. G contains all the indices j ∈ U
S ET θ = (s − z)/ρ whose components vj > vk ; L contains those j ∈ U such
O UTPUT w s.t. vi = max {vi − θ , 0} that vj is smaller. We now face two cases related to the
current summation of entries
P in v greater than the hypoth-
Figure 2. Linear time projection onto the simplex. esized v(ρ) (i.e. vk ). If j:vj ≥vk (vj − vk ) < z then by
Eq. (9), vk ≥ v(ρ) . In this case we know that all the el-
5. A Linear Time Projection Algorithm ements in G participate in the sum defining θ as given by
In this section we describe a more efficient algorithm for Eq. (9). We can discard G and set U to be L as we still
performing projections. To keep our presentation simple P to further identify the remaining elements in L. If
need
and easy to follow, we describe the projection algorithm j:vj ≥vk (vj − vk ) ≥ z then the same rationale implies
onto the simplex. The generalization to the ℓ1 ball can that vk < v(ρ) . Thus, all the elements in L are smaller than
straightforwardly incorporated into the efficient algorithm v(ρ) and can be discarded. In this case we can remove the
by the results from the previous section (we simply work set L and vk and set U to be G \ {k}. The entire process
in the algorithm with a vector of the absolute values of v, ends when U is empty.
replacing the solution’s components wi with sign(vi ) · wi ). Along the process we also keep track of the sum and the
For correctness of the following discussion, we add an- number of elements in v that we have found thus far to
other component to v (the vector to be projected), which be no smaller than v(ρ) , which is required in order not to
we set to 0, thus vn+1 = 0 and v(n+1) = 0. Let us recalculate partial sums. The pseudo-code describing the
start by examining again Lemma 2. The lemma implies efficient projection algorithm is provided in Fig. 2. We
keep the set of elements found to be greater than v(ρ) only
Pρ ρ is the largest
that the index integer that still satisfies
v(ρ) − ρ1 implicitly. Formally, at each iteration of the algorithm we

r=1 v (r) − z > 0. After routine algebraic
manipulations the above can be rewritten in the following maintain a variable s, which is the sum of the elements in
somewhat simpler form: the set {vj : j 6∈ U, vj ≥ v(ρ) }, and overload ρ to des-
ignate the cardinality of the this set throughout the algo-
ρ
X ρ+1
X rithm. Thus, when the algorithms exits its main while loop,
 
v(i) − v(ρ) < z and v(i) − v(ρ+1) ≥ z. (9) ρ is the maximizer defined in Lemma 1. Once the while
i=1 i=1 loop terminates, we are left with the task of calculating θ
using
P Eq. (10) and performing the actual projection. Since
Given ρ and v(ρ) we slightly rewrite the value θ as follows,
j:vj ≥µρ vj is readily available to us as the variable s, we
  simply set θ to be (s − z)/ρ and perform the projection as
1 X prescribed by Eq. (6).
θ= vj − z  . (10)
ρ
j:vj ≥v(ρ) Though omitted here for lack of space, we can also extend
the
P algorithms to handle the more general constraint that
The task of projection can thus be distilled to the task of ai |wi | ≤ z for ai ≥ 0.
finding θ, which in turn reduces to the task of finding ρ and
the pivot element v(ρ) . Our problem thus resembles the
task of finding an order statistic with an additional compli- 6. Efficient Projection for Sparse Gradients
cating factor stemming from the need to compute summa- Before we dive into developing a new algorithm, we re-
tions (while searching) of the form given by Eq. (9). Our mind the reader of the iterations the minimization algo-
efficient projection algorithm is based on a modification of rithm takes from Eq. (2): we generate a sequence {w(t) }
the randomized median finding algorithm (Cormen et al.,
Efficient Projections onto the ℓ1 -Ball for Learning in High Dimensions

I NPUT A balanced tree T and a scalar z > 0 step in which we deduct θt from all the elements of the
I NITIALIZE v ⋆ = ∞, ρ∗ = n + 1, s∗ = z vector implicitly, adhering to the goal of performing a sub-
C ALL P IVOT S EARCH(root(T ), 0, 0) linear number of operations. As before, we assume that the
P ROCEDURE P IVOT S EARCH(v, ρ, s) goal is to project onto the simplex. Equipped with these
C OMPUTE ρ̂ = ρ + r(v) ; ŝ = s + σ(v) variables, the j th component of the projected vector after t
I F ŝ < v ρ̂ + z // v ≥ pivot projected gradient steps can be written as max{vj −Θt , 0}.
I F v⋆ > v The second substantial modification to the core algorithm is
v ⋆ = v ; ρ⋆ = ρ̂ ; s⋆ = ŝ to keep only the non-zero components of the weight vector
E ND I F in a red-black tree (Cormen et al., 2001). The red-black tree
I F leafT (v) facilitates an efficient search for the pivot element (v(ρ) ) in
R ETURN θ = (s⋆ − z)/ρ⋆ time which is logarithmic in the dimension, as we describe
E ND I F in the sequel. Once the pivot element is found we implic-
C ALL P IVOT S EARCH(leftT (v), ρ̂, ŝ) itly deduct θt from all the non-zero elements in our weight
E LSE // v < pivot vector by updating Θt . We then remove all the components
I F leafT (v) that are less than v(ρ) (i.e. less than Θt ); this removal is
R ETURN θ = (s⋆ − z)/ρ⋆ efficient and requires only logarithmic time (Tarjan, 1983).
E ND I F
The course of the algorithm is as follows. After t projected

C ALL P IVOT S EARCH rightT (v), ρ, s
E ND I F gradient iterations we have a vector v(t) whose non-zero el-
E ND P ROCEDURE ements are stored in a red-black tree T and a global deduc-
tion value Θt which is applied to each non-zero component
Figure 3. Efficient search of pivot value for sparse feature spaces. just-in-time, i.e. when needed. Therefore, each non-zero
by iterating weight is accessed as vj − Θt while T does not contain the
zero elements of the vector. When updating v with a gradi-
ent, we modify the vector v(t) by adding to it the gradient-
 
w(t+1) = ΠW w(t) + g(t)
based vector g(t) with k non-zero components. This update
where g(t) = −ηt ∇(t) , W = {w | kwk1 ≤ z} and ΠW is is done using k deletions (removing vi from T such that
(t) (t)
projection onto this set. gi 6= 0) followed by k re-insertions of vi′ = (vi + gi )
into T , which takes O(k log(n)) time. Next we find in
In many applications the dimension of the feature space O(log(n)) time the value of θt . Fig. 3 contains the algo-
is very high yet the number of features which attain a rithm for this step; it is explained in the sequel. The last
non-zero value for each example is very small (see for in- step removes all elements of the new raw vector v(t) + g(t)
stance our experiments on text documents in Sec. 7). It is which become zero due to the projection. This step is dis-
straightforward to implement the gradient-related updates cussed at the end of this section.
in time which is proportional to the number of non-zero
features, but the time complexity of the projection algo- In contrast to standard tree-based search procedure, to find
rithm described in the previous section is linear in the di- θt we need to find a pair of consecutive values in v that
mension. Therefore, using the algorithm verbatim could be correspond to v(ρ) and v(ρ+1) . We do so by keeping track
prohibitively expensive in applications where the dimen- of the smallest element that satisfies the left hand side of
sion is high yet the number of features which are “on” in Eq. (9) while searching based on the condition given on the
each example is small. In this section we describe a pro- right hand side of the same equation. T is keyed on the val-
jection algorithm that updates the vector w(t) with g(t) and ues of the un-shifted vector vt . Thus, all the children in the
scales linearly in the number of non-zero entries of g(t) and left (right) sub-tree of a node v represent values in vt which
only logarithmically in the total number of features (i.e. are smaller (larger) than v. In order to efficiently find θt we
non-zeros in w(t) ). keep at each node the following information: (a) The value
of the component, simply denoted as v. (b) The number of
The first step in facilitating an efficient projection for sparse elements in the right sub-tree rooted at v, denoted r(v), in-
feature spaces is to represent the projected vector as a “raw” cluding the node v. (c) The sum of the elements in the right
vector v by incorporating a global shift that is applied to sub-tree rooted at v, denoted σ(v), including the value v
each non-zero component. Specifically, each projection itself. Our goal is to identify the pivot element v(ρ) and its
step amounts to deducting θ from each component of v index ρ. In the previous section we described a simple con-
and thresholding the result at zero. Let us denote by θt the dition for checking whether an element in v is greater or
shift value used on the tth iteration of the algorithm Pand by smaller than the pivot value. We now rewrite this expres-
Θt the cumulative sum of the shift values, Θt = s≤t θs . sion yet one more time. A component with value v is not
The representation we employ enables us to perform the
Efficient Projections onto the ℓ1 -Ball for Learning in High Dimensions

smaller than the pivot iff the following holds: 2


10
3
10
Coordinate Coordinate
X L1 − Line L1 − Batch
L1 − Batch L1 − Stoch
vj > |{j : vj ≥ v}| · v + z . (11) 1
10
L1 − Stoch
2
10
IP
IP
j:vj ≥v

*
f−f

f−f
1
10

0
10

The variables in the red-black tree form the infrastructure 0


10

for performing efficient recursive computation of Eq. (11). −1


10

Note also that the condition expressed in Eq. (11) still holds 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
−1
10
1 2 3 4 5 6

when we do not deduct Θt from all the elements in v. Approximate Flops Approximate Flops
8 9
x 10 x 10

Figure 4. Comparison of methods on ℓ1 -regularized least squares.


The search algorithm maintains recursively the number ρ The left has dimension n = 800, the right n = 4000
and the sum s of the elements that have been shown to be
greater or equal to the pivot. We start the search with the methods for both least squares and logistic regression (Koh
root node of T , and thus initially ρ = 0 and s = 0. Upon et al., 2007; Kim et al., 2007). The algorithms we use are
entering a new node v, the algorithm checks whether the batch projected gradient, stochastic projected subgradient,
condition given by Eq. (11) holds for v. Since ρ and s were and batch projected gradient augmented with a backtrack-
computed for the parent of v, we need to incorporate the ing line search (Koh et al., 2007). The IP and coordinate-
number and the sum of the elements that are larger than v wise methods both solve regularized loss functions of the
itself. By construction, these variables are r(v) and σ(v), form f (w) = L(w) + λkwk1 rather than having an ℓ1 -
which we store at the node v itself. We let ρ̂ = ρ + r(v) domain constraint, so our objectives are not directly com-
and ŝ = s + σ(v), and with these variables handy, Eq. (11) parable. To surmount this difficulty, we first minimize
distills to the expression ŝ < v ρ̂+z. If the inequality holds, L(w)+λkwk1 and use the 1-norm of the resulting solution
we know that v is either larger than the pivot or it may be w∗ as the constraint for our methods.
the pivot itself. We thus update our current hypothesis for
To generate the data for the least squares problem setting,
µρ and ρ (designated as v ⋆ and ρ⋆ in Fig. 3). We continue
we chose a w with entries distributed normally with 0 mean
searching the left sub-tree (leftT (v)) which includes all el-
and unit variance and randomly zeroed 50% of the vector.
ements smaller than v. If inequality ŝ < v ρ̂ + z does not
The data matrix X ∈ Rm×n was random with entries also
hold, we know that v < µρ , and we thus search the right
normally distributed. To generate target values for the least
subtree (rightT (v)) and keep ρ and s intact. The process
squares problem, we set y = Xw + ν, where the com-
naturally terminates once we reach a leaf, where we can
ponents of ν were also distributed normally at random. In
also calculate the correct value of θ using Eq. (10).
the case of logistic regression, we generated data X and
Once we find θt (if θt ≥ 0) we update the global shift, the vector w identically, but the targets yi were set to be
Θt+1 = Θt + θt . We need to discard all the elements in sign(w · xi ) with probability 90% and to −sign(w · xi )
T smaller than Θt+1 , which we do using Tarjan’s (1983) otherwise. We ran two sets of experiments, one each for
algorithm for splitting a red-black tree. This step is log- n = 800 and n = 4000. We also set the number of ex-
arithmic in the total number of non-zero elements of vt . amples m to be equal to n. For the subgradient methods
Thus, as the additional variables in the tree can be updated in these experiments
√ and throughout the remainder, we set
in constant time as a function of a node’s child nodes in ηt = η0 / t, choosing η0 to give reasonable performance.
T , each of the operations previously described can be per- (η0 too large will mean that the initial steps of the gradient
formed in logarthmic time (Cormen et al., 2001), giving us method are not descent directions; the noise will quickly√
a total update time of O(k log(n)). disappear because the step sizes are proportional to 1/ t).
Fig. 4 and Fig. 5 contain the results of these experiments
7. Experiments and plot f (w) − f (w∗ ) as a function of the number of
floating point operations. From the figures, we see that the
We now present experimental results demonstrating the ef-
projected subgradient methods are generally very fast at the
fectiveness of the projection algorithms. We first report re-
outset, getting us to an accuracy of f (w) − f (w∗ ) ≤ 10−2
sults for experiments with synthetic data and then move to
quickly, but their rate of convergence slows over time. The
experiments with high dimensional natural datasets.
fast projection algorithms we have developed, however, al-
In our experiment with synthetic data, we compared vari- low projected-subgradient methods to be very competitive
ants of the projected subgradient algorithm (Eq. (2)) for with specialized methods, even on these relatively small
ℓ1 -regularized least squares and ℓ1 -regularized logistic re- problem sizes. On higher-dimension data sets interior point
gression. We compared our methods to a specialized methods are infeasible or very slow. The rightmost graphs
coordinate-descent solver for the least squares problem due in Fig. 4 and Fig. 5 plot f (w) − f (w∗ ) as functions of
to Friedman et al. (2007) and to very fast interior point floating point operations for least squares and logistic re-
Efficient Projections onto the ℓ1 -Ball for Learning in High Dimensions

L1 − Line −1
10
L1 − Stoch −1
10
L1 − 1
−1
L1 − Batch L1 − Full EG − 1
10
L1 − Stoch −1
EG − Full L1 − 100
10
IP −2
10 EG − Stoch −2
10 EG − 100
−2
10
*

*
f−f

f−f

f−f

f−f
−3
10 −2 −3
10 10

−3
10
L1 − Batch
−4
L1 − Stoch
10 −4
10
IP −3
−4 10
10
−5
10
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 1 2 3 4 5 6 7 8 0 20 40 60 80 100 120 0 50 100 150 200 250 300 350 400 450
Approximate Flops x 10
8
Approximate Flops x 10
9
Time (CPU seconds) Time (CPU seconds)

Figure 5. Comparison of methods on ℓ1 -regularized logistic re- Figure 6. EG and projected subgradient methods on RCV1.
gression. The left has dimension n = 800, the right n = 4000
mensional spaces, and it very quickly identifies and shrinks
gression with dimension n = 4000. These results indicate weights for irrelevant features (Kivinen & Warmuth, 1997).
that in high dimensional feature spaces, the asymptotically At every step of EG we update
faster convergence of IP methods is counteracted by their
(t)
wi exp −ηt ∇i f (w(t) )

quadratic dependence on the dimension of the space. (t+1)
wi = (13)
We also ran a series of experiments on two real datasets Zt
with high dimensionality: the Reuters RCV1 Cor- where Zt normalizes so that
P (t+1)
i wi = z and ∇i f
pus (Lewis et al., 2004) and the MNIST handwritten digits denotes the ith entry of the gradient of f , the function
database. The Reuters Corpus has 804,414 examples; with to be minimized. EG can actually be viewed as a pro-
simple stemming and stop-wording, there are 112,919 uni- jected subgradient P method using generalized relative en-
gram features and 1,946,684 bigram features. With our pre- tropy (D(xky) = i xi log xyii − xi + yi ) as the distance
processing, the unigrams have a sparsity of 1.2% and the bi- function for projections (Beck & Teboulle, 2003). We can
grams have sparsity of .26%. We performed ℓ1 -constrained replace ∇i f with ∇ ˆ i f in Eq. (13), an unbiased estimator
binary logistic regression on the CCAT category from the gradient of f , to get stochastic EG. p
of √ A step size ηt ∝
RCV1 (classifying a document as corporate/industrial) us- 1/ t guarantees a convergence rate of O( log n/T ). For
ing unigrams in a batch setting and bigrams in an online set- each experiment with EG, however, we
ting. The MNIST dataset consists of 60,000 training exam- √experimented with
learning rates proportional to 1/t, 1/ t, and constant, as
ples and a 10,000 example test set and has 10-classes; each well as different initial step-sizes; to make EG as competi-
image is a gray-scale 28 × 28 image, which we represent as tive as possible, we chose the step-size and rate for which
xi ∈ R784 . Rather than directly use the input xi , however, EG performed best on each individual test..
we learned weights wj using the following Kernel-based
“similarity” function for each class j ∈ {1, . . . , 10}: Results for our batch experiments learning a logistic classi-
 fier for CCAT on the Reuters corpus can be seen in Fig. 6.
X 1 if yi = j The figure plots the binary logistic loss of the different al-
k(x, j) = wji σji K(xi , x), σji =
−1 otherwise. gorithms minus the optimal log loss as a function of CPU
i∈S
time. On the left side Fig. 6, we used projected gradient
In the above, K is a Gaussian kernel function, so that descent and stochastic gradient descent using 25% of the
K(x, y) = exp(−kx − yk2 /25), and S is a 2766 element training data to estimate the gradient, and we used the al-
support set. We put an ℓ1 constraint on each wj , giving us gorithm of Fig. 2 for the projection steps. We see that ℓ1 -
the following multiclass objective with dimension 27,660: projections outperform EG both in terms of convergence
Pm   speed and empirical log-loss. On the right side of the fig-
1 k(xi ,r)−k(xi ,yi )
P
minimizew m i=1 log 1 + r6=yi e ure, we performed stochastic descent using only 1 training
s.t. kwj k1 ≤ z, wj  0. example or 100 training examples to estimate the gradient,
(12) using Fig. 3 to project. When the gradient is sparse, up-
dates for EG are O(k) (where k is the number of non-zeros
As a comparison to our projected subgradient methods on
in the gradient), so EG has a run-time advantage over ℓ1 -
real data, we used a method known in the literature as either
projections when the gradient is very sparse. This advan-
entropic descent, a special case of mirror descent (Beck &
tage can be seen in the right side of Fig. 6.
Teboulle, 2003), or exponentiated gradient (EG) (Kivinen
& Warmuth, 1997). EG maintains
P a weight vector w sub- For MNIST, with dense features, we ran a similar series
ject to the constraint that i wi = z and w  0; it can of tests to those we ran on the Reuters Corpus. We plot
easily be extended to work with negative weights under a the multiclass logistic loss from Eq. (12) over time (as a
1-norm constraint by maintaining two vectors w+ and w− . function of the number gradient evaluations) in Fig. 7. The
We compare against EG since it works well in very high di- left side of Fig. 7 compares EG and gradient descent using
Efficient Projections onto the ℓ1 -Ball for Learning in High Dimensions

ECAT (versus more than 15% and 20% respectively for


EG EG
L1 L1 EG).
0
10

0
Acknowledgments
*

*
10
f−f

f−f
We thank the anonymous reviewers for their helpful and
10
−1
insightful comments.
−1
10
2 4 6 8 10 12 14 16 18 20 50 100 150 200 250 300 350 400
Gradient Evaluations Stochastic Subgradient Evaluations

Figure 7. MNIST multiclass logistic loss as a function of the References


number of gradient steps. The left uses true gradients, the right Beck, A., & Teboulle, M. (2003). Mirror descent and nonlinear
stochastic subgradients. projected subgradient methods for convex optimization. Opera-
x 10
5

7
tions Research Letters, 31, 167–175.
3.5 EG − CCAT Bertsekas, D. (1999). Nonlinear programming. Athena Scien-
EG − ECAT 6
3
L1 − CCAT tific.
5
L1 − ECAT Candes, E. J. (2006). Compressive sampling. Proc. of the Int.
Cumulative Loss

2.5
% Sparsity

2
4
Congress of Math., Madrid, Spain.
1.5
3 Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C.
1 2 (2001). Introduction to algorithms. MIT Press.
% of Total Features Crammer, K., & Singer, Y. (2002). On the learnability and de-
1
0.5
% of Total Seen
sign of output codes for multiclass problems. Machine Learning,
0
0 1 2 3 4
Training Examples
5 6 7 8
5
x 10
0 1 2 3 4
Training Examples
5 6 7
x 10
5
8
47.
Donoho, D. (2006a). Compressed sensing. Technical Report,
Figure 8. Online learning of bigram classifier on RCV1. Left is Stanford University.
the cumulative loss, right shows sparsity over time. Donoho, D. (2006b). For most large underdetermined systems of
linear equations, the minimal ℓ1 -norm solution is also the spars-
the true gradient while the right figure compares stochas- est solution. Comm. Pure Appl. Math. 59.
tic EG and stochastic gradient descent using only 1% of Friedman, J., Hastie, T., & Tibshirani, R. (2007). Pathwise co-
the training set to estimate the gradient. On top of outper- ordinate optimization. Annals of Applied Statistics, 1, 302–332.
forming EG in terms of convergence rate and loss, the ℓ1 - Gafni, E., & Bertsekas, D. P. (1984). Two-metric projection
methods for constrained optimization. SIAM Journal on Control
projection methods also gave sparsity, zeroing out between and Optimization, 22, 936–964.
10 and 50% of the components of each class vector wj in Hazan, E. (2006). Approximate convex optimization by online
the MNIST experiments, while EG gives no sparsity. game playing. Unpublished manuscript.
Kim, S.-J., Koh, K., Lustig, M., Boyd, S., & Gorinevsky, D.
As a last experiment, we ran an online learning test on (2007). An interior-point method for large-scale ℓ1 -regularized
the RCV1 dataset using bigram features, comparing ℓ1 - least squares. IEEE Journal on Selected Topics in Signal Pro-
projections to using decreasing step sizes given by Zinke- cessing, 4, 606–617.
vich (2003) to exponential gradient updates. The ℓ1 - Kivinen, J., & Warmuth, M. (1997). Exponentiated gradient ver-
sus gradient descent for linear predictors. Information and Com-
projections are computationally feasible because of algo- putation, 132, 1–64.
rithm 3, as the dimension of our feature space is nearly 2 Koh, K., Kim, S.-J., & Boyd, S. (2007). An interior-point
million (using the expected linear-time algorithm of Fig. 2 method for large-scale ℓ1 -regularized logistic regression. Jour-
takes 15 times as long to compute the projection for the nal of Machine Learning Research, 8, 1519–1555.
sparse updates in online learning). We selected the bound Lewis, D., Yang, Y., Rose, T., & Li, F. (2004). Rcv1: A new
benchmark collection for text categorization research. Journal
on the 1-norm of the weights to give the best online re- of Machine Learning Research, 5, 361–397.
gret of all our experiments (in our case, the bound was Ng, A. (2004). Feature selection, l1 vs. l2 regularization, and
100). The results of this experiment are in Fig. 8. The rotational invariance. Proceedings of the Twenty-First Interna-
left figure plots the cumulative log-loss for the CCAT and tional Conference on Machine Learning.
ECAT binary prediction problems as a function of the num- Shalev-Shwartz, S., & Singer, Y. (2006). Efficient learning of
label ranking by soft projections onto polyhedra. Journal of Ma-
ber of training examples, while the right hand figure plots chine Learning Research, 7 (July), 1567–1599.
the sparsity of the ℓ1 -constrained weight vector both as a Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos:
function of the dimension and as a function of the number Primal estimated sub-gradient solver for SVM. Proceedings of
of features actually seen. The ℓ1 -projecting learner main- the 24th International Conference on Machine Learning.
tained an active set with only about 5% non-zero compo- Tarjan, R. E. (1983). Data structures and network algorithms.
Society for Industrial and Applied Mathematics.
nents; the EG updates have no sparsity whatsoever. Our on- Tibshirani, R. (1996). Regression shrinkage and selection via
line ℓ1 -projections outperform EG updates in terms of the the lasso. J. Royal. Statist. Soc B., 58, 267–288.
online regret (cumulative log-loss), and the ℓ1 -projection Zinkevich, M. (2003). Online convex programming and general-
updates also achieve a classification error rate of 11.9% ized infinitesimal gradient ascent. Proceedings of the Twentieth
over all the examples on the CCAT task and 14.9% on International Conference on Machine Learning.

You might also like