SVM GC
SVM GC
May, 2021
arXiv:1901.01585v2 [stat.ML] 27 Aug 2021
Abstract
The linear Support Vector Machine (SVM) is a classic classification technique in machine
learning. Motivated by applications in modern high dimensional statistics, we consider penal-
ized SVM problems involving the minimization of a hinge-loss function with a convex sparsity-
inducing regularizer such as: the L1-norm on the coefficients, its grouped generalization and
the sorted L1-penalty (aka Slope). Each problem can be expressed as a Linear Program (LP)
and is computationally challenging when the number of features and/or samples is large – the
current state of algorithms for these problems is rather nascent when compared to the usual
L2-regularized linear SVM. To this end, we propose new computational algorithms for these LPs
by bringing together techniques from (a) classical column (and constraint) generation methods
and (b) first order methods for non-smooth convex optimization — techniques that are rarely
used together for solving large scale LPs. These components have their respective strengths; and
while they are found to be useful as separate entities, they have not been used together in the
context of solving large scale LPs such as the ones studied herein. Our approach complements
the strengths of (a) and (b) — leading to a scheme that seems to significantly outperform com-
mercial solvers as well as specialized implementations for these problems. We present numerical
results on a series of real and synthetic datasets demonstrating the surprising effectiveness of
classic column/constraint generation methods in the context of challenging LP-based machine
learning tasks.
1 Introduction
The linear Support Vector Machine (SVM) [42, 22] is a fundamental tool for binary classification.
Given training data (xi , yi )ni=1 with feature vector xi ∈ Rp and label yi ∈ {−1, 1}, the task is
to learn a linear classifier of the form sign(xT β + β0 ) where, β0 ∈ R is the offset. The popular
L2-regularized linear SVM (L2-SVM) considers the minimization problem
n
X λ
1 − yi (xTi β + β0 ) kβk22
min +
+ (1)
p
β∈R ,β0 ∈R 2
i=1
where, (a)+ := max{a, 0} is often noted as the hinge-loss function; and λ ≥ 0 regularizes the L2-
norm of the coefficients β. Several algorithms have been proposed to efficiently solve Problem (1).
∗
[[email protected]]. Operations Research Center, MIT. Now at Vicarious AI.
†
[[email protected]]. MIT Sloan School of Management and Operations Research Center, MIT
‡
[[email protected]]. Operations Research Center, MIT
1
Popular approaches include stochastic subgradient methods on the primal form [9, 37], coordinate
descent methods on a dual [23] and cutting plane algorithms [25, 17].
The L1-SVM estimator: The L2-SVM estimator generally leads to a dense estimate for β—
towards this end, the L1 penalty [12, 22] is often used as a convex surrogate to encourage sparsity
(i.e., few nonzeros) in the coefficients. This leads to one of the problems we consider in this paper,
namely, the L1-SVM problem:
n
X
1 − yi (xTi β + β0 ) + + λkβk1 ,
min
p
(2)
β∈R ,β0 ∈R
i=1
which can be written as a Linear Program (LP), as shown in Section 2.2. The regularization
parameter λ ≥ 0 controls the L1-norm of β. Off-the-shelf solvers, including commercial LP solvers
(eg, Gurobi, Cplex) work very well for small/moderate sized problems, but become expensive in
solving Problem (2) when n and/or p is large (e.g. when n ≈ p ≈ 104 ; or p ≈ 106 and n is a
few hundreds). Some high-quality specialized solvers for Problem (2) include: a homotopy based
method to compute the entire (piecewise linear) regularization path in β [21]; methods based on
Alternating Direction Method of Multipliers (ADMM) or operator splitting [2, 33]. [34] proposes
a parametric simplex (PSM) approach to solve Problem (2)—this leads to a pair of primal/dual
solutions at optimality, and the authors demonstrate that their method achieves state-of-the-art
performance for some LP-based sparse learning tasks1 compared to a benchmark ADMM-based
implementation “flare” [27]. Our experiments suggest that for problems with small n ≈ 100 and
large p ≈ 50, 000, PSM works well. However, this becomes inefficient as soon as n is large (e.g.,
for n ≈ 104 and p ≈ 100, PSM can take hours and large memory, while the methods we propose
here take a few seconds with minimal memory). [28] propose a perturbation approach where they
reformulate the L1-SVM problem as an unconstrained smooth minimization problem by including
an additional regularization term. They apply Newton type methods to solve the resulting problem.
Such Newton-type methods as discussed in [28] may not scale to large problem instances, due to
expensive matrix inversions. In this paper, our goal is to propose new computational algorithms
for the L1-SVM LP by revisiting classical operations research tools such as column and constraint
generation with origins in 1950s [16]—these methods appear to have been somewhat underutilized
in the context of the L1-SVM problem (as we discuss below) and relatives of the L1-penalty, that
we consider here. To improve the performance of column/constraint generation-based methods we
use (relatively newer) first order optimization techniques.
We note that there are several appealing L1-regularized classifiers and efficient algorithms that
consider a smooth loss function (e.g, logistic, squared hinge loss, etc)—see [18, 44] (for example).
Different loss functions have different operating characteristics: in particular, smooth loss functions
lead to estimators that are different from the hinge loss as in the L1-SVM problem [22]. Our goal
in this paper is not to pursue an empirical analysis of the relative merits/de-merits of different loss
1
The largest example studied in this work is the Dantzig Selector problem with n = 200 samples and p = 5, 000
features.
2
functions (which are well-known); rather, we focus on algorithms for the hinge loss function, with
additional penalty functions that are representable as linear programs.
The Group-SVM estimator: In several applications, sparsity is structured — the coefficient
indices are naturally found to occur in groups that are known a-priori and it is desirable to select
(or set to zero) a whole group together as a “unit”. In this context, a group version of the usual
L1 norm is often used to improve the performance and interpretability of the model [45, 24]. We
consider the popular L1/L∞ penalty [1] leading to the Group-SVM Problem:
n
X G
X
1 − yi (xTi β + β0 ) + + λ
min
p
kβ g k∞ (3)
β∈R , β0 ∈R
i=1 g=1
where, g = 1, . . . , G denotes a group index (the groups are disjoint), β g denotes the subvector of
coefficients belonging to group g and β = (β 1 , . . . , β G ). Problem (3) can be expressed as an LP
and our approach applies to this problem as well (with suitable modifications).
The Slope-SVM estimator: The third problem we study in this paper is of a different flavor
and is inspired by the sorted L1-penalty aka the Slope norm [8, 5], popularly used in the context
of penalized least squares problems for its useful statistical properties. For a (a-priori specified)
sequence λ1 ≥ . . . ≥ λp ≥ 0, the Slope-SVM problem is given by:
n
X p
X
1 − yi (xTi β + β0 ) + +
min
p
λj |β(j) |, (4)
β∈R , β0 ∈R
i=1 j=1
where |β(1) | ≥ . . . ≥ |β(p) | are the ordered values of |βi |, i = 1, . . . , p. Unlike Problems (2) and (3)
where the penalty function is separable (respectively, block separable), the penalty function in (4)
is not separable in the coefficients. We show in Section 3 that Problem (4) can be expressed as an
LP with O(n + p) variables and an exponential number (in p) of constraints, consequently posing
challenges in optimization. Despite the large number of constraints, we show that the LP can
be solved using a non-standard application of column/constraint generation. We note that using
standard reformulation methods [10] (see Section A.1), Problem (4) can be modeled (e.g., using
CVXPY) and solved (e.g., using a commercial solver like Gurobi) for small-sized problems. However,
the computations become expensive when λi s are distinct (e.g., as in [8]) — for these cases, CVXPY
can handle problems up to n ≈ 100, p ≈ 200 whereas, our approach can solve problems with
p ≈ 50, 000 within a few seconds.
First order methods: First order methods [30] have enjoyed great success in solving large scale
structured convex optimization problems arising in machine learning applications. Many of these
methods (such as proximal gradient and its accelerated variants) are appealing candidates for the
minimization of smooth functions and also problems of the composite form [32], wherein accelerated
√
gradient methods enjoy a convergence rate of O(1/ ) to obtain an -accurate solution. For the
nonsmooth SVM problems ((2),(3),(4)) discussed above, Nesterov’s smoothing method [31] (which
replaces the hinge-loss with a smooth approximation) can be used to obtain algorithms with a
3
convergence rate of O(1/)—a method that we explore in Section 4. While this procedure (with
additional heuristics based on [38]) lead to low accuracy solutions relatively fast; in our experience,
the basic version of this algorithm takes a long time to obtain a solution with higher accuracy
when n and/or p are large. Similarly, first order methods based on [4] and [33, 2] also experience
increased run times as the problem sizes become large.
What this paper is about: In this paper, we propose an efficient algorithmic framework for L1-
SVM, Group-SVM and Slope-SVM using tools in column/constraint generation (and, leveraging the
capabilities of excellent LP solvers) that make use of some basic structural properties of solutions
to these problems, as we discuss below.
Specifically, large values of λ will encourage an optimal solution to Problem (2), β̂ (say), to
be sparse. This sparsity will be critical to solve Problem (2) when p n—we anticipate to solve
Problem (2) without having to create an LP model with all p variables. To this end, we use column
generation, a classical method in mathematical optimization/operations research originating in the
context of solving integer programs during late 1950s [16, 13] (see also [14] for a nice review). We
also make use of another structural aspect of a solution to Problem (2) when n is large (and p is
small). Suppose most of the samples can be classified correctly via a linear classifier—then, at an
optimal solution, 1 ≤ yi (xTi β +β0 ) and hence α̃i := (1−yi (xTi β +β0 ))+ will be zero for many indices
i = 1, . . . , n. We leverage this sparsity in α̃i ’s to develop efficient algorithms for Problem (2), using
constraint generation [16, 14] methods. This allows us to solve (2) without explicitly creating an
LP model with n samples.
To summarize, there are two characteristics special to an optimal solution of Problem (2):
(a) sparsity in the SVM coefficients, i.e., β and/or (b) sparsity in α̃i ’s. Column generation can be
used to handle (a); constraint generation can be used to address (b) — in problems where both
n, p are large, we propose to combine both column and constraint generation. To our knowledge,
while column generation and constraint generation are used separately in the context of solving
large scale LPs, using them together, in the context of the L1-SVM problem is novel. For solving
these (usually small) subproblems, we rely on powerful LP solvers (e.g., simplex based algorithms of
Gurobi) which lead to a pair of primal-dual solutions and also possess excellent warm-starting capa-
bilities. Our approach applies to the Group-SVM Problem (3) with suitable modifications. We also
extend our approach to handle the Slope-SVM problem (4), which requires a fairly involved use of
column/constraint generation. Numerical evidence presented here suggests that column/constraint
generation methods are modular, simple and powerful tools—they should perhaps be considered
more frequently to solve machine learning tasks based on LPs, even beyond the ones studied here.
The column/constraint generation methods mentioned above, are found to benefit from a good
initialization. To this end, we use first order optimization methods to get approximate solutions
with low computational cost. These solutions serve as decent initializations and are subsequently
improved to deliver optimal solutions as a part of our column and/or constraint generation frame-
work. This approach is found to be useful in all the three problems studied here.
To our knowledge, this is the first paper that brings together first order methods in convex
optimization and column/constraint generation algorithms for solving large scale LPs, in the context
4
of solving a problem of key importance in machine learning. Implementation of our methods can
be found at:
https://github.com/wanghaoyue123/Column-and-constraint-generation-for-L1-SVM-and-cousins.
Organization of paper: The rest of this paper is organized as follows. Section 2 presents an
overview of column/constraint generation methods; and then discusses their instantiation for the
L1-SVM and Group-SVM problems. Section 3 discusses the Slope-SVM problem. Section 4 dis-
cusses how first order methods can be used to get approximate solutions for these problems. Section
5 presents numerical results.
Notation: For an integer a we use [a] to denote {1, 2, . . . , a}. The ith entry of a vector u is denoted
by ui . For a set A, we use the notation |A| to denote its size. For a positive semidefinite matrix
A, we denote its largest eigenvalue by σmax (A). For a vector x ∈ Rn and a subset B ⊆ [n], we let
xB denote the sub-vector of x corresponding to the indices in B.
Let us consider the case where p̄ is large compared to n̄ and we anticipate an optimal solution
of (P) to have few nonzeros. Let Aj denote the jth column of A. Consider a subset of columns
B ⊂ [p̄] and the corresponding reduced primal/dual problems:
qT b
P
(PB ) : min cj θj (DB ) : maxn̄
θB ∈R|B| j∈B q∈R
qT Aj ≤ cj , j ∈ B, q ≥ 0.
P
s.t. Aj θj ≥ b, θ B ≥ 0 s.t.
j∈B
Let θ̃ B and q̃ be a pair of primal/dual solutions to the restricted problems (PB ) and (DB ). Let
5
θ̂ be an extension of θ̃ B to Rp̄ i.e., θ̂ B = θ̃ B and θ̂ Bc = 0. While θ̂ is a feasible solution for (P),
it may not be optimal for (P). An optimality certificate can be obtained via q̃, by checking if q̃ is
feasible for (D). Specifically, let the reduced cost for variable j be defined as c̄j := cj − q̃T Aj . If
c̄j ≥ 0 for all j ∈
/ B, then q̃ is optimal for (D); and θ̂ is an optimal solution for (P).
Column generation: Column generation makes use of the scheme outlined above. It is useful
when p̄ n̄, and an optimal solution to (P) has few nonzeros. We start with a subset of columns
say, B—i.e., a guess for the support of a minimizer of (P) for which (PB ) is feasible. Given B, we
solve the restricted problem (PB ); and have a pair of primal/dual solutions for (PB )/(DB ). If all
the reduced costs are nonnegative, we declare convergence and stop. Otherwise, we find a column
(or a subset of columns) outside B with the most negative reduced cost(s), update B, and re-solve
the updated problem (PB ) by making use of the warm-start capabilities of a simplex-based LP
solver. If B is only allowed to increase (i.e., we do not drop variables) then this process converges
after finitely many iterations. Convergence guarantees of this procedure is formally discussed in
Section 2.3.1. Upon termination, column generation leads to a pair of primal/dual optimal solutions
to (P)/(D).
Constraint generation: We now consider the case when n̄ p̄. Suppose at an optimal solution
to (P), only a small fraction of the n̄ constraints aTi θ ≥ bi for i ∈ [n̄] are active or binding. Then,
an optimal solution can be obtained by considering only a small subset of the n̄ constraints. This
inspires the use of a constraint generation algorithm, which can also be interpreted as column
generation [6] on the dual Problem (D).
n
X p
X p
X
Pλ ([n], [p]) min ξi + λ βj+ +λ βj− (5a)
ξ∈Rn ,β0 ∈R
β + , β − ∈Rp i=1 j=1 j=1
Above, the positive and negative parts of βi are denoted as βi+ = max{βi , 0} and βi− =
max{−βi , 0} respectively, and ξi ’s are auxiliary continuous variables corresponding to the hinge-loss
6
function. The feasible set of Problem (5) is nonempty. A dual of (5) is the following LP:
n
X
Dλ ([n], [p]) maxn πi
π∈R
i=1
n
X
s.t. −λ ≤ yi xij πi ≤ λ j ∈ [p] (6)
i=1
T
y π=0
0 ≤ πi ≤ 1 i ∈ [n].
For Problems (5) and (6), standard complementary slackness conditions lead to:
πi ξi + yi xTi β + yi β0 − 1 = 0
(1 − πi )ξi = 0, i ∈ [n]. (7)
Let (β ∗ (λ), β0∗ (λ)) and π ∗ (λ) denote optimal solutions for Problems (5) and (6). In what follows,
for notational convenience, we will drop the dependence (of an optimal solution) on λ when there is
no confusion. We make a few observations regarding the geometry of an L1-SVM solution following
standard SVM terminology [22]. For easier notation, we denote αi = yi xTi (β + − β − ) + yi β0 for all
i. Note that ξi = max{1 − αi , 0} for all i. If a point i is correctly classified, we have ξi = 0; and
if this point is away from the margin, then we have 0 > 1 − αi and hence πi = 0 (from (7)). Note
that if point i is misclassified, then ξi > 0 and πi = 1. Furthermore, based on the value of πi , we
have the following cases: (i) If πi = 0, then αi ≥ 1; (ii) If πi = 1 then 1 ≥ αi and (iii) If πi ∈ (0, 1)
then αi = 1. The SVM coefficients can be estimated from the samples lying on the margin i.e., for
all i such that αi = 1. In particular, if an optimal solution to the L1-SVM problem has κ-many
nonzeros in β, then (β, β0 ) can be computed based on (κ + 1)-many samples lying on the margin
(assuming that the corresponding feature columns form a full rank matrix).
7
and the corresponding dual problem is
n
X
Dλ ([n], J ) : max πi
π∈Rn
i=1
n
X
s.t. −λ ≤ yi xij πi ≤ λ j∈J (9)
i=1
T
y π=0
0 ≤ πi ≤ 1 i ∈ [n].
If (β̂j+ , β̂j− ), j ∈ J is a solution to (8), it can be extended to a feasible solution to (5) by padding
coordinates outside J with zeros—we let β̂j = β̂j+ −β̂j− (for j ∈ [p]) denote the corresponding feature
weights. Hence, the minimum of (8) is an upper bound for that of (5). Let π̂ ∈ Rn be an optimal
solution of (9). If π̂ is feasible for (6) then it is an optimal solution for (6); and β̂ is an optimal
solution to (5). Otherwise, π̂ is not an optimal solution; and some inequality constraints in (6)
are violated. For βj+ , we denote its corresponding reduced cost (cf. Section 2.1) by β̄j+ . A similar
notation is used for βj− . For every pair βj+ , βj− , j ∈ / J , the minimum of their reduced costs is
n o X
min β̄j+ , β̄j− = λ − yi xij π̂i . (10)
i∈[n]
We select a subset of indices with negative reduced costs2 and append it to J to form the new
restricted L1-SVM problem; which is re-solved via LP warm-starting. The process repeats till
convergence. For a tolerance level > 0 (e.g, = 10−2 ), we update J by adding all columns j for
which the corresponding reduced cost (10) is lower than ‘−’. We summarize the algorithm below.
3. Form the set J of columns in {1, . . . , p} \J with reduced cost lower than −. Update
J ← J ∪ J ; and go to Step 2.
Algorithm 1 expands (with no deletion) the set of columns in the restricted problem—this con-
verges in a finite number of iterations, bounded above by p. In the worst case, one may need
to add all the columns in (5). The cost of Algorithm 1 also depends upon the size of J —if this
becomes comparable to p, then the cost of solving (8) will be large. However, assuming that a
2
For the column generation procedure to converge, it is not required to choose indices with the smallest reduced
costs—any subset of indices with negative reduced costs can be selected.
8
solution to the L1-SVM problem corresponds to a sparse β, then using a good initialization for
J and/or regularization path continuation (discussed in Section 2.3.2), the worst case behavior is
not observed in practice (see our results in Section 5). In addition, thanks to simplex warm-start
capabilities, one can compute a solution to the updated version of the restricted L1-SVM problem
quite efficiently.
An appealing aspect of the column generation framework is that it provides a bound on the opti-
mality gap, even if one were to terminate early. We present the following result.
Theorem 1. Let z ∗ and ẑ denote the optimal objective values of the full L1-SVM Problem (5) and
the restricted L1-SVM Problem (8), respectively. At an iteration of Algorithm 1, define
n o
˜ := − min min{β̄j+ , β̄j− , 0} ≥ 0. (11)
j∈[p]\J
where β̄j+ and β̄j− are defined in (10). If β ∗ denotes an optimal solution of Problem (5), then it
holds that:
z ∗ ≤ ẑ ≤ z ∗ + ˜kβ ∗ k1 .
Proof. The first inequality z ∗ ≤ ẑ is trivial (as column generation leads to a feasible solution for
the full LP (5)). Below we prove the second inequality. From (10) and our definition of ˜, we have
X
−λ − ˜ ≤ yi xij π̂i ≤ λ + ˜ ∀j ∈ [p] \ J .
i∈[n]
y > π̂ = 0 and 0 ≤ π̂i ≤ 1 for all i ∈ [n]—hence, π̂ is a feasible solution of Dλ+˜ ([n], [p]). On the
other hand, let (ξ ∗ , β ∗+ , β ∗− , β0∗ ) be an optimal solution of Pλ ([n], [p]). Then it is also a feasible
solution of Pλ+˜ ([n], [p]). By weak duality, we have
n
X n
X p
X
ẑ = π̂i ≤ ξi∗ + (λ + ˜) (βj∗+ + βj∗− ) = z ∗ + ˜kβ ∗ k1 ,
i=1 i=1 j=1
Theorem 1 is an adaptation of a result stated in [14] without proof. Theorem 1 states the
optimality gap is of the order of the smallest nonpositive reduced cost corresponding to the SVM
coefficients. Note that the value ˜ satisfies ˜ ≤ where is the value of tolerance we set in
Algorithm 1. When ˜ = 0, we are at an optimal solution.
There are variants of the column generation procedure where one can also drop variables (instead
of continually expanding the set of columns in the restricted problem)—see [14]. This is useful if
the size of J becomes so large that the restricted problem becomes difficult to solve. If we only
expand the set of columns in the restricted problem, as in Algorithm 1, we obtain a sequence
9
of decreasing objective values across the column generation iterations. If one were to both add
and delete columns, one may not obtain a monotone sequence of objective values. In this case,
additional care is needed to ensure convergence of the resulting procedure [14] (though Theorem 1
provides an optimality gap for a given reduced cost tolerance).
Initializing column generation with a candidate set of columns: In practice, Algorithm 1
is found to benefit from a good initial choice for J . To obtain a reasonable estimate of J with low
computational cost, we list a couple of options that we found to be useful.
(i) First order methods: Section 4 discusses first order methods to obtain an approximate solution
to the L1-SVM problem, which can be used to initialize J .
(ii) Regularization path: We compute a path (or grid) of solutions to L1-SVM (with column gen-
eration) for a decreasing sequence of λ values (e.g., λ ∈ {λ0 , . . . , λM }) with the smallest one set to
the current value of interest. This method is discussed in Section 2.3.2.
Note that the subgradient condition of optimality for the L1-SVM Problem (2) is given by: λsign(βj∗ ) =
n
yi xij πi∗ where, for a scalar u, sign(u) denotes a subgradient of u 7→ |u|. When λ is larger than
P
i=1
λmax = maxj∈[p] ni=1 |xij |, an optimal solution to Problem (2) is zero: β ∗ (λ) = 0.
P
Let I+ , I− denote the sample indices corresponding to the classes with labels +1 and −1
(respectively); and let N+ , N− denote their respective sizes. If N+ ≥ N− , then for λ ≥ λmax
a solution to Problem (6) is πi (λ) = N− /N+ , ∀i ∈ I+ and πi (λ) = 1, ∀i ∈ I− . For λ = λmax
using (10), the minimum of the reduced costs of the variables βj+ and βj− is
n o N− X X
min β̄j+ (λmax ), β̄j− (λmax ) = λmax − yi xij + yi xij . (12)
N+
i∈I+ i∈I−
2. For ` ∈ {1, . . . , M } initialize J (λ` ) ← J (λ`−1 ), β ∗ (λ` ) ← β ∗ (λ`−1 ). Run column generation
to obtain the new estimate β ∗ (λ` ) with J (λ` ) denoting the corresponding columns.
10
2.4 Column and constraint generation for L1-SVM
Moving beyond the large p and small n setting, we first consider the case where n is large but p
is small (Section 2.4.1). For the L1-SVM Problem (2), if the linear classifier is able to correctly
classify most of the samples, the corresponding terms in the hinge loss i.e., ξi ’s will be zero at an
optimal solution, and a constraint generation approach will be useful3 . Section 2.4.2 discusses the
case where both n, p are large where we use both column and constraint generation.
Given λ > 0 and I ⊂ {1, . . . , n}, we define the restricted constraints version of the L1-SVM problem,
by using a subset (indexed by I) of the constraints4 in Problem (5)
p p
βj+ + λ βj−
P P P
Pλ (I, [p]) min ξi + λ
ξ∈R|I| ,β0 ∈R i∈I j=1 j=1
β + , β − ∈Rp
p p (13)
yi xij βj+ − yi xij βj− + yi β0 ≥ 1
P P
s.t. ξi + i∈I
j=1 j=1
ξ ≥ 0, β ≥ 0, β − ≥ 0.
+
Let (β † , β0† ) ∈ Rp+1 and {πi† }i∈I denote optimal solutions of Problem (13) and (14) (respec-
tively). Note that a constraint in (5b) corresponding to i ∈ / I is violated if
is (strictly) positive. We add to I all (or a subset of) indices corresponding to the dual variables
having reduced cost higher than a threshold ≥ 0 (specified a-priori). We solve the new LP formed
with the expanded set I. The constraint generation algorithm for L1-SVM is summarized below:
3
If the underlying dataset cannot be well classified via a linear SVM at the corresponding sparsity level of β, then
the constraint generation procedure may not be effective.
4 P
Note that in the objective in (13), the first term involves a summation over i∈I ξi —this is because, if a constraint
is not present, the corresponding value of ξi will be zero.
11
Algorithm 3: Constraint generation for L1-SVM
Input: X, y, regularization parameter λ, a tolerance threshold ≥ 0, an initial set of constraints
indexed by I.
Output: A near-optimal solution β † for the L1-SVM Problem (2).
The following theorem gives an upper bound on the resulting objective value upon termination
of Algorithm 3.
Theorem 2. Let z ∗ denote the optimal objective value of Pλ ([n], [p]). Let (β † , β0† ) ∈ Rp+1 denote
an optimal solution of Problem (13); and
n
1 − yi (xTi β † + β0† )
X
z † := + λkβ † k1 . (16)
+
i=1
At an iteration of Algorithm 3, let ˜ := maxi∈[n]\I {max{π̄i , 0}} with π̄i , i ∈ [n]\I, defined in (15).
Then it holds
Proof. The first inequality is trivial. For the second inequality, note that from (16) we have
X
1 − yi (xTi β † + β0† )
X
z† = max(π̄i , 0) + + λkβ † k1 . (18)
+
i∈[n]\I i∈I
Since i∈I (1 − yi (xTi β † + β0† ))+ + λkβ † k1 is the optimal value of Pλ (I, [p]), and because Pλ (I, [p])
P
Note that the value ˜ ≥ 0 satisfies ˜ ≤ where is the value of tolerance we set in Algorithm 3.
When ˜ = 0, we are at an optimal solution.
12
Initialization: Similar to the case in column generation, the constraint generation procedure ben-
efits from a good initialization scheme. To this end, we use first order methods as described in
Section 4—specifically, the method we use for large n (and small p), is discussed in Section 4.4.2.
2.4.2 Column and constraint generation when both n and p are large
When both n and p are large, we will use a combination of column and constraint generation to
solve the L1-SVM problem. For a given λ, let I and J denote subsets of columns and constraints
(respectively). This leads to the following restricted version of the L1-SVM problem:
βj+ + λ βj−
P P P
Pλ (I, J ) min ξi + λ
ξ∈R|I| , β0 ∈R i∈I j∈J j∈J
β + , β − ∈R|J |
s.t. ξi +
P
yi xij βj+ −
P
yi xij βj− + yi β0 ≥ 1 i∈I (21)
j∈J j∈J
ξ ≥ 0, β ≥ 0, β − ≥ 0.
+
Let ({β̂j† }j∈J , β̂0† ), {π̂i† }i∈I be a pair of optimal primal and dual solutions for the above problem.
Let β̄j+ , β̄j− denote the reduced costs for primal variables βj+ , βj− ; and π̄i denote the reduced cost
for dual variable πi . The reduced costs are given by:
n o
yi xij π̂i† for j ∈ xij β̂j† + β̂0† for i ∈
X X
min β̄j+ , β̄j− = λ − / J; π̄i = 1 − yi / I. (23)
i∈I j∈J
We expand the sets I and J by using Steps 3 and 4 of Algorithm 4 (below). We then solve Problem
(21) (with warm-starting enabled) and continue till I and J stabilize. Section 4.4.3 discusses the
use of first order optimization methods to initialize I and J . Our hybrid column and constraint
generation approach to solve the L1-SVM Problem (2) is summarized below.
13
2. Solve Pλ (I, J ) that is, Problem (21).
3. Let I 1 ⊂ {1, . . . , n} \I denote constraints with reduced cost higher than 1 . Update I ←
I ∪ I 1 .
4. Let J 2 ⊂ {1, . . . , p} \J denote columns with reduced cost lower than −2 . Update J ←
J ∪ J 2 ; and go to Step 2.
The following theorem gives an optimality certificate of a solution obtained from Algorithm 4.
Theorem 3. Let z ∗ denote the optimal objective value of Pλ ([n], [p]). Recall that ({β̂j† }j∈J , β̂0† ) is
†
an optimal solution of Problem Pλ (I, J ). Extend {β̂j† }j∈J into a vector β̂ ∈ Rp with β̂j† = 0 for
all j ∈ [p]\J . Denote
n
† †
1 − yi (xTi β̂ + β̂0† )
X
ẑ † := + λkβ̂ k1 . (24)
+
i=1
where min{β̄j+ , β̄j− } and π̄i are defined in (23). If β ∗ is an optimal solution to Pλ ([n], [p]), it holds
Proof. The first inequality is trivial. Below we prove the second inequality. For any i ∈ [n]\I let
π̂i† = 0. Then from the definition of ˜2 we have
n
yi xij π̂i† = λ − yi xij π̂i† ≥ −˜
X X
λ− 2 ∀j ∈ [p] \ J . (26)
i=1 i∈I
we have
n
yi xij π̂i† ≤ λ + ˜2
X
−λ − ˜2 ≤ ∀j ∈ [p]. (28)
i=1
In addition, we have ni=1 yi π̂i† = i∈I yi π̂i† = 0 and 0 ≤ π̂i† ≤ 1 for all i ∈ [n]. So {π̂i† }ni=1 is a
P P
solution of Dλ+˜2 ([n], [p]). Let (ξ ∗ , β ∗+ , β ∗− , β0∗ ) be an optimal solution of Pλ ([n], [p]), then it is
14
also a feasible solution of Pλ+˜2 ([n], [p]). Hence by weak duality we have
n n p
π̂i† ≤
X X X
ξi∗ + (λ + ˜2 ) (βj∗+ + βj∗− ) = z ∗ + ˜2 kβ ∗ k1 . (29)
i=1 i=1 j=1
Note that the nonnegative values ˜1 and ˜2 satisfy ˜1 ≤ 1 and ˜2 ≤ 2 where 1 and 2 are
tolerances we set in Algorithm 4. When ˜1 = ˜2 = 0, we are at an optimal solution.
n
P G
P
(Group-SVM) min
n
ξi + λ vg
ξ∈R ,β0 ∈R, i=1 g=1
β , β − ∈Rp ,v∈RG
+
15
A dual of Problem (32) is given by:
n
P
(Dual-Group-SVM) maxn πi
π∈R i=1
P n
P
s.t. yi xij πi ≤ λ g ∈ [G]
j∈Ig i=1 (33)
yT π =0
0 ≤ πi ≤ 1 i ∈ [n].
Following the description in Section 2.1, we apply column generation on the groups. Here, the
reduced cost of group g is given as:
n
X X
β¯g = λ − yi xij πi (34)
j∈Ig i=1
and we include into the model (a subset of) groups g for which β¯g < 0 (for a small negative
tolerance).
Computing a regularization path: The regularization path algorithm presented in Section 2.3.2
can be adapted to the Group-SVM problem. First, note that:
n
XX
β ∗ (λ) = 0, ∀λ ≥ λmax = max |xij |. (35)
g∈[G]
j∈Ig i=1
For λ = λmax , the reduced cost of variables corresponding to group g is given by the “group”
analogue of (12):
X N− X X
β¯g = λmax − yi xij + yi xij . (36)
N+
j∈Ig i∈I+ i∈I−
As in Section 2.3.2, we can obtain a small set of groups maximizing the rhs of (36). We use
these groups to initialize the LP solver to solve Problem (32) for the next small value of λ, using
column generation—this results in computational savings when the number of active groups is small
compared to G. We repeat this process for smaller values of λ using warm-start continuation.
Constraint generation and column generation: When n is large (but the number of groups
is small) constraint generation can be used for the Group-SVM problem in a manner similar to
that used for the L1-SVM problem. Similarly, column and constraint generation can be applied
together to obtain computational savings when both n and the number of groups are large.
Initialization: As discussed for the L1-SVM problem, we can use first order methods to obtain
a low-accuracy solution for the Group-SVM problem—these are discussed in Section 4. This can
be used to initialize the set of nonzero groups (for column generation), relevant constraints (for
constraint generation), and both groups and constraints (for the combined column-and-constraint
generation procedure).
16
3 Column and constraint generation for Slope-SVM
Here we discuss the Slope-SVM estimator i.e., Problem (4). For a p-dimensional regularization
parameter λ with coordinates sorted as: λ1 ≥ . . . ≥ λp ≥ 0, we let
p
X
kβkS := λj |β(j) | (37)
j=1
denote the Sorted L1-norm or the Slope-norm—we borrow this term inspired by the “Slope esti-
mator” [7] and acknowledge a slight abuse in terminology. Note that for convenience, we drop the
dependence on λ in the notation k · kS .
We note that the epigraph of kβkS i.e., {(β, η) | kβkS ≤ η} admits an LP formulation using
O(p2 ) many variables and O(p2 ) many constraints (cf Section A.1)—given the problem-sizes we seek
to address, we do not pursue this route. Rather, we make use of the observation that the epigraph
can be expressed with exponentially many linear inequalities (Section 3.1) when using O(p)-many
variables. The large number of constraints make column/constraint generation methodology for
the Slope penalty significantly different than the L1-SVM example. However, as we will discuss,
our procedure does not require us to explicitly enumerate the exponentially many inequalities. To
our knowledge, the procedure we present here for the Slope-SVM problem is novel. Section 3.1
discusses a constraint generation method that greatly reduces the number of constraints needed to
model the epigraph. Section 3.2 discusses the use of column generation to exploit sparsity in β
when p is large. Finally, Section 3.3 combines these two features to address the Slope SVM problem.
We note that even for large p and small n settings, both column and constraint generation methods
are needed for the Slope penalty, making it different from the L1-penalty, where column generation
(alone) suffices. In what follows, we concentrate on the case where n is small but p is large — if n
is also large, a further layer of constraint generation might be needed to efficiently handle sparsity
arising from the hinge-loss.
17
where, β = β + − β − ; and β + , β − ∈ Rp+ denote the positive and negative parts of β (respectively).
In line (38b) we express the Slope penalty in the epigraph form with C defined as:
p p
, β + , β − ∈ Rp+
X X
−
C := (β + , β − , η) η ≥ +
λj β(j) + λj β(j)
j=1 j=1
− −
where, we use the notation β(1) +
+ β(1) +
≥ . . . ≥ β(p) + β(p) and remind ourselves that |βi | = βi+ + βi−
for all i. Below we show that (38b) can be expressed via linear inequalities involving (β + , β − ).
We first introduce some notation. Let Sp denote the set of all permutations of {1, . . . , p}, with
|Sp | = p!. For a permutation φ ∈ Sp , we let (φ(1), . . . , φ(p)) denote the corresponding rearrangement
of (1, . . . , p). Using this notation, the Slope norm can be expressed as:
p
X p
X p
X
kβkS = λj |β(j) | = max λj |βφ(j) | = max λψ(j) |βj |. (39)
φ∈Sp ψ∈Sp
j=1 j=1 j=1
[p]
where, W0 := Conv W [p] is the convex hull of W [p] , where
W [p] := w ∈ Rp | ∃ψ ∈ Sp s.t.
wj = λψ(j) , j ∈ [p] . (40)
Proof. Note that a linear function maximized over a bounded polyhedron reaches its maximum at
one of the extreme points of the polyhedron — this leads to:
Pp
Using the definition of W [p] , we get that the rhs of (41) is maxψ∈Sp j=1 λψ(j) |βj | which is in fact
the Slope norm kβkS .
Remark 1. (a) If all the coefficients are equal i.e., λ1 = . . . = λp and kβkS = λkβk1 , then W [p]
is a singleton. (b) If all the coefficients are distinct i.e., λ1 > . . . > λp , then each permutation
ψ ∈ Sp is associated with a unique vector in W [p] and W [p] contains p! elements.
Using Lemma 1, we can derive an LP formulation of Problem (38) by modeling C in (38b) as:
β+ , β− , η β + , β − ∈ Rp , η ≥ max wT (β + + β − ) ,
C= (42)
w∈W [p]
18
where, W [p] is defined in (40). The resulting LP formulation (38) has at most n constraints
from (38a) and at most p! constraints associated with (38b) (by virtue of (42)). We note that
many constraints in (42) are redundant: for example, the maximum is attained corresponding to
the inverse of permutation φ (denoted by φ−1 ), where |βφ(1) | ≥ . . . ≥ |βφ(p) |. This motivates the
use of constraint generation techniques.
Constraint generation: We proceed by replacing W [p] with a smaller subset and solve the
resulting LP. We subsequently refine this approximation if (38b) is violated. Formally, let us
consider a collection of vectors/cuts w(1) , . . . , w(t) ∈ W [p] leading to a superset Ct of C:
p p
(`) (`)
X X
C ⊆ Ct := (β + , β − , η) β + , β − ∈ Rp , η ≥ wj βj+ + wj βj− , ∀` ≤ t . (43)
j=1 j=1
−1
To this end, consider a permutation ψt+1 ∈ Sp such that |βψ∗ t+1 (1) | ≥ . . . ≥ |βψ∗ t+1 (p) |. If ψt+1
denotes the inverse of ψt+1 , we obtain w(t+1) ∈ W [p] such that:
(t+1)
wj = λψ−1 (j) ∀j ∈ [p] (44)
t+1
and solve the resulting LP. We continue adding cuts, till no further cuts need to be added — this
leads to a (near)-optimal solution to MS (C, [p]). We note that the first cut w(1) can be obtained
by applying (44) on an estimator obtained from the first order optimization schemes (cf Section 4).
Our algorithm is summarized below for convenience.
3. Let ψt+1 ∈ Sp be such that |βψ∗ t+1 (1) | ≥ . . . ≥ |βψ∗ t+1 (p) |. If condition η ∗ + ≥ pj=1 λj |βψ∗ t+1 (j) |
P
is not satisfied, we add a new cut w(t+1) ∈ W [p] as per (44); update Ct+1 and go to Step 2.
19
3.2 Dual formulation and column generation for Slope-SVM
When the amount of regularization is high, the Slope penalty (with λi > 0 for all i) may lead
to many zeros in an optimal solution to Problem (4) — computational savings are possible if we
can leverage this sparsity when p is large. To this end, we use column generation along with the
constraint generation algorithm described in Section 3.1. In particular, given a set of columns
J = {J (1), . . . , J (|J |)} ⊂ [p], we consider a restricted version of Problem (4) with βj = 0, j ∈
/ J:
n
X
1 − yi (xTi β + β0 ) + + kβkS
min
p
s.t. βj = 0, j ∈
/ J.
β∈R ,β0 ∈R
i=1
The above can be expressed as an LP similar to Problem (38) but with fewer columns
n
MS C J , J
P
n
min ξi + η
ξ∈R , β0 , η∈R, i=1
−
β+
J , β J ∈R
|J |
Since column generation is equivalent to constraint generation on the dual problem, to determine
the set of columns to add to J in Problem (45), we need the dual formulation of Slope-SVM.
Dual formulation for Slope-SVM: We first present a dual [46] of the Slope norm:
−1
Xk k
X
max β T z β ∈ Rp , kβkS ≤ 1 = max λj |z(j) | . (47)
k≤p
j=1 j=1
The identity (47) follows from the observation that the maximum will be attained at an extreme
point of the polyhedron PS = {β | kβkS ≤ 1} ⊂ Rp . We describe these extreme points. We fix
k ∈ [p], and a subset A ⊂ {1, . . . , p} of size k — the extreme points of PS having support A have
P −1
k
their nonzero coefficients to be equal, with absolute value j=1 λ j . Finally, (47) follows by
taking a maximum over all k ∈ [p].
20
A dual of Problem (45) is given by:
n
X
max πi
π∈R ,q∈Rp
n
i=1
−1
Xk k
X
s.t. max λj |q(j) | ≤ 1
k=1,...,|J |
j=1 j=1
(48)
n
X
qj = yi xij πi , j ∈ [p]
i=1
T
y π=0
0 ≤ πi ≤ 1, i ∈ [n].
We now discuss how additional columns can be appended to J in Problem (45) to perform column
generation. Let π ∗ ∈ Rn be an optimal solution of Problem (48). We compute the associated q∗
∗ | ≥ . . . ≥ |q ∗
and sort its entries such that |q(1) (|J |) |. The first constraint in (48) leads to:
Xk k
X
∗
max |q(j) |− λj ≤ 0. (49)
k=1,...,|J |
j=1 j=1
Now, for each column j ∈ / J , we compute its corresponding q(j) ∗ and insert it into the sorted
∗ | ≥ . . . ≥ |q ∗
sequence |q(1) (|J |) |. This insertion costs at most O(|J |) flops: we update J ← J ∪ {j}
and denote the sorted entries by: |q ∗ | ≥ . . . ≥ |q ∗ |. We add a column j ∈
/ J to the current
(1) (|J |+1)
model if:
Xk Xk
max |q ∗(j) | − λj > , (50)
k=1,...,|J |+1
j=1 j=1
and this costs O(|J | + 1) flops. Therefore, the total cost of sorting the vector q∗ and scan-
ning through all columns (not in the current model) for negative reduced costs, is of the order
O (|J | log |J | + 2(p − |J |)|J |). This approach can be computationally expensive. To this end, we
propose an alternative method having a smaller cost with O (|J |) flops. Indeed, by combining
Equations (49) and (50), a column j ∈ / J will be added to the model if it satisfies:
This shows that the cost of adding a new column for Slope-SVM is the same as that in L1-SVM.
The column generation algorithm is summarized below.
21
2. Solve the model MS C J , J in Problem (43) with warm-start (if available).
3. Identify the columns J ⊂ {1, . . . , p} \J that need to be added by using criterion (51).
Update J ← J ∪ J , and go to Step 2.
We use the method in Section 3.1 to refine CtJ and the method of Section 3.2 to add a set of
columns to J . We use criterion (51) to select the columns to add. Let J denote these additional
columns with coordinates J (k) for k = 1, . . . , |J | — we will also assume that the elements have
been sorted by increasing reduced costs. For notational purposes, we will need to map5 the existing
cuts of W J onto W J ∪J . To this end, we make the following definition:
(`)
wm = λ|J |+k , ∀m ∈ J , ∀` ≤ t. (53)
(`)
5
In other words, the existing vectors wJ are in R|J | and we need to extend them to R|J |+|J | . Therefore, we
need to define the coordinates corresponding to the new indices J .
22
P|J | ∗ (t+1) J
3. If η < j=1 λj |β(j) | − , add a new cut wJ ∈ W J as in Equation (44) and define Ct+1 .
4. Identify columns J ⊂ {1, . . . , p} \J that need to be added (based on criterion (51)). Map
(1) (t+1)
the cuts wJ , . . . wJ to W J ∪J via (53). Update J ← J ∪ J and go to Step 2.
where, τ > 0 is a parameter that controls the amount of smoothness in H τ (z) and how well it
approximates H 0 (z) = ni=1 (zi )+ . This is formalized in the following lemma adapted from [31]:
P
Lemma 2. The function z 7→ H τ (z) is an O(τ )-approximation for the hinge loss H 0 (z) i.e.,
H 0 (z) ∈ [H τ (z), H τ (z) + nτ /2] for all z. Furthermore, H τ (z) has Lipschitz continuous gradient
with parameter 1/(4τ ), i.e., k∇H τ (z) − ∇H τ (z0 )k2 ≤ 1/(4τ )kz − z0 k2 for all z, z0 .
6
As our experiments demonstrate, obtaining high accuracy solutions via first order methods can become pro-
hibitively expensive especially, when compared to the column/constraint algorithms presented here.
7
For the Group-SVM problem, we use proximal block coordinate methods instead of proximal gradient methods
as they lead to better numerical performance.
23
Let us define:
( n )
X1 τ
τ T T 2
F (β, β0 ) = max 1 − yi (xi β + β0 ) + wi (1 − yi (xi β + β0 )) − kwk2 .
kwk∞ ≤1 2 2
i=1
By Lemma 2, it follows that F τ (β, β0 ) is an uniform O(τ )-approximation to the hinge-loss function.
The gradient of F τ is given by:
n
1X
∇F τ (β, β0 ) = − (1 + wiτ )yi x̃i ∈ Rp+1 , (56)
2
i=1
and (β, β0 ) 7→ ∇F τ (β, β0 ) is Lipschitz continuous with parameter C τ = σmax (X̃T X̃)/(4τ ), where
X̃n×(p+1) is a matrix with ith row (xi , 1).
We use a proximal gradient method [3] to the following composite form of the smoothed-hinge-
loss SVM problem with regularizer Ω(β)
L
F τ (γ) ≤ QL (γ; α) := F τ (α) + ∇F τ (α)T (γ − α) + kγ − αk22 . (58)
2
We denote: γ̂ = (β̂, βˆ0 ). Note that βˆ0 is simple to compute and β̂ can be computed via the
following thresholding operator (with µ > 0):
1
SµΩ (η) := arg min kβ − ηk22 + µΩ(β). (60)
β∈Rp 2
24
Thresholding operator when Ω(β) = kβk1 : In this case, SµΩ (η) is available via componentwise
soft-thresholding where, the scalar soft-thresholding operator is given by:
1
arg min (u − c)2 + µ|u| = sign(c)(|c| − µ)+ .
u∈R 2
P
Thresholding operator when Ω(β) = g∈[G] kβ g k∞ : We first consider the projection operator
that projects onto an L1-ball with radius µ > 0
1 1
S̃ 1 (η) := arg min kβ − ηk22 s.t. kβk1 ≤ 1. (61)
µ k·k1 β 2 µ
From standard results pertaining to the Moreau decomposition [29] (see also [1]) we have:
for any η. Note that S̃ 1 (η) can be computed via a simple sorting operation [40, 41], leading
µ k·k1
to a solution for Sµk.k∞ (η). This observation can be used to solve Problem (60) with the L1/L∞
Group regularizer by noticing that the problem separates across the G groups.
P
Thresholding operator when Ω(β) = i∈[p] λi |β(i) |: For the Slope regularizer, Problem (60)
reduces to the following optimization problem:
1 X
min kβ − ηk22 + µ λi |β(i) |. (63)
β 2
i
As noted by [8], at an optimal solution to Problem (63), the signs of βj and ηj are the same. In
addition, since λi ’s are decreasing, a solution to Problem (63) can be found by solving the following
close relative to the isotonic regression problem [36]
p
1 X
min ku − η̃k22 + µλj uj s.t. u1 ≥ . . . ≥ up ≥ 0 (64)
u 2
j=1
where, η̃ is a decreasing re-arrangement of the absolute values of η, with η̃i ≥ η̃i+1 for all i. If û is
a solution to Problem (64)—then its ith coordinate ûi corresponds to |β̂(i) | where, β̂ is an optimal
solution of Problem (63).
25
q
where, αT +1 = α̃T + qqTT −1
+1
(α̃ T − α̃ T −1 ) and q T +1 = (1 + 1 + 4qT2 )/2. This algorithm requires
O(1/) iterations to reach an -optimal solution for the original problem (with the hinge-loss). We
perform these updates till some tolerance criterion is satisfied, for example, kαT +1 − αT k ≤ η for
some tolerance level η > 0. In most of our examples (cf Section 5), we set a generous (or loose)
tolerance of η = 10−3 or run the algorithm with a limit on the total number of iterations (usually
a couple of hundred)8 .
Block Coordinate Descent (CD) for the Group-SVM problem: We describe a cyclical
proximal block coordinate (CD) descent algorithm [43] for the smooth hinge-loss function with
the group regularizer. For the group-SVM experiments considered in Section 5, the block CD
approach was found to exhibit superior numerical performance compared to a full gradient-based
procedure. We note that [35] explore block CD like algorithms for a different class of group Lasso
type problems9 and our approaches differ.
We perform a proximal gradient step on the gth group of coefficients (with all other blocks and
β0 held fixed) via:
2
1 1 n o λ
β t+1 ∈ arg min β g − β tg − τ ∇F τ (β t+1 , . . . β t+1
, β t
, . . . β t
, β t
) + kβ k∞ , (65)
g
βg 2 Cg 1 g−1 g G 0
Ig
2
Cgτ g
where {∇F τ (·)}Ig denotes the gradient restricted to the coordinates Ig and Cgτ is its associated
Lipschitz constant: Cgτ = σmax (XTIg XIg )/4τ . We cyclically update the coefficients across each
group g ∈ [G] and then update β 0 . This continues till some convergence criterion is met.
Computational savings are possible for this block CD algorithm by a careful accounting of
floating point operations (flops). As one moves from one group to the next, the whole gradient can
be updated easily. To this end, note that the gradient ∇F τ (β, β0 ) restricted to block g is given by:
1
{∇F τ (β, β0 )}Ig = − XTIg {y ◦ (1 + wτ )} ,
2
where ‘◦’ denotes element-wise multiplication. If wτ is known, the above computation requires n|Ig |
1
flops. Recall that wτ depends upon β via: wiτ = min 1, 2τ |zi | sign(zi ) where zi = 1 − yi (xTi β +
β0 ), ∀i. If β changes from β old to β new , then wτ changes via an update in Xβ — this change
can be efficiently computed by noting that: Xβ new = g∈[G] XIg β new = Xβ old + XIg ∆β g where,
P
g
∆β g = β new
g − β old τ
g is a change that is only restricted to block g. Hence updating w also requires
n|Ig | operations. The above suggests that one sweep of block CD across all the coordinates has a
cost similar to that of computing a full gradient. In addition, techniques like active set updates and
warm-start continuation [19] can lead to improved computational performance for CD, in practice.
8
This choice is user-dependent—there is a tradeoff between computation time and the quality of solution. We
recommend using a low accuracy solution as its purpose is to serve as an initialization for the column/constraint
generation method
9
The authors in [35] consider a different class of problems than those studied here; and use exact minimization
for every block (they use a squared error loss function).
26
4.4 Scalability heuristics for large problem instances
When n and/or p becomes large, the first order algorithms discussed above become expensive.
Recall that the goal of the first order methods is to get a low-accuracy solution for the SVM problem
and in particular, an estimate of the initial columns and/or constraints for the column/constraint
generation algorithms. Hence, for scalability purposes, we use principled heuristics as a wrapper
around the first order methods, discussed above.
When p n, we use a feature screening method inspired by correlation screening [38], to restrict
the number of features (or groups in the case of Group-SVM). We apply the first order methods on
this reduced set of features. Usually, for L1-SVM and Slope-SVM, we select for example, the top
10n columns with highest absolute inner product (note that the features are standardized to have
unit L2-norm) with the output. For the Group-SVM problem: for each group, we compute the
inner products between every feature within this group and the response, and take their L1-norm.
We then sort these numbers and take the top n groups.
The methods described in Section 4.3 become expensive due to gradient computations when n
becomes large. When n is large but p is small, we use a subsampling method inspired by [26]. To
get an approximate solution to Problem (2) we apply the algorithm in Section 4.3 on a subsample
(yi , xi ), i ∈ A with sample-indices A ⊂ [n]. We (approximately) solve Problem (2) with λ ← |A| n λ
(to adjust the dependence of λ on the sample size) by using the algorithms in Section 4.3. Let
the solution obtained be given by β̂(A). We obtain β̂(Aj ) for different subsamples Aj , j ∈ [Q]
and average the estimators10 β̄ Q = Q1 j∈[Q] β̂(Aj ). We maintain a counter for Q, and stop as
P
soon as the average stabilizes11 , i.e., kβ̄ Q − β̄ Q−1 k ≤ µTol for some tolerance threshold µTol . The
estimate β̄ Q is used to obtain the violated constraints for the SVM problem and serves to initialize
the constraint generation method.
For large p (small n) and large n (small p) problems, we basically apply a combination of the ideas
described above in Sections 4.4.1 and 4.4.2. More specifically, we choose a subsample Aj and for this
subsample, we use correlation screening to reduce the number of features and obtain an estimator
β̂(Aj ). We then average these estimators (across Aj s) to obtain β̄ Q . If the support of β̄ Q is too
large, we sort the absolute values of the coefficients and retain the top few hundred coefficients (in
absolute value) to initialize the column generation method. The estimator β̄ Q is used to identify
10
We note that the estimates β̂(Aj ) can all be computed in parallel.
11
We note that when n is large (but p is small), basic principles of statistical inference [39] suggest that the estimator
β̂(Aj ) will serve as a proxy of a minimizer of Problem (2) — we average the estimators β̂(Aj )’s to reduce variance.
27
the samples for which the hinge-loss is nonzero — these indices are used to initialize the constraint
generation method.
5 Experiments
We demonstrate the performance of our different methods on synthetic and real datasets. All
computations are performed in Python 3.6 on the MIT Engaging Cluster with 1 CPU and 16GB of
RAM. We use Gurobi 9.0.2 [20] in our experiments involving Gurobi’s LP solver. Sections 5.1, 5.2
and 5.3 present results for the L1-SVM, Group-SVM and Slope-SVM problems (respectively).
where ARA depends upon λ and ‘Alg’. The ARA value is averaged across R-many replications
(hence, the shorthand averaged relative accuracy: ARA).
Different initializations for column generation: We first study the role of different initializa-
tion schemes in column generation (denoted by the shorthand CLG below) for solving the L1-SVM
problem: we consider a fixed value of the regularization parameter, set to λ = 0.01λmax ; and
compare the following three schemes:
(a) RP-CLG: We compute a solution to Problem (2) at the desired value of λ, using a regularization
path (RP) (aka continuation) approach. We compute solutions on a grid of 7 regularization
parameter values in the range [ 12 λmax , λ] using column generation (CLG) for every value of
the regularization parameter.
(b) FO-CLG: This is the column generation method initialized with a first order (FO) method (cf
Section 4.3) with smoothing parameter τ = 0.2. We use a termination criterion of η = 10−3
28
or a maximum number of Tmax = 200 iterations for the first order method. We use correlation
screening to retain the top 10n features before applying the first order method. The time
displayed includes the time taken to run the first order method. For reference, we report the
time taken to run column generation excluding the time of the first order method: “CLG wo
FO”.
(c) Cor. screening: This initializes the column generation method by using correlation screen-
ing to retain the top 50 features.
0 0.000000
1K 2K 5K 10K 20K 50K 100K 200K 500K 1K 2K 5K 10K 20K 50K 100K 200K 500K
Figure 1: Comparison of different initialization schemes for the column generation method for the L1-SVM
LP. Left panel shows runtime (s) versus p. Right panel shows the corresponding optimization accuracy
(ARA) versus p up to 500, 000 (here, n = 100).
The comparative timings between FO-CLG and Cor. screening show the effectiveness of using
a first order method to initialize the column generation method. Method (a) computes a regular-
ization path (using column generation) to arrive at the desired value of λ — it does not use any
first order method like (b) — thus any timing difference between (a) and (b) is due to the role
played by the first order methods for warm-starting.
Figure 1 shows the results for synthetic datasets with n = 100, k0 = 10, ρ = 0.1 and different
values of p in the range 1000 to 500, 000 (results averaged across R = 10 replications). In this figure,
for the column generation methods, we consider a reduced cost threshold of = 0.001, and set the
maximum number of columns to be added in each iteration as 1000. The left panel in Figure 1
presents the run times and the right figure presents the ARA of different methods. As p increases,
the run time for column generation when intitialized with correlation screening, increases. Column
generation is found to benefit the most when initialized with the first order method (denoted by
FO-CLG). The runtime of the first order method is negligible compared to the time taken by column
generation, as seen from the nearly overlapping profiles of FO-CLG and CLG wo FO. The accuracies
of the different procedures (a)–(c) are all quite high ARA∼ 10−6 .
Comparison with benchmarks: We compare the performance of two column-generation meth-
ods RP-CLG and FO-CLG with the following benchmarks:
29
(d) PSM: This is a state-of-the-art algorithm [34] which is a parametric simplex based solver. We
use the software made available by Pang et al. [34] with default parameter-settings.
(e) FOM: This runs our first order method, denoted by FOM (we use accelerated gradient descent)
with τ = 0.02 for a maximum of T = 100, 000 iterations. We terminate the algorithm if the
maximum iteration limit is reached or the change in β (in `2 norm) in the past two iterations
is less than 10−3 .
(f ) SGD: This runs a stochastic sub-gradient algorithm on the L1-SVM Problem (2) using Python
scikit-learn package implementation (SGDClassifier) with fixed number of 10, 000 itera-
tions. The learning rate is set to the “optimal” parameter.
(g) SCS: This is the Splitting Conic Solver [33] version 2.1.2 with default parameter setting. The
solver is called through CVXPY [15]. This solver is a variant of the ADMM method [11].
(h) Gurobi: This is the LP solver of Gurobi [20] with default setting for solving the full L1-SVM
LP. The solver is called through CVXPY [15].
All the benchmarks used above optimize for the L1-SVM objective function (note that FOM
considers a smooth approximation of the hinge-loss).
Table 1 presents the results for different methods for λ = 0.05λmax [top panel] and λ = 0.2λmax
[bottom panel]. Here we run RP-CLG and FO-CLG with reduced cost tolerance = 0.01. For each
combination (n, p) = (100, 10K), (n, p) = (300, 10K) and (n, p) = (100, 50K), and for each method
considered, we show the runtime and associated ARA (results are averaged over 5 replications, and
numbers within parenthesis denote standard errors). For the instance in Table 1 with n = 100
and p = 50K, we run FOM for 2000 iterations—in this instance, stopping the algorithm with a
tolerance 10−3 leads to a low-quality solution.
In addition, to give an idea of the sparsity level of the solution (for the λ-values chosen), we
present the support size of an optimal solution β̂ (computed by Gurobi) and the number of columns
|J | in the restricted problem, upon termination of FO-CLG. From Table 1, it can be seen that our
proposed methods (RP-CLG and FO-CLG) outperform all the benchmark methods in runtime by a
factor of 30X ∼ 500X. In terms of solution accuracy, our column generation methods (reaching an
ARA∼ 10−5 or smaller) are comparable to that of Gurobi, PSM. The operator splitting method SCS
leads to solutions of low-accuracy — their ARA ∼ 10−3 for (n, p) = (100, 10K) and (300, 10K),
but the ARA is slightly larger ∼ 10−1 –10−2 for (n, p) = (100, 50K). SGD and FOM also lead to low-
accuracy solutions; with FOM leading to somewhat better performance compared to SCS and SGD.
We note that the poor performance of SGD in Table 2 should not come as a surprise, as stochastic
subgradient methods are not designed for small n and large p settings. In addition, given our earlier
discussion, deterministic subgradient methods for nonsmooth problems have a slower convergence
compared to Nesterov’s smoothing technique — the FOM presented here is an instance of Nesterov’s
smoothing technique (see Section 4).
30
λ = 0.05λmax
n = 100, p = 10K n = 300, p = 10K n = 100, p = 50K
Method Time (s) ARA Time (s) ARA Time (s) ARA
PSM 15.3(1.23) 6.9e-12(1.43e-12) 193.8(24.94) 1.8e-11(5.46e-12) 178.1(10.05) 2.0e-11(8.45e-12)
SGD 34.4(0.36) 9.5e-02(9.98e-03) 108.8(0.45) 6.1e-02(5.22e-03) 172.0(0.87) 1.6e-01(3.01e-02)
SCS 76.3(3.68) 1.4e-03(4.47e-04) 186.9(2.88) 1.2e-03(1.02e-04) 456.5(7.19) 7.3e-01(6.40e-01)
FOM 12.4(0.90) 2.0e-02(4.31e-04) 21.4(1.20) 9.1e-03(2.53e-04) 92.0(4.65) 3.0e-02(3.77e-04)
Gurobi 34.5(1.68) 0.0e+00(0.00e+00) 93.1(8.66) 0.0e+00(0.00e+00) 250.0(8.89) 0.0e+00(0.00e+00)
RP-CLG 0.4(0.03) 5.5e-06(3.81e-06) 2.0(0.14) 1.0e-06(6.43e-07) 3.3(0.56) 8.5e-06(5.79e-06)
FO-CLG 0.6(0.07) 6.7e-06(3.21e-06) 2.0(0.14) 7.6e-06(4.84e-06) 1.4(0.12) 2.0e-05(8.16e-06)
kβ̂k0 & |J | 55.0 & 265.8 92.8 & 218.2 65.6 & 257.8
λ = 0.2λmax
n = 100, p = 10K n = 300, p = 10K n = 100, p = 50K
Method Time (s) ARA Time (s) ARA Time (s) ARA
PSM 9.5(0.61) 3.9e-13(1.33e-13) 87.9(13.15) 8.9e-13(3.69e-13) 153.0(10.03) 2.9e-12(1.41e-12)
SGD 51.9(0.47) 1.1e-02(1.89e-03) 162.6(0.44) 5.3e-03(6.88e-04) 260.4(1.64) 1.1e-02(2.29e-03)
SCS 84.4(2.63) 3.0e-03(7.26e-04) 201.6(22.27) 2.3e-03(9.70e-04) 420.7(10.80) 4.2e-02(6.05e-03)
FOM 8.2(0.35) 3.7e-03(3.01e-04) 15.5(0.79) 1.0e-03(7.59e-05) 125.6(0.13) 6.4e-03(5.38e-04)
Gurobi 26.2(0.89) 1.6e-16(6.61e-17) 53.3(8.16) 1.2e-16(4.83e-17) 264.9(15.96) 0.0e+00(0.00e+00)
RP-CLG 0.5(0.10) 1.4e-07(9.93e-08) 0.4(0.08) 1.8e-08(1.65e-08) 4.8(0.20) 9.1e-07(8.18e-07)
FO-CLG 0.3(0.00) 2.2e-08(1.92e-08) 0.9(0.03) 1.9e-16(7.34e-17) 0.9(0.04) 6.5e-07(4.57e-07)
kβ̂k0 & |J | 36.8 & 73.6 30.8 & 42.8 53.0 & 144.8
Table 1: L1-SVM, Synthetic dataset (p n): Training time (s) for L1-SVM: our proposed column
generation method versus various benchmarks on synthetic datasets. We show two different values of λ:
λ = 0.05λmax [top table] and λ = 0.2λmax [bottom table]. Each table presents results for different values of
(n, p) with p n. For each table, the last row presents the number of nonzeros in β̂ (an optimal solution
to the L1-SVM problem) and the number of columns, |J |, in the restricted problem, upon termination of
FO-CLG. Results are averaged over 5 replications; and the numbers with parenthesis denote standard errors.
Performance of CLG for difference tolerance levels : Table 1 above presents the results for
FO-CLG for = 0.01. In Figure 2 we study sensitivity to the choice of —we present the runtime
and ARA of FO-CLG under different tolerance values ∈ {0.01, 0.03, 0.1, 0.3, 1} for λ = 0.05λmax . It
can be seen that the ARA of FO-CLG changes with —with = 0.01 generally leading to a solution
of high-accuracy. As decreases, the runtime increases slightly.
Computing a path of solutions: The results above discuss obtaining a solution to the L1-SVM
problem for a fixed value of λ. Here we present results for computing L1-SVM solutions for a grid
of λ values—we focus on comparing the performances of our column generation methods: RP-CLG
and FO-CLG. We fix n = 1000, p = 100K, k0 = 10, ρ = 0.1, and a sequence of 50 values of λ:
For convenience, let us denote the different values of λ by λ1 > λ2 > · · · > λm (here, m = 50). For
RP-CLG, we solve the problems in the order λ1 , λ2 , ....., λm−1 , λm , and use the solution at λi as a
warm start to compute the solution at λi+1 (with column generation enabled). For FO-CLG, we solve
the problems for different λi ’s independently—we use our first order method to initialize the initial
set of columns for column generation. In the left panel of Figure 3, we present the runtime for both
31
2.00 n=100, p=10K
1.75 n=300, p=10K
10−2
n=100, p=50K
1.50
n=100, p=10K
Time(s)
1.25 10−3
ARA
n=300, p=10K
1.00 n=100, p=50K
10−4
0.75
0.50
10−5
0.25
0.01 0.03 0.1 0.3 1 0.01 0.03 0.1 0.3 1
Tolerance ε Tolerance ε
Tolerance () Tolerance ()
Figure 2: Runtime (s) and ARA of FO-CLG under different tolerance levels for the L1-SVM LP. We
consider three different problem-sizes as indicated in the figure legends.
methods at each λi . We also present CLG wo FO, which denotes the runtime of column generation
(not including the runtime for the first order method) in the implementation of FO-CLG. In the right
panel of Figure 3, we present the support size of the solution β̂ (denoted by “beta supp” in the
figure). As shown in Figure 3, when λ decreases, the support size of the solution increases, and the
runtime of both RP-CLG and FO-CLG increase. The performance of RP-CLG, which performs warm-
starts along the sequence of λ-values appears to be superior to FO-CLG. As FO-CLG does not use
regularization-path continuation, one can compute solutions for the different λi -values in parallel,
which is not possible with the sequential approach RP-CLG. Note however, in our experiments, the
L1-SVM solutions for the different λi values are computed sequentially (and not in parallel). Based
on this experiment, we recommend using RP-CLG when one wishes to compute a path of solutions
to the L1-SVM problem, in a sequential fashion.
RP-CLG RP-CLG
FO-CLG 175 FO-CLG
102
CLG wo FO
150
125
beta supp.
Time(s)
101 100
75
50
100 25
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
lambda lambda
κ (here, κ = λ/λmax ) κ (here, κ = λ/λmax )
Figure 3: L1-SVM solutions (n = 1K, p = 100K) on a regularization path. [Left] Runtime (s) versus
κ, defined in (66) [Right] Support size of β denoted by “beta supp(ort)”, i.e., number of nonzero SVM
coefficients β for different values of κ, as available from FO-CLG and RP-CLG. (Note that the two profiles for
“beta supp” are identical).
32
5.1.2 Synthetic datasets for large n and small p
(i) FO-CNG: This is our constraint generation (CNG) method when initialized with a subsampling
based first order (FO) heuristic (cf Section 4.4).
For FO-CNG, we use a reduced cost convergence threshold of = 0.01 (we limit the number of
constraints to be added to at most 400). We compare FO-CNG with several benchmarks: SGD, SCS,
FOM, Gurobi as discussed in Section 5.1.1. We do not present the results for PSM as this was found to
be much slower on these instances with large n (sometimes PSM would not run on these instances).
Table 2 presents the results for λ = 0.001λmax and λ = 0.01λmax . The sub-tables of Table 2
consider (n, p) = (10K, 100), (10K, 300) and (50K, 100), and we present results averaged over 5
replications. As constraint generation leverages sparsity in ξ, to get an idea of the sparsity of
the problem we are dealing with, we present (i) the support size of the solution ξ̂ (computed
by Gurobi); and (ii) the number of constraints |I| in the restricted problem, upon termination
of FO-CNG. From Table 2, it can be seen that FO-CNG outperforms other methods by a factor
4X ∼ 30X. In particular, FO-CNG has better performance when λ is small (recall, we are considering
the setting where n p). This is because a small value of λ imparts less shrinkage on the SVM
coefficients β, hence the support size of ξ̂ is small (i.e., the number of falsely-classified samples is
small). As a result, the constraint generation method speeds up overall runtime making FO-CNG
computationally attractive. Both FO-CNG and Gurobi (solving the full LP) reach high accuracy
solutions. The accuracy of solutions obtained by FO-CNG is considerably higher compared to SGD,
SCS and FOM. In this example (n p), we observe that SGD works well compared to the examples
in Section 5.1.1 where p n.
Finally, we note that in the examples considered in Table 2 when λ is very large and the support
size of ξ̂ is close to n, the runtime of FO-CNG is likely going to increase.
We study the performance of the algorithms when both n and p are comparable and moderately
large. In this case, we use the following method with both column and constraint generation:
(j) FO-CLCNG: This is the combined column-and-constraint generation method, denoted by the
shorthand (CLCNG) (i.e., Algorithm 4), initialized with the first order method discussed12 in
Section 4.4.3. The column/constraint generation reduced cost thresholds are set to be equal
:= 1 = 2 .
For FO-CLCNG, we use = 0.01 and limit the maximum number of columns/constraints that are
added at an iteration to 400. We compare FO-CLCNG with benchmarks including SGD, SCS, FOM,
Gurobi under the same setting as Section 5.1.1. Once again, PSM is found to be significantly slow as
12
For the subsampling heuristic, once the average estimate was obtained, we took the top 200 coefficients (in terms
of absolute value) to initialize the set of columns for column generation.
33
λ = 0.001λmax
n = 10K, p = 100 n = 10K, p = 300 n = 50K, p = 100
Method Time (s) ARA Time (s) ARA Time (s) ARA
SGD 54.2(4.05) 1.8e-02(8.78e-04) 117.4(2.89) 4.5e-02(2.19e-03) 313.1(3.19) 7.4e-03(3.77e-04)
SCS 51.4(4.03) 1.7e-04(4.31e-05) 117.6(1.74) 5.2e-04(1.90e-04) 241.9(11.19) 3.9e-05(7.35e-06)
FOM 170.1(3.40) 2.8e-03(1.39e-04) 207.0(9.03) 5.9e-03(1.12e-04) 1147.4(35.85) 5.9e-04(1.73e-05)
Gurobi 66.3(3.52) 0.0e+00(0.00e+00) 133.8(5.48) 0.0e+00(0.00e+00) 626.6(42.49) 3.4e-16(1.95e-16)
FO-CNG 3.0(0.06) 1.3e-05(1.12e-05) 6.3(0.24) 2.1e-15(1.99e-16) 20.8(0.22) 0.0e+00(0.00e+00)
kξ̂k0 & |I| 87.8 & 443.8 88.8 & 362.2 538.6 & 2277.8
λ = 0.01λmax
n = 10K, p = 100 n = 10K, p = 300 n = 50K, p = 100
Method Time (s) ARA Time (s) ARA Time (s) ARA
SGD 55.6(2.88) 4.0e-03(3.51e-04) 130.1(4.50) 8.2e-03(2.38e-04) 326.7(5.36) 3.6e-03(1.28e-04)
SCS 28.1(3.99) 2.3e-05(3.62e-06) 79.6(7.68) 3.2e-05(6.57e-06) 112.9(7.21) 1.2e-05(3.27e-06)
FOM 47.8(0.90) 5.8e-04(1.05e-05) 68.8(4.00) 1.0e-03(3.21e-05) 297.4(0.82) 1.3e-04(5.87e-06)
Gurobi 49.3(1.84) 9.1e-17(5.92e-17) 104.1(5.43) 7.3e-17(6.56e-17) 377.8(15.05) 2.1e-16(1.89e-16)
FO-CNG 9.0(0.26) 3.5e-06(3.16e-06) 11.4(0.32) 8.3e-06(3.14e-06) 32.4(0.83) 1.3e-06(4.80e-07)
kξ̂k0 & |I| 501.4 & 1090.2 472.2 & 813.8 2559.4 & 3472.8
Table 2: L1-SVM, Synthetic dataset (n p): Training time for L1-SVM versus state-of-art methods
on synthetic datasets. We show two different values of λ: λ = 0.001λmax [top table] and λ = 0.01λmax
[bottom table]. Each table presents results for different values of (n, p) with n p. For each table, the last
row presents the number of nonzeros in ξ in an optimal solution to the problem and the number of active
constraints |I| upon termination of FO-CNG.
n is large, hence we do not include it in our results. Table 3 presents the results for λ = 0.01λmax
and λ = 0.1λmax , with (n, p) = (3K, 3K), (2K, 5K) and (5K, 2K). Note that in these examples,
X is dense so we do not consider larger problem-sizes—larger n, p values with a sparse X are
considered in Section 5.1.4.
Table 3 presents runtimes (s) and ARA values (means and standard errors across 5 independent
experiments). In addition, to get an idea of the sparsity level of the problem, we also present the
support sizes of the solutions β̂ and ξ̂ (computed by Gurobi). We also list the number of columns
and constraints (|J | and |I|) in the resctricted problem, upon termination of FO-CLCNG. As shown
in the sub-tables of Table 3, for λ = 0.01λmax , FO-CLCNG has a 7X ∼ 30X speedup over other
methods; for λ = 0.1λmax , FO-CLCNG has a 4X ∼ 50X speedup over other methods. At the same
time, it is important to note that the ARA of FO-CLCNG is around 10−5 ∼ 10−6 — this is notably
better than that of SCS, FOM and SGD.
Performance under different tolerance levels : Table 3 presents results of FO-CLCNG for
a fixed value of = 0.01. In Figure 4, to understand sensitivity to the choice of , we present
the runtime and ARA of FO-CLCNG under different choices of ∈ {0.01, 0.03, 0.1, 0.3, 1} for λ =
0.01λmax . The presented results are the mean of 5 independent replications. Similar to Figure 2, it
can be seen in Figure 4, that as decreases from 1 to 0.01, the ARA of FO-CLCNG decreases from
10−1 to 10−5 ∼ 10−6 , while the runtime only increases slightly. Therefore a tolerance of = 0.01 is
sufficiently small and leads to a fairly high accuracy for the numerical experiments considered.
34
λ = 0.01λmax
n = 3K, p = 3K n = 2K, p = 5K n = 5K, p = 2K
Method Time (s) ARA Time (s) ARA Time (s) ARA
SGD 393.3(32.06) 1.4e-01(7.79e-03) 387.3(25.89) 4.2e-01(1.20e-02) 486.9(46.32) 4.5e-02(1.28e-03)
SCS 281.8(34.91) 1.7e-04(1.73e-05) 261.7(37.13) 3.0e-04(7.05e-05) 426.3(37.83) 1.3e-04(2.44e-05)
FOM 100.6(9.74) 5.2e-03(1.27e-04) 93.4(5.33) 7.9e-03(1.83e-04) 142.3(12.46) 3.0e-03(5.39e-05)
Gurobi 103.9(27.61) 0.0e+00(0.00e+00) 109.8(31.84) 0.0e+00(0.00e+00) 585.9(57.41) 0.0e+00(0.00e+00)
FO-CLCNG 14.1(0.40) 3.5e-07(2.59e-07) 9.9(0.11) 8.2e-06(2.11e-06) 20.2(0.44) 3.6e-06(2.82e-06)
kβ̂k0 & |J | 170.2 & 646.4 162.0 & 692.0 188.0 & 634.2
kξ̂k0 & |I| 132.8 & 565.2 87.6 & 533.4 221.4 & 668.8
λ = 0.1λmax
n = 3K, p = 3K n = 2K, p = 5K n = 5K, p = 2K
Method Time (s) ARA Time (s) ARA Time (s) ARA
SGD 521.6(40.17) 3.5e-03(1.80e-04) 527.9(26.79) 5.8e-03(6.02e-04) 539.5(24.21) 2.4e-03(2.41e-04)
SCS 150.5(22.98) 3.4e-04(4.09e-05) 178.7(9.48) 3.8e-04(1.12e-04) 259.2(26.38) 8.2e-05(1.40e-05)
FOM 62.2(4.98) 2.6e-04(1.36e-05) 60.5(2.72) 4.4e-04(4.86e-05) 71.0(5.52) 1.3e-04(1.47e-05)
Gurobi 47.6(6.74) 0.0e+00(0.00e+00) 44.4(5.67) 0.0e+00(0.00e+00) 354.9(116.66) 0.0e+00(0.00e+00)
FO-CLCNG 11.7(0.14) 3.1e-06(1.65e-06) 8.3(0.13) 8.2e-07(6.71e-07) 18.7(0.16) 1.2e-06(6.69e-07)
kβ̂k0 & |J | 56.8 & 255.8 68.4 & 322.2 48.0 & 235.8
kξ̂k0 & |I| 750.8 & 1053.2 504.8 & 751.2 1287.0 & 1657.4
Table 3: L1-SVM, Synthetic dataset (n ≈ p): Training time for L1-SVM versus state-of-art methods
on synthetic datasets. We show two different values of λ: λ = 0.01λmax [top table] and λ = 0.1λmax [bottom
table]. Each table presents results for different values of (n, p) with n ≈ p. For each table, the last row
presents the number of nonzeros in β and ξ in an optimal solution to the problem and the number of active
variables and constraints (|J | and |I|) upon termination of FO-CLCNG.
20 n=3K, p=3K
10−1
n=2K, p=5K
18
10−2 n=5K, p=2K
16
n=3K, p=3K 10−3
Time(s)
ARA
14 n=2K, p=5K
n=5K, p=2K 10−4
12
10−5
10
10−6
8
0.01 0.03 0.1 0.3 1 0.01 0.03 0.1 0.3 1
Tolerance ε Tolerance ε
Tolerance () Tolerance ()
Figure 4: Runtime (s) and ARA of FO-CLCNG under different choices of the tolerance threshold . We
consider 3 different (n, p)-values. (This figure mirrors Figure 2 that shows results for column generation
alone: FO-CLG).
Finally, we assess the quality of our hybrid column-and-constraint generation method (FO-CLCNG)
on large real datasets. For a fair comparison, we compare different methods in terms of their
ability to optimize the same L1-SVM optimization problem. We consider three popular open-
source datasets rcv1, news20 and real-sim that can be found at https://www.csie.ntu.edu.tw/
~cjlin/liblinear/. We also consider a larger dataset derived from rcv1: We augment the original
35
features of rcv1 with noisy features obtained by randomly selecting a collection of features from the
original dataset, and randomly permuting the rows of the selected features. We call this augmented
dataset rcv1-aug. Similarly, we form an augmented version of the dataset real-sim that we denote
by real-sim-aug. Note that all these datasets are sparse—#nonzero entries in X, denoted by
nnz(X), is quite small compared to np; and we use sparse matrices (scipy implementation) to deal
with sparse matrix/vector multiplications.
We compare FO-CLCNG (we use a combination of column and constraint generation as n, p
are both large) with benchmark methods Gurobi, SGD and SCS. All methods are run under the
settings explained in Section 5.1.1, except that here we run SGD for 20, 000 epochs to arrive at an
ARA∼ 10−2 –10−3 .
Table 4 presents the sizes of the datasets considered and the runtime (s), ARA for different
algorithms. We consider a sequence of 11 values of λ = κλmax where κ lies in the range κ ∈
[0.003, 0.3]. To make the comparisons fair, all algorithms are run independently for different λ-
values; and we present the average runtime and ARA for a value of λ. In the last row of each
sub-table, we present the minimum, mean and maximum of the value kβ̂k0 + kξ̂k0 over the path
of λ-values. The numbers are presented in the form of the triplet “(min, mean, max)”; and (β̂, ξ̂)
is the solution obtained from FO-CLCNG.
On the datasets rcv1 and rcv1-aug, our proposed method FO-CLCNG outperforms SCS and
Gurobi in runtime by a factor of 3X ∼ 8X and delivers solutions of higher accuracy (ARA). For
other two datasets news20 and real-sim-aug, Gurobi and SCS would run out of memory for some
values of λ (we set a 16GB memory limit). As our column-and-constraint generation procedure
operates on a smaller reduced problem, it consumes less memory. The solutions of FO-CLCNG have
high accuracy with ARA around 10−6 . By comparing FO-CLCNG and SGD, we see that the runtime of
20,000 epochs of SGD is comparable to that of FO-CLCNG on the rcv1 dataset, while it is slower than
FO-CLCNG on other datasets. Note however that the optimization accuracy of the SGD solutions is
significantly worse compared to FO-CLCNG.
36
Dataset: rcv1 Dataset: rcv1-aug
(n=16194, p=47237, nnz=1.20e+06) (n=16194, p=236185, nnz=6.00e+06)
method Time (s) ARA method Time (s) ARA
Gurobi 442.0 0.0e+00 Gurobi 1590.5 0.0e+00
SCS 1185.9 2.2e-04 SCS 3318.8 9.2e-04
SGD 256.3 1.2e-02 SGD 1419.3 1.4e-02
FO-CLCNG 142.9 1.0e-06 FO-CLCNG 442.2 1.1e-06
kξ̂k0 + kβ̂k0 (3212, 6538, 10580) kξ̂k0 + kβ̂k0 (3997, 6765, 10512)
Dataset: news20 Dataset: real-sim-aug
(n=15997, p=1355191, nnz=7.31e+06) (n=57847, p=104795, nnz=1.47e+07)
method Time (s) ARA method Time (s) ARA
Gurobi - - Gurobi - -
SCS - - SCS - -
SGD 2014.4 4.3e-03 SGD 2908.9 2.5e-03
FO-CLCNG 112.0 0.0e+00 FO-CLCNG 955.3 0.0e+00
kξ̂k0 + kβ̂k0 (7106, 10010, 13853) kξ̂k0 + kβ̂k0 (11975, 18931, 26632)
Table 4: L1-SVM on real datasets with both n, p large. We compare our method F0-CLCNG versus other
benchmarks (in terms of runtime and ARA) on a range of λ-values, as discussed in the text. A “-” means
that the method would not run due to memory limitations and/or numerical problems. The last row of
every sub-table provides the (minimum, average, maximum)-tuple of the support-size kξ̂k0 + kβ̂k0 where the
minimum, average and maximum values are taken across the sequence of λ-values considered.
Group-SVM (p n)
n = 100, p = 10K n = 300, p = 10K n = 100, p = 30K
Method Time (s) ARA Time (s) ARA Time (s) ARA
SCS 78.1(16.05) 8.7e-04(1.10e-04) 82.6(14.65) 7.0e-04(2.89e-04) 287.4(21.38) 1.1e-02(5.55e-03)
Gurobi 120.8(5.16) 3.4e-17(5.91e-17) 321.7(14.19) 3.4e-17(6.85e-17) 503.8(26.53) 6.8e-17(6.80e-17)
FO-CLG 2.1(0.17) 1.0e-16(1.14e-16) 3.8(0.13) 1.8e-16(1.91e-16) 2.3(0.14) 6.7e-17(1.15e-16)
Table 5: Training time (s) and ARA for Group-SVM versus various benchmarks on synthetic datasets,
λ = 0.1λmax .
coordinate descent procedure (cf Section 4.3). We use a smoothing parameter τ = 0.2 (for the
hinge-loss) and use a CD method restricted to the top n groups obtained via correlation screening
(cf Section 4.4.1). We use a reduced cost tolerance of = 0.01 in the column generation method.
Table 5 shows the results on synthetic datasets with (n, p) = (100, 10K), (300, 10K) and (100, 30K)
and set λ = 0.1λmax . The reported values are based on 5 independent replications. We can see
that on these examples, FO-CLG outperforms the other two methods by a large factor ≥ 30X in
runtime, and also delivers a solution with high accuracy (as seen from the ARA values).
37
Comparison when λi s are not all distinct: We are not aware of any publicly available spe-
cialized implementation for the Slope SVM problem. We use the CVXPY modeling framework to
model (See Section A.1) the Slope SVM problem and solve it using state-of-the art solvers like Ecos
and Gurobi. We first consider a special instance of the Slope penalty (37) that corresponds to the
coefficients λi = 2λ̃ for i ≤ k0 and λi = λ̃ for i > k0 ; where λ̃ = 0.01λmax . We solve the resulting
problem with both the Ecos and Gurobi solvers, denoted by “CVXPY Ecos” and “CVXPY Gurobi”
(respectively). We compare them with our proposed column-and-constraint generation algorithm,
referred to as “FO-CLCNG”. For our method, we first run the first order algorithm presented in
Section 4.3 (for τ = 0.2) restricted to the 10n columns with highest absolute correlations with the
response (the remaining coefficients are all set to zero). The column-and-constraint generation al-
gorithm (cf Section 3.3) uses a tolerance level of = 0.001. We limit the number of columns added
at each iteration to 10. For reference, we also report the run time of our algorithm by excluding
the time taken by the initialization step—this is denoted by “CLCNG wo FO”. The results, averaged
over 5 replications, are presented in Table 6. Results in Table 6 indicate that our proposed method
FO-CLCNG exhibits a 50X–110X improvement compared to competing solvers. In some cases, when
p ≥ 50K, CVXPY Ecos encounters numerical problems and hence would not run. On the other hand,
we note that FO-CLCNG has the best ARA for p ≥ 20K even when compared with CVXPY Gurobi.
This may be because the model size is large and CVXPY Gurobi appears to use a low-accuracy
termination condition.
Slope-SVM (p n)
FO-CLCNG CLCNG wo FO CVXPY Ecos CVXPY Gurobi
p Time (s) ARA Time (s) Time (s) ARA Time (s) ARA
10k 1.7(0.07) 1.3e-06 1.2(0.05) 64.3(6.81) 7.0e-12 84.0(2.86) 0.0e+00
20k 3.0(0.01) 0.0e+00 2.5(0.00) 130.5(3.17) 1.7e-05 221.9(5.64) 1.7e-05
50k 7.1(0.35) 0.0e+00 6.6(0.34) - - 842.3(20.06) 9.1e-06
100k 16.5(0.02) 0.0e+00 15.9(0.02) - - 1837.3(70.44) 4.1e-06
Table 6: Training times and ARA of our column-and-constraint generation method for Slope-SVM versus
CVXPY—we took λi /λj = 2 for all i ∈ [k0 ] and j > k0 (as described in the text). When the number of
features are in the order of tens of thousands (here, n = 100), our proposed method (FO-CLCNG) enjoys
nearly a 100X speedup in run time. A ‘–’ symbol denotes that the corresponding algorithm encountered
numerical problems.
Comparison when λi s are distinct: Here we consider a different sequence of λ-values: Follow-
p
ing [5], we set λj = log(2p/j)λ̃ with λ̃ = 0.01λmax . We observed that CVXPY could not handle even
small instances of this problem (as the λj s are distinct) — in particular, the Ecos solver crashed
for n = 100, p = 200. We compare our proposed method FO-CLCNG, with the first order method (cf.
Section 4.3) applied to the full smoothed Slope-SVM problem with τ = 0.2. Due to the high per
iteration cost of the first order method (FOM), we terminate the method after a few iterations (the
associated ARA is reported within parenthesis). Table 7 compares our methods—we use the same
synthetic dataset as in the previous Slope-SVM example and average the results over 5 replications
(the first order method was run for one replication due to its long run time).
38
Slope-SVM (p n)
FO-CLCNG CLCNG wo FO FOM
p Time (s) ARA Time (s) Time (s) ARA
10k 1.4(0.11) 0.0e+00(0.00e+00) 0.5(0.05) 164.6 3.0e-01
20k 1.6(0.09) 0.0e+00(0.00e+00) 0.7(0.04) 427.3 3.2e-01
50k 3.3(0.20) 0.0e+00(0.00e+00) 2.4(0.14) 1633.8 3.2e-01
Tablep7: Training times and ARA of our column-and-constraint generation for Slope-SVM with coefficients
λj = log(2p/j)λ̃ (as in the text) on synthetic datasets with large number of features (here, n = 100). Our
proposed approach (FO-CLCNG) outperforms the first order methods in runtimes (by at least 100-times). In
addition, FO-CLCNG delivers solutions of higher accuracy compared to FOM.
6 Acknowledgements
The authors would like to thank the Action Editor and three anonymous reviewers for their helpful
comments and constructive feedback that helped improve the paper. This research was supported
in part, by grants from the Office of Naval Research: ONR-N000141812298 (YIP), the National
Science Foundation: NSF-IIS-1718258 and IBM.
A Appendix
with variables α, vm ∈ Rp , θm ∈ R and 1 ∈ Rp being a vector of all ones. Note that the rhs
formulation in (67) has O(p) variables and O(p)-constraints. Note that we can write:
p
X p
X
λj α(j) = λ̃m (α(1) + . . . + α(m) )
j=1 m=1
where, λ̃m = λm −λm−1 for all m ∈ {1, . . . , p}. Therefore, representing pj=1 λj α(j) ≤ η will require
P
a representation (67) for m = 1, . . . , p — this will lead to a formulation with O(p2 ) variables and
O(p2 ) constraints, which can be quite large as soon as p becomes a few hundred.
39
References
[1] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convex optimization with sparsity-inducing norms.
Optimization for Machine Learning, 5:19–53, 2011.
[2] P. Balamurugan, A. Posinasetty, and S. Shevade. ADMM for training sparse structural SVMs with
augmented l1 regularizers. In Proceedings of the 2016 SIAM International Conference on Data Mining,
pages 684–692. SIAM, 2016.
[3] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
SIAM journal on imaging sciences, 2(1):183–202, 2009.
[4] S. R. Becker, E. J. Candès, and M. C. Grant. Templates for convex cone problems with applications to
sparse signal recovery. Mathematical programming computation, 3(3):165, 2011.
[5] P. C. Bellec, G. Lecué, A. B. Tsybakov, et al. Slope meets Lasso: improved oracle bounds and optimality.
The Annals of Statistics, 46(6B):3603–3642, 2018.
[6] D. Bertsimas and J. N. Tsitsiklis. Introduction to linear optimization, volume 6. Athena Scientific
Belmont, MA, 1997.
[7] M. Bogdan, E. v. d. Berg, W. Su, and E. Candes. Statistical estimation and testing via the sorted l1
norm. arXiv preprint arXiv:1310.1969, 2013.
[8] M. Bogdan, E. van den Berg, C. Sabatti, W. Su, and E. J. Candès. Slope adaptive variable selection
via convex optimization. The annals of applied statistics, 9(3):1103, 2015.
[9] L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMP-
STAT’2010, pages 177–186. Springer, 2010.
[10] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
[11] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. Distributed optimization and statistical
learning via the alternating direction method of multipliers. Foundations and Trends® in Machine
learning, 3(1):1–122, 2011.
[12] P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector
machines. In ICML, volume 98, pages 82–90, 1998.
[13] G. B. Dantzig and P. Wolfe. Decomposition principle for linear programs. Operations research, 8(1):
101–111, 1960.
[14] J. Desrosiers and M. E. Lübbecke. A primer in column generation. In Column generation, pages 1–32.
Springer, 2005.
[15] S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling language for convex optimization.
Journal of Machine Learning Research, 17(83):1–5, 2016.
[16] L. R. Ford Jr and D. R. Fulkerson. A suggested computation for maximal multi-commodity network
flows. Management Science, 5(1):97–101, 1958.
[17] V. Franc and S. Sonnenburg. Optimized cutting plane algorithm for support vector machines. In
Proceedings of the 25th international conference on Machine learning, pages 320–327. ACM, 2008.
[18] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via
coordinate descent. Journal of statistical software, 33(1):1, 2010.
[19] J. Friedman, T. Hastie, and R. Tibshirani. Regularized paths for generalized linear models via coordinate
descent. Journal of Statistical Software, 33(1), 2010.
[20] L. Gurobi Optimization. Gurobi optimizer reference manual, 2021. URL http://www.gurobi.com.
[21] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entire regularization path for the support vector
machine. Journal of Machine Learning Research, 5(Oct):1391–1415, 2004.
40
[22] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Infer-
ence, and Prediction. Springer New York, 2 edition, 2009.
[23] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent
method for large-scale linear SVM. In Proceedings of the 25th international conference on Machine
learning, pages 408–415. ACM, 2008.
[24] J. Huang and T. Zhang. The benefit of group sparsity. The Annals of Statistics, 38(4):1978–2004, 2010.
[25] T. Joachims. Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD interna-
tional conference on Knowledge discovery and data mining, pages 217–226. ACM, 2006.
[26] J. D. Lee, Q. Liu, Y. Sun, and J. E. Taylor. Communication-efficient sparse regression. Journal of
Machine Learning Research, 18(5):1–30, 2017.
[27] X. Li, T. Zhao, X. Yuan, and H. Liu. The flare package for high dimensional linear regression and
precision matrix estimation in r. J. Mach. Learn. Res., 16:553–557, 2015.
[28] O. L. Mangasarian. Exact 1-norm support vector machines via unconstrained convex differentiable
minimization. Journal of Machine Learning Research, 7(Jul):1517–1530, 2006.
[29] J.-J. Moreau. Dual convex functions and proximal points in a Hilbert space. CR Acad. Sci. Paris Ser.
At Math., 255:2897–2899, 1962.
[30] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Norwell, 2004.
[31] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103(1):127–
152, 2005.
[32] Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140
(1):125–161, 2013.
[33] B. O’Donoghue, E. Chu, N. Parikh, and S. Boyd. SCS: Splitting conic solver, version 2.1.2. https:
//github.com/cvxgrp/scs, Nov. 2019.
[34] H. Pang, H. Liu, R. J. Vanderbei, and T. Zhao. Parametric simplex method for sparse learning. In
Advances in Neural Information Processing Systems, pages 188–197, 2017.
[35] Z. Qin, K. Scheinberg, and D. Goldfarb. Efficient block-coordinate descent algorithms for the group
Lasso. Mathematical Programming Computation, 5(2):143–169, 2013.
[36] T. Robertson and T. Robertson. Order restricted statistical inference. Wiley, New York., 1988.
[37] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for SVM.
In Proceedings of the 24th international conference on Machine learning, pages 807–814. ACM, 2007.
[38] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani. Strong rules
for discarding predictors in Lasso-type problems. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 74(2):245–266, 2012.
[39] S. A. van de Geer. Empirical Processes in M-estimation, volume 6. Cambridge University Press, 2000.
[40] E. van den Berg and M. P. Friedlander. SPGL1: A solver for large-scale sparse reconstruction, June
2007. http://www.cs.ubc.ca/labs/scl/spgl1.
[41] E. van den Berg and M. P. Friedlander. Probing the Pareto frontier for basis pursuit solutions. SIAM
Journal on Scientific Computing, 31(2):890–912, 2008. doi: 10.1137/080714488. URL http://link.
aip.org/link/?SCE/31/890.
[42] V. Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.
[43] S. J. Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3–34, 2015.
[44] G.-X. Yuan, K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. A comparison of optimization methods and
software for large-scale l1-regularized linear classification. The Journal of Machine Learning Research,
11:3183–3234, 2010.
41
[45] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of
the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
[46] X. Zeng and M. Figueiredo. The ordered weighted l1 norm: Atomic formulation. Dual Norm, and
Projections. arXiv preprint arXiv:1409.4271.
42