1、Recent Advances of Large-Scale Linear Classification
1、Recent Advances of Large-Scale Linear Classification
PAPER
ABSTRACT | Linear classification is a useful tool in machine have shown to give competitive performances on docu-
learning and data mining. For some data in a rich dimensional ment data with nonlinear classifiers. An important ad-
space, the performance (i.e., testing accuracy) of linear classi- vantage of linear classification is that training and testing
fiers has shown to be close to that of nonlinear classifiers such procedures are much more efficient. Therefore, linear
as kernel methods, but training and testing speed is much classification can be very useful for some large-scale appli-
faster. Recently, many research works have developed efficient cations. Recently, the research on linear classification has
optimization methods to construct linear classifiers and ap- been a very active topic. In this paper, we give a compre-
plied them to some large-scale applications. In this paper, we hensive survey on the recent advances.
give a comprehensive survey on the recent development of this We begin with explaining in Section II why linear
active research area. classification is useful. The differences between linear and
nonlinear classifiers are described. Through experiments,
KEYWORDS | Large linear classification; logistic regression; we demonstrate that for some data, a linear classifier
multiclass classification; support vector machines (SVMs) achieves comparable accuracy to a nonlinear one, but both
training and testing times are much shorter. Linear clas-
sifiers cover popular methods such as support vector
I. INTRODUCTION machines (SVMs) [1], [2], logistic regression (LR),1 and
Linear classification is a useful tool in machine learning others. In Section III, we show optimization problems of
and data mining. In contrast to nonlinear classifiers such these methods and discuss their differences.
as kernel methods, which map data to a higher dimen- An important goal of the recent research on linear
sional space, linear classifiers directly work on data in the classification is to develop fast optimization algorithms for
original input space. While linear classifiers fail to handle training (e.g., [4]–[6]). In Section IV, we discuss issues in
some inseparable data, they may be sufficient for data in a finding a suitable algorithm and give details of some
rich dimensional space. For example, linear classifiers representative algorithms. Methods such as SVM and LR
were originally proposed for two-class problems. Although
past works have studied their extensions to multiclass
problems, the focus was on nonlinear classification. In
Section V, we systematically compare methods for multi-
class linear classification.
Linear classification can be further applied to
Manuscript received June 16, 2011; revised November 24, 2011; accepted February 3,
2012. Date of publication March 30, 2012; date of current version August 16, 2012.
many other scenarios. We investigate some examples in
This work was supported in part by the National Science Council of Taiwan under Section VI. In particular, we show that linear classifiers
Grant 98-2221-E-002-136-MY3.
G.-X. Yuan was with the Department of Computer Science, National Taiwan University,
can be effectively employed to either directly or indirectly
Taipei 10617, Taiwan. He is currently with the University of California Davis, Davis, approximate nonlinear classifiers. In Section VII, we
CA 95616 USA (e-mail: [email protected]).
C.-H. Ho and C.-J. Lin are with the Department of Computer Science, National
Taiwan University, Taipei 10617, Taiwan (e-mail: [email protected]; 1
It is difficult to trace the origin of logistic regression, which can be
[email protected]).
dated back to the 19th century. Interested readers may check the
Digital Object Identifier: 10.1109/JPROC.2012.2188013 investigation in [3].
2584 Proceedings of the IEEE | Vol. 100, No. 9, September 2012 0018-9219/$31.00 2012 IEEE
Yuan et al.: Recent Advances of Large-Scale Linear Classification
discuss an ongoing research topic for data larger than regardless of the dimensionality of ðxÞ. For example
memory or disk capacity. Existing algorithms often fail to
handle such data because of assuming that data can be
stored in a single computer’s memory. We present some 2
Kðxi ; xj Þ xTi xj þ 1 (4)
methods which try to reduce data reading or communi-
cation time. In Section VIII, we briefly discuss related
topics such as structured learning and large-scale linear
is the degree-2 polynomial kernel with
regression.
Finally, Section IX concludes this survey paper.
h pffiffiffi pffiffiffi pffiffiffi
ðxÞ ¼ 1; 2x1 ; . . . ; 2xn ; . . . x21 ; . . . x2n ; 2x1 x2 ;
II . WHY I S L INEAR pffiffiffi pffiffiffi i
CLASSIFICATION USEFUL? 2x1 x3 ; . . . ; 2xn1 xn 2 Rðnþ2Þðnþ1Þ=2 : (5)
Given training data ðyi ; xi Þ 2 f1; þ1g Rn , i ¼ 1; . . . ; l,
where yi is the label and xi is the feature vector, some
classification methods construct the following decision This kernel trick makes methods such as SVM or kernel LR
function: practical and popular; however, for large data, the training
and testing processes are still time consuming. For a kernel
like (4), the cost of predicting a testing instance x via (3)
can be up to OðlnÞ. In contrast, without using kernels, w is
dðxÞ wT ðxÞ þ b (1)
available in an explicit form, so we can predict an instance
by (1). With ðxÞ ¼ x
X
l
dðxÞ i Kðxi ; xÞ þ b (3) 2
In this experiment, we scaled cod-RNA feature wisely to ½1; 1
i¼1 interval.
Table 1 Comparison of Linear and Nonlinear Classifiers. For Linear, We Use the Software LIBLINEAR [7], While for Nonlinear We Use LIBSVM [8]
(RBF Kernel). The Last Column Shows the Accuracy Difference Between Linear and Nonlinear Classifiers. Training and Testing Time Is in Seconds.
The Experimental Setting Follows Exactly From [9, Sec. 4]
of the word in a document. Because the number of fea- loss functions considered in the literature of linear
tures is the same as the number of possible words, the classification
dimensionality is huge, and the data set is often sparse. For
this type of large sparse data, linear classifiers are very
L1 ðw; x; yÞ maxð0; 1 ywT xÞ (8)
useful because of competitive accuracy and very fast train- 2
T
ing and testing. L2 ðw; x; yÞ max ð0; 1 yw xÞ (9)
T
LR ðw; x; yÞ log 1 þ eyw x : (10)
X
l
min f ðwÞ rðwÞ þ C ðw; xi ; yi Þ (7)
w
i¼1
the training loss may not imply that the classifier gives the A convex combination of L1 and L2 regularizations
best testing accuracy. The concept of regularization is forms the elastic net [18]
introduced to prevent from overfitting observations. The
following L2 and L1 regularization terms are commonly
used:
re ðwÞ kwk22 þ ð1 Þkwk1 (14)
1 1X n
rL2 ðwÞ kwk22 ¼ w2 (11) where 2 ½0; 1Þ. The elastic net is used to break the
2 2 j¼1 j
following limitations of L1 regularization. First, L1 regu-
larization term is not strictly convex, so the solution may
and not be unique. Second, for two highly correlated features,
the solution obtained by L1 regularization may select only
one of these features. Consequently, L1 regularization
X
n
may discard the group effect of variables with high
rL1 ðwÞ kwk1 ¼ jwj j: (12)
j¼1
correlation [18].
Problem (7) with L2 regularization and L1 loss is the IV. TRAINING TECHNI QUES
standard SVM proposed in [1]. Both (11) and (12) are To obtain the model w, in the training phase we need to
convex and separable functions. The effect of regulariza- solve the convex optimization problem (7). Although many
tion on a variable is to push it toward zero. Then, the convex optimization methods are available, for large linear
search space of w is more confined and overfitting may be classification, we must carefully consider some factors in
avoided. It is known that an L1-regularized problem can designing a suitable algorithm. In this section, we first
generate a sparse model with few nonzero elements in w. discuss these design issues and follow by showing details of
Note that w2 =2 becomes more and more flat toward zero, some representative algorithms.
but jwj is uniformly steep. Therefore, an L1-regularized
variable is easier to be pushed to zero, but a caveat is that
(12) is not differentiable. Because nonzero elements in w A. Issues in Finding Suitable Algorithms
may correspond to useful features [15], L1 regularization
can be applied for feature selection. In addition, less • Data property. Algorithms that are efficient for
memory is needed to store w obtained by L1 regularization. some data sets may be slow for others. We must
Regarding testing accuracy, comparisons such as [13, take data properties into account in selecting algo-
Suppl. Mater. Sec. D] show that L1 and L2 regularizations rithms. For example, we can check if the number
generally give comparable performance. of instances is much larger than features, or
In statistics literature, a model related to L1 regular- vice versa. Other useful properties include the
ization is LASSO [16] number of nonzero feature values, feature distri-
bution, feature correlation, etc.
• Optimization formulation. Algorithm design is
X
l strongly related to the problem formulation. For
min ðw; xi ; yi Þ example, most unconstrained optimization tech-
w
i¼1
niques can be applied to L2-regularized logistic
subject to kwk1 K (13) regression, while specialized algorithms may be
needed for the nondifferentiable L1-regularized
problems.
where K > 0 is a parameter. This optimization problem is In some situations, by reformulation, we are
equivalent to (7) with L1 regularization. That is, for a given able to transform a nondifferentiable problem to
C in (7), there exists K such that (13) gives the same be differentiable. For example, by letting w ¼
solution as (7). The explanation for this relationship can be wþ w ðwþ ; w 0Þ, L1-regularized classifiers
found in, for example, [17]. can be written as
Any combination of the above-mentioned two regular-
izations and three loss functions has been well studied in
linear classification. Of them, L2-regularized L1/L2-loss
SVM can be geometrically interpreted as maximum margin X
n X
n X
l
min wþ
j þ w
j þ ðwþ w ; xi ; yi Þ
classifiers. L1/L2-regularized LR can be interpreted in a þ
w ;w
j¼1 j¼1 i¼1
Bayesian view by maximizing the posterior probability
subject to wþ
j ; wj 0; j ¼ 1; . . . ; n: (15)
with Laplacian/Gaussian prior of w.
However, there is no guarantee that solving a dif- methods, have been widely considered in large-
ferentiable form is faster. Recent comparisons [13] scale training. They characterize low-cost update,
show that for L1-regularized classifiers, methods low-memory requirement, and slow convergence.
directly minimizing the nondifferentiable form are In classification tasks, slow convergence may not be
often more efficient than those solving (15). a serious concern because a loose solution of (7)
• Solving primal or dual problems. Problem (7) has n may already give similar testing performances to
variables. In some applications, the number of in- that by an accurate solution.
stances l is much smaller than the number of fea- High-order methods such as Newton methods
tures n. By Lagrangian duality, a dual problem of often require the smoothness of the optimization
(7) has l variables. If l n, solving the dual form problems. Further, the cost per step is more
may be easier due to the smaller number of varia- expensive; sometimes a linear system must be
bles. Further, in some situations, the dual problem solved. However, their convergence rate is superi-
possesses nice properties not in the primal form. or. These high-order methods are useful for
For example, the dual problem of the standard applications needing an accurate solution of problem
SVM (L2-regularized L1-loss SVM) is the following (7). Some (e.g., [20]) have tried a hybrid setting by
quadratic program3: using low-order methods in the beginning and
switching to higher order methods in the end.
1 • Cost of different types of operations. In a real-world
min f D ðAÞ AT QA eT A computer, not all types of operations cost equally.
A 2
subject to 0 i C 8i ¼ 1; . . . ; l (16) For example, exponential and logarithmic opera-
tions are much more expensive than multiplication
and division. For training large-scale LR, because
where Qij yi yj xTi xj . Although the primal objective exp = log operations are required, the cost of this
function is nondifferentiable because of the L1 type of operations may accumulate faster than that
loss, in (16), the dual objective function is smooth of other types. An optimization method which can
(i.e., derivatives of all orders are available). Hence, avoid intensive exp = log evaluations is potentially
solving the dual problem may be easier than primal efficient; see more discussion in, for example, [12],
because we can apply differentiable optimization [21], and [22].
techniques. Note that the primal optimal w and the • Parallelization. Most existing training algorithms
dual optimal A satisfy the relationship (2),4 so are inherently sequential, but a parallel algorithm
solving primal and dual problems leads to the same can make good use of the computational power in a
decision function. multicore machine or a distributed system. How-
Dual problems come with another nice property ever, the communication cost between different
that each variable i corresponds to a training cores or nodes may become a new bottleneck. See
instance ðyi ; xi Þ. In contrast, for primal problems, more discussion in Section VII.
each variable wi corresponds to a feature. Optimi- Earlier developments of optimization methods for lin-
zation methods which update some variables at a ear classification tend to focus on data with few features.
time often need to access the corresponding By taking this property, they are able to easily train mil-
instances (if solving dual) or the corresponding lions of instances [23]. However, these algorithms may not
features (if solving primal). In practical applica- be suitable for sparse data with both large numbers of
tions, instance-wise data storage is more common instances and features, for which we show in Section II
than feature-wise storage. Therefore, a dual-based that linear classifiers often give competitive accuracy with
algorithm can directly work on the input data nonlinear classifiers. Many recent studies have proposed
without any transformation. algorithms for such data. We list some of them (and their
Unfortunately, the dual form may not be always software name if any) according to regularization and loss
easier to solve. For example, the dual form of L1- functions used.
regularized problems involves general linear con- • L2-regularized L1-loss SVM: Available approaches
straints rather than bound constraints in (16), so include, for example, cutting plane methods for
solving primal may be easier. the primal form (SVMperf [4], OCAS [24], and
• Using low-order or high-order information. Low- BMRM [25]), a stochastic (sub)gradient descent
order methods, such as gradient or subgradient method for the primal form (Pegasos [5], and
3
Because the bias term b is not considered, therefore, different from SGD [26]), and a coordinate descent method for
the
P dual problem considered in SVM literature, an inequality constraint the dual form (LIBLINEAR [6]).
yi i ¼ 0 is absent from (16). • L2-regularized L2-loss SVM: Existing methods for
4
However, we do not necessarily need the dual problem to get (2).
For example, the reduced SVM [19] directly assumes that w is the linear the primal form include a coordinate descent
combination of a subset of data. method [21], a Newton method [27], and a trust
region Newton method (LIBLINEAR [28]). For where Bþ fiji 2 B; 1 yi wT xi > 0g, and updates w by
the dual problem, a coordinate descent method is
in the software LIBLINEAR [6].
• L2-regularized LR: Most unconstrained optimiza- w w rS f ðw; BÞ (18)
tion methods can be applied to solve the primal
problem. An early comparison on small-scale data
is [29]. Existing studies for large sparse data in-
where ¼ ðClÞ=k is the learning rate and k is the iteration
clude iterative scaling methods [12], [30], [31], a
index. Different from earlier subgradient descent methods,
truncated Newton method [32], and a trust region
Newton method (LIBLINEAR [28]). Few works Pegasos
after the update by (18),p ffiffiffiffi further projects w onto
solve the dual problem. One example is a coordi- the ball set fwjkwk2 Clg.5 That is
nate descent method (LIBLINEAR [33]).
• L1-regularized L1-loss SVM: It seems no studies pffiffiffiffi
have applied L1-regularized L1-loss SVM on large Cl
w min 1; w: (19)
sparse data although some early works for data kwk2
with either few features or few instances are
available [34]–[36].
• L1-regularized L2-loss SVM: Some proposed We show the overall procedure of Pegasos in
methods include a coordinate descent method Algorithm 1.
(LIBLINEAR [13]) and a Newton-type method [22].
• L1-regularized LR: Most methods solve the primal
form, for example, an interior-point method
Algorithm 1: Pegasos for L2-regularized L1-loss SVM
(l1 logreg [37]), (block) coordinate descent meth-
(deterministic setting for batch learning) [5]
ods (BBR [38] and CGD [39]), a quasi-Newton pffiffiffiffi
method (OWL-QN [40]), Newton-type methods 1) Given w such that kwk2 Cl.
(GLMNET [41] and LIBLINEAR [22]), and a 2) For k ¼ 1; 2; 3; . . .
Nesterov’s method (SLEP [42]). Recently, an aug- a) Let B ¼ fðyi ; xi Þgli¼1 .
mented Lagrangian method (DAL [43]) was b) Compute the learning rate ¼ ðClÞ=k.
proposed for solving the dual problem. Compar- c) Compute rS f ðw; BÞ by (17).
isons of methods for L1-regularized LR include [13] d) w w rS f ðw; BÞ. pffiffiffiffi
and [44]. e) Project w by (19) to ensure kwk2 Cl.
In the rest of this section, we show details of some
optimization algorithms. We select them not only because For convergence, it is proved that in Oð1=Þ itera-
they are popular but also because many design issues tions, Pegasos achieves an average -accurate solution.
discussed earlier can be covered. That is
At each iteration, given an iterate w, a trust region of linear classification problems takes the following special
interval , and a quadratic model form:
1 r2 f ðwÞ ¼ I þ CX T Dw X
qðdÞ rf ðwÞT d þ d T r2 f ðwÞd (20)
2
where I is an identity matrix, X ½x1 ; . . . ; xl T , and Dw is
as an approximation of f ðw þ dÞ f ðwÞ, TRON finds a a diagonal matrix. In [28], a conjugate gradient method is
truncated Newton step confined in the trust region by applied to solve (21), where the main operation is the
approximately solving the following subproblem: product between r2 f ðwÞ and a vector v. By
min qðdÞ subject to kdk2 : (21) r2 f ðwÞv ¼ v þ C X T ðDw ðXvÞÞ (23)
d
If the loss function is not twice differentiable (e.g., L2 ri f D ðAÞ
loss), we can use generalized Hessian [14] as r2 f ðwÞ i min max i ;0 ;C : (24)
Qii
in (20).
Some difficulties of applying Newton methods to linear
classification include that r2 f ðwÞ may be a huge n by n From (24), Qii and ri f D ðAÞ are our needs. The
matrix and solving (21) is expensive. Fortunately, r2 f ðwÞ diagonal entries of Q, Qii ; 8i, are computed only once
Algorithm 3: A coordinate descent method for L2- At each iteration, newGLMNET considers the second-
regularized L1-loss SVM [6] order approximation of LðwÞ and solves the following
P problem:
1) Given A and the corresponding u ¼ li¼1 yi i xi .
2) Compute Qii ; 8i ¼ 1; . . . ; l.
3) For k ¼ 1; 2; 3; . . . 1
• For i ¼ 1; . . . ; l min qðdÞ kw þ dk1 kwk1 þ rLðwÞT d þ d T Hd
d 2
a) Compute G ¼ yi uT xi 1 in (27). (29)
b) i i .
c) i minðmaxði G=Qii ; 0Þ; CÞ.
d) u u þ yi ði i Þxi . where H r2 LðwÞ þ I and is a small number to
ensure H to be positive definite.
The vector u defined in (26) is in the same form as w Although (29) is similar to (21), its optimization is
in (2). In fact, as A approaches a dual optimal solution, more difficult because of the 1-norm term. Thus,
u will converge to the primal optimal w following the newGLMNET further breaks (29) to subproblems by a
primal–dual relationship. coordinate descent procedure. In a setting similar to the
method in Section IV-D, each time a one-variable function which is able to quickly obtain an approximate w; however,
is minimized in the final stage, the iterate w converges quickly because a
Newton step is taken. Recall in Section IV-A, we men-
tioned that exp = log operations are more expensive than
1 basic operations such as multiplication/division. Because
qðd þ zej Þ qðdÞ ¼ jwj þ dj þ zj jwj þ dj j þ Gj z þ Hjj z2
2 (30) does not involve any exp = log operation, we suc-
(30) cessfully achieve that time spent on exp = log operations is
only a small portion of the whole procedure. In addition,
newGLMNET is an example of accessing data feature
where G rLðwÞ þ Hd. This one-variable function (30)
wisely; see details in [22] about how Gj in (30) is updated.
has a simple closed-form minimizer (see [48], [49], and
[13, App. B])
F. A Comparison of the Four Examples
The four methods discussed in Sections IV-B–E differ in
8 Gj þ1 various aspects. By considering design issues mentioned in
< Hjj ;
> if Gj þ 1 Hjj ðwj þ dj Þ
Section IV-A, we compare these methods in Table 2. We
Gj 1
z¼ if Gj 1 Hjj ðwj þ dj Þ
> Hjj ;
: point out that three methods are primal based, but one is
ðwj þ dj Þ; otherwise: dual based. Next, both Pegasos and Dual-CD use only
low-order information (subgradient and gradient), but
TRON and newGLMNET employ high-order information
At each iteration of newGLMNET, the coordinate descent by Newton directions. Also, we check how data instances
method does not solve problem (29) exactly. Instead, are accessed. Clearly, Pegasos and Dual-CD instance
newGLMNET designs an adaptive stopping condition so wisely access data, but we have mentioned in Section IV-E
that initially problem (29) is solved loosely and in the final that newGLMNET must employ a feature wisely setting.
iterations, (29) is more accurately solved. Interestingly, TRON can use both because in (23), matrix–
After an approximate solution d of (29) is obtained, we vector products can be conducted by accessing data in-
need a line search procedure to ensure the sufficient stance wisely or feature wisely.
function decrease. It finds 2 ð0; 1 such that We analyze the complexity of the four methods by
showing the cost at the kth iteration:
• Pegasos: OðjBþ jnÞ;
f ðw þ dÞ f ðwÞ kw þ dk1 kwk1 þ rLðwÞT d • TRON: #CG iter OðlnÞ;
(31) • Dual-CD: OðlnÞ;
• newGLMNET: #CD iter OðlnÞ.
The cost of Pegasos and TRON easily follows from (17)
where 2 ð0; 1Þ. The overall procedure of newGLMNET and (23), respectively. For Dual-CD, both (27) and (28)
is in Algorithm 4. cost OðnÞ, so one iteration of going through all variables is
OðnlÞ. For newGLMNET, see details in [22]. We can
clearly see that each iteration of Pegasos and Dual-CD is
Algorithm 4: newGLMNET for L1-regularized minimiza- cheaper because of using low-order information. However,
tion [22] they need more iterations than high-order methods in
1) Given w. Given 0 G ; G 1. order to accurately solve the optimization problem.
2) For k ¼ 1; 2; 3; . . .
a) Find an approximate solution d of (29) by a
coordinate descent method. V. MULT ICLASS LINEAR
b) Find ¼ maxf1; ; 2 ; . . .g such that (31) holds. CLASSIFICATION
c) w w þ d. Most classification methods are originally proposed to
solve a two-class problem; however, extensions of these
Due to the adaptive setting, in the beginning methods to multiclass classification have been studied. For
newGLMNET behaves like a coordinate descent method, nonlinear SVM, some works (e.g., [50] and [51]) have
comprehensively compared different multiclass solutions. method but it attempts to reduce the testing cost.
In contrast, few studies have focused on multiclass linear Starting with a candidate set of all classes, this
classification. This section introduces and compares some method sequentially selects a pair of classes for
commonly used methods. prediction and removes one of the two. That is, if a
binary classifier of class i and j predicts i, then j is
A. Solving Several Binary Problems removed from the candidate set. Alternatively, a
Multiclass classification can be decomposed to several prediction of class j will cause i to be removed.
binary classification problems. One-against-rest and one- Finally, the only remained class is the predicted
against-one methods are two of the most common de- result. For any pair ði; jÞ considered, the true class
composition approaches. Studies that broadly discussed may be neither i nor j. However, it does not matter
various approaches of decomposition include, for example, which one is removed because all we need is that if
[52] and [53]. the true class is involved in a binary prediction, it is
• One-against-rest method. If there are k classes in the the winner. Because classes are sequentially
training data, the one-against-rest method [54] removed, only k 1 models are used. The testing
constructs k binary classification models. To obtain time complexity of DAGSVM is thus OðnkÞ.
the mth model, instances from the mth class of the
training set are treated as positive, and all other B. Considering All Data at Once
instances are negative. Then, the weight vector wm In contrast to using many binary models, some have
for the mth model can be generated by any linear proposed solving a single optimization problem for multi-
classifier. class classification [59]–[61]. Here we discuss details of
After obtaining all k models, we say an instance Crammer and Singer’s approach [60]. Assume class labels
x is in the mth class if the decision value (1) of the are 1; . . . ; k. They consider the following optimization
mth model is the largest, i.e., problem:
the nonlinear case, where the longer training time than Table 3 Comparison of Methods for Multiclass Linear Classification in
Storage (Model Size) and Testing Time. n Is the Number of Features and
one-against-rest and one-against-one methods has made
k Is the Number of Classes
the approach of solving one single optimization problem
less practical [50]. A careful implementation of the ap-
proach in [63] is given in [7, App. E].
C. Maximum Entropy
Maximum entropy (ME) [64] is a generalization of
logistic regression for multiclass problems6 and a special
case of conditional random fields [65] (see Section VIII-A).
It is widely applied by NLP applications. We still assume Equation (35) is a special case of (37) by
class labels 1; . . . ; k for an easy comparison to (33) in our
subsequent discussion. ME models the following condi- 2 3
tional probability function of label y given data x: 0 )
6 .. 7 y1
6.7 2 3
6 7 w1
607
6 7 6 . 7
exp wTy x f ðxi ; yÞ ¼ 6 xi 7 2 Rnk and w ¼ 4 .. 5: (38)
6 7
PðyjxÞ Pk (35) 607 wk
T 6.7
m¼1 exp wm x 4 .. 5
0
where wm ; 8m are weight vectors like those in (32) and
(33). This model is also called multinomial logistic Many studies have investigated optimization methods
regression. for L2-regularized ME. For example, Malouf [66] com-
ME minimizes the following regularized negative log- pares iterative scaling methods [67], gradient descent,
likelihood: nonlinear conjugate gradient, and L-BFGS (quasi-Newton)
method [68] to solve (36). Experiments show that quasi-
Newton performs better. In [12], a framework is proposed
1X k Xl to explain variants of iterative scaling methods [30], [67],
min kwk k2 þ C ME fwm gkm¼1 ; xi ; yi (36) [69] and make a connection to coordinate descent meth-
w1 ;...;wm 2 m¼1 i¼1
ods. For L1-regularized ME, Andrew and Gao [40] propose
an extension of L-BFGS.
Recently, instead of solving the primal problem (36),
where some works solve the dual problem. A detailed derivation
of the dual ME is in [33, App. A.7]. Memisevic [70] pro-
posed a two-level decomposition method. Similar to the
ME fwm gkm¼1 ; x; y log PðyjxÞ: coordinate descent method [63] for (33) in Section V-B, in
[70], a subproblem of k variables is considered at a time.
However, the subproblem does not have a closed-form
solution, so a second-level coordinate descent method is
Clearly, (36) is similar to (33) and ME ð Þ can be consid-
applied. Collin et al. [71] proposed an exponential gradient
ered as a loss function. If wTyi xi wTm xi ; 8m 6¼ yi , then
k method to solve ME dual. They also decompose the prob-
ME ðfwm gm¼1 ; xi ; yi Þ is close to zero (i.e., no loss). On the
lem into k-variable subproblems, but only approximately
other hand, if wTyi xi is smaller than other wTm xi , m 6¼ yi ,
solve each subproblem. The work in [33] follows [70] to
then Pðyi jxi Þ 1 and the loss is large. For prediction, the
apply a two-level coordinate descent method, but uses a
decision function is also (32).
different method in the second level to decide variables for
NLP applications often consider a more general ME
update.
model by using a function f ðx; yÞ to generate the feature
vector
D. Comparison
We summarize storage (model size) and testing time of
expðwT f ðx; yÞÞ each method in Table 3. Clearly, one-against-one and
PðyjxÞ P T 0
: (37) DAGSVM methods are less practical because of the much
y0 expðw f ðx; y ÞÞ
higher storage, although the comparison in [57] indicates
6
Details of the connection between logistic regression and maximum that one-against-one method gives slightly better testing
entropy can be found in, for example, [12, Sec. 5.2]. accuracy. Note that the situation is very different for the
kernel case [50], where one-against-one and DAGSVM are Table 4 Results of Training/Testing Degree-2 Polynomial Mappings by
the Coordinate Descent Method in Section IV-D. The Degree-2 Polynomial
very useful methods.
Mapping Is Dynamically Computed During Training, Instead of Expanded
Beforehand. The Last Column Shows the Accuracy Difference Between
Degree-2 Polynomial Mappings and RBF SVM
VI. LI NEAR-CLASSIFICATI ON
TECHNIQUES FOR NONLINEAR
CLASSIFICATION
Many recent developments of linear classification can be
extended to handle nonstandard scenarios. Interestingly,
most of them are related to training nonlinear classifiers.
(e.g., [19]) consider approximations other than because of expensive kernel computation. For distributed
(39), but also lead to linear classification problems. linear classification, the research is still in its infancy. The
A recent study [78] addresses more on training current trend is to design algorithms so that computing
and testing linear SVM after obtaining the low- nodes access data locally and the communication between
rank approximation. In particular, details of the nodes is minimized. The implementation is often con-
testing procedures can be found in [78, Sec. 2.4]. ducted using distributed computing environments such as
Note that linear SVM problems obtained after Hadoop [95]. In this section, we will discuss some ongoing
kernel approximations are often dense and have research results.
more instances than features. Thus, training Among the existing developments, some can be easily
algorithms suitable for such problems may be categorized as online methods. We describe them in Sec-
different from those for sparse document data. tion VII-A. Batch methods are discussed in Section VII-B,
• Feature mapping approximation. This type of ap- while other approaches are in Section VII-C.
proaches finds a mapping function : Rn ! Rd
such that
A. Online Methods
An online method updates the model w via using some
T ðtÞ
ðxÞ Kðx; tÞ: instances at a time rather than considering the whole
training data. Therefore, not only can online methods
handle data larger than memory, but also they are suitable
Then, linear classifiers can be applied to new data for streaming data where each training instance is used
ðx l Þ. The testing phase is straightfor-
1 Þ; . . . ; ðx only once. One popular online algorithm is the stochastic
ward because the mapping ð Þ is available. gradient descent (SGD) method, which can be traced back
Many mappings have been proposed. Examples to stochastic approximation method [96], [97]. Take the
include random Fourier projection [83], random primal L2-regularized L1-loss SVM in (7) as an example.
projections [84], [85], polynomial approximation At each step, a training instance xi is chosen and w is
[86], and hashing [87]–[90]. They differ in various updated by
aspects, which are beyond the scope of this paper.
An issue related to the subsequent linear classifi-
cation is that some methods (e.g., [83]) generate 1
w w rS kwk22 þ C maxð0; 1 yi wT xi Þ (40)
dense ðxÞ vectors, while others give sparse 2
vectors (e.g., [85]). A recent study focusing on
the linear classification after obtaining ðx i Þ; 8i is
in [91]. where rS is a subgradient operator and is the learning
rate. Specifically, (40) becomes the following update rule:
sequential selection of variables with a random selection. pensive disk input/output (I/O), they design algorithms by
Notice that the update rule (28) is similar to (41), but has reading a continuous chunk of data at a time and mini-
the advantage of not needing to decide the learning rate . mizing the number of disk accesses. The method in [92]
This online setting falls into the general framework of extends the coordinate descent method in Section IV-D for
randomized coordinate descent methods in [101] and linear SVM. The major change is to update more variables
[102]. Using the proof in [101], the linear convergence in at a time so that a block of data is used together.
expectation is obtained in [6, App. 7.5]. Specifically, in the beginning, the training set is randomly
To improve the convergence of SGD, some [103], [104] partitioned to m files B1 ; . . . ; Bm . The available memory
have proposed using higher order information. The rule in space needs to be able to accommodate one block of data
(40) is replaced by and the working space of a training algorithm. To solve
(16), sequentially one block of data B is read and the
following function of d is minimized under the condition
w w HrS ð Þ (42) 0 i þ di C; 8i 2 B and di ¼ 0; 8i 62 B
1
f D ðA þ dÞ f D ðAÞ ¼ d TB QBB d B þ d TB ðQA eÞB
where H is an approximation of the inverse Hessian 2
1 X
r2 f ðwÞ1 . To save the cost at each update, practically H is ¼ d TB QBB d B þ yi di ðuT xi Þ d TB eB
a diagonal scaling matrix. Experiments [103] and [104] 2 i2B
show that using (42) is faster than (40). (43)
The update rule in (40) assumes L2 regularization.
While SGD is applicable for other regularization, it may where QBB is a submatrix of Q and u is defined in (26). By
not perform as well because of not taking special pro- maintaining u in a way similar to (28), equation (43) in-
perties of the regularization term into consideration. For volves only data in the block B, which can be stored in
example, if L1 regularization is used, a standard SGD may memory. Equation (43) can be minimized by any tradi-
face difficulties to generate a sparse w. To address this tional algorithm. Experiments in [92] demonstrate that
problem, recently several approaches have been proposed they can train data 20 times larger than the memory capa-
[105]–[110]. The stochastic coordinate descent method in city. This method is extended in [115] to cache informative
[106] has been extended to a parallel version [111]. data points in the computer memory. That is, at each
Unfortunately, most existing studies of online algo- iteration, not only the selected block but also the cached
rithms conduct experiments by assuming enough memory points are used for updating corresponding variables. Their
and reporting the number of times to access data. To apply way to select informative points is inspired by the shrink-
them in a real scenario without sufficient memory, many ing techniques used in training nonlinear SVM [8], [47].
practical issues must be checked. Vowpal-Wabbit [112] is For distributed batch learning, all existing parallel
one of the very few implementations which can handle optimization methods [116] can possibly be applied. How-
data larger than memory. Because the same data may be ever, we have not seen many practical deployments for
accessed several times and the disk reading time is expen- training large-scale data. Recently, Boyd et al. [117] have
sive, at the first pass, Vowpal-Wabbit stores data to a considered the alternating direction method of multiplier
compressed cache file. This is similar to the compression (ADMM) [118] for distributed learning. Take SVM as an
strategy in [92], which will be discussed in Section VII-B. example and assume data points are partitioned to m dis-
Currently, Vowpal-Wabbit supports unregularized linear tributively stored sets B1 ; . . . ; Bm . This method solves the
classification and regression. It is extended to solve L1- following approximation of the original optimization
regularized problems in [105]. problem:
Recently, Vowpal-Wabbit (after version 6.0) has sup-
ported distributed online learning using the Hadoop [95]
framework. We are aware that other Internet companies 1 T Xm X
have constructed online linear classifiers on distributed min z zþC L1 ðwj ; xi ; yi Þ
w1 ;...;wm ;z 2 j¼1 i2Bj
environments, although details have not been fully avail-
able. One example is the system SETI at Google [113]. X
m
þ kwj zk2
2 j¼1
B. Batch Methods subject to wj z ¼ 0; 8j
In some situations, we still would like to consider the
whole training set and solve a corresponding optimization
problem. While this task is very challenging, some (e.g., where is a prespecific parameter. It then employs an
[92] and [114]) have checked the situation that data are optimization method of multipliers by alternatively
larger than memory but smaller than disk. Because of ex- minimizing the Lagrangian function over w1 ; . . . ; wm ,
minimizing the Lagrangian over z, and updating dual mul- mance. An advantage of the bagging-like approach is the
tipliers. The minimization of Lagrangian over w1 ; . . . ; wm easy implementation using distributed computing techni-
can be decomposed to m independent problems. Other ques such as MapReduce [128].8
steps do not involve data at all. Therefore, data points are
locally accessed and the communication cost is kept
minimum. Examples of using ADMM for distributed VIII. RELATED TOPICS
training include [119]. Some known problems of this
In this section, we discuss some other linear models. They
approaches are first that the convergence rate is not very
are related to linear classification models discussed in
fast, and second that it is unclear how to choose param-
earlier sections.
eter .
Some works solve an optimization problem using
parallel SGD. The data are stored in a distributed system, A. Structured Learning
and each node only computes the subgradient correspond- In the discussion so far, we assumed that the label yi is a
ing to the data instances in the node. In [120], a delayed single value. For binary classification, it is þ1 or 1, while
SGD is proposed. Instead of computing the subgradient of for multiclass classification, it is one of the k class labels.
the current iterate wk , in delayed SGD, each node com- However, in some applications, the label may be a more
putes the subgradient of a previous iterator w ðkÞ , where sophisticated object. For example, in part-of-speech (POS)
ðkÞ k. Delayed SGD is useful to reduce the synchro- tagging applications, the training instances are sentences
nization delay because of communication overheads or and the labels are sequences of POS tags of words. If there
uneven computational time at various nodes. Recent works are l sentences, we can write the training instances as
[121], [122] show that delayed SGD is efficient when the ðyi ; xi Þ 2 Y ni X ni ; 8i ¼ 1; . . . ; l, where xi is the ith sen-
number of nodes is large, and the delay is asymptotically tence, yi is a sequence of tags, X is a set of unique words in
negligible. the context, Y is a set of candidate tags for each word, and
ni is the number of words in the ith sentence. Note that
C. Other Approaches we may not be able to split the problem to several
We briefly discuss some other approaches which can- independent ones by treating each value yij of yi as the
not be clearly categorized as batch or online methods. label, because yij not only depends on the sentence xi but
The most straightforward method to handle large data also other tags ðyi1 ; . . . ; yiðj1Þ ; yiðjþ1Þ ; . . . yini Þ. To handle
is probably to randomly select a subset that can fit in these problems, we could use structured learning models
memory. This approach works well if the data quality is like conditional random fields [65] and structured SVM
good; however, sometimes using more data gives higher [129], [130].
accuracy. To improve the performance of using only a • Conditional random fields (CRFs). The CRF [65] is a
subset, some have proposed techniques to include impor- linear structured model commonly used in NLP.
tant data points into the subset. For example, the approach Using notation mentioned above and a feature
in [123] selects a subset by reading data from disk only function f ðx; yÞ like ME, CRF solves the following
once. For data in a distributed environment, subsampling problem:
can be a complicated operation. Moreover, a subset fitting
the memory of one single computer may be too small to
give good accuracy. 1 Xl
Bagging [124] is a popular classification method to split min kwk22 þ C CRF ðw; xi ; yi Þ (44)
w 2 i¼1
a learning task to several easier ones. It selects several
random subsets, trains each of them, and ensembles (e.g.,
averaging) the results during testing. This method may be
where
particularly useful for distributively stored data because we
can directly consider data in each node as a subset. How-
ever, if data quality in each node is not good (e.g., all
instances with the same class label), the model generated CRF ðw; xi ; yi Þ log Pðyi jxi Þ
by each node may be poor. Thus, ensuring data quality of
expðwT f ðx; yÞÞ
each subset is a concern. Some studies have applied the PðyjxÞ P T 0
: (45)
bagging approach on a distributed system [125], [126]. y0 expðw f ðx; y ÞÞ
The optimization of (44) is challenging because terms. While L1 and L2 regularization is still used, loss
in the probability model (45), the number of functions are different, where two popular ones are
possible y’s is exponentially large. An important
property to make CRF practical is that the gradient
of the objective function in (44) can be efficiently 1 2
LS ðw; x; zÞ ðz wT xÞ (47)
evaluated by dynamic programming [65]. Some 2
available optimization methods include L-BFGS ðw; x; zÞ max 0; jz wT xj : (48)
(quasi-Newton) and conjugate gradient [131], SGD
[132], stochastic quasi-Newton [103], [133], and
trust region Newton method [134]. It is shown in The least square loss in (47) is widely used in many places,
[134] that the Hessian-vector product (23) of the while the -insensitive loss in (48) is extended from the L1
Newton method can also be evaluated by dynamic loss in (8), where there is a user-specified parameter as
programming. the error tolerance. Problem (7) with L2 regularization
• Structured SVM. Structured SVM solves the follow- and -insensitive loss is called support vector regression
ing optimization problem generalized form multi- (SVR) [138]. Contrary to the success of linear classifica-
class SVM in [59], [60]: tion, so far not many applications of linear regression on
large sparse data have been reported. We believe that this
topic has not been fully explored yet.
1 Xl
Regarding the minimization of (7), if L2 regulariza-
min kwk22 þ C SS ðw; xi ; yi Þ (46)
w 2 i¼1
tion is used, many optimization methods mentioned in
Section IV can be easily modified for linear regression.
We then particularly discuss L1-regularized least
where square regression, which has recently drawn much atten-
tion for signal processing and image applications. This
research area is so active that many optimization methods
SS ðw; xi ; yi Þ max max 0; ðyi ; yÞ (e.g., [49] and [139]–[143]) have been proposed. However,
y6¼yi
as pointed out in [13], optimization methods most suitable
wT ðf ðxi ; yi Þ f ðxi ; yÞÞ for signal/image applications via L1-regularized regression
may be very different from those in Section IV for classi-
fying large sparse data. One reason is that data from signal/
and ð Þ is a distance function with ðyi ; yi Þ ¼ 0 image problems tend to be dense. Another is that xi ; 8i
and ðyi ; yj Þ ¼ ðyj ; yi Þ. Similar to the relation may be not directly available in some signal/image prob-
between conditional random fields and maximum lems. Instead, we can only evaluate the product between
entropy, if the data matrix and a vector through certain operators.
Thus, optimization methods that can take this property
( into their design may be more efficient.
0; if yi ¼ yj
ðyi ; yj Þ ¼
1; otherwise IX. CONCLUSION
In this paper, we have comprehensively reviewed recent
advances of large linear classification. For some applica-
and yi 2 f1; . . . ; kg; 8i, then structured SVM be- tions, linear classifiers can give comparable accuracy to
comes Crammer and Singer’s problem in (33) fol- nonlinear classifiers, but enjoy much faster training and
lowing the definition of f ðx; yÞ and w in (38). testing speed. However, these results do not imply that
Like CRF, the main difficulty to solve (46) is on nonlinear classifiers should no longer be considered. Both
handling an exponential number of y values. Some linear and nonlinear classifiers are useful under different
works (e.g., [25], [129], and [135]) use a cutting circumstances.
plane method [136] to solve (46). In [137], a sto- Without mapping data to another space, for linear
chastic subgradient descent method is applied for classification we can easily prepare, select, and manipulate
both online and batch settings. features. We have clearly shown that linear classification is
not limited to standard scenarios like document classifi-
B. Regression cation. It can be applied in many other places such as
Given training data fðzi ; xi Þgli¼1 R Rn , a regres- efficiently approximating nonlinear classifiers. We are
sion problem finds a weight vector w such that wT xi confident that future research works will make linear
zi ; 8i. Like classification, a regression task solves a risk classification a useful technique for more large-scale
minimization problem involving regularization and loss applications. h
[51] R. Rifkin and A. Klautau, BIn defense of Proc. 6th Conf. Natural Lang. Learn., 2002, Cambridge, MA: MIT Press, 2008,
one-vs-all classification,[ J. Mach. Learn. Res., DOI: 10.3115/1118853.1118871. pp. 1177–1184.
vol. 5, pp. 101–141, 2004. [67] J. N. Darroch and D. Ratcliff, BGeneralized [84] D. Achlioptas, BDatabase-friendly random
[52] E. L. Allwein, R. E. Schapire, and Y. Singer, iterative scaling for log-linear models,[ Ann. projections: Johnson-Lindenstrauss with
BReducing multiclass to binary: A unifying Math. Stat., vol. 43, no. 5, pp. 1470–1480, binary coins,[ J. Comput. Syst. Sci., vol. 66,
approach for margin classifiers,[ J. Mach. 1972. pp. 671–687, 2003.
Learn. Res., vol. 1, pp. 113–141, 2001. [68] D. C. Liu and J. Nocedal, BOn the limited [85] P. Li, T. J. Hastie, and K. W. Church,
[53] T.-K. Huang, R. C. Weng, and C.-J. Lin. memory BFGS method for large scale BVery sparse random projections,[ in Proc.
(2006). Generalized Bradley-Terry models optimization,[ Math. Programm., vol. 45, 12th ACM SIGKDD Int. Conf. Knowl. Disc.
and multi-class probability estimates. J. no. 1, pp. 503–528, 1989. Data Mining, 2006, pp. 287–296.
Mach. Learn. Res. [Online]. 7, pp. 85–115. [69] S. Della Pietra, V. Della Pietra, and [86] K.-P. Lin and M.-S. Chen, BEfficient kernel
Available: http://www.csie.ntu.edu.tw/ J. Lafferty, BInducing features of random approximation for large-scale support vector
~cjlin/papers/generalBT.pdf fields,[ IEEE Trans. Pattern Anal. Mach. machine classification,[ in Proc. 11th SIAM
[54] L. Bottou, C. Cortes, J. Denker, H. Drucker, Intell., vol. 19, no. 4, pp. 380–393, Int. Conf. Data Mining, 2011, pp. 211–222.
I. Guyon, L. Jackel, Y. LeCun, U. Muller, Apr. 1997. [87] Q. Shi, J. Petterson, G. Dror, J. Langford,
E. Sackinger, P. Simard, and V. Vapnik, [70] R. Memisevic, BDual optimization of A. Smola, A. Strehl, and S. Vishwanathan,
BComparison of classifier methods: A case conditional probability models,[ Dept. BHash kernels,[ in Proc. 12th Int. Conf.
study in handwriting digit recognition,[ in Comput. Sci., Univ. Toronto, Toronto, Artif. Intell. Stat., 2009, vol. 5, pp. 496–503.
Proc. Int. Conf. Pattern Recognit., 1994, ON, Canada, Tech. Rep., 2006. [88] K. Weinberger, A. Dasgupta, J. Langford,
pp. 77–87.
[71] M. Collins, A. Globerson, T. Koo, A. Smola, and J. Attenberg, BFeature
[55] S. Knerr, L. Personnaz, and G. Dreyfus, X. Carreras, and P. Bartlett, BExponentiated hashing for large scale multitask learning,[
BSingle-layer learning revisited: A stepwise gradient algorithms for conditional random in Proc. 26th Int. Conf. Mach. Learn., 2009,
procedure for building and training a fields and max-margin Markov networks,[ pp. 1113–1120.
neural network,[ in Neurocomputing: J. Mach. Learn. Res., vol. 9, pp. 1775–1822, [89] P. Li and A. C. König, Bb-bit minwise
Algorithms, Architectures and Applications, 2008. hashing,[ in Proc. 19th Int. Conf. World
J. Fogelman, Ed. New York:
[72] E. M. Gertz and J. D. Griffin, BSupport vector Wide Web, 2010, pp. 671–680.
Springer-Verlag, 1990.
machine classifiers for large data sets,[ [90] P. Li and A. C. König, BTheory and
[56] J. H. Friedman, BAnother approach to Argonne Nat. Lab., Argonne, IL, Tech. Rep. applications of b-bit minwise hashing,[
polychotomous classification,[ Dept. Stat., ANL/MCS-TM-289, 2005. Commun. ACM, vol. 54, no. 8, pp. 101–109,
Stanford Univ., Stanford, CA, Tech. Rep.
[73] J. H. Jung, D. P. O’Leary, and A. L. Tits, 2011.
[Online]. Available: http://www-stat.
BAdaptive constraint reduction for training [91] P. Li, A. Shrivastava, J. Moore, and
stanford.edu/~jhf/ftp/poly.pdf
support vector machines,[ Electron. Trans. A. C. König, BHashing algorithms for
[57] T.-L. Huang, BComparison of L2-regularized Numer. Anal., vol. 31, pp. 156–177, 2008. large-scale learning,[ Cornell Univ.,
multi-class linear classifiers,[ M.S. thesis,
[74] Y. Moh and J. M. Buhmann, BKernel Ithaca, NY, Tech. Rep. [Online]. Available:
Dept. Comput. Sci. Inf. Eng., Nat. Taiwan
expansion for online preference tracking,[ http://www.stat.cornell.edu/~li/reports/
Univ., Taipei, Taiwan, 2010.
in Proc. Int. Soc. Music Inf. Retrieval, 2008, HashLearning.pdf
[58] J. C. Platt, N. Cristianini, and pp. 167–172. [92] H.-F. Yu, C.-J. Hsieh, K.-W. Chang, and
J. Shawe-Taylor, BLarge margin DAGs
[75] S. Sonnenburg and V. Franc, BCOFFIN: C.-J. Lin, BLarge linear classification
for multiclass classification,[ in Advances
A computational framework for linear when data cannot fit in memory,[ in Proc.
in Neural Information Processing Systems,
SVMs,[ in Proc. 27th Int. Conf. Mach. 16th ACM SIGKDD Int. Conf. Knowl. Disc.
vol. 12. Cambridge, MA: MIT Press,
Learn., 2010, pp. 999–1006. Data Mining, 2010, pp. 833–842. [Online].
2000, pp. 547–553.
[76] G. Ifrim, G. BakNr, and G. Weikum, BFast Available: http://www.csie.ntu.edu.tw/
[59] J. Weston and C. Watkins, BMulti-class ~cjlin/papers/kdd_disk_decomposition.pdf.
logistic regression for text categorization
support vector machines,[ in Proc. Eur.
with variable-length n-grams,[ in Proc. [93] E. Chang, K. Zhu, H. Wang, H. Bai, J. Li,
Symp. Artif. Neural Netw., M. Verleysen, Ed.,
14th ACM SIGKDD Int. Conf. Knowl. Disc. Z. Qiu, and H. Cui, BParallelizing support
Brussels, 1999, pp. 219–224.
Data Mining, 2008, pp. 354–362. vector machines on distributed computers,[
[60] K. Crammer and Y. Singer, BOn the in Advances in Neural Information Processing
[77] G. Ifrim and C. Wiuf, BBounded
algorithmic implementation of multiclass Systems 20, J. Platt, D. Koller, Y. Singer, and
coordinate-descent for biological sequence
kernel-based vector machines,[ J. S. Roweis, Eds. Cambridge, MA: MIT
classification in high dimensional predictor
Mach. Learn. Res., vol. 2, pp. 265–292, Press, 2008, pp. 257–264.
space,[ in Proc. 17th ACM SIGKDD Int.
2001.
Conf. Knowl. Disc. Data Mining, 2011, [94] Z. A. Zhu, W. Chen, G. Wang, C. Zhu, and
[61] Y. Lee, Y. Lin, and G. Wahba, BMulticategory DOI: 10.1145/2020408.2020519. Z. Chen, BP-packSVM: Parallel primal
support vector machines,[ J. Amer. Stat. gradient descent kernel SVM,[ in Proc. IEEE
[78] S. Lee and S. J. Wright, BASSET:
Assoc., vol. 99, no. 465, pp. 67–81, 2004. Int. Conf. Data Mining, 2009, pp. 677–686.
Approximate stochastic subgradient
[62] C.-J. Lin. (2002, Sep.). A formal analysis of estimation training for support vector [95] T. White, Hadoop: The Definitive Guide,
stopping criteria of decomposition methods machines,[ IEEE Trans. Pattern Anal. 2nd ed. New York: O’Reilly Media, 2010.
for support vector machines. IEEE Trans. Mach. Intell., 2012. [96] H. Robbins and S. Monro, BA stochastic
Neural Netw. [Online]. 13(5), pp. 1045–1052.
[79] C. K. I. Williams and M. Seeger, BUsing approximation method,[ Ann. Math. Stat.,
Available: http://www.csie.ntu.edu.tw/
the Nyström method to speed up kernel vol. 22, no. 3, pp. 400–407, 1951.
~cjlin/papers/stop.ps.gz
machines,[ in Advances in Neural Information [97] J. Kiefer and J. Wolfowitz, BStochastic
[63] S. S. Keerthi, S. Sundararajan, K.-W. Chang, Processing Systems 13, T. Leen, T. Dietterich, estimation of the maximum of a regression
C.-J. Hsieh, and C.-J. Lin, BA sequential dual and V. Tresp, Eds. Cambridge, MA: MIT function,[ Ann. Math. Stat., vol. 23, no. 3,
method for large scale multi-class linear Press, 2001, pp. 682–688. pp. 462–466, 1952.
SVMs,[ in Proc. 14th ACM SIGKDD Int.
[80] P. Drineas and M. W. Mahoney, BOn the [98] T. Zhang, BSolving large scale linear
Conf. Knowl. Disc. Data Mining, 2008,
Nyström method for approximating a gram prediction problems using stochastic
pp. 408–416. [Online]. Available: http://
matrix for improved kernel-based learning,[ gradient descent algorithms,[ in Proc.
www.csie.ntu.edu.tw/~cjlin/papers/
J. Mach. Learn. Res., vol. 6, pp. 2153–2175, 21st Int. Conf. Mach. Learn., 2004,
sdm_kdd.pdf.
2005. DOI: 10.1145/1015330.1015332.
[64] A. L. Berger, V. J. Della Pietra, and
[81] S. Fine and K. Scheinberg, BEfficient [99] L. Bottou and Y. LeCun, BLarge scale online
S. A. Della Pietra, BA maximum entropy
SVM training using low-rank kernel learning,[ Advances in Neural Information
approach to natural language processing,[
representations,[ J. Mach. Learn. Res., Processing Systems 16. Cambridge, MA:
Comput. Linguist., vol. 22, no. 1, pp. 39–71,
vol. 2, pp. 243–264, 2001. MIT Press, 2004, pp. 217–224.
1996.
[82] F. R. Bach and M. I. Jordan, BPredictive [100] A. Bordes, S. Ertekin, J. Weston, and
[65] J. Lafferty, A. McCallum, and F. Pereira,
low-rank decomposition for kernel L. Bottou, BFast kernel classifiers with online
BConditional random fields: Probabilistic
methods,[ in Proc. 22nd Int. Conf. Mach. and active learning,[ J. Mach. Learn. Res.,
models for segmenting and labeling
Learn., 2005, pp. 33–40. vol. 6, pp. 1579–1619, 2005.
sequence data,[ in Proc. 18th Int. Conf.
Mach. Learn., 2001, pp. 282–289. [83] A. Rahimi and B. Recht, BRandom features [101] Y. E. Nesterov, BEfficiency of coordinate
for large-scale kernel machines Advances descent methods on huge-scale optimization
[66] R. Malouf, BA comparison of algorithms for
in Neural Information Processing Systems. problems,[ Université Catholique de
maximum entropy parameter estimation,[ in
Louvain, Louvain-la-Neuve, Louvain, Disc. Data Mining, 2011, DOI: 10.1145/ for structured and interdependent output
Belgium, CORE Discussion Paper, Tech. 2020408.2020517. variables,[ J. Mach. Learn. Res., vol. 6,
Rep. [Online]. Available: http://www.ucl.be/ [116] Y. Censor and S. A. Zenios, Parallel pp. 1453–1484, 2005.
cps/ucl/doc/core/documents/coredp2010_ Optimization: Theory, Algorithms, and [130] B. Taskar, C. Guestrin, and D. Koller,
2web.pdf Applications. Oxford, U.K.: Oxford Univ. BMax-margin markov networks,[ in Advances
[102] P. Richtárik and M. Takáč, BIteration Press, 1998. in Neural Information Processing Systems 16.
complexity of randomized block-coordinate [117] S. Boyd, N. Parikh, E. Chu, B. Peleato, and Cambridge, MA: MIT Press, 2004.
descent methods for minimizing a composite J. Eckstein, BDistributed optimization and [131] F. Sha and F. C. N. Pereira, BShallow parsing
function,[ Schl. Math., Univ. Edinburgh, statistical learning via the alternating with conditional random fields,[ in Proc.
Edinburgh, U.K., Tech. Rep., 2011. direction method of multipliers,[ Found. HLT-NAACL, 2003, pp. 134–141.
[103] A. Bordes, L. Bottou, and P. Gallinari, Trends Mach. Learn., vol. 3, no. 1, pp. 1–122, [132] S. Vishwanathan, N. N. Schraudolph,
BSGD-QN: Careful quasi-Newton stochastic 2011. M. W. Schmidt, and K. Murphy,
gradient descent,[ J. Mach. Learn. Res., [118] D. Gabay and B. Mercier, BA dual algorithm BAccelerated training of conditional
vol. 10, pp. 1737–1754, 2009. for the solution of nonlinear variational random fields with stochastic gradient
[104] A. Bordes, L. Bottou, P. Gallinari, J. Chang, problems via finite element approximation,[ methods,[ in Proc. 23rd Int. Conf. Mach.
and S. A. Smith, BErratum: SGD-QN is Comput. Math. Appl., vol. 2, pp. 17–40, 1976. Learn., 2006, pp. 969–976.
less careful than expected,[ J. Mach. Learn. [119] P. A. Forero, A. Cano, and G. B. Giannakis, [133] N. N. Schraudolph, J. Yu, and S. Gunter,
Res., vol. 11, pp. 2229–2240, 2010. BConsensus-based distributed support BA stochastic quasi-Newton method for
[105] J. Langford, L. Li, and T. Zhang, BSparse vector machines,[ J. Mach. Learn., vol. 11, online convex optimization,[ in Proc.
online learning via truncated gradient,[ pp. 1663–1707, 2010. 11th Int. Conf. Artif. Intell. Stat., 2007,
J. Mach. Learn. Res., vol. 10, pp. 771–801, [120] A. Nedić, D. P. Bertsekas, and V. S. Borkar, pp. 433–440.
2009. BDistributed asynchronous incremental [134] P.-J. Chen, BNewton methods for conditional
[106] S. Shalev-Shwartz and A. Tewari, subgradient methods,[ Studies Comput. random fields,[ M.S. thesis, Dept. Comput.
BStochastic methods for L1 -regularized Math., vol. 8, pp. 381–407, 2001. Sci. Inf. Eng., National Taiwan University,
loss minimization,[ J. Mach. Learn. Res., [121] J. Langford, A. Smola, and M. Zinkevich, Taipei, Taiwan, 2009.
vol. 12, pp. 1865–1892, 2011. BSlow learners are fast,[ in Advances in [135] T. Joachims, T. Finley, and C.-N. J. Yu,
[107] Y. E. Nesterov, BPrimal-dual subgradient Neural Information Processing Systems 22, BCutting-plane training of structural SVMs,[
methods for convex problems,[ Math. Y. Bengio, D. Schuurmans, J. Lafferty, J. Mach. Learn., vol. 77, no. 1, 2008,
Programm., vol. 120, no. 1, pp. 221–259, C. K. I. Williams, and A. Culotta, Eds. DOI: 10.1007/s10994-009-5108-8.
2009. Cambridge, MA: MIT Press, 2009, [136] J. E. Kelley, BThe cutting-plane method for
[108] J. Duchi and Y. Singer, BEfficient online pp. 2331–2339. solving convex programs,[ J. Soc. Ind. Appl.
and batch learning using forward backward [122] A. Agarwal and J. Duchi, BDistributed Math., vol. 8, no. 4, pp. 703–712, 1960.
splitting,[ J. Mach. Learn. Res., vol. 10, delayed stochastic optimization,[ in [137] N. D. Ratliff, J. A. Bagnell, and
pp. 2899–2934, 2009. Advances in Neural Information Processing M. A. Zinkevich, B(Online) subgradient
[109] J. Duchi, E. Hazan, and Y. Singer, BAdaptive Systems 24. Cambridge, MA: MIT Press, methods for structured prediction,[ in
subgradient methods for online learning 2011. Proc. 11th Int. Conf. Artif. Intell. Stat., 2007,
and stochastic optimization,[ J. Mach. Learn. [123] H. Yu, J. Yang, and J. Han, BClassifying large pp. 380–387.
Res., vol. 12, pp. 2121–2159, 2011. data sets using SVMs with hierarchical [138] V. Vapnik, Statistical Learning Theory.
[110] L. Xiao, BDual averaging methods for clusters,[ in Proc. 9th ACM SIGKDD Int. Conf. New York: Wiley, 1998.
regularized stochastic learning and online Knowl. Disc. Data Mining, 2003, pp. 306–315.
[139] I. Daubechies, M. Defrise, and C. De Mol,
optimization,[ J. Mach. Learn. Res., vol. 11, [124] L. Breiman, BBagging predictors,[ Mach. BAn iterative thresholding algorithm for
pp. 2543–2596, 2010. Learn., vol. 24, no. 2, pp. 123–140, linear inverse problems with a sparsity
[111] J. K. Bradley, A. Kyrola, D. Bickson, and Aug. 1996. constraint,[ Commun. Pure Appl. Math.,
C. Guestrin, BParallel coordinate descent [125] D. Chakrabarti, D. Agarwal, and vol. 57, pp. 1413–1457, 2004.
for L1 -regularized loss minimization,[ in V. Josifovski, BContextual advertising by [140] M. A. T. Figueiredo, R. Nowak, and
Proc. 28th Int. Conf. Mach. Learn., 2011, combining relevance with click feedback,[ in S. Wright, BGradient projection for sparse
pp. 321–328. Proc. 17th Int. Conf. World Wide Web, 2008, reconstruction: Applications to compressed
[112] J. Langford, L. Li, and A. Strehl, Vowpal pp. 417–426. sensing and other inverse problems,[ IEEE J.
Wabbit, 2007. [Online]. Available: [126] M. Zinkevich, M. Weimer, A. Smola, and Sel. Top. Signal Process., vol. 1, no. 4,
https://github.com/JohnLangford/vowpal_ L. Li, BParallelized stochastic gradient pp. 586–598, Dec. 2007.
wabbit/wiki. descent,[ in Advances in Neural Information [141] S.-J. Kim, K. Koh, M. Lustig, S. Boyd, and
[113] S. Tong, Lessons Learned Developing a Processing Systems 23, J. Lafferty, D. Gorinevsky, BAn interior point method
Practical Large Scale Machine Learning C. K. I. Williams, J. Shawe-Taylor, R. Zemel, for large-scale L1 -regularized least squares,[
System, Google Research Blog, 2010. and A. Culotta, Eds. Cambridge, MA: IEEE J. Sel. Top. Signal Process., vol. 1, no. 4,
[Online]. Available: http://googleresearch. MIT Press, 2010, pp. 2595–2603. pp. 606–617, Dec. 2007.
blogspot.com/2010/04/lessons-learned- [127] R. McDonald, K. Hall, and G. Mann, [142] J. Duchi, S. Shalev-Shwartz, Y. Singer, and
developing-practical.html. BDistributed training strategies for the T. Chandra, BEfficient projections onto
[114] M. Ferris and T. Munson, BInterior structured perceptron,[ in Proc. 48th the L1 -ball for learning in high dimensions,[
point methods for massive support vector Annu. Meeting Assoc. Comput. Linguist., in Proc. 25th Int. Conf. Mach. Learn., 2008,
machines,[ SIAM J. Optim., vol. 13, no. 3, 2010, pp. 456–464. DOI: 10.1145/1390156.1390191.
pp. 783–804, 2003. [128] J. Dean and S. Ghemawat, BMapReduce: [143] A. Beck and M. Teboulle, BA fast iterative
[115] K.-W. Chang and D. Roth, BSelective Simplified data processing on large clusters,[ shrinkage-thresholding algorithm for linear
block minimization for faster convergence of Commun. ACM, vol. 51, no. 1, pp. 107–113, inverse problems,[ SIAM J. Imag. Sci., vol. 2,
limited memory large-scale linear models,[ 2008. no. 1, pp. 183–202, 2009.
in Proc. 17th ACM SIGKDD Int. Conf. Knowl. [129] I. Tsochantaridis, T. Joachims, T. Hofmann,
and Y. Altun, BLarge margin methods