0% found this document useful (0 votes)

55 views20 pages

1、Recent Advances of Large-Scale Linear Classification

This paper surveys recent advances in optimization methods for large-scale linear classification. Linear classifiers can achieve accuracy close to nonlinear classifiers on some data while being much faster to train and test due to avoiding the computationally expensive process of mapping data to higher dimensions. The paper reviews linear classification methods like support vector machines and logistic regression, discusses algorithms for efficient training, and examines applications of linear classifiers to multi-class problems and large-scale data that exceeds memory capacity.

Uploaded by

da da

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views20 pages

1、Recent Advances of Large-Scale Linear Classification

Uploaded by

da da

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

INVITED

PAPER

Recent Advances of Large-Scale

Linear Classification
This paper is a survey on development of optimization methods to construct linear
classifiers suitable for large-scale applications; for some data, accuracy is
close to that of nonlinear classifiers.
By Guo-Xun Yuan, Chia-Hua Ho, and Chih-Jen Lin, Fellow IEEE

ABSTRACT | Linear classification is a useful tool in machine have shown to give competitive performances on docu-
learning and data mining. For some data in a rich dimensional ment data with nonlinear classifiers. An important ad-
space, the performance (i.e., testing accuracy) of linear classi- vantage of linear classification is that training and testing
fiers has shown to be close to that of nonlinear classifiers such procedures are much more efficient. Therefore, linear
as kernel methods, but training and testing speed is much classification can be very useful for some large-scale appli-
faster. Recently, many research works have developed efficient cations. Recently, the research on linear classification has
optimization methods to construct linear classifiers and ap- been a very active topic. In this paper, we give a compre-
plied them to some large-scale applications. In this paper, we hensive survey on the recent advances.
give a comprehensive survey on the recent development of this We begin with explaining in Section II why linear
active research area. classification is useful. The differences between linear and
nonlinear classifiers are described. Through experiments,
KEYWORDS | Large linear classification; logistic regression; we demonstrate that for some data, a linear classifier
multiclass classification; support vector machines (SVMs) achieves comparable accuracy to a nonlinear one, but both
training and testing times are much shorter. Linear clas-
sifiers cover popular methods such as support vector
I. INTRODUCTION machines (SVMs) [1], [2], logistic regression (LR),1 and
Linear classification is a useful tool in machine learning others. In Section III, we show optimization problems of
and data mining. In contrast to nonlinear classifiers such these methods and discuss their differences.
as kernel methods, which map data to a higher dimen- An important goal of the recent research on linear
sional space, linear classifiers directly work on data in the classification is to develop fast optimization algorithms for
original input space. While linear classifiers fail to handle training (e.g., [4]–[6]). In Section IV, we discuss issues in
some inseparable data, they may be sufficient for data in a finding a suitable algorithm and give details of some
rich dimensional space. For example, linear classifiers representative algorithms. Methods such as SVM and LR
were originally proposed for two-class problems. Although
past works have studied their extensions to multiclass
problems, the focus was on nonlinear classification. In
Section V, we systematically compare methods for multi-
class linear classification.
Linear classification can be further applied to
Manuscript received June 16, 2011; revised November 24, 2011; accepted February 3,
2012. Date of publication March 30, 2012; date of current version August 16, 2012.
many other scenarios. We investigate some examples in
This work was supported in part by the National Science Council of Taiwan under Section VI. In particular, we show that linear classifiers
Grant 98-2221-E-002-136-MY3.
G.-X. Yuan was with the Department of Computer Science, National Taiwan University,
can be effectively employed to either directly or indirectly
Taipei 10617, Taiwan. He is currently with the University of California Davis, Davis, approximate nonlinear classifiers. In Section VII, we
CA 95616 USA (e-mail: [email protected]).
C.-H. Ho and C.-J. Lin are with the Department of Computer Science, National
Taiwan University, Taipei 10617, Taiwan (e-mail: [email protected]; 1
It is difficult to trace the origin of logistic regression, which can be
[email protected]).
dated back to the 19th century. Interested readers may check the
Digital Object Identifier: 10.1109/JPROC.2012.2188013 investigation in [3].

2584 Proceedings of the IEEE | Vol. 100, No. 9, September 2012 0018-9219/$31.00 2012 IEEE
Yuan et al.: Recent Advances of Large-Scale Linear Classification

discuss an ongoing research topic for data larger than regardless of the dimensionality of ðxÞ. For example
memory or disk capacity. Existing algorithms often fail to
handle such data because of assuming that data can be
stored in a single computer’s memory. We present some 2
Kðxi ; xj Þ xTi xj þ 1 (4)
methods which try to reduce data reading or communi-
cation time. In Section VIII, we briefly discuss related
topics such as structured learning and large-scale linear
is the degree-2 polynomial kernel with
regression.
Finally, Section IX concludes this survey paper.
h pffiffiffi pffiffiffi pffiffiffi
ðxÞ ¼ 1; 2x1 ; . . . ; 2xn ; . . . x21 ; . . . x2n ; 2x1 x2 ;
II . WHY I S L INEAR pffiffiffi pffiffiffi i
CLASSIFICATION USEFUL? 2x1 x3 ; . . . ; 2xn1 xn 2 Rðnþ2Þðnþ1Þ=2 : (5)
Given training data ðyi ; xi Þ 2 f1; þ1g Rn , i ¼ 1; . . . ; l,
where yi is the label and xi is the feature vector, some
classification methods construct the following decision This kernel trick makes methods such as SVM or kernel LR
function: practical and popular; however, for large data, the training
and testing processes are still time consuming. For a kernel
like (4), the cost of predicting a testing instance x via (3)
can be up to OðlnÞ. In contrast, without using kernels, w is
dðxÞ wT ðxÞ þ b (1)
available in an explicit form, so we can predict an instance
by (1). With ðxÞ ¼ x

where w is the weight vector and b is an intercept, or

called the bias. A nonlinear classifier maps each instance x
wT ðxÞ ¼ wT x
to a higher dimensional vector ðxÞ if data are not lin-
early separable. If ðxÞ ¼ x (i.e., data points are not
mapped), we say (1) is a linear classifier. Because nonlin-
costs only OðnÞ. It is also known that training a linear
ear classifiers use more features, generally they perform
classifier is more efficient. Therefore, while a linear clas-
better than linear classifiers in terms of prediction
sifier may give inferior accuracy, it often enjoys faster
accuracy.
training and testing.
For nonlinear classification, evaluating wT ðxÞ can be
We conduct an experiment to compare linear SVM and
expensive because ðxÞ may be very high dimensional.
nonlinear SVM [with the radial basis function (RBF) ker-
Kernel methods (e.g., [2]) were introduced to handle
nel]. Table 1 shows the accuracy and training/testing time.
such a difficulty. If w is a linear combination of training
Generally, nonlinear SVM has better accuracy, especially
data, i.e.,
for problems cod-RNA,2 ijcnn1, covtype, webspam,
and MNIST38. This result is consistent with the theoret-
ical proof that SVM with RBF kernel and suitable
X
l
parameters gives at least as good accuracy as linear kernel
w i ðxi Þ; for some A 2 Rl (2)
i¼1
[10]. However, for problems with large numbers of fea-
tures, i.e., real-sim, rcv1, astro-physic, yahoo-japan,
and news20, the accuracy values of linear and nonlinear
SVMs are similar. Regarding training and testing time,
and the following kernel function can be easily calculated:
Table 1 clearly indicates that linear classifiers are at least an
order of magnitude faster.
In Table 1, problems for which linear classifiers yield
Kðxi ; xj Þ ðxi ÞT ðxj Þ comparable accuracy to nonlinear classifiers are all docu-
ment sets. In the area of document classification and
natural language processing (NLP), a bag-of-word model
then the decision function can be calculated by is commonly used to generate feature vectors [11]. Each
feature, corresponding to a word, indicates the existence

X
l
dðxÞ i Kðxi ; xÞ þ b (3) 2
In this experiment, we scaled cod-RNA feature wisely to ½1; 1
i¼1 interval.

Vol. 100, No. 9, September 2012 | Proceedings of the IEEE 2585

Yuan et al.: Recent Advances of Large-Scale Linear Classification

Table 1 Comparison of Linear and Nonlinear Classifiers. For Linear, We Use the Software LIBLINEAR [7], While for Nonlinear We Use LIBSVM [8]
(RBF Kernel). The Last Column Shows the Accuracy Difference Between Linear and Nonlinear Classifiers. Training and Testing Time Is in Seconds.
The Experimental Setting Follows Exactly From [9, Sec. 4]

of the word in a document. Because the number of fea- loss functions considered in the literature of linear
tures is the same as the number of possible words, the classification
dimensionality is huge, and the data set is often sparse. For
this type of large sparse data, linear classifiers are very
L1 ðw; x; yÞ maxð0; 1 ywT xÞ (8)
useful because of competitive accuracy and very fast train- 2
T
ing and testing. L2 ðw; x; yÞ max ð0; 1 yw xÞ (9)
T

LR ðw; x; yÞ log 1 þ eyw x : (10)

III . BINARY LINEAR

CL ASSIFICATION METHODS Equations (8) and (9) are referred to as L1 and L2 losses,
respectively. Problem (7) using (8) and (9) as the loss
To generate a decision function (1), linear classification
function is often called L1-loss and L2-loss SVM, while
involves the following risk minimization problem:
problem (7) using (10) is referred to as logistic regression
(LR). Both SVM and LR are popular classification meth-
X
l ods. The three loss functions in (8)–(10) are all convex and
min f ðw; bÞ rðwÞ þ C ðw; b; xi ; yi Þ (6) nonnegative. L1 loss is not differentiable at the point
w;b
i¼1 ywT x ¼ 1, while L2 loss is differentiable, but not twice
differentiable [14]. For logistic loss, it is twice differentia-
ble. Fig. 1 shows that these three losses are increasing
where rðwÞ is the regularization term and ðw; b; x; yÞ is
functions of ywT x. They slightly differ in the amount of
the loss function associated with the observation ðy; xÞ.
penalty imposed.
Parameter C > 0 is user specified for balancing rðwÞ and
the sum of losses.
Following the discussion in Section II, linear classifi-
B. L1 and L2 Regularization
A classifier is used to predict the label y for a hidden
cation is often applied to data with many features, so the
(testing) instance x. Overfitting training data to minimize
bias term b may not be needed in practice. Experiments in
[12] and [13] on document data sets showed similar
performances with/without the bias term. In the rest of
this paper, we omit the bias term b, so (6) is simplified to

X
l
min f ðwÞ rðwÞ þ C ðw; xi ; yi Þ (7)
w
i¼1

and the decision function becomes dðxÞ wT x.

A. Support Vector Machines and Logistic Regression

In (7), the loss function is used to penalize a wrongly
classified observation ðx; yÞ. There are three common Fig. 1. Three loss functions: L1 , L2 , and LR . The x-axis is ywT x.

2586 Proceedings of the IEEE | Vol. 100, No. 9, September 2012

Yuan et al.: Recent Advances of Large-Scale Linear Classification

the training loss may not imply that the classifier gives the A convex combination of L1 and L2 regularizations
best testing accuracy. The concept of regularization is forms the elastic net [18]
introduced to prevent from overfitting observations. The
following L2 and L1 regularization terms are commonly
used:
re ðwÞ kwk22 þ ð1 Þkwk1 (14)

1 1X n
rL2 ðwÞ kwk22 ¼ w2 (11) where 2 ½0; 1Þ. The elastic net is used to break the
2 2 j¼1 j
following limitations of L1 regularization. First, L1 regu-
larization term is not strictly convex, so the solution may
and not be unique. Second, for two highly correlated features,
the solution obtained by L1 regularization may select only
one of these features. Consequently, L1 regularization
X
n
may discard the group effect of variables with high
rL1 ðwÞ kwk1 ¼ jwj j: (12)
j¼1
correlation [18].

Problem (7) with L2 regularization and L1 loss is the IV. TRAINING TECHNI QUES
standard SVM proposed in [1]. Both (11) and (12) are To obtain the model w, in the training phase we need to
convex and separable functions. The effect of regulariza- solve the convex optimization problem (7). Although many
tion on a variable is to push it toward zero. Then, the convex optimization methods are available, for large linear
search space of w is more confined and overfitting may be classification, we must carefully consider some factors in
avoided. It is known that an L1-regularized problem can designing a suitable algorithm. In this section, we first
generate a sparse model with few nonzero elements in w. discuss these design issues and follow by showing details of
Note that w2 =2 becomes more and more flat toward zero, some representative algorithms.
but jwj is uniformly steep. Therefore, an L1-regularized
variable is easier to be pushed to zero, but a caveat is that
(12) is not differentiable. Because nonzero elements in w A. Issues in Finding Suitable Algorithms
may correspond to useful features [15], L1 regularization
can be applied for feature selection. In addition, less • Data property. Algorithms that are efficient for
memory is needed to store w obtained by L1 regularization. some data sets may be slow for others. We must
Regarding testing accuracy, comparisons such as [13, take data properties into account in selecting algo-
Suppl. Mater. Sec. D] show that L1 and L2 regularizations rithms. For example, we can check if the number
generally give comparable performance. of instances is much larger than features, or
In statistics literature, a model related to L1 regular- vice versa. Other useful properties include the
ization is LASSO [16] number of nonzero feature values, feature distri-
bution, feature correlation, etc.
• Optimization formulation. Algorithm design is
X
l strongly related to the problem formulation. For
min ðw; xi ; yi Þ example, most unconstrained optimization tech-
w
i¼1
niques can be applied to L2-regularized logistic
subject to kwk1 K (13) regression, while specialized algorithms may be
needed for the nondifferentiable L1-regularized
problems.
where K > 0 is a parameter. This optimization problem is In some situations, by reformulation, we are
equivalent to (7) with L1 regularization. That is, for a given able to transform a nondifferentiable problem to
C in (7), there exists K such that (13) gives the same be differentiable. For example, by letting w ¼
solution as (7). The explanation for this relationship can be wþ w ðwþ ; w 0Þ, L1-regularized classifiers
found in, for example, [17]. can be written as
Any combination of the above-mentioned two regular-
izations and three loss functions has been well studied in
linear classification. Of them, L2-regularized L1/L2-loss
SVM can be geometrically interpreted as maximum margin X
n X
n X
l
min wþ
j þ w
j þ ðwþ w ; xi ; yi Þ
classifiers. L1/L2-regularized LR can be interpreted in a þ
w ;w
j¼1 j¼1 i¼1
Bayesian view by maximizing the posterior probability
subject to wþ
j ; wj 0; j ¼ 1; . . . ; n: (15)
with Laplacian/Gaussian prior of w.

Vol. 100, No. 9, September 2012 | Proceedings of the IEEE 2587

Yuan et al.: Recent Advances of Large-Scale Linear Classification

However, there is no guarantee that solving a dif- methods, have been widely considered in large-
ferentiable form is faster. Recent comparisons [13] scale training. They characterize low-cost update,
show that for L1-regularized classifiers, methods low-memory requirement, and slow convergence.
directly minimizing the nondifferentiable form are In classification tasks, slow convergence may not be
often more efficient than those solving (15). a serious concern because a loose solution of (7)
• Solving primal or dual problems. Problem (7) has n may already give similar testing performances to
variables. In some applications, the number of in- that by an accurate solution.
stances l is much smaller than the number of fea- High-order methods such as Newton methods
tures n. By Lagrangian duality, a dual problem of often require the smoothness of the optimization
(7) has l variables. If l n, solving the dual form problems. Further, the cost per step is more
may be easier due to the smaller number of varia- expensive; sometimes a linear system must be
bles. Further, in some situations, the dual problem solved. However, their convergence rate is superi-
possesses nice properties not in the primal form. or. These high-order methods are useful for
For example, the dual problem of the standard applications needing an accurate solution of problem
SVM (L2-regularized L1-loss SVM) is the following (7). Some (e.g., [20]) have tried a hybrid setting by
quadratic program3: using low-order methods in the beginning and
switching to higher order methods in the end.
1 • Cost of different types of operations. In a real-world
min f D ðAÞ AT QA eT A computer, not all types of operations cost equally.
A 2
subject to 0 i C 8i ¼ 1; . . . ; l (16) For example, exponential and logarithmic opera-
tions are much more expensive than multiplication
and division. For training large-scale LR, because
where Qij yi yj xTi xj . Although the primal objective exp = log operations are required, the cost of this
function is nondifferentiable because of the L1 type of operations may accumulate faster than that
loss, in (16), the dual objective function is smooth of other types. An optimization method which can
(i.e., derivatives of all orders are available). Hence, avoid intensive exp = log evaluations is potentially
solving the dual problem may be easier than primal efficient; see more discussion in, for example, [12],
because we can apply differentiable optimization [21], and [22].
techniques. Note that the primal optimal w and the • Parallelization. Most existing training algorithms
dual optimal A satisfy the relationship (2),4 so are inherently sequential, but a parallel algorithm
solving primal and dual problems leads to the same can make good use of the computational power in a
decision function. multicore machine or a distributed system. How-
Dual problems come with another nice property ever, the communication cost between different
that each variable i corresponds to a training cores or nodes may become a new bottleneck. See
instance ðyi ; xi Þ. In contrast, for primal problems, more discussion in Section VII.
each variable wi corresponds to a feature. Optimi- Earlier developments of optimization methods for lin-
zation methods which update some variables at a ear classification tend to focus on data with few features.
time often need to access the corresponding By taking this property, they are able to easily train mil-
instances (if solving dual) or the corresponding lions of instances [23]. However, these algorithms may not
features (if solving primal). In practical applica- be suitable for sparse data with both large numbers of
tions, instance-wise data storage is more common instances and features, for which we show in Section II
than feature-wise storage. Therefore, a dual-based that linear classifiers often give competitive accuracy with
algorithm can directly work on the input data nonlinear classifiers. Many recent studies have proposed
without any transformation. algorithms for such data. We list some of them (and their
Unfortunately, the dual form may not be always software name if any) according to regularization and loss
easier to solve. For example, the dual form of L1- functions used.
regularized problems involves general linear con- • L2-regularized L1-loss SVM: Available approaches
straints rather than bound constraints in (16), so include, for example, cutting plane methods for
solving primal may be easier. the primal form (SVMperf [4], OCAS [24], and
• Using low-order or high-order information. Low- BMRM [25]), a stochastic (sub)gradient descent
order methods, such as gradient or subgradient method for the primal form (Pegasos [5], and
3
Because the bias term b is not considered, therefore, different from SGD [26]), and a coordinate descent method for
the
P dual problem considered in SVM literature, an inequality constraint the dual form (LIBLINEAR [6]).
yi i ¼ 0 is absent from (16). • L2-regularized L2-loss SVM: Existing methods for
4
However, we do not necessarily need the dual problem to get (2).
For example, the reduced SVM [19] directly assumes that w is the linear the primal form include a coordinate descent
combination of a subset of data. method [21], a Newton method [27], and a trust

2588 Proceedings of the IEEE | Vol. 100, No. 9, September 2012

Yuan et al.: Recent Advances of Large-Scale Linear Classification

region Newton method (LIBLINEAR [28]). For where Bþ fiji 2 B; 1 yi wT xi > 0g, and updates w by
the dual problem, a coordinate descent method is
in the software LIBLINEAR [6].
• L2-regularized LR: Most unconstrained optimiza- w w rS f ðw; BÞ (18)
tion methods can be applied to solve the primal
problem. An early comparison on small-scale data
is [29]. Existing studies for large sparse data in-
where ¼ ðClÞ=k is the learning rate and k is the iteration
clude iterative scaling methods [12], [30], [31], a
index. Different from earlier subgradient descent methods,
truncated Newton method [32], and a trust region
Newton method (LIBLINEAR [28]). Few works Pegasos
after the update by (18),p ffiffiffiffi further projects w onto
solve the dual problem. One example is a coordi- the ball set fwjkwk2 Clg.5 That is
nate descent method (LIBLINEAR [33]).
• L1-regularized L1-loss SVM: It seems no studies pffiffiffiffi
have applied L1-regularized L1-loss SVM on large Cl
w min 1; w: (19)
sparse data although some early works for data kwk2
with either few features or few instances are
available [34]–[36].
• L1-regularized L2-loss SVM: Some proposed We show the overall procedure of Pegasos in
methods include a coordinate descent method Algorithm 1.
(LIBLINEAR [13]) and a Newton-type method [22].
• L1-regularized LR: Most methods solve the primal
form, for example, an interior-point method
Algorithm 1: Pegasos for L2-regularized L1-loss SVM
(l1 logreg [37]), (block) coordinate descent meth-
(deterministic setting for batch learning) [5]
ods (BBR [38] and CGD [39]), a quasi-Newton pffiffiffiffi
method (OWL-QN [40]), Newton-type methods 1) Given w such that kwk2 Cl.
(GLMNET [41] and LIBLINEAR [22]), and a 2) For k ¼ 1; 2; 3; . . .
Nesterov’s method (SLEP [42]). Recently, an aug- a) Let B ¼ fðyi ; xi Þgli¼1 .
mented Lagrangian method (DAL [43]) was b) Compute the learning rate ¼ ðClÞ=k.
proposed for solving the dual problem. Compar- c) Compute rS f ðw; BÞ by (17).
isons of methods for L1-regularized LR include [13] d) w w rS f ðw; BÞ. pffiffiffiffi
and [44]. e) Project w by (19) to ensure kwk2 Cl.
In the rest of this section, we show details of some
optimization algorithms. We select them not only because For convergence, it is proved that in Oð1=Þ itera-
they are popular but also because many design issues tions, Pegasos achieves an average -accurate solution.
discussed earlier can be covered. That is

B. Example: A Subgradient Method (Pegasos With

!, !
Deterministic Settings) X
T
Shalev-Shwartz et al. [5] proposed a method Pegasos f k
w T f ðw Þ
for solving the primal form of L2-regularized L1-loss SVM. k¼1

It can be used for batch and online learning. Here we

discuss only the deterministic setting and leave the sto-
chastic setting in Section VII-A. where wk is the kth iterate and w is the optimal solution.
Given a training subset B, at each iteration, Pegasos Pegasos has been applied in many studies. One
approximately solves the following problem: implementation issue is that information obtained in the
algorithm cannot be directly used for designing a suitable
1 X stopping condition.
min f ðw; BÞ kwk22 þ C maxð0; 1 yi wT xÞ:
w 2 i2B
C. Example: Trust Region Newton Method ðTRONÞ
Here, for the deterministic setting, B is the whole training Trust region Newton method ðTRONÞ is an effective
set. Because L1 loss is not differentiable, Pegasos takes approach for unconstrained and bound-constrained opti-
the following subgradient direction of f ðw; BÞ: mization. In [28], it applies the setting in [45] to solve (7)
with L2 regularization and differentiable losses.
X
rS f ðw; BÞ w C y i xi (17) 5
pffiffiffiffi solution of f ðwÞ is proven to be in the ball set
The optimal
i2Bþ fwjkwk2 Clg; see [5, Th. 1].

Vol. 100, No. 9, September 2012 | Proceedings of the IEEE 2589

Yuan et al.: Recent Advances of Large-Scale Linear Classification

At each iteration, given an iterate w, a trust region of linear classification problems takes the following special
interval , and a quadratic model form:

1 r2 f ðwÞ ¼ I þ CX T Dw X
qðdÞ rf ðwÞT d þ d T r2 f ðwÞd (20)
2
where I is an identity matrix, X ½x1 ; . . . ; xl T , and Dw is
as an approximation of f ðw þ dÞ f ðwÞ, TRON finds a a diagonal matrix. In [28], a conjugate gradient method is
truncated Newton step confined in the trust region by applied to solve (21), where the main operation is the
approximately solving the following subproblem: product between r2 f ðwÞ and a vector v. By

min qðdÞ subject to kdk2 : (21) r2 f ðwÞv ¼ v þ C X T ðDw ðXvÞÞ (23)
d

the Hessian matrix r2 f ðwÞ need not be stored.

Then, by checking the ratio Because of using high-order information (Newton
directions), TRON gives fast quadratic local convergence.
It has been extended to solve L1-regularized LR and L2-loss
f ðw þ dÞ f ðwÞ SVM in [13] by reformulating (7) to a bound-constrained
(22) optimization problem in (15).
qðdÞ

D. Example: Solving Dual SVM by Coordinate

of actual function reduction to estimated function re- Descent Methods ðDual-CDÞ
duction, TRON decides if w should be updated and then Hsieh et al. [6] proposed a coordinate descent method
adjusts . A large enough indicates that the quadratic for the dual L2-regularized linear SVM in (16). We call this
model qðdÞ is close to f ðw þ dÞ f ðwÞ, so TRON updates algorithm Dual-CD. Here, we focus on L1-loss SVM,
w to be w þ d and slightly enlarges the trust region in- although the same method has been applied to L2-loss
terval for the next iteration. Otherwise, the current SVM in [6].
iterate w is unchanged and the trust region interval A coordinate descent method sequentially selects one
shrinks by multiplying a factor less than one. The overall variable for update and fixes others. To update the ith
procedure of TRON is presented in Algorithm 2. variable, the following one-variable problem is solved:

min f D ðA þ dei Þ f D ðAÞ

Algorithm 2: TRON for L2-regularized LR and L2-loss d
SVM [28] subject to 0 i þ d C
1) Given w, , and 0 .
2) For k ¼ 1; 2; 3; . . .
where f ðAÞ is defined in (16), ei ¼ ½0; . . . ; 0 ; 1; 0; . . . ; 0T ,
a) Find an approximate solution d of (21) by the |fflfflfflffl{zfflfflfflffl}
and i1
conjugate gradient method.
b) Check the ratio in (22).
c) If > 0 1
f D ðA þ dei Þ f D ðAÞ ¼ Qii d2 þ ri f D ðAÞd:
2
w w þ d:
This simple quadratic function can be easily minimized.
After considering the constraint, a simple update rule for
d) Adjust according to . i is

If the loss function is not twice differentiable (e.g., L2 ri f D ðAÞ
loss), we can use generalized Hessian [14] as r2 f ðwÞ i min max i ;0 ;C : (24)
Qii
in (20).
Some difficulties of applying Newton methods to linear
classification include that r2 f ðwÞ may be a huge n by n From (24), Qii and ri f D ðAÞ are our needs. The
matrix and solving (21) is expensive. Fortunately, r2 f ðwÞ diagonal entries of Q, Qii ; 8i, are computed only once

2590 Proceedings of the IEEE | Vol. 100, No. 9, September 2012

Yuan et al.: Recent Advances of Large-Scale Linear Classification

initially, but The linear convergence of Algorithm 3 is established in

[6] using techniques in [46]. The authors propose two
implementation tricks to speed up the convergence. First,
X
l instead of a sequential update, they repeatedly permute
ri f D ðAÞ ¼ ðQAÞi 1 ¼ yi yt xTi xt t 1 (25) f1; . . . ; lg to decide the order. Second, similar to the
t¼1 shrinking technique used in training nonlinear SVM [47],
they identify some bounded variables which may already
be optimal and remove them during the optimization
requires OðnlÞ cost for l inner products xTi xt ; 8t ¼ 1; . . . ; l. procedure. Experiments in [6] show that for large sparse
To make coordinate descent methods viable for large linear data, Algorithm 3 is much faster than TRON in the early
classification, a crucial step is to maintain stage. However, it is less competitive if the parameter C is
large.
Algorithm 3 is very related to popular decomposition
X
l methods used in training nonlinear SVM (e.g., [8] and
u yt t xt (26) [47]). These decomposition methods also update very few
t¼1
variables at each step, but use more sophisticated schemes
for selecting variables. The main difference is that for
linear SVM, we can define u in (26) because xi ; 8i are
so that (25) becomes available. For nonlinear SVM, ri f D ðwÞ in (25) needs OðnlÞ
cost for calculating l kernel elements. This difference
between OðnÞ and OðnlÞ is similar to that in the testing
ri f D ðAÞ ¼ ðQAÞi 1 ¼ yi uT xi 1: (27) phase discussed in Section II.

E. Example: Solving L1-Regularized Problems by

If u is available through the training process, then the cost
Combining Newton and Coordinate Descent
OðnlÞ in (25) is significantly reduced to OðnÞ. The re-
Methods ðnewGLMNETÞ
maining task is to maintain u. Following (26), if i and i
GLMNET proposed by Friedman et al. [41] is a
are values before and after the update (24), respectively, Newton method for L1-regularized minimization. An im-
then we can easily maintain u by the following OðnÞ proved version newGLMNET [22] is proposed for large-
operation: scale training.
Because the 1-norm term is not differentiable, we
represents f ðwÞ as the sum of two terms kwk1 þ LðwÞ,
u u þ yi ði
i Þxi : (28) where

Therefore, the total cost for updating an i is OðnÞ. The X

l
overall procedure of the coordinate descent method is in LðwÞ C ðw; xi ; yi Þ:
Algorithm 3. i¼1

Algorithm 3: A coordinate descent method for L2- At each iteration, newGLMNET considers the second-
regularized L1-loss SVM [6] order approximation of LðwÞ and solves the following
P problem:
1) Given A and the corresponding u ¼ li¼1 yi i xi .
2) Compute Qii ; 8i ¼ 1; . . . ; l.
3) For k ¼ 1; 2; 3; . . . 1
• For i ¼ 1; . . . ; l min qðdÞ kw þ dk1 kwk1 þ rLðwÞT d þ d T Hd
d 2
a) Compute G ¼ yi uT xi 1 in (27). (29)
b) i i .
c) i minðmaxði G=Qii ; 0Þ; CÞ.
d) u u þ yi ði i Þxi . where H r2 LðwÞ þ I and is a small number to
ensure H to be positive definite.
The vector u defined in (26) is in the same form as w Although (29) is similar to (21), its optimization is
in (2). In fact, as A approaches a dual optimal solution, more difficult because of the 1-norm term. Thus,
u will converge to the primal optimal w following the newGLMNET further breaks (29) to subproblems by a
primal–dual relationship. coordinate descent procedure. In a setting similar to the

Vol. 100, No. 9, September 2012 | Proceedings of the IEEE 2591

Yuan et al.: Recent Advances of Large-Scale Linear Classification

Table 2 A Comparison of the Four Methods in Section IV-B–E

method in Section IV-D, each time a one-variable function which is able to quickly obtain an approximate w; however,
is minimized in the final stage, the iterate w converges quickly because a
Newton step is taken. Recall in Section IV-A, we men-
tioned that exp = log operations are more expensive than
1 basic operations such as multiplication/division. Because
qðd þ zej Þ qðdÞ ¼ jwj þ dj þ zj jwj þ dj j þ Gj z þ Hjj z2
2 (30) does not involve any exp = log operation, we suc-
(30) cessfully achieve that time spent on exp = log operations is
only a small portion of the whole procedure. In addition,
newGLMNET is an example of accessing data feature
where G rLðwÞ þ Hd. This one-variable function (30)
wisely; see details in [22] about how Gj in (30) is updated.
has a simple closed-form minimizer (see [48], [49], and
[13, App. B])
F. A Comparison of the Four Examples
The four methods discussed in Sections IV-B–E differ in
8 Gj þ1 various aspects. By considering design issues mentioned in
< Hjj ;
> if Gj þ 1 Hjj ðwj þ dj Þ
Section IV-A, we compare these methods in Table 2. We
Gj 1
z¼ if Gj 1 Hjj ðwj þ dj Þ
> Hjj ;
: point out that three methods are primal based, but one is
ðwj þ dj Þ; otherwise: dual based. Next, both Pegasos and Dual-CD use only
low-order information (subgradient and gradient), but
TRON and newGLMNET employ high-order information
At each iteration of newGLMNET, the coordinate descent by Newton directions. Also, we check how data instances
method does not solve problem (29) exactly. Instead, are accessed. Clearly, Pegasos and Dual-CD instance
newGLMNET designs an adaptive stopping condition so wisely access data, but we have mentioned in Section IV-E
that initially problem (29) is solved loosely and in the final that newGLMNET must employ a feature wisely setting.
iterations, (29) is more accurately solved. Interestingly, TRON can use both because in (23), matrix–
After an approximate solution d of (29) is obtained, we vector products can be conducted by accessing data in-
need a line search procedure to ensure the sufficient stance wisely or feature wisely.
function decrease. It finds 2 ð0; 1 such that We analyze the complexity of the four methods by
showing the cost at the kth iteration:
• Pegasos: OðjBþ jnÞ;
f ðw þ dÞ f ðwÞ kw þ dk1 kwk1 þ rLðwÞT d • TRON: #CG iter OðlnÞ;
(31) • Dual-CD: OðlnÞ;
• newGLMNET: #CD iter OðlnÞ.
The cost of Pegasos and TRON easily follows from (17)
where 2 ð0; 1Þ. The overall procedure of newGLMNET and (23), respectively. For Dual-CD, both (27) and (28)
is in Algorithm 4. cost OðnÞ, so one iteration of going through all variables is
OðnlÞ. For newGLMNET, see details in [22]. We can
clearly see that each iteration of Pegasos and Dual-CD is
Algorithm 4: newGLMNET for L1-regularized minimiza- cheaper because of using low-order information. However,
tion [22] they need more iterations than high-order methods in
1) Given w. Given 0 G ; G 1. order to accurately solve the optimization problem.
2) For k ¼ 1; 2; 3; . . .
a) Find an approximate solution d of (29) by a
coordinate descent method. V. MULT ICLASS LINEAR
b) Find ¼ maxf1; ; 2 ; . . .g such that (31) holds. CLASSIFICATION
c) w w þ d. Most classification methods are originally proposed to
solve a two-class problem; however, extensions of these
Due to the adaptive setting, in the beginning methods to multiclass classification have been studied. For
newGLMNET behaves like a coordinate descent method, nonlinear SVM, some works (e.g., [50] and [51]) have

2592 Proceedings of the IEEE | Vol. 100, No. 9, September 2012

Yuan et al.: Recent Advances of Large-Scale Linear Classification

comprehensively compared different multiclass solutions. method but it attempts to reduce the testing cost.
In contrast, few studies have focused on multiclass linear Starting with a candidate set of all classes, this
classification. This section introduces and compares some method sequentially selects a pair of classes for
commonly used methods. prediction and removes one of the two. That is, if a
binary classifier of class i and j predicts i, then j is
A. Solving Several Binary Problems removed from the candidate set. Alternatively, a
Multiclass classification can be decomposed to several prediction of class j will cause i to be removed.
binary classification problems. One-against-rest and one- Finally, the only remained class is the predicted
against-one methods are two of the most common de- result. For any pair ði; jÞ considered, the true class
composition approaches. Studies that broadly discussed may be neither i nor j. However, it does not matter
various approaches of decomposition include, for example, which one is removed because all we need is that if
[52] and [53]. the true class is involved in a binary prediction, it is
• One-against-rest method. If there are k classes in the the winner. Because classes are sequentially
training data, the one-against-rest method [54] removed, only k 1 models are used. The testing
constructs k binary classification models. To obtain time complexity of DAGSVM is thus OðnkÞ.
the mth model, instances from the mth class of the
training set are treated as positive, and all other B. Considering All Data at Once
instances are negative. Then, the weight vector wm In contrast to using many binary models, some have
for the mth model can be generated by any linear proposed solving a single optimization problem for multi-
classifier. class classification [59]–[61]. Here we discuss details of
After obtaining all k models, we say an instance Crammer and Singer’s approach [60]. Assume class labels
x is in the mth class if the decision value (1) of the are 1; . . . ; k. They consider the following optimization
mth model is the largest, i.e., problem:

class of x arg max wTm x: (32) 1X k Xl

m¼1;...;k
min kwm k22 þ C CS fwm gkm¼1 ; xi ; yi (33)
w1 ;...;wk 2 m¼1 i¼1

The cost for testing an instance is OðnkÞ.

• One-against-one method. One-against-one method where
[55] solves kðk 1Þ=2 binary problems. Each bi-
nary classifier constructs a model with data from

one class as positive and another class as negative. CS fwm gkm¼1 ; x; y max max 0; 1 ðwy wm ÞT x :
Since there is kðk 1Þ=2 combination of two m6¼y
classes, kðk 1Þ=2 weight vectors are constructed: (34)
w1;2 ; w1;3 ; . . . ; w1;k ; w2;3 ; . . . ; wðk1Þ;k .
There are different methods for testing. One
approach is by voting [56]. For a testing instance The setting is like combining all binary models of the one-
x, if model ði; jÞ predicts x as in the ith class, then a against-rest method. There are k weight vectors w1 ; . . . ; wk
counter for the ith class is added by one; for k classes. In the loss function (34), for each m,
otherwise, the counter for the jth class is added. maxð0; 1 ðwyi wm ÞT xi Þ is similar to the L1 loss in (8)
Then, we say x is in the ith class if the ith counter for binary classification. Overall, we hope that the decision
has the largest value. Other prediction methods value of xi by the model wyi is at least one larger than the
are similar though they differ in how to use the values by other models. For testing, the decision function
kðk 1Þ=2 decision values; see some examples in is also (32).
[52] and [53]. Early works of this method focus on the nonlinear (i.e.,
For linear classifiers, one-against-one method is kernel) case [50], [60], [62]. A study for linear classifica-
shown to give better testing accuracy than one- tion is in [63], which applies a coordinate descent method
against-rest method [57]. However, it requires to solve the dual problem of (33). The idea is similar to the
Oðk2 nÞ spaces for storing models and Oðk2 nÞ cost method in Section IV-D; however, at each step, a larger
for testing an instance; both are more expensive subproblem of k variables is solved. A nice property of this
than the one-against-rest method. Interestingly, k-variable subproblem is that it has a closed-form solution.
for nonlinear classifiers via kernels, one-against- Experiments in [63] show that solving (33) gives slightly
one method does not have such disadvantages [50]. better accuracy than one-against-rest method, but the
DAGSVM [58] is the same as one-against-one training time is competitive. This result is different from

Vol. 100, No. 9, September 2012 | Proceedings of the IEEE 2593

Yuan et al.: Recent Advances of Large-Scale Linear Classification

the nonlinear case, where the longer training time than Table 3 Comparison of Methods for Multiclass Linear Classification in
Storage (Model Size) and Testing Time. n Is the Number of Features and
one-against-rest and one-against-one methods has made
k Is the Number of Classes
the approach of solving one single optimization problem
less practical [50]. A careful implementation of the ap-
proach in [63] is given in [7, App. E].

C. Maximum Entropy
Maximum entropy (ME) [64] is a generalization of
logistic regression for multiclass problems6 and a special
case of conditional random fields [65] (see Section VIII-A).
It is widely applied by NLP applications. We still assume Equation (35) is a special case of (37) by
class labels 1; . . . ; k for an easy comparison to (33) in our
subsequent discussion. ME models the following condi- 2 3
tional probability function of label y given data x: 0 )
6 .. 7 y1
6.7 2 3
6 7 w1
607
6 7 6 . 7
exp wTy x f ðxi ; yÞ ¼ 6 xi 7 2 Rnk and w ¼ 4 .. 5: (38)
6 7
PðyjxÞ Pk (35) 607 wk
T 6.7
m¼1 exp wm x 4 .. 5
0
where wm ; 8m are weight vectors like those in (32) and
(33). This model is also called multinomial logistic Many studies have investigated optimization methods
regression. for L2-regularized ME. For example, Malouf [66] com-
ME minimizes the following regularized negative log- pares iterative scaling methods [67], gradient descent,
likelihood: nonlinear conjugate gradient, and L-BFGS (quasi-Newton)
method [68] to solve (36). Experiments show that quasi-
Newton performs better. In [12], a framework is proposed
1X k Xl to explain variants of iterative scaling methods [30], [67],
min kwk k2 þ C ME fwm gkm¼1 ; xi ; yi (36) [69] and make a connection to coordinate descent meth-
w1 ;...;wm 2 m¼1 i¼1
ods. For L1-regularized ME, Andrew and Gao [40] propose
an extension of L-BFGS.
Recently, instead of solving the primal problem (36),
where some works solve the dual problem. A detailed derivation
of the dual ME is in [33, App. A.7]. Memisevic [70] pro-
posed a two-level decomposition method. Similar to the
ME fwm gkm¼1 ; x; y log PðyjxÞ: coordinate descent method [63] for (33) in Section V-B, in
[70], a subproblem of k variables is considered at a time.
However, the subproblem does not have a closed-form
solution, so a second-level coordinate descent method is
Clearly, (36) is similar to (33) and ME ð Þ can be consid-
applied. Collin et al. [71] proposed an exponential gradient
ered as a loss function. If wTyi xi wTm xi ; 8m 6¼ yi , then
k method to solve ME dual. They also decompose the prob-
ME ðfwm gm¼1 ; xi ; yi Þ is close to zero (i.e., no loss). On the
lem into k-variable subproblems, but only approximately
other hand, if wTyi xi is smaller than other wTm xi , m 6¼ yi ,
solve each subproblem. The work in [33] follows [70] to
then Pðyi jxi Þ 1 and the loss is large. For prediction, the
apply a two-level coordinate descent method, but uses a
decision function is also (32).
different method in the second level to decide variables for
NLP applications often consider a more general ME
update.
model by using a function f ðx; yÞ to generate the feature
vector
D. Comparison
We summarize storage (model size) and testing time of
expðwT f ðx; yÞÞ each method in Table 3. Clearly, one-against-one and
PðyjxÞ P T 0
: (37) DAGSVM methods are less practical because of the much
y0 expðw f ðx; y ÞÞ
higher storage, although the comparison in [57] indicates
6
Details of the connection between logistic regression and maximum that one-against-one method gives slightly better testing
entropy can be found in, for example, [12, Sec. 5.2]. accuracy. Note that the situation is very different for the

2594 Proceedings of the IEEE | Vol. 100, No. 9, September 2012

Yuan et al.: Recent Advances of Large-Scale Linear Classification

kernel case [50], where one-against-one and DAGSVM are Table 4 Results of Training/Testing Degree-2 Polynomial Mappings by
the Coordinate Descent Method in Section IV-D. The Degree-2 Polynomial
very useful methods.
Mapping Is Dynamically Computed During Training, Instead of Expanded
Beforehand. The Last Column Shows the Accuracy Difference Between
Degree-2 Polynomial Mappings and RBF SVM
VI. LI NEAR-CLASSIFICATI ON
TECHNIQUES FOR NONLINEAR
CLASSIFICATION
Many recent developments of linear classification can be
extended to handle nonstandard scenarios. Interestingly,
most of them are related to training nonlinear classifiers.

A. Training and Testing Explicit Data Mappings via

Linear Classifiers explicitly represented, as long as these two operations can
In some problems, training a linear classifier in the be performed, Algorithm 3 is applicable.
original feature space may not lead to competitive perfor- Studies in [76] and [77] designed linear classifiers to
mances. For example, on ijcnn1 in Table 1, the testing train explicit mappings of sequence data, where features
accuracy (92.21%) of a linear classifier is inferior to 98. correspond to subsequences. Using the relation between
69% of a nonlinear one with the RBF kernel. However, the subsequences, they are able to design efficient training
higher accuracy comes with longer training and testing methods for very high-dimensional mappings.
time. Taking the advantage of linear classifiers’ fast train-
ing, some studies have proposed using the explicit nonlin- B. Approximation of Kernel Methods via
ear data mappings. That is, we consider ðxi Þ, i ¼ 1; . . . ; l, Linear Classification
as the new training set and employ a linear classifier. In Methods in Section VI-A train ðxi Þ; 8i explicitly, so
some problems, this type of approaches may still enjoy fast they obtain the same model as a kernel method using
training/testing, but achieve accuracy close to that of using Kðxi ; xj Þ ¼ ðxi ÞT ðxj Þ. However, they have limitations
highly nonlinear kernels. when the dimensionality of ðxÞ is very high. To resolve
Some early works, e.g., [72]–[74], have directly trained the slow training/testing of kernel methods, approxima-
nonlinearly mapped data in their experiments. Chang et al. tion is sometimes unavoidable. Among the many available
[9] analyze when this approach leads to faster training and methods to approximate the kernel, some of them lead to
testing. Assume that the coordinate descent method in training a linear classifier. Following [78], we categorize
Section IV-D is used for training linear/kernelized classi- these methods to the following two types.
fiers7 and ðxÞ 2 Rd . From Section IV-D, each coordinate • Kernel matrix approximation. This type of ap-
descent step takes OðdÞ and OðnlÞ operations for linear and proaches finds a low-rank matrix 2 Rdl with
kernelized settings, respectively. Thus, if d nl, the ap- d l such that T
can approximate the kernel
proach of training explicit mappings may be faster than matrix Q
using kernels. In [9], the authors particularly study
degree-2 polynomial mappings such as (5). The dimen-
sionality is d ¼ Oðn2 Þ, but for sparse data, the Oðn2 Þ versus Q ¼
T Q: (39)
OðnlÞ comparison is changed to Oð n2 Þ versus Oð
nlÞ, where
n is the average number of nonzero values per instance.
For large sparse data sets, n l, so their approach can be Assume ½x1 ; . . . ; xl . If we replace Q in (16)
very efficient. Table 4 shows results of training/testing
with Q, then (16) becomes the dual problem of
degree-2 polynomial mappings using three data sets in training a linear SVM on the new set ðyi ; xi Þ,
Table 1 with significant lower linear-SVM accuracy than i ¼ 1; . . . ; l. Thus, optimization methods discussed
RBF. We apply the same setting as [9, Sec. 4]. From in Section IV can be directly applied. An advantage
Tables 1 and 4, we observed that training ðxi Þ; 8i by a of this approach is that we do not need to know an
linear classifier may give accuracy close to RBF kernel, but explicit mapping function corresponding to a
is faster in training/testing. kernel of our interest (see the other type of ap-
A general framework was proposed in [75] for various proaches discussed below). However, this property
nonlinear mappings of data. They noticed that to perform causes a complicated testing procedure. That is,
the coordinate descent method in Section IV-D, one only the approximation in (39) does not directly reveal
needs that uT ðxÞ in (27) and u u þ yði i ÞðxÞ in how to adjust the decision function (3).
(28) can be performed. Thus, even if ðxÞ cannot be Early developments focused on finding a good
7 approximation matrix . Some examples include
See the discussion in the end of Section IV-D about the connection
between Algorithm 3 and the popular decomposition methods for nonlinear Nyström method [79], [80] and incomplete
SVMs. Cholesky factorization [81], [82]. Some works

Vol. 100, No. 9, September 2012 | Proceedings of the IEEE 2595

Yuan et al.: Recent Advances of Large-Scale Linear Classification

(e.g., [19]) consider approximations other than because of expensive kernel computation. For distributed
(39), but also lead to linear classification problems. linear classification, the research is still in its infancy. The
A recent study [78] addresses more on training current trend is to design algorithms so that computing
and testing linear SVM after obtaining the low- nodes access data locally and the communication between
rank approximation. In particular, details of the nodes is minimized. The implementation is often con-
testing procedures can be found in [78, Sec. 2.4]. ducted using distributed computing environments such as
Note that linear SVM problems obtained after Hadoop [95]. In this section, we will discuss some ongoing
kernel approximations are often dense and have research results.
more instances than features. Thus, training Among the existing developments, some can be easily
algorithms suitable for such problems may be categorized as online methods. We describe them in Sec-
different from those for sparse document data. tion VII-A. Batch methods are discussed in Section VII-B,
• Feature mapping approximation. This type of ap- while other approaches are in Section VII-C.
proaches finds a mapping function : Rn ! Rd
such that
A. Online Methods
An online method updates the model w via using some
T ðtÞ
ðxÞ Kðx; tÞ: instances at a time rather than considering the whole
training data. Therefore, not only can online methods
handle data larger than memory, but also they are suitable
Then, linear classifiers can be applied to new data for streaming data where each training instance is used
ðx l Þ. The testing phase is straightfor-
1 Þ; . . . ; ðx only once. One popular online algorithm is the stochastic
ward because the mapping ð Þ is available. gradient descent (SGD) method, which can be traced back
Many mappings have been proposed. Examples to stochastic approximation method [96], [97]. Take the
include random Fourier projection [83], random primal L2-regularized L1-loss SVM in (7) as an example.
projections [84], [85], polynomial approximation At each step, a training instance xi is chosen and w is
[86], and hashing [87]–[90]. They differ in various updated by
aspects, which are beyond the scope of this paper.
An issue related to the subsequent linear classifi-
cation is that some methods (e.g., [83]) generate 1
w w rS kwk22 þ C maxð0; 1 yi wT xi Þ (40)
dense ðxÞ vectors, while others give sparse 2
vectors (e.g., [85]). A recent study focusing on
the linear classification after obtaining ðx i Þ; 8i is
in [91]. where rS is a subgradient operator and is the learning
rate. Specifically, (40) becomes the following update rule:

VI I. TRAINING LARGE DATA B EYOND If 1 yi wT xi > 0;

THE MEMORY OR T HE DISK CAPACITY then w ð1 Þw þ Cyi xi :
(41)
Recall that we described some binary linear classification
algorithms in Section IV. Those algorithms can work well
under the assumption that the training set is stored in the The learning rate is gradually reduced along iterations.
computer memory. However, as the training size goes It is well known that SGD methods have slow con-
beyond the memory capacity, traditional algorithms may vergence. However, they are suitable for large data because
become very slow because of frequent disk access. Indeed, of accessing only one instance at a time. Early studies
even if the memory is enough, loading data to memory may which have applied SGD to linear classification include,
take more time than subsequent computation [92]. There- for example, [98] and [99]. For data with many features,
fore, the design of algorithms for data larger than memory recent studies [5], [26] show that SGD is effective. They
is very different from that of traditional algorithms. allow more flexible settings such as using more than one
If the data set is beyond the disk capacity of a single training instance at a time. We briefly discuss the online
computer, then it must be stored distributively. Internet setting of Pegasos [5]. In Algorithm 1, at each step a), a
companies now routinely handle such large data sets in small random subset B is used instead of the full set.
data centers. In such a situation, linear classification faces Similar convergence properties to that described in
even more challenges because of expensive communica- Section IV-B still hold but in expectation (see [5, Th. 2]).
tion cost between different computing nodes. In some Instead of solving the primal problem, we can design an
recent works [93], [94], parallel SVM on distributed envi- online algorithm to solve the dual problem [6], [100]. For
ronments has been studied but they investigated only example, the coordinate descent method in Algorithm 3
kernel SVM. The communication overhead is less serious can be easily extended to an online setting by replacing the

2596 Proceedings of the IEEE | Vol. 100, No. 9, September 2012

Yuan et al.: Recent Advances of Large-Scale Linear Classification

sequential selection of variables with a random selection. pensive disk input/output (I/O), they design algorithms by
Notice that the update rule (28) is similar to (41), but has reading a continuous chunk of data at a time and mini-
the advantage of not needing to decide the learning rate . mizing the number of disk accesses. The method in [92]
This online setting falls into the general framework of extends the coordinate descent method in Section IV-D for
randomized coordinate descent methods in [101] and linear SVM. The major change is to update more variables
[102]. Using the proof in [101], the linear convergence in at a time so that a block of data is used together.
expectation is obtained in [6, App. 7.5]. Specifically, in the beginning, the training set is randomly
To improve the convergence of SGD, some [103], [104] partitioned to m files B1 ; . . . ; Bm . The available memory
have proposed using higher order information. The rule in space needs to be able to accommodate one block of data
(40) is replaced by and the working space of a training algorithm. To solve
(16), sequentially one block of data B is read and the
following function of d is minimized under the condition
w w HrS ð Þ (42) 0 i þ di C; 8i 2 B and di ¼ 0; 8i 62 B

1
f D ðA þ dÞ f D ðAÞ ¼ d TB QBB d B þ d TB ðQA eÞB
where H is an approximation of the inverse Hessian 2
1 X
r2 f ðwÞ1 . To save the cost at each update, practically H is ¼ d TB QBB d B þ yi di ðuT xi Þ d TB eB
a diagonal scaling matrix. Experiments [103] and [104] 2 i2B
show that using (42) is faster than (40). (43)
The update rule in (40) assumes L2 regularization.
While SGD is applicable for other regularization, it may where QBB is a submatrix of Q and u is defined in (26). By
not perform as well because of not taking special pro- maintaining u in a way similar to (28), equation (43) in-
perties of the regularization term into consideration. For volves only data in the block B, which can be stored in
example, if L1 regularization is used, a standard SGD may memory. Equation (43) can be minimized by any tradi-
face difficulties to generate a sparse w. To address this tional algorithm. Experiments in [92] demonstrate that
problem, recently several approaches have been proposed they can train data 20 times larger than the memory capa-
[105]–[110]. The stochastic coordinate descent method in city. This method is extended in [115] to cache informative
[106] has been extended to a parallel version [111]. data points in the computer memory. That is, at each
Unfortunately, most existing studies of online algo- iteration, not only the selected block but also the cached
rithms conduct experiments by assuming enough memory points are used for updating corresponding variables. Their
and reporting the number of times to access data. To apply way to select informative points is inspired by the shrink-
them in a real scenario without sufficient memory, many ing techniques used in training nonlinear SVM [8], [47].
practical issues must be checked. Vowpal-Wabbit [112] is For distributed batch learning, all existing parallel
one of the very few implementations which can handle optimization methods [116] can possibly be applied. How-
data larger than memory. Because the same data may be ever, we have not seen many practical deployments for
accessed several times and the disk reading time is expen- training large-scale data. Recently, Boyd et al. [117] have
sive, at the first pass, Vowpal-Wabbit stores data to a considered the alternating direction method of multiplier
compressed cache file. This is similar to the compression (ADMM) [118] for distributed learning. Take SVM as an
strategy in [92], which will be discussed in Section VII-B. example and assume data points are partitioned to m dis-
Currently, Vowpal-Wabbit supports unregularized linear tributively stored sets B1 ; . . . ; Bm . This method solves the
classification and regression. It is extended to solve L1- following approximation of the original optimization
regularized problems in [105]. problem:
Recently, Vowpal-Wabbit (after version 6.0) has sup-
ported distributed online learning using the Hadoop [95]
framework. We are aware that other Internet companies 1 T Xm X

have constructed online linear classifiers on distributed min z zþC L1 ðwj ; xi ; yi Þ
w1 ;...;wm ;z 2 j¼1 i2Bj
environments, although details have not been fully avail-
able. One example is the system SETI at Google [113]. X
m
þ kwj zk2
2 j¼1
B. Batch Methods subject to wj z ¼ 0; 8j
In some situations, we still would like to consider the
whole training set and solve a corresponding optimization
problem. While this task is very challenging, some (e.g., where is a prespecific parameter. It then employs an
[92] and [114]) have checked the situation that data are optimization method of multipliers by alternatively
larger than memory but smaller than disk. Because of ex- minimizing the Lagrangian function over w1 ; . . . ; wm ,

Vol. 100, No. 9, September 2012 | Proceedings of the IEEE 2597

Yuan et al.: Recent Advances of Large-Scale Linear Classification

minimizing the Lagrangian over z, and updating dual mul- mance. An advantage of the bagging-like approach is the
tipliers. The minimization of Lagrangian over w1 ; . . . ; wm easy implementation using distributed computing techni-
can be decomposed to m independent problems. Other ques such as MapReduce [128].8
steps do not involve data at all. Therefore, data points are
locally accessed and the communication cost is kept
minimum. Examples of using ADMM for distributed VIII. RELATED TOPICS
training include [119]. Some known problems of this
In this section, we discuss some other linear models. They
approaches are first that the convergence rate is not very
are related to linear classification models discussed in
fast, and second that it is unclear how to choose param-
earlier sections.
eter .
Some works solve an optimization problem using
parallel SGD. The data are stored in a distributed system, A. Structured Learning
and each node only computes the subgradient correspond- In the discussion so far, we assumed that the label yi is a
ing to the data instances in the node. In [120], a delayed single value. For binary classification, it is þ1 or 1, while
SGD is proposed. Instead of computing the subgradient of for multiclass classification, it is one of the k class labels.
the current iterate wk , in delayed SGD, each node com- However, in some applications, the label may be a more
putes the subgradient of a previous iterator w ðkÞ , where sophisticated object. For example, in part-of-speech (POS)
ðkÞ k. Delayed SGD is useful to reduce the synchro- tagging applications, the training instances are sentences
nization delay because of communication overheads or and the labels are sequences of POS tags of words. If there
uneven computational time at various nodes. Recent works are l sentences, we can write the training instances as
[121], [122] show that delayed SGD is efficient when the ðyi ; xi Þ 2 Y ni X ni ; 8i ¼ 1; . . . ; l, where xi is the ith sen-
number of nodes is large, and the delay is asymptotically tence, yi is a sequence of tags, X is a set of unique words in
negligible. the context, Y is a set of candidate tags for each word, and
ni is the number of words in the ith sentence. Note that
C. Other Approaches we may not be able to split the problem to several
We briefly discuss some other approaches which can- independent ones by treating each value yij of yi as the
not be clearly categorized as batch or online methods. label, because yij not only depends on the sentence xi but
The most straightforward method to handle large data also other tags ðyi1 ; . . . ; yiðj1Þ ; yiðjþ1Þ ; . . . yini Þ. To handle
is probably to randomly select a subset that can fit in these problems, we could use structured learning models
memory. This approach works well if the data quality is like conditional random fields [65] and structured SVM
good; however, sometimes using more data gives higher [129], [130].
accuracy. To improve the performance of using only a • Conditional random fields (CRFs). The CRF [65] is a
subset, some have proposed techniques to include impor- linear structured model commonly used in NLP.
tant data points into the subset. For example, the approach Using notation mentioned above and a feature
in [123] selects a subset by reading data from disk only function f ðx; yÞ like ME, CRF solves the following
once. For data in a distributed environment, subsampling problem:
can be a complicated operation. Moreover, a subset fitting
the memory of one single computer may be too small to
give good accuracy. 1 Xl
Bagging [124] is a popular classification method to split min kwk22 þ C CRF ðw; xi ; yi Þ (44)
w 2 i¼1
a learning task to several easier ones. It selects several
random subsets, trains each of them, and ensembles (e.g.,
averaging) the results during testing. This method may be
where
particularly useful for distributively stored data because we
can directly consider data in each node as a subset. How-
ever, if data quality in each node is not good (e.g., all
instances with the same class label), the model generated CRF ðw; xi ; yi Þ log Pðyi jxi Þ
by each node may be poor. Thus, ensuring data quality of
expðwT f ðx; yÞÞ
each subset is a concern. Some studies have applied the PðyjxÞ P T 0
: (45)
bagging approach on a distributed system [125], [126]. y0 expðw f ðx; y ÞÞ

For example, in the application of web advertising,

Chakrabarti et al. [125] train a set of individual classifiers
in a distributed way. Then, a final model is obtained by If elements in yi are independent of each other,
averaging the separate classifiers. In the linguistic appli- then CRF reduces to ME.
cations, McDonald et al. [127] extend the simple model 8
We mentioned earlier the Hadoop system, which includes a
average to the weighted average and achieve better perfor- MapReduce implementation.

2598 Proceedings of the IEEE | Vol. 100, No. 9, September 2012

Yuan et al.: Recent Advances of Large-Scale Linear Classification

The optimization of (44) is challenging because terms. While L1 and L2 regularization is still used, loss
in the probability model (45), the number of functions are different, where two popular ones are
possible y’s is exponentially large. An important
property to make CRF practical is that the gradient
of the objective function in (44) can be efficiently 1 2
LS ðw; x; zÞ ðz wT xÞ (47)
evaluated by dynamic programming [65]. Some 2
available optimization methods include L-BFGS ðw; x; zÞ max 0; jz wT xj : (48)
(quasi-Newton) and conjugate gradient [131], SGD
[132], stochastic quasi-Newton [103], [133], and
trust region Newton method [134]. It is shown in The least square loss in (47) is widely used in many places,
[134] that the Hessian-vector product (23) of the while the -insensitive loss in (48) is extended from the L1
Newton method can also be evaluated by dynamic loss in (8), where there is a user-specified parameter as
programming. the error tolerance. Problem (7) with L2 regularization
• Structured SVM. Structured SVM solves the follow- and -insensitive loss is called support vector regression
ing optimization problem generalized form multi- (SVR) [138]. Contrary to the success of linear classifica-
class SVM in [59], [60]: tion, so far not many applications of linear regression on
large sparse data have been reported. We believe that this
topic has not been fully explored yet.
1 Xl
Regarding the minimization of (7), if L2 regulariza-
min kwk22 þ C SS ðw; xi ; yi Þ (46)
w 2 i¼1
tion is used, many optimization methods mentioned in
Section IV can be easily modified for linear regression.
We then particularly discuss L1-regularized least
where square regression, which has recently drawn much atten-
tion for signal processing and image applications. This
research area is so active that many optimization methods
SS ðw; xi ; yi Þ max max 0; ðyi ; yÞ (e.g., [49] and [139]–[143]) have been proposed. However,
y6¼yi
as pointed out in [13], optimization methods most suitable
wT ðf ðxi ; yi Þ f ðxi ; yÞÞ for signal/image applications via L1-regularized regression
may be very different from those in Section IV for classi-
fying large sparse data. One reason is that data from signal/
and ð Þ is a distance function with ðyi ; yi Þ ¼ 0 image problems tend to be dense. Another is that xi ; 8i
and ðyi ; yj Þ ¼ ðyj ; yi Þ. Similar to the relation may be not directly available in some signal/image prob-
between conditional random fields and maximum lems. Instead, we can only evaluate the product between
entropy, if the data matrix and a vector through certain operators.
Thus, optimization methods that can take this property
( into their design may be more efficient.
0; if yi ¼ yj
ðyi ; yj Þ ¼
1; otherwise IX. CONCLUSION
In this paper, we have comprehensively reviewed recent
advances of large linear classification. For some applica-
and yi 2 f1; . . . ; kg; 8i, then structured SVM be- tions, linear classifiers can give comparable accuracy to
comes Crammer and Singer’s problem in (33) fol- nonlinear classifiers, but enjoy much faster training and
lowing the definition of f ðx; yÞ and w in (38). testing speed. However, these results do not imply that
Like CRF, the main difficulty to solve (46) is on nonlinear classifiers should no longer be considered. Both
handling an exponential number of y values. Some linear and nonlinear classifiers are useful under different
works (e.g., [25], [129], and [135]) use a cutting circumstances.
plane method [136] to solve (46). In [137], a sto- Without mapping data to another space, for linear
chastic subgradient descent method is applied for classification we can easily prepare, select, and manipulate
both online and batch settings. features. We have clearly shown that linear classification is
not limited to standard scenarios like document classifi-
B. Regression cation. It can be applied in many other places such as
Given training data fðzi ; xi Þgli¼1 R Rn , a regres- efficiently approximating nonlinear classifiers. We are
sion problem finds a weight vector w such that wT xi confident that future research works will make linear
zi ; 8i. Like classification, a regression task solves a risk classification a useful technique for more large-scale
minimization problem involving regularization and loss applications. h

Vol. 100, No. 9, September 2012 | Proceedings of the IEEE 2599

Yuan et al.: Recent Advances of Large-Scale Linear Classification

REFERENCES solution may be sparse,[ IEEE Trans. Inf. http://www.csie.ntu.edu.tw/~cjlin/papers/

Theory, vol. 54, no. 11, pp. 4789–4812, maxent_dual.pdf.
[1] B. E. Boser, I. Guyon, and V. Vapnik, Nov. 2008.
BA training algorithm for optimal margin [34] J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani,
classifiers,[ in Proc. 5th Annu. Workshop [18] H. Zou and T. Hastie, BRegularization B1-norm support vector machines,[ in
Comput. Learn. Theory, 1992, pp. 144–152. and variable selection via the elastic net,[ Advances in Neural Information Processing
J. Roy. Stat. Soc. B (Stat. Methodol.), vol. 67, Systems 16, S. Thrun, L. Saul, and
[2] C. Cortes and V. Vapnik, BSupport-vector no. 2, pp. 301–320, 2005. B. Schölkopf, Eds. Cambridge, MA:
network,[ Mach. Learn., vol. 20, MIT Press, 2004.
pp. 273–297, 1995. [19] Y.-J. Lee and O. L. Mangasarian, BRSVM:
Reduced support vector machines,[ in Proc. [35] G. M. Fung and O. L. Mangasarian, BA
[3] J. S. Cramer, BThe origins of logistic 1st SIAM Int. Conf. Data Mining, 2001. feature selection Newton method for support
regression,[ Tinbergen Inst., Amsterdam, [Online]. Available: http://www.siam.org/ vector machine classification,[ Comput.
The Netherlands, Tech. Rep. [Online]. proceedings/datamining/2001/dm01.php. Optim. Appl., vol. 28, pp. 185–202, 2004.
Available: http://ideas.repec.org/p/dgr/
uvatin/20020119.html [20] J. Shi, W. Yin, S. Osher, and P. Sajda, [36] O. L. Mangasarian, BExact 1-norm support
BA fast hybrid algorithm for large scale vector machines via unconstrained convex
[4] T. Joachims, BTraining linear SVMs in ‘1 -regularized logistic regression,[ differentiable minimization,[ J. Mach.
linear time,[ in Proc. 12th ACM SIGKDD J. Mach. Learn. Res., vol. 11, pp. 713–741, Learn. Res., vol. 7, pp. 1517–1530, 2006.
Int. Conf. Knowl. Disc. Data Mining, 2006, 2010.
DOI: 10.1145/1150402.1150429. [37] K. Koh, S.-J. Kim, and S. Boyd. (2007).
[21] K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. An interior-point method for large-scale
[5] S. Shalev-Shwartz, Y. Singer, and N. Srebro, (2008). Coordinate descent method L1 -regularized logistic regression. J. Mach.
BPegasos: Primal estimated sub-gradient for large-scale L2-loss linear SVM. J. Mach. Learn. Res. [Online]. 8, pp. 1519–1555.
solver for SVM,[ in Proc. 24th Int. Conf. Learn. Res. [Online]. 9, pp. 1369–1398. Available: http://www.stanford.edu/~boyd/
Mach. Learn., 2007, DOI: 10.1145/1273496. Available: http://www.csie.ntu.edu.tw/ l1_logistic_reg.html
1273598. ~cjlin/papers/cdl2.pdf [38] A. Genkin, D. D. Lewis, and D. Madigan,
[6] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, [22] G.-X. Yuan, C.-H. Ho, and C.-J. Lin. (2011). BLarge-scale Bayesian logistic regression
S. S. Keerthi, and S. Sundararajan, BA dual An improved GLMNET for ‘1 -regularized for text categorization,[ Technometrics,
coordinate descent method for large-scale logistic regression and support vector vol. 49, no. 3, pp. 291–304, 2007.
linear SVM,[ in Proc. 25th Int. Conf. Mach. machines, Nat. Taiwan Univ., Taipei,
Learn., 2008, DOI: 10.1145/1390156. [39] S. Yun and K.-C. Toh, BA coordinate gradient
Taiwan, Tech. Rep. [Online]. Available: descent method for L1 -regularized convex
1390208. [Online]. Available: http://www. http://www.csie.ntu.edu.tw/~cjlin/papers/
csie.ntu.edu.tw/~cjlin/papers/cddual.pdf. minimization,[ Comput. Optim. Appl., vol. 48,
long-glmnet.pdf no. 2, pp. 273–307, 2011.
[7] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, [23] P. S. Bradley and O. L. Mangasarian,
X.-R. Wang, and C.-J. Lin. (2008). [40] G. Andrew and J. Gao, BScalable training
BMassive data discrimination via linear of L1 -regularized log-linear models,[ in
LIBLINEAR: A library for large linear support vector machines,[ Optim. Methods
classification. J. Mach. Learn. Res. [Online]. Proc. 24th Int. Conf. Mach. Learn., 2007,
Softw., vol. 13, no. 1, pp. 1–10, 2000. DOI: 10.1145/1273496.1273501.
9, pp. 1871–1874. Available: http://www.
csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf [24] V. Franc and S. Sonnenburg, BOptimized [41] J. H. Friedman, T. Hastie, and R. Tibshirani,
cutting plane algorithm for support vector BRegularization paths for generalized
[8] C.-C. Chang and C.-J. Lin. (2011). LIBSVM: machines,[ in Proc. 25th Int. Conf. Mach.
A library for support vector machines. linear models via coordinate descent,[ J.
Learn., 2008, pp. 320–327. Stat. Softw., vol. 33, no. 1, pp. 1–22, 2010.
ACM Trans. Intell. Syst. Technol. [Online].
2, pp. 27:1–27:27. Available: http://www. [25] C. H. Teo, S. Vishwanathan, A. Smola, and [42] J. Liu, J. Chen, and J. Ye, BLarge-scale sparse
csie.ntu.edu.tw/~cjlin/libsvm Q. V. Le, BBundle methods for regularized logistic regression,[ in Proc. 15th ACM
risk minimization,[ J. Mach. Learn. Res., SIGKDD Int. Conf. Knowl. Disc. Data Mining,
[9] Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, vol. 11, pp. 311–365, 2010.
M. Ringgaard, and C.-J. Lin. (2010). Training 2009, pp. 547–556.
and testing low-degree polynomial data [26] L. Bottou, Stochastic Gradient Descent [43] R. Tomioka, T. Suzuki, and M. Sugiyama,
mappings via linear SVM. J. Mach. Learn. Res. Examples, 2007. [Online]. Available: BSuper-linear convergence of dual
[Online]. 11, pp. 1471–1490. Available: http://leon.bottou.org/projects/sgd. augmented Lagrangian algorithm for
http://www.csie.ntu.edu.tw/~cjlin/papers/ [27] S. S. Keerthi and D. DeCoste, BA modified sparse learning,[ J. Mach. Learn. Res.,
lowpoly_journal.pdf finite Newton method for fast solution of vol. 12, pp. 1537–1586, 2011.
[10] S. S. Keerthi and C.-J. Lin, BAsymptotic large scale linear SVMs,[ J. Mach. Learn. [44] M. Schmidt, G. Fung, and R. Rosales,
behaviors of support vector machines with Res., vol. 6, pp. 341–361, 2005. Optimization methods for L1 -regularization,
Gaussian kernel,[ Neural Comput., vol. 15, [28] C.-J. Lin, R. C. Weng, and S. S. Keerthi. Univ. British Columbia, Vancouver, BC,
no. 7, pp. 1667–1689, 2003. (2008). Trust region Newton method for Canada, Tech. Rep. TR-2009-19, 2009.
[11] Z. S. Harris, BDistributional structure,[ large-scale logistic regression. J. Mach. Learn. [45] C.-J. Lin and J. J. Moré, BNewton’s
Word, vol. 10, pp. 146–162, 1954. Res. [Online]. 9, pp. 627–650. Available: method for large-scale bound constrained
http://www.csie.ntu.edu.tw/~cjlin/papers/ problems,[ SIAM J. Optim., vol. 9,
[12] F.-L. Huang, C.-J. Hsieh, K.-W. Chang, and logistic.pdf
C.-J. Lin. (2010). Iterative scaling and pp. 1100–1127, 1999.
coordinate descent methods for maximum [29] T. P. Minka, A Comparison of Numerical [46] Z.-Q. Luo and P. Tseng, BOn the convergence
entropy. J. Mach. Learn. Res. [Online]. Optimizers for Logistic Regression, 2003. of coordinate descent method for convex
11, pp. 815–848. Available: http://www. [Online]. Available: http://research. differentiable minimization,[ J. Optim.
csie.ntu.edu.tw/~cjlin/papers/maxent_ microsoft.com/~minka/papers/logreg/. Theory Appl., vol. 72, no. 1, pp. 7–35, 1992.
journal.pdf [30] J. Goodman, BSequential conditional [47] T. Joachims, BMaking large-scale SVM
[13] G.-X. Yuan, K.-W. Chang, C.-J. Hsieh, generalized iterative scaling,[ in Proc. 40th learning practical,[ in Advances in Kernel
and C.-J. Lin. (2010). A comparison Annu. Meeting Assoc. Comput. Linguist., 2002, MethodsVSupport Vector Learning,
of optimization methods and software pp. 9–16. B. Schölkopf, C. J. C. Burges, and
for large-scale L1 -regularized linear [31] R. Jin, R. Yan, J. Zhang, and A. J. Smola, Eds. Cambridge, MA:
classification. J. Mach. Learn. Res. [Online]. A. G. Hauptmann, BA faster iterative MIT Press, 1998, pp. 169–184.
11, pp. 3183–3234. Available: http://www. scaling algorithm for conditional exponential [48] J. H. Friedman, T. Hastie, H. Höfling, and
csie.ntu.edu.tw/~cjlin/papers/l1.pdf model,[ in Proc. 20th Int. Conf. Mach. Learn., R. Tibshirani, BPathwise coordinate
[14] O. L. Mangasarian, BA finite Newton 2003, pp. 282–289. optimization,[ Ann. Appl. Stat., vol. 1, no. 2,
method for classification,[ Optim. Methods [32] P. Komarek and A. W. Moore, BMaking pp. 302–332, 2007.
Softw., vol. 17, no. 5, pp. 913–929, 2002. logistic regression a core data mining tool: [49] S. J. Wright, R. D. Nowak, and
[15] A. Y. Ng, BFeature selection, L1 vs. L2 A practical investigation of accuracy, speed, M. A. Figueiredo, BSparse reconstruction
regularization, and rotational invariance,[ in and simplicity,[ Robotics Inst., Carnegie by separable approximation,[ IEEE
Proc. 21st Int. Conf. Mach. Learn., 2004, DOI: Mellon Univ., Pittsburgh, PA, Tech. Rep. Trans. Signal Process., vol. 57, no. 7,
10.1145/1015330.1015435. TR-05-27, 2005. pp. 2479–2493, Jul. 2009.
[16] R. Tibshirani, BRegression shrinkage [33] H.-F. Yu, F.-L. Huang, and C.-J. Lin, BDual [50] C.-W. Hsu and C.-J. Lin, BA comparison
and selection via the lasso,[ J. Roy. Stat. coordinate descent methods for logistic of methods for multi-class support vector
Soc. B, vol. 58, pp. 267–288, 1996. regression and maximum entropy machines,[ IEEE Trans. Neural Netw.,
models,[ Mach. Learn., vol. 85, no. 1–2, vol. 13, no. 2, pp. 415–425, Mar. 2002.
[17] D. L. Donoho and Y. Tsaig, BFast solution pp. 41–75, Oct. 2011. [Online]. Available:
of ‘1 -norm minimization problems when the

2600 Proceedings of the IEEE | Vol. 100, No. 9, September 2012

Yuan et al.: Recent Advances of Large-Scale Linear Classification

[51] R. Rifkin and A. Klautau, BIn defense of Proc. 6th Conf. Natural Lang. Learn., 2002, Cambridge, MA: MIT Press, 2008,
one-vs-all classification,[ J. Mach. Learn. Res., DOI: 10.3115/1118853.1118871. pp. 1177–1184.
vol. 5, pp. 101–141, 2004. [67] J. N. Darroch and D. Ratcliff, BGeneralized [84] D. Achlioptas, BDatabase-friendly random
[52] E. L. Allwein, R. E. Schapire, and Y. Singer, iterative scaling for log-linear models,[ Ann. projections: Johnson-Lindenstrauss with
BReducing multiclass to binary: A unifying Math. Stat., vol. 43, no. 5, pp. 1470–1480, binary coins,[ J. Comput. Syst. Sci., vol. 66,
approach for margin classifiers,[ J. Mach. 1972. pp. 671–687, 2003.
Learn. Res., vol. 1, pp. 113–141, 2001. [68] D. C. Liu and J. Nocedal, BOn the limited [85] P. Li, T. J. Hastie, and K. W. Church,
[53] T.-K. Huang, R. C. Weng, and C.-J. Lin. memory BFGS method for large scale BVery sparse random projections,[ in Proc.
(2006). Generalized Bradley-Terry models optimization,[ Math. Programm., vol. 45, 12th ACM SIGKDD Int. Conf. Knowl. Disc.
and multi-class probability estimates. J. no. 1, pp. 503–528, 1989. Data Mining, 2006, pp. 287–296.
Mach. Learn. Res. [Online]. 7, pp. 85–115. [69] S. Della Pietra, V. Della Pietra, and [86] K.-P. Lin and M.-S. Chen, BEfficient kernel
Available: http://www.csie.ntu.edu.tw/ J. Lafferty, BInducing features of random approximation for large-scale support vector
~cjlin/papers/generalBT.pdf fields,[ IEEE Trans. Pattern Anal. Mach. machine classification,[ in Proc. 11th SIAM
[54] L. Bottou, C. Cortes, J. Denker, H. Drucker, Intell., vol. 19, no. 4, pp. 380–393, Int. Conf. Data Mining, 2011, pp. 211–222.
I. Guyon, L. Jackel, Y. LeCun, U. Muller, Apr. 1997. [87] Q. Shi, J. Petterson, G. Dror, J. Langford,
E. Sackinger, P. Simard, and V. Vapnik, [70] R. Memisevic, BDual optimization of A. Smola, A. Strehl, and S. Vishwanathan,
BComparison of classifier methods: A case conditional probability models,[ Dept. BHash kernels,[ in Proc. 12th Int. Conf.
study in handwriting digit recognition,[ in Comput. Sci., Univ. Toronto, Toronto, Artif. Intell. Stat., 2009, vol. 5, pp. 496–503.
Proc. Int. Conf. Pattern Recognit., 1994, ON, Canada, Tech. Rep., 2006. [88] K. Weinberger, A. Dasgupta, J. Langford,
pp. 77–87.
[71] M. Collins, A. Globerson, T. Koo, A. Smola, and J. Attenberg, BFeature
[55] S. Knerr, L. Personnaz, and G. Dreyfus, X. Carreras, and P. Bartlett, BExponentiated hashing for large scale multitask learning,[
BSingle-layer learning revisited: A stepwise gradient algorithms for conditional random in Proc. 26th Int. Conf. Mach. Learn., 2009,
procedure for building and training a fields and max-margin Markov networks,[ pp. 1113–1120.
neural network,[ in Neurocomputing: J. Mach. Learn. Res., vol. 9, pp. 1775–1822, [89] P. Li and A. C. König, Bb-bit minwise
Algorithms, Architectures and Applications, 2008. hashing,[ in Proc. 19th Int. Conf. World
J. Fogelman, Ed. New York:
[72] E. M. Gertz and J. D. Griffin, BSupport vector Wide Web, 2010, pp. 671–680.
Springer-Verlag, 1990.
machine classifiers for large data sets,[ [90] P. Li and A. C. König, BTheory and
[56] J. H. Friedman, BAnother approach to Argonne Nat. Lab., Argonne, IL, Tech. Rep. applications of b-bit minwise hashing,[
polychotomous classification,[ Dept. Stat., ANL/MCS-TM-289, 2005. Commun. ACM, vol. 54, no. 8, pp. 101–109,
Stanford Univ., Stanford, CA, Tech. Rep.
[73] J. H. Jung, D. P. O’Leary, and A. L. Tits, 2011.
[Online]. Available: http://www-stat.
BAdaptive constraint reduction for training [91] P. Li, A. Shrivastava, J. Moore, and
stanford.edu/~jhf/ftp/poly.pdf
support vector machines,[ Electron. Trans. A. C. König, BHashing algorithms for
[57] T.-L. Huang, BComparison of L2-regularized Numer. Anal., vol. 31, pp. 156–177, 2008. large-scale learning,[ Cornell Univ.,
multi-class linear classifiers,[ M.S. thesis,
[74] Y. Moh and J. M. Buhmann, BKernel Ithaca, NY, Tech. Rep. [Online]. Available:
Dept. Comput. Sci. Inf. Eng., Nat. Taiwan
expansion for online preference tracking,[ http://www.stat.cornell.edu/~li/reports/
Univ., Taipei, Taiwan, 2010.
in Proc. Int. Soc. Music Inf. Retrieval, 2008, HashLearning.pdf
[58] J. C. Platt, N. Cristianini, and pp. 167–172. [92] H.-F. Yu, C.-J. Hsieh, K.-W. Chang, and
J. Shawe-Taylor, BLarge margin DAGs
[75] S. Sonnenburg and V. Franc, BCOFFIN: C.-J. Lin, BLarge linear classification
for multiclass classification,[ in Advances
A computational framework for linear when data cannot fit in memory,[ in Proc.
in Neural Information Processing Systems,
SVMs,[ in Proc. 27th Int. Conf. Mach. 16th ACM SIGKDD Int. Conf. Knowl. Disc.
vol. 12. Cambridge, MA: MIT Press,
Learn., 2010, pp. 999–1006. Data Mining, 2010, pp. 833–842. [Online].
2000, pp. 547–553.
[76] G. Ifrim, G. BakNr, and G. Weikum, BFast Available: http://www.csie.ntu.edu.tw/
[59] J. Weston and C. Watkins, BMulti-class ~cjlin/papers/kdd_disk_decomposition.pdf.
logistic regression for text categorization
support vector machines,[ in Proc. Eur.
with variable-length n-grams,[ in Proc. [93] E. Chang, K. Zhu, H. Wang, H. Bai, J. Li,
Symp. Artif. Neural Netw., M. Verleysen, Ed.,
14th ACM SIGKDD Int. Conf. Knowl. Disc. Z. Qiu, and H. Cui, BParallelizing support
Brussels, 1999, pp. 219–224.
Data Mining, 2008, pp. 354–362. vector machines on distributed computers,[
[60] K. Crammer and Y. Singer, BOn the in Advances in Neural Information Processing
[77] G. Ifrim and C. Wiuf, BBounded
algorithmic implementation of multiclass Systems 20, J. Platt, D. Koller, Y. Singer, and
coordinate-descent for biological sequence
kernel-based vector machines,[ J. S. Roweis, Eds. Cambridge, MA: MIT
classification in high dimensional predictor
Mach. Learn. Res., vol. 2, pp. 265–292, Press, 2008, pp. 257–264.
space,[ in Proc. 17th ACM SIGKDD Int.
2001.
Conf. Knowl. Disc. Data Mining, 2011, [94] Z. A. Zhu, W. Chen, G. Wang, C. Zhu, and
[61] Y. Lee, Y. Lin, and G. Wahba, BMulticategory DOI: 10.1145/2020408.2020519. Z. Chen, BP-packSVM: Parallel primal
support vector machines,[ J. Amer. Stat. gradient descent kernel SVM,[ in Proc. IEEE
[78] S. Lee and S. J. Wright, BASSET:
Assoc., vol. 99, no. 465, pp. 67–81, 2004. Int. Conf. Data Mining, 2009, pp. 677–686.
Approximate stochastic subgradient
[62] C.-J. Lin. (2002, Sep.). A formal analysis of estimation training for support vector [95] T. White, Hadoop: The Definitive Guide,
stopping criteria of decomposition methods machines,[ IEEE Trans. Pattern Anal. 2nd ed. New York: O’Reilly Media, 2010.
for support vector machines. IEEE Trans. Mach. Intell., 2012. [96] H. Robbins and S. Monro, BA stochastic
Neural Netw. [Online]. 13(5), pp. 1045–1052.
[79] C. K. I. Williams and M. Seeger, BUsing approximation method,[ Ann. Math. Stat.,
Available: http://www.csie.ntu.edu.tw/
the Nyström method to speed up kernel vol. 22, no. 3, pp. 400–407, 1951.
~cjlin/papers/stop.ps.gz
machines,[ in Advances in Neural Information [97] J. Kiefer and J. Wolfowitz, BStochastic
[63] S. S. Keerthi, S. Sundararajan, K.-W. Chang, Processing Systems 13, T. Leen, T. Dietterich, estimation of the maximum of a regression
C.-J. Hsieh, and C.-J. Lin, BA sequential dual and V. Tresp, Eds. Cambridge, MA: MIT function,[ Ann. Math. Stat., vol. 23, no. 3,
method for large scale multi-class linear Press, 2001, pp. 682–688. pp. 462–466, 1952.
SVMs,[ in Proc. 14th ACM SIGKDD Int.
[80] P. Drineas and M. W. Mahoney, BOn the [98] T. Zhang, BSolving large scale linear
Conf. Knowl. Disc. Data Mining, 2008,
Nyström method for approximating a gram prediction problems using stochastic
pp. 408–416. [Online]. Available: http://
matrix for improved kernel-based learning,[ gradient descent algorithms,[ in Proc.
www.csie.ntu.edu.tw/~cjlin/papers/
J. Mach. Learn. Res., vol. 6, pp. 2153–2175, 21st Int. Conf. Mach. Learn., 2004,
sdm_kdd.pdf.
2005. DOI: 10.1145/1015330.1015332.
[64] A. L. Berger, V. J. Della Pietra, and
[81] S. Fine and K. Scheinberg, BEfficient [99] L. Bottou and Y. LeCun, BLarge scale online
S. A. Della Pietra, BA maximum entropy
SVM training using low-rank kernel learning,[ Advances in Neural Information
approach to natural language processing,[
representations,[ J. Mach. Learn. Res., Processing Systems 16. Cambridge, MA:
Comput. Linguist., vol. 22, no. 1, pp. 39–71,
vol. 2, pp. 243–264, 2001. MIT Press, 2004, pp. 217–224.
1996.
[82] F. R. Bach and M. I. Jordan, BPredictive [100] A. Bordes, S. Ertekin, J. Weston, and
[65] J. Lafferty, A. McCallum, and F. Pereira,
low-rank decomposition for kernel L. Bottou, BFast kernel classifiers with online
BConditional random fields: Probabilistic
methods,[ in Proc. 22nd Int. Conf. Mach. and active learning,[ J. Mach. Learn. Res.,
models for segmenting and labeling
Learn., 2005, pp. 33–40. vol. 6, pp. 1579–1619, 2005.
sequence data,[ in Proc. 18th Int. Conf.
Mach. Learn., 2001, pp. 282–289. [83] A. Rahimi and B. Recht, BRandom features [101] Y. E. Nesterov, BEfficiency of coordinate
for large-scale kernel machines Advances descent methods on huge-scale optimization
[66] R. Malouf, BA comparison of algorithms for
in Neural Information Processing Systems. problems,[ Université Catholique de
maximum entropy parameter estimation,[ in

Vol. 100, No. 9, September 2012 | Proceedings of the IEEE 2601

Yuan et al.: Recent Advances of Large-Scale Linear Classification

Louvain, Louvain-la-Neuve, Louvain, Disc. Data Mining, 2011, DOI: 10.1145/ for structured and interdependent output
Belgium, CORE Discussion Paper, Tech. 2020408.2020517. variables,[ J. Mach. Learn. Res., vol. 6,
Rep. [Online]. Available: http://www.ucl.be/ [116] Y. Censor and S. A. Zenios, Parallel pp. 1453–1484, 2005.
cps/ucl/doc/core/documents/coredp2010_ Optimization: Theory, Algorithms, and [130] B. Taskar, C. Guestrin, and D. Koller,
2web.pdf Applications. Oxford, U.K.: Oxford Univ. BMax-margin markov networks,[ in Advances
[102] P. Richtárik and M. Takáč, BIteration Press, 1998. in Neural Information Processing Systems 16.
complexity of randomized block-coordinate [117] S. Boyd, N. Parikh, E. Chu, B. Peleato, and Cambridge, MA: MIT Press, 2004.
descent methods for minimizing a composite J. Eckstein, BDistributed optimization and [131] F. Sha and F. C. N. Pereira, BShallow parsing
function,[ Schl. Math., Univ. Edinburgh, statistical learning via the alternating with conditional random fields,[ in Proc.
Edinburgh, U.K., Tech. Rep., 2011. direction method of multipliers,[ Found. HLT-NAACL, 2003, pp. 134–141.
[103] A. Bordes, L. Bottou, and P. Gallinari, Trends Mach. Learn., vol. 3, no. 1, pp. 1–122, [132] S. Vishwanathan, N. N. Schraudolph,
BSGD-QN: Careful quasi-Newton stochastic 2011. M. W. Schmidt, and K. Murphy,
gradient descent,[ J. Mach. Learn. Res., [118] D. Gabay and B. Mercier, BA dual algorithm BAccelerated training of conditional
vol. 10, pp. 1737–1754, 2009. for the solution of nonlinear variational random fields with stochastic gradient
[104] A. Bordes, L. Bottou, P. Gallinari, J. Chang, problems via finite element approximation,[ methods,[ in Proc. 23rd Int. Conf. Mach.
and S. A. Smith, BErratum: SGD-QN is Comput. Math. Appl., vol. 2, pp. 17–40, 1976. Learn., 2006, pp. 969–976.
less careful than expected,[ J. Mach. Learn. [119] P. A. Forero, A. Cano, and G. B. Giannakis, [133] N. N. Schraudolph, J. Yu, and S. Gunter,
Res., vol. 11, pp. 2229–2240, 2010. BConsensus-based distributed support BA stochastic quasi-Newton method for
[105] J. Langford, L. Li, and T. Zhang, BSparse vector machines,[ J. Mach. Learn., vol. 11, online convex optimization,[ in Proc.
online learning via truncated gradient,[ pp. 1663–1707, 2010. 11th Int. Conf. Artif. Intell. Stat., 2007,
J. Mach. Learn. Res., vol. 10, pp. 771–801, [120] A. Nedić, D. P. Bertsekas, and V. S. Borkar, pp. 433–440.
2009. BDistributed asynchronous incremental [134] P.-J. Chen, BNewton methods for conditional
[106] S. Shalev-Shwartz and A. Tewari, subgradient methods,[ Studies Comput. random fields,[ M.S. thesis, Dept. Comput.
BStochastic methods for L1 -regularized Math., vol. 8, pp. 381–407, 2001. Sci. Inf. Eng., National Taiwan University,
loss minimization,[ J. Mach. Learn. Res., [121] J. Langford, A. Smola, and M. Zinkevich, Taipei, Taiwan, 2009.
vol. 12, pp. 1865–1892, 2011. BSlow learners are fast,[ in Advances in [135] T. Joachims, T. Finley, and C.-N. J. Yu,
[107] Y. E. Nesterov, BPrimal-dual subgradient Neural Information Processing Systems 22, BCutting-plane training of structural SVMs,[
methods for convex problems,[ Math. Y. Bengio, D. Schuurmans, J. Lafferty, J. Mach. Learn., vol. 77, no. 1, 2008,
Programm., vol. 120, no. 1, pp. 221–259, C. K. I. Williams, and A. Culotta, Eds. DOI: 10.1007/s10994-009-5108-8.
2009. Cambridge, MA: MIT Press, 2009, [136] J. E. Kelley, BThe cutting-plane method for
[108] J. Duchi and Y. Singer, BEfficient online pp. 2331–2339. solving convex programs,[ J. Soc. Ind. Appl.
and batch learning using forward backward [122] A. Agarwal and J. Duchi, BDistributed Math., vol. 8, no. 4, pp. 703–712, 1960.
splitting,[ J. Mach. Learn. Res., vol. 10, delayed stochastic optimization,[ in [137] N. D. Ratliff, J. A. Bagnell, and
pp. 2899–2934, 2009. Advances in Neural Information Processing M. A. Zinkevich, B(Online) subgradient
[109] J. Duchi, E. Hazan, and Y. Singer, BAdaptive Systems 24. Cambridge, MA: MIT Press, methods for structured prediction,[ in
subgradient methods for online learning 2011. Proc. 11th Int. Conf. Artif. Intell. Stat., 2007,
and stochastic optimization,[ J. Mach. Learn. [123] H. Yu, J. Yang, and J. Han, BClassifying large pp. 380–387.
Res., vol. 12, pp. 2121–2159, 2011. data sets using SVMs with hierarchical [138] V. Vapnik, Statistical Learning Theory.
[110] L. Xiao, BDual averaging methods for clusters,[ in Proc. 9th ACM SIGKDD Int. Conf. New York: Wiley, 1998.
regularized stochastic learning and online Knowl. Disc. Data Mining, 2003, pp. 306–315.
[139] I. Daubechies, M. Defrise, and C. De Mol,
optimization,[ J. Mach. Learn. Res., vol. 11, [124] L. Breiman, BBagging predictors,[ Mach. BAn iterative thresholding algorithm for
pp. 2543–2596, 2010. Learn., vol. 24, no. 2, pp. 123–140, linear inverse problems with a sparsity
[111] J. K. Bradley, A. Kyrola, D. Bickson, and Aug. 1996. constraint,[ Commun. Pure Appl. Math.,
C. Guestrin, BParallel coordinate descent [125] D. Chakrabarti, D. Agarwal, and vol. 57, pp. 1413–1457, 2004.
for L1 -regularized loss minimization,[ in V. Josifovski, BContextual advertising by [140] M. A. T. Figueiredo, R. Nowak, and
Proc. 28th Int. Conf. Mach. Learn., 2011, combining relevance with click feedback,[ in S. Wright, BGradient projection for sparse
pp. 321–328. Proc. 17th Int. Conf. World Wide Web, 2008, reconstruction: Applications to compressed
[112] J. Langford, L. Li, and A. Strehl, Vowpal pp. 417–426. sensing and other inverse problems,[ IEEE J.
Wabbit, 2007. [Online]. Available: [126] M. Zinkevich, M. Weimer, A. Smola, and Sel. Top. Signal Process., vol. 1, no. 4,
https://github.com/JohnLangford/vowpal_ L. Li, BParallelized stochastic gradient pp. 586–598, Dec. 2007.
wabbit/wiki. descent,[ in Advances in Neural Information [141] S.-J. Kim, K. Koh, M. Lustig, S. Boyd, and
[113] S. Tong, Lessons Learned Developing a Processing Systems 23, J. Lafferty, D. Gorinevsky, BAn interior point method
Practical Large Scale Machine Learning C. K. I. Williams, J. Shawe-Taylor, R. Zemel, for large-scale L1 -regularized least squares,[
System, Google Research Blog, 2010. and A. Culotta, Eds. Cambridge, MA: IEEE J. Sel. Top. Signal Process., vol. 1, no. 4,
[Online]. Available: http://googleresearch. MIT Press, 2010, pp. 2595–2603. pp. 606–617, Dec. 2007.
blogspot.com/2010/04/lessons-learned- [127] R. McDonald, K. Hall, and G. Mann, [142] J. Duchi, S. Shalev-Shwartz, Y. Singer, and
developing-practical.html. BDistributed training strategies for the T. Chandra, BEfficient projections onto
[114] M. Ferris and T. Munson, BInterior structured perceptron,[ in Proc. 48th the L1 -ball for learning in high dimensions,[
point methods for massive support vector Annu. Meeting Assoc. Comput. Linguist., in Proc. 25th Int. Conf. Mach. Learn., 2008,
machines,[ SIAM J. Optim., vol. 13, no. 3, 2010, pp. 456–464. DOI: 10.1145/1390156.1390191.
pp. 783–804, 2003. [128] J. Dean and S. Ghemawat, BMapReduce: [143] A. Beck and M. Teboulle, BA fast iterative
[115] K.-W. Chang and D. Roth, BSelective Simplified data processing on large clusters,[ shrinkage-thresholding algorithm for linear
block minimization for faster convergence of Commun. ACM, vol. 51, no. 1, pp. 107–113, inverse problems,[ SIAM J. Imag. Sci., vol. 2,
limited memory large-scale linear models,[ 2008. no. 1, pp. 183–202, 2009.
in Proc. 17th ACM SIGKDD Int. Conf. Knowl. [129] I. Tsochantaridis, T. Joachims, T. Hofmann,
and Y. Altun, BLarge margin methods

2602 Proceedings of the IEEE | Vol. 100, No. 9, September 2012

Yuan et al.: Recent Advances of Large-Scale Linear Classification

ABOUT THE AUTHORS

Guo-Xun Yuan received the B.S. degree in Chih-Jen Lin (Fellow, IEEE) received the B.S. de-
computer science from the National Tsinghua gree in mathematics from the National Taiwan
University, Hsinchu, Taiwan, and the M.S. degree University, Taipei, Taiwan, in 1993 and the Ph.D.
in computer science from the National Taiwan degree in industrial and operations engineering
University, Taipei, Taiwan. He is currently working from the University of Michigan, Ann Arbor, in
towards the Ph.D. degree at the University of 1998.
California Davis, Davis. He is currently a Distinguished Professor at the
His research interest is large-scale data Department of Computer Science, National Taiwan
classification. University. His major research areas include
machine learning, data mining, and numerical
optimization. He is best known for his work on support vector machines
Chia-Hua Ho received the B.S. degree in computer (SVMs) for data classification. His software LIBSVM is one of the most
science from the National Taiwan University, widely used and cited SVM packages. Nearly all major companies apply
Taipei, Taiwan, in 2010, where he is currently his software for classification and regression applications.
working towards the M.S. degree at the Depart- Prof. Lin received many awards for his research work. A recent one is
ment of Computer Science. the ACM KDD 2010 best paper award. He is an Association for Computing
His research interests are machine learning Machinery (ACM) distinguished scientist for his contribution to machine
and data mining. learning algorithms and software design. More information about him
and his software tools can be found at http://www.csie.ntu.edu.tw/~cjlin.

Vol. 100, No. 9, September 2012 | Proceedings of the IEEE 2603

03 - Non Linear Classifiers PDF
No ratings yet
03 - Non Linear Classifiers PDF
38 pages
Support Vector Machines For Classification and Regression: Steve R. Gunn
No ratings yet
Support Vector Machines For Classification and Regression: Steve R. Gunn
66 pages
Fast Kernel Classifiers
No ratings yet
Fast Kernel Classifiers
41 pages
Pattern Recognition - Theodoridis Koutroumbas
No ratings yet
Pattern Recognition - Theodoridis Koutroumbas
641 pages
Max A. Little Machine Learning For Signal Processing Data Science Algorithms and Computational Statistics Oxford University Press USA 2019
100% (1)
Max A. Little Machine Learning For Signal Processing Data Science Algorithms and Computational Statistics Oxford University Press USA 2019
378 pages
Supervised Learning
No ratings yet
Supervised Learning
187 pages
Numerical Methods For Ordinary Differential Equations
100% (2)
Numerical Methods For Ordinary Differential Equations
134 pages
Ijcsea 2
No ratings yet
Ijcsea 2
13 pages
Lecture 14
No ratings yet
Lecture 14
20 pages
An Introduction To Statistical Learning PDF
No ratings yet
An Introduction To Statistical Learning PDF
35 pages
U21amg05 Aif and ML Unit 04 Notes
No ratings yet
U21amg05 Aif and ML Unit 04 Notes
42 pages
SVM
No ratings yet
SVM
59 pages
Question Bank
No ratings yet
Question Bank
5 pages
Support Vector Machines and Perceptrons Learning, Optimization, Classification, and Application To Social Networks
No ratings yet
Support Vector Machines and Perceptrons Learning, Optimization, Classification, and Application To Social Networks
103 pages
ML Unit-4
No ratings yet
ML Unit-4
28 pages
ML - Module 3
No ratings yet
ML - Module 3
58 pages
CMPE 442 Introduction To Machine Learning: Support Vector Machines
No ratings yet
CMPE 442 Introduction To Machine Learning: Support Vector Machines
64 pages
C. Cifarelli Et Al - Incremental Classification With Generalized Eigenvalues
No ratings yet
C. Cifarelli Et Al - Incremental Classification With Generalized Eigenvalues
25 pages
08classification I
No ratings yet
08classification I
52 pages
Chapter 4. Classification Algorithms-Stud
No ratings yet
Chapter 4. Classification Algorithms-Stud
43 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Chapter Classification
No ratings yet
Chapter Classification
12 pages
3.unit 3 ML Part-2 Q&A
No ratings yet
3.unit 3 ML Part-2 Q&A
23 pages
MLT Unit-2
No ratings yet
MLT Unit-2
30 pages
IML-IITKGP - Assignment 5 Solution
No ratings yet
IML-IITKGP - Assignment 5 Solution
7 pages
Chapter 2: Artificial Intelligence (Deep Learning and Machine Learning)
No ratings yet
Chapter 2: Artificial Intelligence (Deep Learning and Machine Learning)
9 pages
ML 41
No ratings yet
ML 41
49 pages
Support Vector Machines For Classification and Regression: Steve R. Gunn
No ratings yet
Support Vector Machines For Classification and Regression: Steve R. Gunn
66 pages
Deblurring Images, Matrices, Spectra, and Filtering (Fundamentals of Algorithms)
No ratings yet
Deblurring Images, Matrices, Spectra, and Filtering (Fundamentals of Algorithms)
145 pages
IOML Ch-5
No ratings yet
IOML Ch-5
11 pages
Modern Machine Learning in Python
No ratings yet
Modern Machine Learning in Python
50 pages
MLT Essentials
No ratings yet
MLT Essentials
32 pages
Statistical Inference For Engineers and Data Scientists - Pierre Moulin - Venugopal v. Veeravalli (2019)
100% (1)
Statistical Inference For Engineers and Data Scientists - Pierre Moulin - Venugopal v. Veeravalli (2019)
421 pages
Comparative Study of Four Supervised Machine Learning Techniques For Classification
No ratings yet
Comparative Study of Four Supervised Machine Learning Techniques For Classification
15 pages
Slide 10 Chapter9 Classification Advanced Methods
No ratings yet
Slide 10 Chapter9 Classification Advanced Methods
46 pages
Unit - 4-NNDL - Notes
No ratings yet
Unit - 4-NNDL - Notes
14 pages
Matt Carrick - Foundations of Digital Signal Processing - Complex Numbers
No ratings yet
Matt Carrick - Foundations of Digital Signal Processing - Complex Numbers
83 pages
An Introduction Of: Support Vector Machine
No ratings yet
An Introduction Of: Support Vector Machine
36 pages
SVM Tutorial
100% (1)
SVM Tutorial
34 pages
Support Vector Machines: Vibhav Gogate The University of Texas at Dallas
No ratings yet
Support Vector Machines: Vibhav Gogate The University of Texas at Dallas
36 pages
20MEMECH Part 3 - Classification
No ratings yet
20MEMECH Part 3 - Classification
49 pages
ML Qa
No ratings yet
ML Qa
10 pages
Machine Learning
No ratings yet
Machine Learning
11 pages
Machine Learning Section4 Ebook v03
No ratings yet
Machine Learning Section4 Ebook v03
20 pages
Unit 3 in Machine Intelligence
No ratings yet
Unit 3 in Machine Intelligence
62 pages
Ai GFG
No ratings yet
Ai GFG
24 pages
Backpropagation - Theory, Architectures, and Applications - Yves Chauvin (Ed.), David E. Rumelhart (Ed.) - Lawrence Erlbaum Associates (1995)
No ratings yet
Backpropagation - Theory, Architectures, and Applications - Yves Chauvin (Ed.), David E. Rumelhart (Ed.) - Lawrence Erlbaum Associates (1995)
575 pages
22-Kernel Tricks Shit
No ratings yet
22-Kernel Tricks Shit
43 pages
Support Vector Machine: Abinas Panda
No ratings yet
Support Vector Machine: Abinas Panda
52 pages
CH 5 SVM
No ratings yet
CH 5 SVM
25 pages
Numerical Analysis - Historical Developments in The 20th Century - C. Brezinski and L. Wuytack (Auth.) - Elsevier Science (2001)
No ratings yet
Numerical Analysis - Historical Developments in The 20th Century - C. Brezinski and L. Wuytack (Auth.) - Elsevier Science (2001)
497 pages
Learning From Delayed Rewards（1989）-可选定
No ratings yet
Learning From Delayed Rewards（1989）-可选定
241 pages
Optimization For Data Analysis Stephen J Wright Benjamin Recht Instant Download
No ratings yet
Optimization For Data Analysis Stephen J Wright Benjamin Recht Instant Download
85 pages
ML Unit-4
No ratings yet
ML Unit-4
20 pages
Machine Learning
No ratings yet
Machine Learning
32 pages
Support Vector Machine
No ratings yet
Support Vector Machine
45 pages
Unit 4 ML
No ratings yet
Unit 4 ML
11 pages
Taz TFG 2016 2057
No ratings yet
Taz TFG 2016 2057
52 pages
Ch-2 Supervised Machine Learning
No ratings yet
Ch-2 Supervised Machine Learning
48 pages
Automatic Design of Decision-Tree Induction Algorithms - Rodrigo C. Barros, André C.P.L.F de Carvalho, Alex A. Freitas (Auth.)
No ratings yet
Automatic Design of Decision-Tree Induction Algorithms - Rodrigo C. Barros, André C.P.L.F de Carvalho, Alex A. Freitas (Auth.)
184 pages
Lecture 18 - SVM
No ratings yet
Lecture 18 - SVM
54 pages
Support Vector Machine: Prof. Subodh Kumar Mohanty
No ratings yet
Support Vector Machine: Prof. Subodh Kumar Mohanty
52 pages
11597-Article Text-7395-2-10-20220813
No ratings yet
11597-Article Text-7395-2-10-20220813
5 pages
NMF Tutorial
No ratings yet
NMF Tutorial
189 pages
Ain3001 - 04 - Support - Vector.machines
No ratings yet
Ain3001 - 04 - Support - Vector.machines
50 pages
SVM
No ratings yet
SVM
36 pages
MLRS Assignment 1 24070146008 Sreemanth Mannem
No ratings yet
MLRS Assignment 1 24070146008 Sreemanth Mannem
12 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
Patterns of Scalable Bayesian Inference
No ratings yet
Patterns of Scalable Bayesian Inference
133 pages
Gradnorm: Gradient Normalization For Adaptive Loss Balancing in Deep Multitask Networks
No ratings yet
Gradnorm: Gradient Normalization For Adaptive Loss Balancing in Deep Multitask Networks
12 pages
An Introduction Of: Support Vector Machine
No ratings yet
An Introduction Of: Support Vector Machine
36 pages
Multi Class Classification
No ratings yet
Multi Class Classification
79 pages
Lecture 9
No ratings yet
Lecture 9
27 pages
On-Line Q-Learning Using Connectionist Systems
No ratings yet
On-Line Q-Learning Using Connectionist Systems
21 pages
A Framework For Robust Subspace Learning
No ratings yet
A Framework For Robust Subspace Learning
47 pages
Probability With Applications in Engineering, Science, and Technology, 2nd (Instructor's Solution Manual) - Matthew A. Carlton
100% (1)
Probability With Applications in Engineering, Science, and Technology, 2nd (Instructor's Solution Manual) - Matthew A. Carlton
400 pages
PEST Manual
No ratings yet
PEST Manual
336 pages
SVM PRESENTATION
No ratings yet
SVM PRESENTATION
34 pages
Mt2Dinvmatlab-A Program in Matlab and Fortran For Two-Dimensional Magnetotelluric Inversion
No ratings yet
Mt2Dinvmatlab-A Program in Matlab and Fortran For Two-Dimensional Magnetotelluric Inversion
14 pages
SVM Tutorial
No ratings yet
SVM Tutorial
31 pages
A Unified Approach To Interpreting Model Predictions
No ratings yet
A Unified Approach To Interpreting Model Predictions
9 pages
CS-13410 Introduction To Machine Learning
No ratings yet
CS-13410 Introduction To Machine Learning
33 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
子空间学习机（SLM）：方法论和性能
No ratings yet
子空间学习机（SLM）：方法论和性能
12 pages
A Farewell To The Bias-Variance Tradeoff? An Overview of The Theory of Overparameterized Machine Learning
No ratings yet
A Farewell To The Bias-Variance Tradeoff? An Overview of The Theory of Overparameterized Machine Learning
48 pages
Machine Learning in Medicine - A Practical Introduction
No ratings yet
Machine Learning in Medicine - A Practical Introduction
18 pages
Regression Analysis Lasso and Ridge Regression 1678810035
No ratings yet
Regression Analysis Lasso and Ridge Regression 1678810035
18 pages
ML1 Skript 2023
No ratings yet
ML1 Skript 2023
97 pages
DIP Solved Questions
No ratings yet
DIP Solved Questions
13 pages
Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances
No ratings yet
Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances
13 pages
Report 102 Intro Start
No ratings yet
Report 102 Intro Start
73 pages
LSTD A Low-Shot Transfer Detector For Object Detection
No ratings yet
LSTD A Low-Shot Transfer Detector For Object Detection
8 pages
Week-1 ML Slides
No ratings yet
Week-1 ML Slides
16 pages
Unit 2
No ratings yet
Unit 2
125 pages
Zhang Learning Fast Sample Re-Weighting Without Reward Data ICCV 2021 Paper
No ratings yet
Zhang Learning Fast Sample Re-Weighting Without Reward Data ICCV 2021 Paper
10 pages
Videomage:: Multi-Subject and Motion Customization of Text-To-Video Diffusion Models
No ratings yet
Videomage:: Multi-Subject and Motion Customization of Text-To-Video Diffusion Models
14 pages
DL QB 24-25
No ratings yet
DL QB 24-25
3 pages
Data Mining and Machine Learning
No ratings yet
Data Mining and Machine Learning
48 pages
Ml-Unit 2-QB
No ratings yet
Ml-Unit 2-QB
6 pages
Machine Learning - AL3451 - Important Questions With Answer
No ratings yet
Machine Learning - AL3451 - Important Questions With Answer
27 pages
A Seismic Sensor Based Human Activity Recognition Framework Using Deep Learning
No ratings yet
A Seismic Sensor Based Human Activity Recognition Framework Using Deep Learning
8 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Differential Equations
From Everand
Differential Equations
Harry Hochstadt
3.5/5 (2)
Pathways to Machine Learning and Soft Computing: 邁向機器學習與軟計算之路（國際英文版）
From Everand
Pathways to Machine Learning and Soft Computing: 邁向機器學習與軟計算之路（國際英文版）
Jyh-Horng Jeng
No ratings yet

1、Recent Advances of Large-Scale Linear Classification

Uploaded by

1、Recent Advances of Large-Scale Linear Classification

Uploaded by

INVITED

Recent Advances of Large-Scale

where w is the weight vector and b is an intercept, or

Vol. 100, No. 9, September 2012 | Proceedings of the IEEE 2585

III . BINARY LINEAR

and the decision function becomes dðxÞ wT x.

A. Support Vector Machines and Logistic Regression

2586 Proceedings of the IEEE | Vol. 100, No. 9, September 2012

Vol. 100, No. 9, September 2012 | Proceedings of the IEEE 2587

2588 Proceedings of the IEEE | Vol. 100, No. 9, September 2012

B. Example: A Subgradient Method (Pegasos With

It can be used for batch and online learning. Here we

Vol. 100, No. 9, September 2012 | Proceedings of the IEEE 2589

the Hessian matrix r2 f ðwÞ need not be stored.

D. Example: Solving Dual SVM by Coordinate

min f D ðA þ dei Þ f D ðAÞ

2590 Proceedings of the IEEE | Vol. 100, No. 9, September 2012

initially, but The linear convergence of Algorithm 3 is established in

E. Example: Solving L1-Regularized Problems by

Therefore, the total cost for updating an i is OðnÞ. The X

Vol. 100, No. 9, September 2012 | Proceedings of the IEEE 2591

Table 2 A Comparison of the Four Methods in Section IV-B–E

2592 Proceedings of the IEEE | Vol. 100, No. 9, September 2012

class of x arg max wTm x: (32) 1X k Xl 

The cost for testing an instance is OðnkÞ.

Vol. 100, No. 9, September 2012 | Proceedings of the IEEE 2593

2594 Proceedings of the IEEE | Vol. 100, No. 9, September 2012

A. Training and Testing Explicit Data Mappings via

Vol. 100, No. 9, September 2012 | Proceedings of the IEEE 2595

VI I. TRAINING LARGE DATA B EYOND If 1 yi wT xi > 0;

2596 Proceedings of the IEEE | Vol. 100, No. 9, September 2012

Vol. 100, No. 9, September 2012 | Proceedings of the IEEE 2597

For example, in the application of web advertising,

2598 Proceedings of the IEEE | Vol. 100, No. 9, September 2012

Vol. 100, No. 9, September 2012 | Proceedings of the IEEE 2599

REFERENCES solution may be sparse,[ IEEE Trans. Inf. http://www.csie.ntu.edu.tw/~cjlin/papers/

2600 Proceedings of the IEEE | Vol. 100, No. 9, September 2012

Vol. 100, No. 9, September 2012 | Proceedings of the IEEE 2601

2602 Proceedings of the IEEE | Vol. 100, No. 9, September 2012

ABOUT THE AUTHORS

Vol. 100, No. 9, September 2012 | Proceedings of the IEEE 2603

You might also like

class of x arg max wTm x: (32) 1X k Xl