Journal of Statistical Software: Regularization Paths For Generalized Linear Models Via Coordinate Descent
Journal of Statistical Software: Regularization Paths For Generalized Linear Models Via Coordinate Descent
Abstract
We develop fast algorithms for estimation of generalized linear models with convex
penalties. The models include linear regression, two-class logistic regression, and multi-
nomial regression problems while the penalties include `1 (the lasso), `2 (ridge regression)
and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent,
computed along a regularization path. The methods can handle large problems and can
also deal efficiently with sparse features. In comparative timings we find that the new
algorithms are considerably faster than competing methods.
Keywords: lasso, elastic net, logistic regression, `1 penalty, regularization path, coordinate-
descent.
1. Introduction
The lasso (Tibshirani 1996) is a popular method for regression that uses an `1 penalty to
achieve a sparse solution. In the signal processing literature, the lasso is also known as basis
pursuit (Chen et al. 1998). This idea has been broadly applied, for example to general-
ized linear models (Tibshirani 1996) and Cox’s proportional hazard models for survival data
(Tibshirani 1997). In recent years, there has been an enormous amount of research activity
devoted to related regularization methods:
1. The grouped lasso (Yuan and Lin 2007; Meier et al. 2008), where variables are included
or excluded in groups.
2. The Dantzig selector (Candes and Tao 2007, and discussion), a slightly modified version
of the lasso.
3. The elastic net (Zou and Hastie 2005) for correlated variables, which uses a penalty that
is part `1 , part `2 .
2 Regularization Paths for GLMs via Coordinate Descent
4. `1 regularization paths for generalized linear models (Park and Hastie 2007a).
5. Methods using non-concave penalties, such as SCAD (Fan and Li 2005) and Friedman’s
generalized elastic net (Friedman 2008), enforce more severe variable selection than the
lasso.
7. The graphical lasso (Friedman et al. 2008) for sparse covariance estimation and undi-
rected graphs.
Efron et al. (2004) developed an efficient algorithm for computing the entire regularization
path for the lasso for linear regression models. Their algorithm exploits the fact that the coef-
ficient profiles are piecewise linear, which leads to an algorithm with the same computational
cost as the full least-squares fit on the data (see also Osborne et al. 2000).
In some of the extensions above (items 2,3, and 6), piecewise-linearity can be exploited as in
Efron et al. (2004) to yield efficient algorithms. Rosset and Zhu (2007) characterize the class
of problems where piecewise-linearity exists—both the loss function and the penalty have to
be quadratic or piecewise linear.
Here we instead focus on cyclical coordinate descent methods. These methods have been
proposed for the lasso a number of times, but only recently was their power fully appreciated.
Early references include Fu (1998), Shevade and Keerthi (2003) and Daubechies et al. (2004).
Van der Kooij (2007) independently used coordinate descent for solving elastic-net penalized
regression models. Recent rediscoveries include Friedman et al. (2007) and Wu and Lange
(2008). The first paper recognized the value of solving the problem along an entire path of
values for the regularization parameters, using the current estimates as warm starts. This
strategy turns out to be remarkably efficient for this problem. Several other researchers have
also re-discovered coordinate descent, many for solving the same problems we address in this
paper—notably Shevade and Keerthi (2003), Krishnapuram and Hartemink (2005), Genkin
et al. (2007) and Wu et al. (2009).
In this paper we extend the work of Friedman et al. (2007) and develop fast algorithms
for fitting generalized linear models with elastic-net penalties. In particular, our models
include regression, two-class logistic regression, and multinomial regression problems. Our
algorithms can work on very large datasets, and can take advantage of sparsity in the feature
set. We provide a publicly available package glmnet (Friedman et al. 2009) implemented in
the R programming system (R Development Core Team 2009). We do not revisit the well-
established convergence properties of coordinate descent in convex problems (Tseng 2001) in
this article.
Lasso procedures are frequently used in domains with very large datasets, such as genomics
and web analysis. Consequently a focus of our research has been algorithmic efficiency and
speed. We demonstrate through simulations that our procedures outperform all competitors
— even those based on coordinate descent.
In Section 2 we present the algorithm for the elastic net, which includes the lasso and ridge
regression as special cases. Section 3 and 4 discuss (two-class) logistic regression and multi-
nomial logistic regression. Comparative timings are presented in Section 5.
Although the title of this paper advertises regularization paths for GLMs, we only cover three
important members of this family. However, exactly the same technology extends trivially to
Journal of Statistical Software 3
other members of the exponential family, such as the Poisson model. We plan to extend our
software to cover these important other cases, as well as the Cox model for survival data.
Note that this article is about algorithms for fitting particular families of models, and not
about the statistical properties of these models themselves. Such discussions have taken place
elsewhere.
where
1
Pα (β) = (1 − α) ||β||2`2 + α||β||`1 (2)
2
Xp
2
1
= 2 (1 − α)βj + α|βj | . (3)
j=1
Pα is the elastic-net penalty (Zou and Hastie 2005), and is a compromise between the ridge-
regression penalty (α = 0) and the lasso penalty (α = 1). This penalty is particularly useful
in the p N situation, or any situation where there are many correlated predictor variables.1
Ridge regression is known to shrink the coefficients of correlated predictors towards each
other, allowing them to borrow strength from each other. In the extreme case of k identical
predictors, they each get identical coefficients with 1/kth the size that any single one would
get if fit alone. From a Bayesian point of view, the ridge penalty is ideal if there are many
predictors, and all have non-zero coefficients (drawn from a Gaussian distribution).
Lasso, on the other hand, is somewhat indifferent to very correlated predictors, and will tend
to pick one and ignore the rest. In the extreme case above, the lasso problem breaks down.
The lasso penalty corresponds to a Laplace prior, which expects many coefficients to be close
to zero, and a small subset to be larger and nonzero.
The elastic net with α = 1 − ε for some small ε > 0 performs much like the lasso, but removes
any degeneracies and wild behavior caused by extreme correlations. More generally, the entire
family Pα creates a useful compromise between ridge and lasso. As α increases from 0 to 1,
for a given λ the sparsity of the solution to (1) (i.e., the number of coefficients equal to zero)
increases monotonically from 0 to the sparsity of the lasso solution.
Figure 1 shows an example that demonstrates the effect of varying α. The dataset is from
(Golub et al. 1999), consisting of 72 observations on 3571 genes measured with DNA microar-
rays. The observations fall in two classes, so we use the penalties in conjunction with the
1
Zou and Hastie (2005) called this penalty the naive elastic net, and preferred a rescaled version which they
called elastic net. We drop this distinction here.
4 Regularization Paths for GLMs via Coordinate Descent
Figure 1: Leukemia data: profiles of estimated coefficients for three methods, showing only
first 10 steps (values for λ) in each case. For the elastic net, α = 0.2.
Journal of Statistical Software 5
logistic regression models of Section 3. The coefficient profiles from the first 10 steps (grid
values for λ) for each of the three regularization methods are shown. The lasso penalty admits
at most N = 72 genes into the model, while ridge regression gives all 3571 genes non-zero
coefficients. The elastic-net penalty provides a compromise between these two, and has the
effect of averaging genes that are highly correlated and then entering the averaged gene into
the model. Using the algorithm described below, computation of the entire path of solutions
for each method, at 100 values of the regularization parameter evenly spaced on the log-scale,
took under a second in total. Because of the large number of non-zero coefficients for the
ridge penalty, they are individually much smaller than the coefficients for the other methods.
Consider a coordinate descent step for solving (1). That is, suppose we have estimates β̃0 and
β̃` for ` 6= j, and we wish to partially optimize with respect to βj . We would like to compute
the gradient at βj = β̃j , which only exists if β̃j 6= 0. If β̃j > 0, then
N
∂Rλ 1 X
|β=β̃ = − xij (yi − β̃o − x>
i β̃) + λ(1 − α)βj + λα. (4)
∂βj N
i=1
A similar expression exists if β̃j < 0, and β̃j = 0 is treated separately. Simple calculus shows
(Donoho and Johnstone 1994) that the coordinate-wise update has the form
P
(j)
S N1 N i=1 x ij (yi − ỹi ), λα
β̃j ← (5)
1 + λ(1 − α)
where
(j)
ỹi
P
= β̃0 + `6=j xi` β̃` is the fitted value excluding the contribution from xij , and
(j)
hence yi − ỹi the partial residual for fitting βj . Because of the standardization,
1 PN (j)
N i=1 xij (yi − ỹi ) is the simple least-squares coefficient when fitting this partial
residual to xij .
The details of this derivation are spelled out in Friedman et al. (2007).
Thus we compute the simple least-squares coefficient on the partial residual, apply soft-
thresholding to take care of the lasso contribution to the penalty, and then apply a pro-
portional shrinkage for the ridge penalty. This algorithm was suggested by Van der Kooij
(2007).
where ŷi is the current fit of the model for observation i, and hence ri the current residual.
Thus
N N
1 X (j) 1 X
xij (yi − ỹi ) = xij ri + β̃j , (8)
N N
i=1 i=1
because the xj are standardized. The first term on the right-hand side is the gradient of
the loss with respect to βj . It is clear from (8) why coordinate descent is computationally
efficient. Many coefficients are zero, remain zero after the thresholding, and so nothing needs
to be changed. Such a step costs O(N ) operations— the sum to compute the gradient. On
the other hand, if a coefficient does change after the thresholding, ri is changed in O(N ) and
the step costs O(2N ). Thus a complete cycle through all p variables costs O(pN ) operations.
We refer to this as the naive algorithm, since it is generally less efficient than the covariance
updating algorithm to follow. Later we use these algorithms in the context of iteratively
reweighted least squares (IRLS), where the observation weights change frequently; there the
naive algorithm dominates.
N
X X
xij ri = hxj , yi − hxj , xk iβ̃k , (9)
i=1 k:|β̃k |>0
where hxj , yi = N
P
i=1 xij yi . Hence we need to compute inner products of each feature with y
initially, and then each time a new feature xk enters the model (for the first time), we need
to compute and store its inner product with all the rest of the features (O(N p) operations).
We also store the p gradient components (9). If one of the coefficients currently in the model
changes, we can update each gradient in O(p) operations. Hence with m non-zero terms in
the model, a complete cycle costs O(pm) operations if no new variables become non-zero, and
costs O(N p) for each new variable entered. Importantly, O(N ) calculations do not have to
be made at every step. This is the case for all penalized procedures with squared error loss.
not alter the sparsity, but centering will. So scaling is performed up front, but the centering
is incorporated in the algorithm in an efficient and obvious manner.
P
N (j)
S i=1 wi xij (yi − ỹi ), λα
β̃j ← PN . (10)
2 + λ(1 − α)
i=1 wi xij
If the xj are not standardized, there is a similar sum-of-squares term in the denominator
(even without weights). The presence of weights does not change the computational costs of
either algorithm much, as long as the weights remain fixed.
= 1 − Pr(G = 1|x).
Alternatively, this implies that
Pr(G = 1|x)
log = β0 + x> β. (12)
Pr(G = 2|x)
Here we fit this model by regularized maximum (binomial) likelihood. Let p(xi ) = Pr(G =
1|xi ) be the probability (11) for observation i at a particular value for the parameters (β0 , β),
then we maximize the penalized log likelihood
N
" #
1 X
max I(gi = 1) log p(xi ) + I(gi = 2) log(1 − p(xi )) − λPα (β) . (13)
(β0 ,β)∈Rp+1 N
i=1
Denoting yi = I(gi = 1), the log-likelihood part of (13) can be written in the more explicit
form
N
1 X (β0 +x>
`(β0 , β) = yi · (β0 + x>
i β) − log(1 + e
i β) ), (14)
N
i=1
a concave function of the parameters. The Newton algorithm for maximizing the (unpe-
nalized) log-likelihood (14) amounts to iteratively reweighted least squares. Hence if the
current estimates of the parameters are (β̃0 , β̃), we form a quadratic approximation to the
log-likelihood (Taylor expansion about current estimates), which is
N
1 X
`Q (β0 , β) = − wi (zi − β0 − x> 2
i β) + C(β̃0 , β̃)
2
(15)
2N
i=1
where
yi − p̃(xi )
zi = β̃0 + x>
i β̃ + , (working response) (16)
p̃(xi )(1 − p̃(xi ))
wi = p̃(xi )(1 − p̃(xi )), (weights) (17)
and p̃(xi ) is evaluated at the current parameters. The last term is constant. The Newton
update is obtained by minimizing `Q .
Our approach is similar. For each value of λ, we create an outer loop which computes the
quadratic approximation `Q about the current parameters (β̃0 , β̃). Then we use coordinate
descent to solve the penalized weighted least-squares problem
min {−`Q (β0 , β) + λPα (β)} . (18)
(β0 ,β)∈Rp+1
middle loop: Update the quadratic approximation `Q using the current parameters (β̃0 , β̃).
inner loop: Run the coordinate descent algorithm on the penalized weighted-least-squares
problem (18).
When p N , one cannot run λ all the way to zero, because the saturated logistic
regression fit is undefined (parameters wander off to ±∞ in order to achieve probabilities
of 0 or 1). Hence the default λ sequence runs down to λmin = λmax > 0.
Our code has an option to approximate the Hessian terms by an exact upper-bound.
This is obtained by setting the wi in (17) all equal to 0.25 (Krishnapuram and Hartemink
2005).
We allow the response data to be supplied in the form of a two-column matrix of counts,
sometimes referred to as grouped data. We discuss this in more detail in Section 4.2.
This parametrization is not estimable without constraints, because for any values for the
parameters {β0` , β` }K K
1 , {β0` − c0 , β` − c}1 give identical probabilities (20). Regularization
deals with this ambiguity in a natural way; see Section 4.1 below.
10 Regularization Paths for GLMs via Coordinate Descent
We fit the model (20) by regularized maximum (multinomial) likelihood. Using a similar
notation as before, let p` (xi ) = Pr(G = `|xi ), and let gi ∈ {1, 2, . . . , K} be the ith response.
We maximize the penalized log-likelihood
N K
" #
1 X X
max log pgi (xi ) − λ Pα (β` ) . (21)
{β0` ,β` }K
1 ∈R
K(p+1) N
i=1 `=1
Denote by Y the N × K indicator response matrix, with elements yi` = I(gi = `). Then we
can write the log-likelihood part of (21) in the more explicit form
N
"K K
!#
1 X X X >
`({β0` , β` }K
1 )= yi` (β0` + x>
i β` ) − log eβ0` +xi β` . (22)
N
i=1 `=1 `=1
The Newton algorithm for multinomial regression can be tedious, because of the vector nature
of the response observations. Instead of weights wi as in (17), we get weight matrices, for
example. However, in the spirit of coordinate descent, we can avoid these complexities.
We perform partial Newton steps by forming a partial quadratic approximation to the log-
likelihood (22), allowing only (β0` , β` ) to vary for a single class at a time. It is not hard to
show that this is
N
1 X
`Q` (β0` , β` ) = − wi` (zi` − β0` − x> 2 K
i β` ) + C({β̃0k , β̃k }1 ), (23)
2N
i=1
where as before
yi` − p̃` (xi )
zi` = β̃0` + x>
i β̃` + , (24)
p̃` (xi )(1 − p̃` (xi ))
wi` = p̃` (xi )(1 − p̃` (xi )), (25)
Our approach is similar to the two-class case, except now we have to cycle over the classes
as well in the outer loop. For each value of λ, we create an outer loop which cycles over `
and computes the partial quadratic approximation `Q` about the current parameters (β̃0 , β̃).
Then we use coordinate descent to solve the penalized weighted least-squares problem
middle loop (inner): Update the quadratic approximation `Q` using the current parame-
ters {β̃0k , β̃k }K
1 .
inner loop: Run the co-ordinate descent algorithm on the penalized weighted-least-squares
problem (26).
Journal of Statistical Software 11
K
X
minp Pα (β` − c). (27)
c∈R
`=1
Theorem 1 Consider problem (28) for values α ∈ [0, 1]. Let β̄j be the mean of the βj` , and
βjM a median of the βj` (and for simplicity assume β̄j ≤ βjM . Then we have
The two endpoints are obvious. The proof of Theorem 1 is given in Appendix A. A con-
sequence of the theorem is that a very simple search algorithm can be used to solve (28).
The objective is piecewise quadratic, with knots defined by the βj` . We need only evaluate
solutions in the intervals including the mean and median, and those in between. We recenter
the parameters in each index set j after each inner middle loop step, using the the solution
cj for each j.
Not all the parameters in our model are regularized. The intercepts β0` are not, and with our
penalty modifiers γj (Section 2.6) others need not be as well. For these parameters we use
mean centering.
Equivalently, the data can be presented directly as a matrix of class proportions, along with
a weight vector. From the point of view of the algorithm, any matrix of positive numbers and
any non-negative weight vector will be treated in the same way.
12 Regularization Paths for GLMs via Coordinate Descent
5. Timings
In this section we compare the run times of the coordinate-wise algorithm to some competing
algorithms. These use the lasso penalty (α = 1) in both the regression and logistic regression
settings. All timings were carried out on an Intel Xeon 2.80GH processor.
We do not perform comparisons on the elastic net versions of the penalties, since there is not
much software available for elastic net. Comparisons of our glmnet code with the R package
elasticnet will mimic the comparisons with lars (Hastie and Efron 2007) for the lasso, since
elasticnet (Zou and Hastie 2004) is built on the lars package.
where βj = (−1)j exp(−2(j − 1)/20), Z ∼ N (0, 1) and k is chosen so that the signal-to-noise
ratio is 3.0. The coefficients are constructed to have alternating signs and to be exponentially
decreasing.
Table 1 shows the average CPU timings for the coordinate-wise algorithm, and the lars proce-
dure (Efron et al. 2004). All algorithms are implemented as R functions. The coordinate-wise
algorithm does all of its numerical work in Fortran, while lars (Hastie and Efron 2007) does
much of its work in R, calling Fortran routines for some matrix operations. However com-
parisons in Friedman et al. (2007) showed that lars was actually faster than a version coded
entirely in Fortran. Comparisons between different programs are always tricky: in particular
the lars procedure computes the entire path of solutions, while the coordinate-wise procedure
solves the problem for a set of pre-defined points along the solution path. In the orthogonal
case, lars takes min(N, p) steps: hence to make things roughly comparable, we called the
latter two algorithms to solve a total of min(N, p) problems along the path. Table 1 shows
timings in seconds averaged over three runs. We see that glmnet is considerably faster than
lars; the covariance-updating version of the algorithm is a little faster than the naive version
when N > p and a little slower when p > N . We had expected that high correlation between
the features would increase the run time of glmnet, but this does not seem to be the case.
Correlation
0 0.1 0.2 0.5 0.9 0.95
N = 1000, p = 100
glmnet (type = "naive") 0.05 0.06 0.06 0.09 0.08 0.07
glmnet (type = "cov") 0.02 0.02 0.02 0.02 0.02 0.02
lars 0.11 0.11 0.11 0.11 0.11 0.11
N = 5000, p = 100
glmnet (type = "naive") 0.24 0.25 0.26 0.34 0.32 0.31
glmnet (type = "cov") 0.05 0.05 0.05 0.05 0.05 0.05
lars 0.29 0.29 0.29 0.30 0.29 0.29
N = 100, p = 1000
glmnet (type = "naive") 0.04 0.05 0.04 0.05 0.04 0.03
glmnet (type = "cov") 0.07 0.08 0.07 0.08 0.04 0.03
lars 0.73 0.72 0.68 0.71 0.71 0.67
N = 100, p = 5000
glmnet (type = "naive") 0.20 0.18 0.21 0.23 0.21 0.14
glmnet (type = "cov") 0.46 0.42 0.51 0.48 0.25 0.10
lars 3.73 3.53 3.59 3.47 3.90 3.52
N = 100, p = 20000
glmnet (type = "naive") 1.00 0.99 1.06 1.29 1.17 0.97
glmnet (type = "cov") 1.86 2.26 2.34 2.59 1.24 0.79
lars 18.30 17.90 16.90 18.03 17.91 16.39
N = 100, p = 50000
glmnet (type = "naive") 2.66 2.46 2.84 3.53 3.39 2.43
glmnet (type = "cov") 5.50 4.92 6.13 7.35 4.52 2.53
lars 58.68 64.00 64.79 58.20 66.39 79.79
Table 1: Timings (in seconds) for glmnet and lars algorithms for linear regression with lasso
penalty. The first line is glmnet using naive updating while the second uses covariance
updating. Total time for 100 λ values, averaged over 3 runs.
same 100 λ values for all. Table 2 shows the results; in some cases, we omitted a method
when it was seen to be very slow at smaller values for N or p. Again we see that glmnet is
the clear winner: it slows down a little under high correlation. The computation seems to
be roughly linear in N , but grows faster than linear in p. Table 3 shows some results when
the feature matrix is sparse: we randomly set 95% of the feature values to zero. Again, the
glmnet procedure is significantly faster than l1logreg.
14 Regularization Paths for GLMs via Coordinate Descent
Correlation
0 0.1 0.2 0.5 0.9 0.95
N = 1000, p = 100
glmnet 1.65 1.81 2.31 3.87 5.99 8.48
l1logreg 31.475 31.86 34.35 32.21 31.85 31.81
BBR 40.70 47.57 54.18 70.06 106.72 121.41
LPL 24.68 31.64 47.99 170.77 741.00 1448.25
N = 5000, p = 100
glmnet 7.89 8.48 9.01 13.39 26.68 26.36
l1logreg 239.88 232.00 229.62 229.49 22.19 223.09
N = 100, p = 1000
glmnet 1.06 1.07 1.09 1.45 1.72 1.37
l1logreg 25.99 26.40 25.67 26.49 24.34 20.16
BBR 70.19 71.19 78.40 103.77 149.05 113.87
LPL 11.02 10.87 10.76 16.34 41.84 70.50
N = 100, p = 5000
glmnet 5.24 4.43 5.12 7.05 7.87 6.05
l1logreg 165.02 161.90 163.25 166.50 151.91 135.28
Table 2: Timings (seconds) for logistic models with lasso penalty. Total time for tenfold
cross-validation over a grid of 100 λ values.
Cancer (Ramaswamy et al. 2002): gene-expression data with 14 cancer classes. Here we
compare glmnet with BMR (Genkin et al. 2007), a multinomial version of BBR.
Leukemia (Golub et al. 1999): gene-expression data with a binary response indicating
type of leukemia—AML vs ALL. We used the preprocessed data of Dettling (2004).
Correlation
0 0.1 0.2 0.5 0.9 0.95
N = 1000, p = 100
glmnet 0.77 0.74 0.72 0.73 0.84 0.88
l1logreg 5.19 5.21 5.14 5.40 6.14 6.26
BBR 2.01 1.95 1.98 2.06 2.73 2.88
N = 100, p = 1000
glmnet 1.81 1.73 1.55 1.70 1.63 1.55
l1logreg 7.67 7.72 7.64 9.04 9.81 9.40
BBR 4.66 4.58 4.68 5.15 5.78 5.53
Table 3: Timings (seconds) for logistic model with lasso penalty and sparse features (95%
zero). Total time for ten-fold cross-validation over a grid of 100 λ values.
NewsGroup (Lang 1995): document classification problem. We used the training set
cultured from these data by Koh et al. (2007a). The response is binary, and indicates
a subclass of topics; the predictors are binary, and indicate the presence of particular
tri-gram sequences. The predictor matrix has 0.05% nonzero values.
All four datasets are available online with this publication as saved R data objects (the latter
two in sparse format using the Matrix package, Bates and Maechler 2009).
For the Leukemia and InternetAd datasets, the BBR program used fewer than 100 λ values
so we estimated the total time by scaling up the time for smaller number of values. The
InternetAd and NewsGroup datasets are both sparse: 1% nonzero values for the former,
0.05% for the latter. Again glmnet is considerably faster than the competing methods.
Dense
Cancer 14 class 144 16,063 2.5 mins 2.1 hrs
Leukemia 2 class 72 3571 2.50 55.0 450
Sparse
InternetAd 2 class 2359 1430 5.0 20.9 34.7
NewsGroup 2 class 11,314 777,811 2 mins 3.5 hrs
Table 4: Timings (seconds, unless stated otherwise) for some real datasets. For the Cancer,
Leukemia and InternetAd datasets, times are for ten-fold cross-validation using 100 values
of λ; for NewsGroup we performed a single run with 100 values of λ, with λmin = 0.05λmax .
Table 5: Timings (seconds) for the Leukemia dataset, using 100 λ values. These timings
were performed on two different platforms, which were different again from those used in the
earlier timings in this paper.
an earlier draft of this paper suggested two methods of which we were not aware. We ran a
single benchmark against each of these using the Leukemia data, fitting models at 100 values
of λ in each case.
The R package penalized (Goeman 2009b,a), which fits GLMs using a fast implementa-
tion of gradient ascent.
Table 5 shows these comparisons (on two different machines); glmnet is considerably faster
in both cases.
Gaussian Family
99 99 97 95 93 75 54 21 12 5 2 1
32
●
●
●
Mean Squared Error
30
●
●
28
●
●
●
●
●
●
●
26
●●●●●●●●●●●●● ●
●●●●●●
●●●● ●
●●●
●●●
●● ●
●●
●● ●
●● ●
●●
●● ●
●● ●
●●● ●
●●●●●●
24
−6 −5 −4 −3 −2 −1 0
log(Lambda)
0.50
●
●
●
1.3
●
Misclassification Error
●
0.40
1.2
●
Deviance
●
●
1.1
●
●
●
0.30
● ●
1.0
● ●
● ●●●
●●
● ●●
●●●●●●●●
●●●●● ● ●
●●●●
●●●
0.9
●●● ● ●
●● ●
●●
●●
●●
● ●●
● ●
●● ●●●●●●●●●●●●●●●●●●●●●●●●●●
0.20
●● ● ●●
●● ● ●●●●
● ●●
●●
●●●● ●● ●●●● ● ●●●
●●●●●●●● ●●●● ●● ●● ●●●● ●● ●
●● ● ● ●
0.8
−9 −8 −7 −6 −5 −4 −3 −2 −9 −8 −7 −6 −5 −4 −3 −2
log(Lambda) log(Lambda)
Figure 2: Ten-fold cross-validation on simulated data. The first row is for regression with a
Gaussian response, the second row logistic regression with a binomial response. In both cases
we have 1000 observations and 100 predictors, but the response depends on only 10 predictors.
For regression we use mean-squared prediction error as the measure of risk. For logistic
regression, the left panel shows the mean deviance (minus twice the log-likelihood on the
left-out data), while the right panel shows misclassification error, which is a rougher measure.
In all cases we show the mean cross-validated error curve, as well as a one-standard-deviation
band. In each figure the left vertical line corresponds to the minimum error, while the right
vertical line the largest value of lambda such that the error is within one standard-error of
the minimum—the so called “one-standard-error” rule. The top of each plot is annotated with
the size of the models.
18 Regularization Paths for GLMs via Coordinate Descent
Alternatively, they can use K-fold cross-validation (Hastie et al. 2009, for example), where
the training data is used both for training and testing in an unbiased way.
Figure 2 illustrates cross-validation on a simulated dataset. For logistic regression, we some-
times use the binomial deviance rather than misclassification error, since the latter is smoother.
We often use the “one-standard-error” rule when selecting the best model; this acknowledges
the fact that the risk curves are estimated with error, so errs on the side of parsimony (Hastie
et al. 2009). Cross-validation can be used to select α as well, although it is often viewed as a
higher-level parameter and chosen on more subjective grounds.
7. Discussion
Cyclical coordinate descent methods are a natural approach for solving convex problems
with `1 or `2 constraints, or mixtures of the two (elastic net). Each coordinate-descent step
is fast, with an explicit formula for each coordinate-wise minimization. The method also
exploits the sparsity of the model, spending much of its time evaluating only inner products
for variables with non-zero coefficients. Its computational speed both for large N and p are
quite remarkable.
An R-language package glmnet is available under general public licence (GPL-2) from the
Comprehensive R Archive Network at http://CRAN.R-project.org/package=glmnet. Sparse
data inputs are handled by the Matrix package. MATLAB functions (Jiang 2009) are available
from http://www-stat.stanford.edu/~tibs/glmnet-matlab/.
Acknowledgments
We would like to thank Holger Hoefling for helpful discussions, and Hui Jiang for writing the
MATLAB interface to our Fortran routines. We thank the associate editor, production editor
and two referees who gave useful comments on an earlier draft of this article.
Friedman was partially supported by grant DMS-97-64431 from the National Science Foun-
dation. Hastie was partially supported by grant DMS-0505676 from the National Science
Foundation, and grant 2R01 CA 72028-07 from the National Institutes of Health. Tibshirani
was partially supported by National Science Foundation Grant DMS-9971405 and National
Institutes of Health Contract N01-HV-28183.
References
Candes E, Tao T (2007). “The Dantzig Selector: Statistical Estimation When p is much
Larger than n.” The Annals of Statistics, 35(6), 2313–2351.
Chen SS, Donoho D, Saunders M (1998). “Atomic Decomposition by Basis Pursuit.” SIAM
Journal on Scientific Computing, 20(1), 33–61.
Daubechies I, Defrise M, De Mol C (2004). “An Iterative Thresholding Algorithm for Lin-
ear Inverse Problems with a Sparsity Constraint.” Communications on Pure and Applied
Mathematics, 57, 1413–1457.
Dettling M (2004). “BagBoosting for Tumor Classification with Gene Expression Data.”
Bioinformatics, 20, 3583–3593.
Donoho DL, Johnstone IM (1994). “Ideal Spatial Adaptation by Wavelet Shrinkage.”
Biometrika, 81, 425–455.
Efron B, Hastie T, Johnstone I, Tibshirani R (2004). “Least Angle Regression.” The Annals
of Statistics, 32(2), 407–499.
Fan J, Li R (2005). “Variable Selection via Nonconcave Penalized Likelihood and its Oracle
Properties.” Journal of the American Statistical Association, 96, 1348–1360.
Friedman J (2008). “Fast Sparse Regression and Classification.” Technical report, Depart-
ment of Statistics, Stanford University. URL http://www-stat.stanford.edu/~jhf/ftp/
GPSpub.pdf.
Friedman J, Hastie T, Hoefling H, Tibshirani R (2007). “Pathwise Coordinate Optimization.”
The Annals of Applied Statistics, 2(1), 302–332.
Friedman J, Hastie T, Tibshirani R (2008). “Sparse Inverse Covariance Estimation with the
Graphical Lasso.” Biostatistics, 9, 432–441.
Friedman J, Hastie T, Tibshirani R (2009). glmnet: Lasso and Elastic-Net Regularized
Generalized Linear Models. R package version 1.1-4, URL http://CRAN.R-project.org/
package=glmnet.
Fu W (1998). “Penalized Regressions: The Bridge vs. the Lasso.” Journal of Computational
and Graphical Statistics, 7(3), 397–416.
Genkin A, Lewis D, Madigan D (2007). “Large-scale Bayesian Logistic Regression for Text
Categorization.” Technometrics, 49(3), 291–304.
Goeman J (2009a). “L1 Penalized Estimation in the Cox Proportional Hazards Model.”
Biometrical Journal. doi:10.1002/bimj.200900028. Forthcoming.
Goeman J (2009b). penalized: L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMs
and in the Cox Model. R package version 0.9-27, URL http://CRAN.R-project.org/
package=penalized.
Golub T, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML,
Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999). “Molecular Classification of
Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring.” Science,
286, 531–536.
20 Regularization Paths for GLMs via Coordinate Descent
Hastie T, Efron B (2007). lars: Least Angle Regression, Lasso and Forward Stagewise.
R package version 0.9-7, URL http://CRAN.R-project.org/package=Matrix.
Hastie T, Rosset S, Tibshirani R, Zhu J (2004). “The Entire Regularization Path for the
Support Vector Machine.” Journal of Machine Learning Research, 5, 1391–1415.
Koh K, Kim SJ, Boyd S (2007a). “An Interior-Point Method for Large-Scale L1-Regularized
Logistic Regression.” Journal of Machine Learning Research, 8, 1519–1555.
Koh K, Kim SJ, Boyd S (2007b). l1logreg: A Solver for L1-Regularized Logistic Regression.
R package version 0.1-1. Avaliable from Kwangmoo Koh ([email protected]).
Madigan D, Lewis D (2007). BBR, BMR: Bayesian Logistic Regression. Open-source stand-
alone software, URL http://www.bayesianregression.org/.
Meier L, van de Geer S, Bühlmann P (2008). “The Group Lasso for Logistic Regression.”
Journal of the Royal Statistical Society B, 70(1), 53–71.
Park MY, Hastie T (2007a). “L1 -Regularization Path Algorithm for Generalized Linear Mod-
els.” Journal of the Royal Statistical Society B, 69, 659–677.
Park MY, Hastie T (2007b). glmpath: L1 Regularization Path for Generalized Lin-
ear Models and Cox Proportional Hazards Model. R package version 0.94, URL http:
//CRAN.R-project.org/package=glmpath.
Journal of Statistical Software 21
R Development Core Team (2009). R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http:
//www.R-project.org/.
Rosset S, Zhu J (2007). “Piecewise Linear Regularized Solution Paths.” The Annals of
Statistics, 35(3), 1012–1030.
Shevade K, Keerthi S (2003). “A Simple and Efficient Algorithm for Gene Selection Using
Sparse Logistic Regression.” Bioinformatics, 19, 2246–2253.
Tibshirani R (1996). “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal
Statistical Society B, 58, 267–288.
Tibshirani R (1997). “The Lasso Method for Variable Selection in the Cox Model.” Statistics
in Medicine, 16, 385–395.
Van der Kooij A (2007). Prediction Accuracy and Stability of Regrsssion with Optimal Scaling
Transformations. Ph.D. thesis, Department of Data Theory, University of Leiden. URL
https://openaccess.leidenuniv.nl/dspace/handle/1887/12096.
Yuan M, Lin Y (2007). “Model Selection and Estimation in Regression with Grouped Vari-
ables.” Journal of the Royal Statistical Society B, 68(1), 49–67.
Zou H (2006). “The Adaptive Lasso and its Oracle Properties.” Journal of the American
Statistical Association, 101, 1418–1429.
Zou H, Hastie T (2004). elasticnet: Elastic Net Regularization and Variable Selection.
R package version 1.02, URL http://CRAN.R-project.org/package=elasticnet.
Zou H, Hastie T (2005). “Regularization and Variable Selection via the Elastic Net.” Journal
of the Royal Statistical Society B, 67(2), 301–320.
22 Regularization Paths for GLMs via Coordinate Descent
A. Proof of Theorem 1
We have
K
X
1
− α)(βj` − t)2 + α|βj` − t| .
cj = arg min 2 (1 (32)
t
`=1
where sj` = sign(βj` − t) if βj` 6= t and sj` ∈ [−1, 1] otherwise. This gives
K
1 α X
t = β̄j + sj` (34)
K 1−α
`=1
It follows that t cannot be larger than βjM since then the second term above would be negative
and this would imply that t is less than β̄j . Similarly t cannot be less than β̄j , since then the
second term above would have to be negative, implying that t is larger than βjM .
Affiliation:
Trevor Hastie
Department of Statistics
Stanford University
California 94305, United States of America
E-mail: [email protected]
URL: http://www-stat.stanford.edu/~hastie/