A Convenient Approach For Penalty Parameter Selection in Robust Lasso Regression
A Convenient Approach For Penalty Parameter Selection in Robust Lasso Regression
651
2017, Vol. 24, No. 6, 651–662 Print ISSN 2287-7843 / Online ISSN 2383-4757
a
Department of Statistics, Hankuk University of Foreign Studies, Korea
Abstract
We propose an alternative procedure to select penalty parameter in L1 penalized robust regression. This
procedure is based on marginalization of prior distribution over the penalty parameter. Thus, resulting objective
function does not include the penalty parameter due to marginalizing it out. In addition, its estimating algorithm
automatically chooses a penalty parameter using the previous estimate of regression coefficients. The proposed
approach bypasses cross validation as well as saves computing time. Variable-wise penalization also performs
best in prediction and variable selection perspectives. Numerical studies using simulation data demonstrate the
performance of our proposals. The proposed methods are applied to Boston housing data. Through simulation
study and real data application we demonstrate that our proposals are competitive to or much better than cross-
validation in prediction, variable selection, and computing time perspectives.
Keywords: adaptive lasso, cross validation, lasso, robust regression, variable selection
1. Introduction
Regularized regression is popularly used to incorporate various assumptions that the model should
require (Bishop, 2006; Hastie et al., 2001; Murphy, 2012). Type of regularization depends on a
specific model assumption. For example, in functional regression, the coefficient parameter is as-
sumed to be a smooth function. In sparse regression, some coefficient parameters corresponding to
insignificant variables are expected to be zero. To incorporate smoothness and sparsity on parameter
estimate, the regularization technique is often used. Such regularization is achieved under a penalized
loss minimization framework. Consider a regression model yi = β0 + xTi β + ϵi with a training data
D = {(xi , yi )|xi ∈ R p , yi ∈ R1 , i = 1, 2, . . . , n} and error variance var(ϵi ) = σ2 . The objective function
to be minimized is a penalized loss:
∑
n
ℓ(β0 , β) = ρ(ri ) + P(β; λ), (1.1)
i=1
where ri = (yi − β0 − xTi β)/σ. Squared loss function ρ(u) = u2 is used in regression. In presence
of outliers in training data, robust loss functions, Huber loss or bisquare loss for example, are often
used to circumvent harmful effects from outliers. Penalty function P(· ; ·) is chosen according to the
purpose of regularization. Roughness penalty is used for a smooth function estimate and sparsity-
inducing penalty for sparse estimate. Regularization and goodness-of-fit are balanced at an optimal
1 Corresponding author: Department of Statistics, Hankuk University of Foreign Studies, 50 Oedae-ro 54beon-gil,
Mohyeon-myeon, Cheoin-gu, Yongin 17035, Korea. E-mail: [email protected]
penalty parameter λ. The optimal λ is chosen for the best prediction. In absence of test data, cross
validation (CV) is commonly used for test mean squared errors (MSE) estimation in regression. CV
is a generic model selection criterion which can be applied to many predictive models where the
response variable is based on even no concrete probabilistic ground. Even with its versatility, CV
suffers from heavy computational burden.
In this research, we propose an alternative approach to CV in robust regression with L1 penaliza-
tion. This approach is based on marginalizing prior distribution over the penalty parameter. Thus,
resulting objective function does not include the penalty parameter due to marginalizing it out. But its
estimating equation automatically sets a tentative penalty parameter, and we can use it as the penalty
parameter in the penalized regression estimating procedure. This idea is straightforwardly generalized
to variable-wise penalization, which is often prohibitive through CV.
This paper is organized as follows. In Section 2, we introduce robust lasso regression and provide
Bayesian approach to marginalizing penalty parameter and its estimating algorithm. Variable-wise
penalization is introduced in the same section. The performance of our proposals are demonstrated
through simulation studies and Boston housing data in Section 3. The paper ends with some remarks
in Section 4.
2. Methodology
2.1. Robust lasso regression
Consider a linear model yi = β0 + xTi β + ϵi for i = 1, . . . , n. In the presence of outlying observations
among the training dataset, robust loss is used to reduce the outlier effect in regression (Maronna et
al., 2006). When we impose some regularization on the coefficient β ∈ R p , we add a penalty function
on β to the empirical loss and minimize the penalized empirical loss (1.1) to find a robust coefficient
estimate. Well-known robust loss functions are Huber loss,
u2 , if |u| ≤ k,
ρH (u|k) =
2k|u| − k2 ,
if |u| > k,
and Tukey’s bisquare loss,
{ ( u )2 }3
1 − 1 − , if |u| ≤ k,
ρB (u|k) =
k
1, if |u| > k.
Both loss functions include the additional parameter k, which regulates robustness and efficiency of
resulting coefficient estimate. Throughout this study, we use k = 1.345 for Huber loss and k = 4.685
for bisquare loss for 95% efficiency under normality (Maronna et al., 2006).
As in traditional linear regression, we can apply regularization techniques in robust regression (El
Ghaoui and Lebret, 1997; Owen, 2006). For variable selection as well as prediction enhancement,
∑
L1 penalty P(β; λ) = λ∥β∥1 = λ pj=1 |β j | can be imposed. It becomes lasso if we use the square loss
ρ(u) = u2 . Thus, robust lasso regression is done by minimizing the penalized objective function (1.1).
A usual way for minimizing (1.1) is to use Newton-Raphson algorithm for M-estimates in robust
statistics. To look at this more closely, we take a derivative of (1.1) with respect to β0 and β. Then,
normal equations become
∂ℓ ∑n
wi ri ∂ℓ ∑ n
wi ri xi ∂∥β∥1
=− = 0, =− +λ =0 (2.1)
∂β0 i=1
σ ∂β i=1
σ ∂β
Penalty parameter selection in robust linear regression 653
with wi = ψ(ri )/ri and ψ = ρ′ . The above normal equations are exactly same with those from the
below weighted lasso problem:
∑
n ∑
p
Q(β0 , β) = wi ri2 +λ |β j |. (2.2)
i=1 j=1
Thus, robust lasso regression (1.1) with robust loss and L1 penalty can be solved by iteratively op-
timizing weighted lasso (2.2), where weight wi is updated at every iteration step. In robust linear
regression, scale parameter σ should be estimated in robust fashion. A common way to obtain ro-
bust scale is the normalized median absolute deviates (MADN) of residuals from robust regression
(Maronna et al., 2006). MADN is defined as MADN(x) = median(|x − median(x)|)/0.675. In this
study, we initially fit L1 regression estimate (least absolute deviation (LAD) estimate; Koenker, 2005).
Finally, we obtain σ estimate as MADN of residuals from LAD fit.
To choose an optimal regularization parameter λ, CV is a popular choice. In CV procedure,
training data is split into several exclusive pieces. Each piece works as a validation set and remaining
pieces are used to fit the model. The fitted model is evaluated on the validation set by MSE. This step
is cycled until all pieces serve as a validation set. After finishing the whole cycle, CV score is defined
as the averaged MSEs and the regularization parameter λ is chosen to minimize CV score. In the
presence of outliers in the training data, MSE based CV score is not reliable because of the possible
presence of outliers in the validation set. Thus, robust loss for errors is used in CV score computation,
instead of squared errors in the traditional MSE. Generalized cross validation (GCV) is an convenient
selection criterion which does not require an onerous splitting-and-fitting procedure as in CV. After
fitting the model to whole training data, GCV is obtained using hat matrix Hλ where ŷ = Hλ y.
However, GCV is not available in robust regression because the regression coefficient estimate is not
given as a linear combination of response variable in most robust regression. Information criterion,
including Bayesian information criterion (BIC), is a frequently used for model selection. BIC as well
as most other information criterion depend on a data model because data distribution specifies the
likelihood as a part of the criterion. In robust regression it is difficult to assume data distribution due
to the existence of outliers.
The prior of β depends on λ as a scale parameter. To remove the dependency on λ, Buntine and
Weigend (1991) impose an improper Jeffrey prior on the hyperparameter λ and marginalize it out.
Using Jeffrey prior for λ, f (λ) ∝ 1/λ, the marginal prior on β becomes
∫ ∞ ∫ ∞ p−1
λ Γ(p)
f (β) = f (β|λ) f (λ)dλ = exp{−λ∥β∥1 }dλ = p p .
0 0 2 p 2 ∥β∥1
654 Jongyoung Kim, Seokho Lee
By replacing the original penalty induced from − log f (β|λ) in (1.1) by a new penalty from − log f (β),
we can obtain a penalized loss ℓ̃:
∑
n
ℓ̃(β0 , β) = ρ(ri ) + p log ∥β∥1 . (2.3)
i=1
Equation (2.3) is a new objective function without λ. For β estimation, normal equations from (2.3)
become
∂ℓ̃ ∑ wi ri
n
∂ℓ̃ ∑ wi ri xi
n
p ∂∥β∥1
=− = 0, =− + = 0. (2.4)
∂β0 i=1
σ ∂β i=1
σ ∥β∥ 1 ∂β
The above normal equation cannot be solved analytically. Note that, by comparing (2.4) with (2.1)
from the original objective function, the penalty parameter λ in (2.1) is replaced by the term p/∥β∥1
in (2.4). Thus, Buntine and Weigend (1991) suggest an iterative fitting procedure for a penalized
least square problems, where (β0 , β̂) is obtained as a weighted penalized least squares solution with
∑
a penalty parameter λ = p/∥βo ∥1 = p/ pj=1 |βoj | using the previous solution βo = (βo1 , . . . , βop )T . We
can use the same idea in robust lasso problem by solving iteratively (2.2). Updating formula for
(β0 , β) in (2.2) are obtained as a weighted lasso problem using weights wi = ψ(rio )/rio and penalty
∑
λ = p/ pj=1 |βoj |, where rio = yi − β̂o0 − xTi βo .
Through some simulation studies, we observed that this approach produces reliable performance
in prediction but is not satisfactory in variable selection. To enhance variable selection performance,
we consider a different penalty function for β. Instead of the common penalty parameter λ for
all β j , we use separate penalty parameters for each variable, in the similar to adaptive lasso (Zou,
∑ ∑
2006). Thus, L1 penalty function λ pj=1 |β j | in (1.1) is replaced by P(β; λ) = pj=1 λ j |β j | with
λ = (λ1 , . . . , λ p )T . We call the former “robust lasso” and the latter “robust adaptive lasso.” This
change causes computational challenge in CV because p penalty parameters should be chosen by
p-dimensional grid search. This is infeasible in practice even with a moderate size of variables. How-
ever, the same Bayesian approach can be easily implemented. With variable-wise penalty parameters,
the objective function becomes
∑
n ∑
p
ℓ(β0 , β) = ρ(ri ) + λ j |β j |. (2.5)
i=1 j=1
∑
n ∑
p
ℓ̃(β0 , β) = ρ(ri ) + log |β j |. (2.6)
i=1 j=1
Penalty parameter selection in robust linear regression 655
∂ℓ̃ ∑ wi ri
n
∂ℓ̃ ∑ wi ri xi j
n
1 ∂|β j |
=− = 0, = + = 0, (2.7)
∂β0 i=1
σ ∂β j i=1
σ |β j | ∂β j
for j = 1, 2, . . . , p. This normal equation is exactly same with that from ℓ in (2.5) if λ j is set to be
1/|β j |. We can still here employ an iterative scheme of estimation. We summarize the whole procedure
into the below conceptual algorithm, where we use coordinate descent algorithm to estimate β.
Algorithm: Robust lasso and robust adaptive lasso by marginalizing penalty parameters
o
1. (Initialization) Obtain initial parameters β̂o0 , β̂ from LAD or LAD-ridge. Set σ̂ = MADN(ri )
where ri are residuals from LAD.
2. Repeat until convergence is met.
o
(a) Set ri = (yi − β̂o0 − xTi β̂ )/σ̂ and wi = ψ(ri )/ri for i = 1, . . . , n. For robust lasso, set λ =
∑ o
|I|/ j∈I |β̂oj | where I is the index set of nonzero elements in β̂ and |I| is the number of
o
nonzeros in β̂ . For robust adaptive lasso, set λ j = 1/|β̂oj |. If β̂oj = 0, then λ j = ∞.
∑ ∑ ∑ ∑
(b) Compute ȳw = ni=1 wi yi / ni=1 wi and x̄wj = ni=1 wi xi j / ni=1 wi for all j = 1, 2, . . . , p. Then,
set ỹi = yi − ȳw and x̃i j = xi j − x̄wj .
(c) For j = 1, 2, . . . , p, compute yi(− j) and update β̂ j by
∑n
∑ i=1 wi x̃i j yi(− j) σ̂2 λ∗j
yi(− j) = ỹi − x̃il β̂ol , β̂ j = soft ∑n 2
, ∑n 2
,
l, j i=1 wi x̃i j i=1 wi x̃i j
where soft(x, t) = sign(x)(|x| − t)+ and u+ = max(u, 0). Here, we set λ∗j = λ for robust lasso
and λ∗j = λ j for robust adaptive lasso.
∑
(d) Update the intercept β̂0 by β̂0 = ȳw − pj=1 w̄wj β̂ j .
o
(e) If convergence is not achieved, update β̂o0 ← β̂0 and β̂ ← β̂.
down prior variance for negligible variables. Its estimation can be cast into penalized loss framework.
In ARD, however, updating formula for λ j depends all coefficient parameter estimates in the previous
iteration and, thus, cannot be separable variable-wise. This causes relatively slow convergence and
cannot exploit the simple thresholding scheme.
Our approach is closely related to a traditional Bayesian approach. If ρ(u) = u2 , i.e., non-
robust regression, lasso can be formulated in Bayesian framework using Gaussian scale mixture as
β j |τ2j , λ ∼ N(0, τ2j ) and τ2j |λ ∼ gamma(1, λ2 /2) (Park and Casella, 2008; West, 1987). Using normality
on the response yi , full Bayesian approach allows Bayesian inference for lasso because the full poste-
rior f (β|D) is available. However, typical Markov chain Monte Carlo (MCMC) approximation does
not often produce an exact zero solution. Contrast to MCMC, expectation-maximization (EM) algo-
rithm allows sparse estimation under Bayesian formulation (Figueiredo, 2003). For either MCMC
or EM, hyperparameter λ should be chosen by empirical Bayes or evidence procedure, where λ is
chosen by the maximizer of f (D|λ) called marginal likelihood or evidence. In robust lasso, however,
f (D|λ) is not available because Huber or biweight loss functions are not derived from any distribu-
tion. Penalty function in robust adaptive lasso can also be formulated under a Bayesian structure by
assuming β j |τ2j , λ ∼ N(0, τ2j ), τ2j |λ j ∼ gamma(1, λ2j /2), and λ j |a, b ∼ inverse-gamma(a, b). It is called
hierarchical adaptive lasso (HAL; Lee et al., 2010). Under non-robust situation, EM algorithm is
similar to the proposed approach in the sense that M-step produces the same β estimation and E-step
gives E(λ j ) = (a + 1)/(b + |β̂oj |). If a = b = 0, then the expected value of λ j is exactly same with the
proposed approach. However, HAL is not appropriate for robust adaptive lasso either because HAL
also assumes normality on the response. Another interesting connection can be found in log penalized
regression in Zou and Li (2008). Zou and Li (2008) derived logarithm penalty as a limiting version
of Bridge penalty with q → 0 using local linear approximation. The difference is that our approach
resorts to full iteration while Zou and Li (2008) suggests one-step iteration only.
3. Numerical Studies
3.1. Synthetic data
Artificial data is used to test our proposal in this section. Input variables xi j (i = 1, 2, . . . , n; j =
1, 2, . . . , p) are independently generated from N(0, 1). The intercept parameter β0 is set to zero, and
the first 10 slope parameters are generated uniformly on the positive interval from 1 to 2 and remaining
slopes are set to zero. i.e., β j ∼ uniform(1, 2) for j = 1, 2, . . . , 10 and β j = 0 for j = 11, . . . , p.
Response variables yi are constructed by yi = xTi β + ϵi for i = 1, . . . , n. Thus, the first 10 variables
are important in prediction and the remaining p − 10 variables are not. To mimic contamination in
training data, random error ϵi are generated from the contaminated normal distribution
ϵi ∼ (1 − π)N(0, σ2 ) + πN(mσ2 , σ2 ).
Here, π is a contamination rate. If π = 0, there is no outlier. Outliers come from normal distribution
with shifted mean by m-factor of error variance. The error variance is set to have a five times signal-
to-noise ratio in a standard deviation scale. i.e., σ = sd(xTi β) × 0.2. We used glmnet() and cv.glmnet()
functions in glmnet package for CV. For non-robust case (π = 0), the direct use of cv.glmnet() is
sufficient and efficient. For a robust case, weight wi should be updated at every iteration. Thus, we
devise a loop for CV and use glmnet() function for model fitting inside the loop. We conducted 5-fold
CV over 100 grid points as suggested in glmnet package. BIC is also considered for comparison study
(Chang et al., 2017; Lambert-Lacroix and Zwald, 2011) and the same grid set used in CV was also
used in BIC.
Penalty parameter selection in robust linear regression 657
Table 1: Simulation 1 - average of 100 test RMSEs and its standard error (in parenthesis)
RMSEs
n π
S.CV S.BIC S.L H.CV H.BIC H.L H.A B.CV B.BIC B.L B.A
2.7167 2.7373 2.7444 2.7413 2.7457 2.7589 2.7352 2.7748 2.7696 2.8025 2.7723
0.0
(0.0250) (0.0280) (0.0230) (0.0260) (0.0280) (0.0240) (0.0230) (0.0270) (0.0280) (0.0250) (0.0250)
4.5879 5.3437 4.8752 2.8645 5.1600 2.9041 2.8670 2.7688 5.2952 2.7862 2.7486
0.1
(0.0540) (0.0580) (0.0720) (0.0290) (0.0560) (0.0270) (0.0270) (0.0280) (0.0450) (0.0260) (0.0260)
100
6.4616 6.8108 6.9528 3.4884 5.2903 3.5763 3.5501 2.7830 5.0319 2.7922 2.7468
0.2
(0.0820) (0.0660) (0.1150) (0.0490) (0.0600) (0.0470) (0.0500) (0.0300) (0.0680) (0.0280) (0.0290)
8.3657 8.5412 8.9019 6.8079 6.9848 7.5472 7.6094 6.1575 6.3042 6.9350 6.9094
0.3
(0.1160) (0.0970) (0.1570) (0.1810) (0.1890) (0.1690) (0.1700) (0.2590) (0.2500) (0.2670) (0.2650)
2.5476 2.5470 2.5384 2.5349 2.5389 2.5388 2.5352 2.5344 2.5384 2.5387 2.5348
0.0
(0.0070) (0.0070) (0.0060) (0.0060) (0.0060) (0.0060) (0.0060) (0.0060) (0.0060) (0.0060) (0.0060)
3.5918 3.6337 3.6147 2.5858 2.6246 2.5908 2.5861 2.5369 2.5718 2.5400 2.5349
0.1
(0.0080) (0.0080) (0.0080) (0.0060) (0.0070) (0.0060) (0.0060) (0.0060) (0.0070) (0.0060) (0.0060)
1,000
5.4718 5.5318 5.4964 2.8059 2.8917 2.8146 2.8054 2.5401 2.5818 2.5424 2.5352
0.2
(0.0160) (0.0160) (0.0170) (0.0060) (0.0070) (0.0050) (0.0050) (0.0060) (0.0070) (0.0060) (0.0060)
7.5890 7.6413 7.6104 3.6360 4.1826 3.6681 3.6514 2.5429 2.7074 2.5447 2.5343
0.3
(0.0260) (0.0250) (0.0260) (0.0060) (0.0300) (0.0060) (0.0060) (0.0060) (0.0190) (0.0060) (0.0060)
3.7873 3.6779 2.5022 2.5022 2.5033 2.5024 2.5021 2.5022 2.5032 2.5024 2.5020
0.0
(0.0050) (0.0050) (0.0020) (0.0020) (0.0020) (0.0020) (0.0020) (0.0020) (0.0020) (0.0020) (0.0020)
4.0398 4.0137 3.4280 2.5387 2.5433 2.5396 2.5390 2.5024 2.5054 2.5026 2.5021
0.1
(0.0030) (0.0030) (0.0030) (0.0020) (0.0020) (0.0020) (0.0020) (0.0020) (0.0020) (0.0020) (0.0020)
10,000
5.4909 5.4722 5.2731 2.7283 2.7396 2.7315 2.7305 2.5030 2.5070 2.5033 2.5026
0.2
(0.0040) (0.0040) (0.0050) (0.0020) (0.0020) (0.0020) (0.0020) (0.0020) (0.0020) (0.0020) (0.0020)
7.4267 7.4242 7.3777 3.4506 3.4673 3.4560 3.4542 2.5029 2.5089 2.5032 2.5021
0.3
(0.0070) (0.0070) (0.0080) (0.0020) (0.0020) (0.0020) (0.0020) (0.0020) (0.0020) (0.0020) (0.0020)
Best performer for each case is highlighted.
RMSE = root mean squared error; S = square; H = Huber; B = bisquare; CV = cross validation; BIC = Bayes information criterion; L = robust
lasso; A = robust adaptive lasso.
The first simulation considers n > p situation. We increase sample size n = 100, 1,000, 10,000
with a fixed p = 20. Proportion of outliers is π = 0, 0.1, 0.2, 0.3 and m = 5 for outlier generation.
Test data of size 10,000 is additionally generated without outliers and used to measure prediction
error. Prediction error is calculated as root mean square of error (RMSE). Simulation is repeated 100
times and we report the average and standard error of RMSEs in Table 1. We name square, Huber,
bisquare losses by “S”, “H”, “B”, respectively. “BIC” for BIC, “L” for robust lasso, and “A” for
robust adaptive lasso. Thus, “B.A” means robust adaptive lasso with bisquare loss. From Table 1,
in absence of outliers (π = 0), traditional lasso using square loss under CV is best in prediction.
However, as π increases, square loss is outperformed by robust loss. Bisquare loss is generally better
than Huber loss, which demonstrates the well-known fact that nonconvex loss is better than convex
loss in robust regression. Overall, robust adaptive lasso performs best among other competitors. We
observe that CV and our proposals are hardly distinguishable in standard error scale, as the sample
size increases. We compute a false negative rate (FN) and false positive rate (FP), and report their
average in Table 2. FN is defined as a ratio of a falsely claimed zero, while they are actually not
zero; however, FP is defined as a ratio of falsely claimed nonzero, while they are actually zero. FN
is almost zero for all methods, except for a small sample size. But we observe that FP seems quite
high. Note that low FN and high FP indicates that almost all variables are selected from a variable
selection, implying a low selection ability. Considering this, the BIC selection seems to perform very
well in a large sample case. Table 3 presents average computing time for each method. Robust lasso
and robust adaptive lasso are enormously faster than CV while their prediction and variable selection
performance is comparable or better than CV. Our proposals are much faster than CV and BIC in
penalized robust regression.
We conduct the second simulation where variable size is larger than sample size. Sample size is
fixed as n = 100 and variable size are set to be p = 50, 100, 200, 400. In this simulation, outliers
658 Jongyoung Kim, Seokho Lee
√
are generated from mean mσ2 with a factor m = p. Thus, outliers locate further from typical
data as dimension increases. Prediction performance, variable selection performance, and computing
time are summarized in Tables 3, 4, and 5, respectively. Table 4 shows that bisquare loss produces
reliable results in presence of outliers. Note that Huber loss quickly deteriorates as contamination get
Penalty parameter selection in robust linear regression 659
Table 4: Simulation 2 - average of 100 test RMSEs and its standard error (in parenthesis)
RMSEs
p π
S.CV S.BIC S.L H.CV H.BIC H.L H.A B.CV B.BIC B.L B.A
2.8572 2.9005 3.2608 2.8977 2.9604 3.2688 3.1108 3.0635 3.0740 3.5044 3.1815
0.0
(0.0250) (0.0280) (0.0290) (0.0280) (0.0310) (0.0300) (0.0280) (0.0330) (0.0340) (0.0370) (0.0320)
6.1176 6.1751 11.2281 3.1535 5.3629 4.5647 3.8234 3.0421 5.3962 3.4482 3.1631
0.1
(0.0720) (0.0540) (0.2730) (0.0340) (0.0310) (0.0950) (0.0560) (0.0350) (0.0380) (0.0400) (0.0320)
50
8.6024 8.4683 15.3106 4.1725 5.4820 13.4685 8.6459 3.0188 5.1744 3.3201 3.0847
0.2
(0.1410) (0.1230) (0.3640) (0.0670) (0.0400) (0.3700) (0.2630) (0.0290) (0.0570) (0.0340) (0.0300)
11.3820 11.2803 17.9280 10.3680 10.9391 18.7964 16.3489 6.3431 6.5475 9.3799 7.5900
0.3
(0.2100) (0.1990) (0.4310) (0.4600) (0.4930) (0.5100) (0.3870) (0.5120) (0.5030) (0.7030) (0.5120)
5.3336 5.3388 5.3360 3.3972 3.5069 5.2086 3.6239 4.2170 4.3317 4.2089 3.4831
0.0
(0.0330) (0.0330) (0.0660) (0.0730) (0.0760) (0.0630) (0.0310) (0.1030) (0.1040) (0.0520) (0.0300)
7.0232 7.0159 41.7878 3.6306 5.3966 25.7004 7.7918 3.7066 5.4368 4.1659 3.5490
0.1
(0.0680) (0.0630) (0.9690) (0.0620) (0.0330) (0.5200) (0.2380) (0.0840) (0.0350) (0.0480) (0.0350)
100
10.7519 10.5992 55.5286 4.7298 10.4626 39.3374 21.2339 3.3009 5.3223 3.7344 3.3720
0.2
(0.1520) (0.1400) (1.3470) (0.0660) (1.5880) (0.7500) (0.4130) (0.0540) (0.0460) (0.0420) (0.0420)
15.0825 14.7932 58.5546 16.8676 82.2173 49.5735 28.1948 5.6270 5.6168 5.5368 5.4661
0.3
(0.2360) (0.2210) (1.6150) (1.0060) (5.0710) (1.0190) (0.5150) (0.7570) (0.5940) (0.6270) (0.5240)
5.2792 5.2960 3.7408 3.3144 3.4933 3.7530 3.7157 3.9389 4.2246 3.7315 3.5408
0.0
(0.0340) (0.0330) (0.0330) (0.0430) (0.0510) (0.0330) (0.0350) (0.0840) (0.0970) (0.0340) (0.0330)
8.5121 8.4652 23.9495 3.9219 15.3873 22.8119 22.2143 3.5161 5.4437 3.6966 3.5413
0.1
(0.1320) (0.1270) (0.5490) (0.0540) (1.0690) (0.5020) (0.4800) (0.0560) (0.0320) (0.0340) (0.0430)
200
14.4405 14.1266 32.4718 5.4247 31.7276 31.4729 34.1979 3.5092 5.0911 3.6295 3.3470
0.2
(0.2890) (0.2620) (0.7400) (0.2220) (0.9290) (0.6900) (0.7320) (0.0460) (0.0570) (0.0420) (0.0500)
20.7929 20.6055 38.8953 30.3725 39.6322 38.5896 42.9984 3.7537 3.9634 3.6696 3.6053
0.3
(0.4330) (0.4680) (0.8680) (1.5450) (0.9350) (0.8480) (0.9140) (0.0530) (0.0620) (0.0460) (0.0650)
4.0374 4.1494 3.5140 3.3626 3.5749 3.5171 3.3047 3.4149 3.6326 3.5210 3.2603
0.0
(0.0590) (0.0630) (0.0310) (0.0310) (0.0300) (0.0310) (0.0350) (0.0360) (0.0400) (0.0320) (0.0350)
11.6193 10.7873 24.3507 4.5725 25.0613 23.0480 28.2852 3.6127 4.9803 3.6144 3.1086
0.1
(0.2530) (0.1560) (0.4800) (0.0480) (0.4560) (0.3950) (0.5340) (0.0380) (0.0580) (0.0350) (0.0360)
400
20.4784 19.7779 34.8443 17.6653 35.4930 33.3502 38.8368 3.7503 4.2447 3.7797 3.0991
0.2
(0.4600) (0.4310) (0.6920) (1.4230) (0.7130) (0.6010) (0.6690) (0.0390) (0.0630) (0.0410) (0.0450)
29.3303 35.9054 43.7305 45.2124 44.3634 43.2399 49.8013 3.6527 4.0667 4.4179 3.5856
0.3
(0.5730) (1.1870) (0.8180) (0.8840) (0.8480) (0.7940) (0.8940) (0.0440) (0.0410) (0.0680) (0.0560)
Best performer for each case is highlighted.
RMSE = root mean squared error; S = square; H = Huber; B = bisquare; CV = cross validation; BIC = Bayes information criterion; L = robust
lasso; A = robust adaptive lasso.
larger, demonstrating that convex Huber loss suffers from severe outliers. In this simulation CV often
produces best results, but adaptive lasso version of Bayesian approach is still comparable. Most of
all, computing time in Table 6 demonstrates that the latter is more attractive than the former. In Table
5 we provides false rate of slope estimate by combining false negatives and false positives, instead
of reporting separate false rates as in Table 2. Thus, false rate is the proportion of elements in β̂
that true zero is falsely claimed to zero and true nonzero falsely claimed zero. Table 5 demonstrates
that robust adaptive lasso using bisquare outperforms CV in variable selection perspective, while they
are comparable in prediction sense. Note that CV usually finds the best model which gives the best
prediction, while robust adaptive lasso focuses on variable selection as well as a good fit to the data in
the high-dimensional cases.
1940), dis (weighted distances to five Boston employment centers), rad (index of accessibility to
radial highways), tax (full-valued property-tax rate per 10,000 dollars), ptratio (pupil-teacher ratio by
town), black (1000 × (Bk − 0.63)2 where Bk is the proportion of blacks by town), and lstat (percent
of lower status of the population). There are 506 observations in the dataset. We split the dataset
into two parts: the first 300 observations were regarded as a training dataset and the remaining 206
observations were treated as a test dataset. To see the effect of outliers on the methods we considered,
we randomly selected 30 observations from the training dataset and changed their response variables
by values having large residuals (five times larger than residual standard deviation).
Penalty parameter selection in robust linear regression 661
Table 7: Boston housing data - estimated coefficients and test RMSEs without outliers
Variables S.CV S.BIC S.L H.CV H.BIC H.L H.A B.CV B.BIC B.L B.A
(Intercept) −13.914 −18.325 −13.625 −16.707 −18.326 −16.273 −15.874 −15.764 −18.325 −15.014 −16.377
crim 0.955 0.000 1.073 0.000 0.101 0.350 0.000 0.000 0.000 0.119 0.000
zn 0.011 0.000 0.015 0.002 0.000 0.014 0.012 0.011 0.000 0.019 0.018
indus 0.006 0.000 0.019 0.000 0.000 −0.015 0.000 −0.023 0.000 −0.037 0.000
chas 0.607 0.413 0.579 0.500 0.506 0.525 0.000 0.468 0.413 0.362 0.000
nox −6.453 0.000 −6.933 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
rm 9.160 9.253 9.137 9.045 9.278 8.991 9.078 8.908 9.253 8.841 9.060
age −0.046 −0.035 −0.048 −0.042 −0.039 −0.053 −0.053 −0.050 −0.035 −0.057 −0.059
dis −0.919 −0.529 −0.977 −0.562 −0.584 −0.822 −0.814 −0.700 −0.529 −0.870 −0.826
rad 0.117 0.000 0.162 0.049 0.000 0.176 0.181 0.158 0.000 0.240 0.257
tax −0.013 −0.010 −0.015 −0.011 −0.011 −0.014 −0.014 −0.013 −0.010 −0.015 −0.015
ptratio −0.632 −0.617 −0.628 −0.615 −0.620 −0.575 −0.616 −0.577 −0.617 −0.557 −0.576
black 0.016 0.011 0.017 0.010 0.012 0.013 0.012 0.011 0.011 0.012 0.012
lstat −0.110 −0.114 −0.113 −0.107 −0.114 −0.101 −0.084 −0.070 −0.114 −0.060 −0.041
Test RMSE 15.824 7.984 17.472 7.849 8.113 9.761 7.863 7.808 7.984 8.394 8.042
RMSE = root mean squared error; S = square; H = Huber; B = bisquare; CV = cross validation; BIC = Bayes information
criterion; L = robust lasso; A = robust adaptive lasso.
Table 8: Boston housing data - estimated coefficients and test RMSEs after outlier inclusion
Variables S.CV S.BIC S.L H.CV H.BIC H.L H.A B.CV B.BIC B.L B.A
(Intercept) 13.198 46.611 61.568 −15.959 12.883 −14.087 −12.600 −15.960 23.007 −14.195 −14.963
crim 0.000 0.000 −9.169 0.000 −0.603 0.070 0.000 0.000 −3.529 0.000 0.000
zn 0.000 0.000 0.027 0.007 0.000 0.020 0.020 0.015 0.000 0.024 0.024
indus 0.000 0.000 −0.481 −0.020 −0.278 −0.035 0.000 −0.020 −0.360 −0.045 0.000
chas 0.000 0.000 −4.543 0.312 0.000 0.327 0.000 0.357 −0.324 0.147 0.000
nox 0.000 0.000 −15.481 0.000 0.000 −0.714 −3.601 0.000 0.000 0.000 −1.502
rm 7.873 0.000 8.700 8.975 9.339 8.877 8.960 8.678 9.825 8.502 8.728
age 0.000 0.000 0.195 −0.042 0.000 −0.051 −0.050 −0.053 0.008 −0.062 −0.063
dis 0.000 0.000 0.398 −0.610 0.000 −0.892 −0.921 −0.746 0.000 −0.985 −0.960
rad 0.000 0.000 1.882 0.112 1.214 0.260 0.284 0.189 1.446 0.301 0.321
tax 0.000 0.000 0.084 −0.007 0.036 −0.010 −0.010 −0.010 0.055 −0.013 −0.013
ptratio −0.675 0.000 −3.067 −0.652 −1.700 −0.645 −0.692 −0.558 −2.091 −0.536 −0.563
black 0.000 0.000 −0.101 0.010 −0.007 0.012 0.012 0.013 −0.038 0.015 0.014
lstat −0.492 0.000 −1.028 −0.129 −0.722 −0.117 −0.102 −0.085 −0.699 −0.062 −0.044
Test RMSE 23.183 29.441 98.664 7.885 38.727 8.608 8.407 7.798 42.804 8.038 8.249
RMSE = root mean squared error; S = square; H = Huber; B = bisquare; CV = cross validation; BIC = Bayes information
criterion; L = robust lasso; A = robust adaptive lasso.
Tables 7 and 8 present the coefficient estimates and test RNSE from the fit of the training dataset
without and with outliers, respectively. Table 7, in the case of absence of outliers, shows that all
methods seem to produce the similar coefficient estimates and test RMSE. However, after outlier
inclusion, square loss severely deteriorates, while Huber and biweight losses are robust against outlier.
Interestingly, we observed that BIC is not stable for outlier in this example.
In this study we limit this idea only to robust regression; however, it can be further applied to
various penalized predictive models. Our approach becomes especially valuable when CV is not
appropriate. For example, in classification problem, training data may suffer from a mislabeling class
label called label-noise (Lee et al., 2016). In such case, hold-out procedure like CV is not successful
because the validation set also have a label-nose; in addition, a cross-validation score from a validation
set having label-noise is not reliable for model selection. One can extend the same idea in this study to
some complicated problems that include label-nose, where CV is not available. We leave this direction
for future research.
References
Bishop CM (2006). Pattern Recognition and Machine Learning, Springer, New York.
Buntine WL and Weigend AS (1991). Bayesian back-propagation, Complex Systems, 5, 603–643.
Chang L, Roberts S, and Welsh A (2017). Robust lasso regression using Tukey’s biweight criterion,
Technometrics, from: https://dx.doi.org/10.1080/00401706.2017.1305299
El Ghaoui L and Lebret H (1997). Robust solutions to least-squares problems with uncertain data,
SIAM Journal on Matrix Analysis and Applications, 18, 1035–1064.
Figueiredo MAT (2003). Adaptive sparseness for supervised learning, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 25, 1150-1159.
Hastie T, Tibshirani R, and Friedman JH (2001). The Elements of Statistical Learning: Data Mining
Inference, and Prediction, Springer, New York.
Koenker R (2005). Quantile Regression, Cambridge University Press, Cambridge.
Lambert-Lacroix S and Zwald L (2011). Robust regression through the Huber’s criterion and adaptive
lasso penalty, Electronic Journal of Statistics, 5, 1015–1053.
Lee A, Caron F, Doucet A, and Holmes C (2010). A hierarchical Bayesian framework for constructing
sparsity-inducing priors (Technical report), University of Oxford, Oxford.
Lee S, Shin H, and Lee SH (2016). Label-noise resistant logistic regression for functional data classi-
fication with an application to Alzheimer’s disease study, Biometrics, 72, 1325–1335.
MacKay DJC (1995). Probable networks and plausible predictions: a review of practical Bayesian
methods for supervised neural networks, Network: Computation in Neural Systems, 6, 469–505.
Maronna RA, Martin RD, and Yohai VJ (2006). Robust Statistics: Theory and Methods, Wiley,
Chichester.
Murphy KP (2012). Machine Learning: A Probabilistic Perspective, The MIT Press, Cambridge.
Owen AB (2006). A robust hybrid of lasso and ridge regression (Technical report), Stanford Univer-
sity, Stanford.
Park T and Casella G (2008). The Bayesian lasso, Journal of the American Statistical Association,
103, 681–686.
Tipping ME (2001). Sparse Bayesian learning and the relevance vector machine, Journal of Machine
Learning Research, 1, 211–244.
West M (1987). On scale mixtures of normal distributions, Biometrika, 74, 646–648.
Zou H (2006). The adaptive lasso and its oracle properties, Journal of the American Statistical Asso-
ciation, 101, 1418–1429.
Zou H and Li R (2008). One-step sparse estimates in nonconcave penalized likelihood models, The
Annals of Statistics, 36, 1509–1533.
Received July 21, 2017; Revised September 20, 2017; Accepted October 14, 2017