Big and Complex Data Analysis Methodologies and Applications Unlimited Ebook Download
Big and Complex Data Analysis Methodologies and Applications Unlimited Ebook Download
Applications
Visit the link below to download the full version of this book:
https://medipdf.com/product/big-and-complex-data-analysis-methodologies-and-appl
ications/
v
vi Preface
xiii
xiv Contents
1 Introduction
With the booming of information and vast improvement for computation speed,
we are able to collect large amount of data in terms of a large collections of n
observations and p predictors, where p n. Recently, model selection gains
increasing attention especially for ultra-high dimensional regression problems.
Theoretically, the accuracy and interpretability of selected model are crucial in
variable selection. Practically, algorithm feasibility and efficiency are vital in
applications.
A great variety of penalized methods have been proposed in recent years. The
regularization techniques for simultaneous variable selection and estimation are
particularly useful to obtain sparse models compared to simply apply traditional
criteria such as Akaike’s information criterion [1] and Bayesian information
Y. Feng () • M. Yu
Department of Statistics, Columbia University, New York, NY 10027, USA
e-mail: [email protected]
criterion [18]. The least absolute shrinkage and selection operator (Lasso) [19] have
been widely used as the l1 penalty shrinks most coefficients to 0 and fulfills the
task of variable selection. Many other regularization methods have been developed;
including bridge regression [13], the smoothly clipped absolute deviation method
[5], the elastic net [26], adaptive Lasso [25], LAMP [11], among others. Asymptotic
analysis for the sign consistency in model selection [20, 24] has been introduced
to provide theoretical support for various methods. Some other results such as
parameter estimation [17], prediction [15], and oracle properties [5] have been
introduced under different model contexts.
However, in ultra-high dimensional space where the dimension p D exp.na /
(where a > 0), the conditions for sign consistency are easily violated as a con-
sequence of large correlations among variables. To deal with such challenges, Fan
and Lv [6] proposed the sure independence screening (SIS) method which is based
on correlation learning to screen out irrelevant variables efficiently. Further analysis
and generalization can be found in Fan and Song [7] and Fan et al. [8]. From the
idea of retaining important variables rather than screening out irrelevant variables,
Weng et al. [21] proposed the regularization after retention (RAR) method. The
major differences between SIS and RAR can be summarized as follows. SIS makes
use of marginal correlations between variables and response to screen noises out,
while RAR tries to retain signals after acquiring these coefficients. Both of them
relax the irrepresentable-type conditions [20] and achieve sign consistency.
In this paper, we would like to introduce a general multi-step estimation
framework that integrates the idea of screening and retention in the first step to learn
the importance of the features using the marginal information during the first step,
and then impose regularization using corresponding weights. The main contribution
of the paper is two-fold. First, the new framework is able to utilize the marginal
information adaptively in two different directions, which will relax the conditions
for sign consistency. Second, the idea of the framework is very general and covers
the one-step regularization methods, the regularization after screening method, and
the regularization after retention method as special cases.
The rest of this paper is organized as follows. In Sect. 2, we introduce the model
setup and the relevant techniques. The new variable selection framework is elabo-
rated in Sect. 3 with connections to existing methods explained. Section 4 develops
the sign consistency result for the proposed estimators. Extensive simulations are
conducted in Sect. 5 to compare the performance of the new method with the
existing approaches. We conclude with a short discussion in Sect. 6. All the technical
proofs are relegated to the appendix.
Regularization After Marginal Learning for Ultra-High Dimensional Regression Models 5
Let .Xi ; Yi / be i.i.d. random pairs following the linear regression model:
Yi D Xi ˇ C "i ; i D 1; : : : ; n;
i:i:d:
.ˇ1 ; : : : ; ˇp /T is the true coefficient vector, "1 ; : : : ; "n N.0; 2 /; and fXi gniD1 are
independent of f"i gniD1 . Note here, we sometimes use pn to emphasize the dimension
p is diverging with the sample size n. Denote the support index set of ˇ by S D f j W
ˇj ¤ 0g and the cardinality of S by sn , and †Sc jS D †Sc Sc †Sc S .†SS /1 †SSc : Both
pn and sn are allowed to increase as n increases. For conciseness, we sometimes use
signals and noises to represent relevant predictors S and irrelevant predictors Sc (or
their corresponding coefficients) respectively.
For any set A, let Ac be its complement set. For any k dimensional vector w
and any subset P K f1; : : : ; kg, wPK denotes the subvector of w indexed by K, and
let kwk1 D kiD1 jwi j; kwk2 D . kiD1 w2i /1=2 ; kwk1 D maxiD1;:::;k jwi j: For any
k1 k2 matrix M, any subsets K1 f1; : : : ; k1 g, K2 f1; : : : ; k2 g, MK1 K2 represents
the submatrix of M consisting of entries indexed by the Cartesian product K1 K2 .
Let MK2 be the columns of M indexed by K2 and M j be the j-th Pk column of M.
Denote kMk2 D fƒmax .M T M/g1=2 and kMk1 D maxiD1;:::;k jD1 jMij j: When
k1 D k2 D k, let .M/ D maxiD1;:::;k Mii , ƒmin .M/ and ƒmax .M/ be the minimum
and maximum eigenvalues of M, respectively.
[20] has identified precise conditions of achieving sparsity recovery with a family
of regularization parameters n under deterministic design.
Another effective approach to the penalization problem is adaptive Lasso
(AdaLasso) [25], which uses an adaptively weighted l1 -penalty term, defined as
( )
X
n X
pn
1
ˇO D arg min .2n/ .Yi XiT ˇ/2 C n !j jˇj j ; n 0: (2)
ˇ
iD1 jD1
To reduce dimension from ultra-high to a moderate level, Fan and Lv [6] proposed
a sure independence screening (SIS) method, which makes use of marginal correla-
tions as a measure of importance in first step and then utilizes other operators such
as Lasso to fulfill the target of variable selection. In particular, first we calculate
the component-wise regression coefficients for each variable, i.e., ˇOjM D .XQ j /T Y,
Q
j D 1; : : : ; pn , where XQ j is the standardized j-th column of data X and YQ is the
standardized response. Second, we define a sub-model with respect to the largest
coefficients
Predictors that are not in M are regarded as noise and therefore discarded for
further analysis. SIS reduces the number of candidate covariates to a moderate level
for the subsequent analysis. Combining SIS and Lasso, Fan and Lv [6] introduced
SIS-Lasso estimator,
( )
Xn X
1 2
ˇO D arg min .2n/ .Yi X ˇ/ C n
T
jˇj j
i
ˇ2M
iD1 j2M
( )
X
n X X
1 2
D arg min .2n/ .Yi Xi ˇ/ C n
T
jˇj j C 1 jˇj j : (3)
ˇ
iD1 j2M j2Mc
Regularization After Marginal Learning for Ultra-High Dimensional Regression Models 7
Clearly, should be chosen carefully to avoid screening out signals. To deal with
the issue that signals may be marginally uncorrelated with the response in some
cases, iterative-SIS was introduced [6] as a practical procedure but without rigorous
theoretical support for the sign consistency. As a result, solely relying on marginal
information is sometimes a bit too risky, or greedy, for model selection purpose.
3.1 Algorithm
From Sect. 2, one potential drawback shared between AdaLasso and SIS-Lasso is
that they may miss important covariates that are marginally weakly correlated with
the response.
Now, we introduce a new algorithm, regularization after marginal (RAM)
learning, to solve the issue. It utilizes marginal correlation to divide all variables
into three candidate sets: a retention set, a noise set, and an undetermined set. Then
regularization is imposed to find signals in the uncertainty set as well as to identify
falsely retention signals and falsely screened noises.
A detailed description of the algorithm is as follows:
Step 0 (Marginal Learning) Calculate the marginal regression coefficients after
standardizing each predictor, i.e.,
X
n j
.Xi XN j /
ˇOjM D Yi ; 1 j pn ; (4)
iD1
O j
q Pn
1 Pn j Nj 2
iD1 .Xi X /
O j2
j
where XN j D n iD1 Xi and n1
. D
Define a retention set by R D f1 j p W jˇOjM j n g, for a positive constant
O
n ; a noise set by NO D f1 j p W jˇOjM j Qn g, for a positive constant Qn < n ;
and an undetermined set by UO D .R O [ NO /c .
where the index UO1 is denoted as the set of variables that are estimated as signals
in U, O ˇO O O /j ¤ 0g. After Step 1, the selected variable set is
O namely UO1 D f j 2 Uj.
R;U1
RO [ UO1 .