100% found this document useful (14 votes)
267 views16 pages

Big and Complex Data Analysis Methodologies and Applications Unlimited Ebook Download

veterinary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (14 votes)
267 views16 pages

Big and Complex Data Analysis Methodologies and Applications Unlimited Ebook Download

veterinary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Big and Complex Data Analysis Methodologies and

Applications

Visit the link below to download the full version of this book:

https://medipdf.com/product/big-and-complex-data-analysis-methodologies-and-appl
ications/

Click Download Now


Preface

This book comprises a collection of research contributions toward high-dimensional


data analysis. In this data-centric world, we are often challenged with data sets
containing many predictors in the model at hand. In a host of situations, the number
of predictors may very well exceed the sample size. Truly, many modern scientific
investigations require the analysis of such data. There are a host of buzzwords
in today’s data-centric world, especially in digital and print media. We encounter
data in every walk of life, and for analytically and objectively minded people,
data is everything. However, making sense of the data and extracting meaningful
information from it may not be an easy task. Sometimes, we come across buzzwords
such as big data, high-dimensional data, data visualization, data science, and open
data without a proper definition of such words. The rapid growth in the size and
scope of data sets in a host of disciplines has created a need for innovative statistical
and computational strategies for analyzing such data. A variety of statistical and
computational tools are needed to deal with such type of data and to reveal the data
story.
This book focuses on variable selection, parameters estimation, and prediction
based on high-dimensional data (HDD). In classical regression context, we define
HDD where a number of predictors (d/ are larger than the sample size (n/. There
are situations when the number of predictors is in millions and sample size maybe
in hundreds. The modeling of HDD, where the sample size is much smaller than
the size of the data element associated with each observation, is an important
feature in a host of research fields such as social media, bioinformatics, medical,
environmental, engineering, and financial studies, among others. A number of the
classical techniques are available when d < n to tell the data story. However, the
existing classical strategies are not capable of yielding solutions for HDD. On the
other hand, the term “big data” is not very well defined, but its problems are real
and statisticians need to play a vital role in this data world. Generally speaking,
big data relates when data is very large and may not even be stored at one place.
However, the relationship between n and d may not be as crucial when comparing
with HDD. Further, in some cases, users are not able to make the distinction between
population and sampled data when dealing with big data. In any event, the big data

v
vi Preface

or data science is an emerging field stemming equally from research enterprise


and public and private sectors. Undoubtedly, big data is the future of research in
a host of research fields, and transdisciplinary programs are required to develop the
skills for data scientists. For example, many private and public agencies are using
sophisticated number-crunching, data mining, or big data analytics to reveal patterns
based on collected information. Clearly, there is an increasing demand for efficient
prediction strategies for analyzing such data. Some examples of big data that have
prompted demand are gene expression arrays; social network modeling; clinical,
genetics, and phenotypic spatiotemporal data; and many others.
In the context of regression models, due to the trade-off between model pre-
diction and model complexity, the model selection is an extremely important and
challenging problem in the big data arena. Over the past two decades, many penal-
ized regularization approaches have been developed to perform variable selection
and estimation simultaneously. This book makes a seminal contribution in the arena
of big data analysis including HDD. For a smooth reading and understanding of the
contributions made in this book, it is divided in three parts as follows:
General High-dimensional theory and methods (chapters “Regularization
After Marginal Learning for Ultra-High Dimensional Regression Models”–
“Bias-Reduced Moment Estimators of Population Spectral Distribution and Their
Applications”)
Network analysis and big data (chapters “Statistical Process Control Charts
as a Tool for Analyzing Big Data”–“Nonparametric Testing for Heterogeneous
Correlation”)
Statistics learning and applications (chapters “Optimal Shrinkage Estimation
in Heteroscedastic Hierarchical Linear Models”–“A Mixture of Variance-Gamma
Factor Analyzers”)
We anticipate that the chapters published in this book will represent a meaningful
contribution to the development of new ideas in big data analysis and will
showcase interesting applications. In a sense, each chapter is self-contained. A brief
description of the contents of each of the eighteen chapters in this book is provided.
Chapter “Regularization After Marginal Learning for Ultra-High Dimensional
Regression Models” (Feng) introduces a general framework for variable selection
in ultrahigh-dimensional regression models. By combining the idea of marginal
screening and retention, the framework can achieve sign consistency and is
extremely fast to implement.
In chapter “Empirical Likelihood Test for High Dimensional Generalized Lin-
ear Models” (Zang et al.), the estimation and model selection aspects of high-
dimensional data analysis are considered. It focuses on the inference aspect, which
can provide complementary insights to the estimation studies, and has at least two
notable contributions. The first is the investigation of both full and partial tests,
and the second is the utilization of the empirical likelihood technique under high-
dimensional settings.
Abstract random projections are frequently used for dimension reduction in
many areas of machine learning as they enable us to do computations on a
more succinct representation of the data. Random projections can be applied row-
Preface vii

and column-wise to the data, compressing samples and compressing features,


respectively. Chapter “Random Projections For Large-Scale Regression” (Thanei
et al.) discusses the properties of the latter column-wise compression, which turn
out to be very similar to the properties of ridge regression. It is pointed out that
further improvements in accuracy can be achieved by averaging over least squares
estimates generated by independent random projections.
Testing a hypothesis subsequent to model selection leads to test problems
in which nuisance parameters are present. Chapter “Testing in the Presence of
Nuisance Parameters: Some Comments on Tests Post-Model-Selection and Random
Critical Values” (Leeb and Pötscher) reviews and critically evaluates proposals that
have been suggested in the literature to deal with such problems. In particular,
the chapter reviews a procedure based on the worst-case critical value, a more
sophisticated proposal based on earlier work, and recent proposals from the econo-
metrics literature. It is furthermore discussed why intuitively appealing proposals,
for example, a parametric bootstrap procedure, as well as another recently suggested
procedure, do not lead to valid tests, not even asymptotically.
As opposed to extensive research of covariate measurement error, error in
response has received much less attention. In particular, systematic studies on
general clustered/longitudinal data with response error do not seem to be available.
Chapter “Analysis of Correlated Data with Error-Prone Response Under Gener-
alized Linear Mixed Models” (Yi et al.) considers this important problem and
investigates the asymptotic bias induced by the error in response. Valid inference
procedures are developed to account for response error effects under different
situations, and asymptotic results are appropriately established.
Statistical inference on large covariance matrices has become a fast growing
research area due to the wide availability of high-dimensional data, and spec-
tral distributions of large covariance matrices play an important role. Chapter
“Bias-Reduced Moment Estimators of Population Spectral Distribution and Their
Applications” (Qin and Li) derives bias-reduced moment estimators for the popula-
tion spectral distribution of large covariance matrices and presents consistency and
asymptotic normality of these estimators.
Big data often take the form of data streams with observations of a related
process being collected sequentially over time. Statistical process control (SPC)
charts provide a major statistical tool for monitoring the longitudinal performance of
the process by online detecting any distributional changes in the sequential process
observations. So, SPC charts could be a major statistical tool for analyzing big data.
Chapter “Statistical Process Control Charts as a Tool for Analyzing Big Data” (Qiu)
introduces some basic SPC concepts and methods and demonstrates the use of SPC
charts for analyzing certain real big data sets. This chapter also describes some
recent SPC methodologies that have a great potential for handling different big data
applications. These methods include disease dynamic screening system and some
recent profile monitoring methods for online monitoring of profile/image data that
is commonly used in modern manufacturing industries.
Chapter “Fast Community Detection in Complex Networks with a K-Depths
Classifier” (Tian and Gel) introduces a notion of data depth for recovery of
viii Preface

community structures in large complex networks. The authors propose a new


data-driven algorithm, K-depths, for community detection using the L1 depth
in an unsupervised setting. Further, they evaluate finite sample properties of
the K-depths method using synthetic networks and illustrate its performance for
tracking communities in online social media platform Flickr. The new method
significantly outperforms the classical K-means and yields comparable results to
the regularized K-means. Being robust to low-degree vertices, the new K-depths
method is computationally efficient, requiring up to 400 times less CPU time than
the currently adopted regularization procedures based on optimizing the Davis-
Kahan bound.
Chapter “How Different are Estimated Genetic Networks of Cancer Subtypes?”
(Shojaie and Sedaghat) presents a comprehensive comparison of estimated networks
of cancer subtypes. Specifically, the networks estimated using six estimation
methods were compared based on various network descriptors characterizing both
local network structures, that is, edges, and global properties, such as energy
and symmetry. This investigation revealed two particularly interesting properties
of estimated gene networks across different cancer subtypes. First, the estimates
from the six network reconstruction methods can be grouped into two seemingly
unrelated clusters, with clusters that include methods based on linear and nonlinear
associations, as well as methods based on marginal and conditional associations.
Further, while the local structures of estimated networks are significantly different
across cancer subtypes, global properties of estimated networks are less distinct.
These findings can guide future research in computational and statistical methods
for differential network analysis.
Statistical analysis of big clustered time-to-event data presents daunting sta-
tistical challenges as well as exciting opportunities. One of the challenges in
working with big biomedical data is detecting the associations between disease
outcomes and risk factors that involve complex functional forms. Many existing
statistical methods fail in large-scale settings because of lack of computational
power, as, for example, the computation and inversion of the Hessian matrix of
the log-partial likelihood is very expensive and may exceed computation memory.
Chapter “A Computationally Efficient Approach for Modeling Complex and Big
Survival Data” (He et al.) handles problems with a large number of parameters
and propose a novel algorithm, which combines the strength of quasi-Newton,
MM algorithm, and coordinate descent. The proposed algorithm improves upon
the traditional semiparametric frailty models in several aspects. For instance, the
proposed algorithms avoid calculation of high-dimensional second derivatives of the
log-partial likelihood and, hence, are competitive in term of computation speed and
memory usage. Simplicity is obtained by separating the variables of the optimization
problem. The proposed methods also provide a useful tool for modeling complex
data structures such as time-varying effects.
Asymptotic inference for the concentration of directional data has attracted
much attention in the past decades. Most of the asymptotic results related to
concentration parameters have been obtained in the traditional large sample size
and fixed dimension case. Chapter “Tests of Concentration for Low-Dimensional
Preface ix

and High-Dimensional Directional Data” (Cutting et al.) considers the extension of


existing testing procedures for concentration to the large n and large d case. In this
high-dimensional setup, the authors provide tests that remain valid in the sense that
they reach the correct asymptotic level within the class of rotationally symmetric
distributions.
“Nonparametric testing for heterogeneous correlation” covers the big data
problem of determining whether a weak overall monotone association between two
variables persists throughout the population or is driven by a strong association
that is limited to a subpopulation. The idea of homogeneous association rests
on the underlying copula of the distribution. In chapter “Nonparametric Testing
for Heterogeneous Correlation” (Bamattre et al.), two copulas are considered,
the Gaussian and the Frank, under which components of two respective ranking
measures, Spearman’s footrule and Kendall’s tau, are shown to have tractable
distributions that lead to practical tests.
Shrinkage estimators have profound impacts in statistics and in scientific and
engineering applications. Chapter “Optimal Shrinkage Estimation in Heteroscedas-
tic Hierarchical Linear Models” (Kou and Yang) considers shrinkage estimation
in the presence of linear predictors. Two heteroscedastic hierarchical regression
models are formulated, and the study of optimal shrinkage estimators in each
model is thoroughly presented. A class of shrinkage estimators, both parametric
and semiparametric, based on unbiased risk estimate is proposed and is shown
to be (asymptotically) optimal under mean squared error loss in each model. A
simulation study is conducted to compare the performance of the proposed methods
with existing shrinkage estimators. The authors also apply the method to real data
and obtain encouraging and interesting results.
Chapter “High Dimensional Data Analysis: Integrating Submodels” (Ahmed
and Yuzbasi) considers efficient prediction strategies in sparse high-dimensional
model. In high-dimensional data settings, many penalized regularization strategies
are suggested for simultaneous variable selection and estimation. However, different
strategies yield a different submodel with different predictors and number of
predictors. Some procedures may select a submodel with a relatively larger number
of predictors than others. Due to the trade-off between model complexity and
model prediction accuracy, the statistical inference of model selection is extremely
important and a challenging problem in high-dimensional data analysis. For this
reason, we suggest shrinkage and pretest post estimation strategies to improve the
prediction performance of two selected submodels. Such a pretest and shrinkage
strategy is constructed by shrinking an overfitted model estimator in the direction
of an underfitted model estimator. The numerical studies indicate that post selection
pretest and shrinkage strategies improved the prediction performance of selected
submodels. This chapter reveals many interesting results and opens doors for further
research in a host of research investigations.
Chapter “High-Dimensional Classification for Brain Decoding” (Croteau et al.)
discusses high-dimensional classification within the context of brain decoding
where spatiotemporal neuroimaging data are used to decode latent cognitive states.
The authors discuss several approaches for feature selection including persistent
x Preface

homology, robust functional principal components analysis, and mutual information


networks. These features are incorporated into a multinomial logistic classifier, and
model estimation is based on penalized likelihood using the elastic net penalty.
The approaches are illustrated in an application where the task is to infer, from
brain activity measured with magnetoencephalography (MEG), the type of video
stimulus shown to a subject.
Principal components analysis is a widely used technique for dimension reduc-
tion and characterization of variability in multivariate populations. In chapter
“Unsupervised Bump Hunting Using Principal Components” (A. D’{az-Pach’on et
al.), the authors interest lies in studying when and why the rotation to principal
components can be used effectively within a response-predictor set relationship in
the context of mode hunting. Specifically focusing on the Patient Rule Induction
Method (PRIM), the authors first develop a fast version of this algorithm (fastPRIM)
under normality which facilitates the theoretical studies to follow. Using basic geo-
metrical arguments, they then demonstrate how the principal components rotation of
the predictor space alone can in fact generate improved mode estimators. Simulation
results are used to illustrate findings.
The analysis of high-dimensional data is challenging in multiple aspects. One
aspect is interaction analysis, which is critical in biomedical and other studies.
Chapter “Identifying Gene-Environment Interactions Associated with Prognosis
Using Penalized Quantile Regression” (Wang et al.) studies high-dimensional
interactions using a robust approach. The effectiveness demonstrated in this study
opens doors for other robust methods under high-dimensional settings. This study
will also be practically useful by introducing a new way of analyzing genetic data.
In chapter “A Mixture of Variance-Gamma Factor Analyzers” (McNicholas et
al.), a mixture modeling approach for clustering high-dimensional data is developed.
This approach is based on a mixture of variance-gamma distributions, which
is interesting because the variance-gamma distribution has been underutilized in
multivariate statistics—certainly, it has received far less attention than the skew-t
distribution, which also parameterizes location, scale, concentration, and skewness.
Clustering is carried out using a mixture of variance-gamma factor analyzers
(MVGFA) model, which is an extension of the well-known mixture of factor
analyzers model that can accommodate clusters that are asymmetric and/or heavy
tailed. The formulation of the variance-gamma distribution used can be represented
as a normal mean variance mixture, a fact that is exploited in the development of
the associated factor analyzers.
In summary, several directions for innovative research in big data analysis were
highlighted in this book. I remain confident that this book conveys some of the
surprises, puzzles, and success stories in the arena of big data analysis. The research
in this arena is ongoing for a foreseeable future.
As an ending thought, I would like to thank all the authors who submitted their
papers for possible publication in this book as well as all the reviewers for their
valuable input and constructive comments on all submitted manuscripts. I would like
to express my special thanks to Veronika Rosteck at Springer for the encouragement
and generous support on this project and helping me to arrive at the finishing line.
Preface xi

My special thanks go to Ulrike Stricker-Komba at Springer for outstanding technical


support for the production of this book. Last but not least, I am thankful to my family
for their support for the completion of this book.

Niagara-On-The-Lake, Ontario, Canada S. Ejaz Ahmed


August 2016
Contents

Part I General High-Dimensional Theory and Methods


Regularization After Marginal Learning for Ultra-High
Dimensional Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3
Yang Feng and Mengjia Yu
Empirical Likelihood Test for High Dimensional Generalized
Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 29
Yangguang Zang, Qingzhao Zhang, Sanguo Zhang, Qizhai Li,
and Shuangge Ma
Random Projections for Large-Scale Regression . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 51
Gian-Andrea Thanei, Christina Heinze, and Nicolai Meinshausen
Testing in the Presence of Nuisance Parameters: Some
Comments on Tests Post-Model-Selection and Random Critical Values. . . 69
Hannes Leeb and Benedikt M. Pötscher
Analysis of Correlated Data with Error-Prone Response Under
Generalized Linear Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 83
Grace Y. Yi, Zhijian Chen, and Changbao Wu
Bias-Reduced Moment Estimators of Population Spectral
Distribution and Their Applications . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 103
Yingli Qin and Weiming Li

Part II Network Analysis and Big Data


Statistical Process Control Charts as a Tool for Analyzing Big Data.. . . . . . 123
Peihua Qiu
Fast Community Detection in Complex Networks
with a K-Depths Classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 139
Yahui Tian and Yulia R. Gel

xiii
xiv Contents

How Different Are Estimated Genetic Networks of Cancer Subtypes? .. . . 159


Ali Shojaie and Nafiseh Sedaghat
A Computationally Efficient Approach for Modeling Complex
and Big Survival Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 193
Kevin He, Yanming Li, Qingyi Wei, and Yi Li
Tests of Concentration for Low-Dimensional
and High-Dimensional Directional Data.. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 209
Christine Cutting, Davy Paindaveine, and Thomas Verdebout
Nonparametric Testing for Heterogeneous Correlation .. . . . . . . . . . . . . . . . . . . . 229
Stephen Bamattre, Rex Hu, and Joseph S. Verducci

Part III Statistics Learning and Applications


Optimal Shrinkage Estimation in Heteroscedastic Hierarchical
Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 249
S.C. Kou and Justin J. Yang
High Dimensional Data Analysis: Integrating Submodels . . . . . . . . . . . . . . . . . . 285
Syed Ejaz Ahmed and Bahadır Yüzbaşı
High-Dimensional Classification for Brain Decoding . . . .. . . . . . . . . . . . . . . . . . . . 305
Nicole Croteau, Farouk S. Nathoo, Jiguo Cao, and Ryan Budney
Unsupervised Bump Hunting Using Principal Components . . . . . . . . . . . . . . . . 325
Daniel A. Díaz-Pachón, Jean-Eudes Dazard, and J. Sunil Rao
Identifying Gene–Environment Interactions Associated with
Prognosis Using Penalized Quantile Regression .. . . . . . . . .. . . . . . . . . . . . . . . . . . . . 347
Guohua Wang, Yinjun Zhao, Qingzhao Zhang, Yangguang Zang,
Sanguo Zang, and Shuangge Ma
A Mixture of Variance-Gamma Factor Analyzers. . . . . . . .. . . . . . . . . . . . . . . . . . . . 369
Sharon M. McNicholas, Paul D. McNicholas, and Ryan P. Browne
Part I
General High-Dimensional Theory
and Methods
Regularization After Marginal Learning
for Ultra-High Dimensional Regression Models

Yang Feng and Mengjia Yu

Abstract Regularization is a popular variable selection technique for high dimen-


sional regression models. However, under the ultra-high dimensional setting, a
direct application of the regularization methods tends to fail in terms of model
selection consistency due to the possible spurious correlations among predictors.
Motivated by the ideas of screening (Fan and Lv, J R Stat Soc Ser B Stat Methodol
70:849–911, 2008) and retention (Weng et al, Manuscript, 2013), we propose a
new two-step framework for variable selection, where in the first step, marginal
learning techniques are utilized to partition variables into different categories, and
the regularization methods can be applied afterwards. The technical conditions of
model selection consistency for this broad framework relax those for the one-step
regularization methods. Extensive simulations show the competitive performance of
the new method.

Keywords Independence screening • Lasso • Marginal learning • Retention •


Selection • Sign consistency

1 Introduction

With the booming of information and vast improvement for computation speed,
we are able to collect large amount of data in terms of a large collections of n
observations and p predictors, where p  n. Recently, model selection gains
increasing attention especially for ultra-high dimensional regression problems.
Theoretically, the accuracy and interpretability of selected model are crucial in
variable selection. Practically, algorithm feasibility and efficiency are vital in
applications.
A great variety of penalized methods have been proposed in recent years. The
regularization techniques for simultaneous variable selection and estimation are
particularly useful to obtain sparse models compared to simply apply traditional
criteria such as Akaike’s information criterion [1] and Bayesian information

Y. Feng () • M. Yu
Department of Statistics, Columbia University, New York, NY 10027, USA
e-mail: [email protected]

© Springer International Publishing AG 2017 3


S.E. Ahmed (ed.), Big and Complex Data Analysis, Contributions to Statistics,
DOI 10.1007/978-3-319-41573-4_1
4 Y. Feng and M. Yu

criterion [18]. The least absolute shrinkage and selection operator (Lasso) [19] have
been widely used as the l1 penalty shrinks most coefficients to 0 and fulfills the
task of variable selection. Many other regularization methods have been developed;
including bridge regression [13], the smoothly clipped absolute deviation method
[5], the elastic net [26], adaptive Lasso [25], LAMP [11], among others. Asymptotic
analysis for the sign consistency in model selection [20, 24] has been introduced
to provide theoretical support for various methods. Some other results such as
parameter estimation [17], prediction [15], and oracle properties [5] have been
introduced under different model contexts.
However, in ultra-high dimensional space where the dimension p D exp.na /
(where a > 0), the conditions for sign consistency are easily violated as a con-
sequence of large correlations among variables. To deal with such challenges, Fan
and Lv [6] proposed the sure independence screening (SIS) method which is based
on correlation learning to screen out irrelevant variables efficiently. Further analysis
and generalization can be found in Fan and Song [7] and Fan et al. [8]. From the
idea of retaining important variables rather than screening out irrelevant variables,
Weng et al. [21] proposed the regularization after retention (RAR) method. The
major differences between SIS and RAR can be summarized as follows. SIS makes
use of marginal correlations between variables and response to screen noises out,
while RAR tries to retain signals after acquiring these coefficients. Both of them
relax the irrepresentable-type conditions [20] and achieve sign consistency.
In this paper, we would like to introduce a general multi-step estimation
framework that integrates the idea of screening and retention in the first step to learn
the importance of the features using the marginal information during the first step,
and then impose regularization using corresponding weights. The main contribution
of the paper is two-fold. First, the new framework is able to utilize the marginal
information adaptively in two different directions, which will relax the conditions
for sign consistency. Second, the idea of the framework is very general and covers
the one-step regularization methods, the regularization after screening method, and
the regularization after retention method as special cases.
The rest of this paper is organized as follows. In Sect. 2, we introduce the model
setup and the relevant techniques. The new variable selection framework is elabo-
rated in Sect. 3 with connections to existing methods explained. Section 4 develops
the sign consistency result for the proposed estimators. Extensive simulations are
conducted in Sect. 5 to compare the performance of the new method with the
existing approaches. We conclude with a short discussion in Sect. 6. All the technical
proofs are relegated to the appendix.
Regularization After Marginal Learning for Ultra-High Dimensional Regression Models 5

2 Model Setup and Several Methods in Variable Selection

2.1 Model Setup and Notations

Let .Xi ; Yi / be i.i.d. random pairs following the linear regression model:

Yi D Xi ˇ C "i ; i D 1; : : : ; n;

where Xi D .Xi1 ; : : : ; Xi /T is pn -dimensional vector distributed as N.0; †/, ˇ D


p

i:i:d:
.ˇ1 ; : : : ; ˇp /T is the true coefficient vector, "1 ; : : : ; "n  N.0;  2 /; and fXi gniD1 are
independent of f"i gniD1 . Note here, we sometimes use pn to emphasize the dimension
p is diverging with the sample size n. Denote the support index set of ˇ by S D f j W
ˇj ¤ 0g and the cardinality of S by sn , and †Sc jS D †Sc Sc  †Sc S .†SS /1 †SSc : Both
pn and sn are allowed to increase as n increases. For conciseness, we sometimes use
signals and noises to represent relevant predictors S and irrelevant predictors Sc (or
their corresponding coefficients) respectively.
For any set A, let Ac be its complement set. For any k dimensional vector w
and any subset P K  f1; : : : ; kg, wPK denotes the subvector of w indexed by K, and
let kwk1 D kiD1 jwi j; kwk2 D . kiD1 w2i /1=2 ; kwk1 D maxiD1;:::;k jwi j: For any
k1  k2 matrix M, any subsets K1  f1; : : : ; k1 g, K2  f1; : : : ; k2 g, MK1 K2 represents
the submatrix of M consisting of entries indexed by the Cartesian product K1  K2 .
Let MK2 be the columns of M indexed by K2 and M j be the j-th Pk column of M.
Denote kMk2 D fƒmax .M T M/g1=2 and kMk1 D maxiD1;:::;k jD1 jMij j: When
k1 D k2 D k, let .M/ D maxiD1;:::;k Mii , ƒmin .M/ and ƒmax .M/ be the minimum
and maximum eigenvalues of M, respectively.

2.2 Regularization Techniques

The Lasso [19] defined as


( )
X
n X
pn
1
ˇO D arg min .2n/ .Yi  XiT ˇ/2 C n jˇj j ; n  0 (1)
ˇ
iD1 jD1

is a popular variable selection method. Thanks to the invention of efficient algo-


rithms including LARS [4] and the coordinate descent algorithm [14], Lasso and its
variants are applied to a wide range of different scenarios in this big data era. There
is a large amount of research related to the theoretical properties of Lasso. Zhao and
Yu [24] proposed almost necessary and sufficient conditions for the sign consistency
for Lasso to select true model in the large pn setting as n increases. Considering the
sensitivity of tuning parameter n and consistency for model selection, Wainwright
6 Y. Feng and M. Yu

[20] has identified precise conditions of achieving sparsity recovery with a family
of regularization parameters n under deterministic design.
Another effective approach to the penalization problem is adaptive Lasso
(AdaLasso) [25], which uses an adaptively weighted l1 -penalty term, defined as
( )
X
n X
pn
1
ˇO D arg min .2n/ .Yi  XiT ˇ/2 C n !j jˇj j ; n  0: (2)
ˇ
iD1 jD1

where !j D 1=jˇOinit j for some   0, in which ˇOinit is some initial estimator.


When signals are weakly correlated to noises, Huang et al. [16] proved AdaLasso
is sign consistent with !j D 1=jˇOjM j  1=j.XQ j /T Yj, where XQ is the centered
and scaled data matrix. One potential issue of this weighting choice is that when
the correlations between some signals and response are too small, those signals
would be severely penalized and may be estimated as noises. We will use numeric
examples to demonstrate this point in the simulation section.

2.3 Sure Independence Screening

To reduce dimension from ultra-high to a moderate level, Fan and Lv [6] proposed
a sure independence screening (SIS) method, which makes use of marginal correla-
tions as a measure of importance in first step and then utilizes other operators such
as Lasso to fulfill the target of variable selection. In particular, first we calculate
the component-wise regression coefficients for each variable, i.e., ˇOjM D .XQ j /T Y,
Q
j D 1; : : : ; pn , where XQ j is the standardized j-th column of data X and YQ is the
standardized response. Second, we define a sub-model with respect to the largest
coefficients

M D f1 j pn W jˇOjM j is among the first b nc of allg:

Predictors that are not in M are regarded as noise and therefore discarded for
further analysis. SIS reduces the number of candidate covariates to a moderate level
for the subsequent analysis. Combining SIS and Lasso, Fan and Lv [6] introduced
SIS-Lasso estimator,
( )
Xn X
1 2
ˇO D arg min .2n/ .Yi  X ˇ/ C n
T
jˇj j
i
ˇ2M
iD1 j2M
( )
X
n X X
1 2
D arg min .2n/ .Yi  Xi ˇ/ C n
T
jˇj j C 1 jˇj j : (3)
ˇ
iD1 j2M j2Mc
Regularization After Marginal Learning for Ultra-High Dimensional Regression Models 7

Clearly,  should be chosen carefully to avoid screening out signals. To deal with
the issue that signals may be marginally uncorrelated with the response in some
cases, iterative-SIS was introduced [6] as a practical procedure but without rigorous
theoretical support for the sign consistency. As a result, solely relying on marginal
information is sometimes a bit too risky, or greedy, for model selection purpose.

3 Regularization After Marginal Learning

3.1 Algorithm

From Sect. 2, one potential drawback shared between AdaLasso and SIS-Lasso is
that they may miss important covariates that are marginally weakly correlated with
the response.
Now, we introduce a new algorithm, regularization after marginal (RAM)
learning, to solve the issue. It utilizes marginal correlation to divide all variables
into three candidate sets: a retention set, a noise set, and an undetermined set. Then
regularization is imposed to find signals in the uncertainty set as well as to identify
falsely retention signals and falsely screened noises.
A detailed description of the algorithm is as follows:
Step 0 (Marginal Learning) Calculate the marginal regression coefficients after
standardizing each predictor, i.e.,

X
n j
.Xi  XN j /
ˇOjM D Yi ; 1 j pn ; (4)
iD1
O j
q Pn
1 Pn j Nj 2
iD1 .Xi X /
O j2
j
where XN j D n iD1 Xi and n1
. D
Define a retention set by R D f1 j p W jˇOjM j  n g, for a positive constant
O
n ; a noise set by NO D f1 j p W jˇOjM j Qn g, for a positive constant Qn < n ;
and an undetermined set by UO D .R O [ NO /c .

Step 1 (Regularization After Screening Noises Out) Search for signals in UO by


solving
( n 
)
X X X 2 X
1
ˇOR;
O UO1 D arg min .2n/ Yi  Xij ˇj  Xik ˇk C n jˇj j ; (5)
ˇNO D0 iD1 j2UO O
k2R j2UO

where the index UO1 is denoted as the set of variables that are estimated as signals
in U, O ˇO O O /j ¤ 0g. After Step 1, the selected variable set is
O namely UO1 D f j 2 Uj.
R;U1
RO [ UO1 .

You might also like