Local-Learning-Based Feature Selection For High-Dimensional Data Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO.

X, XX 20XX 1

Local Learning Based Feature Selection


for High Dimensional Data Analysis
Yijun Sun, Sinisa Todorovic, and Steve Goodison

Abstract—This paper considers feature selection for data classification in the presence of a huge number of irrelevant features. We
propose a new feature selection algorithm that addresses several major issues with prior work, including problems with algorithm
implementation, computational complexity and solution accuracy. The key idea is to decompose an arbitrarily complex nonlinear
problem into a set of locally linear ones through local learning, and then learn feature relevance globally within the large margin
framework. The proposed algorithm is based on well-established machine learning and numerical analysis techniques, without making
any assumptions about the underlying data distribution. It is capable of processing many thousands of features within minutes on
a personal computer, while maintaining a very high accuracy that is nearly insensitive to a growing number of irrelevant features.
Theoretical analyses of the algorithm’s sample complexity suggest that the algorithm has a logarithmical sample complexity with
respect to the number of features. Experiments on eleven synthetic and real-world data sets demonstrate the viability of our formulation
of the feature selection problem for supervised learning and the effectiveness of our algorithm.

Index Terms—Feature selection, local learning, logistical regression, ℓ1 regularization, sample complexity.

1 I NTRODUCTION the most advanced classifiers, believed to scale well with


the increasing number of features – experiences a notable

H IGH throughput technologies now routinely pro-


duce large data sets characterized by unprece-
dented numbers of features. Accordingly, feature selec-
drop in accuracy when this number becomes sufficiently
large [4], [14]. It has been proved by [5] that SVM has a
worst-case sample complexity that grows at least linearly
tion has become increasingly important in a wide range
in the number of irrelevant features. In addition to de-
of scientific disciplines. In this paper, we consider feature
fying the curse of dimensionality, eliminating irrelevant
selection for the purposes of data classification. An ex-
features can also reduce system complexity, processing
ample of data classification tasks, where feature selection
time of data analysis, and the cost of collecting irrelevant
plays a critical role, is the use of oligonucleotide mi- features. In some cases, feature selection can also provide
croarray for the identification of cancer-associated gene
significant insights into the nature of the problem under
expression profiles of diagnostic or prognostic value [1],
investigation.
[2], [50]. Typically, the number of samples is less than
Feature selection for high-dimensional data is consid-
one hundred, while the number of features associated
ered one of the current challenges in statistical machine
with the raw data is in the order of thousands or even
learning [7]. In this paper, we propose a new feature
tens of thousands. Amongst this enormous number of
selection algorithm that addresses several major issues
genes, only a small fraction is likely to be relevant for
with existing methods, including their problems with al-
cancerous tumor growth and/or spread. The abundance
gorithm implementation, computational complexity and
of irrelevant features poses serious problems for exist-
solution accuracy for high-dimensional data. The formu-
ing machine learning algorithms, and represents one of
lation of the proposed algorithm is based on a simple
the most recalcitrant problems for their applications in
concept that a given complex problem can be more
oncology and other scientific disciplines dealing with
easily, yet accurately enough, analyzed by parsing it into
copious features. Performance of most classification algo-
a set of locally linear problems. Local learning allows
rithms suffers as the number of features becomes exces-
one to capture local structure of the data, while the
sively large. It has been recently observed, for example,
parameter estimation is performed globally within the
that even support vector machine (SVM) [3] – one of
large margin framework to avoid possible overfitting.
The new algorithm performs remarkably well in the
• Y. Sun is with the Interdisciplinary Center for Biotechnology Research, presence of copious irrelevant features. A large-scale ex-
University of Florida, Gainesville, FL 32610.
E-mail: [email protected]
periment conducted on eleven synthetic and real-world
• S. Goodison is with the Department of Surgery, University of Florida, data sets demonstrates that the algorithm is capable of
Gainesville, FL 32610. E-mail: [email protected] processing many thousands of features within minutes
• S. Todorovic is with the School of EECS at Oregon State University,
Corvallis, OR 97331. E-mail: [email protected]
on a personal computer, yet maintaining a very high
accuracy that is nearly insensitive to a growing number
Manuscript received 4 Oct. 2008; revised 19 Feb. 2009 and 7 July 2009;
accepted 29 July 2009; published online xx. of irrelevant features. In one simulation study where we
Recommended for acceptance by O. Chapelle. consider Fermat’s spiral problem, our algorithm achieves
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 2

a close-to-optimal solution even when the data contains Existing algorithms are traditionally categorized as
one million irrelevant features. wrapper or filter methods, with respect to the criteria
We study the algorithm’s properties from several dif- used to search for relevant features [10]. In wrapper
ferent angles to explain why the proposed algorithm methods, a classification algorithm is employed to eval-
performs well in a high-dimensional space. We show uate the goodness of a selected feature subset, whereas
that the algorithm can be regarded as finding a feature in filter methods criterion functions evaluate feature
weight vector so that the upper bound of the leave-one- subsets by their information content, typically interclass
out cross-validation error of a nearest-neighbor classifier distance (e.g., Fisher score) or statistical measures (e.g.,
in the induced feature space is minimized. By using fixed p-value of t-test), instead of optimizing the performance
point theory, we prove that with a mild condition the of any specific learning algorithm directly. Hence, filter
proposed algorithm converges to a solution as if it had methods are computationally much more efficient, but
perfect prior knowledge as to which feature is relevant. usually do not perform as well as wrapper methods.
We also conduct a theoretical analysis of the algorithm’s One major issue with wrapper methods is their high
sample complexity, which suggests that the algorithm computational complexity due to the need to train a
has a logarithmical sample complexity with respect to large number of classifiers. Many heuristic algorithms
the input data dimensionality. That is, the number of (e.g., forward and backward selection [11]) have been
samples needed to maintain the same level of learning proposed to alleviate this issue. However, due to their
accuracy grows only logarithmically with respect to the heuristic nature, none of them can provide any guarantee
feature dimensionality. This dependence is very weak, of optimality. With tens of thousands of features, which
and matches the best known bounds proved in various is the case in gene expression microarray data analysis, a
feature selection contexts [5], [6]. Although logarithmical hybrid approach is usually adopted, wherein the number
sample complexity is not new in the literature, it holds of features is first reduced by using a filter method
only for linear models, whereas in our algorithm no and then a wrapper method is applied to the reduced
assumptions are made about the underlying data dis- feature set. Nevertheless, it still may take several hours
tribution. to perform the search, depending on the classifier used in
In this paper, we also show that the aforementioned the wrapper method. To reduce complexity, in practice,
theoretical analysis of our feature selection algorithm a simple classifier (e.g., linear classifier) is often used
may have an important theoretical implication in learn- to evaluate the goodness of feature subsets, and the
ing theory. Based on the proposed feature selection selected features are then fed into a more complicated
algorithm, we show that it is possible to derive a classifier in the subsequent data analysis. This gives rise
new classification algorithm with a generalization error to the issue of feature exportability – in some cases, a
bound that grows only logarithmically in the input data feature subset that is optimal for one classifier may not
dimensionality for arbitrary data distribution. This is a work well for others [8]. Another issue associated with
very encouraging result, considering that a large part a wrapper method is its capability to perform feature
of machine learning research is focused on developing selection for multiclass problems. To a large extent, this
learning algorithms that behave gracefully when faced property depends on the capability of a classifier used
with the curse of dimensionality. in a wrapper method to handle multiclass problems. In
This paper is organized as follows. Sec. 2 reviews many cases, a multiclass problem is first decomposed
prior work, focusing on the main problems with exist- into several binary ones by using an error-correct-code
ing methods that are addressed by our algorithm. The method [13], [34], and then feature selection is performed
newly proposed feature selection algorithm is described for each binary problem. This strategy further increases
in Sec. 3. This section also presents the convergence the computational burden of a wrapper method. One
analysis of our algorithm in Sec. 3.1, its computational issue that is rarely addressed in the literature is algo-
complexity in Sec. 3.2, and its extension to multiclass rithmic implementation. Many wrapper methods require
problems in Sec. 3.3. Experimental evaluation is pre- the training of a large number of classifiers and man-
sented in Sec. 4. The algorithm’s sample complexity is ual specification of many parameters. This makes their
discussed in Sec. 5, after which we conclude the paper implementation and use rather complicated, demanding
by tracing back the origins of our work and pointing expertise in machine learning. This is probably one of
out major differences and improvements made here as the main reasons why filter methods are more popular
compared to other well-known algorithms. in the biomedical community [1], [2].
It is difficult to address the aforementioned issues
2 L ITERATURE R EVIEW directly in the wrapper-method framework. To overcome
Research on feature selection has been very active in the this difficulty, embedded methods have recently received
past decade [8], [10], [11], [12], [14], [18], [19]. This section an increased interest (see, e.g., [4], [5], [14], [16], [17]).
gives a brief review of existing algorithms and discusses The interested reader may refer to [15] for an excellent
some major issues with prior work that are addressed review. Embedded methods incorporate feature selec-
by our algorithm. The interested reader may refer to [8] tion into the learning process of a classifier. A feature
and [9] for more details. weighting strategy is usually adopted that uses real-
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 3

valued numbers, instead of binary ones, to indicate the [18], and first mathematically defined in [19] (using
relevance of features in a learning process. This strategy Euclidean distance) for the feature-selection purpose. An
has many advantages. For example, there is no need intuitive interpretation of this margin is a measure as to
to pre-specify the number of relevant features. Also, how much the features of xn can be corrupted by noise
standard optimization techniques (e.g., gradient descent) (or how much xn can “move” in the feature space) before
can be used to avoid a combinatorial search. Hence, being misclassified. By the large margin theory [3], [20],
embedded methods are usually computationally more a classifier that minimizes a margin-based error function
tractable than wrapper methods. Still, computational usually generalizes well on unseen test data. One natural
complexity is a major issue when the number of fea- idea then is to scale each feature, and thus obtain a
tures becomes excessively large. Other issues including weighted feature space, parameterized by a nonnegative
algorithm implementation, feature exportability and ex- vector w, so that a margin-based error function in the
tension to multiclass problems also remain. induced feature space is minimized. The margin of xn ,
Some recently developed embedded algorithms can computed with respect to w, is given by:
be used for large-scale feature selection problems un-
der certain assumptions. For example, [4], [14] propose ρn (w) = d(xn , NM(xn )|w) − d(xn , NH(xn )|w) = wT zn ,
to perform feature selection directly in the SVM for- (2)
mulation, where the scaling factors are adjusted using where zn = |xn − NM(xn )| − |xn − NH(xn )|, and | · |
the gradient of a theoretical upper bound on the error is an element-wise absolute operator. Note that ρn (w)
rate. RFE [16] is a well-known feature selection method is a linear function of w, and has the same form as the
specifically designed for microarray data analysis. It sample margin of SVM, given by ρSVM (xn ) = wT φ(xn ),
works by iteratively training an SVM classifier with a using a mapping function φ(·). An important difference,
current set of features, and then heuristically removing however, is that by construction the magnitude of each
the features with small feature weights. As with wrapper element of w in the above margin definition reflects
methods, the structural parameters of SVM may need to the relevance of the corresponding feature in a learning
be re-estimated by using, for example, cross-validation process. This is not the case in SVM except when a linear
during iterations. Also, a linear kernel is usually used kernel is used, which however can capture only linear
for computational reasons. ℓ1 -SVM with a linear kernel discriminant information. Note that the margin thus de-
[17], with a proper parameter tuning, can lead to a sparse fined requires only information about the neighborhood
solution, where only relevant features receive non-zero of xn , while no assumption is made about the underlying
weights. A similar algorithm is logistical regression with data distribution. This means that by local learning we
ℓ1 regulation. It has been proved by [5] that ℓ1 regu- can transform an arbitrary nonlinear problem into a set
larized logistical regression has a logarithmical sample of locally linear ones.
complexity with respect to the number of features. How- The local linearization of a nonlinear problem enables
ever, the linearity assumption of data models in these us to estimate the feature weights by using a linear
approaches limits their applicability to general problems. model that has been extensively studied in the litera-
ture. It also facilitates the mathematical analysis of the
algorithm. The main problem with the above margin
3 O UR A LGORITHM definition, however, is that the nearest neighbors of
In this section, we present a new feature selection al- a given sample are unknown before learning. In the
gorithm that addresses many issues with prior work presence of many thousands of irrelevant features, the
discussed in Sec. 2. Let D = {(xn , yn )}N J
n=1 ⊂ R × {±1} nearest neighbors defined in the original space can be
be a training data set, where xn is the n-th data sample completely different from those in the induced space (see
containing J features, yn is its corresponding class label, Fig. 2). To account for the uncertainty in defining local
and J ≫ N . For clarity, we here consider only binary information, we develop a probabilistic model, where
problems, while in Sec. 3.3 our algorithm is general- the nearest neighbors of a given sample are treated
ized to address multiclass problems. We first define the as hidden variables. Following the principles of the
margin. Given a distance function, we find two nearest expectation-maximization algorithm [21], we estimate
neighbors of each sample xn , one from the same class the margin by computing the expectation of ρn (w) via
(called nearest hit or NH), and the other from the different averaging out the hidden variables:
class (called nearest miss or NM). The margin of xn is then
computed as ρ̄n (w) = wT (Ei∼Mn [|xn − xi |] − Ei∼Hn [|xn − xi |])
 X
ρn = d(xn , NM(xn )) − d(xn , NH(xn )) , (1) = wT P (xi =NM(xn )|w)|xn − xi |
i∈Mn
where d(·) is a distance function. For the purposes of X 
− P (xi =NH(xn )|w)|xn − xi | = wT z̄n , (3)
this paper, we use the Manhattan distance to define a
i∈Hn
sample’s margin and nearest neighbors, while other stan-
dard definitions may also be used. This margin definition where Mn = {i : 1 ≤ i ≤ N, yi 6= yn }, Hn = {i : 1 ≤
is implicitly used in the well-known RELIEF algorithm i ≤ N, yi = yn , i 6= n}, Ei∼Mn denotes the expecta-
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 4

0.25
framework. Two most popular margin formulations are
0.2
SVM [3] and logistic regression [23]. Due to the nonneg-
w
0.15
ative constraint on w, the SVM formulation represents

Second Feature
A 0.1 a large-scale optimization problem, while the problem
0.05 size can not be reduced by transforming it into the dual
C
B
0 domain. For computational convenience, we therefore
−0.05 perform the estimation in the logistic regression formula-
−0.1
−0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25
tion, which leads to the following optimization problem:
First Feature

N
(a) (b) X
log 1 + exp(−wT z̄n ) , subject to w ≥ 0 , (6)

min
w
Fig. 1. Fermat’s spiral problem. (a) Samples belonging to n=1
two classes are distributed in a two-dimensional space, where w ≥ 0 means that each element of w is nonnega-
forming a spiral shape. A possible decision boundary is tive.
also plotted. If one walks from point A to B along the In applications with a huge amount of features, we
decision boundary, at any given point (say, point C), one expect that most features are irrelevant. For example,
would obtain a linear problem locally. (b) By projecting in cancer prognosis, most genes are not involved in
the transformed data z̄n onto the direction specified by w, tumor growth and/or spread [1], [2]. To encourage the
most samples have positive margins. sparseness, one commonly used strategy is to add ℓ1
penalty of w to an objective function [5], [17], [24],
tion computed with respect to Mn , P (xi =NM(xn )|w) [25], [26], [27], [28]. Accomplishing sparse solutions by
and P (xi =NH(xn )|w) are the probabilities of sample introducing the ℓ1 penalty has been theoretically justified
xi being the nearest miss or hit of xn , respectively. (see, for example, [29] and the references therein). With
These probabilities are estimated via the standard kernel the ℓ1 penalty, we obtain the following optimization
density estimation: problem:
N
k(kxn − xi kw ) X
P (xi =NM(xn )|w) = P , ∀i∈Mn , log 1 + exp(−wT z̄n ) +λkwk1 , subject to w ≥ 0 ,

min
j∈Mn k(kxn − xj kw ) w
n=1
(4) (7)
and where λ is a parameter that controls the penalty strength
k(kxn − xi kw ) and consequently the sparseness of the solution.
P (xi =NH(xn )|w) = P , ∀i∈Hn ,
j∈Hn k(kxn − xj kw ) The optimization formulation (7) can also be written
(5) as:
where k(·) is a kernel function. Specifically, we use N
the exponential kernel k(d) = exp(−d/σ), where the
X
log 1 + exp(−wT z̄n ) , subject to kwk1 ≤ β, w ≥ 0 .

min
kernel width σ is an input parameter that determines the w
n=1
resolution at which the data is locally analyzed. Other (8)
kernel functions can also be used, and the descriptions In statistics, the above formulation is called nonnegative
of their properties can be found in [22]. garrote [30]. For every solution to (7), obtained for a
To motivate the above formulation, we consider the given value of λ, there is a corresponding value of β
well-known Fermat’s problem in which two-class sam- in (8) that gives the same solution. The optimization
ples are distributed in a two-dimensional space, forming problem of (8) has an interesting interpretation: if we
a spiral shape, as illustrated in Fig. 1(a). A possible adopt a classification rule where xn is correctly classified
decision boundary is also plotted. If one walks from if and only if margin ρ̄n (w) ≥ 0 (i.e., on average, xn is
point A to B along the decision boundary, at any given closer to the patterns from the same class in the training
point (say, point C), one would obtain a linear problem data excluding xn than to those from the opposite class),
PN
locally. One possible linear formulation is given by Eq. then n=1 I(ρ̄n (w) < 0) is the leave-one-out (LOO)
(3). Clearly, in this spiral problem, both features are classification error induced by w, where I(·) is the
equally important. By projecting the transformed data indicator function. Since the logistic loss function is an
z̄n onto the feature weight vector w = [1, 1]T , we note upper bound of the misclassification loss function, up to
that most samples have positive margins (Fig. 1(b)). The a difference of a constant factor, the physical meaning of
above arguments generally hold for arbitrary nonlinear our algorithm is to find a feature weight vector so that
problems for a wide range of kernel widths as long as the upper bound of the LOO classification error in the in-
the local linearity condition is preserved. We will demon- duced feature space is minimized. Hence, the algorithm
strate in the experiment that the algorithm’s performance has two levels of regularization, i.e., the implicit LOO
is indeed robust against a specific choice of kernel width. and explicit ℓ1 regularization. We will shortly see that
After the margins are defined, the problem of learning this property, together with the convergence property,
feature weights can be solved within the large margin leads to superior performance of our algorithm in the
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 5

presence of many thousands of irrelevant features. We The entries of H(x+ ) are given by
will also see that the performance of our algorithm is
∂2f

∂f
largely insensitive to a specific choice of λ, due to the

hij = 4xi xj +2 δ(i, j) , 1≤i≤J, 1≤j≤J ,
LOO regularization. ∂yi ∂yj ∂yi x+ +
i ,xj

Since z̄n implicitly depends on w through the prob- (12)


abilities P (xi =NH(xn )|w) and P (xi = NM(xn )|w), we where δ(i, j) is the Kronecker delta function that equals
use a fixed-point recursion method to solve for w. In 1 if i = j, and 0 otherwise. Note that some elements
each iteration, z̄n is first computed by using the previous of x+ may be equal to zero. Thus, the elements of x+
estimate of w, which is then updated by solving the can be grouped into two sets S0 = {x+ +
j : xj =0, 1≤j≤ J}
+ +
optimization problem (7). The iterations are carried out and S6=0 = {xj : xj 6=0, 1≤j≤J}. From (11), we have
until convergence. It is interesting to note that though ∂f /∂yj+ = 0 for xj ∈ S6=0 . For simplicity, assume without
local learning is a highly nonlinear process, in each loss of generality that the first M elements of x+ belong
iteration we only deal with a linear problem. to S0 , while the rest J −M elements belong to S6=0 . Then,
For fixed z̄n , (7) is a constrained convex optimization from (12), the Hessian matrix of g(x), evaluated at x+ ,
problem. Due to the nonnegative constraint on w, it is given by
cannot be solved directly by using gradient descent.  
+ A 0
To overcome this difficulty, we reformulate the problem H(x ) = , (13)

0 B⊗C

x=x+
slightly as:
   where we have used the fact that ∂f /∂yj+ = 0 for j ∈
N
X X S6=0 ,
min log 1 + exp − vj2 z̄n (j) + λkvk22 , (9)  ∂f 
v 2 ∂y1 . . . 0
n=1 j  . . .. 
A=  .. .. .  , (14)

thus obtaining an unconstrained optimization problem. ∂f
0 . . . 2 ∂yM
It is easy to show that at the optimum solution we have
wj = vj2 , 1 ≤ j ≤ J. The solution of v can thus be readily  ∂2 f ∂2f

...
found through gradient descent with a simple update ∂ 2 yM +1 ∂yM +1 ∂yJ
rule:

B= .. .. .. 
 , (15)
 . . . 
2 2
∂ f ∂ f
N ...
!
exp(− j vj2 z̄n (j))
P
X ∂yM +1 ∂yJ ∂ 2 yJ
v ← v − η λ1 − z̄n ⊗ v ,
1 + exp(− j vj2 z̄n (j))
P
n=1 and
(10)
4x2M+1
 
... 4xM+1 xJ
where ⊗ is the Hadamard operator, and η is the learn- .. ..
C=
 .. ≻0.

(16)
ing rate determined by the standard line search. Note . . .
that the objective function of (9) is no longer a convex 2
4xM+1 xJ ... 4xJ
function, and thus a gradient descent method may find a
local minimizer or a saddle point. The following theorem Since f (x) is a strictly convex function of x, we have
shows that if the initial point is properly selected, the B ≻ 0. Therefore, by Schur product theorem [51], B ⊗ C
solution obtained when the gradient vanishes is a global is a positive definite matrix. It follows that H(x+ )  0 if
minimizer. and only if A  0.
If H(x+ ) is not positive semidefinite, then x+ is a sad-
Theorem 1. Let f (x) be a strictly convex function of x ∈ RJ dle point. In the following, we prove that if H(x+ )  0,
and g(x) = f (y), where y = [y1 , · · · , yJ ]T = [x21 , · · · , x2J ]T . then x+ must be a global minimizer of g(x). Suppose
∂g
If ∂x |x=x+ = 0, then x+ is not a local minimizer, but a saddle opposite that instead of x+ , some x∗ is a global mini-
point or a global minimizer of g(x). Moreover, if x+ is found mizer of g(x) and y∗ 6= y+ . Then, by Taylor’s theorem,
(0)
through gradient descent with an initial point xj 6= 0, 1 ≤ there exists α ∈ (0, 1) such that
+
j ≤ J, then x is a global minimizer of g(x).  T
∗ + ∂f
Proof: For notional simplicity, we use ∂x∂f f (y ) = f (y ) + (y∗ − y+ )
∗ to denote ∂y+
∂f
∂x |x=x∗ .Also, we use A ≻ 0 and A  0 to denote that A 2

1 ∗ + T ∂ f
is a positive definite or semidefinite matrix, respectively. + (y − y ) (y∗ − y+ ) ,
∂g 2 ∂ 2 y y=y+ +α(y∗ −y+ )
We first prove that if ∂x + = 0, then x+ is either  
a saddle point or a global minimizer of g(x). To this ∂f ∂f
= f (y+ ) + , . . . , , 0, . . . , 0 (y∗ − y+ ) + R ,
end, we examine the properties of the Hessian matrix of ∂y1+ ∂yM +

g(x), denoted as H. Let x+ be a stationary point of g(x) ∂f ∂f ∗


= f (y+ ) + + y1∗ + · · · + + yM +R , (17)
satisfying: ∂y1 ∂yM
where we have used the fact that y1+ = · · · = yM+
T
= 0.

∂g ∂f + ∂f +
= 2x , · · · , + 2xJ =0. (11) 2
∂x+ ∂y1+ 1 ∂yJ Since f is a strictly convex function, we have ∂∂yf2 ≻ 0
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 6

and R is a positive number. Also, our initial assump- Input : Data D = {(xn , yn )}N J
n=1 ⊂ R × {±1},
tion that H(x+ )  0 is equivalent to A  0, that is, kernel width σ, regularization parameter
∂f ∂f ∗ +
∂y1+
≥ 0, . . . , ∂y + ≥ 0. It follows that f (y ) − f (y ) ≥ 0, λ, stop criterion θ
M
where the equality holds when y∗ = y+ . This contradicts Output: Feature weights w
the initial assumption that x∗ is a global minimizer of 1 Initialization: Set w(0) = 1, t = 1 ;
g(x) and y∗ 6= y+ . We finished the proof that a given 2 repeat
stationary point x+ of g(x) is either a saddle point if 3 Compute d(xn , xi |w(t−1) ), ∀xn , xi ∈ D;
H(x+ ) is not positive semidefinite, or a global minimizer
4 Compute P (xi =NM(xn )|w(t−1) ) and
of g(x) if H(x+ )  0.
P (xj =NH(xn )|w(t−1) ) as in (4) and (5);
Next, we prove that if stationary point x+ is found via 5 Solve for v through gradient descent using
(0)
gradient descent with an initial point xj 6= 0, 1 ≤ j ≤ J, the update rule specified in (10);
+
then x is a global minimizer of g(x). Suppose that 6
(t)
wj = vj2 , 1 ≤ j ≤ J;
∂g/∂x∗ = 0 and x∗ is a saddle point. Again, we assume 7 t = t + 1;
that the first M elements of x∗ belong to S0 , while 8 until kw
(t)
− w(t−1) k < θ ;
the rest J − M elements belong to S6=0 . There exists (t)
9 w = w .
an element i ∈ S0 so that ∂f /∂yi∗ < 0 (otherwise
H(x∗ )  0 and x∗ is a global minimizer). Due to the Algorithm 1: Pseudo-code of the proposed feature
continuity, there exists ξ > 0, such that ∂f /∂yi < 0 selection algorithm
for every yi ∈ C = {y : |y − yi∗ | < ξ}. It follows that

∂g/∂xi = 2xi (∂f /∂yi ) < 0 for xi = yi , and ∂g/∂xi > 0
√ ∗ d(xn , xi |w) = min d(xn , xj |w) and 0 otherwise 1 . Sim-
for xi = − yi . That is, xi = 0 is not reachable by using a j∈Mn
gradient descent method, given by xi ← xi − η(∂g/∂xi ), ilar asymptotic behavior holds for P (xi =NH(xn )|w).
(0)
except when the component xi of the initial point x(0) From the above analysis, it follows that for σ → +∞
is set to zero. Equivalently, the saddle point x∗ is not the algorithm converges to a unique solution in one
reachable via gradient decent. This concludes the proof iteration, since P (xi =NM(xn )|w) and P (xi =NH(xn )|w)
of the theorem. are constants for any initial feature weights. On the other
For fixed z̄n , the objective function (7) is a strictly hand, for σ → 0, the algorithm searches for only one
convex function of w. Theorem 1 assures that in each nearest neighbor when computing margins, and we em-
iteration, via gradient descent, reaching a global op- pirically find that the algorithm may not converge. This
timum solution of w is guaranteed. After the feature suggests that the convergence behavior and convergence
weighting vector is found, the pairwise distances among rate of the algorithm are fully controlled by the kernel
data samples are re-evaluated using the updated fea- width, which is formally stated in the following theorem.
ture weights, and the probabilities P (xi =NM(xn )|w)
Theorem 2. For the feature selection algorithm defined in
and P (xj =NH(xn )|w) are re-computed using the newly
Alg. 1, there exists σ ∗ such that lim kw(t) − w(t−1) k = 0
obtained pairwise distances. The two steps are iterated t→+∞
until convergence. The implementation of the algorithm whenever σ > σ ∗ . Moreover, for a fixed σ > σ ∗ , the algorithm
is very simple. It is coded in Matlab with less than one converges to a unique solution for any nonnegative initial
hundred lines. Except for the line search, no other built- feature weights w(0) .
in functions are used. The pseudo-code is presented in We use the Banach fixed point theorem to prove
Algorithm 1. the convergence theorem. We first state the fixed point
In the following three subsections, we analyze the theorem without proof, which can be found for example
convergence and computational complexity of our algo- in [31].
rithm, and present its extension to multiclass problems.
Definition 1. Let U be a subset of a normed space Z, and
k · k is a norm defined in Z. An operator T : U → Z is called
a contraction operator if there exists a constant q ∈ [0, 1) such
3.1 Convergence Analysis that kT (x) − T (y)k ≤ qkx − yk for every x, y ∈ U. q is called
the contraction number of T . An element of a normed space
We begin by studying the asymptotic behavior of the Z is called a fixed point of T : U → Z if T (x) = x.
algorithm. If σ → +∞, for every w ≥ 0, we have
Theorem 3 (Fixed Point Theorem). Let T be a contraction
1 operator mapping a complete subset U of a normed space
lim P (xi =NM(xn )|w) = , ∀i∈Mn , (18) Z into itself. Then the sequence generated as x(t+1) =
σ→+∞ |Mn |
T (x(t) ), t = 0, 1, 2, · · · , with arbitrary x(0) ∈ U, converges
since lim k(d) = 1. On the other hand, if σ → 0, by 1. In a degenerated case, it is possible that d(xn , xi |w) =
σ→+∞
assuming that for every xn , d(xn , xi |w) 6= d(xn , xj |w) d(xn , xj |w), which, however, is a zero-probablity event provided that
patterns contains some random noise. For simplicity, the degenerated
if i 6= j, we have lim P (xi = NM(xn )|w) = 1 if case is not considered in our analysis.
σ→0
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 7

to the unique fixed point x∗ of T . Moreover, the following of which in each iteration are O(N 2 J) and O(N J),
estimation error bounds hold: respectively. Here, J is the feature dimensionality and
kx(t) − x∗ k ≤ qt (1)
− x(0) k , N is the number of samples. When N is sufficiently
1−q kx (19) large (say 100), the most of the CPU time is spent on the
q
and kx(t) − x∗ k ≤ 1−q kx
(t)
− x(t−1) k .
second task (see Fig. 5(a)). The computational complexity
Proof: (Proof of Theorem 2): The gist of the proof of our algorithm is comparable to those of RELIEF [18]
is to identify a contraction operator for the algo- and Simba [19], which are known for their computational
rithm, and make sure that the conditions of Theorem efficiency. A close look at the update equation of v,
3 are met. To this end, we define P = {p : p = given by (10), allows us to further reduce complexity. If
[P (xi =NM(xn )|w), P (xj =NH(xn )|w)]} and W = {w : some elements of v are very close to zero (say less than
w ∈ RJ , kwk1 ≤ β, w ≥ 0}, and specify the first step 10−4 ), the corresponding features can be eliminated from
of the algorithm in a functional form as T 1 : W → P, further consideration with a negligible impact on the
where T 1(w) = p, and the second step as T 2 : P → W, subsequent iterations, thus providing a built-in mech-
where T 2(p) = w. Then, the algorithm can be written as anism for automatically removing irrelevant features
w(t) = (T 2 ◦ T 1)(w(t−1)) , T (w(t−1) ), where (◦) denotes during learning.
functional composition and T : W → W. Since W is Our algorithm has a linear complexity with respect
a closed subset of finite-dimensional normed space RJ to the number of features. In contrast, some popular
(or a Banach space) and thus complete [31], T is an greedy search methods (e.g., forward search) require
operator mapping complete subset W into itself. Next, of the order of O(J 2 ) moves in a feature space [9].
note that for σ → +∞, the algorithm converges with However, when the sample size becomes excessively
one step. We have lim kT (w1 , σ) − T (w2 , σ)k = 0, large, it can still be computationally intensive to run
σ→+∞
for any w1 , w2 ∈ W. Therefore, in the limit, T is a our algorithm. Considerable efforts have been made over
contraction operator with contraction constant q = 0, that the years to improve the computational efficiency of
is, lim q(σ) = 0. Therefore, for every ε > 0, there exists nearest neighbor search algorithms [32]. It is possible to
σ→+∞ use similar techniques to reduce the number of distance
σ ∗ such that q(σ) ≤ ε whenever σ > σ ∗ . By setting ε < 1, evaluations actually performed in our algorithm, which
the resulting operator T is a contraction operator. By the will be our future work.
Banach fixed point theorem, our algorithm converges
to a unique fixed point provided the kernel width is
properly selected. The above arguments establish the 3.3 Feature Selection for Multiclass Problems
convergence theorem of the algorithm. This section considers feature selection for multiclass
The theorem ensures the convergence of the algorithm problems. Some existing feature selection algorithms,
if the kernel width is properly selected. This is a very originally designed for binary problems, can be naturally
loose condition, as our empirical results show that the al- extended to multiclass settings, while for others the
gorithm always converges for a sufficiently large kernel extension is not straightforward. For both embedded
width (see Fig. 5(b)). Also, the error bound in (19) tells and wrapper methods, the extension largely depends
us that the smaller the contraction number, the tighter on the capability of a classifier to handle multiclass
the error bound and hence the faster the convergence problems [9]. In many cases, a multiclass problem is
rate. Our experiments suggest that a larger kernel width first decomposed into several binary ones by using an
yields a faster convergence. error-correct-code method [13], [34], and then feature
Unlike many other machine learning algorithms (e.g., selection is performed for each binary problem. This
neural networks), the convergence and the solution of strategy further increases the computational burden of
our algorithm are not affected by the initial value if the embedded and wrapper methods. Our algorithm does
kernel width is fixed. This property has a very important not suffer from this problem. A natural extension of the
consequence: even if the initial feature weights were margin defined in (2) to multiclass problems is [34]:
wrongly selected (e.g., investigators have no or false ρn (w) = min d(xn , NM(c) (xn )|w) − d(xn , NH(xn )|w) ,
prior information) and the algorithm started comput- c∈Y,c6=yn

ing erroneous nearest misses and hits for each sample, = min d(xn , xi |w) − d(xn , NHn |w) ,
xi ∈D\Dyn
the theorem assures that the algorithm will eventually (20)
converge to the same solution obtained when one had where Y is the set of class labels, NM(c) (xn ) is the nearest
perfect prior knowledge. The correctness of the proof of neighbor of xn from class c, and Dc is a subset of D
Theorem 2 is experimentally verified in Sec. 4.1. containing only samples from class c. The derivation of
our feature selection algorithm for multiclass problems
3.2 Computational Complexity and Fast Implemen- by using the margin defined in (20) is straightforward.
tation
The algorithm consists of two main parts: computing 4 E XPERIMENTS
pairwise distances between samples and solving the ℓ1 We perform a large-scale experiment on eleven synthetic
optimization problem, the computational complexities and real-world data sets to demonstrate the effectiveness
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 8

50 500 5000

of the newly proposed algorithm. The experiment is 30 30 30

performed on a desktop with Pentium 4 2.8GHz and 25 25 25

2GB RAM. 20 20 20

15 15 15

10 10 10
4.1 Spiral Problem
5 5 5

This section presents a simulation study on Fermat’s 0 0 1


0 0 1 2
0 0 1 2 3
10 10 10 10 10 10 10 10 10

spiral problem, carefully designed to verify various 10000 20000 30000

properties of the algorithm, theoretically established in 30 30 30

Sec. 3.1. Fermat’s spiral problem is a binary classifica- 25 25 25

tion problem, where each class contains 230 samples 20 20 20

distributed in a two-dimensional space, forming a spiral 15 15 15

shape, as illustrated in Fig. 1. In addition to the first 10 10 10

two relevant features, each sample is contaminated by 5 5 5

a varying number of irrelevant features, where this 0 0


10
1
10 10
2 3
10
4
10
0 0
10
1
10
2
10
3
10
4
10
0 0
10
1
10
2
10
3
10
4
10

number is set to {50, 500, 5000, 10000, 20000, 30000}. The


Fig. 3. Feature weights learned on the spiral dataset
number 30000 exceeds by far the amount of features
with different numbers of irrelevant features, ranging from
experienced in many scientific fields. For example, hu-
50 to 30000. The y-axis represents the values of feature
man beings have about 25000 genes, and hence nearly
weights, and the x-axis is the number of features, where
all gene expression microarray platforms have less than
the first two are always fixed to represent the two rele-
25000 probes. The added irrelevant features are inde-
vant features. Zero-valued feature weights indicate that
pendently sampled from zero-mean and unit-variance
the corresponding features are not relevant. The feature
Gaussian distribution. Our task is to identify the first two
weights learned across all dimensionality are almost iden-
relevant features. Note that only if these two features are
tical, for the same input parameters.
used simultaneously can the two classes of the samples
be well separated. Most filter and wrapper approaches
perform poorly on this example, since in the former the
goodness of each feature is evaluated individually, while values. This result is a consequence of Theorem 2, which
in the latter the search for relevant features is performed may be explained as follows: suppose that we have two
heuristically. spiral data sets with 5000 and 10000 irrelevant features,
Fig. 2 illustrates the dynamics of our algorithm per- respectively. Also, suppose that we first perform the
formed on the spiral data with 10000 irrelevant fea- algorithm on the second dataset, and that after some
tures. The algorithm iteratively refines the estimates iterations the algorithm finds 5000 irrelevant features
of weight vector w and probabilities P (xi =NH(xn )|w) whose weights are very close to zero. Then, both prob-
and P (xi =NM(xn )|w) until convergence. Each sample is lems are almost identical, except that the first problem
colored according to its probability of being the nearest has a uniform initial point (see line 1 of Alg. 1) and the
miss or hit of a given sample indicated by a black second one has a non-uniform initial point. By Theorem
cross. We observe that, with a uniform initial feature 2, the algorithm converges to the same solution for both
weights, the nearest neighbors defined in the original problems. Of course, due to the randomness of irrelevant
feature space can be completely different from the true features and the finite number of iteration steps, the two
ones. The plot shows that the algorithm converges to solutions are slightly different. For example, in Fig. 3,
a perfect solution in just three iterations. This example for the dataset with 30000 features, the algorithm selects
also illustrates why similarity-based learning algorithms one false feature as relevant, in addition to two relevant
(e.g., KNN and SVM with RBF kernel) perform poorly ones. However, the weight associated with the selected
in the presence of copious irrelevant features. This is be- irrelevant feature is much smaller than the weights of
cause the neighboring samples of a test sample provide the two relevant ones.
misleading information. One may be interested to know how many irrelevant
Fig. 3 presents the feature weights that our algorithm features can be added to the data set before the algorithm
learns on the spiral data for a varying number of irrele- fails. To answer this question, we conduct an experiment
vant features. The results are obtained for parameters σ where the number of irrelevant features are continuously
and λ respectively set to 2 and 1, while the same solution increased. We find that the algorithm attains the almost
holds for a wide range of other values of kernel widths identical solutions to those presented in Fig. 3 until
and regularization parameters (insensitivity to a specific one million irrelevant features are added (Fig. 4). The
choice of these parameters will be discussed shortly). algorithm fails simply because our computer runs out
Our algorithm performs remarkably well over a wide of memory. This result suggests that our algorithm is
range of feature-dimensionality values, with the same capable of handling problems with an extremely large in-
parameters. We also note that the feature weights learned put data dimensionality, far beyond that needed in many
are almost identical, across all feature-dimensionality data-analysis settings one may currently encounter. This
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 9

Initialization Iteration #1 Iteration #2 Iteration #3


2 1 1 1

1.8 0.9 0.9 0.9

1.6 0.8 0.8 0.8

1.4 0.7 0.7 0.7


Feature Score

1.2 0.6 0.6


0.6
1 0.5
0.5 0.5
0.8
0.4 0.4 0.4
0.6
0.3 0.3 0.3
0.4
0.2 0.2 0.2
0.2
0.1 0.1 0.1
0
2000 4000 6000 8000 10000
0 0 0 0 0 0
Features 10
1
10
2
10
3
10
4
10 10
1
10 10
2
10
3 4
10 10 10
1 2
10 10
3
10
4

0.16

0.14

0.12

0.1

0.08

0.06

0.04

0.02

Fig. 2. The algorithm iteratively refines the estimates of weight vector w and probabilities P (xi =NH(xn )|w) and
P (xi =NM(xn )|w) until convergence. The result is obtained on the spiral data with 10000 irrelevant features. Each
sample is colored according to its probability of being the nearest miss or hit of a given sample indicated by a black
cross. The plot shows that the algorithm converges to a perfect solution in just three iterations. (The figure is better
viewed electronically.)

0
30 10
1
10
372s
−1
25 10

−2
0 10
Feature Weights

20 10 37s
CPU Time(minute)

−3
10
15

θ
−1 −4
10 3.5s 3.81s 10 0.01
0.05
10 0.5
−5
10 1
−2
10
10 50
5 0.26s −6
10
0.17s
−7
0 0 2 3 4 10
2 4 6 10 10 10 2 4 6 8 10 12 14
10 10 10 10
Number of Features Number of iterations
Number of Features

(a) (b)
Fig. 4. Feature weights learned on the spiral data with
one million features. Fig. 5. (a) The CPU time it takes our algorithm to perform
feature selection on the spiral data with different numbers
of irrelevant features, ranging from 50 to 30000. The CPU
time spent on solving the ℓ1 optimization problems is
result is very encouraging, and some theoretical analyses
on the algorithm’s sample complexity are presented in also reported (blue line). The plot demonstrates linear
Sec. 5 that explain in part the algorithm’s excellent complexity with respect to the feature dimensionality. (b)
performance. Convergence analysis of our algorithm performed on the
spiral data with 5000 irrelevant features, for λ = 1 and
Our algorithm is computationally very efficient. σ ∈ {0.01, 0.05, 0.5, 1, 10, 50}. The plots present θ =
Fig. 5(a) shows the CPU time it takes the algorithm kw(t) − w(t−1) k2 as a function of the number of iteration
to perform feature selection on the spiral data with steps.
different numbers of irrelevant features. The stopping
criterion in Alg. 1 is θ = 0.01. As can be seen from the
figure, the algorithm runs for only 3.5s for the problem
with 100 features, 37s for 1000 features, and 372s for same dataset with 10000 features, and yet there is no
20000 features. The computational complexity is linear guarantee that the optimal solution will be reached, due
with respect to the feature dimensionality. It would be to heuristic search. The CPU time spent on solving the ℓ1
difficult for most wrapper methods to compete with ours optimization problems is also reported, which accounts
in terms of computational complexity. Depending on the for about 2% of the total CPU time.
classifier used to search for relevant features, it may Fig. 5(b) presents the convergence analysis of our
take several hours for a wrapper method to analyze the algorithm on the spiral data with 5000 irrelevant fea-
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 10

TABLE 1 kernel width and regularization parameter thus can be


Summary of the data sets used in the experiment. The estimated through ten-fold cross validation on training
number of irrelevant features artificially added to the data, without resorting to other classifiers. The KS test
original ones is indicated in the parentheses. is a nonparametric univariate method that determines
the information content of each feature, by using as test
Dataset Train Test Feature statistic the maximum difference of the empirical distri-
spiral 460 / 2 (0 ∼ 106 ) bution functions between samples of each class. AMS,
twonorm 120 7000 20(5000)
waveform 120 4600 21(5000) along with RFE, is among the first to perform feature
banana 468 300 2(5000) selection directly in the SVM formulation. The basic idea
thyroid 70 75 5(5000) of AMS is to automatically tune the scaling parameters
diabetics 130 300 8(5000)
of a kernel by minimizing a generalization error bound.
heart 58 100 13(5000)
splice 110 2175 60(5000) The scaling parameters can be used as the test statistic
prostate cancer 102 / 22283 to evaluate the information content of each feature. The
breast cancer 97 / 24488 code is downloaded from [14]. The default settings of the
DLBCL 77 / 5469
algorithm are used, and the span bound is minimized.
Since AMS is computationally very expensive, we apply
AMS to the UCI data containing only 1000 irrelevant
tures, for λ = 1 and different kernel widths σ ∈ features. Simba is a local learning based algorithm that
{0.01, 0.05, 0.5, 1, 10, 50}. We observe that the algorithm
in part motivates the development of our algorithm.
converges for a wide range of σ values, and that a larger One major problem with Simba is its implementation:
kernel width generally yields a faster convergence. These
its objective function is characterized by many local
results validate our theoretical convergence analysis,
minima. This problem is mitigated in Simba by restarting
presented in Sec. 3.1. the algorithm from five different starting points. Nev-
The kernel width σ and the regularization parameter
ertheless, the reach of a global optimal solution is not
λ are two input parameters of the algorithm. Alterna-
guaranteed. The codes of Simba is downloaded from
tively, they can be estimated through cross validation
[19]. The non-linear sigmoid activation function is used.
on training data. It is well-known that cross validation
The number of passes of training data is set to 5, while
may produce an estimate with a large variance. For-
the default value is 1. All other parameters use their
tunately, this does not pose a serious concern for our
default values. The codes of RFE is downloaded from
algorithm. In Figs. 6 and 7, we plot the feature weights
[16]. The RBF kernel is used and the number of retained
learned with different kernel widths and regularization
features is set to 30. It is difficult to specify the kernel
parameters. The algorithm performs well over a wide
width and regularization parameter of SVM used in
range of parameter values, yielding always the largest
RFE. We estimate both parameters through ten-fold cross
weights for the first two relevant features, while the
validation by using original training data without 5000
other weights are significantly smaller. This suggests
irrelevant features. It should be noted that in practical
that the algorithm’s performance is largely insensitive
applications one would not have access to noise-free
to a specific choice of parameters σ and λ, which makes
data. As with Simba, I-RELIEF is also a local learning
parameter tuning and hence the implementation of our
based method. One major difference between I-RELIEF
algorithm easy, even for researchers outside of the ma-
and ours is that the objective function of I-RELIEF is
chine learning community.
not directly related to the classification performance of
a learning algorithm. Moreover, I-RELIEF imposes a ℓ2
4.2 Experiments on UCI Data constraint on feature weights, and thus cannot provide
This section presents our feature selection results ob- a sparse solution. During the review process, the editor
tained on seven benchmark UCI data sets [35], including suggested us to combine I-RELIEF with a back-selection
banana, waveform, twonorm, thyroid, heart, diabetics, and strategy similar to that used in RFE. In the presence of
splice. For each data set, the set of original features is copious irrelevant features, there is no guarantee that a
augmented by 5000 irrelevant features, independently useful feature has to have a weight larger than half of the
sampled from a Gaussian distribution with zero-mean other features. Consequently, as with RFE, some useful
and unit-variance. It should be noted that some fea- features may be eliminated during the back-selection
tures in the original feature sets may be irrelevant or process. The number of retained features is set to 30,
weakly relevant. Unlike the spiral data, however, this and the kernel width is the same as that used in our
information is unknown to us a priori. The data sets are algorithm. For notational simplicity, we refer to it as I-
summarized in Table 1. RELIEF/BS. Except for RFE where the computationally
We compare our algorithm with five other algorithms, intensive tasks (i.e., SVM training) is executed in C, all
including the Kolmogorov-Smirnov (KS) test [4], AMS other algorithms are programmed in Matlab.
[14], RFE with a RBF kernel [16], Simba [19], and I- SVM (with RBF kernel) is used to estimate the clas-
RELIEF [42]. As we mentioned before, our algorithm sification errors obtained by using the features selected
can be used for classification purposes (see Eq. (8)). The by each algorithm. The structural parameters of SVM
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 11

λ=0.1 0.5 1 1.5 2

40 25
30
60
20
35
25 20
50
30
15
20
40 25
15

20 15
30 10
10
15
20 10
10
5
5
10 5
5

0 0 1 2 3
0 0 1 2 3
0 0 1 2 3
0 0 1 2 3
0 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

Fig. 6. Feature weights learned on the spiral data with 5000 irrelevant features, for a fixed kernel width σ = 2 and
different regularization parameters λ ∈ {0.1, 0.5, 1, 1.5, 2}.

σ=0.1 0.5 1 3 5
30
30 30 30 30

25
25 25 25 25

20
20 20 20 20

15 15
15 15 15

10 10 10 10 10

5 5 5 5 5

0 0 1 2 3
0 0 1 2 3
0 0 1 2 3
0 0 1 2 3
0 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

Fig. 7. Feature weights learned on the spiral data with 5000 irrelevant features, for a fixed regularization parameter
λ = 1 and different kernel widths σ ∈ {0.1, 0.5, 1, 3, 5}.

are estimated through ten-fold cross validation on the TABLE 3


original training data without 5000 irrelevant features. CPU time per run (in seconds) of six algorithms
To eliminate statistical variations, each algorithm is run performed on seven UCI data sets. The results of AMS
10 times for each data set. In each run, a data set is first are obtained by using only 1000 features.
randomly partitioned into a training and test data, a fea-
ture weight vector is learned, the top ranked features are Dataset Ours I-RELIEF/BS KS RFE AMS Simba
successively fed into SVM, and the minimum test error (1000)
twonorm 24 24 0.9 28 256 44
is recorded. Table 2 presents the averaged classification waveform 9 37 1.4 28 311 45
errors and standard deviations of each algorithm. For a banana 97 250 2.2 298 1242 419
rigorous comparison, a Student’s paired two-tailed t-test diabetics 41 30 2.3 33 449 50
thyroid 2 11 1.4 12 91 17
is also performed. The p-value of the t-test represents heart 4 8 3.5 8 91 13
the probability that two sets of compared results come splice 24 34 1.0 24 249 38
from distributions with a equal mean. A p-value of 0.05 Average 29 56 1.8 62 384 89
is considered statistically significant. The last row of
Table 2 summarizes the win/tie/loss of each algorithm
when compared to ours at the 0.05 p-value level. In our effectiveness of our algorithm for the purpose of fea-
algorithm, after a feature weight vector is learned, the ture selection. Among the six algorithms, ours yields
maximum value of the feature weights is normalized to the best results in six out of the seven data sets. RFE
1, and the features with weights > 0.01 are considered is a popular method, however, nearly all of the RFE
useful. The false discovery rate (FDR), defined as the applications use a linear kernel. We observe that RFE
ratio between the number of artificially added, irrelevant with a RBF kernel does not perform well on data with
features identified by our algorithms as useful ones and a very high data dimensionality. This is possibly due to
the total number of irrelevant features (i. e., 5000), is the poor performance of SVM in the presence of copious
reported in Table 2. For reference, the classification errors irrelevant features, leading to an inaccurate estimate
of SVM performed on the original data and corrupted of feature weights. We empirically find that in some
data are also reported. From these experimental results, cases RFE removes all of the useful features after the
we have the following observations: first iteration. Except for our algorithm, none of the
(1) SVM using all features performs poorly, which is competing methods perform well on the banana data.
consistent with the results reported in the literature (see, (2) In addition to successfully identifying relevant
for example, [4], [14]). SVM using the features identified features, our algorithm performs very well in removing
by our algorithm performs similarly or sometimes even irrelevant ones. The false discovery rate, averaged over
slightly better than SVM using the original features the seven data sets, is only 0.19%. The feature weights
(e.g., heart and splice), which clearly demonstrates the learned on one realization of each data set are plotted in
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 12

TABLE 2
Classification errors and standard deviations (%) of SVM performed on the seven UCI data sets contaminated by
5000 useless features. FDR is defined as the ratio between the number of articially added, irrelevant features
identied by our algorithm as useful ones and the total number of irrelevant features (i.e., 5000). The last row
summarizes the win/loss/tie of each algorithm when compared with ours at the 0.05 p-value level.

SVM SVM Our Method I-RELIEF/BS KS AMS RFE Simba


Dataset (original features) (all features) Errors FDR (1000)
twonorm 2.8(0.3) 35.0(1.0) 4.1(0.9) 2.2/1000 4.1(1.0) 4.6(0.9) 5.0(0.9) 48.9(0.4) 15.4(3.9)
waveform 12.8(0.9) 33.0(0.2) 13.8(1.1) 0.6/1000 15.4(1.6) 14.0(1.4) 18.1(1.9) 40.2(2.8) 15.8(1.0)
banana 10.9(0.5) 32.9(0.2) 10.9(0.5) 0/1000 41.9(5.2) 32.3(4.9) 35.3(11.0) 32.9(0.2) 45.0(1.9)
diabetics 27.2(1.9) 34.3(1.9) 27.7(2.4) 7.4/1000 29.9(6.0) 27.0(2.3) 25.6(1.5) 42.3(4.2) 33.3(8.2)
thyroid 6.3(3.2) 30.0(3.7) 6.4(2.7) 0.4/1000 10.1(1.8) 16.3(9.7) 22.8(14.9) 29.5(3.8) 20.5(9.0)
heart 20.3(5.6) 45.6(3.4) 19.0(5.2) 0.3/1000 20.5(6.0) 23.1(3.5) 23.1(4.0) 43.3(4.6) 20.5(5.3)
splice 20.4(1.8) 48.9(2.0) 17.8(1.9) 2.3/1000 22.0(2.2) 23.5(1.0) 24.6(3.0) 48.7(0.8) 20.1(2.6)
win/tie/loss / / / / 0/3/4 0/4/3 1/0/6 0/0/7 0/2/5

Fig. 8. Except for diabetics, our algorithm removes nearly we conduct an experiment to demonstrate that ours is
all of the irrelevant features. For comparison, the feature not only effective in eliminating noisy features but also
weights learned using the original data are also plotted. redundant ones.
We observe that the weights learned in the two cases
are very similar, which is consistent with the result of
4.3 Experiments on Microarray Data
the spiral data and the theoretical results of Sec. 3.1.
(3) The CPU time of the six algorithms, averaged In this section, we demonstrate the effectiveness of our
over the ten runs, is presented in Table 2. In terms algorithm using three microarray data sets, including
of computational efficiency, the KS test performs the breast cancer [1], prostate cancer [52], and diffuse large B-cell
best, ours the second, I-RELIEF/BS, RFE and Simba lymphoma (DLBCL) [53]. The detailed data information
the third, and AMS the least efficient. On average, it is summarized in Table 1. For all three data sets, the
takes our algorithm less than half a minute to process number of genes is significantly larger than the number
5000 features. It should be noted that the CPU time of samples. Another major characteristic of microarray
of AMS is obtained by using only 1000 features and data, unlike the UCI data we considered in the previ-
RFE is implemented in C, and hence the comparison is ous section, is the presence of a significant number of
somewhat in favor of AMS and RFE. redundant features (or co-regulated genes) since genes
We above use classification errors as the main cri- function in a modular fashion. It is well known that
terion to compare different algorithms. Classification including redundant features may not improve, but
errors, however, may not tell the whole story. Some sometimes deteriorate classification performance [10].
other criteria are worthy of mention. All of the five From the clinical perspective, the examination of the
algorithms that we compare to our algorithm are feature expression levels of redundant genes may not improve
weighting methods. In our experiment, the test data clinical decisions but increase medical examination costs
is used to estimate the minimum classification errors. needlessly. Hence, our goal is to derive a gene signature
In practical applications, without test data, one may with a minimum number of genes to achieve a highly
have to use a classifier learned on training data to accurate prediction performance.
determine the number of features that will be used in We have shown in the previous section that our algo-
the test stage, which requires considerable efforts on rithm significantly outperformed AMS, RFE and Simba
parameter tuning. In our method, the regularization in terms of computational efficiency and accuracy. In
parameter can be learned simultaneously within the this experiment, we only consider the KS test and I-
learning process, without using any other classifiers. RELIEF/BS, which are the two most competitive al-
Our method performs back selection implicitly: when gorithms with respect to performance. For the results
the weight of a feature is less than 10−8 , the feature is reported in this paper, the kernel width and regular-
eliminated from further consideration. Unlike RFE and ization parameter of our algorithm are set to 5 and
I-RELIEF/BS where the number of discarded features in 1, respectively. We empirically find that the algorithm
each iteration is pre-defined regardless of the values of yields nearly identical prediction performance for a wide
their feature weights, in our algorithm, when a feature range of parameter values. The same kernel width is
should be removed and how many of them should be also used in I-RELIEF/BS. We use KNN to estimate
removed are all automatically determined by the algo- the performance of each algorithm. We do not spend
rithm. On the theoretical side, ours has a solid theoretical additional effort to tune the value of K, but simply
foundation that will be presented in Sec. 5, whereas it set it to 3. Due to the small sample size, the leave-
is difficult to perform a similar theoretical analysis for one-out cross validation (LOOCV) method is used. In
some algorithms we herein consider. In the next section, each iteration, one sample is held out for testing and
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 13

banana waveform diabetis heart


1 1 1 1
no irrelevant features
0.9 with 5000 irrelevant features 0.9 0.9 0.9

0.8 0.8 0.8 0.8

0.7 0.7 0.7 0.7

Feature Scores

Feature Scores
Feature Scores

Feature Scores
0.6 0.6 0.6 0.6

0.5 0.5 0.5 0.5

0.4 0.4 0.4 0.4

0.3 0.3 0.3 0.3

0.2 0.2 0.2 0.2

0.1 0.1 0.1 0.1

0 0 1 2 3
0 0 1 2 3
0 0 1 2 3
0 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
Features Features Features Features

twonorm splice thyroid


1 1 1

0.9 0.9 0.9

0.8 0.8 0.8

0.7 0.7 0.7

Feature Scores
Feature Scores

Feature Scores
0.6 0.6 0.6

0.5 0.5 0.5

0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 1 2 3
0 0 1 2 3
0 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10
Features Features Features

Fig. 8. Feature weights learned in one sample trial of the seven UCI data sets with and without 5000 irrelevant features.
The dashed red line indicates the number of original features. The weights plotted on the left side of the dashed line
are associated with the original features, while those on the right, with the additional 5000 irrelevant features. The
feature weights learned in the two cases are very similar.

the remaining samples are used for identifying a gene


TABLE 4
signature. The genes are then ranked in a decreasing
Classification errors (%) of three algorithms performed
order based on their corresponding feature weights, and
on three microarray data. The number in parenthesis is
the 50 top-ranked genes are successively fed into a KNN
the number of genes at which the minimal classification
classifier. The process is repeated until each sample has
error is attained.
been tested. The classification errors are finally averaged
and the best results are reported in Table 4. The number
Dataset Our Method I-RELIEF/BS KS
of genes at which the minimum error is attained is
also recorded. Since in each iteration of LOOCV, KNN Prostate Cancer 16.5 (6) 25.3 (9) 21.5 (13)
classifies a test sample either correctly or incorrectly Breast Cancer 21.7 (4) 23.7 (28) 27.8 (39)
(i.e., 0 or 1), we are unable to perform a statistical test DLBCL 2.6 (10) 5.2 (7) 7.8 (23)
(e.g., t-test) to rigorously quantify the performance of
each algorithm, as we did in the previous section. Nev-
ertheless, we observe from Table 4 that our algorithm 1 22
KS
yields a better prediction accuracy with a much smaller 0.9 I−RELIEF
Ours
20

18
gene subset. This is because both I-RELIEF/BS and the 0.8
16
0.7
KS test are unable to eliminate redundant features. If
CPU Time (Second)

14
Feature Score

0.6
a gene is top ranked, its co-regulated genes will also 0.5
12

have a high ranking score. To further demonstrate this, 0.4


10

8
we plot in Fig. 9(a) the feature weight vectors learned 0.3
6

by the three algorithms performed on the breast cancer 0.2


4

data. For ease of presentation, the genes presented along 0.1 2

0 0 0
the x-axis are arranged based on the p-values of a t- 10 10
Genes
2
10
4
500 1000 2000 5000
Number of Genes
10000 24000

test in a decreasing order. For example, the first gene


(a) (b)
contains the most discriminant information according
to the t-test. We observe that some of the top-ranked Fig. 9. (a) Feature weights learned by the three algo-
features in the t-test are not selected in the gene signature rithms performed on the breast cancer data. (b) The CPU
learned by our algorithm. One possible explanation is time it takes our algorithm to identify a gene signature for
that these excluded genes are redundant with respect to the breast cancer data with a varying number of genes,
the identified gene signature. ranging from 500 to 24481.
This experiment further demonstrates the computa-
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 14

tional efficiency of our algorithm. Fig. 9(b) presents the Definition 2. Let F = {f (x|α) : x ∈ RJ , α ∈ Ω} be a
CPU time it takes our feature selection algorithm to set of real valued function. Given N arbitrary data samples
identify a gene signature for the breast cancer dataset with X N = {x1 , · · · , xN } ⊂ RJ , define F (X N ) = {f (X N |α) =
a varying number of genes, ranging from 500 to 24481. [f (x1 |α), · · · , f (xN |α)]T : α ∈ Ω}. We say that set V =
It only takes about 22 seconds to process all 24481 genes. {v1 , · · · , vK } ⊂ RN ǫ-covers F (X N ) in the p-norm if for all
If a filter method (e.g., t-test) is first used to reduce the α, there exists vk ∈ V so that kf (X N |α) − vk kp ≤ N 1/p ǫ ,
feature dimensionality to, say, 2000, as is almost always where k · kp is the p-norm. The covering number of F (X N )
done in microarray data analysis, our algorithm runs for in the p-norm, denoted as Np (F , ǫ, X N ) is the cardinality of
only about two seconds. the smallest set that ǫ-covers F (X N ). Define Np (F , ǫ, N ) =
supX N Np (F , ǫ, X N ).
5 A NALYSIS OF S AMPLE C OMPLEXITY
This section presents a theoretical study of our algo- Theorem 4 (Pollard,1984). For all ǫ > 0 and distribution
rithm’s sample complexity. The main purpose of the p(x), we have
analysis is to explain why the proposed algorithm per-  
forms so well for high-dimensional data, as demon-
P sup |R(α, X N ) − R(α)| > ǫ
strated in the previous section. As one can see from α∈Ω
Eq. (8), the algorithm finds a feature weight vector 
−N ǫ2

aimed at minimizing an empirical logistic loss. Hence, ≤ 8E[N1 (L, ǫ/8, X N )] exp , (21)
128M 2
it is a learning problem and the analysis can be per-
formed under the VC-theory framework. Specifically, we
where M = sup L(α, x) − inf L(α, x), and N1 is the 1-norm
try to establish the dependence of the generalization α,x α,x
performance of the proposed algorithm on input data covering number of function class L.
dimensionality. Our study suggests that the algorithm
has a logarithmical sample complexity with respect to In general, it is very difficult to estimate the covering
the input feature dimensionality. We should emphasize number of an arbitrary function class. Some general
that for many existing feature selection algorithms (e.g., bounds on covering numbers exist. For example, the
wrapper method), due to heuristic search, it would be covering number of a closed ball of radius r centered
difficult to conduct such theoretical analysis. at the origin in RJ is bounded by (4r/ǫ)J [40], where ǫ
We begin by reviewing some basic concepts of the sta- is the radius of the disks covering the ball. These bounds
tistical learning theory. Let {x, y} be a pair of observation are usually too loose to be useful. Fortunately, there
and target value, sampled from a fixed but unknown exists a tight bound for linear function class, due to [41],
joint distribution p(x, y). In the sequel, we absorb y into which can be used for estimating the covering number
x, for notation simplicity. Given a set of real valued map- for our purposes. We slightly modify the theorem of [41]
ping functions F = {f (x|α) : α ∈ Ω}, parameterized by to include the nonnegative constraint of feature weights.
α, and a loss function L(f (x|α)), we are concerned with Theorem 5. Let F = {wT x : w ≥ 0, kwk1 = a, x ∈
the problem of finding a parameter α to R minimize the ex- RJ , kxk∞ = b}. Then we have
pected loss: R(α) = E[L(f (x|α))] = L(f (x|α))p(x)dx.
In real applications, the true distribution is rarely known,
a2 b 2
 
and one has access only to a limited number of obser- log2 N2 (F , ǫ, N ) ≤ log2 (J + 1) , (22)
vations X N = {x1 , · · · , xN } independently drawn from ǫ2
the unknown distribution. A natural method to solve a
learning problem is to find a parameter α to minimize where ⌈x⌉ is the nearest integers of x towards infinity.
PN
the empirical loss: R(α, X N ) = N1 n=1 L(f (xn |α)). We The proof of the theorem is similar to that in [41].
are interested to know how well a learning algorithm By using Theorems 4 and 5, we establish the following
trained on a limited number of training data will per- bounds for our algorithm with σ = +∞ and σ → 0. Note
form on unseen data. This can be studied under the VC that, when the kernel width goes to zero, the algorithm
theory, which relies on the uniform convergence of the finds only one nearest neighbor for each pattern when
empirical loss to the expected loss. It has been proved by computing margins, as shown in Sec. 3.1.
[36] that if the bound supα∈Ω |R(α, X N ) − R(α)| is tight,
then the function that minimizes the empirical loss is Theorem 6. Let kwk1 ≤ β, w ≥ 0, x ∈ RJ , and kxk∞ ≤ 1.
likely to have an expected loss that is close to the best For the proposed algorithm, if σ = +∞, for all ǫ > 0, and
in the function class. distribution p(x), we have
A theorem, due to [37], provides an upper bound on
the rate of the uniform convergence of a class of func-  
N
tions in terms of its covering number. Before we present P sup |R(w, X ) − R(w)| > ǫ
w
the theorem, we first define the concept of covering
−N ǫ2
2
 
number. The interested reader may refer to [38] and [39] ⌈ 256β ⌉
≤ 8(J + 1) ǫ2 exp . (23)
for more comprehensive coverage on this subject. 512(2β + 1)2
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 15

samples in each class. There are N ( N/2−1 + N/2


 
Proof: If σ = +∞, 2 2 ) =
X N (N/2−1)2 hyperplanes that divide the parameter space
zn = P (xi =NM(xn )|w)|xn − xi | into at most N (N/2 − 1)2 + 1 parts. In each of these
i∈Mn parts, the nearest neighbor of a given sample are the
same, independent of w. For the i-th part, construct a
X
− P (xi =NH(xn )|w)|xn − xi |
i∈Hn dataset Z N = [z1 , · · · , zN ], where zn = |xn − NM(xn )| −
1 X 1 X |xn − NH(xn )|, and define a class of linear function Gi =
= |xn − xi | − |xn − xi |. (24) {gi (z) = wT z : z ∈ RJ , kwk1 ≤ β, w ≥ 0, kzk∞ ≤ 2}. By
|Mn | |Hn |
i∈Mn i∈Hn Theorem 5,l them covering number of Gi is upper bounded
4β 2
Hence, for a given dataset X N , zn is a constant vector in- by (J + 1) ǫ2 , and the total covering number is there-
dependent of w. Construct a data set Z N = [z1 , · · · , zN ]. 4β 2
l m
fore upper bounded by (N (N/2 − 1)2 + 1)(J + 1) ǫ2 .
It can be shown that kzn k∞ ≤ 2. Define a class of
Now, by using the same arguments that prove Theorem
linear functions G = {g(z) = wT z : z ∈ RJ , kwk1 ≤
6, we conclude the proof of the theorem.
β, w ≥ 0, kzk∞ ≤ 2}. By Theorem 5, the covering number
l m
4β 2 In the current formulation, when the kernel width
N2 (G, ǫ, N ) ≤ (J + 1) ǫ2 . From the definition of the goes to zero, the algorithm may not converge. However,
cover number and Jensen’s inequality, we have N1 ≤ N2 . the algorithm is not agnostic. The current formulation
Now let us consider the function class L = {l(g(z)) : is based on batch learning. Although not considered in
g ∈ G}. In the proposed algorithm, l(·) is a the logistic this paper, it is possible to specify an online-learning
loss function, and l(g(z)) = log(1 + exp(−g(z))). It is version of our algorithm that updates feature weights
proved in [39] that if l(·) is a Lipschitz function with after seeing each sample – not the entire dataset. The
Lipschitz constant L, then the covering number of L is convergence of this online feature selection algorithm
N1 (L, ǫ, N ) ≤ N1 (G, ǫ/L, N ). The logistic loss function is does not depend on any specific value of the kernel
a Lipschitz function with Lipschitz constant L = 1 [5]. width, but on the learning rate [43].
Hence, Both Theorems 6 and 7 can be easily written into the
following PAC style generalization error bounds that
E[N1 (L, ǫ, X N )] ≤ N1 (L, ǫ, N ) ≤ N1 (G, ǫ, N ) facilitate the sample complexity analysis.
4β 2
l m
≤ N2 (G, ǫ, N ) ≤ (J + 1) ǫ2 . (25) Theorem 8. Let kwk1 ≤ β, w ≥ 0, x ∈ RJ , and kxk∞ ≤ 1.
For the proposed algorithm, if σ → 0, for all w and δ > 0,
By using Holder’s inequality, with probability of at least 1 − δ, the following generalization
error bound holds:
|l(g(x))| = | log(1 + exp(−g(z)))|
≤ |wT z| + 1 ≤ kwk1 kzk∞ + 1 ≤ 2β + 1. (26) R(w) < R(w, X N )+
v s
u
Hence, M = sup L(w, z) − inf L(w, z) ≤ 2(2β + 1). Plug- u ln(J + 1)

4 ln(J + 1) + ln(2/δ)
2
w,z
512(2β + 1)2 + ,
w,z t
ging Eq. (25) into Theorem 4, we prove the theorem. 2N N
Theorem 7. Let kwk1 ≤ β, w ≥ 0, x ∈ RJ , and kxk∞ ≤ 1. (29)
For the proposed algorithm, if σ → 0, for all ǫ > 0, and and for σ = +∞, a similar generalization error bound holds
distribution p(x), we have with a minor difference of some constants:

R(w) < R(w, X N )+


 
P sup |R(w, X N ) − R(w)| > ǫ v
w u s  2
256β 2

−N ǫ2
 u ln(J + 1) ln(J + 1) + ln(8/δ)
≤ 8(N (N/2−1)2+1)(J+1)⌈ ǫ2 ⌉ exp 512(2β + 1)2 + .
t
. 2N N
512(2β + 1)2
(27) (30)
If the number of samples is smaller than the feature dimen-
sionality, then Theorem 8 can be proved by setting the right sides of
  (23) and (27) to δ and solving for ǫ.
N As can be seen from (29) and (30), both generalization
P sup |R(w, X ) − R(w)| > ǫ error bounds depend logarithmically on the feature di-
w
2

−N ǫ2
 mensionality J. An equivalent statement of Theorem 8
⌈ 256β ⌉+3
< 2(J + 1) ǫ 2 exp . (28) is that for the obtained learning algorithms, the number
512(2β + 1)2
of samples needed in order to maintain the same level
Proof: Consider the following equation: wT |x1 − of learning accuracy grows only logarithmically with
x2 | = wT |x1 − x3 |, which divides the parameter space respect to the feature dimensionality. This dependence is
into two parts, where the nearest neighbor of x1 is very weak, and matches the best known bounds proved
either x2 or x3 . Assume for simplicity that there are N/2 in various feature selection contexts.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 16

Using the infinite kernel width in our algorithm community, and we do not expect that this paper can
amounts to making the linear assumption about the data solve this open problem. However, the experimentally
model. In this case, our algorithm has a similar gener- verified high accuracy, computational efficiency, and ease
alization error bound to that of ℓ1 regularized logistic of implementation of the proposed algorithm justify its
regression. The main difference is that our algorithm has presentation to the broader community at this stage of
an implicit LOO regularization. Taking this regulariza- our theoretical development.
tion into account when deriving the error bound may
lead to an even tighter bound.
A somewhat surprising result is that when the kernel
6 D ISCUSSION
width goes to zero our algorithm also has a logarithmic We conclude this paper by tracing back the origins of
sample complexity. As discussed in Sec. 3, if we adopt a this work, comparing our algorithm with some related
classification rule that classifies x by the sign of its mar- feature selection/weighting approaches and summariz-
gin ρ̄(x), our algorithm can be viewed as the one-nearest- ing the main contributions we made in this paper.
neighbor classifier (1NN). It is well-known that 1NN Our approach is motivated to a great extent by the
performs poorly for high-dimensional data. [19] proves ideas implemented in the RELIEF algorithm [18]. RELIEF
that when feature selection is performed, the generaliza- is considered one of the most successful feature weight-
tion error bound of 1NN depend logarithmically on the ing algorithms due to its simplicity and effectiveness
input feature dimensionality, but polynomially on the [33]. It has been long regarded as a heuristic filter
number of selected features (Theorem 1 in [19]). While method, until recently, when we have mathematically
our generalization bound in Eq. (29) is consistent with proved that RELIEF is an online-learning algorithm that
the result of [19], note however that our bound does not solves a convex optimization problem aimed at maxi-
depend on the number of the selected features, but on mizing the averaged margin [42]. One major problem
the total size of the feature weights (i.e., kwk1 bounded with RELIEF is that the nearest neighbors of a given
by β. Also see Eq. (8).). This result is consistent with that sample are predefined in the original feature space,
of [45] that the size of the weights is more important which typically yields erroneous nearest hits and misses
than the size of neural networks. We perform some in the presence of copious irrelevant features. RELIEF-
experiments on using our algorithm for classification F [49] mitigates this problem by searching for multiple,
purposes (i.e., classify x by the sign of its margin ρ̄(x)) instead of just one, nearest neighbors when computing
on the UCI data sets contaminated by varying numbers margins. Empirical studies have shown that RELIEF-
of irrelevant features ranging from 0 to 10000. Since F achieves significant performance improvement over
the main focus of the paper is on feature selection, we RELIEF [49]. However, the performance of both algo-
report the detailed results in the supplementary data. As rithms degrades significantly with the increase of feature
can see from Table 1S, the classification performance is dimensionality. To address this problem, we recently
largely insensitive to a growing number of irrelevant fea- proposed a new feature weighting algorithm referred
tures. This result is very encouraging, and indicates that to as I-RELIEF, which performs significantly better than
it is possible to develop a new classification algorithm RELIEF-F [42]. Similar to the algorithm proposed in
with a logarithmic dependence on data dimensionality this paper, I-RELIEF employs a probabilistic model to
without making any assumption of data distributions. define the local information of a given sample. However,
We here only provide the sample complexity analysis as with all other algorithms in the RELIEF family, the
for two specific kernel widths. However, it is reasonable objective function optimized by I-RELIEF is not directly
to expect that a relatively large kernel width should related to the classification performance of a learning
improve the generalization error bound of the algorithm algorithm. Moreover, I-RELIEF imposes a ℓ2 constraint
over that derived when the kernel width goes to zero on feature weights, and thus is not able to remove
(consider a similar case where KNN usually performs redundant features and provide a sparse solution (see
better than 1NN.). The result presented in Fig. 7 shows Fig. 9(a)). I-RELIEF does not enjoy the same theoretical
that a large kernel width can indeed remove irrelevant properties as ours. This work is also motivated by the
features, and our empirical result obtained on the spiral Simba algorithm recently proposed in [19]. Compared
dataset contaminated by up to one million irrelevant to RELIEF, Simba re-evaluates the distances according
features suggests that the proposed algorithm have a to the learned weight vector, and thus is superior to RE-
logarithmic sample complexity. LIEF. One major problem, however, is that the objective
It is worth noting that the above analysis for a gen- function optimized by Simba is characterized by many
eral case with an arbitrary value of kernel width is local minima, which can be mitigated by restarting the
very difficult. Indeed, after decades of research, the algorithm from several initial points [19]. Also, Simba
covering number or similar techniques (i.e., Rademacher represents a constrained nonlinear optimization problem
complexity and VC dimension) are only defined for a that cannot be easily solved by conventional optimiza-
small set of functional classes. The problem of deriv- tion techniques. We empirically find that Simba performs
ing generalization error bounds for arbitrary functional very well when the number of irrelevant features is
classes is still largely open in the machine learning small, but may fail completely when there exist a large
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 17

number of irrelevant features (see Table 2). One possible suggestions that significantly improved the quality of
explanation is that the chance for Simba to be stuck the paper. This work is in part supported by the Su-
into local minima is increased dramatically with the san Komen Breast Cancer Foundation under grant No.
increased number of features. BCTR0707587.
Feature selection is closely related to distance metric
learning (see, e.g., NCA [46], LMNN [47] and LFE [48]).
R EFERENCES
These algorithms are also based on local learning and
share the same goals as ours to reduce data dimension- [1] L. van’t Veer, H. Dai, M. van de Vijver, et. al, “Gene expression
profiling predicts clinical outcome of breast cancer,” Nature, vol.
ality, but they completely differ in algorithmic formu- 415, pp. 530–536, 2002.
lations. Moreover, these algorithms are all for feature [2] Y. Wang, J. Klijn, Y. Zhang, et. al, “Gene-expression profiles to
extraction, and it is unclear whether they enjoy the same predict distant metastasis of lymph-node-negative primary breast
cancer,” Lancet, vol. 365, pp. 671–679, 2005.
theoretical properties outlined in Sec. 5. [3] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.
The proposed algorithm embraces some fundamental [4] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and
concepts in machine learning. It is related to SVM in V. Vapnik, “Feature selection for SVMs,” in Proc. 13th Adv. Neu.
Info. Proc. Sys., 2001, pp. 668–674.
the sense that both algorithms solve a nonlinear prob- [5] A. Y. Ng, “Feature selection, L1 vs. L2 regularization, and rotational
lem by first transforming it into a linear problem and invariance,” in Proc. 21st Int. Conf. Mach. Learn., 2004, pp. 78–86.
then solving the linear one so that the margin is max- [6] A. Y. Ng and M. I. Jordan, “Convergence rates of the voting gibbs
classifier, with application to bayesian feature selection,” in Proc.
imized. Unlike SVM, the linearization in our approach 18th Int. Conf. Mach. Learn., 2001, pp. 377–384.
is achieved by local learning, instead of projecting the [7] J. Lafferty and L. Wasserman, “Challenges in statistical machine
data onto a higher (possibly infinite) space, based on learning,” Statistica Sinica, vol. 16, pp. 307–322, 2006.
[8] M. Hilario and A. Kalousis, “Approaches to dimensionality reduc-
the concept that a given complex problem can be more tion in proteomic biomarker studies,” Briefings in Bioinformatics,
easily, yet accurately enough, analyzed by parsing it into vol. 9, no. 2, pp. 102–118, 2008.
a set of locally linear ones. Local learning allows one to [9] I. Guyon and A. Elisseeff, “An introduction to variable and feature
selection,” J. Mach. Learn. Res., vol. 3, pp. 1157–1182, 2003.
capture local structure of the data, while the parameter [10] R. Kohavi and G. H. John, “Wrappers for feature subset selection,”
estimation is performed globally to avoid possible over- Artif. Intell., vol. 97, no. 1-2, pp. 273–324, 1997.
fitting. The idea of “fit locally and think globally” is also [11] P. Pudil and J. Novovicova, “Novel methods for subset selection
with respect to problem knowledge,” IEEE Intell. Syst., vol. 13,
used in the well-known locally linear embedding (LLE) no. 2, pp. 66–74, March 1998.
algorithm that approximates a complex nonlinear man- [12] D. Koller and M. Sahami, “Toward optimal feature selection,” in
ifold using a set of locally linear patches [44]. LLE is an Proc. 13th Int. Conf. Mach. Learn., 1996, pp. 284–292.
[13] T. G. Dietterich and G. Bakiri, “Solving multiclass learning prob-
algorithm for dimensionality reduction in unsupervised lems via error-correcting output codes,” J. Artif. Intell. Res., vol. 2,
learning settings, while our algorithm is for supervised pp. 263–286, 1995.
learning. Another important difference between the two [14] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, “Choosing
multiple parameters for support vector machines,” Mach. Learn.,
algorithms is that LLE is based on the assumption that vol. 46, no. 1, pp. 131-159, 2002.
nearby points in the high dimensional space remain [15] T. N. Lal, O. Chapelle, J. Weston, and A. Elisseeff, “Embed-
adjacent in the reduced low-dimensional space, which ded methods,” in Feature Extraction, Foundations and Applications,
I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh, Eds. Springer-Verlag,
may not be true in the presence of copious irrelevant 2006, pp. 137–165.
features, as shown in Fig. 2. [16] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for
The main contribution of the paper is that we provided cancer classification using support vector machines,” Mach. Learn.,
vol. 46, no. 1-3, pp. 389–422, 2002.
a principled way to perform feature selection for classi- [17] J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani, “1-norm support
fication problems with complex data distributions and vector machines,” in Proc. 16th Adv. Neu. Info. Proc. Sys., 2004.
very high data dimensionality. It avoids any heuristic [18] K. Kira and L. A. Rendell, “A practical approach to feature
selection,” in Proc. 9th Int. Conf. Mach. Learn., 1992, pp. 249–256.
combinatorial search, and hence can be implemented [19] R. Gilad-Bachrach, A. Navot, and N. Tishby, “Margin based
efficiently. Unlike many existing methods, ours has a feature selection - theory and algorithms,” in Proc. 21st Int. Conf.
solid theoretical foundation that ensures its performance. Mach. Learn., 2004, pp. 43–50.
[20] R. E. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee, “Boosting
Moreover, its implementation and parameter tuning are the margin: a new explanation for the effectiveness of voting
easy, and the extension of the algorithm to multiclass set- methods,” Ann. Statist., vol. 26, no. 5, pp. 1651–1686, 1998.
tings is straightforward. We have experimentally demon- [21] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from
incomplete data via the EM algorithm,” J. R. Stat. Soc. Ser. B, vol. 39,
strated that our algorithm is already capable of handling no. 1, pp. 1–38, 1977.
many feature selection problems one may encounter in [22] C. Atkeson, A. Moore, and S. Schaal, “Locally weighted learning,”
scientific research. Considering the increased demand for Artif. Intell. Rev., vol. 11, no. 15, pp. 11–73, 1997.
[23] C. M. Bishop, Pattern Recognition and Machine Learning. New
analyzing data with a large number of features in many York: Springer, 2006.
research fields, including bioinformatics, economics, and [24] R. Tibshirani, “Regression shrinkage and selection via the lasso,”
computer vision, we expect that the work presented in J. R. Stat. Soc. Ser. B, vol. 58, no. 1, pp. 267–288, 1996.
[25] M. Y. Park and T. Hastie, “ℓ1 regularization path algorithm for
this paper will make a broad impact. generalized linear models,” J. R. Statist. Soc. B, vol. 69, no. 4, pp.
659–677, 2007.
7 ACKNOWLEDGMENTS [26] L. Meier, S. van de Geer, and P. Buhlmann, “The group lasso for
logistic regression,” J. R. Statist. Soc. B, vol. 70, pp. 53–71, 2008.
The authors thank the associate editor Dr. Olivier [27] V. Roth, “The generalized LASSO,” IEEE Trans. Neu. Net., no. 15,
Chapelle and three anonymous reviewers for numerous pp. 16–28, 2004.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 18

[28] S. Rosset, “Following curved regularized optimization solution Yijun Sun received two B.S. degrees in elec-
paths,” Proc. 17th Adv. Neu. Info. Proc. Sys., 2005, pp. 1153–1160. trical and mechanical engineering from Shang-
[29] D. L. Donoho and M. Elad, “Optimally sparse representations in hai Jiao Tong University, Shanghai, China, in
general nonorthogonal dictionaries by ℓ1 minimization,” Proc. Natl. 1995, and the M.S. and Ph.D. degrees in elec-
Acad. Sci. U.S.A., vol. 100, no. 5, pp. 2197–2202, 2003. PLACE trical engineering from the University of Florida,
[30] L. Breiman, “Better subset regression using the nonnegative gar- PHOTO Gainesville, USA, in 2003 and 2004, respec-
rote,” Technometrics, vol. 37, no. 4, pp. 373–384, 1995. HERE tively. Currently, he is a research scientist at the
[31] R. Kress, Numerical Analysis. New York: Springer-Verlag, 1998. Interdisciplinary Center for Biotechnology Re-
[32] P. Zezula, G. Amato, V. Dohnal, and M. Batko. Similarity Search - search and an affiliated faculty member at the
The Metric Space Approach. Springer, 2006. Department of Electrical and Computer Engi-
[33] T. G. Dietterich, “Machine learning research: four current direc- neering at the University of Florida. His research
tions,” AI Magazine, vol. 18, no. 4, pp. 97–136, 1997. interests are mainly on machine learning, bioinformatics, and their
[34] Y. Sun, S. Todorovic, J. Li, and D. Wu, “Unifying error-correcting applications to cancer study and microbial community analysis.
and output-code AdaBoost through the margin concept,” in Proc.
22nd Int. Conf. Mach. Learn., 2005, pp. 872–879.
[35] A. Asuncion and D. Newman, “UCI machine learning repository,”
2007.
[36] V. Vapnik and A. Chervonenkis, Theory of Pattern Recognition,
Nauka, Moscow (in Russian), 1974.
[37] D. Pollard, Convergence of Stochastic Processes. New York:
Springer-Verlag, 1984.
[38] L. Devroye, L. Györfi, and G. Lugosi, A Probabilistic Theory of
Pattern Recognition. New York: Springer-Verlag, 1996.
[39] M. Anthony and P. L. Bartlett, Neural Network Learning: Theoretical
Foundations. Cambridge University Press, 1999.
[40] F. Cucker and S. Smale, “On the mathematical foundations of Sinisa Todorovic received his Ph.D. degree
learning,” Bulletin Amer. Math. Soc., vol. 39, no. 1, pp. 1–49, 2002. in electrical and computer engineering at the
[41] T. Zhang, “Covering number bounds of certain regularized linear University of Florida in 2005. He was Postdoc-
function classes,” J. Mach. Learn. Res., vol. 2, pp. 527–550, 2002. toral Research Associate in Beckman Institute
[42] Y. Sun and J. Li, “Iterative RELIEF for feature weighting,” in Proc. PLACE at the University of Illinois Urbana-Champaign,
23rd Int. Conf. Mach. Learn., 2006, pp. 913–920. PHOTO between 2005-2008. Currently, Dr. Todorovic is
[43] H. Kushner and G. Yin, Stochastic Approximation and Recursive HERE Assistant Professor in the School of EECS at
Algorithms and Applications, New York: Springer-Verlag, 2003. Oregon State University. His research interests
[44] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction include computer vision and machine learning,
by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323– with focus on object/activity recognition and tex-
2326, 2000. ture analysis. His synergistic activities include:
[45] P. L. Bartlett, “The sample complexity of pattern classification Associate Editor of Image and Vision Computing Journal, Program Chair
with neural networks: the size of the weights is more important of 1st International Workshop on Stochastic Image Grammars 2009, and
than the size of the network,” IEEE Trans. Inform. Theory, vol. 44, reviewer for all major journals and conferences in computer vision. He
no. 2, pp. 525–536, 1998. was awarded Jack Neubauer Best Paper Award by the IEEE Vehicular
[46] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, Technology Society in 2004, and the Outstanding Reviewer award by
“Neighbourhood components analysis,” in Proc. 17th Adv. Neu. 11th IEEE International Conference on Computer Vision in 2007.
Info. Proc. Sys., 2005, pp. 513–520.
[47] K. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning
for large margin nearest neighbor classification,” in Proc. 18th Adv.
Neu. Info. Proc. Sys., 2006, pp. 1473–1480.
[48] Y. Sun and D. Wu, “A RELIEF based feature extraction algorithm,”
in Proc. 8th SIAM Intl. Conf. Data Mining, 2008, pp. 188–195.
[49] I. Kononenko, “Estimating attributes: analysis and extensions of
RELIEF,” in Proc. Eur. Conf. Mach. Learn., 1994, pp. 171–182.
[50] Y. Sun, S. Goodison, J. Li, L. Liu, and W. Farmerie, “Improved
breast cancer prognosis through the combination of clinical and
genetic markers,” Bioinformatics, vol. 23, no. 1, pp. 30–37, 2007.
[51] R. Horn and C. Johnson, Matrix Analysis. Cambridge University
Press, 1985. Steve Goodison obtained a BSc degree in Bo-
[52] A. J. Stephenson, A. Smith, M. W. Kattan, J. Satagopan, V. E. chemistry from the University of Wales in 1989.
Reuter, P. T. Scardino, and W. L. Gerald, “Integration of gene He went on to receive a PhD in molecular biol-
expression profiling and clinical variables to predict prostate carci- ogy from Oxford University in 1993 as a Well-
noma recurrence after radical prostatectomy,” Cancer, vol. 104, no. PLACE come Trust Scholar. Postdoctoral studies at Ox-
2, pp. 290–298, 2005. PHOTO ford University included diabetes research as a
[53] M. A. Shipp, K. N. Ross, P. Tamayo, et. al, “Diffuse large B-cell HERE Wellcome Trust fellow and in cancer research in
lymphoma outcome prediction by gene expression profiling and the Dept. of Clinical Biochemistry. He joined the
supervised machine learning,” Nat. Med., vol. 8, 68–74, 2002. University of California, San Diego as Assistant
Professor in 2001, and is currently an Associate
Professor at the University of Florida. His re-
search interests are primarily cancer research, and include molecular
pathology, the function of specific genes in the metastatic spread of
cancer, and biomarker discovery and development for clinical diagnosis
and prognosis.

You might also like