Local-Learning-Based Feature Selection For High-Dimensional Data Analysis
Local-Learning-Based Feature Selection For High-Dimensional Data Analysis
Local-Learning-Based Feature Selection For High-Dimensional Data Analysis
X, XX 20XX 1
Abstract—This paper considers feature selection for data classification in the presence of a huge number of irrelevant features. We
propose a new feature selection algorithm that addresses several major issues with prior work, including problems with algorithm
implementation, computational complexity and solution accuracy. The key idea is to decompose an arbitrarily complex nonlinear
problem into a set of locally linear ones through local learning, and then learn feature relevance globally within the large margin
framework. The proposed algorithm is based on well-established machine learning and numerical analysis techniques, without making
any assumptions about the underlying data distribution. It is capable of processing many thousands of features within minutes on
a personal computer, while maintaining a very high accuracy that is nearly insensitive to a growing number of irrelevant features.
Theoretical analyses of the algorithm’s sample complexity suggest that the algorithm has a logarithmical sample complexity with
respect to the number of features. Experiments on eleven synthetic and real-world data sets demonstrate the viability of our formulation
of the feature selection problem for supervised learning and the effectiveness of our algorithm.
Index Terms—Feature selection, local learning, logistical regression, ℓ1 regularization, sample complexity.
a close-to-optimal solution even when the data contains Existing algorithms are traditionally categorized as
one million irrelevant features. wrapper or filter methods, with respect to the criteria
We study the algorithm’s properties from several dif- used to search for relevant features [10]. In wrapper
ferent angles to explain why the proposed algorithm methods, a classification algorithm is employed to eval-
performs well in a high-dimensional space. We show uate the goodness of a selected feature subset, whereas
that the algorithm can be regarded as finding a feature in filter methods criterion functions evaluate feature
weight vector so that the upper bound of the leave-one- subsets by their information content, typically interclass
out cross-validation error of a nearest-neighbor classifier distance (e.g., Fisher score) or statistical measures (e.g.,
in the induced feature space is minimized. By using fixed p-value of t-test), instead of optimizing the performance
point theory, we prove that with a mild condition the of any specific learning algorithm directly. Hence, filter
proposed algorithm converges to a solution as if it had methods are computationally much more efficient, but
perfect prior knowledge as to which feature is relevant. usually do not perform as well as wrapper methods.
We also conduct a theoretical analysis of the algorithm’s One major issue with wrapper methods is their high
sample complexity, which suggests that the algorithm computational complexity due to the need to train a
has a logarithmical sample complexity with respect to large number of classifiers. Many heuristic algorithms
the input data dimensionality. That is, the number of (e.g., forward and backward selection [11]) have been
samples needed to maintain the same level of learning proposed to alleviate this issue. However, due to their
accuracy grows only logarithmically with respect to the heuristic nature, none of them can provide any guarantee
feature dimensionality. This dependence is very weak, of optimality. With tens of thousands of features, which
and matches the best known bounds proved in various is the case in gene expression microarray data analysis, a
feature selection contexts [5], [6]. Although logarithmical hybrid approach is usually adopted, wherein the number
sample complexity is not new in the literature, it holds of features is first reduced by using a filter method
only for linear models, whereas in our algorithm no and then a wrapper method is applied to the reduced
assumptions are made about the underlying data dis- feature set. Nevertheless, it still may take several hours
tribution. to perform the search, depending on the classifier used in
In this paper, we also show that the aforementioned the wrapper method. To reduce complexity, in practice,
theoretical analysis of our feature selection algorithm a simple classifier (e.g., linear classifier) is often used
may have an important theoretical implication in learn- to evaluate the goodness of feature subsets, and the
ing theory. Based on the proposed feature selection selected features are then fed into a more complicated
algorithm, we show that it is possible to derive a classifier in the subsequent data analysis. This gives rise
new classification algorithm with a generalization error to the issue of feature exportability – in some cases, a
bound that grows only logarithmically in the input data feature subset that is optimal for one classifier may not
dimensionality for arbitrary data distribution. This is a work well for others [8]. Another issue associated with
very encouraging result, considering that a large part a wrapper method is its capability to perform feature
of machine learning research is focused on developing selection for multiclass problems. To a large extent, this
learning algorithms that behave gracefully when faced property depends on the capability of a classifier used
with the curse of dimensionality. in a wrapper method to handle multiclass problems. In
This paper is organized as follows. Sec. 2 reviews many cases, a multiclass problem is first decomposed
prior work, focusing on the main problems with exist- into several binary ones by using an error-correct-code
ing methods that are addressed by our algorithm. The method [13], [34], and then feature selection is performed
newly proposed feature selection algorithm is described for each binary problem. This strategy further increases
in Sec. 3. This section also presents the convergence the computational burden of a wrapper method. One
analysis of our algorithm in Sec. 3.1, its computational issue that is rarely addressed in the literature is algo-
complexity in Sec. 3.2, and its extension to multiclass rithmic implementation. Many wrapper methods require
problems in Sec. 3.3. Experimental evaluation is pre- the training of a large number of classifiers and man-
sented in Sec. 4. The algorithm’s sample complexity is ual specification of many parameters. This makes their
discussed in Sec. 5, after which we conclude the paper implementation and use rather complicated, demanding
by tracing back the origins of our work and pointing expertise in machine learning. This is probably one of
out major differences and improvements made here as the main reasons why filter methods are more popular
compared to other well-known algorithms. in the biomedical community [1], [2].
It is difficult to address the aforementioned issues
2 L ITERATURE R EVIEW directly in the wrapper-method framework. To overcome
Research on feature selection has been very active in the this difficulty, embedded methods have recently received
past decade [8], [10], [11], [12], [14], [18], [19]. This section an increased interest (see, e.g., [4], [5], [14], [16], [17]).
gives a brief review of existing algorithms and discusses The interested reader may refer to [15] for an excellent
some major issues with prior work that are addressed review. Embedded methods incorporate feature selec-
by our algorithm. The interested reader may refer to [8] tion into the learning process of a classifier. A feature
and [9] for more details. weighting strategy is usually adopted that uses real-
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 3
valued numbers, instead of binary ones, to indicate the [18], and first mathematically defined in [19] (using
relevance of features in a learning process. This strategy Euclidean distance) for the feature-selection purpose. An
has many advantages. For example, there is no need intuitive interpretation of this margin is a measure as to
to pre-specify the number of relevant features. Also, how much the features of xn can be corrupted by noise
standard optimization techniques (e.g., gradient descent) (or how much xn can “move” in the feature space) before
can be used to avoid a combinatorial search. Hence, being misclassified. By the large margin theory [3], [20],
embedded methods are usually computationally more a classifier that minimizes a margin-based error function
tractable than wrapper methods. Still, computational usually generalizes well on unseen test data. One natural
complexity is a major issue when the number of fea- idea then is to scale each feature, and thus obtain a
tures becomes excessively large. Other issues including weighted feature space, parameterized by a nonnegative
algorithm implementation, feature exportability and ex- vector w, so that a margin-based error function in the
tension to multiclass problems also remain. induced feature space is minimized. The margin of xn ,
Some recently developed embedded algorithms can computed with respect to w, is given by:
be used for large-scale feature selection problems un-
der certain assumptions. For example, [4], [14] propose ρn (w) = d(xn , NM(xn )|w) − d(xn , NH(xn )|w) = wT zn ,
to perform feature selection directly in the SVM for- (2)
mulation, where the scaling factors are adjusted using where zn = |xn − NM(xn )| − |xn − NH(xn )|, and | · |
the gradient of a theoretical upper bound on the error is an element-wise absolute operator. Note that ρn (w)
rate. RFE [16] is a well-known feature selection method is a linear function of w, and has the same form as the
specifically designed for microarray data analysis. It sample margin of SVM, given by ρSVM (xn ) = wT φ(xn ),
works by iteratively training an SVM classifier with a using a mapping function φ(·). An important difference,
current set of features, and then heuristically removing however, is that by construction the magnitude of each
the features with small feature weights. As with wrapper element of w in the above margin definition reflects
methods, the structural parameters of SVM may need to the relevance of the corresponding feature in a learning
be re-estimated by using, for example, cross-validation process. This is not the case in SVM except when a linear
during iterations. Also, a linear kernel is usually used kernel is used, which however can capture only linear
for computational reasons. ℓ1 -SVM with a linear kernel discriminant information. Note that the margin thus de-
[17], with a proper parameter tuning, can lead to a sparse fined requires only information about the neighborhood
solution, where only relevant features receive non-zero of xn , while no assumption is made about the underlying
weights. A similar algorithm is logistical regression with data distribution. This means that by local learning we
ℓ1 regulation. It has been proved by [5] that ℓ1 regu- can transform an arbitrary nonlinear problem into a set
larized logistical regression has a logarithmical sample of locally linear ones.
complexity with respect to the number of features. How- The local linearization of a nonlinear problem enables
ever, the linearity assumption of data models in these us to estimate the feature weights by using a linear
approaches limits their applicability to general problems. model that has been extensively studied in the litera-
ture. It also facilitates the mathematical analysis of the
algorithm. The main problem with the above margin
3 O UR A LGORITHM definition, however, is that the nearest neighbors of
In this section, we present a new feature selection al- a given sample are unknown before learning. In the
gorithm that addresses many issues with prior work presence of many thousands of irrelevant features, the
discussed in Sec. 2. Let D = {(xn , yn )}N J
n=1 ⊂ R × {±1} nearest neighbors defined in the original space can be
be a training data set, where xn is the n-th data sample completely different from those in the induced space (see
containing J features, yn is its corresponding class label, Fig. 2). To account for the uncertainty in defining local
and J ≫ N . For clarity, we here consider only binary information, we develop a probabilistic model, where
problems, while in Sec. 3.3 our algorithm is general- the nearest neighbors of a given sample are treated
ized to address multiclass problems. We first define the as hidden variables. Following the principles of the
margin. Given a distance function, we find two nearest expectation-maximization algorithm [21], we estimate
neighbors of each sample xn , one from the same class the margin by computing the expectation of ρn (w) via
(called nearest hit or NH), and the other from the different averaging out the hidden variables:
class (called nearest miss or NM). The margin of xn is then
computed as ρ̄n (w) = wT (Ei∼Mn [|xn − xi |] − Ei∼Hn [|xn − xi |])
X
ρn = d(xn , NM(xn )) − d(xn , NH(xn )) , (1) = wT P (xi =NM(xn )|w)|xn − xi |
i∈Mn
where d(·) is a distance function. For the purposes of X
− P (xi =NH(xn )|w)|xn − xi | = wT z̄n , (3)
this paper, we use the Manhattan distance to define a
i∈Hn
sample’s margin and nearest neighbors, while other stan-
dard definitions may also be used. This margin definition where Mn = {i : 1 ≤ i ≤ N, yi 6= yn }, Hn = {i : 1 ≤
is implicitly used in the well-known RELIEF algorithm i ≤ N, yi = yn , i 6= n}, Ei∼Mn denotes the expecta-
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 4
0.25
framework. Two most popular margin formulations are
0.2
SVM [3] and logistic regression [23]. Due to the nonneg-
w
0.15
ative constraint on w, the SVM formulation represents
Second Feature
A 0.1 a large-scale optimization problem, while the problem
0.05 size can not be reduced by transforming it into the dual
C
B
0 domain. For computational convenience, we therefore
−0.05 perform the estimation in the logistic regression formula-
−0.1
−0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25
tion, which leads to the following optimization problem:
First Feature
N
(a) (b) X
log 1 + exp(−wT z̄n ) , subject to w ≥ 0 , (6)
min
w
Fig. 1. Fermat’s spiral problem. (a) Samples belonging to n=1
two classes are distributed in a two-dimensional space, where w ≥ 0 means that each element of w is nonnega-
forming a spiral shape. A possible decision boundary is tive.
also plotted. If one walks from point A to B along the In applications with a huge amount of features, we
decision boundary, at any given point (say, point C), one expect that most features are irrelevant. For example,
would obtain a linear problem locally. (b) By projecting in cancer prognosis, most genes are not involved in
the transformed data z̄n onto the direction specified by w, tumor growth and/or spread [1], [2]. To encourage the
most samples have positive margins. sparseness, one commonly used strategy is to add ℓ1
penalty of w to an objective function [5], [17], [24],
tion computed with respect to Mn , P (xi =NM(xn )|w) [25], [26], [27], [28]. Accomplishing sparse solutions by
and P (xi =NH(xn )|w) are the probabilities of sample introducing the ℓ1 penalty has been theoretically justified
xi being the nearest miss or hit of xn , respectively. (see, for example, [29] and the references therein). With
These probabilities are estimated via the standard kernel the ℓ1 penalty, we obtain the following optimization
density estimation: problem:
N
k(kxn − xi kw ) X
P (xi =NM(xn )|w) = P , ∀i∈Mn , log 1 + exp(−wT z̄n ) +λkwk1 , subject to w ≥ 0 ,
min
j∈Mn k(kxn − xj kw ) w
n=1
(4) (7)
and where λ is a parameter that controls the penalty strength
k(kxn − xi kw ) and consequently the sparseness of the solution.
P (xi =NH(xn )|w) = P , ∀i∈Hn ,
j∈Hn k(kxn − xj kw ) The optimization formulation (7) can also be written
(5) as:
where k(·) is a kernel function. Specifically, we use N
the exponential kernel k(d) = exp(−d/σ), where the
X
log 1 + exp(−wT z̄n ) , subject to kwk1 ≤ β, w ≥ 0 .
min
kernel width σ is an input parameter that determines the w
n=1
resolution at which the data is locally analyzed. Other (8)
kernel functions can also be used, and the descriptions In statistics, the above formulation is called nonnegative
of their properties can be found in [22]. garrote [30]. For every solution to (7), obtained for a
To motivate the above formulation, we consider the given value of λ, there is a corresponding value of β
well-known Fermat’s problem in which two-class sam- in (8) that gives the same solution. The optimization
ples are distributed in a two-dimensional space, forming problem of (8) has an interesting interpretation: if we
a spiral shape, as illustrated in Fig. 1(a). A possible adopt a classification rule where xn is correctly classified
decision boundary is also plotted. If one walks from if and only if margin ρ̄n (w) ≥ 0 (i.e., on average, xn is
point A to B along the decision boundary, at any given closer to the patterns from the same class in the training
point (say, point C), one would obtain a linear problem data excluding xn than to those from the opposite class),
PN
locally. One possible linear formulation is given by Eq. then n=1 I(ρ̄n (w) < 0) is the leave-one-out (LOO)
(3). Clearly, in this spiral problem, both features are classification error induced by w, where I(·) is the
equally important. By projecting the transformed data indicator function. Since the logistic loss function is an
z̄n onto the feature weight vector w = [1, 1]T , we note upper bound of the misclassification loss function, up to
that most samples have positive margins (Fig. 1(b)). The a difference of a constant factor, the physical meaning of
above arguments generally hold for arbitrary nonlinear our algorithm is to find a feature weight vector so that
problems for a wide range of kernel widths as long as the upper bound of the LOO classification error in the in-
the local linearity condition is preserved. We will demon- duced feature space is minimized. Hence, the algorithm
strate in the experiment that the algorithm’s performance has two levels of regularization, i.e., the implicit LOO
is indeed robust against a specific choice of kernel width. and explicit ℓ1 regularization. We will shortly see that
After the margins are defined, the problem of learning this property, together with the convergence property,
feature weights can be solved within the large margin leads to superior performance of our algorithm in the
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 5
presence of many thousands of irrelevant features. We The entries of H(x+ ) are given by
will also see that the performance of our algorithm is
∂2f
∂f
largely insensitive to a specific choice of λ, due to the
hij = 4xi xj +2 δ(i, j) , 1≤i≤J, 1≤j≤J ,
LOO regularization. ∂yi ∂yj ∂yi x+ +
i ,xj
and R is a positive number. Also, our initial assump- Input : Data D = {(xn , yn )}N J
n=1 ⊂ R × {±1},
tion that H(x+ ) 0 is equivalent to A 0, that is, kernel width σ, regularization parameter
∂f ∂f ∗ +
∂y1+
≥ 0, . . . , ∂y + ≥ 0. It follows that f (y ) − f (y ) ≥ 0, λ, stop criterion θ
M
where the equality holds when y∗ = y+ . This contradicts Output: Feature weights w
the initial assumption that x∗ is a global minimizer of 1 Initialization: Set w(0) = 1, t = 1 ;
g(x) and y∗ 6= y+ . We finished the proof that a given 2 repeat
stationary point x+ of g(x) is either a saddle point if 3 Compute d(xn , xi |w(t−1) ), ∀xn , xi ∈ D;
H(x+ ) is not positive semidefinite, or a global minimizer
4 Compute P (xi =NM(xn )|w(t−1) ) and
of g(x) if H(x+ ) 0.
P (xj =NH(xn )|w(t−1) ) as in (4) and (5);
Next, we prove that if stationary point x+ is found via 5 Solve for v through gradient descent using
(0)
gradient descent with an initial point xj 6= 0, 1 ≤ j ≤ J, the update rule specified in (10);
+
then x is a global minimizer of g(x). Suppose that 6
(t)
wj = vj2 , 1 ≤ j ≤ J;
∂g/∂x∗ = 0 and x∗ is a saddle point. Again, we assume 7 t = t + 1;
that the first M elements of x∗ belong to S0 , while 8 until kw
(t)
− w(t−1) k < θ ;
the rest J − M elements belong to S6=0 . There exists (t)
9 w = w .
an element i ∈ S0 so that ∂f /∂yi∗ < 0 (otherwise
H(x∗ ) 0 and x∗ is a global minimizer). Due to the Algorithm 1: Pseudo-code of the proposed feature
continuity, there exists ξ > 0, such that ∂f /∂yi < 0 selection algorithm
for every yi ∈ C = {y : |y − yi∗ | < ξ}. It follows that
√
∂g/∂xi = 2xi (∂f /∂yi ) < 0 for xi = yi , and ∂g/∂xi > 0
√ ∗ d(xn , xi |w) = min d(xn , xj |w) and 0 otherwise 1 . Sim-
for xi = − yi . That is, xi = 0 is not reachable by using a j∈Mn
gradient descent method, given by xi ← xi − η(∂g/∂xi ), ilar asymptotic behavior holds for P (xi =NH(xn )|w).
(0)
except when the component xi of the initial point x(0) From the above analysis, it follows that for σ → +∞
is set to zero. Equivalently, the saddle point x∗ is not the algorithm converges to a unique solution in one
reachable via gradient decent. This concludes the proof iteration, since P (xi =NM(xn )|w) and P (xi =NH(xn )|w)
of the theorem. are constants for any initial feature weights. On the other
For fixed z̄n , the objective function (7) is a strictly hand, for σ → 0, the algorithm searches for only one
convex function of w. Theorem 1 assures that in each nearest neighbor when computing margins, and we em-
iteration, via gradient descent, reaching a global op- pirically find that the algorithm may not converge. This
timum solution of w is guaranteed. After the feature suggests that the convergence behavior and convergence
weighting vector is found, the pairwise distances among rate of the algorithm are fully controlled by the kernel
data samples are re-evaluated using the updated fea- width, which is formally stated in the following theorem.
ture weights, and the probabilities P (xi =NM(xn )|w)
Theorem 2. For the feature selection algorithm defined in
and P (xj =NH(xn )|w) are re-computed using the newly
Alg. 1, there exists σ ∗ such that lim kw(t) − w(t−1) k = 0
obtained pairwise distances. The two steps are iterated t→+∞
until convergence. The implementation of the algorithm whenever σ > σ ∗ . Moreover, for a fixed σ > σ ∗ , the algorithm
is very simple. It is coded in Matlab with less than one converges to a unique solution for any nonnegative initial
hundred lines. Except for the line search, no other built- feature weights w(0) .
in functions are used. The pseudo-code is presented in We use the Banach fixed point theorem to prove
Algorithm 1. the convergence theorem. We first state the fixed point
In the following three subsections, we analyze the theorem without proof, which can be found for example
convergence and computational complexity of our algo- in [31].
rithm, and present its extension to multiclass problems.
Definition 1. Let U be a subset of a normed space Z, and
k · k is a norm defined in Z. An operator T : U → Z is called
a contraction operator if there exists a constant q ∈ [0, 1) such
3.1 Convergence Analysis that kT (x) − T (y)k ≤ qkx − yk for every x, y ∈ U. q is called
the contraction number of T . An element of a normed space
We begin by studying the asymptotic behavior of the Z is called a fixed point of T : U → Z if T (x) = x.
algorithm. If σ → +∞, for every w ≥ 0, we have
Theorem 3 (Fixed Point Theorem). Let T be a contraction
1 operator mapping a complete subset U of a normed space
lim P (xi =NM(xn )|w) = , ∀i∈Mn , (18) Z into itself. Then the sequence generated as x(t+1) =
σ→+∞ |Mn |
T (x(t) ), t = 0, 1, 2, · · · , with arbitrary x(0) ∈ U, converges
since lim k(d) = 1. On the other hand, if σ → 0, by 1. In a degenerated case, it is possible that d(xn , xi |w) =
σ→+∞
assuming that for every xn , d(xn , xi |w) 6= d(xn , xj |w) d(xn , xj |w), which, however, is a zero-probablity event provided that
patterns contains some random noise. For simplicity, the degenerated
if i 6= j, we have lim P (xi = NM(xn )|w) = 1 if case is not considered in our analysis.
σ→0
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 7
to the unique fixed point x∗ of T . Moreover, the following of which in each iteration are O(N 2 J) and O(N J),
estimation error bounds hold: respectively. Here, J is the feature dimensionality and
kx(t) − x∗ k ≤ qt (1)
− x(0) k , N is the number of samples. When N is sufficiently
1−q kx (19) large (say 100), the most of the CPU time is spent on the
q
and kx(t) − x∗ k ≤ 1−q kx
(t)
− x(t−1) k .
second task (see Fig. 5(a)). The computational complexity
Proof: (Proof of Theorem 2): The gist of the proof of our algorithm is comparable to those of RELIEF [18]
is to identify a contraction operator for the algo- and Simba [19], which are known for their computational
rithm, and make sure that the conditions of Theorem efficiency. A close look at the update equation of v,
3 are met. To this end, we define P = {p : p = given by (10), allows us to further reduce complexity. If
[P (xi =NM(xn )|w), P (xj =NH(xn )|w)]} and W = {w : some elements of v are very close to zero (say less than
w ∈ RJ , kwk1 ≤ β, w ≥ 0}, and specify the first step 10−4 ), the corresponding features can be eliminated from
of the algorithm in a functional form as T 1 : W → P, further consideration with a negligible impact on the
where T 1(w) = p, and the second step as T 2 : P → W, subsequent iterations, thus providing a built-in mech-
where T 2(p) = w. Then, the algorithm can be written as anism for automatically removing irrelevant features
w(t) = (T 2 ◦ T 1)(w(t−1)) , T (w(t−1) ), where (◦) denotes during learning.
functional composition and T : W → W. Since W is Our algorithm has a linear complexity with respect
a closed subset of finite-dimensional normed space RJ to the number of features. In contrast, some popular
(or a Banach space) and thus complete [31], T is an greedy search methods (e.g., forward search) require
operator mapping complete subset W into itself. Next, of the order of O(J 2 ) moves in a feature space [9].
note that for σ → +∞, the algorithm converges with However, when the sample size becomes excessively
one step. We have lim kT (w1 , σ) − T (w2 , σ)k = 0, large, it can still be computationally intensive to run
σ→+∞
for any w1 , w2 ∈ W. Therefore, in the limit, T is a our algorithm. Considerable efforts have been made over
contraction operator with contraction constant q = 0, that the years to improve the computational efficiency of
is, lim q(σ) = 0. Therefore, for every ε > 0, there exists nearest neighbor search algorithms [32]. It is possible to
σ→+∞ use similar techniques to reduce the number of distance
σ ∗ such that q(σ) ≤ ε whenever σ > σ ∗ . By setting ε < 1, evaluations actually performed in our algorithm, which
the resulting operator T is a contraction operator. By the will be our future work.
Banach fixed point theorem, our algorithm converges
to a unique fixed point provided the kernel width is
properly selected. The above arguments establish the 3.3 Feature Selection for Multiclass Problems
convergence theorem of the algorithm. This section considers feature selection for multiclass
The theorem ensures the convergence of the algorithm problems. Some existing feature selection algorithms,
if the kernel width is properly selected. This is a very originally designed for binary problems, can be naturally
loose condition, as our empirical results show that the al- extended to multiclass settings, while for others the
gorithm always converges for a sufficiently large kernel extension is not straightforward. For both embedded
width (see Fig. 5(b)). Also, the error bound in (19) tells and wrapper methods, the extension largely depends
us that the smaller the contraction number, the tighter on the capability of a classifier to handle multiclass
the error bound and hence the faster the convergence problems [9]. In many cases, a multiclass problem is
rate. Our experiments suggest that a larger kernel width first decomposed into several binary ones by using an
yields a faster convergence. error-correct-code method [13], [34], and then feature
Unlike many other machine learning algorithms (e.g., selection is performed for each binary problem. This
neural networks), the convergence and the solution of strategy further increases the computational burden of
our algorithm are not affected by the initial value if the embedded and wrapper methods. Our algorithm does
kernel width is fixed. This property has a very important not suffer from this problem. A natural extension of the
consequence: even if the initial feature weights were margin defined in (2) to multiclass problems is [34]:
wrongly selected (e.g., investigators have no or false ρn (w) = min d(xn , NM(c) (xn )|w) − d(xn , NH(xn )|w) ,
prior information) and the algorithm started comput- c∈Y,c6=yn
ing erroneous nearest misses and hits for each sample, = min d(xn , xi |w) − d(xn , NHn |w) ,
xi ∈D\Dyn
the theorem assures that the algorithm will eventually (20)
converge to the same solution obtained when one had where Y is the set of class labels, NM(c) (xn ) is the nearest
perfect prior knowledge. The correctness of the proof of neighbor of xn from class c, and Dc is a subset of D
Theorem 2 is experimentally verified in Sec. 4.1. containing only samples from class c. The derivation of
our feature selection algorithm for multiclass problems
3.2 Computational Complexity and Fast Implemen- by using the margin defined in (20) is straightforward.
tation
The algorithm consists of two main parts: computing 4 E XPERIMENTS
pairwise distances between samples and solving the ℓ1 We perform a large-scale experiment on eleven synthetic
optimization problem, the computational complexities and real-world data sets to demonstrate the effectiveness
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 8
50 500 5000
2GB RAM. 20 20 20
15 15 15
10 10 10
4.1 Spiral Problem
5 5 5
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
Fig. 2. The algorithm iteratively refines the estimates of weight vector w and probabilities P (xi =NH(xn )|w) and
P (xi =NM(xn )|w) until convergence. The result is obtained on the spiral data with 10000 irrelevant features. Each
sample is colored according to its probability of being the nearest miss or hit of a given sample indicated by a black
cross. The plot shows that the algorithm converges to a perfect solution in just three iterations. (The figure is better
viewed electronically.)
0
30 10
1
10
372s
−1
25 10
−2
0 10
Feature Weights
20 10 37s
CPU Time(minute)
−3
10
15
θ
−1 −4
10 3.5s 3.81s 10 0.01
0.05
10 0.5
−5
10 1
−2
10
10 50
5 0.26s −6
10
0.17s
−7
0 0 2 3 4 10
2 4 6 10 10 10 2 4 6 8 10 12 14
10 10 10 10
Number of Features Number of iterations
Number of Features
(a) (b)
Fig. 4. Feature weights learned on the spiral data with
one million features. Fig. 5. (a) The CPU time it takes our algorithm to perform
feature selection on the spiral data with different numbers
of irrelevant features, ranging from 50 to 30000. The CPU
time spent on solving the ℓ1 optimization problems is
result is very encouraging, and some theoretical analyses
on the algorithm’s sample complexity are presented in also reported (blue line). The plot demonstrates linear
Sec. 5 that explain in part the algorithm’s excellent complexity with respect to the feature dimensionality. (b)
performance. Convergence analysis of our algorithm performed on the
spiral data with 5000 irrelevant features, for λ = 1 and
Our algorithm is computationally very efficient. σ ∈ {0.01, 0.05, 0.5, 1, 10, 50}. The plots present θ =
Fig. 5(a) shows the CPU time it takes the algorithm kw(t) − w(t−1) k2 as a function of the number of iteration
to perform feature selection on the spiral data with steps.
different numbers of irrelevant features. The stopping
criterion in Alg. 1 is θ = 0.01. As can be seen from the
figure, the algorithm runs for only 3.5s for the problem
with 100 features, 37s for 1000 features, and 372s for same dataset with 10000 features, and yet there is no
20000 features. The computational complexity is linear guarantee that the optimal solution will be reached, due
with respect to the feature dimensionality. It would be to heuristic search. The CPU time spent on solving the ℓ1
difficult for most wrapper methods to compete with ours optimization problems is also reported, which accounts
in terms of computational complexity. Depending on the for about 2% of the total CPU time.
classifier used to search for relevant features, it may Fig. 5(b) presents the convergence analysis of our
take several hours for a wrapper method to analyze the algorithm on the spiral data with 5000 irrelevant fea-
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 10
40 25
30
60
20
35
25 20
50
30
15
20
40 25
15
20 15
30 10
10
15
20 10
10
5
5
10 5
5
0 0 1 2 3
0 0 1 2 3
0 0 1 2 3
0 0 1 2 3
0 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
Fig. 6. Feature weights learned on the spiral data with 5000 irrelevant features, for a fixed kernel width σ = 2 and
different regularization parameters λ ∈ {0.1, 0.5, 1, 1.5, 2}.
σ=0.1 0.5 1 3 5
30
30 30 30 30
25
25 25 25 25
20
20 20 20 20
15 15
15 15 15
10 10 10 10 10
5 5 5 5 5
0 0 1 2 3
0 0 1 2 3
0 0 1 2 3
0 0 1 2 3
0 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
Fig. 7. Feature weights learned on the spiral data with 5000 irrelevant features, for a fixed regularization parameter
λ = 1 and different kernel widths σ ∈ {0.1, 0.5, 1, 3, 5}.
TABLE 2
Classification errors and standard deviations (%) of SVM performed on the seven UCI data sets contaminated by
5000 useless features. FDR is defined as the ratio between the number of articially added, irrelevant features
identied by our algorithm as useful ones and the total number of irrelevant features (i.e., 5000). The last row
summarizes the win/loss/tie of each algorithm when compared with ours at the 0.05 p-value level.
Fig. 8. Except for diabetics, our algorithm removes nearly we conduct an experiment to demonstrate that ours is
all of the irrelevant features. For comparison, the feature not only effective in eliminating noisy features but also
weights learned using the original data are also plotted. redundant ones.
We observe that the weights learned in the two cases
are very similar, which is consistent with the result of
4.3 Experiments on Microarray Data
the spiral data and the theoretical results of Sec. 3.1.
(3) The CPU time of the six algorithms, averaged In this section, we demonstrate the effectiveness of our
over the ten runs, is presented in Table 2. In terms algorithm using three microarray data sets, including
of computational efficiency, the KS test performs the breast cancer [1], prostate cancer [52], and diffuse large B-cell
best, ours the second, I-RELIEF/BS, RFE and Simba lymphoma (DLBCL) [53]. The detailed data information
the third, and AMS the least efficient. On average, it is summarized in Table 1. For all three data sets, the
takes our algorithm less than half a minute to process number of genes is significantly larger than the number
5000 features. It should be noted that the CPU time of samples. Another major characteristic of microarray
of AMS is obtained by using only 1000 features and data, unlike the UCI data we considered in the previ-
RFE is implemented in C, and hence the comparison is ous section, is the presence of a significant number of
somewhat in favor of AMS and RFE. redundant features (or co-regulated genes) since genes
We above use classification errors as the main cri- function in a modular fashion. It is well known that
terion to compare different algorithms. Classification including redundant features may not improve, but
errors, however, may not tell the whole story. Some sometimes deteriorate classification performance [10].
other criteria are worthy of mention. All of the five From the clinical perspective, the examination of the
algorithms that we compare to our algorithm are feature expression levels of redundant genes may not improve
weighting methods. In our experiment, the test data clinical decisions but increase medical examination costs
is used to estimate the minimum classification errors. needlessly. Hence, our goal is to derive a gene signature
In practical applications, without test data, one may with a minimum number of genes to achieve a highly
have to use a classifier learned on training data to accurate prediction performance.
determine the number of features that will be used in We have shown in the previous section that our algo-
the test stage, which requires considerable efforts on rithm significantly outperformed AMS, RFE and Simba
parameter tuning. In our method, the regularization in terms of computational efficiency and accuracy. In
parameter can be learned simultaneously within the this experiment, we only consider the KS test and I-
learning process, without using any other classifiers. RELIEF/BS, which are the two most competitive al-
Our method performs back selection implicitly: when gorithms with respect to performance. For the results
the weight of a feature is less than 10−8 , the feature is reported in this paper, the kernel width and regular-
eliminated from further consideration. Unlike RFE and ization parameter of our algorithm are set to 5 and
I-RELIEF/BS where the number of discarded features in 1, respectively. We empirically find that the algorithm
each iteration is pre-defined regardless of the values of yields nearly identical prediction performance for a wide
their feature weights, in our algorithm, when a feature range of parameter values. The same kernel width is
should be removed and how many of them should be also used in I-RELIEF/BS. We use KNN to estimate
removed are all automatically determined by the algo- the performance of each algorithm. We do not spend
rithm. On the theoretical side, ours has a solid theoretical additional effort to tune the value of K, but simply
foundation that will be presented in Sec. 5, whereas it set it to 3. Due to the small sample size, the leave-
is difficult to perform a similar theoretical analysis for one-out cross validation (LOOCV) method is used. In
some algorithms we herein consider. In the next section, each iteration, one sample is held out for testing and
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 13
Feature Scores
Feature Scores
Feature Scores
Feature Scores
0.6 0.6 0.6 0.6
0 0 1 2 3
0 0 1 2 3
0 0 1 2 3
0 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
Features Features Features Features
Feature Scores
Feature Scores
Feature Scores
0.6 0.6 0.6
0 0 1 2 3
0 0 1 2 3
0 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10
Features Features Features
Fig. 8. Feature weights learned in one sample trial of the seven UCI data sets with and without 5000 irrelevant features.
The dashed red line indicates the number of original features. The weights plotted on the left side of the dashed line
are associated with the original features, while those on the right, with the additional 5000 irrelevant features. The
feature weights learned in the two cases are very similar.
18
gene subset. This is because both I-RELIEF/BS and the 0.8
16
0.7
KS test are unable to eliminate redundant features. If
CPU Time (Second)
14
Feature Score
0.6
a gene is top ranked, its co-regulated genes will also 0.5
12
8
we plot in Fig. 9(a) the feature weight vectors learned 0.3
6
0 0 0
the x-axis are arranged based on the p-values of a t- 10 10
Genes
2
10
4
500 1000 2000 5000
Number of Genes
10000 24000
tional efficiency of our algorithm. Fig. 9(b) presents the Definition 2. Let F = {f (x|α) : x ∈ RJ , α ∈ Ω} be a
CPU time it takes our feature selection algorithm to set of real valued function. Given N arbitrary data samples
identify a gene signature for the breast cancer dataset with X N = {x1 , · · · , xN } ⊂ RJ , define F (X N ) = {f (X N |α) =
a varying number of genes, ranging from 500 to 24481. [f (x1 |α), · · · , f (xN |α)]T : α ∈ Ω}. We say that set V =
It only takes about 22 seconds to process all 24481 genes. {v1 , · · · , vK } ⊂ RN ǫ-covers F (X N ) in the p-norm if for all
If a filter method (e.g., t-test) is first used to reduce the α, there exists vk ∈ V so that kf (X N |α) − vk kp ≤ N 1/p ǫ ,
feature dimensionality to, say, 2000, as is almost always where k · kp is the p-norm. The covering number of F (X N )
done in microarray data analysis, our algorithm runs for in the p-norm, denoted as Np (F , ǫ, X N ) is the cardinality of
only about two seconds. the smallest set that ǫ-covers F (X N ). Define Np (F , ǫ, N ) =
supX N Np (F , ǫ, X N ).
5 A NALYSIS OF S AMPLE C OMPLEXITY
This section presents a theoretical study of our algo- Theorem 4 (Pollard,1984). For all ǫ > 0 and distribution
rithm’s sample complexity. The main purpose of the p(x), we have
analysis is to explain why the proposed algorithm per-
forms so well for high-dimensional data, as demon-
P sup |R(α, X N ) − R(α)| > ǫ
strated in the previous section. As one can see from α∈Ω
Eq. (8), the algorithm finds a feature weight vector
−N ǫ2
aimed at minimizing an empirical logistic loss. Hence, ≤ 8E[N1 (L, ǫ/8, X N )] exp , (21)
128M 2
it is a learning problem and the analysis can be per-
formed under the VC-theory framework. Specifically, we
where M = sup L(α, x) − inf L(α, x), and N1 is the 1-norm
try to establish the dependence of the generalization α,x α,x
performance of the proposed algorithm on input data covering number of function class L.
dimensionality. Our study suggests that the algorithm
has a logarithmical sample complexity with respect to In general, it is very difficult to estimate the covering
the input feature dimensionality. We should emphasize number of an arbitrary function class. Some general
that for many existing feature selection algorithms (e.g., bounds on covering numbers exist. For example, the
wrapper method), due to heuristic search, it would be covering number of a closed ball of radius r centered
difficult to conduct such theoretical analysis. at the origin in RJ is bounded by (4r/ǫ)J [40], where ǫ
We begin by reviewing some basic concepts of the sta- is the radius of the disks covering the ball. These bounds
tistical learning theory. Let {x, y} be a pair of observation are usually too loose to be useful. Fortunately, there
and target value, sampled from a fixed but unknown exists a tight bound for linear function class, due to [41],
joint distribution p(x, y). In the sequel, we absorb y into which can be used for estimating the covering number
x, for notation simplicity. Given a set of real valued map- for our purposes. We slightly modify the theorem of [41]
ping functions F = {f (x|α) : α ∈ Ω}, parameterized by to include the nonnegative constraint of feature weights.
α, and a loss function L(f (x|α)), we are concerned with Theorem 5. Let F = {wT x : w ≥ 0, kwk1 = a, x ∈
the problem of finding a parameter α to R minimize the ex- RJ , kxk∞ = b}. Then we have
pected loss: R(α) = E[L(f (x|α))] = L(f (x|α))p(x)dx.
In real applications, the true distribution is rarely known,
a2 b 2
and one has access only to a limited number of obser- log2 N2 (F , ǫ, N ) ≤ log2 (J + 1) , (22)
vations X N = {x1 , · · · , xN } independently drawn from ǫ2
the unknown distribution. A natural method to solve a
learning problem is to find a parameter α to minimize where ⌈x⌉ is the nearest integers of x towards infinity.
PN
the empirical loss: R(α, X N ) = N1 n=1 L(f (xn |α)). We The proof of the theorem is similar to that in [41].
are interested to know how well a learning algorithm By using Theorems 4 and 5, we establish the following
trained on a limited number of training data will per- bounds for our algorithm with σ = +∞ and σ → 0. Note
form on unseen data. This can be studied under the VC that, when the kernel width goes to zero, the algorithm
theory, which relies on the uniform convergence of the finds only one nearest neighbor for each pattern when
empirical loss to the expected loss. It has been proved by computing margins, as shown in Sec. 3.1.
[36] that if the bound supα∈Ω |R(α, X N ) − R(α)| is tight,
then the function that minimizes the empirical loss is Theorem 6. Let kwk1 ≤ β, w ≥ 0, x ∈ RJ , and kxk∞ ≤ 1.
likely to have an expected loss that is close to the best For the proposed algorithm, if σ = +∞, for all ǫ > 0, and
in the function class. distribution p(x), we have
A theorem, due to [37], provides an upper bound on
the rate of the uniform convergence of a class of func-
N
tions in terms of its covering number. Before we present P sup |R(w, X ) − R(w)| > ǫ
w
the theorem, we first define the concept of covering
−N ǫ2
2
number. The interested reader may refer to [38] and [39] ⌈ 256β ⌉
≤ 8(J + 1) ǫ2 exp . (23)
for more comprehensive coverage on this subject. 512(2β + 1)2
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 15
Using the infinite kernel width in our algorithm community, and we do not expect that this paper can
amounts to making the linear assumption about the data solve this open problem. However, the experimentally
model. In this case, our algorithm has a similar gener- verified high accuracy, computational efficiency, and ease
alization error bound to that of ℓ1 regularized logistic of implementation of the proposed algorithm justify its
regression. The main difference is that our algorithm has presentation to the broader community at this stage of
an implicit LOO regularization. Taking this regulariza- our theoretical development.
tion into account when deriving the error bound may
lead to an even tighter bound.
A somewhat surprising result is that when the kernel
6 D ISCUSSION
width goes to zero our algorithm also has a logarithmic We conclude this paper by tracing back the origins of
sample complexity. As discussed in Sec. 3, if we adopt a this work, comparing our algorithm with some related
classification rule that classifies x by the sign of its mar- feature selection/weighting approaches and summariz-
gin ρ̄(x), our algorithm can be viewed as the one-nearest- ing the main contributions we made in this paper.
neighbor classifier (1NN). It is well-known that 1NN Our approach is motivated to a great extent by the
performs poorly for high-dimensional data. [19] proves ideas implemented in the RELIEF algorithm [18]. RELIEF
that when feature selection is performed, the generaliza- is considered one of the most successful feature weight-
tion error bound of 1NN depend logarithmically on the ing algorithms due to its simplicity and effectiveness
input feature dimensionality, but polynomially on the [33]. It has been long regarded as a heuristic filter
number of selected features (Theorem 1 in [19]). While method, until recently, when we have mathematically
our generalization bound in Eq. (29) is consistent with proved that RELIEF is an online-learning algorithm that
the result of [19], note however that our bound does not solves a convex optimization problem aimed at maxi-
depend on the number of the selected features, but on mizing the averaged margin [42]. One major problem
the total size of the feature weights (i.e., kwk1 bounded with RELIEF is that the nearest neighbors of a given
by β. Also see Eq. (8).). This result is consistent with that sample are predefined in the original feature space,
of [45] that the size of the weights is more important which typically yields erroneous nearest hits and misses
than the size of neural networks. We perform some in the presence of copious irrelevant features. RELIEF-
experiments on using our algorithm for classification F [49] mitigates this problem by searching for multiple,
purposes (i.e., classify x by the sign of its margin ρ̄(x)) instead of just one, nearest neighbors when computing
on the UCI data sets contaminated by varying numbers margins. Empirical studies have shown that RELIEF-
of irrelevant features ranging from 0 to 10000. Since F achieves significant performance improvement over
the main focus of the paper is on feature selection, we RELIEF [49]. However, the performance of both algo-
report the detailed results in the supplementary data. As rithms degrades significantly with the increase of feature
can see from Table 1S, the classification performance is dimensionality. To address this problem, we recently
largely insensitive to a growing number of irrelevant fea- proposed a new feature weighting algorithm referred
tures. This result is very encouraging, and indicates that to as I-RELIEF, which performs significantly better than
it is possible to develop a new classification algorithm RELIEF-F [42]. Similar to the algorithm proposed in
with a logarithmic dependence on data dimensionality this paper, I-RELIEF employs a probabilistic model to
without making any assumption of data distributions. define the local information of a given sample. However,
We here only provide the sample complexity analysis as with all other algorithms in the RELIEF family, the
for two specific kernel widths. However, it is reasonable objective function optimized by I-RELIEF is not directly
to expect that a relatively large kernel width should related to the classification performance of a learning
improve the generalization error bound of the algorithm algorithm. Moreover, I-RELIEF imposes a ℓ2 constraint
over that derived when the kernel width goes to zero on feature weights, and thus is not able to remove
(consider a similar case where KNN usually performs redundant features and provide a sparse solution (see
better than 1NN.). The result presented in Fig. 7 shows Fig. 9(a)). I-RELIEF does not enjoy the same theoretical
that a large kernel width can indeed remove irrelevant properties as ours. This work is also motivated by the
features, and our empirical result obtained on the spiral Simba algorithm recently proposed in [19]. Compared
dataset contaminated by up to one million irrelevant to RELIEF, Simba re-evaluates the distances according
features suggests that the proposed algorithm have a to the learned weight vector, and thus is superior to RE-
logarithmic sample complexity. LIEF. One major problem, however, is that the objective
It is worth noting that the above analysis for a gen- function optimized by Simba is characterized by many
eral case with an arbitrary value of kernel width is local minima, which can be mitigated by restarting the
very difficult. Indeed, after decades of research, the algorithm from several initial points [19]. Also, Simba
covering number or similar techniques (i.e., Rademacher represents a constrained nonlinear optimization problem
complexity and VC dimension) are only defined for a that cannot be easily solved by conventional optimiza-
small set of functional classes. The problem of deriv- tion techniques. We empirically find that Simba performs
ing generalization error bounds for arbitrary functional very well when the number of irrelevant features is
classes is still largely open in the machine learning small, but may fail completely when there exist a large
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 17
number of irrelevant features (see Table 2). One possible suggestions that significantly improved the quality of
explanation is that the chance for Simba to be stuck the paper. This work is in part supported by the Su-
into local minima is increased dramatically with the san Komen Breast Cancer Foundation under grant No.
increased number of features. BCTR0707587.
Feature selection is closely related to distance metric
learning (see, e.g., NCA [46], LMNN [47] and LFE [48]).
R EFERENCES
These algorithms are also based on local learning and
share the same goals as ours to reduce data dimension- [1] L. van’t Veer, H. Dai, M. van de Vijver, et. al, “Gene expression
profiling predicts clinical outcome of breast cancer,” Nature, vol.
ality, but they completely differ in algorithmic formu- 415, pp. 530–536, 2002.
lations. Moreover, these algorithms are all for feature [2] Y. Wang, J. Klijn, Y. Zhang, et. al, “Gene-expression profiles to
extraction, and it is unclear whether they enjoy the same predict distant metastasis of lymph-node-negative primary breast
cancer,” Lancet, vol. 365, pp. 671–679, 2005.
theoretical properties outlined in Sec. 5. [3] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.
The proposed algorithm embraces some fundamental [4] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and
concepts in machine learning. It is related to SVM in V. Vapnik, “Feature selection for SVMs,” in Proc. 13th Adv. Neu.
Info. Proc. Sys., 2001, pp. 668–674.
the sense that both algorithms solve a nonlinear prob- [5] A. Y. Ng, “Feature selection, L1 vs. L2 regularization, and rotational
lem by first transforming it into a linear problem and invariance,” in Proc. 21st Int. Conf. Mach. Learn., 2004, pp. 78–86.
then solving the linear one so that the margin is max- [6] A. Y. Ng and M. I. Jordan, “Convergence rates of the voting gibbs
classifier, with application to bayesian feature selection,” in Proc.
imized. Unlike SVM, the linearization in our approach 18th Int. Conf. Mach. Learn., 2001, pp. 377–384.
is achieved by local learning, instead of projecting the [7] J. Lafferty and L. Wasserman, “Challenges in statistical machine
data onto a higher (possibly infinite) space, based on learning,” Statistica Sinica, vol. 16, pp. 307–322, 2006.
[8] M. Hilario and A. Kalousis, “Approaches to dimensionality reduc-
the concept that a given complex problem can be more tion in proteomic biomarker studies,” Briefings in Bioinformatics,
easily, yet accurately enough, analyzed by parsing it into vol. 9, no. 2, pp. 102–118, 2008.
a set of locally linear ones. Local learning allows one to [9] I. Guyon and A. Elisseeff, “An introduction to variable and feature
selection,” J. Mach. Learn. Res., vol. 3, pp. 1157–1182, 2003.
capture local structure of the data, while the parameter [10] R. Kohavi and G. H. John, “Wrappers for feature subset selection,”
estimation is performed globally to avoid possible over- Artif. Intell., vol. 97, no. 1-2, pp. 273–324, 1997.
fitting. The idea of “fit locally and think globally” is also [11] P. Pudil and J. Novovicova, “Novel methods for subset selection
with respect to problem knowledge,” IEEE Intell. Syst., vol. 13,
used in the well-known locally linear embedding (LLE) no. 2, pp. 66–74, March 1998.
algorithm that approximates a complex nonlinear man- [12] D. Koller and M. Sahami, “Toward optimal feature selection,” in
ifold using a set of locally linear patches [44]. LLE is an Proc. 13th Int. Conf. Mach. Learn., 1996, pp. 284–292.
[13] T. G. Dietterich and G. Bakiri, “Solving multiclass learning prob-
algorithm for dimensionality reduction in unsupervised lems via error-correcting output codes,” J. Artif. Intell. Res., vol. 2,
learning settings, while our algorithm is for supervised pp. 263–286, 1995.
learning. Another important difference between the two [14] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, “Choosing
multiple parameters for support vector machines,” Mach. Learn.,
algorithms is that LLE is based on the assumption that vol. 46, no. 1, pp. 131-159, 2002.
nearby points in the high dimensional space remain [15] T. N. Lal, O. Chapelle, J. Weston, and A. Elisseeff, “Embed-
adjacent in the reduced low-dimensional space, which ded methods,” in Feature Extraction, Foundations and Applications,
I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh, Eds. Springer-Verlag,
may not be true in the presence of copious irrelevant 2006, pp. 137–165.
features, as shown in Fig. 2. [16] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for
The main contribution of the paper is that we provided cancer classification using support vector machines,” Mach. Learn.,
vol. 46, no. 1-3, pp. 389–422, 2002.
a principled way to perform feature selection for classi- [17] J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani, “1-norm support
fication problems with complex data distributions and vector machines,” in Proc. 16th Adv. Neu. Info. Proc. Sys., 2004.
very high data dimensionality. It avoids any heuristic [18] K. Kira and L. A. Rendell, “A practical approach to feature
selection,” in Proc. 9th Int. Conf. Mach. Learn., 1992, pp. 249–256.
combinatorial search, and hence can be implemented [19] R. Gilad-Bachrach, A. Navot, and N. Tishby, “Margin based
efficiently. Unlike many existing methods, ours has a feature selection - theory and algorithms,” in Proc. 21st Int. Conf.
solid theoretical foundation that ensures its performance. Mach. Learn., 2004, pp. 43–50.
[20] R. E. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee, “Boosting
Moreover, its implementation and parameter tuning are the margin: a new explanation for the effectiveness of voting
easy, and the extension of the algorithm to multiclass set- methods,” Ann. Statist., vol. 26, no. 5, pp. 1651–1686, 1998.
tings is straightforward. We have experimentally demon- [21] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from
incomplete data via the EM algorithm,” J. R. Stat. Soc. Ser. B, vol. 39,
strated that our algorithm is already capable of handling no. 1, pp. 1–38, 1977.
many feature selection problems one may encounter in [22] C. Atkeson, A. Moore, and S. Schaal, “Locally weighted learning,”
scientific research. Considering the increased demand for Artif. Intell. Rev., vol. 11, no. 15, pp. 11–73, 1997.
[23] C. M. Bishop, Pattern Recognition and Machine Learning. New
analyzing data with a large number of features in many York: Springer, 2006.
research fields, including bioinformatics, economics, and [24] R. Tibshirani, “Regression shrinkage and selection via the lasso,”
computer vision, we expect that the work presented in J. R. Stat. Soc. Ser. B, vol. 58, no. 1, pp. 267–288, 1996.
[25] M. Y. Park and T. Hastie, “ℓ1 regularization path algorithm for
this paper will make a broad impact. generalized linear models,” J. R. Statist. Soc. B, vol. 69, no. 4, pp.
659–677, 2007.
7 ACKNOWLEDGMENTS [26] L. Meier, S. van de Geer, and P. Buhlmann, “The group lasso for
logistic regression,” J. R. Statist. Soc. B, vol. 70, pp. 53–71, 2008.
The authors thank the associate editor Dr. Olivier [27] V. Roth, “The generalized LASSO,” IEEE Trans. Neu. Net., no. 15,
Chapelle and three anonymous reviewers for numerous pp. 16–28, 2004.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XX 20XX 18
[28] S. Rosset, “Following curved regularized optimization solution Yijun Sun received two B.S. degrees in elec-
paths,” Proc. 17th Adv. Neu. Info. Proc. Sys., 2005, pp. 1153–1160. trical and mechanical engineering from Shang-
[29] D. L. Donoho and M. Elad, “Optimally sparse representations in hai Jiao Tong University, Shanghai, China, in
general nonorthogonal dictionaries by ℓ1 minimization,” Proc. Natl. 1995, and the M.S. and Ph.D. degrees in elec-
Acad. Sci. U.S.A., vol. 100, no. 5, pp. 2197–2202, 2003. PLACE trical engineering from the University of Florida,
[30] L. Breiman, “Better subset regression using the nonnegative gar- PHOTO Gainesville, USA, in 2003 and 2004, respec-
rote,” Technometrics, vol. 37, no. 4, pp. 373–384, 1995. HERE tively. Currently, he is a research scientist at the
[31] R. Kress, Numerical Analysis. New York: Springer-Verlag, 1998. Interdisciplinary Center for Biotechnology Re-
[32] P. Zezula, G. Amato, V. Dohnal, and M. Batko. Similarity Search - search and an affiliated faculty member at the
The Metric Space Approach. Springer, 2006. Department of Electrical and Computer Engi-
[33] T. G. Dietterich, “Machine learning research: four current direc- neering at the University of Florida. His research
tions,” AI Magazine, vol. 18, no. 4, pp. 97–136, 1997. interests are mainly on machine learning, bioinformatics, and their
[34] Y. Sun, S. Todorovic, J. Li, and D. Wu, “Unifying error-correcting applications to cancer study and microbial community analysis.
and output-code AdaBoost through the margin concept,” in Proc.
22nd Int. Conf. Mach. Learn., 2005, pp. 872–879.
[35] A. Asuncion and D. Newman, “UCI machine learning repository,”
2007.
[36] V. Vapnik and A. Chervonenkis, Theory of Pattern Recognition,
Nauka, Moscow (in Russian), 1974.
[37] D. Pollard, Convergence of Stochastic Processes. New York:
Springer-Verlag, 1984.
[38] L. Devroye, L. Györfi, and G. Lugosi, A Probabilistic Theory of
Pattern Recognition. New York: Springer-Verlag, 1996.
[39] M. Anthony and P. L. Bartlett, Neural Network Learning: Theoretical
Foundations. Cambridge University Press, 1999.
[40] F. Cucker and S. Smale, “On the mathematical foundations of Sinisa Todorovic received his Ph.D. degree
learning,” Bulletin Amer. Math. Soc., vol. 39, no. 1, pp. 1–49, 2002. in electrical and computer engineering at the
[41] T. Zhang, “Covering number bounds of certain regularized linear University of Florida in 2005. He was Postdoc-
function classes,” J. Mach. Learn. Res., vol. 2, pp. 527–550, 2002. toral Research Associate in Beckman Institute
[42] Y. Sun and J. Li, “Iterative RELIEF for feature weighting,” in Proc. PLACE at the University of Illinois Urbana-Champaign,
23rd Int. Conf. Mach. Learn., 2006, pp. 913–920. PHOTO between 2005-2008. Currently, Dr. Todorovic is
[43] H. Kushner and G. Yin, Stochastic Approximation and Recursive HERE Assistant Professor in the School of EECS at
Algorithms and Applications, New York: Springer-Verlag, 2003. Oregon State University. His research interests
[44] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction include computer vision and machine learning,
by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323– with focus on object/activity recognition and tex-
2326, 2000. ture analysis. His synergistic activities include:
[45] P. L. Bartlett, “The sample complexity of pattern classification Associate Editor of Image and Vision Computing Journal, Program Chair
with neural networks: the size of the weights is more important of 1st International Workshop on Stochastic Image Grammars 2009, and
than the size of the network,” IEEE Trans. Inform. Theory, vol. 44, reviewer for all major journals and conferences in computer vision. He
no. 2, pp. 525–536, 1998. was awarded Jack Neubauer Best Paper Award by the IEEE Vehicular
[46] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, Technology Society in 2004, and the Outstanding Reviewer award by
“Neighbourhood components analysis,” in Proc. 17th Adv. Neu. 11th IEEE International Conference on Computer Vision in 2007.
Info. Proc. Sys., 2005, pp. 513–520.
[47] K. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning
for large margin nearest neighbor classification,” in Proc. 18th Adv.
Neu. Info. Proc. Sys., 2006, pp. 1473–1480.
[48] Y. Sun and D. Wu, “A RELIEF based feature extraction algorithm,”
in Proc. 8th SIAM Intl. Conf. Data Mining, 2008, pp. 188–195.
[49] I. Kononenko, “Estimating attributes: analysis and extensions of
RELIEF,” in Proc. Eur. Conf. Mach. Learn., 1994, pp. 171–182.
[50] Y. Sun, S. Goodison, J. Li, L. Liu, and W. Farmerie, “Improved
breast cancer prognosis through the combination of clinical and
genetic markers,” Bioinformatics, vol. 23, no. 1, pp. 30–37, 2007.
[51] R. Horn and C. Johnson, Matrix Analysis. Cambridge University
Press, 1985. Steve Goodison obtained a BSc degree in Bo-
[52] A. J. Stephenson, A. Smith, M. W. Kattan, J. Satagopan, V. E. chemistry from the University of Wales in 1989.
Reuter, P. T. Scardino, and W. L. Gerald, “Integration of gene He went on to receive a PhD in molecular biol-
expression profiling and clinical variables to predict prostate carci- ogy from Oxford University in 1993 as a Well-
noma recurrence after radical prostatectomy,” Cancer, vol. 104, no. PLACE come Trust Scholar. Postdoctoral studies at Ox-
2, pp. 290–298, 2005. PHOTO ford University included diabetes research as a
[53] M. A. Shipp, K. N. Ross, P. Tamayo, et. al, “Diffuse large B-cell HERE Wellcome Trust fellow and in cancer research in
lymphoma outcome prediction by gene expression profiling and the Dept. of Clinical Biochemistry. He joined the
supervised machine learning,” Nat. Med., vol. 8, 68–74, 2002. University of California, San Diego as Assistant
Professor in 2001, and is currently an Associate
Professor at the University of Florida. His re-
search interests are primarily cancer research, and include molecular
pathology, the function of specific genes in the metastatic spread of
cancer, and biomarker discovery and development for clinical diagnosis
and prognosis.