Local Feature Selection For Data Classification
Local Feature Selection For Data Classification
Abstract—Typical feature selection methods choose an optimal global feature subset that is applied over all regions of the sample
space. In contrast, in this paper we propose a novel localized feature selection (LFS) approach whereby each region of the sample
space is associated with its own distinct optimized feature set, which may vary both in membership and size across the sample space.
This allows the feature set to optimally adapt to local variations in the sample space. An associated method for measuring the
similarities of a query datum to each of the respective classes is also proposed. The proposed method makes no assumptions about
the underlying structure of the samples; hence the method is insensitive to the distribution of the data over the sample space. The
method is efficiently formulated as a linear programming optimization problem. Furthermore, we demonstrate the method is robust
against the over-fitting problem. Experimental results on eleven synthetic and real-world data sets demonstrate the viability of the
formulation and the effectiveness of the proposed algorithm. In addition we show several examples where localized feature selection
produces better results than a global feature selection method.
1 INTRODUCTION
Fig. 1. Block diagram of the proposed method where each training sample xðiÞ i ¼ 1; . . . ; N is considered to be a representative point of its neighbor-
ing region and an optimal feature set (possibly different in size and membership) is selected for that region. Feature sets of all representative points
are used for classification of a query datum xq . The detail of the feature selection and classification is presented in Sections 3.1 and 3.2, respectively.
the optimal feature set is no longer constant over the sample sample space without considering the local behavior of data
space, ordinary classifiers are no longer appropriate for the in different regions of the feature space [7], [9], [28], [29].
proposed method. We therefore propose a localized classifi- For example in [7] a common feature set is selected using a
cation procedure that has been adapted for our purposes. minimal redundancy maximal relevance criterion, which is
We refer to the proposed algorithm as the Localized Feature based on mutual information. In [9] a common discrimina-
Selection method. tive feature set is selected through maximizing a class sepa-
The LFS method has several advantages. First, we rability criterion in a high-dimensional kernel space. In [29]
make no assumptions regarding the distribution of the a common feature set is computed using an evolutionary
data over the sample space. The proposed approach there- method, which is a combination of a differential evolution
fore allows us to handle variations of the samples in the optimization method and a repair mechanism based on fea-
same class over the sample space, and to accommodate ture distribution measures. One conventional feature selec-
irregular or disjoint sample distributions. Moreover, tion approach which seems to be close to the proposed LFS
we show later that the performance of the LFS method is algorithm is feature selection using the Fisher criterion [12]
robust against the overfitting problem. The proposed (FDA) that computes a score for each feature based on maxi-
method also has the advantage that the underlying opti- mizing between class distances and minimizing within class
mization problem is formulated as a linear programming distances in the data space spanned by the corresponding
optimization problem. Furthermore, feature selection pro- feature. The main drawback of this algorithm, besides
cess for different regions of sample space are independent ignoring the local behavior of the samples, is that it consid-
from each other and can therefore be performed in paral- ers features independently, leading to a sub-optimal subset
lel. The computer implementation of the method can of features.
therefore be fast and efficient. On the other hand, several approaches exist that try to
An overview of the proposed method is shown in improve classification accuracy by local investigation of the
Fig. 1. An early version of this paper appeared in [27]. feature space. One such approach are the margin-based fea-
The remaining portion of this paper is organized as fol- ture selection methods [23], [24], [25], [30], [31], [32], [33].
lows: Section 2 briefly reviews recent feature selection These methods are instance-based, where each feature is
algorithms. Details of the proposed method for local fea- weighted to achieve maximal margin. The ”margin” of a
ture selection and classification are presented in Section data point is defined as the distance between the nearest
3. In Section 4, experimental results, which demonstrate same-labeled data (near-hit) and the nearest differently
the performance of the proposed method over a range of labeled data (near-miss). RELIEF [25] detects those features
synthetic and real-world data sets, are presented. Conclu- which are statistically relevant to the target. One drawback
sions are drawn in Section 5. of RELIEF is that the margins are not reevaluated during
the learning process. Compared to RELIEF, the Simba algo-
rithm [24] reevaluates margins based on the learned weight
2 RELATED WORK vector of features. However since its objective function is
Feature selection has been an active research area in past non-convex, it is characterized by many local minima.
decades. In this section we briefly review some of the main Recently, a local margin–based feature selection method is
ideas of various feature selection approaches for data presented in [23], which uses local learning to decompose a
classification. complex nonlinear problem into a set of locally linear prob-
Some of the conventional feature selection approaches lems. In [26] local information is embedded in feature selec-
assign a common discriminative feature set to the whole tion through combining instance-based and model-based
ARMANFARD ET AL.: LOCAL FEATURE SELECTION FOR DATA CLASSIFICATION 1219
learning methods. Although all these approaches use local between-class samples to be respectively minimized and
information to determine an optimal feature set, the selected maximized as in (2),
feature set is still forced to model the entire sample space.
ðiÞ ðjÞ
min wj xðiÞp x ðiÞ
p ; j 2 y ; j 6¼ i
fðiÞ 2
3 PROPOSED METHOD (2)
ðiÞ ðjÞ
The proposed method is presented in two parts: feature max wj xðiÞ p x p ;j2= yðiÞ ;
fðiÞ 2
selection and class similarity measurement. In the former,
a discriminative subset of features is selected for each of where yðiÞ is the set of all training samples with class label
the sample space regions. In the latter, a localized classi- ðiÞ
similar to yðiÞ . The quantity wj is the weight of the corre-
fier structure for measuring the similarity of a query
sponding distance where, in order to concentrate on neigh-
datum to a specific class is presented. The overfitting
boring samples and reduce the effect of remote samples on
issue with regard to the proposed algorithm is discussed
the objective functions, higher weights are assigned to the
in Section 3.3.
closer samples of xðiÞp . Weights decrease exponentially with
3.1 Feature Selection increasing distance from xðiÞ
p . However, measuring sample
Assume that we encounter a classification problem with distances from xðiÞp is a challenging issue since these distan-
ðiÞ ðiÞ N
N training samples x ;y i¼1
RM Y where Y ¼ ces should be measured in the local co-ordinate system
fY1 ; . . . ; Yc g is the set of class labels, xðiÞ is the ith training defined by fðiÞ , which is unknown at the problem outset. To
sample containing M features and yðiÞ 2 Y is its correspond- overcome this issue, we use an iterative approach for com-
ing class label. puting fðiÞ , where at each iteration weights are determined
To implement the proposed localized feature selection based on the distances in the co-ordinate system defined at
scheme, we consider each training sample xðiÞ to be a repre- the previous iteration. The following discussion assumes
sentative point for its neighboring region and assign an the weights have been determined in this manner. Further
M-dimensional indicator vector fðiÞ 2 f0; 1gM to xðiÞ that discussion on the computation of the weights is given in
indicates which features are optimal for local separation of Section 3.1.4.
classes. If the element fm ðiÞ
¼ 1, then the mth feature is There are constraints that must be considered in our
optimization formulations. Since we are looking for an
selected for the ith sample, otherwise it is not. The optimal ðiÞ ðiÞ ðiÞ
indicator vector fðiÞ ¼ ðf1 ; f2 ; . . . ; fM ÞT the problem vari-
indicator vector fðiÞ is computed such that, in its respective ðiÞ
ables fm ; m ¼ 1; . . . ; M are restricted to 0 and 1, where
subspace, the neighboring samples with class label similar T
to yðiÞ cluster as closely as possible around xðiÞ , whereas ð:Þ is transpose operator. Because there must be at least
samples with differing class labels are as far away as possi- one active feature in fðiÞ , the null binary vector must be
ble. No assumptions are made that require the classes to excluded, i.e., 1 1T fðiÞ where 1 is an M dimensional vec-
be unimodal, nor on the probability distribution of the tor with all elements equal to 1. Furthermore, we would
samples. In this work, Euclidean distance is used as the like to limit the maximum number of active features to a
distance measure. user-settable value a, i.e., 1T fðiÞ a, where a must be an
The following will present the process of calculating fðiÞ integer number between 1 and M. Therefore, the feature
corresponding to the representative point xðiÞ . selection problem for the neighboring region of xðiÞ can be
written as follows:
3.1.1 Initial Formulation
ðiÞ
Assume that xðk;iÞ
p is the projection of an arbitrary training min wj xðiÞ p xðjÞ
p ; j 2 yðiÞ ; j 6¼ i
ðiÞ 2
sample x into the subspace defined by fðiÞ as follows:
ðkÞ f
ðiÞ ðiÞ
max wj xp xðjÞ p ;j2 = yðiÞ
fðiÞ 2 (3)
xðk;iÞ
p ¼ xðkÞ fðiÞ ; k ¼ 1; . . . ; N (1) (
f ðiÞ 2 f0; 1g; m ¼ 1; . . . ; M
s:t: m
where is the element-wise product. In the sequel, projec- 1 1T fðiÞ a;
tion into the space defined by fðiÞ is implied, so dependence
on i in xðk;iÞ is suppressed. where the notation fg is used to indicate a discrete set,
p
whereas the notation ½ is used later to indicate a continuous
We want to encourage clustering behaviour—i.e., in the
interval.
neighborhood of xðiÞ p , we want to find an optimal feature In the next section, the above optimization problem is
subset fðiÞ so that, in the corresponding local co-ordinate reformulated into an efficient linear programming optimi-
system, we satisfy the following two goals: zation problem.
1) neighboring samples of the same class are closely sit-
uated around xðiÞ
p , and simultaneously,
3.1.2 Problem Reformulation
2) neighboring samples with different classes are fur- To obtain a well-behaved optimization problem, in the
ther removed from xðiÞ
p . following, we use the squared Euclidean distance instead
To realize these goals, we define N 1 objective func- of the Euclidean distance itself. It is apparent that the
tions which are weighted distances of all within- and optimal solution of (3) is invariant to this replacement.
1220 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 6, JUNE 2016
Considering the sample projection definition in (1) and There are several ways to re-configure a multi-objective
the fact that the problem variables fm ðiÞ; m ¼ 1; . . . ; M are problem into a standard form [34], [37], [38] with a single
binary, each objective function of (3) can be simplified objective function; e.g., a linear combination of the objec-
as follows: tive functions. In the multi-objective case, the concept of
optimality is replaced with Pareto optimality. A Pareto
2 2
ðiÞ ðjÞ ðiÞ ðiÞ ðiÞ optimal solution is one in which an improvement in one
wj xðiÞ
p xp ¼ wj ðx xðjÞ
Þ f
2 2 objective requires a degradation of another. Since our
M
X multi-objective optimization problem is convex (because
ðiÞ ðiÞ ðiÞ 2
¼ wj dj;m fm both objective functions and the constraints defined in
m¼1
(4) (6) are convex) the set of achievable objectives L is also
ðiÞ
X
M 2 convex. The solution to a multi-objective optimization
ðiÞ ðiÞ
¼ wj fm dj;m
problem is not unique and consists of the set of all Pareto
m¼1
optimal points that are on the boundary of the convex set
ðiÞ ðiÞT
¼ wj Dj fðiÞ ; L. Different points in the set correspond to different
weightings between the two objective functions. The set of
ðiÞ ðiÞ2 ðiÞ2 ðiÞ2 Pareto points is unique and independent of the methodol-
where D j ¼ ðdj;1 ; dj;2 ; . . . ; dj;M ÞT , xðiÞ xðjÞ ðxðiÞ xðjÞ Þ.
ðiÞ 2 ðiÞ
ogy by which the two functions are weighted (for more
fm in the second line is replaced with fm due to the first detail about Pareto optimal approach see [34]). In this
constraint in (3). The important conclusion drawn is that paper, we use the -constraint method as described by (7),
the objective functions are linear in terms of the problem such that instead of maximizing the total inter-class dis-
variables. tance, we force it to be greater than some constant ðiÞ . In
Using the summation of all weighted within-class distan- this way we can map out the entire Pareto optimal set by
ces and all weighted between-class distances in the sub-fea- varying a single parameter, ðiÞ . One advantage of this
ture space defined by fðiÞ , we define the total intra-class approach is that we can guarantee the combined inter-class
distance and the total inter-class distance as in (5). The prob-
distances are in excess of the value of the parameter ðiÞ ,
lem is then reformulated by simultaneously minimizing the
former and maximizing the later, T
min aðiÞ fðiÞ
fðiÞ
total intra class distance : 8 ðiÞ
X ðiÞ ðiÞT ðiÞ
ðiÞT ðiÞ
< fm 2 ½0; 1 ; m ¼ 1; . . . ; M
> (7)
wj Dj f , a f s:t: 1 1T fðiÞ a
j2yðiÞ
>
: ðiÞT ðiÞ
(5) b f
ðiÞ :
total inter class distance :
X ðiÞ ðiÞT ðiÞ ðiÞT ðiÞ The parameter ðiÞ must be determined such that the optimi-
wj Dj f , b f : zation problem defined in (7) is feasible. In the next section
= yðiÞ
j2
we present an approach to automatically determine a value
of the parameter ðiÞ which guarantees that the feasible set is
We see that (3) is in the form of an integer program, not empty.
which is known to be computationally intractable [34].
However this issue is readily addressed through the use
3.1.3 Problem Feasibility
of a standard and widely–accepted approximation of an
integer programming problem [34], [35], [36]. Here, we The optimization problem defined in (7) is feasible if there is
replace (relax) the binary constraint in (3) with linear at least one point that satisfies its constraints. The con-
ðiÞ
inequalities 0 fm ðiÞ
1; m ¼ 1; . . . ; M. This procedure straints fm 2 ½0; 1 ; m ¼ 1; . . . ; M indicate that the optimum
restores the computational efficiency of the program. point must be inside a unit hyper-cube. The constraints
A randomized rounding procedure (to be discussed fur- 1 1T fðiÞ a indicate that the optimum point must be
ther) that maps the linear solution back onto a suitable within the space between two parallel hyper-planes defined
point on the binary grid, then follows. by 1T fðiÞ ¼ 1 and 1T fðiÞ ¼ a. Since a is an integer number
These reformulations result in (6), which is a multi-objec- greater than or equal to 1, the space bounded by these two
tive optimization problem consisting of two linear objective parallel hyper-planes is always non-empty and its intersec-
functions that are to be simultaneously minimized and max- tion with the unit hyper-cube is also non-empty. In fact,
imized, along with 2M þ 2 linear constraints, the intersection of the spaces defined by fm ðiÞ
2 ½0; 1 ; m ¼
T ðiÞ
T
1; . . . ; M and 1 1 f a is a polyhedron P that can be
min aðiÞ fðiÞ seen as a unit cube in which two parts are removed; the first
fðiÞ
T
part is the intersection between the half-space 1T fðiÞ < 1
max bðiÞ fðiÞ and the unit hyper-cube, and the second is the intersection
fðiÞ (6)
( between the half-space 1T fðiÞ > a and the unit hyper-cube
f ðiÞ 2 ½0; 1 ; m ¼ 1; . . . ; M (see Fig. 2). If the intersection between the polyhedron P
s:t: m
1 1T fðiÞ a: T
and the half-space defined by bðiÞ fðiÞ
ðiÞ , i.e., the last con-
straint, is non-empty then the optimization problem is
ARMANFARD ET AL.: LOCAL FEATURE SELECTION FOR DATA CLASSIFICATION 1221
Fig. 2. The polyhedron P in the case of a 3-D original feature space, i.e., 3.1.4 Weight Definition
the data dimension M is 3, where a is set to 2. It is a unit cube (defined ðiÞ
by 0 fmðiÞ
1; m ¼ 1; . . . ; 3) in which two regions, i.e., blue and red pyr- In order to compute the sub-feature set f corresponding
amids, are removed. The blue pyramid is the intersection between unit to the representative point xðiÞ , the proposed method
cube and the half space 1T fðiÞ < 1, and the red pyramid is the intersec- focuses on the neighboring samples by assigning higher
tion between the half space 1T fðiÞ > a and the unit cube. weights to them. However, the computation of the weights
is dependent on the co-ordinate system, which is defined by
ðiÞ
feasible. The maximum value ðiÞ
max that
ðiÞ
can take such that f , which is unknown at the problem outset. To overcome
the intersection remains non-empty is the solution to the fol- this problem, we use an iterative approach. At each itera-
lowing feasibility LP problem: ðiÞ
tion, weights wj ; j ¼ 1; . . . ; N; j 6¼ i (see (3)) are computed
ðiÞ
max bðiÞ fðiÞ
T
using the previous estimates of f ; i ¼ 1; . . . ; N. Initially,
fðiÞ the weights are all assigned uniform values. Empirically, if
( (8)
f ðiÞ 2 ½0; 1 ; m ¼ 1; . . . ; M two samples are close to each other in one space, they are
s:t: m also close in most of the other sub-spaces. Therefore we
1 1T fðiÞ a: ðiÞ
define wj , using the distance between xðiÞ and xðjÞ in all N
Effectively, (8) corresponds to an extreme Pareto point subspaces obtained from the previous iteration, in the fol-
where the weighting given to the intra-class distance lowing manner:
term (the first objective in (6)) is zero. Finally, we set
!
ðiÞ ¼ bðiÞ 1 X N
max where b lies between zero and one. In this ðiÞ
way, the optimization problem is always feasible and by wj ¼ exp dijjk dmin ijjk
N k¼1
changing b we can map out the entire Pareto optimal set
ðkÞ
corresponding to different relative weightings of intra- dijjk ¼ xðiÞ xðjÞ f
versus inter-class distances. Here we define the Pareto 8
2 (10)
ðjÞ
optimal point corresponding to a specific value of b as > min
< v2yðiÞ divjk ; if y ¼ yðiÞ
ðiÞ ðiÞ
fb ; furthermore we define the set ffb gb2½0;1 as the com- dmin ¼
ijjk
>
: min divjk ; if yðjÞ 6¼ yðiÞ ;
plete Pareto optimal set. The final reformulation of the = yðiÞ
v2
problem may therefore be expressed as: ðkÞ
where f ; k ¼ 1; . . . ; N are known from the previous itera-
T ðiÞ ðiÞ
min aðiÞ fb tion. Such a definition implies all the wj are normalized
ðiÞ
fb
over ½0; 1 .
8
ðiÞ
> The pseudo code of the proposed feature selection
> fm;b 2 ½0; 1 ; m ¼ 1; . . . ; M
>
< (9)
method is presented in Algorithm 1 where the parameter t
ðiÞ
s:t: 1 1T fb a is the number of iterations.
>
>
>
: bðiÞT fðiÞ
bðiÞ ;
b max 3.2 Class Similarity Measurement
ðiÞ ðiÞ ðiÞ ðiÞ
where fb¼ ðf1;b ; f2;b ; . . . ; fM;b ÞT .
This formulation has the A consequence of the localized feature selection approach is
desirable form of a linear program and hence is convex. that, since there is no common set of features across the
The solution to (9) provides a solution for each element of sample space, conventional classifiers are inappropriate.
ðiÞ We now discuss how to build a classifier for the localized
fb over the continuous range ½0; 1 that may be considered
scenario. The proposed classifier structure is based on mea-
close to the corresponding binary Pareto optimal solution
ðiÞ ðiÞ
suring the similarity of a query datum xq to a specific class
fb . To obtain fb , a randomized rounding process [34], $
using the optimal feature sets specified by the ff ðiÞ gN i¼1 .
ðiÞ
[35], [36] is applied to the optimal point of (9), i.e., fb , where The proposed method assumes that the sample space con-
ðiÞ ðiÞ sists of N (probably overlapping) regions, where each region
fm;b is set to one with probability fm;b and is set to zero with
ðiÞ
is characterized by its representative point xðiÞ , its class label
probability ð1 fm;b Þ for m ¼ 1; . . . ; M. To explore the $
yðiÞ and its optimal feature set f ðiÞ . We define each region to
ðiÞ
entire region surrounding the Pareto optimal fb , the be a hyper-sphere QðiÞ in the co-ordinate system defined by
1222 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 6, JUNE 2016
$ ðiÞ ðiÞ
f , which is centered at xðiÞ
p . The radius of Q is deter- We now discuss a method for determining a suitable
ðiÞ value for b (which corresponds to the selection of a suitable
mined such that the “impurity level” within Q is less than
point in the Pareto set). We examine different values of
the parameter g. The “impurity level” is the ratio of the nor-
b 2 ½0; 1 in increments of 0.05. For each value, we solve (9)
malized number of samples with differing class label to the
followed by the randomized rounding process. This deter-
normalized number of samples with the same class label. In
mines the candidate local co-ordinate system for the respec-
all our experiments, g is fixed at the value 0.2. $ ðiÞ
To assess the similarity SY‘ ðxq Þ of a query datum xq to class tive value of b, i.e., fb , and therefore specifies the
ðiÞ
Y‘ 2 Y, we measure the similarity of xq to all regions whose candidate Q and the weak classifier si . The corresponding
class label is Y‘ . To this end we define a set of binary variables local clustering performance may then be determined using
si ðxq Þ; i ¼ 1; . . . ; N such that si ðxq Þ is set to 1 if xq 2 QðiÞ and a leave-one-out cross-validation procedure, using the
the class label of the nearest neighbor of xq is yðiÞ ; otherwise it respective weak classifier results over the training samples
is set to 0. The variable si ðxq Þ may be interpreted as a weak situated within the corresponding QðiÞ as a criterion of per-
classifier which shows the similarity of xq to the ith region. formance. The Pareto optimal point corresponding to the
The similarity SY‘ ðxq Þ is then obtained as follows: value of b which yields best local performance is then
$
selected as the binary solution f ðiÞ at the current iteration
P (see line 11 of Algorithm 1).
i2Y‘ si ðxq Þ
SY‘ ðx Þ ¼
q
; (11)
h‘ 3.3 Discussion about Overfitting
In the following we discuss the overfitting issue with the
where Y‘ indicates the set of all regions whose class labels proposed method. Let the available feature pool be
are Y‘ . The cardinality of Y‘ is h‘ . After computing the simi- denoted by the set X . Let us consider the idealized scenario
larity of xq to all classes, the class label of xq is the one which where for each localized region, we can partition X into
provides the largest similarity. ðiÞ ðiÞ ðiÞ ðiÞ
the two disjoint sets X R and X I such that X R [ X I ¼
If query sample xq does not fall in any of the QðiÞ s, our ðiÞ ðiÞ
desire is to assign its class as the class label of the nearest X ; i ¼ 1; . . . ; N. The sets X R and X I contain only the rele-
ðiÞ
sample to xq . The question is “what coordinate system vant and irrelevant features, respectively. Let hR denote
should be used to determine the nearest neighboring ðiÞ
the cardinality of X R .
sample”. To address this matter, we use a majority voting For the time being, let us consider the hypothetical situa-
procedure of the class labels within the set of all nearest ðiÞ
tion where a ¼ hR . We note that “relevant” features are
neighboring samples. This nearest neighbor set consists of those which encourage local clustering behaviour, which is
those samples which have the nearest distances to the query quantified by the optimization problem of (9). We therefore
datum as measured over each of the N local co-ordinate sys- ðiÞ
tems. The number of votes for each class is normalized to make the assumption that all features in X R are sufficiently
the number of samples within that class. It is to be noted relevant to be selected as local features by the proposed pro-
that on the basis of our experiments, the percentage of such cedure; i.e., with high probability, they are the solution to
a situation occurring is very rare—only 0.03 percent. (9), followed by the randomized rounding process. If we
ðiÞ ðiÞ
now let a grow above the value hR , features in X I become
Algorithm 1. Pseudo Code of the Proposed Feature candidates for selection. Because these features do not
Selection Algorithm encourage clustering, then with high probability these fea-
tures must be given a low f-value in order to satisfy the opti-
Input: fðxðiÞ ; yðiÞ ÞgN
i¼1 ; t; a mality of (9). Thus there is a low probability that any feature
$
Output: ff ðiÞ gN i¼1
ðiÞ
in X I will be selected by the randomized rounding proce-
$
1 Initialization: Set f ðiÞ ¼ ð0; . . . ; 0ÞT ; i ¼ 1; . . . ; N; dure. We recall that any solution selected by the random-
2 for iteration 1 to t do ized rounding procedure must also satisfy the constraints;
$ ðiÞ $
3 fprev: ¼ f ðiÞ ; i ¼ 1; . . . ; N; therefore, such a solution remains feasible, due to the
4 for i 1 to N do inequality constraint involving a in (9). Therefore in this ide-
ðiÞ $ ðkÞ N
5 Compute wj ; j ¼ 1; . . . ; N 1 using ffprev: gk¼1 as alized scenario, we see that as a grows, the number of
in (10); ðiÞ
selected features tends to saturate at the value hR .
6 Compute ðiÞ max through solving (8); In the more practical scenario, the features may not sepa-
7 for b 0 to 1 do
ðiÞ rate so cleanly into the relevant and irrelevant groups as we
8 Compute fb through solving (9);
$ ðiÞ
have assumed, with the result that “partially relevant” fea-
9 Compute fb through randomized rounding tures may continue to be selected as a grows. Therefore the
ðiÞ
of fb ; risk of overfitting is not entirely eliminated for real data
10 end sets. Nevertheless, as we demonstrate in Section 4, a satura-
$ $ ðiÞ
11 Set f ðiÞ equal to the member of ffb gb2½0;1 which tion effect of the number of selected features in real data sce-
yields the best local performance as explained in narios is clearly evident.
Section 3.2; In summary, the LFS algorithm inherently tends to select
12 end only relevant features and rejects irrelevant features. This
13 end imposes a limit on the number of selected features. Thus the
LFS method tends to be immune to the overfitting problem.
ARMANFARD ET AL.: LOCAL FEATURE SELECTION FOR DATA CLASSIFICATION 1223
TABLE 1
Minimum Classification Error (in Percent) and Standard Deviation (in Percent) of the Different Algorithms
Data set LFS FDA Simba mRMR KCSM Logo DEFS SVM
(no feature selection)
Sonar 22.87(3.92) 26.11(4.29) 25.24(3.67) 28.70(2.61) 26.85(3.49) 26.75(3.44) 27.81(6.67) 49.90(4.81)
DNA 13.41(1.88) 13.94(2.71) 14.43(4.77) 13.75(2.96) 35.95(17.04) 15.35(5.73) 18.68(5.02) 49.70(2.04)
Breast 6.37(1.33) 7.71(1.92) 8.89(1.28) 8.29(2.19) 7.73(1.63) 8.25(1.36) 11.01(2.49) 37.61(0.60)
Adult 22.27(1.46) 24.65(0.33) 24.65(0.67) 24.75(7.58) 24.85(1.10) 24.53(7.85) 26.37(2.05) 24.65(0.33)
ARR 33.06(2.60) 46.68(9.95) 33.53(6.46) 31.59(3.25) 33.34(9.039) 33.93(5.32) 31.42(4.73) 43.68(1.19)
ALLAML 1.66(3.51) 5.00(4.30) 25.50(16.49) 7.50(8.28) 32.50(17.76) 7.50(7.29) 14.66(10.50) 38.33(15.81)
Prostate 4.16(4.39) 6.66(6.57) 12.66(8.79) 8.33(7.58) 33.33(16.66) 8.33(7.85) 13.66(9.56) 57.50(10.72)
Duke-breast 10.83(7.90) 17.50(8.28) 30.83(12.86) 21.66(5.82) 33.33(16.19) 21.66(11.91) 26.66(14.61) 63.33(10.54)
Leukemia 3.33(4.30) 5.00(5.82) 12.00(8.06) 5.00(5.82) 32.50(13.86) 6.66(5.27) 16.83(10.85) 35.83(14.19)
Colon 9.16(0.08) 11.66(8.95) 34.50(13.58) 19.16(5.62) 12.50(9.00) 20.83(10.57) 26.66(14.12) 36.66(17.21)
Average 12.71 16.49 22.22 16.87 27.29 17.38 21.38 43.72
The minimum classification error and the corresponding Fig. 4, where samples have been contaminated with addi-
standard deviation as determined by the bootstrapping pro- tional irrelevant features ranging in number from 1 to 30,000.
cedure described earlier is presented in Table 1. For refer- Each point shows the percentage of samples for which the
ence, the classification error rate of the SVM classifier expected feature(s), (i.e., x1 for samples within subclass ‘+’,
performed on the data sets without any feature selection is x2 for samples within ‘’ and fx1 ; x2 g for samples within ‘
’),
also reported in the last column of Table 1. Since the perfor- are correctly identified. It can be seen that the performance is
mance in this case is generally very low, this result implies refined from one iteration to another, especially for a higher
that, without feature selection, classification suffers from number of irrelevant features. The most significant improve-
the presence of irrelevant features and the curse of ment happens at the second iteration; hence, as mentioned
dimensionality [43]. The best result for each data set is previously, the default value of t is set to 2.
shown in bold. Among the seven algorithms, the proposed The data set “DNA” has a “ground truth”, in that
LFS algorithm yields the best results in nine out of the 10 much better performance has been previously reported if
data sets. The last row shows the classification error rates the selected features are those with indexes in the interval
averaged over all data sets. This row indicates that the pro- between 61 to 120 [9], [44]. This observation provides a
posed LFS method performs noticeably better on average good means of evaluating LFS performance on a real
than the other seven algorithms. world data set. Fig. 5 shows the result of applying the
proposed LFS method to the data set “DNA”, where the
4.2 Iterative Weight Definition and Correct height of each feature index indicates the percentage of
Feature Selection representative points which select these ground-truthed
As illustrated in Fig. 3, the distribution of class Y1 of the syn- features as a member of their optimal feature set. These
thetic data set has two disjoint subclasses, whereas class Y2 is results demonstrate that the proposed method mostly
a compact class with one mode. Samples of subclass ‘+’ can identifies features with indexes from 61 to 105. Thus they
be discriminated from samples of class Y2 using only the rele- are well matched to the “ground truth”. The proposed
vant feature x1 . In a similar way, samples of subclass ‘’ method also performs very well in discarding the artifi-
require only x2 , whereas samples of class ‘
’ require both x1 cially added irrelevant features, i.e., features with indexes
and x2 . The results of applying the proposed method to the from 181 to 280.
synthetic data set over four successive iterations is shown in
4.3 Sensitivity of the Proposed Method to a and g
To show the sensitivity of the proposed method to the
TABLE 2 parameter a, the classification error rate and the cardinality
Characteristics of the Real-World Data Sets Used in the
Experiments
Fig. 5. Selected features for “DNA” data set. The height corresponding to Fig. 8. Classification error rate of the proposed method for data set
each feature index indicates what percentage of representative points “Colon” where the parameter g ranges from 0 to 1.
select the respective feature as a discriminative feature, where a is set
to a typical value of 5.
experiments, g is set to 0.2 without tuning. This value is
seen to work well over all data sets.
of the optimal feature sets (averaged over all N sets) versus
a, for data set “Sonar”, are respectively shown in Figs. 6 and 4.4 Overlapping Feature Sets?
7 where a ranges from 1 to the maximum possible value of The reader may be interested to know if there is any overlap
M ¼ 160. These results demonstrate the robustness of the between the optimal feature sets of the representative
proposed LFS algorithm against overfitting as discussed in points. To answer this question, the normalized histogram
Section 3.3. over all feature sets for the data set “ALLAML” is shown in
Note that estimating an appropriate value for the num- Fig. 9, where the parameter a is set to a typical value of 5.
ber of selected features is generally a challenging issue. This The height of each feature index indicates what percentage
is usually estimated using a validation set or based on prior of representative points select the respective feature. As is
knowledge, which may not be available in some applica- expected, there are some overlap between region specific
tions. As can be seen, the proposed LFS algorithm is not too feature sets, but it is evident there does not appear to be one
sensitive to this parameter. Moreover, as illustrated in common feature set that works well over all regions of the
Fig. 7, the cardinality of the optimal feature sets saturates sample space. Indeed, with the proposed method and these
for a sufficiently large value of a. results, we assert the assumption of a common feature set
The error rate of the proposed method versus the impu- over the entire sample space is not necessarily optimal in
rity level parameter g for data set “Colon” is shown in Fig. 8 real world applications.
where g ranges from 0 to 1. Small (large) values of g can be The most common features may be interpreted as the
interpreted as a small (large) radius of the hyper-spheres. most informative features in terms of classification accuracy
This demonstrates that the error rate is not too sensitive to a over the sample space. The less common features may be
wide range of values of g. As one may intuitively guess, we interpreted as being informative features, but only relevant
found that impurity levels in the range of 0.1 to 0.4 are for a small group of samples; e.g., in the context of biology/
appropriate. As mentioned previously, throughout all our genetics applications, the less common features may be
interpreted as being important in the discrimination of
some small sub-population of samples.
One may be interested to know the classification accu-
racy in the context of a global selection scheme; i.e., we
select the top five dominant features from Fig. 9 as pro-
duced by the LFS method, and then feed them into an SVM
classifier. Using such a feature set, the error rate is 6.66 per-
cent which is in the range of that of the other methods,
but nevertheless significantly greater than the error rate
Fig. 6. Classification error rate of the proposed method for data set (1.66 percent) corresponding to the proposed LFS region-
“Sonar” where the parameter a ranges from 1 to the maximum possible
value of M ¼ 160:
ðiÞ Fig. 9. Selected features for “ALLAML” data set. The height of each fea-
Fig. 7. Averaged cardinality of the optimal feature sets f i ¼ 1; . . . ; N ture index indicates what percentage of representative points select the
versus the parameter a where a ranges from 1 to the maximum possible respective feature as a discriminative feature where a is set to the typical
value of M ¼ 160: value of 5.
1226 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 6, JUNE 2016
Fig. 10. Histogram of distances between relaxed solutions and their cor- 5 CONCLUSIONS
responding binary solutions for data set “Prostate” where a is set to the
typical value of 5. In this paper we present an effective and practical method
for local feature selection for application to the data classifi-
specific feature selection method (see Table 1). This example cation problem. Unlike most feature selection algorithms
is a further demonstration of the effectiveness of modeling which pick a “global” subset of features which is most rep-
the feature space locally. resentative for the given data set, the proposed algorithm
instead picks “local” subsets of features that are most infor-
4.5 How Far Is the Binary Solution from mative for the small region around the data points. The car-
the Relaxed One? dinality and identity of the feature sets can vary from data
To demonstrate that the relaxed solutions are a proper point to data point. The process of computing a feature set
approximation of the final binary solutions, obtained from for each region is independent of the others and can be per-
the randomized rounding process explained in Section 3.1, formed in parallel.
the normalized histogram over the ‘1 –norm distances The LFS procedure is formulated as a linear program,
between the relaxed solutions and their corresponding which has the advantage of convexity and efficient implemen-
binary solutions is shown in Fig. 10. The height of each bar tation. The proposed algorithm is shown to perform well in
indicates what fraction of the representative points have the practice, compared to previous state-of-the-art feature selec-
corresponding value as their ‘1 –norm distance. The ‘1 –norm tion algorithms. Performance of the proposed algorithm is
distances are normalized relative to the data dimension M. insensitive to the underlying distribution of the data. Further-
As may be seen, the relaxed solutions are appropriate more we have demonstrated that the method is relatively
approximations of the binary solutions. invariant to an upper bound on the number of selected fea-
tures, and so is robust against the overfitting phenomenon.
4.6 CPU Time
The computational complexity for computing a feature ACKNOWLEDGMENTS
set for each representative point depends mainly on the The authors wish to acknowledge the financial support of
data dimension. Fig. 11 shows the CPU time taken by the the Natural Sciences and Engineering Research Council of
proposed method (using MATLAB) to perform feature Canada (NSERC) and MITACS.
selection for one representative point on the synthetic
data set, with the number of irrelevant features ranging
REFERENCES
from 1 to 30,000. As may be seen, the figure shows linear
[1] I. K. Fodor, “A survey of dimension reduction techniques,” Law-
complexity of the LFS method with respect to feature
rence Livermore National Laboratory, Tech. Rep. UCRL-ID-
dimensionality. 148494, 2002.
Note that the feature selection process for each represen- [2] A. K. Jain, R. P. W. Duin, and J. Mao, “Statistical pattern recogni-
tative point is independent of the others and can be per- tion: A review,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22,
no. 1, pp. 4–37, Jan. 2000.
formed in parallel. For instance, in the case of a data set [3] P. Langley, Selection of Relevant Features in Machine Learning.
with 100 training samples (i.e., N ¼ 100) and 10,000 features Defense Technical Information Center, 1994.
(i.e., M ¼ 10;000) on a typical desktop computer with 12 [4] A. R. Webb, Statistical Pattern Recognition. Hoboken, NJ, USA:
cores, the required processing time in the training phase is Wiley, 2003.
[5] I. Jolliffe, Principal Component Analysis. Hoboken, NJ, USA: Wiley,
2005.
[6] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction
by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–
2326, 2000.
[7] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual
information criteria of max-dependency, max-relevance, and min-
redundancy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 8,
pp. 1226–1238, Aug. 2005.
[8] H.-L. Wei and S. A. Billings, “Feature subset selection and ranking
for data dimensionality reduction,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 29, no. 1, pp. 162–166, Jan. 2007.
[9] L. Wang, “Feature selection with kernel class separability,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 30, no. 9, pp. 1534–1546,
Sep. 2008.
Fig. 11. The CPU time (seconds) taken by the proposed algorithm to per- [10] H. Zeng and Y.-M. Cheung, “Feature selection and kernel learning
form feature selection for one representative point xðiÞ with a given b on for local Learning-based clustering,” IEEE Trans. Pattern Anal.
the synthetic data set where the parameter a is set to 2. Mach. Intell., vol. 33, no. 8, pp. 1532–1547, Aug. 2011.
ARMANFARD ET AL.: LOCAL FEATURE SELECTION FOR DATA CLASSIFICATION 1227
[11] N. Kwak and C.-H. Choi, “Input feature selection for classification [38] G. Mavrotas, “Effective implementation of the -constraint
problems,” IEEE Trans. Neural Netw., vol. 13, no. 1, pp. 143–159, method in multi-objective mathematical programming problems,”
Jan. 2002. Appl. Math. Comput., vol. 213, no. 2, pp. 455–465, 2009.
[12] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. [39] K. Bache and M. Lichman. (2013). UCI machine learning reposi-
Hoboken, NJ, USA: Wiley, 2001. tory [Online]. Available: http://archive.ics.uci.edu/ml
[13] A. Hyv€ arinen and E. Oja, “Independent component analysis: [40] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack,
Algorithms and applications,” Neural Netw., vol. 13, no. 4, and A. J. Levine, “Broad patterns of gene expression revealed by
pp. 411–430, 2000. clustering analysis of tumor and normal colon tissues probed by
[14] J. B. Tenenbaum, V. De Silva, and J. C. Langford, “A global geo- oligonucleotide arrays,” Proc. Nat. Acad. Sci., vol. 96, no. 12,
metric framework for nonlinear dimensionality reduction,” pp. 6745–6750, 1999.
Science, vol. 290, no. 5500, pp. 2319–2323, 2000. [41] F. Nie, H. Huang, X. Cai, and C. H. Ding, “Efficient and robust
[15] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensional- feature selection via joint l2;1 -norms minimization,” in Proc. Adv.
ity reduction and data representation,” Neural Comput., vol. 15, Neural Inf. Process. Syst., 2010, pp. 1813–1821.
no. 6, pp. 1373–1396, 2003. [42] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R.
[16] D. L. Donoho and C. Grimes, “Hessian eigenmaps: Locally linear Spang, H. Zuzan, J. A. Olson, J. R. Marks, and J. R. Nevins,
embedding techniques for high-dimensional data,” Proc. Nat. “Predicting the clinical status of human breast cancer by using
Acad. Sci., vol. 100, no. 10, pp. 5591–5596, 2003. gene expression profiles,” Proc. Nat. Acad. Sci., vol. 98, no. 20,
[17] Y. M. Lui and J. R. Beveridge, “Grassmann registration manifolds pp. 11 462–11 467, 2001.
for face recognition,” in Proc. 10th Eur. Conf. Comput. Vis., 2008, [43] R. E. Bellman and S. E. Dreyfus, Applied Dynamic Programming.
pp. 44–57. Rand Corporation, 1962.
[18] X. He, D. Cai, S. Yan, and H.-J. Zhang, “Neighborhood preserving [44] G. John. DNA dataset (statlog version)-primate splice-junction
embedding,” in Proc. 10th IEEE Int. Conf. Comput. Vis., 2005, vol. 2, gene sequences (dna) with associated imperfect domain theory
pp. 1208–1213. [Online]. Available: https://www.sgi.com/tech/mlc/db/DNA.
[19] N. Ueda, R. Nakano, Z. Ghahramani, and G. Hinton, “Pattern names
classification using a mixture of factor analyzers,” in Proc. 1999
IEEE Signal Process. Soc. Workshop Neural Netw. Signal Process. IX, Narges Armanfard received the MSc degree in
1999, pp. 525–534. electrical engineering from Tarbiat Modares Uni-
[20] S. C. Madeira and A. L. Oliveira, “Biclustering algorithms for bio- versity, Tehran, Iran, in 2008. She is currently
logical data analysis: A survey,” IEEE/ACM Trans. Comput. Biol. working towards the PhD degree in electrical
Bioinformat., vol. 1, no. 1, pp. 24–45, Jan. 2004. engineering at McMaster University, Hamilton,
[21] Y. Cheng and G. M. Church, “Biclustering of expression data,” in Ontario, Canada. Since 2012, she has been a
Proc. Int. Conf. Intell. Syst. Mol. Biol., 2000, vol. 8, pp. 93–103. research assistant and also a teaching assistant
[22] I. S. Dhillon, “Co-clustering documents and words using bipartite in the Department of Electrical and Computer
spectral graph partitioning,” in Proc. 7th ACM SIGKDD Int. Conf. Engineering, McMaster University, Hamilton,
Knowl. Discovery Data Mining, 2001, pp. 269–274. Ontario, Canada. Her research interests include
[23] Y. Sun, S. Todorovic, and S. Goodison, “Local-learning-based fea- signal, image, and video processing, specifically
ture selection for high-dimensional data analysis,” IEEE Trans. EEG and ECG signal analysis, machine learning, machine vision, video
Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1610–1626, Sep. 2010. surveillance, and document image processing.
[24] R. Gilad-Bachrach, A. Navot, and N. Tishby, “Margin based fea-
ture selection-theory and algorithms,” in Proc. 21st Int. Conf. Mach. James P. Reilly (S’76–M’80) received the BASc
Learn., 2004, p. 43. degree from the University of Waterloo, Waterloo,
[25] K. Kira and L. A. Rendell, “A practical approach to feature ON, Canada, in 1973, and the MEng and PhD
selection,” in Proc. 9th Int. Workshop Mach. Learn., 1992, pp. 249–256. degrees from McMaster University, Hamilton,
[26] Z. Liu, W. Hsiao, B. L. Cantarel, E. F. Drabek, and C. Fraser- ON, Canada, in 1977 and 1980, respectively, all
Liggett, “Sparse distance-based learning for simultaneous multi- in electrical engineering. He was employed in the
class classification and feature selection of metagenomic data,” telecommunications industry for a total of seven
Bioinformatics, vol. 27, no. 23, pp. 3242–3249, 2011. years and was then appointed to the Department
[27] N. Armanfard and J. P. Reilly, “Classification based on local fea- of Electrical & Computer Engineering at McMas-
ture selection via linear programming,” in Proc. IEEE Int. Work- ter University in 1985 as an associate professor.
shop Mach. Learn. Signal Process., 2013, pp. 1–6. He was promoted to the rank of a full professor in
[28] G. Brown, A. Pocock, M.-J. Zhao, and M. Lujan, “Conditional likeli- 1992. He has been a visiting academic at the University of Canterbury,
hood maximisation: A unifying framework for information theoretic New Zealand, and the University of Melbourne, Australia. His research
feature selection,” The J. Mach. Learn. Res., vol. 13, pp. 27–66, 2012. interests include several aspects of signal processing, specifically
[29] R. N. Khushaba, A. Al-Ani, and A. Al-Jumaily, “Feature subset machine learning, EEG signal analysis, Bayesian methods, blind signal
selection using differential evolution and a statistical repair mech- separation, blind identification, and array signal processing. He is a reg-
anism,” Expert Syst. Appl., vol. 38, no. 9, pp. 11 515–11 526, 2011. istered professional engineer in the province of Ontario.
[30] I. Kononenko, “Estimating attributes: Analysis and extensions of
relief,” in Proc. Eur. Conf. Mach. Learn., 1994, pp. 171–182.
[31] Y. Sun, “Iterative relief for feature weighting: Algorithms, theo- Majid Komeili received the MSc degree in elec-
ries, and applications,” IEEE Trans. Pattern Anal. Mach. Intell., trical engineering from Tarbiat Modares Univer-
vol. 29, no. 6, pp. 1035–1051, Jun. 2007. sity, Tehran, Iran, in 2008. He is currently working
[32] B. Chen, H. Liu, J. Chai, and Z. Bao, “Large margin feature toward the PhD degree in electrical engineering
weighting method via linear programming,” IEEE Trans. Knowl. at the University of Toronto, Toronto, Ontario,
Data Eng., vol. 21, no. 10, pp. 1475–1488, Oct. 2009. Canada. Since 2012, he has been a research
[33] B. Liu, B. Fang, X. Liu, J. Chen, and Z. Huang, “Large margin sub- assistant and also a teaching assistant in the
space learning for feature selection,” Pattern Recog., vol. 46, Edward S. Rogers Sr. Department of Electrical &
pp. 2798–2806, 2013. Computer Engineering, University of Toronto, Tor-
[34] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, onto, Ontario, Canada. His research interests
U.K.: Cambridge Univ. Press, 2004. include signal, image, and video processing, spe-
[35] M. T. Thai, “Approximation algorithms: LP relaxation, rounding, cifically ECG andEEG signal analysis, machine learning, machine vision,
and randomized rounding techniques,” Lecture Notes, University video surveillance, and document image processing.
of Florida, 2013.
[36] A. Souza, “Randomized algorithm & probabilistic methods,” " For more information on this or any other computing topic,
Lecture Notes, Humboldt University of Berlin, 2001. please visit our Digital Library at www.computer.org/publications/dlib.
[37] C. L. Hwang, A. S. M. Masud, et al., Multiple Objective Decision
Making-Methods and Applications. New York, NY, USA: Springer,
1979, vol. 164.