0% found this document useful (0 votes)
39 views

Local Feature Selection For Data Classification

Localized feature selection (LFS) is proposed as an alternative to conventional feature selection methods. LFS selects distinct, optimized feature subsets for different regions of the sample space, allowing the features to adapt locally. In contrast, typical methods select a single, global feature set. The LFS approach associates each training sample with its own feature subset. An optimal subset is selected for a region to minimize within-class distances and maximize between-class distances locally. This allows for an enhanced characterization of variations in the sample space compared to methods that model the entire space with a single feature set.

Uploaded by

João Júnior
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Local Feature Selection For Data Classification

Localized feature selection (LFS) is proposed as an alternative to conventional feature selection methods. LFS selects distinct, optimized feature subsets for different regions of the sample space, allowing the features to adapt locally. In contrast, typical methods select a single, global feature set. The LFS approach associates each training sample with its own feature subset. An optimal subset is selected for a region to minimize within-class distances and maximize between-class distances locally. This allows for an enhanced characterization of variations in the sample space compared to methods that model the entire space with a single feature set.

Uploaded by

João Júnior
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO.

6, JUNE 2016 1217

Local Feature Selection for Data Classification


Narges Armanfard, James P. Reilly, and Majid Komeili

Abstract—Typical feature selection methods choose an optimal global feature subset that is applied over all regions of the sample
space. In contrast, in this paper we propose a novel localized feature selection (LFS) approach whereby each region of the sample
space is associated with its own distinct optimized feature set, which may vary both in membership and size across the sample space.
This allows the feature set to optimally adapt to local variations in the sample space. An associated method for measuring the
similarities of a query datum to each of the respective classes is also proposed. The proposed method makes no assumptions about
the underlying structure of the samples; hence the method is insensitive to the distribution of the data over the sample space. The
method is efficiently formulated as a linear programming optimization problem. Furthermore, we demonstrate the method is robust
against the over-fitting problem. Experimental results on eleven synthetic and real-world data sets demonstrate the viability of the
formulation and the effectiveness of the proposed algorithm. In addition we show several examples where localized feature selection
produces better results than a global feature selection method.

Index Terms—Local feature selection, classification, linear programming

1 INTRODUCTION

I N many applications nowadays, data sets are character-


ized by hundreds or even thousands of features. Typi-
cally, there is often an insufficient number of objects to
In this study, the feature selection process is considered
for data classification. Given a set of training samples and
their associated classes, the feature selection problem
adequately represent the distribution of these high-dimen- involves finding a subset of features that leads to an
sional feature spaces. Hence, dimensionality reduction is an “optimal” characterization of the different classes. Conven-
important issue in a wide range of scientific disciplines. tional feature selection algorithms select a single common
Many approaches for dimensionality reduction have been feature set for characterizing all regions of the sample space.
proposed in the literature [1], [2], [3]. Dimensionality reduc- In fact, these methods assume that a single feature subset
tion methods can be roughly categorized into two groups: can optimally characterize sample space variations.
feature extraction (also known as subspace learning) [4], [5], In this paper we offer an alternative to the conven-
[6] and feature selection [7], [8], [9], [10], [11]. tional feature selection approaches by introducing the
Feature extraction methods, like principal component novel concept of localized feature selection (LFS), where
analysis [5], linear discriminant analysis [12] and inde- the optimal feature subset varies over the sample space in
pendent component analysis [13] mix original features to a manner that optimally adapts to local variations. We
produce a new set of features. Since such features are a com- propose that an enhanced mathematical description of
bination of the original features, the physical interpretation the sample space could be obtained by allowing various
in terms of the original features may be lost. In addition groups of samples in different regions to be associated
to linear methods, there are also some nonlinear feature with their own distinct feature set, which is optimized for
extraction methods which assume that the data of interest that specific region.
lie on an embedded nonlinear manifold [6], [14], [15], [16]. Embedding local information is not inherently a new
Manifold learning techniques often need a large amount of idea. LLE [6], Isomap [14], NPE [18] and MFA [19] apply
training data and dense sampling on a manifold. Such rich local information for feature extraction (not feature selec-
training data may not be available in some real-world appli- tion) in which the physical interpretation of the induced
cations [17]. On the other hand, in many applications it is co–ordinate system is lost. Bi-clustering approaches [20],
desired to reduce not only the dimensionality, but also the [21], [22] apply local information for clustering of sam-
number of features that are to be considered. Unlike feature ples and features simultaneously, but these are basically
extraction, feature selection returns a subset of the original unsupervised learning algorithms. Some more related
features without any transformation. approaches such as Logo [23], Simba [24], Relief [25] and
MetaDistance [26] consider local sample behavior for fea-
 N. Armanfard and J. P. Reilly are with the Department of Electrical and ture selection, but all these algorithms suffer from the
Computer Engineering, McMaster University, Hamilton, Ontario, requirement that the entire sample space be modeled by
Canada. E-mail: {armanfn, reillyj}@mcmaster.ca. a single common feature set.
 M. Komeili is with the Department of Electrical and Computer Engineer-
ing, University of Toronto, Toronto, Ontario, Canada.
In this paper, the concept of localized feature selection is
E-mail: [email protected]. realized by considering each training sample as a represen-
Manuscript received 30 Jan. 2015; accepted 7 June 2015. Date of publication tative point of its neighboring region and by selecting an
13 Sept. 2015; date of current version 12 May 2016. optimal feature set for that region. The optimal feature set is
Recommended for acceptance by K. Borgwardt. such that, within its corresponding co-ordinate system, the
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference the Digital Object Identifier below.
within class distances and the between class distances
Digital Object Identifier no. 10.1109/TPAMI.2015.2478471 are locally minimized and maximized, respectively. Since
0162-8828 ß 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
1218 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 6, JUNE 2016

Fig. 1. Block diagram of the proposed method where each training sample xðiÞ i ¼ 1; . . . ; N is considered to be a representative point of its neighbor-
ing region and an optimal feature set (possibly different in size and membership) is selected for that region. Feature sets of all representative points
are used for classification of a query datum xq . The detail of the feature selection and classification is presented in Sections 3.1 and 3.2, respectively.

the optimal feature set is no longer constant over the sample sample space without considering the local behavior of data
space, ordinary classifiers are no longer appropriate for the in different regions of the feature space [7], [9], [28], [29].
proposed method. We therefore propose a localized classifi- For example in [7] a common feature set is selected using a
cation procedure that has been adapted for our purposes. minimal redundancy maximal relevance criterion, which is
We refer to the proposed algorithm as the Localized Feature based on mutual information. In [9] a common discrimina-
Selection method. tive feature set is selected through maximizing a class sepa-
The LFS method has several advantages. First, we rability criterion in a high-dimensional kernel space. In [29]
make no assumptions regarding the distribution of the a common feature set is computed using an evolutionary
data over the sample space. The proposed approach there- method, which is a combination of a differential evolution
fore allows us to handle variations of the samples in the optimization method and a repair mechanism based on fea-
same class over the sample space, and to accommodate ture distribution measures. One conventional feature selec-
irregular or disjoint sample distributions. Moreover, tion approach which seems to be close to the proposed LFS
we show later that the performance of the LFS method is algorithm is feature selection using the Fisher criterion [12]
robust against the overfitting problem. The proposed (FDA) that computes a score for each feature based on maxi-
method also has the advantage that the underlying opti- mizing between class distances and minimizing within class
mization problem is formulated as a linear programming distances in the data space spanned by the corresponding
optimization problem. Furthermore, feature selection pro- feature. The main drawback of this algorithm, besides
cess for different regions of sample space are independent ignoring the local behavior of the samples, is that it consid-
from each other and can therefore be performed in paral- ers features independently, leading to a sub-optimal subset
lel. The computer implementation of the method can of features.
therefore be fast and efficient. On the other hand, several approaches exist that try to
An overview of the proposed method is shown in improve classification accuracy by local investigation of the
Fig. 1. An early version of this paper appeared in [27]. feature space. One such approach are the margin-based fea-
The remaining portion of this paper is organized as fol- ture selection methods [23], [24], [25], [30], [31], [32], [33].
lows: Section 2 briefly reviews recent feature selection These methods are instance-based, where each feature is
algorithms. Details of the proposed method for local fea- weighted to achieve maximal margin. The ”margin” of a
ture selection and classification are presented in Section data point is defined as the distance between the nearest
3. In Section 4, experimental results, which demonstrate same-labeled data (near-hit) and the nearest differently
the performance of the proposed method over a range of labeled data (near-miss). RELIEF [25] detects those features
synthetic and real-world data sets, are presented. Conclu- which are statistically relevant to the target. One drawback
sions are drawn in Section 5. of RELIEF is that the margins are not reevaluated during
the learning process. Compared to RELIEF, the Simba algo-
rithm [24] reevaluates margins based on the learned weight
2 RELATED WORK vector of features. However since its objective function is
Feature selection has been an active research area in past non-convex, it is characterized by many local minima.
decades. In this section we briefly review some of the main Recently, a local margin–based feature selection method is
ideas of various feature selection approaches for data presented in [23], which uses local learning to decompose a
classification. complex nonlinear problem into a set of locally linear prob-
Some of the conventional feature selection approaches lems. In [26] local information is embedded in feature selec-
assign a common discriminative feature set to the whole tion through combining instance-based and model-based
ARMANFARD ET AL.: LOCAL FEATURE SELECTION FOR DATA CLASSIFICATION 1219

learning methods. Although all these approaches use local between-class samples to be respectively minimized and
information to determine an optimal feature set, the selected maximized as in (2),
feature set is still forced to model the entire sample space.  
ðiÞ  ðjÞ 
min wj xðiÞp  x ðiÞ
p  ; j 2 y ; j 6¼ i
fðiÞ 2
3 PROPOSED METHOD   (2)
ðiÞ  ðjÞ 
The proposed method is presented in two parts: feature max wj xðiÞ p  x p  ;j2= yðiÞ ;
fðiÞ 2
selection and class similarity measurement. In the former,
a discriminative subset of features is selected for each of where yðiÞ is the set of all training samples with class label
the sample space regions. In the latter, a localized classi- ðiÞ
similar to yðiÞ . The quantity wj is the weight of the corre-
fier structure for measuring the similarity of a query
sponding distance where, in order to concentrate on neigh-
datum to a specific class is presented. The overfitting
boring samples and reduce the effect of remote samples on
issue with regard to the proposed algorithm is discussed
the objective functions, higher weights are assigned to the
in Section 3.3.
closer samples of xðiÞp . Weights decrease exponentially with
3.1 Feature Selection increasing distance from xðiÞ
p . However, measuring sample
Assume that we encounter a classification problem with distances from xðiÞp is a challenging issue since these distan-
 ðiÞ ðiÞ N
N training samples x ;y i¼1
 RM  Y where Y ¼ ces should be measured in the local co-ordinate system
fY1 ; . . . ; Yc g is the set of class labels, xðiÞ is the ith training defined by fðiÞ , which is unknown at the problem outset. To
sample containing M features and yðiÞ 2 Y is its correspond- overcome this issue, we use an iterative approach for com-
ing class label. puting fðiÞ , where at each iteration weights are determined
To implement the proposed localized feature selection based on the distances in the co-ordinate system defined at
scheme, we consider each training sample xðiÞ to be a repre- the previous iteration. The following discussion assumes
sentative point for its neighboring region and assign an the weights have been determined in this manner. Further
M-dimensional indicator vector fðiÞ 2 f0; 1gM to xðiÞ that discussion on the computation of the weights is given in
indicates which features are optimal for local separation of Section 3.1.4.
classes. If the element fm ðiÞ
¼ 1, then the mth feature is There are constraints that must be considered in our
optimization formulations. Since we are looking for an
selected for the ith sample, otherwise it is not. The optimal ðiÞ ðiÞ ðiÞ
indicator vector fðiÞ ¼ ðf1 ; f2 ; . . . ; fM ÞT the problem vari-
indicator vector fðiÞ is computed such that, in its respective ðiÞ
ables fm ; m ¼ 1; . . . ; M are restricted to 0 and 1, where
subspace, the neighboring samples with class label similar T
to yðiÞ cluster as closely as possible around xðiÞ , whereas ð:Þ is transpose operator. Because there must be at least
samples with differing class labels are as far away as possi- one active feature in fðiÞ , the null binary vector must be
ble. No assumptions are made that require the classes to excluded, i.e., 1  1T fðiÞ where 1 is an M dimensional vec-
be unimodal, nor on the probability distribution of the tor with all elements equal to 1. Furthermore, we would
samples. In this work, Euclidean distance is used as the like to limit the maximum number of active features to a
distance measure. user-settable value a, i.e., 1T fðiÞ  a, where a must be an
The following will present the process of calculating fðiÞ integer number between 1 and M. Therefore, the feature
corresponding to the representative point xðiÞ . selection problem for the neighboring region of xðiÞ can be
written as follows:
3.1.1 Initial Formulation  
ðiÞ  
Assume that xðk;iÞ
p is the projection of an arbitrary training min wj xðiÞ p  xðjÞ
p  ; j 2 yðiÞ ; j 6¼ i
ðiÞ 2
sample x into the subspace defined by fðiÞ as follows:
ðkÞ f
 
ðiÞ  ðiÞ 
max wj xp  xðjÞ p  ;j2 = yðiÞ
fðiÞ 2 (3)
xðk;iÞ
p ¼ xðkÞ  fðiÞ ; k ¼ 1; . . . ; N (1) (
f ðiÞ 2 f0; 1g; m ¼ 1; . . . ; M
s:t: m
where  is the element-wise product. In the sequel, projec- 1  1T fðiÞ  a;
tion into the space defined by fðiÞ is implied, so dependence
on i in xðk;iÞ is suppressed. where the notation fg is used to indicate a discrete set,
p
whereas the notation ½ is used later to indicate a continuous
We want to encourage clustering behaviour—i.e., in the
interval.
neighborhood of xðiÞ p , we want to find an optimal feature In the next section, the above optimization problem is
subset fðiÞ so that, in the corresponding local co-ordinate reformulated into an efficient linear programming optimi-
system, we satisfy the following two goals: zation problem.
1) neighboring samples of the same class are closely sit-
uated around xðiÞ
p , and simultaneously,
3.1.2 Problem Reformulation
2) neighboring samples with different classes are fur- To obtain a well-behaved optimization problem, in the
ther removed from xðiÞ
p . following, we use the squared Euclidean distance instead
To realize these goals, we define N  1 objective func- of the Euclidean distance itself. It is apparent that the
tions which are weighted distances of all within- and optimal solution of (3) is invariant to this replacement.
1220 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 6, JUNE 2016

Considering the sample projection definition in (1) and There are several ways to re-configure a multi-objective
the fact that the problem variables fm ðiÞ; m ¼ 1; . . . ; M are problem into a standard form [34], [37], [38] with a single
binary, each objective function of (3) can be simplified objective function; e.g., a linear combination of the objec-
as follows: tive functions. In the multi-objective case, the concept of
optimality is replaced with Pareto optimality. A Pareto
 2  2
ðiÞ  ðjÞ  ðiÞ  ðiÞ ðiÞ  optimal solution is one in which an improvement in one
wj xðiÞ
p  xp  ¼ wj ðx  xðjÞ
Þ  f 
2 2 objective requires a degradation of another. Since our
M 
X  multi-objective optimization problem is convex (because
ðiÞ ðiÞ ðiÞ 2
¼ wj dj;m fm both objective functions and the constraints defined in
m¼1
(4) (6) are convex) the set of achievable objectives L is also
ðiÞ
X
M 2 convex. The solution to a multi-objective optimization
ðiÞ ðiÞ
¼ wj fm dj;m
problem is not unique and consists of the set of all Pareto
m¼1
optimal points that are on the boundary of the convex set
ðiÞ ðiÞT
¼ wj Dj fðiÞ ; L. Different points in the set correspond to different
weightings between the two objective functions. The set of
ðiÞ ðiÞ2 ðiÞ2 ðiÞ2   Pareto points is unique and independent of the methodol-
where D j ¼ ðdj;1 ; dj;2 ; . . . ; dj;M ÞT , xðiÞ  xðjÞ  ðxðiÞ  xðjÞ Þ.
 ðiÞ 2 ðiÞ
ogy by which the two functions are weighted (for more
fm in the second line is replaced with fm due to the first detail about Pareto optimal approach see [34]). In this
constraint in (3). The important conclusion drawn is that paper, we use the -constraint method as described by (7),
the objective functions are linear in terms of the problem such that instead of maximizing the total inter-class dis-
variables. tance, we force it to be greater than some constant ðiÞ . In
Using the summation of all weighted within-class distan- this way we can map out the entire Pareto optimal set by
ces and all weighted between-class distances in the sub-fea- varying a single parameter, ðiÞ . One advantage of this
ture space defined by fðiÞ , we define the total intra-class approach is that we can guarantee the combined inter-class
distance and the total inter-class distance as in (5). The prob-
distances are in excess of the value of the parameter ðiÞ ,
lem is then reformulated by simultaneously minimizing the
former and maximizing the later, T
min aðiÞ fðiÞ
fðiÞ
total intra  class distance : 8 ðiÞ
X  ðiÞ ðiÞT ðiÞ 
ðiÞT ðiÞ
< fm 2 ½0; 1 ; m ¼ 1; . . . ; M
> (7)
wj Dj f , a f s:t: 1  1T fðiÞ  a
j2yðiÞ
>
: ðiÞT ðiÞ
(5) b f
ðiÞ :
total inter  class distance :
X  ðiÞ ðiÞT ðiÞ  ðiÞT ðiÞ The parameter ðiÞ must be determined such that the optimi-
wj Dj f , b f : zation problem defined in (7) is feasible. In the next section
= yðiÞ
j2
we present an approach to automatically determine a value
of the parameter ðiÞ which guarantees that the feasible set is
We see that (3) is in the form of an integer program, not empty.
which is known to be computationally intractable [34].
However this issue is readily addressed through the use
3.1.3 Problem Feasibility
of a standard and widely–accepted approximation of an
integer programming problem [34], [35], [36]. Here, we The optimization problem defined in (7) is feasible if there is
replace (relax) the binary constraint in (3) with linear at least one point that satisfies its constraints. The con-
ðiÞ
inequalities 0  fm ðiÞ
 1; m ¼ 1; . . . ; M. This procedure straints fm 2 ½0; 1 ; m ¼ 1; . . . ; M indicate that the optimum
restores the computational efficiency of the program. point must be inside a unit hyper-cube. The constraints
A randomized rounding procedure (to be discussed fur- 1  1T fðiÞ  a indicate that the optimum point must be
ther) that maps the linear solution back onto a suitable within the space between two parallel hyper-planes defined
point on the binary grid, then follows. by 1T fðiÞ ¼ 1 and 1T fðiÞ ¼ a. Since a is an integer number
These reformulations result in (6), which is a multi-objec- greater than or equal to 1, the space bounded by these two
tive optimization problem consisting of two linear objective parallel hyper-planes is always non-empty and its intersec-
functions that are to be simultaneously minimized and max- tion with the unit hyper-cube is also non-empty. In fact,
imized, along with 2M þ 2 linear constraints, the intersection of the spaces defined by fm ðiÞ
2 ½0; 1 ; m ¼
T ðiÞ
T
1; . . . ; M and 1  1 f  a is a polyhedron P that can be
min aðiÞ fðiÞ seen as a unit cube in which two parts are removed; the first
fðiÞ
T
part is the intersection between the half-space 1T fðiÞ < 1
max bðiÞ fðiÞ and the unit hyper-cube, and the second is the intersection
fðiÞ (6)
( between the half-space 1T fðiÞ > a and the unit hyper-cube
f ðiÞ 2 ½0; 1 ; m ¼ 1; . . . ; M (see Fig. 2). If the intersection between the polyhedron P
s:t: m
1  1T fðiÞ  a: T
and the half-space defined by bðiÞ fðiÞ
ðiÞ , i.e., the last con-
straint, is non-empty then the optimization problem is
ARMANFARD ET AL.: LOCAL FEATURE SELECTION FOR DATA CLASSIFICATION 1221

randomized rounding process is repeated 1,000 times and


the point that simultaneously satisfies constraints of (9) and
provides the minimum value for the objective function of
ðiÞ
(9) is chosen as the binary Pareto point f b . Among the
ðiÞ
binary Pareto optimal points ff b gb2½0;1 the one which
yields the best local clustering of samples is chosen as the
ðiÞ
binary feature vector f corresponding to the representa-
tive point xðiÞ . This process is explained more in detail in
Section 3.2.

Fig. 2. The polyhedron P in the case of a 3-D original feature space, i.e., 3.1.4 Weight Definition
the data dimension M is 3, where a is set to 2. It is a unit cube (defined ðiÞ
by 0  fmðiÞ
 1; m ¼ 1; . . . ; 3) in which two regions, i.e., blue and red pyr- In order to compute the sub-feature set f corresponding
amids, are removed. The blue pyramid is the intersection between unit to the representative point xðiÞ , the proposed method
cube and the half space 1T fðiÞ < 1, and the red pyramid is the intersec- focuses on the neighboring samples by assigning higher
tion between the half space 1T fðiÞ > a and the unit cube. weights to them. However, the computation of the weights
is dependent on the co-ordinate system, which is defined by
ðiÞ
feasible. The maximum value ðiÞ
max that 
ðiÞ
can take such that f , which is unknown at the problem outset. To overcome
the intersection remains non-empty is the solution to the fol- this problem, we use an iterative approach. At each itera-
lowing feasibility LP problem: ðiÞ
tion, weights wj ; j ¼ 1; . . . ; N; j 6¼ i (see (3)) are computed
ðiÞ
max bðiÞ fðiÞ
T
using the previous estimates of f ; i ¼ 1; . . . ; N. Initially,
fðiÞ the weights are all assigned uniform values. Empirically, if
( (8)
f ðiÞ 2 ½0; 1 ; m ¼ 1; . . . ; M two samples are close to each other in one space, they are
s:t: m also close in most of the other sub-spaces. Therefore we
1  1T fðiÞ  a: ðiÞ
define wj , using the distance between xðiÞ and xðjÞ in all N
Effectively, (8) corresponds to an extreme Pareto point subspaces obtained from the previous iteration, in the fol-
where the weighting given to the intra-class distance lowing manner:
term (the first objective in (6)) is zero. Finally, we set
!
ðiÞ ¼ bðiÞ 1 X N   
max where b lies between zero and one. In this ðiÞ
way, the optimization problem is always feasible and by wj ¼ exp  dijjk  dmin ijjk
N k¼1
changing b we can map out the entire Pareto optimal set   
 ðkÞ 
corresponding to different relative weightings of intra- dijjk ¼  xðiÞ  xðjÞ  f 
versus inter-class distances. Here we define the Pareto 8
2 (10)
ðjÞ
optimal point corresponding to a specific value of b as > min
< v2yðiÞ divjk ; if y ¼ yðiÞ
ðiÞ ðiÞ
fb ; furthermore we define the set ffb gb2½0;1 as the com- dmin ¼
ijjk
>
: min divjk ; if yðjÞ 6¼ yðiÞ ;
plete Pareto optimal set. The final reformulation of the = yðiÞ
v2
problem may therefore be expressed as: ðkÞ
where f ; k ¼ 1; . . . ; N are known from the previous itera-
T ðiÞ ðiÞ
min aðiÞ fb tion. Such a definition implies all the wj are normalized
ðiÞ
fb
over ½0; 1 .
8
ðiÞ
> The pseudo code of the proposed feature selection
> fm;b 2 ½0; 1 ; m ¼ 1; . . . ; M
>
< (9)
method is presented in Algorithm 1 where the parameter t
ðiÞ
s:t: 1  1T fb  a is the number of iterations.
>
>
>
: bðiÞT fðiÞ
bðiÞ ;
b max 3.2 Class Similarity Measurement
ðiÞ ðiÞ ðiÞ ðiÞ
where fb¼ ðf1;b ; f2;b ; . . . ; fM;b ÞT .
This formulation has the A consequence of the localized feature selection approach is
desirable form of a linear program and hence is convex. that, since there is no common set of features across the
The solution to (9) provides a solution for each element of sample space, conventional classifiers are inappropriate.
ðiÞ We now discuss how to build a classifier for the localized
fb over the continuous range ½0; 1 that may be considered
scenario. The proposed classifier structure is based on mea-
close to the corresponding binary Pareto optimal solution
ðiÞ ðiÞ
suring the similarity of a query datum xq to a specific class
f b . To obtain f b , a randomized rounding process [34], $
using the optimal feature sets specified by the ff ðiÞ gN i¼1 .
ðiÞ
[35], [36] is applied to the optimal point of (9), i.e., fb , where The proposed method assumes that the sample space con-
ðiÞ ðiÞ sists of N (probably overlapping) regions, where each region
fm;b is set to one with probability fm;b and is set to zero with
ðiÞ
is characterized by its representative point xðiÞ , its class label
probability ð1  fm;b Þ for m ¼ 1; . . . ; M. To explore the $
yðiÞ and its optimal feature set f ðiÞ . We define each region to
ðiÞ
entire region surrounding the Pareto optimal fb , the be a hyper-sphere QðiÞ in the co-ordinate system defined by
1222 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 6, JUNE 2016

$ ðiÞ ðiÞ
f , which is centered at xðiÞ
p . The radius of Q is deter- We now discuss a method for determining a suitable
ðiÞ value for b (which corresponds to the selection of a suitable
mined such that the “impurity level” within Q is less than
point in the Pareto set). We examine different values of
the parameter g. The “impurity level” is the ratio of the nor-
b 2 ½0; 1 in increments of 0.05. For each value, we solve (9)
malized number of samples with differing class label to the
followed by the randomized rounding process. This deter-
normalized number of samples with the same class label. In
mines the candidate local co-ordinate system for the respec-
all our experiments, g is fixed at the value 0.2. $ ðiÞ
To assess the similarity SY‘ ðxq Þ of a query datum xq to class tive value of b, i.e., fb , and therefore specifies the
ðiÞ
Y‘ 2 Y, we measure the similarity of xq to all regions whose candidate Q and the weak classifier si . The corresponding
class label is Y‘ . To this end we define a set of binary variables local clustering performance may then be determined using
si ðxq Þ; i ¼ 1; . . . ; N such that si ðxq Þ is set to 1 if xq 2 QðiÞ and a leave-one-out cross-validation procedure, using the
the class label of the nearest neighbor of xq is yðiÞ ; otherwise it respective weak classifier results over the training samples
is set to 0. The variable si ðxq Þ may be interpreted as a weak situated within the corresponding QðiÞ as a criterion of per-
classifier which shows the similarity of xq to the ith region. formance. The Pareto optimal point corresponding to the
The similarity SY‘ ðxq Þ is then obtained as follows: value of b which yields best local performance is then
$
selected as the binary solution f ðiÞ at the current iteration
P (see line 11 of Algorithm 1).
i2Y‘ si ðxq Þ
SY‘ ðx Þ ¼
q
; (11)
h‘ 3.3 Discussion about Overfitting
In the following we discuss the overfitting issue with the
where Y‘ indicates the set of all regions whose class labels proposed method. Let the available feature pool be
are Y‘ . The cardinality of Y‘ is h‘ . After computing the simi- denoted by the set X . Let us consider the idealized scenario
larity of xq to all classes, the class label of xq is the one which where for each localized region, we can partition X into
provides the largest similarity. ðiÞ ðiÞ ðiÞ ðiÞ
the two disjoint sets X R and X I such that X R [ X I ¼
If query sample xq does not fall in any of the QðiÞ s, our ðiÞ ðiÞ
desire is to assign its class as the class label of the nearest X ; i ¼ 1; . . . ; N. The sets X R and X I contain only the rele-
ðiÞ
sample to xq . The question is “what coordinate system vant and irrelevant features, respectively. Let hR denote
should be used to determine the nearest neighboring ðiÞ
the cardinality of X R .
sample”. To address this matter, we use a majority voting For the time being, let us consider the hypothetical situa-
procedure of the class labels within the set of all nearest ðiÞ
tion where a ¼ hR . We note that “relevant” features are
neighboring samples. This nearest neighbor set consists of those which encourage local clustering behaviour, which is
those samples which have the nearest distances to the query quantified by the optimization problem of (9). We therefore
datum as measured over each of the N local co-ordinate sys- ðiÞ
tems. The number of votes for each class is normalized to make the assumption that all features in X R are sufficiently
the number of samples within that class. It is to be noted relevant to be selected as local features by the proposed pro-
that on the basis of our experiments, the percentage of such cedure; i.e., with high probability, they are the solution to
a situation occurring is very rare—only 0.03 percent. (9), followed by the randomized rounding process. If we
ðiÞ ðiÞ
now let a grow above the value hR , features in X I become
Algorithm 1. Pseudo Code of the Proposed Feature candidates for selection. Because these features do not
Selection Algorithm encourage clustering, then with high probability these fea-
tures must be given a low f-value in order to satisfy the opti-
Input: fðxðiÞ ; yðiÞ ÞgN
i¼1 ; t; a mality of (9). Thus there is a low probability that any feature
$
Output: ff ðiÞ gN i¼1
ðiÞ
in X I will be selected by the randomized rounding proce-
$
1 Initialization: Set f ðiÞ ¼ ð0; . . . ; 0ÞT ; i ¼ 1; . . . ; N; dure. We recall that any solution selected by the random-
2 for iteration 1 to t do ized rounding procedure must also satisfy the constraints;
$ ðiÞ $
3 fprev: ¼ f ðiÞ ; i ¼ 1; . . . ; N; therefore, such a solution remains feasible, due to the
4 for i 1 to N do inequality constraint involving a in (9). Therefore in this ide-
ðiÞ $ ðkÞ N
5 Compute wj ; j ¼ 1; . . . ; N  1 using ffprev: gk¼1 as alized scenario, we see that as a grows, the number of
in (10); ðiÞ
selected features tends to saturate at the value hR .
6 Compute ðiÞ max through solving (8); In the more practical scenario, the features may not sepa-
7 for b 0 to 1 do
ðiÞ rate so cleanly into the relevant and irrelevant groups as we
8 Compute fb through solving (9);
$ ðiÞ
have assumed, with the result that “partially relevant” fea-
9 Compute fb through randomized rounding tures may continue to be selected as a grows. Therefore the
ðiÞ
of fb ; risk of overfitting is not entirely eliminated for real data
10 end sets. Nevertheless, as we demonstrate in Section 4, a satura-
$ $ ðiÞ
11 Set f ðiÞ equal to the member of ffb gb2½0;1 which tion effect of the number of selected features in real data sce-
yields the best local performance as explained in narios is clearly evident.
Section 3.2; In summary, the LFS algorithm inherently tends to select
12 end only relevant features and rejects irrelevant features. This
13 end imposes a limit on the number of selected features. Thus the
LFS method tends to be immune to the overfitting problem.
ARMANFARD ET AL.: LOCAL FEATURE SELECTION FOR DATA CLASSIFICATION 1223

runs is recorded. For a fair comparison between feature


selection algorithms, the training and test sets for each run
are common for all algorithms.
To increase the challenge of the classification problem,
following [23], the set of original features of the data sets
“Sonar”, “DNA”, “Breast”, “Adult” and “ARR” have been
augmented by 100 irrelevant features, independently
sampled from a zero-mean and unit-variance Gaussian
distribution. Data sets “ALLAML”, “Prostate”, “Duke-
breast”, “Leukemia” and “Colon” are microarray data
sets where in each case the number of features is signifi-
cantly larger than the number of samples. Each feature
Fig. 3. Illustration of the synthetic data set in terms of its relevant fea- variable in the synthetic data set and the real-world data
tures x1 and x2 , after feature values are transformed into their z-scores.
sets have been transformed to their z-score values. These
real data sets represent applications where expensive fea-
This behavior is in contrast to that of current feature selec- ture selection methods such as an exhaustive search can-
tion methods which inherently do not penalize over-estima- not be used directly.
tion of the number of selected features. The code for our comparison feature selection methods
Further, the proposed algorithm deals with the effect are all available on the respective author’s websites, with
of outlier training samples through the aggregation pro- the exception of KCSM, which was obtained directly from
cess of (11) where the final decision is based on the aver- the author. The default settings for each algorithm are used.
age of the “weak classifier” results si ðxq Þ; i ¼ 1; . . . ; N. In the case of the Simba algorithm, following [23], a non-
Since si ðxq Þ is either 0 or 1, if the number of outlier sam- linear sigmoid activation function is used with sigmoid
ples in each class is much smaller than the number of parameter set to 1.
true samples within that class, as in well-behaved classifi- Apart from the parameter a, which is analogous to the
cation problems, then the effect of outlier samples in the number of selected features in our comparison feature selec-
final classification result is diminished. tion algorithms, the proposed method has two additional
user-defined parameters: the number of iterations t (see
Section 3.1.4) and the level of impurity g (see Section 3.2).
4 EXPERIMENTAL RESULTS Generally, these parameters can be estimated through cross
The performance of the proposed method is demonstrated validation and be tuned for each data set to provide the
by performing a large-scale experiment on one synthetic most accurate classification results. However, in this case,
data set and ten real-world binary classification problems for a fair comparison, they are not tuned and set respec-
and is compared against six well-known and state-of-the-art tively to 2 and 0.2, i.e., default values. These values are fixed
feature selection algorithms including FDA [12], Simba [24], during all our experiments on all data sets.
mRMR [7], KCSM [9], Logo [23] and DEFS [29]. The proposed algorithm is implemented in MATLAB
As is shown in Fig. 3, the synthetic data set is distributed and executed on a desktop with an Intel Core i7-2600 CPU
in a two dimensional feature space where class Y1 data is @ 3.4 GHz and 16 GB RAM.
split into two discrete clusters. The features x1 and x2 of all
subclasses ‘ ’, ‘+’ and ‘ ’ are drawn from Normal distribu- 4.1 Classification Accuracy
tions with unit variances. Besides the two relevant features Since the comparison feature selection algorithms do not
x1 and x2 , following [9], each sample is artificially contami- inherently incorporate a classifier, an SVM classifier with
nated by adding a varying number of irrelevant features, an RBF kernel is used to estimate the classification accu-
ranging in number from 1 to 30;000, as a means of testing racy corresponding to the features selected by our com-
the capability of the proposed method to detect only the parison feature selection algorithms on each data set. To
most relevant features. The number 30;000 is deemed to be this end, after performing feature selection on the training
a reasonable upper limit for most scientific applications
samples, an SVM classifier with the top-t selected features
[23]. The artificial irrelevant features are independently
is trained with training data and tested on the test data.
sampled from a Gaussian distribution with zero-mean and
Default values for the SVM classifier are used for both the
unit-variance.
training and the test phase. In our experiments, t ranges
Details of real-world data sets are summarized in Table 2.
The total number of available samples in each case is the from 1 to 30 since there is no performance improvement
sum of entries in columns 2 (# train) and 3 (# test). Following for larger values, with the exception of the data set
[23], the performance of the various feature selection algo- “Adult”. Here the minimum error rate for all methods,
rithms on each data set is evaluated using a bootstrapping except LFS and Logo, happens out of this range and is
algorithm. To this end, each algorithm is run 10 times on equal to the case where no feature selection is performed,
each data set. For each run, the number of data points as i.e., when all candidate features are selected. The perfor-
shown in column 2 of Table 2 is randomly selected to be the mance in this case is 24.65 percent (see the last column of
training set, and the remaining samples (whose number is Table 1 for the Adult data set). For a fair comparison, the
indicated in the third column of Table 2) are used as test parameter a (analogous to t) of the proposed LFS method
samples for that run. The average performance over all 10 also ranges from 1 to 30.
1224 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 6, JUNE 2016

TABLE 1
Minimum Classification Error (in Percent) and Standard Deviation (in Percent) of the Different Algorithms

Data set LFS FDA Simba mRMR KCSM Logo DEFS SVM
(no feature selection)
Sonar 22.87(3.92) 26.11(4.29) 25.24(3.67) 28.70(2.61) 26.85(3.49) 26.75(3.44) 27.81(6.67) 49.90(4.81)
DNA 13.41(1.88) 13.94(2.71) 14.43(4.77) 13.75(2.96) 35.95(17.04) 15.35(5.73) 18.68(5.02) 49.70(2.04)
Breast 6.37(1.33) 7.71(1.92) 8.89(1.28) 8.29(2.19) 7.73(1.63) 8.25(1.36) 11.01(2.49) 37.61(0.60)
Adult 22.27(1.46) 24.65(0.33) 24.65(0.67) 24.75(7.58) 24.85(1.10) 24.53(7.85) 26.37(2.05) 24.65(0.33)
ARR 33.06(2.60) 46.68(9.95) 33.53(6.46) 31.59(3.25) 33.34(9.039) 33.93(5.32) 31.42(4.73) 43.68(1.19)
ALLAML 1.66(3.51) 5.00(4.30) 25.50(16.49) 7.50(8.28) 32.50(17.76) 7.50(7.29) 14.66(10.50) 38.33(15.81)
Prostate 4.16(4.39) 6.66(6.57) 12.66(8.79) 8.33(7.58) 33.33(16.66) 8.33(7.85) 13.66(9.56) 57.50(10.72)
Duke-breast 10.83(7.90) 17.50(8.28) 30.83(12.86) 21.66(5.82) 33.33(16.19) 21.66(11.91) 26.66(14.61) 63.33(10.54)
Leukemia 3.33(4.30) 5.00(5.82) 12.00(8.06) 5.00(5.82) 32.50(13.86) 6.66(5.27) 16.83(10.85) 35.83(14.19)
Colon 9.16(0.08) 11.66(8.95) 34.50(13.58) 19.16(5.62) 12.50(9.00) 20.83(10.57) 26.66(14.12) 36.66(17.21)
Average 12.71 16.49 22.22 16.87 27.29 17.38 21.38 43.72

Standard deviations are presented in parentheses.

The minimum classification error and the corresponding Fig. 4, where samples have been contaminated with addi-
standard deviation as determined by the bootstrapping pro- tional irrelevant features ranging in number from 1 to 30,000.
cedure described earlier is presented in Table 1. For refer- Each point shows the percentage of samples for which the
ence, the classification error rate of the SVM classifier expected feature(s), (i.e., x1 for samples within subclass ‘+’,
performed on the data sets without any feature selection is x2 for samples within ‘ ’ and fx1 ; x2 g for samples within ‘ ’),
also reported in the last column of Table 1. Since the perfor- are correctly identified. It can be seen that the performance is
mance in this case is generally very low, this result implies refined from one iteration to another, especially for a higher
that, without feature selection, classification suffers from number of irrelevant features. The most significant improve-
the presence of irrelevant features and the curse of ment happens at the second iteration; hence, as mentioned
dimensionality [43]. The best result for each data set is previously, the default value of t is set to 2.
shown in bold. Among the seven algorithms, the proposed The data set “DNA” has a “ground truth”, in that
LFS algorithm yields the best results in nine out of the 10 much better performance has been previously reported if
data sets. The last row shows the classification error rates the selected features are those with indexes in the interval
averaged over all data sets. This row indicates that the pro- between 61 to 120 [9], [44]. This observation provides a
posed LFS method performs noticeably better on average good means of evaluating LFS performance on a real
than the other seven algorithms. world data set. Fig. 5 shows the result of applying the
proposed LFS method to the data set “DNA”, where the
4.2 Iterative Weight Definition and Correct height of each feature index indicates the percentage of
Feature Selection representative points which select these ground-truthed
As illustrated in Fig. 3, the distribution of class Y1 of the syn- features as a member of their optimal feature set. These
thetic data set has two disjoint subclasses, whereas class Y2 is results demonstrate that the proposed method mostly
a compact class with one mode. Samples of subclass ‘+’ can identifies features with indexes from 61 to 105. Thus they
be discriminated from samples of class Y2 using only the rele- are well matched to the “ground truth”. The proposed
vant feature x1 . In a similar way, samples of subclass ‘ ’ method also performs very well in discarding the artifi-
require only x2 , whereas samples of class ‘ ’ require both x1 cially added irrelevant features, i.e., features with indexes
and x2 . The results of applying the proposed method to the from 181 to 280.
synthetic data set over four successive iterations is shown in
4.3 Sensitivity of the Proposed Method to a and g
To show the sensitivity of the proposed method to the
TABLE 2 parameter a, the classification error rate and the cardinality
Characteristics of the Real-World Data Sets Used in the
Experiments

Data set # Train # Test # Features (M)


Sonar [28] 100 108 60(100)
DNA [9] 100 3,086 180(100)
Breast [28] 100 469 30(100)
Adult [39] 100 1,505 119(100)
ARR [40] 100 320 278(100)
ALLAML [41] 60 12 7,129
Prostate [41] 90 12 5,966
Duke-breast [42] 30 12 7,129
Leukemia [28] 60 12 7,070 Fig. 4. Percentage of correct feature selection over four successive itera-
Colon [40] 50 12 2,000 tions of the proposed algorithm for the synthetic data set, where the
samples are contaminated with a varying number of irrelevant features.
The number of artificially added irrelevant features is indicated in parentheses. The parameter a is set to 2.
ARMANFARD ET AL.: LOCAL FEATURE SELECTION FOR DATA CLASSIFICATION 1225

Fig. 5. Selected features for “DNA” data set. The height corresponding to Fig. 8. Classification error rate of the proposed method for data set
each feature index indicates what percentage of representative points “Colon” where the parameter g ranges from 0 to 1.
select the respective feature as a discriminative feature, where a is set
to a typical value of 5.
experiments, g is set to 0.2 without tuning. This value is
seen to work well over all data sets.
of the optimal feature sets (averaged over all N sets) versus
a, for data set “Sonar”, are respectively shown in Figs. 6 and 4.4 Overlapping Feature Sets?
7 where a ranges from 1 to the maximum possible value of The reader may be interested to know if there is any overlap
M ¼ 160. These results demonstrate the robustness of the between the optimal feature sets of the representative
proposed LFS algorithm against overfitting as discussed in points. To answer this question, the normalized histogram
Section 3.3. over all feature sets for the data set “ALLAML” is shown in
Note that estimating an appropriate value for the num- Fig. 9, where the parameter a is set to a typical value of 5.
ber of selected features is generally a challenging issue. This The height of each feature index indicates what percentage
is usually estimated using a validation set or based on prior of representative points select the respective feature. As is
knowledge, which may not be available in some applica- expected, there are some overlap between region specific
tions. As can be seen, the proposed LFS algorithm is not too feature sets, but it is evident there does not appear to be one
sensitive to this parameter. Moreover, as illustrated in common feature set that works well over all regions of the
Fig. 7, the cardinality of the optimal feature sets saturates sample space. Indeed, with the proposed method and these
for a sufficiently large value of a. results, we assert the assumption of a common feature set
The error rate of the proposed method versus the impu- over the entire sample space is not necessarily optimal in
rity level parameter g for data set “Colon” is shown in Fig. 8 real world applications.
where g ranges from 0 to 1. Small (large) values of g can be The most common features may be interpreted as the
interpreted as a small (large) radius of the hyper-spheres. most informative features in terms of classification accuracy
This demonstrates that the error rate is not too sensitive to a over the sample space. The less common features may be
wide range of values of g. As one may intuitively guess, we interpreted as being informative features, but only relevant
found that impurity levels in the range of 0.1 to 0.4 are for a small group of samples; e.g., in the context of biology/
appropriate. As mentioned previously, throughout all our genetics applications, the less common features may be
interpreted as being important in the discrimination of
some small sub-population of samples.
One may be interested to know the classification accu-
racy in the context of a global selection scheme; i.e., we
select the top five dominant features from Fig. 9 as pro-
duced by the LFS method, and then feed them into an SVM
classifier. Using such a feature set, the error rate is 6.66 per-
cent which is in the range of that of the other methods,
but nevertheless significantly greater than the error rate
Fig. 6. Classification error rate of the proposed method for data set (1.66 percent) corresponding to the proposed LFS region-
“Sonar” where the parameter a ranges from 1 to the maximum possible
value of M ¼ 160:

ðiÞ Fig. 9. Selected features for “ALLAML” data set. The height of each fea-
Fig. 7. Averaged cardinality of the optimal feature sets f i ¼ 1; . . . ; N ture index indicates what percentage of representative points select the
versus the parameter a where a ranges from 1 to the maximum possible respective feature as a discriminative feature where a is set to the typical
value of M ¼ 160: value of 5.
1226 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 6, JUNE 2016

almost 25 seconds. Note again that this is the training phase


time which is performed off-line. On the other hand, the test
phase only involves testing weather the query datum con-
tained within the specified hyper-spheres and determining
the class label of its nearest neighbors. This is much faster
than the training process, since it requires no optimization.
In our experiments, the test phase is typically performed in
a fraction of a second.

Fig. 10. Histogram of distances between relaxed solutions and their cor- 5 CONCLUSIONS
responding binary solutions for data set “Prostate” where a is set to the
typical value of 5. In this paper we present an effective and practical method
for local feature selection for application to the data classifi-
specific feature selection method (see Table 1). This example cation problem. Unlike most feature selection algorithms
is a further demonstration of the effectiveness of modeling which pick a “global” subset of features which is most rep-
the feature space locally. resentative for the given data set, the proposed algorithm
instead picks “local” subsets of features that are most infor-
4.5 How Far Is the Binary Solution from mative for the small region around the data points. The car-
the Relaxed One? dinality and identity of the feature sets can vary from data
To demonstrate that the relaxed solutions are a proper point to data point. The process of computing a feature set
approximation of the final binary solutions, obtained from for each region is independent of the others and can be per-
the randomized rounding process explained in Section 3.1, formed in parallel.
the normalized histogram over the ‘1 –norm distances The LFS procedure is formulated as a linear program,
between the relaxed solutions and their corresponding which has the advantage of convexity and efficient implemen-
binary solutions is shown in Fig. 10. The height of each bar tation. The proposed algorithm is shown to perform well in
indicates what fraction of the representative points have the practice, compared to previous state-of-the-art feature selec-
corresponding value as their ‘1 –norm distance. The ‘1 –norm tion algorithms. Performance of the proposed algorithm is
distances are normalized relative to the data dimension M. insensitive to the underlying distribution of the data. Further-
As may be seen, the relaxed solutions are appropriate more we have demonstrated that the method is relatively
approximations of the binary solutions. invariant to an upper bound on the number of selected fea-
tures, and so is robust against the overfitting phenomenon.
4.6 CPU Time
The computational complexity for computing a feature ACKNOWLEDGMENTS
set for each representative point depends mainly on the The authors wish to acknowledge the financial support of
data dimension. Fig. 11 shows the CPU time taken by the the Natural Sciences and Engineering Research Council of
proposed method (using MATLAB) to perform feature Canada (NSERC) and MITACS.
selection for one representative point on the synthetic
data set, with the number of irrelevant features ranging
REFERENCES
from 1 to 30,000. As may be seen, the figure shows linear
[1] I. K. Fodor, “A survey of dimension reduction techniques,” Law-
complexity of the LFS method with respect to feature
rence Livermore National Laboratory, Tech. Rep. UCRL-ID-
dimensionality. 148494, 2002.
Note that the feature selection process for each represen- [2] A. K. Jain, R. P. W. Duin, and J. Mao, “Statistical pattern recogni-
tative point is independent of the others and can be per- tion: A review,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22,
no. 1, pp. 4–37, Jan. 2000.
formed in parallel. For instance, in the case of a data set [3] P. Langley, Selection of Relevant Features in Machine Learning.
with 100 training samples (i.e., N ¼ 100) and 10,000 features Defense Technical Information Center, 1994.
(i.e., M ¼ 10;000) on a typical desktop computer with 12 [4] A. R. Webb, Statistical Pattern Recognition. Hoboken, NJ, USA:
cores, the required processing time in the training phase is Wiley, 2003.
[5] I. Jolliffe, Principal Component Analysis. Hoboken, NJ, USA: Wiley,
2005.
[6] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction
by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–
2326, 2000.
[7] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual
information criteria of max-dependency, max-relevance, and min-
redundancy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 8,
pp. 1226–1238, Aug. 2005.
[8] H.-L. Wei and S. A. Billings, “Feature subset selection and ranking
for data dimensionality reduction,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 29, no. 1, pp. 162–166, Jan. 2007.
[9] L. Wang, “Feature selection with kernel class separability,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 30, no. 9, pp. 1534–1546,
Sep. 2008.
Fig. 11. The CPU time (seconds) taken by the proposed algorithm to per- [10] H. Zeng and Y.-M. Cheung, “Feature selection and kernel learning
form feature selection for one representative point xðiÞ with a given b on for local Learning-based clustering,” IEEE Trans. Pattern Anal.
the synthetic data set where the parameter a is set to 2. Mach. Intell., vol. 33, no. 8, pp. 1532–1547, Aug. 2011.
ARMANFARD ET AL.: LOCAL FEATURE SELECTION FOR DATA CLASSIFICATION 1227

[11] N. Kwak and C.-H. Choi, “Input feature selection for classification [38] G. Mavrotas, “Effective implementation of the -constraint
problems,” IEEE Trans. Neural Netw., vol. 13, no. 1, pp. 143–159, method in multi-objective mathematical programming problems,”
Jan. 2002. Appl. Math. Comput., vol. 213, no. 2, pp. 455–465, 2009.
[12] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. [39] K. Bache and M. Lichman. (2013). UCI machine learning reposi-
Hoboken, NJ, USA: Wiley, 2001. tory [Online]. Available: http://archive.ics.uci.edu/ml
[13] A. Hyv€ arinen and E. Oja, “Independent component analysis: [40] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack,
Algorithms and applications,” Neural Netw., vol. 13, no. 4, and A. J. Levine, “Broad patterns of gene expression revealed by
pp. 411–430, 2000. clustering analysis of tumor and normal colon tissues probed by
[14] J. B. Tenenbaum, V. De Silva, and J. C. Langford, “A global geo- oligonucleotide arrays,” Proc. Nat. Acad. Sci., vol. 96, no. 12,
metric framework for nonlinear dimensionality reduction,” pp. 6745–6750, 1999.
Science, vol. 290, no. 5500, pp. 2319–2323, 2000. [41] F. Nie, H. Huang, X. Cai, and C. H. Ding, “Efficient and robust
[15] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensional- feature selection via joint l2;1 -norms minimization,” in Proc. Adv.
ity reduction and data representation,” Neural Comput., vol. 15, Neural Inf. Process. Syst., 2010, pp. 1813–1821.
no. 6, pp. 1373–1396, 2003. [42] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R.
[16] D. L. Donoho and C. Grimes, “Hessian eigenmaps: Locally linear Spang, H. Zuzan, J. A. Olson, J. R. Marks, and J. R. Nevins,
embedding techniques for high-dimensional data,” Proc. Nat. “Predicting the clinical status of human breast cancer by using
Acad. Sci., vol. 100, no. 10, pp. 5591–5596, 2003. gene expression profiles,” Proc. Nat. Acad. Sci., vol. 98, no. 20,
[17] Y. M. Lui and J. R. Beveridge, “Grassmann registration manifolds pp. 11 462–11 467, 2001.
for face recognition,” in Proc. 10th Eur. Conf. Comput. Vis., 2008, [43] R. E. Bellman and S. E. Dreyfus, Applied Dynamic Programming.
pp. 44–57. Rand Corporation, 1962.
[18] X. He, D. Cai, S. Yan, and H.-J. Zhang, “Neighborhood preserving [44] G. John. DNA dataset (statlog version)-primate splice-junction
embedding,” in Proc. 10th IEEE Int. Conf. Comput. Vis., 2005, vol. 2, gene sequences (dna) with associated imperfect domain theory
pp. 1208–1213. [Online]. Available: https://www.sgi.com/tech/mlc/db/DNA.
[19] N. Ueda, R. Nakano, Z. Ghahramani, and G. Hinton, “Pattern names
classification using a mixture of factor analyzers,” in Proc. 1999
IEEE Signal Process. Soc. Workshop Neural Netw. Signal Process. IX, Narges Armanfard received the MSc degree in
1999, pp. 525–534. electrical engineering from Tarbiat Modares Uni-
[20] S. C. Madeira and A. L. Oliveira, “Biclustering algorithms for bio- versity, Tehran, Iran, in 2008. She is currently
logical data analysis: A survey,” IEEE/ACM Trans. Comput. Biol. working towards the PhD degree in electrical
Bioinformat., vol. 1, no. 1, pp. 24–45, Jan. 2004. engineering at McMaster University, Hamilton,
[21] Y. Cheng and G. M. Church, “Biclustering of expression data,” in Ontario, Canada. Since 2012, she has been a
Proc. Int. Conf. Intell. Syst. Mol. Biol., 2000, vol. 8, pp. 93–103. research assistant and also a teaching assistant
[22] I. S. Dhillon, “Co-clustering documents and words using bipartite in the Department of Electrical and Computer
spectral graph partitioning,” in Proc. 7th ACM SIGKDD Int. Conf. Engineering, McMaster University, Hamilton,
Knowl. Discovery Data Mining, 2001, pp. 269–274. Ontario, Canada. Her research interests include
[23] Y. Sun, S. Todorovic, and S. Goodison, “Local-learning-based fea- signal, image, and video processing, specifically
ture selection for high-dimensional data analysis,” IEEE Trans. EEG and ECG signal analysis, machine learning, machine vision, video
Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1610–1626, Sep. 2010. surveillance, and document image processing.
[24] R. Gilad-Bachrach, A. Navot, and N. Tishby, “Margin based fea-
ture selection-theory and algorithms,” in Proc. 21st Int. Conf. Mach. James P. Reilly (S’76–M’80) received the BASc
Learn., 2004, p. 43. degree from the University of Waterloo, Waterloo,
[25] K. Kira and L. A. Rendell, “A practical approach to feature ON, Canada, in 1973, and the MEng and PhD
selection,” in Proc. 9th Int. Workshop Mach. Learn., 1992, pp. 249–256. degrees from McMaster University, Hamilton,
[26] Z. Liu, W. Hsiao, B. L. Cantarel, E. F. Drabek, and C. Fraser- ON, Canada, in 1977 and 1980, respectively, all
Liggett, “Sparse distance-based learning for simultaneous multi- in electrical engineering. He was employed in the
class classification and feature selection of metagenomic data,” telecommunications industry for a total of seven
Bioinformatics, vol. 27, no. 23, pp. 3242–3249, 2011. years and was then appointed to the Department
[27] N. Armanfard and J. P. Reilly, “Classification based on local fea- of Electrical & Computer Engineering at McMas-
ture selection via linear programming,” in Proc. IEEE Int. Work- ter University in 1985 as an associate professor.
shop Mach. Learn. Signal Process., 2013, pp. 1–6. He was promoted to the rank of a full professor in
[28] G. Brown, A. Pocock, M.-J. Zhao, and M. Lujan, “Conditional likeli- 1992. He has been a visiting academic at the University of Canterbury,
hood maximisation: A unifying framework for information theoretic New Zealand, and the University of Melbourne, Australia. His research
feature selection,” The J. Mach. Learn. Res., vol. 13, pp. 27–66, 2012. interests include several aspects of signal processing, specifically
[29] R. N. Khushaba, A. Al-Ani, and A. Al-Jumaily, “Feature subset machine learning, EEG signal analysis, Bayesian methods, blind signal
selection using differential evolution and a statistical repair mech- separation, blind identification, and array signal processing. He is a reg-
anism,” Expert Syst. Appl., vol. 38, no. 9, pp. 11 515–11 526, 2011. istered professional engineer in the province of Ontario.
[30] I. Kononenko, “Estimating attributes: Analysis and extensions of
relief,” in Proc. Eur. Conf. Mach. Learn., 1994, pp. 171–182.
[31] Y. Sun, “Iterative relief for feature weighting: Algorithms, theo- Majid Komeili received the MSc degree in elec-
ries, and applications,” IEEE Trans. Pattern Anal. Mach. Intell., trical engineering from Tarbiat Modares Univer-
vol. 29, no. 6, pp. 1035–1051, Jun. 2007. sity, Tehran, Iran, in 2008. He is currently working
[32] B. Chen, H. Liu, J. Chai, and Z. Bao, “Large margin feature toward the PhD degree in electrical engineering
weighting method via linear programming,” IEEE Trans. Knowl. at the University of Toronto, Toronto, Ontario,
Data Eng., vol. 21, no. 10, pp. 1475–1488, Oct. 2009. Canada. Since 2012, he has been a research
[33] B. Liu, B. Fang, X. Liu, J. Chen, and Z. Huang, “Large margin sub- assistant and also a teaching assistant in the
space learning for feature selection,” Pattern Recog., vol. 46, Edward S. Rogers Sr. Department of Electrical &
pp. 2798–2806, 2013. Computer Engineering, University of Toronto, Tor-
[34] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, onto, Ontario, Canada. His research interests
U.K.: Cambridge Univ. Press, 2004. include signal, image, and video processing, spe-
[35] M. T. Thai, “Approximation algorithms: LP relaxation, rounding, cifically ECG andEEG signal analysis, machine learning, machine vision,
and randomized rounding techniques,” Lecture Notes, University video surveillance, and document image processing.
of Florida, 2013.
[36] A. Souza, “Randomized algorithm & probabilistic methods,” " For more information on this or any other computing topic,
Lecture Notes, Humboldt University of Berlin, 2001. please visit our Digital Library at www.computer.org/publications/dlib.
[37] C. L. Hwang, A. S. M. Masud, et al., Multiple Objective Decision
Making-Methods and Applications. New York, NY, USA: Springer,
1979, vol. 164.

You might also like