Feature Selection and Its Use in Big Data
Feature Selection and Its Use in Big Data
ABSTRACT Feature selection has been an important research area in data mining, which chooses a subset
of relevant features for use in the model building. This paper aims to provide an overview of feature selection
methods for big data mining. First, it discusses the current challenges and difficulties faced when mining
valuable information from big data. A comprehensive review of existing feature selection methods in big
data is then presented. Herein, we approach the review from two aspects: methods specific to a particular
kind of big data with certain characteristics and applications of methods in classification analysis, which are
significantly different to the existing review work. This paper also highlights the current issues of feature
selection in big data and suggests the future research directions.
feature set [13]. Random or noisy features often disturb a with an analysis of possible challenges and trends in future
classifier learning correct correlations, and redundant or cor- research. Additionally, we discuss the applications of feature
related features increase the complexity of a classifier without selection methods in several specific kinds of data and clas-
adding any useful information to the classifier [14]. A variety sification analysis.
of feature selection methods, such as filter, wrapper, and The structure of our paper is explained as follows:
embedded approaches [13] [15], have been developed. Section II looks back to the feature selection methods for
As mentioned above, scalability is a major issue in big data traditional data. Next, available feature selection methods
processing systems. The enormous redundancy or irrelevance and difficulties with processing big data are analyzed in
absolutely accounts for it, not only consuming computing Section IV. Section VI summarizes the paper, and provides
resources, but also affecting processing performance. On this several promising directions for further research.
occasion, if this useless information can be removed while
valuable clues are retained, the dimension of big data will II. BASIC FEATURE SELECTION FRAMEWORK
be greatly lowered, and as a consequence, apart from the Feature selection, also known as variable selection, attribute
computational efficiency, the processing performance of big selection or variable subset selection, is a data mining tech-
data will be improved. As a result, studying feature selection nique targeting at selecting an optimal subset of features
approaches for big data so as to obtain a feature subset with from the whole feature set that renders the best performance
superior divisibility is of considerable necessity. in terms of well-defined criteria. Here, a feature refers to
Recently, some researchers have applied these methods an attribute of data, which represents the function of these
to high dimensionality domains, such as DNA microarray data in a certain aspect. Since feature selection performs
analysis [16]–[19], text classification [20]–[23], information well in simplifying the model, shortening training times, and
retrieval [24]–[26], and web mining [27]–[29]. Online fea- reducing the variance of the model, researchers can interpret
ture selection methods have also been applied to streaming and understand the pattern of the data model more easily
data [30] and valuable information has been extracted from by using feature selection. Yu et al. [34] pointed out that a
noisy data [31], albeit on a small-scale and with a huge good feature selection method should be capable of selecting
dimension [32], [33]. different features with a high degree of correlation and the
optimal classification results.
B. CHALLENGES OF FEATURE SELECTION
Compared to traditional data, some influential points need to A. FEATURE SELECTION FRAMEWORK
be highlighted on extracting valuable information from big A feature selection method can be divided into two parts,
data. Taking the 3V characteristics into consideration, tradi- i.e., a feature set selection technique that accounts for how
tional feature selection methods face the following threefold to select features from the original entire set, and a feature set
challenges with respect to the case of big data: (1) traditional evaluation technique that presents how to evaluate the feature
feature selection methods usually require large amounts of subsets [14], [35]. The process of feature selection is shown
learning time, so it is hard for processing speed to catch up in Algorithm1 and Fig.1.
with the change of big data; (2) generally, big data not only
include an immense amount of irrelevant and/or redundant Algorithm 1 The process of feature selection
features, but also have possible noises of different degrees
1: input the original dataset, X;
and different types, which greatly increases the difficulty of
2: while the termination condition is not met do
selecting features; (3) some data are unreliable/forged, due
3: generate the feature subset, F, by searching strate-
to different means of acquisition, or even loss, which further
gies;
enhances the complexity of feature selection.
4: evaluate the feature subset, F, by evaluation criteria;
Due to the properties of big data, existing feature selection
5: end while
methods face demanding challenges in a variety of phases,
6: return F;
e.g. the speed of data processing, tracing concept drifts, and
dealing with incomplete and/or noisy data. Thus, studying
pertinent feature selection methods for big data is of consider- In Algorithm1, the feature subset can be generated by
able urgency. However, the available methods are extremely searching strategies (in Line3), such as the random search
specific, and how to extract valuable information from big strategy, the stepwise addition or deletion of features, and
data based on tackling and analyzing them is still an open heuristic search methods. After a feature subset F has been
issue. obtained, its performance of it must be assessed (in Line4).
Apart from our review, Bolón-Canedo et al. [12] presented Figure1 depicts feature selection as a kind of learning method,
a review of feature selection in the context of big data, which which aims to find the appropriate variable subset for users.
mainly describes available feature selection methods classi-
fied by practical applications and the next future needs [12]. 1) BENEFITS FROM FEATURE SELECTION
Unlike their work, we aim to review and compare studies The basic idea of using feature selection is to obtain a new
to date regarding the threefold challenges mentioned above, dataset with neither redundant features nor irrelevant features,
1) FILTER TECHNIQUES
The filter feature selection method is the algorithm that
selects the features without evaluating the performance met-
ric of the classifier’s model and the selected features [50].
It assumes the data are completely independent of classifi-
cation algorithms and forms the feature subset according to
FIGURE 1. The flowchart of feature selection. the importance of a feature measured by its contribution to
the class attributes. The performance metric on the output
of the classification algorithm is not employed to assess a
as well as containing the original data pattern, and not losing feature subset, while the measurement works only by data
any useful information in the original dataset. Nowadays, distribution [39], [51], [52].
feature selection methods are widely employed with their There are a variety of filter measures which are clas-
capability of dimension reduction, for instance, in the field sified according to the way of combining the features
of written text and DNA microarray analysis, and show their and the class attributes, such as distance-based measures,
advantages when the number of features is large while the probability-based measures, mutual information-based mea-
number of samples is small. Compared with feature extrac- sures, consistency measures, and neighborhood graph-based
tion, feature selection aims to find the features which can measures [53]. Therefore, the key to a filter approach lies in
describe the original dataset precisely and briefly whereas defining and exploring the relevance between each feature
the latter aims to create new features based on the original and the class attributes. Accordingly, the measurement of
dataset. It should be noted that some related features may strongly relevant, weakly relevant and irrelevant features is
be redundant since there might be another features which are presented in different aspects by researchers [30], [52], [54].
strongly correlated [36]–[38]. For example, Xindong et al. [30] defined relevance based
Performing feature selection on a data set has at least the on the exclusion of the conditional independence, whereas
following three advantages: (1) the selected features can be Kira and Rendell [55] described the RELIEF algorithm to
employed to build a brief model for describing original data estimate the weights of the features. Representative filter
and thus are beneficial for improving the performance of methods are RELIEF [55], FOCUS [56], and MIFS [54].
criteria; (2) the selected features can reflect the core charac- The benefits of filter methods are that they are independent
ters of original data and thus are helpful for tracing concept of a learning process, have good robustness for the concept
drifts of data expression with good robustness; and (3) the drift of data expression [53], [57], [58], and are time-effective
chosen features can help the decision-maker pick valuable because there is less computational complexity. However,
information from a large number of noisy data [38], [39]. they have the following drawbacks: greatly relying on the
stopping criteria (a threshold for determining when to stop
B. TAXONOMIES AND COMPARISONS these methods) and the mechanisms for calculating the impor-
Commonly, feature selection techniques can be classified tance of a feature [59]. Besides, the strategy of seeking fea-
into filter, wrapper, and embedded according to the means tures is an influential factor on filter-based feature subset
of combining a classifier and a machine learning approach evaluation methods.
when selecting a feature subset [38], [40]–[42], regardless Although the selection process of filters relies little on
of supervised or unsupervised methods [37], [43], [44]. The the classification algorithms, the best filter measure is
noticeable difference between these two methods lies in likely to be classifier specific, since different classifiers per-
whether or not the class labels are available. In the field of form differently when combined with the same filter [35].
feature selection, the former is under condition of available Recently, Freeman et al. [53] compared 16 commonly used
label information employed to evaluate the significance of filters and combined them with two classifiers, K-Nearest
features and provides rankings of these features. The latter Neighbor(KNN) and Support Vector Machine(SVM)for
seeks hidden structures in unlabeled data and constructs a 40 datasets. Their empirical results in terms of classification
feature selector by means of intrinsic properties of data [45]. accuracy give an indication of which filter measures may be
Due to various machine learning and data mining tasks, there appropriate for use with different classifiers.
4) COMPARISONS
The above studies have resulted in many feature selection
methods, most of which, however, aim only at specific back-
grounds. In addition, comparing the performances of these
methods is not easy. Herein, we provide a brief comparison
in Table1 of filter, wrapper, and embedded methods.
Bolón et al. [13] compared some frequently used filter
methods mentioned above in terms of whether they are uni-
FIGURE 2. The wrapper procedure. variate or multivariate and their computing cost. According
to them, univariate methods are fast and scalable, but ignore
From Figure2, an initial state, a termination condition, feature dependencies, while multivariate filters model feature
and a search engine are required during the process of a dependencies at the cost of being slower and less scalable than
wrapper method. In addition, as wrappers are associated univariate techniques [13].
with the learning algorithm, the combination of features, Table2 presents the comparison description of these meth-
the criterion for evaluating the performance and the type of ods (where n is the number of samples and m is the number
a classifier are crucial factors that influence the classification of features).
results [63]. Similar to filter, the type of classifiers has a con-
tributing impact on the performance of wrapper-based feature III. VARIANTS AND EXTENSIONS OF FEATURE SELECTION
selection methods according to the research conducted by A. HYBRID AND ENSEMBLE METHODS
Shanab et al. [62]. By combining the advantages of the above methods,
various hybrid feature selection methods have been
3) EMBEDDED TECHNIQUES developed [16], [74]–[78], including the combination of two
Embedded methods incorporate the learning progress of filter methods, and that of one filter strategy with one
a classifier into feature selection [64] and search an wrapper strategy. These hybrid methods can take advantages
TABLE 2. Comparison of commonly used feature selection methods problems have been proposed, such as improved Fisher
according to Bolón et al. [13].
score algorithm [87] and enhanced bare-bones particle
swarm optimization (BPSO) [88]. Moreover, for some spe-
cific problems, such as unreliable data [89], [90], incom-
plete data [91]–[94], text data [95], and costly data [96],
researchers have also proposed the corresponding feature
selection methods. This kind of methods usually concentrates
on improving the performance of search strategy of the
optimal subset.
Similarly, differential evolution (DE) is an optimization be viewed simply as a reference; for a multi-objective feature
method based on population with super global search capa- selection problem, balancing each objective and looking for
bility. In recent years, it has been used for feature selec- the most suitable solution are desirable.
tion with interesting classification results [38], [114], [115]. The other example is the mathematical model of a wrap-
Additionally, two of our previous works on multi-objective per method for unreliable data [89]. Since sample data are
feature selection are based on DE [96], [116]. As these two unreliable, the reliability degree (RD) in Equal.1, not merely
techniques are focused on different backgrounds, it is hard the classification accuracy (CA) in Equal.2, is taken into
to compare and discuss which of the two is better. This is account for evaluating a feature subset. Herein, if a feature i
also a problem when comparing most of the available feature is selected, then xi is set as 1; otherwise, xi is set as 0. ei is
selection measures. the value within [0,1] to represent the reliability of a feature
Huang [117] proposed a classification method using ant without loss of generality. The bigger the value, the higher
colony optimization (ACO), which optimizes both the param- the RD value of this feature. For measuring CA, we adopt the
eters of a feature subset and SVM. Su and Lin [57] incor- one nearest neighbor classifier, where the testing dataset only
porated an electromagnetism-like mechanism into a wrapper contains one attribute each time to evaluate the classification
method. Lin et al. [118] integrated the simulated annealing accuracy of a solution (feature subset). If the constructed
with SVM for feature selection. classifier can predict the class of the testing data clue x, then
Si is set as 1; otherwise, Si is set as 1.
C. OPTIMALITY MODELS N
P
Note that before selecting features, researchers should for- xi · ei
mulate the problem of feature selection as an optimization f1 (x) =
i=1
(1)
model. Some researchers have considered the problem as PN
xi
a single-objective model, e.g. maximizing the accuracy in i=1
classification [114], [115]. A feature selection problem gen- K
erally includes several conflicting objectives, e.g. the number 1 X
f2 (x) = Si (x) (2)
of features, the performance in classification, and/or the reli- K
i=1
ability of features [89], [96]. Formulating feature selection
as a multi-objective problem is the premise of obtaining a The above studies have resulted in many feature selection
series of non-dominated feature subsets, which is benefi- methods, most of which, however, aim only at specific back-
cial to meet various requirements in real-world applications. grounds. In addition, comparing the performance of these
Herein, we introduce two examples. One is the previous methods is not easy.
results obtained by our experiments on the dataset, ‘Sonar’,
on this two-objective feature selection problem with the num- D. LIMITATIONS OF TRADITIONAL FEATURE
ber of selected features and the classification accuracy, shown SELECTION METHODS FOR BIG DATA
separately in Fig.4. Since practical data may include noises with different
degrees, types and formats, or values that are approxi-
mate zero, which makes the determination of the correla-
tion degree between features difficult, feature selection is
still an unsolved problem. In the case of big data, as vast
ever growing data emerge while the existing measurements
are inadequate, there is a growing need for efficient feature
selection methods for big data.
Due to the three characteristics for big data mentioned
above, different analytical modes must be considered for
different application requirements [8].
Feature selection methods for traditional data are like a
kind of offline method. However, as there is an immense
amount of irrelevant and/or redundant features as well as the
large volume, how to decrease the computational cost without
FIGURE 4. An example of a two-objective feature selection problem. the classification accuracy deteriorating is an urgent issue.
Additionally, with respect to the wide variety of big data,
Fig.4 has a trend that if more features are selected, a higher efficient feature selection methods are required to extract
accuracy can be obtained. However, the maximal size of the valuable information from data with a small size and various
variable subset is still far below the variety of original dataset. formats or types. Moreover, for dynamic data, traditional fea-
This means that the redundant and irrelevant features are ture selection methods have difficulty in tracking the changes
removed from the original dataset, and our feature selection of the data and since no complete knowledge is known in
approach is able to work well. However, a set of results can advance, constructing the classifier’s model difficult using
traditional methods. Finally, on account of the precision of which are developed to cope with computational complexity,
equipment or environmental disturbance, dealing with the since pairwise comparisons for calculating the correlations
severe lack or unreliability of some attributes’s values from between features are conducted [34], [45], [93], [123], [124].
big data needs more attempts. To further improve the performance of mRMR,
Wang et al. [45] proposed an unsupervised feature selection
IV. FEATURE SELECTION METHODS FOR BIG DATA method for dimensionality reduction.
Complex characteristics of data bring about a difficulty in MPMR: This measure provides a new criterion for unsu-
obtaining a common feature selection method for big data. pervised feature selection. The new criterion is called the
A method specific to a background is feasible. Accordingly, maximal projection and minimal redundancy, which is
in this section, we will review the available feature selection formulated with the use of a projection matrix.
methods for big data according to the particular types of mr 2 PSO: Unler et al. [123] presented a relevance and
data they use to handle and the applications in analysis. The redundancy criterion based on mutual information. The
first part includes static big data, dynamic data, missing data, basic idea of the proposed relevance and redundancy
heterogeneous data, unreliable data and imbalanced data, criterion is to maximize the prediction accuracy of the
while the latter part consists of applications in text analysis. selected feature subset, which varies from the mRMR to
In addition, after looking back to available feature selection determine the information property of a feature subset.
methods for a kind of data, we also describe what we can do That means the relevance and redundancy mutual infor-
in terms of further research. mation acts only as an intermediate measure in the PSO
algorithm to improve the speed and performance of the
search.
A. SPECIFIC TO SEVERAL PARTICULAR KINDS OF DATA
1) STATIC BIG DATA b: SMALL-SAMPLE DATA WITH A HIGH DIMENSION
The progress of science and technology contributes to a For big data with a number of dimensions much bigger than
world full of information, and data is the clue to information. that of data, the dimension becomes a major barrier to devel-
Some common characteristic or even rational effect from oping a predictive model with a high precision or improving
historical data may facilitate policy-making. For example, the efficiency of a feature selection method. Viewing this,
taking rainfall data or other meteorological information for many scholars have attempted to design methods targeting at
an area during the past few decades, the month this year in data with high dimension. He et al. [124] proposed a feature
which the heavy rain is most likely to occur can be inferred. selection method based on mRMR, called MINT, which per-
Therefore, we can work on an outside activity more reason- forms feature selection using both the training data and the
ably or even make some protection to avoid flooding. Clinical unlabeled test data.
data have to be well-preserved due to long-time research for Apart from the mutual information-based measure,
pathology. Moreover, the symptoms determine the diagnosis the distance-based measure also facilitates the process of
from the doctor and the next treatment. As a consequence, feature selection with small-scale and high dimensional
the relevance between symptoms and the diagnosis has to data. For example, Vijay et al. [125] presented an embedded
be learnt and unnecessary physical examinations can be technique by incorporating sparsity into a classifier, and
avoided. Fang et al. [126] proposed an unsupervised feature selection
For static big data, due to large scale or high dimension, method based on localities and similarities.
the aim is to look for the inner pattern or construction of data, The aforementioned methods can effectively handle the
followed by extracting useful information which will be sub- problem of feature selection for high dimensional data. How-
ject to further use, for example for prediction. Consequently, ever, on one hand, they require information on all the features
feature selection methods work like a pre-processor for find- of all the data before selecting features, which is unpractical
ing valuable information from big data. Herein, we dis- for big data. On the other hand, filter and embedded tech-
cuss methods from two aspects, large-scale data with a high niques take advantage of small computing consumption at
dimension and data with a small sample but a high dimension. the expense of a high degree of accuracy. How to improve
accuracy with a small computing cost needs more attention
a: LARGE-SCALE DATA WITH A HIGH DIMENSION in the future.
As we have discussed in SectionII-B.1, a series of
measurements can be used to estimate the relevance between 2) DYNAMIC DATA
the features and class attributes. For large-scale data, Online feature selection methods belong to the stream mode,
mRMR(max-relevancy and min-redundancy) is an efficient which reevaluate the existing features based on the newly
tool that can search a set of features where the relevance received datum. The key challenge of online feature selection
between the feature and the class is maximized (max- is how to make accurate predictions for an instance using a
relevancy) while the pairwise information between the fea- small number of active features [127]. A number of research
tures in the set is minimized (min-redundancy) [119]–[122]. has been proposed to deal with dynamic data by means of
This is one of the mutual information-based measures, feature selection [128]–[130].
SFS: Zhou [131] proposed the streamwise feature weights based on mRMR for each newly arrived label.
selection which only evaluates each feature once when Afterwards, on the basis of the fixed weight values,
it is generated. The benefits of streamwise methods are the distance between the final feature rank list and each
that features are generated dynamically and overfitting individual feature rank list is calculated, and the final
can be controlled by dynamically adjusting the threshold feature rank list which makes the distance minimal is
for adding features to the model. In their work, the can- what is needed. This is a kind of embedded methods with
didate feature set is regarded as a dynamically generated a filter to rank and a learning method seeking the optimal
stream, while knowledge on the structure of the feature feature subset. It provides a new idea for streaming-label
space is required prior to heuristically controlling the feature selection problems, which attract many domains
choice of candidate features, which is often infeasible like image retrieval and medical diagnosis.
for real-world applications. In summary, Table3 compares the methods discussed
OFS: Wang et al. [127] assume that the dimension of above briefly.
data is fixed and the pattern of a datum can be achieved Similar to feature selection methods for large-scale data,
with the datum over time, where an online learner is in terms of online feature selection problems, filter and
employed to maintain a classifier involving only a small embedded techniques show a great potential due to their
and fixed number of features. small computational cost, while wrapper methods are seldom
OSFS: Conversely, with the number of training exam- employed. However, the small complexity of filter methods
ples dynamically changing and more attention being is often accompanied by a low degree of accuracy. There-
paid to streaming features, Xindong et al. [30] described fore, embedded methods as well as combined methods will
streaming features as features that flow in one by one become new trends for further research.
over time. They studied a framework with small com-
plexity cost, Fast-OSFS, to estimate the relevance of fea- 3) MISSING DATA
tures and class attributes by calculating some probability Missing data is very common in big data on account of
values. An interesting point is that Fast-OSFS has its software disasters or low resolution of hardware [134]–[136].
memory for redundant features due to the definition of Bu et al. [136] has pointed out that the existence of miss-
the relationship based on conditional probability. This ing data greatly increases greatly increases the difficulty in
guarantees that even though a redundant feature has been processing data. For traditional data, a simple approach to
removed earlier, a new feature with the same kind of processing a dataset with missing data is to directly delete
redundant information as the former feature can also be these missing data. It is clear that this approach can reduce
discovered and eliminated. the number of data, but some valuable information may also
SAOLA:In contrast to the estimation of mutual infor- be ignored [93]. As mentioned above, even though some
mation by conditional probability, Yu et al. [34] defined information is redundant or irrelevant, it is still retained in
the redundancy between features and the relevance the original big data-set in its entirely for further analysis.
between features and class attributes using entropy mod- Therefore, this approach goes against the intention of big
els. Additionally, once a feature is deleted, it will not be data. Another simple way is to seek features with a high
investigated any more by the greedy algorithm, which degree of relevance to those of the missing data, and to take
only adds new features but never deletes them. There- the average value of the related features as the value of the
fore, their method can have a more rapid speed than missing data [137]. This kind of method assumes similarity
Fast-OSFS. measured by distance. Expectation maximization for missing
For streaming data with a high dimension, great comput- data is established using a probabilistic model where iteration
ing consumption may give rise to an incremental searching cannot be ignored when looking for a promising estimation.
space, especially in an exponential way. In view of this, Similarly, low-rank approximation for missing data also has
Fong et al. [132] have proposed a light-weight feature selec- the demerit of repeated iterations [138]. Commonly, a param-
tion method based on heuristic algorithms. eter interpreted as probability is required, which is more
APSOFS: APSOFS [132] finds a preferred combination difficult to determine for big data. In summary, repeated
of classification algorithms and the light-weight feature iterations are not suitable for big data mining, which may
selection algorithms. An interesting aspect of their work cause an incremental computation.
is the discussion on how the new functions of data stream What can we do to deal with incomplete big data?
mining algorithms can help overcome the incremental Bu et al. [139] have attempted to employ feature selection
computation. for clustering incomplete big data. In their method, feature
An interesting issue is streaming labels, namely, the num- selection aims to filter undesirable features in a set of com-
ber of class attributes being unknown and the size of feature plete data measured by some entropy-based definitions, fol-
subset being constant. lowed by a clustering model based on the selected feature
MLFS: Like streaming features, Lin et al. [133] made subset. Yuan et al. [140] employed a feature selection method
an assumption that labels arrive one at a time. Under to multi-source learning of incomplete neural imaging data,
this scenario, they first obtain individual feature rank list which is a kind of large-scale data. The method utilizes
TABLE 3. Comparison of commonly used feature selection methods for dynamic data.
missing blocks to partition a dataset into several independent of big data makes feature selection difficult, let alone big
learning tasks, with each having a classification model based data in the dynamic environment. One challenge for avail-
on a feature selection method. These methods regard feature able methods is the improvement of the consumption cost,
selection as a tool for reconstruction of the original data. which performs well in reconstructing a data model without
If a feature has a redundancy with the others, or it is not repeated iterations. Moreover, how to apply efficient methods
relevant to the class attributes, tackling it at the expense of of big data in a static environment to dealing with big data in
computing resources is unnecessary. When a feature without dynamic environment is still an open issue.
a recognized value is regarded as a redundant or irrelevant
feature, it will then be eliminated. 4) HETEROGENEOUS DATA
Moreover, since feature selection is clearly desirable due In most practical problems, data are often collected from
to the abundance of missing features in many real-world different sources. Their features are often heterogeneous
applications, researchers have attempted to select a subset of and consist of numerical and non-numerical features with
features by rough sets although some features are missing, different properties [142]–[145]. For example, in clinical
and to preserve the meaning of features contained in the data research [146] medical data are collected from different
set to avoid information loss. Qian and Shu [141] proposed a sources, such as demographics, disease history, medication,
feature selection method based on mutual information mea- allergies, biomarkers, medical images, or genetic markers,
sured by rough sets for incomplete data, which takes into each of which offers a different partial view of a patientąŕs
account a greedy forward search strategy from a whole set condition [147]. As a result, it is difficult to evaluate heteroge-
to accelerate the selection speed. The challenge of this kind neous features concurrently. As discussed previously, feature
of method is the use of the mutual information based on rough selection methods assign each feature a value of importance,
set theory for constructing data models. and accordingly retain or get rid of a feature according to their
Incomplete data is an interesting but formidable issue for inner measurement. Therefore, for heterogeneous data, data
data mining. Due to the efficiency of feature selection in format differences contribute to the major obstacles for data
seeking the relationship between features and features with mining, in particular in the field of big data.
class attributes, feature selection facilitates the reconstruction The available feature selection methods for heterogeneous
of a data model in line with the original data set with much data can be roughly divided into numerical [148], [149]
missing information. Further data mining techniques can then and non-numerical [150], [151] feature selection algorithms.
be processed. However, the large scale or the high dimension Rough set [152], [153] and mutual information [145] are
two efficient tools for dealing with heterogeneous feature Embedded methods has been proven to be the most efficient
selection, while the convincing difference lies in methods tool for dealing with imbalanced data [169], [170], while
based on the former being computationally extensive. before that, data level approaches with the aim of balancing
Under the circumstance of ever increasingly heterogeneous all class attributes by re-sampling the training dataset are
data, effective methods are in great demand in terms of necessary. For example, over-sampling and under-sampling,
size or formats. However, there are a very limited number of where the former creates new samples in minor class, such
methods for heterogeneous data in the context of big data. as SMOTE [171], while the latter reduces the number of
majority class samples, such as ACOS [172].
5) UNRELIABLE DATA AND IMBALANCED DATA One obstacle to selecting feature subsets for imbalanced
Unreliable data are collected from equipment with devia- data is avoiding losing potentially useful information and
tions or with effects from the outside environment. Each altering the original data distribution since feature selection
feature has a reliability value resulting from the sensor preci- has to achieve a trade-off between eliminating irrelevance and
sion, faulty equipment, environmental temperature changes, redundancy and retaining valuable features. Under the envi-
incorrect operation, etc., [90], [154], [155]. Gong et al. [89] ronment of imbalanced big data, the inner pattern is hard to
propose a feature selection method for unreliable data where recognize and the computing cost should be lowered. Besides,
the reliability of a feature is represented by a value between it will have an extreme effect on the classifier’s accuracy if
0 and 1, and the mathematic model is constructed using potential features are removed. Therefore, the improvement
the reliability value. Commonly, when dealing with unreli- of the performance in feature selection methods for imbal-
able data, fuzzy methods have the capability of describing anced data is a major challenge for imbalanced mining.
the degree of uncertainty for each variable. For example,
Chen [156] employs a cost-based fuzzy decision model to
deal with unreliable systems. Xie et al. [157] incorporate the B. APPLICATIONS IN CLASSIFICATION ANALYSIS
designed fuzzy weighting function into their fuzzy control 1) TEXT CLASSIFICATION
model under communication links. Since feature selection is The basic motivation for studying big data is to aggregate and
a tool for mining data, if a datum is not totally reliant, or if process a huge amount of data as rapidly as possible to mine
there is a a breakdown or faulty operation when collecting valuable information [173]–[177]. Based on this desirable
the data, how can we make full use of it? In an industrial information, researchers can avoid risks, confirm reasons,
and mining context, there may be hundreds of sensors with and predict events [178]. One of the most important tasks of
their own individual accuracy in the underground production processing big data is to classify texts.
line, and the supervised data from these sensors is delivered Today, the Internet is an essential part of people’s lives,
in real-time to the upper detecting chamber. The transmitted offering a great deal of convenience and reference sources.
data will surely have a significant influence on the judgment The open review platform on the web provides users with
of the upper control. Consequently, designing useful feature plenty of available contents. Many chaotic reviews, however,
selection methods for unreliable big data, in particular when consume valuable decision-making time for users. Under this
data are transmitted in real-time for further processing, is of circumstance, feature selection is a promising way to provide
great importance. a filter and reference. Wang et al. [179] proposed an effective
In machine learning and data mining, when the number of feature selection method for classifying sentiment text in
observations in one class is significantly rarer than in other product reviews, where the experiment includes 1006 car
classes [158], the processing for the minor samples is diffi- reviews documents of cars. Although the method performs
cult, which may contribute to misclassifying or even ignoring well in text classification, the attempts for big data where
these minor samples. These minor samples, however, often perhaps hundreds of or thousands of clues need to be ana-
contain more valuable information. Imbalanced data arise lyzed. Assuming that there is a high degree of redundancy
from the accumulated amount of data, in particular in the field and irrelevance in these clues, one difficult task for decision
of big data [159]. For instance in a detecting chamber of a makers is to choose goods with the preferred performance and
pit, data collected from sensors consist of the major samples a low expense.
under no fault environment. When a breakdown occurs, for If a dynamic environment exists in text classification,
instance a leash deviates from its set track, an abnormal value what form could a feature selection method take? Nakanishi
from the sensor is then transmitted into the detecting chamber. Takafumi introduces a vector space model, which is the most
These abnormal values consist of minor samples. popular and basic method for the comparison of concepts,
Since one or more classes are underrepresented in the data events, and phenomena [180] to measure similarity or cor-
set [160], some researchers treat all imbalanced data con- relation between queries and information resources that are
sistently with a versatile algorithm [159], [161], [162] while relevant to users’ information. It is worth mentioning that
others deal with imbalanced data with various dimensions, the work of Takafumi can change space dynamically for
ratios and the number of classes [158]. semantic computations and analyses with the update of data.
For imbalanced data, several feature selection This is a breakthrough in text classification. In the big data
models have been attempted [158], [161], [163]–[168]. environment, dynamic feature selection needs more attention.
2) MICROARRAY ANALYSIS images tend to have an influence on eye tracking data. Diverse
Microarray data are generated from microarray experiments, properties for images, such as the color, edge distribution,
which generally have high dimensions and a small number illumination, weather, season, daytime, saturation, buildings,
of samples. A key issue in microarray experiments is the cars, trees, the sky, and roads [192] make the image clas-
large number of irrelevant and redundant genes. Their elim- sification method specific to different scenarios. From this
ination should make the process of obtaining the classifier big domain of images, the way that we can extract valuable
easier [181]–[184]. As mentioned above, feature selection information and make use of it attract attention due to the
methods for big data with high dimension and small samples widespread use of digital equipment.
has been discussed, and these are not suitable for a large Background suppression targets detecting and analyzing
amount of repeated iterations. text from video frames, where a media sequence is assumed
Under this circumstance, Apolloni et al. [184] developed in an unknown video length [195], [196]. Feature selection
an efficient hybrid feature selection method combining methods are able to seek information, and Nguyen et al.
a wrapper based on binary differential evolution with a employ feature selection to deal with the background sup-
rank-based filter, where the initial population consists of pression problem [197]. Similar to unreliable data, features
solutions from the most relevant features obtained by the filter for this kind of problem often have different importance,
and solutions randomly generated to promote the diversity of leading to more difficulty for feature selection. Moreover,
solutions. Similarly, Tabakhi et al. [185] proposed an embed- their streamwise styles of data undoubtedly produce a greater
ded feature selection method for genes, which is an unsuper- challenge.
vised method that is different from the former method.
In view of the particularity in some sensitive cases, V. FUTURE RESEARCH ISSUES
e.g. medical diagnosis, their research data are commonly very This paper has reviewed available feature selection methods
large on account of constant preservation [6]. Filter taxonomy for dealing with big data. Some possible future directions will
is often taken into consideration in terms of the advantage of be discussed in this section.
saving computation cost. However, in order to increase the Data model: As mentioned above, we are now in the
desired classification accuracy, researchers tend to incorpo- era of big data with extremely large sizes and rapid
rate filters into other taxonomies or heuristic search strate- changes. A changing environment under the conditions
gies. Therefore, the accuracy of feature selection methods of large-scale data should also be taken into considera-
needs to be improved without increasing the complexity of tion for feature selection. The dynamic feature selection
these methods. Besides, the small number of samples creates method is an open issue for researchers, since not only
a stepping stone for the improvement of the classification is the cost of accessing the features or data high, but
accuracy. also we usually cannot obtain all the features or data in
advance in real-world applications. With the growth of
3) IMAGE CLASSIFICATION AND streaming data and the development of the dynamic fea-
BACKGROUND SUPPRESSION ture selection methods, how to combine the two aspects
In the domain of image classification, diverse information for an efficient and fast classification requires a lot of
is required for images to be classified. Current attempts in work.
terms of this are like [50] and [186]–[190]. On account of High-dimension: Although some available feature
storage and computing costs, a large number of features selection strategies based on mutual information
does not represent better classification. Therefore, feature for high-dimension data have the capability in com-
selection is taken into consideration for image classifica- putational consuming, the definition of the relation-
tion. Shang and Barnes [191] proposed a fuzzy-rough feature ships between features or class attributes, which
selection method which is then incorporated into machine has an influence on the final selection results, is still
learning for Mars image classification. Hierarchical image a great challenge.
content analysis is provided by Vavilin and Jo [192] for Large-scale: With respect to various formats of
dealing with image classification and retrieval for natural data, combined with a large scale, cloud computing
and urban scenes. Moreover, Chang et al. [193] employed and cooperative computing are new topics, while
k-fold feature selection, based on the concept similar to k-fold there is a handful of research incorporating these
cross-validation for image classification. parallel processing methods into feature selection
Although these studies are able to obtain a good classi- for big data.
fication result, the human factor has been ignored. In view Data structure: In terms of semi-structured data and
of this, Zhou et al. [194] proposed an eye-guided tracking non-structured data, the importance of normalization
feature selection method for this field, which explores the should be stressed. If feature selection methods that
mechanisms of the human eye for processing visual informa- facilitate seeking the internal patterns of these kinds
tion based on mRMR and SVM. Their method takes a new of big data are designed, then it will be easier for our
look at image classification even though they do not consider interpreters to process and utilize them. However, most
dynamic images. In the context of big data, certain types of of the current research aims to find the feature subset
of structured big data, where the semi-structured and account when designing a processing technique. Due to
non-structured big data are absent. the merits of the dimensional reduction brought from
Dynamic environment: Concerning the data whose feature selection, some researchers have attempted to
format, scale or other characteristics are changing over apply feature selection methods to processing data in
time, namely, under the dynamic occasion, there have real time [199], [200]. For example, Zhang et al. [160]
been a limited amount of research, even though several proposed a feature selection method based on the Fisher
streamwise models are available. The processing speed filter and wrapper to reduce the number of features so as
is a main obstacle of these kinds of data, since in some to reduce the processing time. For dealing with problems
cases, it is more necessary for our data users to obtain with the requirement of real-time processing, apart from
a brief and simple model, which can describe the main the efficiency of feature selection, just as much attention
characteristics of the original model rather than a precise must be paid to the accuracy. The well-established data
one. Using the obtained data model, future data can be model will help further processing whereas an incor-
processed roughly online, followed by other processing rect one contributes to a great impact on subsequent
techniques offline. processing.
Combined with parallel methods: For the charac- Practical problems: Since feature selection targets
teristics of big data, parallel processing methods have recognizing the inner pattern of a dataset and eliminating
been applied to improve the efficiency of mining big the irrelevance or the redundancy of the dataset, apply-
data [198]. Since feature selection methods can lighten ing feature selection as a data mining tool to deal with
the processing load in inducing a data mining model, practical problems is desirable in today’s world of big
the combination of the merits of both feature selection data. However, there is a lack of research into practical
and parallel processing is worth investigating. cases on the basis of feature selection in the context
a. Combined with CoEA: The cooperative of big data. It will be appreciated if the unreliability,
evolutionary algorithm (CoEA), a parallel evolu- the unbalance, and the allotropy of data are taken into
tionary algorithm (EA), has generally a rapid pro- consideration.
cessing speed. If the divide-and-conquer idea of
CoEA to feature selection is applied to big data,
VI. CONCLUSION
the problem of feature selection with high dimen-
Mining valuable information from big data is indeed difficult
sions is split into several subproblems of feature
and challenging. As an important data preprocessing tech-
selection with low dimensions. Therefore, these
nique, feature selection can greatly improve the efficiency
subproblems with low dimensions can be easily
of utilizing data. This paper first reviews feature selection
handled in parallel by EAs, which provides a feasi-
methods for traditional data and then comments in detail on
ble approach for improving the efficiency of feature
available feature selection methods for big data. On the one
selection.
hand, although researchers have developed a large variety of
b. Combined with cloud computing: In recent
available feature selection methods for traditional data, they
years, cloud computing has been intensively stud-
still have difficulties tackling the problem of feature selection
ied. This lays the foundation for remote storage
for big data. On the other hand, the existing methods of big
and distributed processing for big data. Heteroge-
data feature selection have severe limitations in achieving
neous parallel processing based on a cloud environ-
an appropriate tradeoff between the accuracy of solutions
ment, however, can lead to many problems, such
and computational complexity. Moreover, for practical prob-
as the division of processing tasks and cooperation
lems, even though more work is essential, we have some
among cloud resources. Moreover, in the light of
strategies and techniques specific to a background, which are
the remarkable performance of GPUs in float point
reviewed in this paper. Besides, more attention is paid to the
arithmetic and large scale data processing, imple-
applications of feature selection methods specific to several
menting feature selection methods on GPUs is also
particular kinds of data and classification analysis. It will
an effective way to improve the efficiency of big
be appreciated if our review work provides a reference for
data mining.
those who would like to explore big data mining via feature
Energy saving: Filter methods offer a lower compu-
selection.
tational cost than both wrapper and embedded meth-
ods, despite the fact that they perform at the expense
of high accuracy. Therefore, hybrid methods, efficient REFERENCES
embedded methods and parallel techniques are desirable [1] S. El-Sappagh, F. Ali, S. El-Masri, K. Kim, A. Ali, and K.-S. Kwak,
to save computing cost while not lowering the accuracy ‘‘Mobile health technologies for diabetes mellitus: Current state and
when dealing with big data. future challenges,’’ IEEE Access, to be published.
Performance in real-time processing: Since the use- [2] R. Elshawi, S. Sakr, D. Talia, and P. Trunfio, ‘‘Big data systems meet
machine learning challenges: Towards big data science as a service,’’
fulness of a datum will degrade over time, the time Big Data Res., vol. 14, pp. 1–11, Dec. 2018. [Online]. Available:
consumption for processing big data must be taken into http://www.sciencedirect.com/science/article/pii/S2214579617303957
[3] V. Mayer-Schönberger and K. Cukier, Big Data: A Revolution That Will [27] M. H. Kamarudin, C. Maple, T. Watson, and N. S. Safa, ‘‘A LogitBoost-
Transform How We Live, Work and Think. Boston, MA, USA: Houghton based algorithm for detecting known and unknown Web attacks,’’ IEEE
Mifflin, 2013. Access, vol. 5, pp. 26190–26200, 2017.
[4] R. J. Hathaway and J. C. Bezdek, ‘‘Extending fuzzy and probabilistic [28] L. Zhao and X. Dong, ‘‘An industrial Internet of Things feature selection
clustering to very large data sets,’’ Comput. Statist. Data Anal., vol. 51, method based on potential entropy evaluation criteria,’’ IEEE Access,
no. 1, pp. 215–234, 2006. vol. 6, pp. 4608–4617, 2018.
[5] C. Lynch, ‘‘Big data: How do your data grow?’’ Nature, vol. 455, [29] R. Wald, T. M. Khoshgoftaar, A. Napolitano, and C. Sumner, ‘‘Using
no. 7209, pp. 28–29, 2008. Twitter content to predict psychopathy,’’ in Proc. IEEE 11th Int. Conf.
[6] K. Davis, Ethics of Big Data: Balancing Risk and Innovation. Newton, Mach. Learn. Appl. (ICMLA), vol. 2, Dec. 2012, pp. 394–401.
MA, USA: O’Reilly and Associates Inc, 2012. [30] X. Wu, K. Yu, W. Ding, H. Wang, and X. Zhu, ‘‘Online feature selection
[7] O. B. Sezer, E. Dogdu, and A. M. Ozbayoglu, ‘‘Context-aware comput- with streaming features,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 35,
ing, learning, and big data in Internet of Things: A survey,’’ IEEE Internet no. 5, pp. 1178–1192, May 2013.
Things J., vol. 5, no. 1, pp. 1–27, Feb. 2018. [31] W. Shu, W. Qian, and Y. Xie, ‘‘Incremental approaches for feature selec-
[8] C. K. Emani, N. Cullot, and C. Nicolle, ‘‘Understandable big data: tion from dynamic data with the variation of multiple objects,’’ Knowl.-
A survey,’’ Comput. Sci. Rev., vol. 17, pp. 70–81, Aug. 2015. Based Syst., vol. 163, pp. 320–331, Jan. 2019. [Online]. Available:
[9] A. Nara, ‘‘Big data: Techniques and technologies in geoinformatics,’’ Int. http://www.sciencedirect.com/science/article/pii/S0950705118304246
J. Geograph. Inf. Sci., vol. 29, no. 4, pp. 694–696, Apr. 2015. [32] P. Bugata and P. Drotár, ‘‘Weighted nearest neighbors feature
[10] J. Manyika et al., Big Data: The Next Frontier for Innovation, Competi- selection,’’ Knowl.-Based Syst., vol. 163, pp. 749–761, Jan. 2019.
tion, and Productivity. McKinsey Global Institute, 2011. [Online]. Available: http://www.sciencedirect.com/science/article/pii/
[11] Q. Tuo, H. Zhao, and Q. Hu, ‘‘Hierarchical feature selection with subtree S0950705118304908
based graph regularization,’’ Knowl.-Based Syst., vol. 163, pp. 996–1008, [33] A. Lagrange, M. Fauvel, and M. Grizonnet, ‘‘Large-scale feature selection
Jan. 2019. [Online]. Available: http://www.sciencedirect.com/science/ with Gaussian mixture models for the classification of high dimensional
article/pii/S0950705118305094 remote sensing images,’’ IEEE Trans. Comput. Imag., vol. 3, no. 2,
[12] V. Bolón-Canedo, N. Sánchez-Maroño, and A. Alonso-Betanzos, pp. 230–242, Jun. 2017.
‘‘Recent advances and emerging challenges of feature selection in the [34] K. Yu, X. Wu, W. Ding, and J. Pei, ‘‘Towards scalable and accurate online
context of big data,’’ Knowl.-Based Syst., vol. 86, pp. 33–45, Sep. 2015. feature selection for big data,’’ in Proc. IEEE Int. Conf. Data Mining
[13] V. Bolón-Canedo, N. Sánchez-Maroño, and A. Alonso-Betanzos, (ICDM), Dec. 2014, pp. 660–669.
‘‘A review of feature selection methods on synthetic data,’’ Knowl. Inf. [35] R. Kohavi and G. H. John, ‘‘Wrappers for feature subset selection,’’ Artif.
Syst., vol. 34, no. 3, pp. 483–519, 2013. Intell., vol. 97, nos. 1–2, pp. 273–324, 1997.
[14] J. Wu and Z. Lu, ‘‘A novel hybrid genetic algorithm and simulated [36] H. Zhou, S. Han, and Y. Liu, ‘‘A novel feature selection approach based on
annealing for feature selection and kernel optimization in support vector document frequency of segmented term frequency,’’ IEEE Access, vol. 6,
regression,’’ in Proc. IEEE 5th Int. Conf. Adv. Comput. Intell. (ICACI), pp. 53811–53821, 2018.
Oct. 2012, pp. 999–1003. [37] W. Zhou, C. Wu, Y. Yi, and G. Luo, ‘‘Structure preserving non-negative
[15] N. Abd-Alsabour, ‘‘A review on evolutionary feature selection,’’ in Proc. feature self-representation for unsupervised feature selection,’’ IEEE
Eur. Modelling Symp., 2014, pp. 20–26. Access, vol. 5, pp. 8792–8803, 2017.
[16] A. A. Raweh, M. Nassef, and A. Badr, ‘‘A hybridized feature selection [38] I. Guyon and A. Elisseeff, ‘‘An introduction to variable and feature
and extraction approach for enhancing cancer prediction based on DNA selection,’’ J. Mach. Learn. Res., vol. 3, no. 6, pp. 1157–1182, Jan. 2003.
methylation,’’ IEEE Access, vol. 6, pp. 15212–15223, 2018. [39] J. R. Vergara and P. A. Estévez, ‘‘A review of feature selection methods
[17] A. Brankovic, M. Hosseini, and L. Piroddi, ‘‘A distributed feature selec- based on mutual information,’’ Neural Comput. Appl., vol. 24, no. 1,
tion algorithm based on distance correlation with an application to pp. 175–186, 2014.
microarrays,’’ IEEE/ACM Trans. Comput. Biol. Bioinf., to be published. [40] X.-Y. Liu, Y. Liang, S. Wang, Z.-Y. Yang, and H.-S. Ye, ‘‘A hybrid genetic
[18] E. Bonilla-Huerta, A. Hernández-Montiel, R. Morales-Caporal, and algorithm with wrapper-embedded approaches for feature selection,’’
M. Arjona-López, ‘‘Hybrid framework using multiple-filters and an IEEE Access, vol. 6, pp. 22863–22874, 2018.
embedded approach for an efficient selection and classification of [41] F. Bagherzadeh-Khiabani, A. Ramezankhani, F. Azizi, F. Hadaegh,
microarray data,’’ IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 13, E. W. Steyerberg, and D. Khalili, ‘‘A tutorial on variable selection for clin-
no. 1, pp. 12–26, Jan./Feb. 2016. [Online]. Available: https://doi. ical prediction models: Feature selection methods in data mining could
org/10.1109/TCBB.2015.2474384 improve the results,’’ J. Clin. Epidemiol., vol. 71, pp. 76–85, Mar. 2016.
[19] M. Reboiro-Jato, F. Díaz, D. Glez-Peña, and F. Fdez-Riverola, ‘‘A novel [42] M. Dash and H. Liu, ‘‘Feature selection for classification,’’ Intell. Data
ensemble of classifiers that use biological relevant gene sets for microar- Anal., vol. 1, nos. 1–4, pp. 131–156, 1997.
ray classification,’’ Appl. Soft Comput., vol. 17, pp. 117–126, Apr. 2014. [43] R. Chen, N. Sun, X. Chen, M. Yang, and Q. Wu, ‘‘Supervised feature
[20] F. P. Shah and V. Patel, ‘‘A review on feature selection and feature selection with a stratified feature weighting method,’’ IEEE Access, vol. 6,
extraction for text classification,’’ in Proc. Int. Conf. Wireless Commun., pp. 15087–15098, 2018.
Signal Process. Netw. (WiSPNET), Mar. 2016, pp. 2264–2268. [44] S. Wang and W. Zhu, ‘‘Sparse graph embedding unsupervised fea-
[21] H. Wang, F. Dong, and L. Song, ‘‘Bubble-forming regime identifica- ture selection,’’ IEEE Trans. Syst., Man, Cybern., Syst., vol. 48, no. 3,
tion based on image textural features and the MCWA feature selection pp. 329–341, Mar. 2018.
method,’’ IEEE Access, vol. 5, pp. 15820–15830, 2017. [45] S. Wang, W. Pedrycz, Q. Zhu, and W. Zhu, ‘‘Unsupervised feature selec-
[22] G. Forman, ‘‘An extensive empirical study of feature selection metrics tion via maximum projection and minimum redundancy,’’ Knowl.-Based
for text classification,’’ J. Mach. Learn. Res., vol. 3, pp. 1289–1305, Syst., vol. 75, pp. 19–29, Feb. 2015.
Mar. 2003. [46] N. Spolaôr, M. C. Monard, G. Tsoumakas, and H. D. Lee, ‘‘A systematic
[23] J. C. Gomez, E. Boiy, and M.-F. Moens, ‘‘Highly discriminative statis- review of multi-label feature selection and a new method based on label
tical features for email classification,’’ Knowl. Inf. Syst., vol. 31, no. 1, construction,’’ Neurocomputing, vol. 180, pp. 3–15, Mar. 2015.
pp. 23–53, 2012. [47] J. C. Ang, A. Mirzal, H. Haron, and H. N. A. Hamed, ‘‘Supervised,
[24] Z. Zhao, X. He, D. Cai, L. Zhang, W. Ng, and Y. Zhuang, ‘‘Graph reg- unsupervised, and semi-supervised feature selection: A review on gene
ularized feature selection with data reconstruction,’’ IEEE Trans. Knowl. selection,’’ IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 13, no. 5,
Data Eng., vol. 28, no. 3, pp. 689–700, Mar. 2016. pp. 971–989, Sep. 2016.
[25] J. G. Dy, C. E. Brodley, A. Kak, L. S. Broderick, and A. M. Aisen, [48] J. Li and H. Liu, ‘‘Challenges of feature selection for big data analytics,’’
‘‘Unsupervised feature selection applied to content-based retrieval of IEEE Intell. Syst., vol. 32, no. 2, pp. 9–15, Mar./Apr. 2017.
lung images,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 3, [49] H. Liu and L. Yu, ‘‘Toward integrating feature selection algorithms for
pp. 373–378, Mar. 2003. classification and clustering,’’ IEEE Trans. Knowl. Data Eng., vol. 17,
[26] P. Saari, T. Eerola, and O. Lartillot, ‘‘Generalizability and simplicity no. 4, pp. 491–502, Apr. 2005.
as criteria in feature selection: Application to mood classification in [50] A. Moghimi, C. Yang, and P. M. Marchetto, ‘‘Ensemble feature selection
music,’’ IEEE Trans. Audio, Speech, Language Process., vol. 19, no. 6, for plant phenotyping: A journey from hyperspectral to multispectral
pp. 1802–1812, Aug. 2011. imaging,’’ IEEE Access, vol. 6, pp. 56870–56884, 2018.
[51] C. Yao, Y.-F. Liu, B. Jiang, J. Han, and J. Han, ‘‘LLE score: [74] R. Alzubi, N. Ramzan, H. Alzoubi, and A. Amira, ‘‘A hybrid feature
A new filter-based unsupervised feature selection method based on selection method for complex diseases SNPs,’’ IEEE Access, vol. 6,
nonlinear manifold embedding and its application to image recogni- pp. 1292–1301, 2018.
tion,’’ IEEE Trans. Image Process., vol. 26, no. 11, pp. 5257–5269, [75] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and
Nov. 2017. V. Vapnik, ‘‘Feature selection for SVMS,’’ in Proc. NIPS, vol. 12, 2000,
[52] I. Tsamardinos and C. F. Aliferis, ‘‘Towards principled feature selection: pp. 668–674.
Relevancy, filters and wrappers,’’ in Proc. 9th Int. Workshop Artif. Intell. [76] Y. Zhang, C. Ding, and T. Li, ‘‘Gene selection algorithm by combin-
Statist. San Mateo, CA, USA: Morgan Kaufmann, 2003. ing reliefF and mRMR,’’ BMC Genomics, vol. 9, no. Suppl 2, p. S27,
[53] C. Freeman, D. Kulić, and O. Basir, ‘‘An evaluation of classifier-specific 2008.
filter measure performance for feature selection,’’ Pattern Recognit., [77] Y. Peng, Z. Wu, and J. Jiang, ‘‘A novel feature selection approach
vol. 48, no. 5, pp. 1812–1826, 2015. for biomedical data classification,’’ J. Biomed. Inform., vol. 43, no. 1,
[54] R. Battiti, ‘‘Using mutual information for selecting features in super- pp. 15–23, 2010.
vised neural net learning,’’ IEEE Trans. Neural Netw., vol. 5, no. 4, [78] A. E. Akadi, A. Amine, A. E. Ouardighi, and D. Aboutajdine,
pp. 537–550, Jul. 1994. ‘‘A two-stage gene selection scheme utilizing MRMR filter and
[55] K. Kira and L. A. Rendell, ‘‘A practical approach to feature selection,’’ in GA wrapper,’’ Knowl. Inf. Syst., vol. 26, no. 3, pp. 487–500,
Proc. Int. Workshop Mach. Learn., 1992, pp. 249–256. 2011.
[56] H. Almuallim and T. G. Dietterich, ‘‘Learning Boolean concepts in the [79] Y. Luo, Y. Li, C. Zhou, and C. Xu, ‘‘Combining feature selectors in a
presence of many irrelevant features,’’ Artif. Intell., vol. 69, nos. 1–2, product advertisement classification system,’’ in Proc. IEEE 1st Asian
pp. 279–305, 1994. Conf. Pattern Recognit. (ACPR), Nov. 2011, pp. 184–188.
[57] C.-T. Su and H.-C. Lin, ‘‘Applying electromagnetism-like mechanism for [80] L. Hedjazi, J. Aguilar-Martin, M.-V. Le Lann, and
feature selection,’’ Inf. Sci., vol. 181, no. 5, pp. 972–986, 2011. T. Kempowsky-Hamon, ‘‘Membership-margin based feature selection
[58] P. Bermejo, L. de la Ossa, J. A. Gámez, and J. M. Puerta, ‘‘Fast wrapper for mixed type and high-dimensional data: Theory and applications,’’
feature subset selection in high-dimensional datasets by means of filter Inf. Sci., vol. 322, pp. 174–196, Nov. 2015.
re-ranking,’’ Knowl.-Based Syst., vol. 25, no. 1, pp. 35–44, 2012. [81] K. Nag and N. R. Pal, ‘‘A multiobjective genetic programming-based
[59] A. L. Blum and P. Langley, ‘‘Selection of relevant features and examples ensemble for simultaneous feature selection and classification,’’ IEEE
in machine learning,’’ Artif. Intell., vol. 97, no. 1, pp. 245–271, Dec. 1997. Trans. Cybern., vol. 46, no. 2, pp. 499–510, Feb. 2016.
[60] L. Ma, M. Li, Y. Gao, T. Chen, X. Ma, and L. Qu, ‘‘A novel wrap- [82] F. Alzami et al., ‘‘Adaptive hybrid feature selection-based classifier
per approach for feature selection in object-based image classification ensemble for epileptic seizure classification,’’ IEEE Access, vol. 6,
using polygon-based cross-validation,’’ IEEE Geosci. Remote Sens. Lett., pp. 29132–29145, 2018.
vol. 14, no. 3, pp. 409–413, Mar. 2017. [83] S. Nagi and D. K. Bhattacharyya, ‘‘Classification of microarray can-
[61] T. M. Khoshgoftaar, A. Fazelpour, H. Wang, and R. Wald, ‘‘A survey of cer data using ensemble approach,’’ Netw. Model. Anal. Health Inform.
stability analysis of feature subset selection techniques,’’ in Proc. IEEE Bioinform., vol. 2, no. 3, pp. 159–173, 2013.
14th Int. Conf. Inf. Reuse Integr. (IRI), Aug. 2013, pp. 424–431. [84] O. P. Günther et al., ‘‘A computational pipeline for the development
[62] A. A. Shanab, T. M. Khoshgoftaar, and R. Wald, ‘‘Evaluation of wrapper- of multi-marker bio-signature panels and ensemble classifiers,’’ BMC
based feature selection using hard, moderate, and easy bioinformatics Bioinform., vol. 13, no. 1, p. 326, 2012.
data,’’ in Proc. IEEE Int. Conf. Bioinform. Bioeng. (BIBE), Nov. 2014, [85] H. Wang, T. M. Khoshgoftaar, and A. Napolitano, ‘‘A comparative study
pp. 149–155. of ensemble feature selection techniques for software defect prediction,’’
[63] L.-Y. Qiao, X.-Y. Peng, and Y. Peng, ‘‘BPSO-SVM wrapper for feature in Proc. IEEE 9th Int. Conf. Mach. Learn. Appl. (ICMLA), Dec. 2010,
subset selection,’’ Dianzi Xuebao (Acta Electronica Sinica), vol. 34, no. 3, pp. 135–140.
pp. 496–498, 2006. [86] R. Xia, C. Zong, X. Hu, and E. Cambria, ‘‘Feature ensemble plus sample
[64] H.-H. Hsu, C.-W. Hsieh, and M.-D. Lu, ‘‘Hybrid feature selection by selection: Domain adaptation for sentiment classification,’’ IEEE Intell.
combining filters and wrappers,’’ Expert Syst. Appl., vol. 38, no. 7, Syst., vol. 28, no. 3, pp. 10–18, May 2013.
pp. 8144–8150, 2011. [87] N. Gu, M. Fan, L. Du, and D. Ren, ‘‘Efficient sequential feature selec-
[65] C. Hou, F. Nie, X. Li, D. Yi, and Y. Wu, ‘‘Joint embedding learning and tion based on adaptive eigenspace model,’’ Neurocomputing, vol. 161,
sparse regression: A framework for unsupervised feature selection,’’ IEEE pp. 199–209, Aug. 2015.
Trans. Cybern., vol. 44, no. 6, pp. 793–804, Jun. 2014. [88] Y. Zhang, D. Gong, Y. Hu, and W. Zhang, ‘‘Feature selection algorithm
[66] K. Mistry, L. Zhang, S. C. Neoh, C. P. Lim, and B. Fielding, based on bare bones particle swarm optimization,’’ Neurocomputing,
‘‘A micro-GA embedded PSO feature selection approach to intelli- vol. 148, pp. 150–157, Jan. 2015.
gent facial emotion recognition,’’ IEEE Trans. Cybern., vol. 47, no. 6, [89] D. Gong, Y. Hu, and Y. Zhang, ‘‘Feature selection method for unreliable
pp. 1496–1509, Jun. 2017. data based on particle swarm optimization,’’ Acta Electronica Sinica,
[67] J. Xie and W.-X. Xie, ‘‘Several feature selection algorithms based on the vol. 7, no. 7, pp. 1320–1326, 2014.
discernibility of a feature subset and support vector machines,’’ Chin. J. [90] Z. Yong, G. Dun-Wei, and Z. Wan-Qiu, ‘‘Feature selection of unreli-
Comput., vol. 37, pp. 1704–1718, Aug. 2014. able data using an improved multi-objective PSO algorithm,’’ Neuro-
[68] T. Kari et al., ‘‘Hybrid feature selection approach for power transformer computing, vol. 171, pp. 1281–1290, Jan. 2016. [Online]. Available:
fault diagnosis based on support vector machine and genetic algorithm,’’ http://www.sciencedirect.com/science/article/pii/S0925231215010632
IET Gener., Transmiss. Distrib., vol. 12, no. 21, pp. 5672–5680, 2018. [91] J. DePasquale and R. Polikar, ‘‘Random feature subset selection for
[69] J. Spilka, J. Frecon, R. Leonarduzzi, N. Pustelnik, P. Abry, and analysis of data with missing features,’’ in Proc. IEEE Int. Joint Conf.
M. Doret, ‘‘Sparse support vector machine for intrapartum fetal heart Neural Netw. (IJCNN), Aug. 2007, pp. 2379–2384.
rate classification,’’ IEEE J. Biomed. Health Inform., vol. 21, no. 3, [92] A. Aussem and S. R. de Morais, ‘‘A conservative feature subset selec-
pp. 664–671, May 2017. tion algorithm with missing data,’’ Neurocomputing, vol. 73, nos. 4–6,
[70] A. Rakotomamonjy, ‘‘Variable selection using SVM based criteria,’’ pp. 585–590, 2010.
J. Mach. Learn. Res., vol. 3, pp. 1357–1370, Mar. 2003. [93] G. Doquire and M. Verleysen, ‘‘Feature selection with missing
[71] R. Chakraborty and N. R. Pal, ‘‘Feature selection using a neural frame- data using mutual information estimators,’’ Neurocomputing, vol. 90,
work with controlled redundancy,’’ IEEE Trans. Neural Netw. Learn. pp. 3–11, Aug. 2012.
Syst., vol. 26, no. 1, pp. 35–50, Jan. 2015. [94] M. Ramoni and P. Sebastiani, ‘‘Robust learning with missing data,’’
[72] E. Romero and J. M. Sopena, ‘‘Performing feature selection with multi- Mach. Learn., vol. 45, no. 2, pp. 147–170, 2001.
layer perceptrons,’’ IEEE Trans. Neural Netw., vol. 19, no. 3, pp. 431–441, [95] W. Zong, F. Wu, L.-K. Chu, and D. Sculli, ‘‘A discriminative and semantic
Mar. 2008. feature selection method for text categorization,’’ Int. J. Prod. Econ.,
[73] B. Lerner, M. Levinstein, B. Rosenberg, H. Guterman, I. Dinstein, and vol. 165, pp. 215–222, Jul. 2015.
Y. Romem, ‘‘Feature selection and chromosome classification using [96] Y. Zhang, D.-W. Gong, and M. Rong, ‘‘Multi-objective differential evo-
a multilayer perceptron neural network,’’ in Proc. IEEE Int. Conf. lution algorithm for multi-label feature selection in classification,’’ in
Neural Netw., IEEE World Congr. Comput. Intell., vol. 6, Jun. 1994, Advances in Swarm and Computational Intelligence. Beijing, China:
pp. 3540–3545. Springer, 2015, pp. 339–345.
[97] S. Peng, Q. Xu, X. B. Ling, X. Peng, W. Du, and L. Chen, ‘‘Molecular [118] S.-W. Lin, Z.-J. Lee, S.-C. Chen, and T.-Y. Tseng, ‘‘Parameter determi-
classification of cancer types from microarray data using the combination nation of support vector machine and feature selection using simulated
of genetic algorithms and support vector machines,’’ FEBS Lett., vol. 555, annealing approach,’’ Appl. Soft Comput., vol. 8, no. 4, pp. 1505–1512,
no. 2, pp. 358–362, 2003. 2008.
[98] C. H. Ooi and P. Tan, ‘‘Genetic algorithms applied to multi-class predic- [119] Y. Zhang, J. Wu, and J. Cai, ‘‘Compact representation of high-
tion for the analysis of gene expression data,’’ Bioinformatics, vol. 19, dimensional feature vectors for large-scale image recognition and
no. 1, pp. 37–44, 2003. retrieval,’’ IEEE Trans. Image Process., vol. 25, no. 5, pp. 2407–2419,
[99] L. Yu, L. Hu, and L. Tang, ‘‘Stock selection with a novel sigmoid- May 2016.
based mixed discrete-continuous differential evolution algorithm,’’ [120] F. Gao, X. Zhang, Y. Huang, Y. Luo, X. Li, and L.-Y. Duan, ‘‘Data-driven
IEEE Trans. Knowl. Data Eng., vol. 28, no. 7, pp. 1891–1904, lightweight interest point selection for large-scale visual search,’’ IEEE
Jul. 2016. Trans. Multimedia, vol. 20, no. 10, pp. 2774–2787, Oct. 2018.
[100] A. Gosh, S. Das, R. Mallipeddi, A. K. Das, and S. S. Dash, ‘‘A modified [121] J. M. N. Abad and A. Soleimani, ‘‘Novel feature selection algorithm for
differential evolution with distance-based selection for continuous opti- thermal prediction model,’’ IEEE Trans. Very Large Scale Integr. (VLSI)
mization in presence of noise,’’ IEEE Access, vol. 5, pp. 26944–26964, Syst., vol. 26, no. 10, pp. 1831–1844, Oct. 2018.
2017. [122] H. Peng, F. Long, and C. Ding, ‘‘Feature selection based on mutual
[101] Y. Zhang, D.-W. Gong, and J. Cheng, ‘‘Multi-objective particle information criteria of max-dependency, max-relevance, and min-
swarm optimization approach for cost-based feature selection in redundancy,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 8,
classification,’’ IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 14, pp. 1226–1238, Aug. 2005.
no. 1, pp. 64–75, Jan./Feb. 2017. [Online]. Available: https://doi.org/ [123] A. Unler, A. Murat, and R. B. Chinnam, ‘‘mr 2 PSO: A maximum rele-
10.1109/TCBB.2015.2476796 vance minimum redundancy feature selection method based on swarm
[102] A. A. Naeini, M. Babadi, S. M. J. Mirzadeh, and S. Amini, ‘‘Particle intelligence for support vector machine classification,’’ Inf. Sci., vol. 181,
swarm optimization for object-based feature selection of VHSR satellite no. 20, pp. 4625–4641, 2011.
images,’’ IEEE Geosci. Remote Sens. Lett., vol. 15, no. 3, pp. 379–383, [124] D. He, I. Rish, D. Haws, S. Teyssedre, Z. Karaman, and L. Parida.
Mar. 2018. (2013). ‘‘MINT: Mutual information based transductive feature selec-
[103] B. Tran, B. Xue, and M. Zhang, ‘‘A new representation in PSO for tion for genetic trait prediction.’’ [Online]. Available: https://arxiv.org/
discretization-based feature selection,’’ IEEE Trans. Cybern., vol. 48, abs/1310.1659
no. 6, pp. 1733–1746, Jun. 2018. [125] V. Pappu, O. P. Panagopoulos, P. Xanthopoulos, and P. M. Pardalos,
[104] C. Yan, J. Ma, H. Luo, and J. Wang, ‘‘A hybrid algorithm based on binary ‘‘Sparse proximal support vector machines for feature selection in high
chemical reaction optimization and tabu search for feature selection of dimensional datasets,’’ Expert Syst. Appl., vol. 42, pp. 9183–9191, 2015.
high-dimensional biomedical data,’’ Tsinghua Sci. Technol., vol. 23, no. 6, [126] X. Fang, Y. Xu, X. Li, Z. Fan, H. Liu, and Y. Chen, ‘‘Locality and
pp. 733–743, Dec. 2018. similarity preserving embedding for feature selection,’’ Neurocomputing,
[105] M. A. Tahir and A. Bouridane, ‘‘Novel Round-Robin Tabu search algo- vol. 128, no. 5, pp. 304–315, Mar. 2014.
rithm for prostate cancer classification and diagnosis using multispectral [127] J. Wang, P. Zhao, S. C. H. Hoi, and R. Jin, ‘‘Online feature selection
imagery,’’ IEEE Trans. Inf. Technol. Biomed., vol. 10, no. 4, pp. 782–793, and its applications,’’ IEEE Trans. Knowl. Data Eng., vol. 26, no. 3,
Oct. 2006. pp. 698–710, Mar. 2014.
[128] S. S. Naqvi, W. N. Browne, and C. Hollitt, ‘‘Feature quality-based
[106] S. A. Toussi, H. S. Yazdi, E. Hajinezhad, and S. Effati, ‘‘Eigenvector
dynamic feature selection for improving salient object detection,’’ IEEE
selection in spectral clustering using Tabu search,’’ in Proc. IEEE 1st Int.
Trans. Image Process., vol. 25, no. 9, pp. 4298–4313, Sep. 2016.
eConf. Comput. Knowl. Eng. (ICCKE), Oct. 2011, pp. 75–80.
[129] C. Tong and X. Shi, ‘‘Decentralized monitoring of dynamic processes
[107] I.-S. Oh, J.-S. Lee, and B.-R. Moon, ‘‘Hybrid genetic algorithms for
based on dynamic feature selection and informative fault pattern dis-
feature selection,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 26,
similarity,’’ IEEE Trans. Ind. Electron., vol. 63, no. 6, pp. 3804–3814,
no. 11, pp. 1424–1437, Nov. 2004.
Jun. 2016.
[108] X. Wang, J. Yang, X. Teng, W. Xia, and R. Jensen, ‘‘Feature selection
[130] M. Pratama, W. Pedrycz, and E. Lughofer, ‘‘Evolving ensemble fuzzy
based on rough sets and particle swarm optimization,’’ Pattern Recognit.
classifier,’’ IEEE Trans. Fuzzy Syst., vol. 26, no. 5, pp. 2552–2567,
Lett., vol. 28, no. 4, pp. 459–471, 2007.
Oct. 2018.
[109] F. van den Bergh and A. P. Engelbrecht, ‘‘A study of particle swarm [131] J. Zhou, D. P. Foster, R. A. Stine, and L. H. Ungar, ‘‘Streamwise feature
optimization particle trajectories,’’ Inf. Sci., vol. 176, no. 8, pp. 937–971, selection,’’ J. Mach. Learn. Res., vol. 7, no. 1, pp. 1861–1885, Dec. 2006.
2006. [132] S. Fong, R. Wong, and A. V. Vasilakos, ‘‘Accelerated PSO swarm search
[110] E.-G. Talbi, L. Jourdan, J. Garcia-Nieto, and E. Alba, ‘‘Comparison feature selection for data stream mining big data,’’ IEEE Trans. Serv.
of population based metaheuristics for feature selection: Application to Comput., vol. 9, no. 1, pp. 33–45, Jan. 2016.
microarray data classification,’’ in Proc. ACS/IEEE Int. Conf. Comput. [133] Y. Lin, Q. Hu, J. Zhang, and X. Wu, ‘‘Multi-label feature selection with
Syst. Appl., Mar./Apr. 2008, pp. 45–52. streaming labels,’’ Inf. Sci., vol. 372, pp. 256–275, Dec. 2016.
[111] L.-Y. Chuang, C.-H. Yang, and J.-C. Li, ‘‘Chaotic maps based on binary [134] S. Jin, F. Ye, Z. Zhang, K. Chakrabarty, and X. Gu, ‘‘Efficient board-
particle swarm optimization for feature selection,’’ Appl. Soft Comput., level functional fault diagnosis with missing syndromes,’’ IEEE Trans.
vol. 11, no. 1, pp. 239–248, 2011. Comput.-Aided Design Integr. Circuits Syst., vol. 35, no. 6, pp. 985–998,
[112] A. Unler and A. Murat, ‘‘A discrete particle swarm optimization method Jun. 2016.
for feature selection in binary classification problems,’’ Eur. J. Oper. Res., [135] W. Xu, Z. He, E. Lo, and C.-Y. Chow, ‘‘Explaining missing answers
vol. 206, no. 3, pp. 528–539, 2010. to top-k SQL queries,’’ IEEE Trans. Knowl. Data Eng., vol. 28, no. 8,
[113] L.-Y. Chuang, C.-S. Yang, K.-C. Wu, and C.-H. Yang, ‘‘Gene selec- pp. 2071–2085, Aug. 2016.
tion and classification using Taguchi chaotic binary particle swarm [136] F. Bu, Z. Chen, Q. Zhang, and X. Wang, ‘‘Incomplete big data clustering
optimization,’’ Expert Syst. Appl., vol. 38, no. 10, pp. 13367–13377, algorithm using feature selection and partial distance,’’ in Proc. 5th Int.
2011. Conf. Digit. Home (ICDH), 2014, pp. 263–266.
[114] U. K. Sikdar, A. Ekbal, and S. Saha, ‘‘Differential evolution based men- [137] T. Hastie, R. Tibshirani, G. Sherlock, M. Eisen, P. Brown, and D. Botstein,
tion detection for anaphora resolution,’’ in Proc. Annu. IEEE India Conf. ‘‘Imputing missing data for gene expression arrays,’’ Division Biostatist.,
(INDICON), Dec. 2013, pp. 1–6. Stanford Univ., Stanford, CA, USA, Tech. Rep., 1999.
[115] A. Al-Ani, A. Alsukker, and R. N. Khushaba, ‘‘Feature subset selection [138] P. G. Clark, J. W. Grzymala-Busse, and W. Rzasa, ‘‘Mining incomplete
using differential evolution and a wheel based search strategy,’’ Swarm data with singleton, subset and concept probabilistic approximations,’’
Evol. Comput., vol. 9, pp. 15–26, Apr. 2013. Inf. Sci., vol. 280, pp. 368–384, Oct. 2014.
[116] Y. Zhang, M. Rong, and D. Gong, ‘‘A multi-objective feature selection [139] F. Bu, Z. Chen, Q. Zhang, and L. T. Yang, ‘‘Incomplete high-dimensional
based on differential evolution,’’ in Proc. IEEE Int. Conf. Control, Autom. data imputation algorithm using feature selection and clustering analysis
Inf. Sci. (ICCAIS), Oct. 2015, pp. 302–306. on cloud,’’ J. Supercomput., vol. 72, no. 8, pp. 2977–2990, 2015.
[117] C.-L. Huang, ‘‘ACO-based hybrid classification system with feature [140] L. Yuan et al., ‘‘Multi-source feature learning for joint analysis of incom-
subset selection and model parameters optimization,’’ Neurocomputing, plete multiple heterogeneous neuroimaging data,’’ NeuroImage, vol. 61,
vol. 73, nos. 1–3, pp. 438–448, 2009. no. 3, pp. 622–632, 2012.
[141] W. Qian and W. Shu, ‘‘Mutual information criterion for feature selec- [162] G. Haixiang, L. Yijing, L. Yanan, L. Jinlinga, and L. Xiao, ‘‘BPSO-
tion from incomplete data,’’ Neurocomputing, vol. 168, pp. 210–220, AdaBoost-KNN ensemble learning algorithm for multi-class imbalanced
Nov. 2015. data classification,’’ Eng. Appl. Artif. Intell., vol. 49, pp. 176–193,
[142] D. Song, Y. Luo, and J. Heflin, ‘‘Linking heterogeneous data in the Mar. 2016.
semantic Web using scalable and domain-independent candidate selec- [163] W. C. Yeh, Y. T. Yang, and C. M. Lai, ‘‘A hybrid simplified swarm
tion,’’ IEEE Trans. Knowl. Data Eng., vol. 29, no. 1, pp. 143–156, optimization method for imbalanced data feature selection,’’ Austral.
Jan. 2017. Acad. Bus. Econ. Rev., vol. 2, no. 3, pp. 263–275, 2017.
[143] S. Huda, J. Yearwood, H. F. Jelinek, M. M. Hassan, G. Fortino, and [164] W. D. Fisher, T. K. Camp, and V. V. Krzhizhanovskaya, ‘‘Anomaly
M. Buckland, ‘‘A hybrid feature selection with ensemble classification detection in earth dam and levee passive seismic data using support
for imbalanced healthcare data: A case study for brain tumor diagnosis,’’ vector machines and automatic feature selection,’’ J. Comput. Sci., vol. 20,
IEEE Access, vol. 4, pp. 9145–9154, 2016. pp. 143–153, May 2017. [Online]. Available: http://www.sciencedirect.
[144] M. A. Hossain, X. Jia, and J. A. Benediktsson, ‘‘One-class oriented com/science/article/pii/S1877750316304185
feature selection and classification of heterogeneous remote sensing [165] L. Yin, Y. Ge, K. Xiao, X. Wang, and X. Quan, ‘‘Feature selection for
images,’’ IEEE J. Sel. Topics Appl. Earth Observ. Remote Ses., vol. 9, high-dimensional imbalanced data,’’ Neurocomputing, vol. 105, no. 3,
pp. 1606–1612, Apr. 2016. pp. 3–11, 2013.
[145] M. Wei, T. W. S. Chow, and R. H. M. Chan, ‘‘Heterogeneous feature [166] M. Alibeigi, S. Hashemi, and A. Hamzeh, ‘‘DBFS: An effective density
subset selection using mutual information-based feature transformation,’’ based feature selection scheme for small sample size and high dimen-
Neurocomputing, vol. 168, pp. 706–718, Nov. 2015. [Online]. Available: sional imbalanced data sets,’’ Data Knowl. Eng., vols. 81–82, no. 4,
http://www.sciencedirect.com/science/article/pii/S092523121500733X pp. 67–103, 2012.
[146] Y. Motai, N. A. Siddique, and H. Yoshida, ‘‘Heterogeneous data [167] U. Mahdiyah, M. I. Irawan, and E. M. Imah, ‘‘Integrating data selection
analysis: Online learning for medical-image-based diagnosis,’’ Pat- and extreme learning machine for imbalanced data,’’ Procedia Comput.
tern Recognit., vol. 63, pp. 612–624, Mar. 2017. [Online]. Available: Sci., vol. 59, pp. 221–229, Aug. 2015.
http://www.sciencedirect.com/science/article/pii/S0031320316302977 [168] J. A. Sáez, B. Krawczyk, and M. Woźniak, ‘‘Analyzing the oversampling
[147] S. Pölsterl, S. Conjeti, N. Navab, and A. Katouzian ‘‘Survival analy- of different classes and types of examples in multi-class imbalanced
sis for high-dimensional, heterogeneous medical data: Exploring fea- datasets,’’ Pattern Recognit., vol. 57, pp. 164–178, Sep. 2016.
ture extraction as an alternative to feature selection,’’ Artif. Intell. [169] J. F. Díez-Pastor, J. J. Rodríguez, C. García-Osorio, and L. I. Kuncheva,
Med., vol. 72, pp. 1–11, Sep. 2016. [Online]. Available: http://www. ‘‘Random balance: Ensembles of variable priors classifiers for imbal-
sciencedirect.com/science/article/pii/S0933365716300653 anced data,’’ Knowl.-Based Syst., vol. 85, pp. 96–111, Sep. 2015.
[148] T. W. S. Chow and D. Huang, ‘‘Estimating optimal feature subsets using [170] Z. Sun, Q. Song, X. Zhu, H. Sun, B. Xu, and Y. Zhou, ‘‘A novel ensemble
efficient estimation of high-dimensional mutual information,’’ IEEE method for classifying imbalanced data,’’ Pattern Recognit., vol. 48, no. 5,
Trans. Neural Netw., vol. 16, no. 1, pp. 213–224, Jan. 2005. pp. 1623–1637, 2015.
[149] J. Neumann, C. Schnörr, and G. Steidl, ‘‘Combined SVM-based fea- [171] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,
ture selection and classification,’’ Mach. Learn., vol. 61, nos. 1–3, ‘‘SMOTE: Synthetic minority over-sampling technique,’’ J. Artif. Intell.
pp. 129–150, 2005. Res., vol. 16, no. 1, pp. 321–357, 2002.
[172] H. Yu, J. Ni, and J. Zhao, ‘‘ACOSampling: An ant colony optimization-
[150] T. W. S. Chow, P. Wang, and E. W. M. Ma, ‘‘A new feature selection
based undersampling method for classifying imbalanced DNA microar-
scheme using a data distribution factor for unsupervised nominal data,’’
ray data,’’ Neurocomputing, vol. 101, no. 3, pp. 309–318, 2013.
IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 2, pp. 499–509,
[173] A. K. Uysal, ‘‘On two-stage feature selection methods for text classifica-
Apr. 2008.
tion,’’ IEEE Access, vol. 6, pp. 43233–43251, 2018.
[151] N. Zhong, J. Dong, and S. Ohsuga, ‘‘Using rough sets with heuristics for
[174] A. Onan and S. Korukoğlu, ‘‘A feature selection model based on genetic
feature selection,’’ J. Intell. Inf. Syst., vol. 16, no. 3, pp. 199–214, 2001.
rank aggregation for text sentiment classification,’’ J. Inf. Sci., vol. 43,
[152] Z. Pawlak and A. Skowron, ‘‘Rough sets: Some extensions,’’ Inf. Sci.,
no. 1, pp. 25–38, 2017.
vol. 177, no. 1, pp. 28–40, 2007.
[175] D. Agnihotri, K. Verma, and P. Tripathi, ‘‘Variable global feature selec-
[153] R. Slowinski and D. Vanderpooten, ‘‘A generalized definition of rough tion scheme for automatic classification of text documents,’’ Expert
approximations based on similarity,’’ IEEE Trans. Knowl. Data Eng., Syst. Appl., vol. 81, pp. 268–281, Sep. 2017. [Online]. Available:
vol. 12, no. 2, pp. 331–336, Mar. 2000. http://www.sciencedirect.com/science/article/pii/S0957417417302208
[154] M.-G. Park and K.-J. Yoon, ‘‘Optimal key-frame selection for [176] M. Labani, P. Moradi, F. Ahmadizar, and M. Jalili, ‘‘A novel multivariate
video-based structure-from-motion,’’ Electron. Lett., vol. 47, no. 25, filter method for feature selection in text classification problems,’’ Eng.
pp. 1367–1369, Dec. 2011. Appl. Artif. Intell., vol. 70, pp. 25–37, Apr. 2018. [Online]. Available:
[155] R. M. Balabin and S. V. Smirnov, ‘‘Variable selection in near- http://www.sciencedirect.com/science/article/pii/S0952197617303172
infrared spectroscopy: Benchmarking of feature selection methods on [177] B. Parlak and A. K. Uysal, ‘‘On feature weighting and selection for
biodiesel data,’’ Analytica Chim. Acta, vol. 692, nos. 1–2, pp. 63–72, medical document classification,’’ in Developments and Advances in
2011. [Online]. Available: http://www.sciencedirect.com/science/article/ Intelligent Systems and Applications, Á. Rocha and L. Reis, Eds. Cham,
pii/S0003267011003539 Switzerland: Springer, 2018, pp. 269–282.
[156] S.-P. Chen, ‘‘Time value of delays in unreliable production systems with [178] T. Nakanishi, ‘‘A feature selection method for comparision of each con-
mixed uncertainties of fuzziness and randomness,’’ Eur. J. Oper. Res., cept in big data,’’ in Proc. IEEE/ACIS 14th Int. Conf. Comput. Inf. Sci.
vol. 255, no. 3, pp. 834–844, 2016. (ICIS), Jun./Jul. 2015, pp. 229–234.
[157] X.-P. Xie, D. Yue, and S.-L. Hu, ‘‘Fuzzy control design of nonlinear sys- [179] S. Wang, D. Li, Y. Wei, and H. Li, ‘‘A feature selection method based
tems under unreliable communication links: A systematic homogenous on Fisher’s discriminant ratio for text sentiment classification,’’ in Web
polynomial approach,’’ Inf. Sci., vols. 370–371, pp. 763–771, Nov. 2016. Information Systems and Mining. 2009, pp. 88–97.
[158] L. Yijing, G. Haixiang, L. Xiao, L. Yanan, and L. Jinling, ‘‘Adapted [180] K. Kesorn and S. Poslad, ‘‘An enhanced bag-of-visual word vector space
ensemble classification algorithm based on multiple classifier system and model to represent visual content in athletics images,’’ IEEE Trans.
feature selection for classifying multi-class imbalanced data,’’ Knowl.- Multimedia, vol. 14, no. 1, pp. 211–222, Feb. 2012.
Based Syst., vol. 94, pp. 88–104, Feb. 2015. [181] I. Jain, V. K. Jain, and R. Jain, ‘‘Correlation feature selection based
[159] J. Fan, Z. Niu, Y. Liang, and Z. Zhao, ‘‘Probability model selection and improved-binary particle swarm optimization for gene selection and
parameter evolutionary estimation for clustering imbalanced data without cancer classification,’’ Appl. Soft Comput., vol. 62, pp. 203–215,
sampling,’’ Neurocomputing, vol. 211, pp. 172–181, Oct. 2016. Jan. 2018. [Online]. Available: http://www.sciencedirect.com/science/
[160] Z. Zhang, H. Chen, Y. Xu, J. Zhong, N. Lv, and S. Chen, ‘‘Multisensor- article/pii/S156849461730577X
based real-time quality monitoring by means of feature extraction, selec- [182] Y. Li, T. Li, and H. Liu, ‘‘Recent advances in feature selection and its
tion and modeling for al alloy in arc welding,’’ Mech. Syst. Signal Pro- applications,’’ Knowl. Inf. Syst., vol. 53, no. 3, pp. 551–577, Dec. 2017.
cess., vols. 60–61, pp. 151–165, Aug. 2015. [Online]. Available: https://doi.org/10.1007/s10115-017-1059-8
[161] S. Vluymans, D. S. Tarragó, Y. Saeys, C. Cornelis, and F. Herrera, [183] S. Turgut, M. Dağtekin, and T. Ensari, ‘‘Microarray breast cancer data
‘‘Fuzzy rough classifiers for class imbalanced multi-instance data,’’ Pat- classification using machine learning methods,’’ in Proc. Electr. Elec-
tern Recognit., vol. 53, pp. 36–45, May 2016. tron., Comput. Sci., Biomed. Eng. Meeting (EBBT), Apr. 2018, pp. 1–3.
[184] J. Apolloni, G. Leguizamón, and E. Alba, ‘‘Two hybrid wrapper-filter fea- DUNWEI GONG received the B.S. degree in
ture selection algorithms applied to high-dimensional microarray experi- applied mathematics from the China University of
ments,’’ Appl. Soft Comput., vol. 38, pp. 922–932, Jan. 2016. Mining and Technology, Xuzhou, China, in 1992,
[185] S. Tabakhi, A. Najafi, R. Ranjbar, and P. Moradi, ‘‘Gene selection for the M.S. degree in control theory and its applica-
microarray data classification using a novel ant colony optimization,’’ tions from the Beijing University of Aeronautics
Neurocomputing, vol. 168, pp. 1024–1036, Nov. 2015. and Astronautics, Beijing, China, in 1995, and the
[186] G. Taşkın, H. Kaya, and L. Bruzzone, ‘‘Feature selection based on Ph.D. degree in control theory and control engi-
high dimensional model representation for hyperspectral images,’’ IEEE
neering from the China University of Mining and
Trans. Image Process., vol. 26, no. 6, pp. 2918–2928, Jun. 2017.
Technology, in 1999. He is currently a Professor
[187] H. Lang, J. Zhang, X. Zhang, and J. Meng, ‘‘Ship classification in SAR
image by joint feature and classifier selection,’’ IEEE Geosci. Remote and the Ph.D. Advisor of the School of Information
Sens. Lett., vol. 13, no. 2, pp. 212–216, Feb. 2016. and Electrical Engineering, China University of Mining and Technology.
[188] P. Bolourchi, H. Demirel, and S. Uysal, ‘‘Entropy-score-based feature He is also with the School of Information Science and Technology, Qingdao
selection for moment-based SAR image classification,’’ Electron. Lett., University of Science and Technology, Qingdao, China. His main research
vol. 54, no. 9, pp. 593–595, May 2018. [Online]. Available: https://digital- interests include evolutionary computation, intelligence optimization, and
library.theiet.org/content/journals/10.1049/el.2017.4419 data mining.
[189] I. Garali, M. Adel, S. Bourennane, and E. Guedj, ‘‘Histogram-based
features selection and volume of interest ranking for brain PET image
classification,’’ IEEE J. Transl. Eng. Health Med., vol. 6, 2018,
Art. no. 2100212.
[190] L. Zhang, Q. Zhang, B. Du, X. Huang, Y. Y. Tang, and D. Tao, ‘‘Simul-
taneous spectral-spatial feature selection and extraction for hyperspectral
images,’’ IEEE Trans. Cybern., vol. 48, no. 1, pp. 16–28, Jan. 2018.
[191] C. Shang and D. Barnes, ‘‘Fuzzy-rough feature selection aided support
vector machines for Mars image classification,’’ Comput. Vis. Image
Understand., vol. 117, no. 3, pp. 202–213, 2013.
[192] A. Vavilin and K.-H. Jo, ‘‘Automatic context analysis for image classi-
fication and retrieval based on optimal feature subset selection,’’ Neuro-
computing, vol. 116, no. 10, pp. 201–207, 2013.
[193] C.-Y. Chang, S.-J. Chen, and M.-F. Tsai, ‘‘Application of support-vector-
machine-based method for feature selection and classification of thy-
roid nodules in ultrasound images,’’ Pattern Recognit., vol. 43, no. 10,
pp. 3494–3506, Oct. 2010.
[194] X. Zhou, X. Gao, J. Wang, H. Yu, Z. Wang, and Z. Chi, ‘‘Eye tracking
data guided feature selection for image classification,’’ Pattern Recognit., XIAOZHI GAO received the B.Sc. and M.Sc.
vol. 63, pp. 56–70, Mar. 2017. degrees from the Harbin Institute of Technology,
[195] Z. Shi, Z. Zou, and C. Zhang, ‘‘Real-time traffic light detection with
China, in 1993 and 1996, respectively, and the
adaptive background suppression filter,’’ IEEE Trans. Intell. Transp. Syst.,
D.Sc. (Tech.) degree from the Helsinki University
vol. 17, no. 3, pp. 690–700, Mar. 2016.
[196] C.-C. Shen and J.-H. Yan, ‘‘High-order Hadamard-encoded transmission of Technology (now Aalto University), Finland,
for tissue background suppression in ultrasound contrast imaging: Mem- in 1999. He was with the Helsinki University of
ory effect and decoding schemes,’’ IEEE Trans. Ultrason., Ferroelectr., Technology (now Aalto University) from 2004 to
Freq. Control, vol. 66, no. 1, pp. 26–37, Jan. 2019. 2018. He is currently with the School of Comput-
[197] T. M. Nguyen, Q. M. J. Wu, and D. Mukherjee, ‘‘An online unsupervised ing, University of Eastern Finland, Kuopio, Fin-
feature selection and its application for background suppression,’’ in Proc. land. He is also a Guest/Visiting Professor with the
IEEE 12th Conf. Comput. Robot Vis. (CRV), Jun. 2015, pp. 161–168. Harbin Institute of Technology, Beijing Normal University, and Shanghai
[198] Y. Zhang et al., ‘‘Parallel processing systems for big data: A survey,’’ Maritime University, China. He has published over 290 technical papers
Proc. IEEE, vol. 104, no. 11, pp. 2114–2136, Nov. 2016. in refereed journals and international conferences. His current research
[199] R. Trichet and F. Bremond, ‘‘Dataset optimization for real-time pedestrian interests are nature-inspired computing methods (e.g., neural networks, fuzzy
detection,’’ IEEE Access, vol. 6, pp. 7719–7727, 2018. logic, evolutionary computing, swarm intelligence, and artificial immune
[200] S. Lekha and M. Suchetha, ‘‘A novel 1-D convolution neural network with systems) with their applications in optimization, data mining, control, signal
SVM architecture for real-time detection applications,’’ IEEE Sensors J., processing, and industrial electronics. He is an invited plenary speaker
vol. 18, no. 2, pp. 724–731, Jan. 2018.
at the 2014 International Workshop on Synchro-Phasor Measurements for
Smart Grid, the 2006 International Workshop on Nature Inspired Coopera-
MIAO RONG received the B.S. degree in elec- tive Strategies for Optimization, and the 2001 NATO Advanced Research
trical engineering and automation from the China Workshop on Systematic Organization of Information in Fuzzy Systems.
University of Mining and Technology, Xuzhou, He is the General Chair of the 2005 IEEE Mid-Summer Workshop on
China, in 2014, where she is currently pursuing Soft Computing in Industrial Applications. He is an Associate Editor of
the Ph.D. degree in control theory and control Applied Soft Computing, the International Journal of Machine Learning and
engineering. Her main research interests include Cybernetics, the journal of Intelligent Automation and Soft Computing, Inter-
data mining and multiobjective optimization. national Journal of Innovative Computing, Information and Control, and the
International Journal of Swarm Intelligence and Evolutionary Computation.
He also serves on the editorial boards for few international journals.