Toward Integrating Feature Selection Algorithms For Classification and Clustering-M7s PDF
Toward Integrating Feature Selection Algorithms For Classification and Clustering-M7s PDF
Abstract—This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for
classification and clustering, groups and compares different algorithms with a categorizing framework based on search strategies,
evaluation criteria, and data mining tasks, reveals unattempted combinations, and provides guidelines in selecting feature selection
algorithms. With the categorizing framework, we continue our efforts toward building an integrated system for intelligent feature
selection. A unifying platform is proposed as an intermediate step. An illustrative example is presented to show how existing feature
selection algorithms can be integrated into a meta algorithm that can take advantage of individual algorithms. An added advantage of
doing so is to help a user employ a suitable algorithm without knowing details of each algorithm. Some real-world applications are
included to demonstrate the use of feature selection in data mining. We conclude this work by identifying trends and challenges of
feature selection research and development.
Index Terms—Feature selection, classification, clustering, categorizing framework, unifying platform, real-world applications.
1 INTRODUCTION
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:02 UTC from IE Xplore. Restricon aply.
LIU AND YU: TOWARD INTEGRATING FEATURE SELECTION ALGORITHMS FOR CLASSIFICATION AND CLUSTERING 493
Some popular independent criteria are distance measures, 2.3 Stopping Criteria
information measures, dependency measures, and consis- A stopping criterion determines when the feature selec-
tency measures [3], [5], [34], [53]. tion process should stop. Some frequently used stopping
Distance measures are also known as separability, criteria are:
divergence, or discrimination measures. For a two-class
problem, a feature X is preferred to another feature Y if X 1. The search completes.
induces a greater difference between the two-class condi- 2. Some given bound is reached, where a bound can be
tional probabilities than Y, because we try to find the a specified number (minimum number of features or
feature that can separate the two classes as far as possible. X maximum number of iterations).
and Y are indistinguishable if the difference is zero. 3. Subsequent addition (or deletion) of any feature
Information measures typically determine the informa- does not produce a better subset.
tion gain from a feature. The information gain from a 4. A sufficiently good subset is selected (e.g., a subset
feature X is defined as the difference between the prior may be sufficiently good if its classification error rate
uncertainty and expected posterior uncertainty using X. is less than the allowable error rate for a given task).
Feature X is preferred to feature Y if the information gain 2.4 Result Validation
from X is greater than that from Y. A straightforward way for result validation is to directly
Dependency measures are also known as correlation
measure the result using prior knowledge about the data. If
measures or similarity measures. They measure the ability
we know the relevant features beforehand as in the case of
to predict the value of one variable from the value of
synthetic data, we can compare this known set of features
another. In feature selection for classification, we look for
with the selected features. Knowledge on the irrelevant or
how strongly a feature is associated with the class. A feature
redundant features can also help. We do not expect them to
X is preferred to another feature Y if the association
be selected. In real-world applications, however, we usually
between feature X and class C is higher than the association do not have such prior knowledge. Hence, we have to rely
between Y and C. In feature selection for clustering, the on some indirect methods by monitoring the change of
association between two random features measures the mining performance with the change of features. For
similarity between the two. example, if we use classification error rate as a performance
Consistency measures are characteristically different
indicator for a mining task, for a selected feature subset, we
from the above measures because of their heavy reliance
can simply conduct the “before-and-after” experiment to
on the class information and the use of the Min-Features
compare the error rate of the classifier learned on the full set
bias [3] in selecting a subset of features. These measures
of features and that learned on the selected subset [53], [89].
attempt to find a minimum number of features that separate
classes as consistently as the full set of features can. An
inconsistency is defined as two instances having the same 3 A CATEGORIZING FRAMEWORK FOR FEATURE
feature values but different class labels. SELECTION ALGORITHMS
Given the key steps of feature selection, we now introduce a
2.2.2 Dependent Criteria
categorizing framework that groups many existing feature
A dependent criterion used in the wrapper model requires a
selection algorithms into distinct categories, and summarize
predetermined mining algorithm in feature selection and individual algorithms based on this framework.
uses the performance of the mining algorithm applied on
the selected subset to determine which features are selected. 3.1 A Categorizing Framework
It usually gives superior performance as it finds features There exists a vast body of available feature selection
better suited to the predetermined mining algorithm, but it algorithms. In order to better understand the inner
also tends to be more computationally expensive, and may instrument of each algorithm and the commonalities and
not be suitable for other mining algorithms [6]. For example, differences among them, we develop a three-dimensional
in a task of classification, predictive accuracy is widely used categorizing framework (shown in Table 1) based on the
as the primary measure. It can be used as a dependent previous discussions. We understand that search strategies
criterion for feature selection. As features are selected by the and evaluation criteria are two dominating factors in
classifier that later on uses these selected features in designing a feature selection algorithm, so they are chosen
predicting the class labels of unseen instances, accuracy is as two dimensions in the framework. In Table 1, under
normally high, but it is computationally rather costly to Search Strategies, algorithms are categorized into Complete,
estimate accuracy for every feature subset [41]. Sequential, and Random. Under Evaluation Criteria, algo-
In a task of clustering, the wrapper model of feature rithms are categorized into Filter, Wrapper, and Hybrid. We
selection tries to evaluate the goodness of a feature subset consider Data Mining Tasks as a third dimension because
by the quality of the clusters resulted from applying the the availability of class information in Classification or
clustering algorithm on the selected subset. There exist a Clustering tasks affects evaluation criteria used in feature
number of heuristic criteria for estimating the quality of selection algorithms (as discussed in Section 2.2). In
clustering results, such as cluster compactness, scatter addition to these three basic dimensions, algorithms within
separability, and maximum likelihood. Recent work on the Filter category are further distinguished by specific
developing dependent criteria in feature selection for evaluation criteria including Distance, Information, Depen-
clustering can been found in [20], [27], [42]. dency, and Consistency. Within the Wrapper category,
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:02 UTC from IE Xplore. Restricon aply.
494 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 4, APRIL 2005
TABLE 1
Categorization of Feature Selection Algorithms in a Three-Dimensional Framework
Predictive Accuracy is used for Classification, and Cluster suitable for potential future work. In particular, for
Goodness for Clustering. example, current feature selection algorithms for clustering
Many feature selection algorithms collected in Table 1 are only limited to sequential search.
can be grouped into distinct categories according to these With a large number of existing algorithms seen in the
characteristics. The categorizing framework serves three framework, we summarize all the algorithms into three
roles. First, it reveals relationships among different algo-
generalized algorithms corresponding to the filter model,
rithms: Algorithms in the same block (category) are most
the wrapper model, and the hybrid model, respectively.
similar to each other (i.e., designed with similar search
strategies and evaluation criteria, and for the same type of 3.2 Filter Algorithm
data mining tasks). Second, it enables us to focus our
Algorithms within the filter model are illustrated through a
selection of feature selection algorithms for a given task on a
generalized filter algorithm (shown in Table 2). For a given
relatively small number of algorithms out of the whole
body. For example, knowing that feature selection is
performed for classification, predicative accuracy of a TABLE 2
classifier is suitable to be the evaluation criterion, and A Generalized Filter Algorithm
complete search is not suitable for the limited time allowed,
we can conveniently limit our choices to two groups of
algorithms in Table 1: one is defined by Classification,
Wrapper, and Sequential; the other is by Classification,
Wrapper, and Random. Both groups have more than one
algorithm available.1 Third, the framework also reveals
what are missing in the current collection of feature
selection algorithms. As we can see, there are many empty
blocks in Table 1, indicating that no feature selection
algorithm exists for these combinations which might be
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:02 UTC from IE Xplore. Restricon aply.
LIU AND YU: TOWARD INTEGRATING FEATURE SELECTION ALGORITHMS FOR CLASSIFICATION AND CLUSTERING 495
TABLE 4
A Generalized Hybrid Algorithm
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:02 UTC from IE Xplore. Restricon aply.
496 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 4, APRIL 2005
different cardinalities. Basically, it starts the search from a 4.1 A Unifying Platform
given subset S0 (usually, an empty set in sequential In Section 3.1, we develop a categorizing framework based
forward selection) and iterates to find the best subsets at on three dimensions (search strategies, evaluation criteria,
each increasing cardinality. In each round for a best and data mining tasks) from an algorithm designer’s
subset with cardinality c, it searches through all possible perspective. However, it would be impractical to require a
subsets of cardinality c þ 1 by adding one feature from domain expert or a user to keep abreast of such technical
the remaining features. Each newly generated subset S details about feature selection algorithms. Moreover, in
with cardinality c þ 1 is evaluated by an independent most cases, it is not sufficient to decide the most suitable
measure M and compared with the previous best one. If algorithm based merely on this framework. Recall the two
0
S is better, it becomes the current best subset Sbest at level groups of algorithms identified by the three dimensions in
c þ 1. At the end of each iteration, a mining algorithm A Section 3.1, each group still contains quite a few candidate
0
is applied on Sbest at level c þ 1 and the quality of the algorithms. Assuming that we only have three wrapper
mined result is compared with that from the best subset algorithms: WSFG and WSBG in one group and LVW in the
0
at level c. If Sbest is better, the algorithm continues to find other group, additional information is required to decide
the best subset at the next level; otherwise, it stops and the most suitable one for the given task. We propose a
outputs the current best subset as the final best subset. unifying platform (shown in Fig. 2) that expands the
The quality of results from a mining algorithm provides a categorizing framework by introducing more dimensions
natural stopping criterion in the hybrid model. from a user’s perspective.
At the top, knowledge and data about feature selection are
two key determining factors. Currently, the knowledge
4 TOWARD AN INTEGRATED SYSTEM FOR
factor covers Purpose of feature selection, Time concern,
INTELLIGENT FEATURE SELECTION expected Output Type, and M=N Ratio—the ratio between the
Research on feature selection has been active for decades expected number of selected features M and the total
with attempts to improve well-known algorithms or to number of original features N. The data factor covers Class
develop new ones. The proliferation of feature selection Information, Feature Type, Quality of data, and N=I Ratio—the
algorithms, however, has not brought about a general ratio between the number of features N and the number of
methodology that allows for intelligent selection from instances I. Each dimension is discussed below.
existing algorithms. In order to make a “right” choice, a The purpose of feature selection can be broadly categor-
user not only needs to know the domain well (this is usually ized into visualization, data understanding, data cleaning,
not a problem for the user), but also is expected to redundancy and/or irrelevancy removal, and performance
understand technical details of available algorithms (dis-
(e.g., predictive accuracy and comprehensibility) enhance-
cussed in previous sections). Therefore, the more algo-
ment. Recall that feature selection algorithms are categor-
rithms available, the more challenging it is to choose a
ized into the filter model, the wrapper model, and the
suitable one for an application. Consequently, a big number
of algorithms are not even attempted in practice and only a hybrid model. Accordingly, we can also summarize
couple of algorithms are always used. Therefore, there is a different purposes of feature selection into these three
pressing need for intelligent feature selection that can categories to form a generic task hierarchy, as different
automatically recommend the most suitable algorithm purposes imply different evaluation criteria and, thus,
among many for a given application. In this section, we guide the selection of feature selection algorithms differ-
present an integrated approach to intelligent feature ently. For general purpose of redundancy and/or irrele-
selection. First, we introduce a unifying platform which vancy removal, algorithms in the filter model are good
serves an intermediate step toward building an integrated choices as they are unbiased and fast. To enhance the
system for intelligent feature selection. Second, we illustrate mining performance, algorithms in the wrapper model
the idea through a preliminary system based on our should be preferred than those in the filter model as they
research. are better suited to the mining algorithms [44], [48].
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:02 UTC from IE Xplore. Restricon aply.
LIU AND YU: TOWARD INTEGRATING FEATURE SELECTION ALGORITHMS FOR CLASSIFICATION AND CLUSTERING 497
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:02 UTC from IE Xplore. Restricon aply.
498 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 4, APRIL 2005
or M 00 (shown in Fig. 3), where M 0 is an estimate of M by unique terms (words or phrases) that occur in documents,
SetCover, and M 00 an estimate of M by QBB. With this and the number of terms can be hundreds of thousands
system, Focus or ABB is recommended when either M 0 or for even a moderate-sized text collection. This is prohibi-
N M 0 is small because they guarantee the optimal tively high for many mining algorithms. Therefore, it is
selected subset. However, the two could take an impratical highly desirable to reduce the original feature space
long time to converge when the two conditions are not true. without sacrificing categorization accuracy. In [94], dif-
Therefore, either SetCover or QBB will be used based on the ferent feature selection methods are evaluated and
comparison of M 0 and M 00 . The two algorithms do not compared in the reduction of a high-dimensional feature
guarantee optimal subsets, but they are efficient in generat- space in text categorization problems. It is reported that
ing near optimal subsets. the methods under evaluation can effectively remove
The example in Fig. 3 verifies the idea of automatically 50 percent to 90 percent of the terms while maintaining
choosing a suitable feature selection algorithm within a the categorization accuracy.
limited scope based on the unifying platform. All four Image Retrieval. Feature selection is applied in [86] to
algorithms share the following characteristics: using an content-based image retrieval. Recent years have seen a
independent evaluation criterion (i.e., the filter model), rapid increase of the size and amount of image collections
searching for a minimum feature subset, time critical, and from both civilian and military equipments. However, we
dealing with labelled data. The preliminary integrated cannot access to or make use of the information unless it is
system uses the M=N ratio to guide the selection of feature organized so as to allow efficient browsing, searching, and
selection algorithms. How to substantially extend this retrieving. Content-based image retrieval [77] is proposed
preliminary work to a fully integrated system that to effectively handle the large scale of image collections.
incorporates all the factors specified in the unifying plat- Instead of being manually annotated by text-based key-
form remains a challenging problem. words, images would be indexed by their own visual
After presenting the concepts and state-of-the-art algo- contents (features), such as color, texture, shape, etc. One
rithms with a categorizing framework and a unifying of the biggest problems encountered while trying to make
platform, we now examine the use of feature selection in content-based image retrieval truly scalable to large size
real-world data mining applications. Feature selection has image collections is still the “curse of dimensionality” [37].
found many successes in real-world applications. As suggested in [77], the dimensionality of the feature
space is normally of the order of 102 . Dimensionality
reduction is a promising approach to solve this problem.
5 REAL-WORLD APPLICATIONS OF FEATURE
The image retrieval system proposed in [86] uses the
SELECTION theories of optimal projection to achieve optimal feature
The essence of these successful applications lies at the selection. Relevant features are then used to index images
recognition of a need for effective data preprocessing: Data for efficient retrieval.
mining can be effectively accomplished with the aid of Customer Relationship Management. A case of feature
feature selection. Data is often collected for many reasons selection is presented in [69] for customer relationship
other than data mining (e.g., required by law, easy to collect, management. In the context that each customer means a big
or simply for the purpose of book-keeping). In real-world revenue and the loss of one will likely trigger a significant
applications, one often encounters problems such as too segment to defect, it is imperative to have a team of highly
many features, individual features unable to independently experienced experts monitor each customer’s intention and
capture significant characteristics of data, high dependency movement based on massively collected data. A set of key
among the individual features, and emergent behaviors of indicators are used by the team and proven useful in
combined features. Humans are ineffective at formulating predicting potential defectors. The problem is that it is
and understanding hypotheses when data sets have large difficult to find new indicators describing the dynamically
numbers of variables (possibly thousands in cases involving changing business environment among many possible
demographics and hundreds of thousands in cases involving indicators (features). The machine recorded data is simply
Web browsing, microarray data analysis, or text document too enormous for any human expert to browse and obtain
analysis), and people would find it easy to understand any insight from it. Feature selection is employed to search
aspects of the problem in lower-dimensional subspaces [30], for new potential indicators in a dynamically changing
[72]. Feature selection can reduce the dimensionality to environment. They are later presented to experts for
enable many data mining algorithms to work effectively on scrutiny and adoption. This approach considerably im-
data with large dimensionality. Some illustrative applica- proves the team’s efficiency in finding new changing
tions of feature selection are showcased here. indicators.
Text Categorization. Text categorization [50], [70] is Intrusion Detection. As network-based computer sys-
the problem of automatically assigning predefined cate- tems play increasingly vital roles in modern society, they
gories to free text documents. This problem is of great have become the targets of our enemies and criminals. The
practical importance given the massive volume of online security of a computer system is compromised when an
text available through the World Wide Web, e-mails, and intrusion takes place. Intrusion detection is often used as
digital libraries. A major characteristic, or difficulty of text one way to protect computer systems. In [49], Lee et al.
categorization problems, is the high dimensionality of the proposed a systematic data mining framework for analyz-
feature space. The original feature space consists of many ing audit data and constructing intrusion detection models.
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:02 UTC from IE Xplore. Restricon aply.
LIU AND YU: TOWARD INTEGRATING FEATURE SELECTION ALGORITHMS FOR CLASSIFICATION AND CLUSTERING 499
Under this framework, a large amount of audit data is first represent here some challenges in research and develop-
analyzed using data mining algorithms in order to obtain ment of feature selection.
the frequent activity patterns. These patterns are then used Feature Selection with Large Dimensionality. Classi-
to guide the selection of system features as well as the cally, the dimensionality N is considered large if it is in the
construction of additional temporal and statistical features range of hundreds. However, in some recent applications of
for another phase of automated learning. Classifiers based feature selection, the dimensionality can be tens or
on these selected features are then inductively learned hundreds of thousands. Such high dimensionality causes
using the appropriately formatted audit data. These two major problems for feature selection. One is the so-
classifiers can be used as intrusion detection models since called “curse of dimensionality” [37]. As most existing
they can classify whether an observed system activity is feature selection algorithms have quadratic or higher time
“legitimate” or “intrusive.” Feature selection plays an complexity about N, it is difficult to scale up with high
important role in building classification models for intru- dimensionality. Since algorithms in the filter model use
sion detection. evaluation criteria that are less computationally expensive
Genomic Analysis. Structural and functional data from than those of the wrapper model, the filter model is often
analysis of the human genome has increased many folds in preferred to the wrapper model in dealing with large
recent years, presenting enormous opportunities and dimensionality. Recently, algorithms of the hybrid model
challenges for data mining [91], [96]. In particular, gene [15], [91] are considered to handle data sets with high
expression microarray is a rapidly maturing technology that dimensionality. These algorithms focus on combining filter
provides the opportunity to assay the expression levels of and wrapper algorithms to achieve best possible perfor-
thousands or tens of thousands of genes in a single mance with a particular mining algorithm with similar time
experiment. These assays provide the input to a wide complexity of filter algorithms. Therefore, more efficient
search strategies and evaluation criteria are needed for
variety of data mining tasks, including classification and
feature selection with large dimensionality. An efficient
clustering. However, the number of instances in these
correlation-based filter algorithm is introduced in [95] to
experiments is often severely limited. In [91], for example, a
effectively handle large-dimensional data with class infor-
case involving only 38 training data points in a 7,130 dimen-
mation. Another difficulty faced by feature selection with
sional space is used to exemplify the above situation that
data of large dimensionality is the relative shortage of
are becoming increasingly common in application of data
instances. That is, the dimensionality N can sometimes
mining to molecular biology. In this extreme of very few
greatly exceed the number of instances I. In such cases, we
observations on a large number of features, Xing et al. [91]
should consider algorithms that intensively work along the
investigated the possible use of feature selection on a
I dimension as seen in [91].
microarray classification problem. All the classifiers tested Feature Selection with Active Instance Selection.
in the experiments performed significantly better in the Traditional feature selection algorithms perform dimen-
reduced feature space than in the full feature space. sionality reduction using whatever training data is given to
them. When the training data set is very large, random
6 CONCLUDING REMARKS AND FUTURE sampling [14], [33] is commonly used to sample a subset of
DIRECTIONS instances. However, random sampling is blind without
exploiting any data characteristic. The concept of active
This survey provides a comprehensive overview of various
feature selection is first introduced and studied in [57].
aspects of feature selection. We introduce two architectures Active feature selection promotes the idea to actively select
—a categorizing framework and a unifying platform. They instances for feature selection. It avoids pure random
categorize the large body of feature selection algorithms, sampling and is realized by selective sampling [57], [60]
reveal future directions for developing new algorithms, and that takes advantage of data characteristics when selecting
guide the selection of algorithms for intelligent feature instances. The key idea of selective sampling is to select
selection. The categorizing framework is developed from an only those instances with high probabilities to be informa-
algorithm designer’s viewpoint that focuses on the technical tive in determining feature relevance. Selective sampling
details about the general procedures of feature selection aims to achieve better or equally good feature selection
process. A new feature selection algorithm can be incorpo- results with a significantly smaller number of instances than
rated into the framework according to the three dimensions. that of random sampling. Although some selective sam-
The unifying platform is developed from a user’s viewpoint pling methods based on data variance or class information
that covers the user’s knowledge about the domain and have proven effective on representative algorithms [57],
data for feature selection. The unifying platform is one [60], more research efforts are needed to investigate the
necessary step toward building an integrated system for effectiveness of selective sampling over the vast body of
intelligent feature selection. The ultimate goal for intelligent feature selection algorithms.
feature selection is to create an integrated system that will Feature Selection with New Data Types. The field of
automatically recommend the most suitable algorithm(s) to feature selection develops fast as data mining is an
the user while hiding all technical details irrelevant to an application-driven field where research questions tend to
application. be motivated by real-world data sets. A broad spectrum of
As data mining develops and expands to new applica- formalisms and techniques has been proposed in a large
tion areas, feature selection also faces new challenges. We number of applications. For example, the work of feature
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:02 UTC from IE Xplore. Restricon aply.
500 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 4, APRIL 2005
selection mainly focused on labeled data until 1997. Since mining is performed in a focused and direct way [11], [61],
1998, we have observed the increasing use of feature [76]. Feature selection is a dynamic field closely connected
selection for unlabeled data. The best-known data type in to data mining and other data processing techniques. This
traditional data analysis, data mining, and feature selection paper attempts to survey this fast developing field, show
is N-dimensional vectors of measurements on I instances some effective applications, and point out interesting trends
(or objects, individuals). Such data is often referred to as and challenges. It is hoped that further and speedy
multivariate data and can be thought of as an N I data development of feature selection can work with other
matrix [84]. Since data mining emerged, a common form of related techniques to help evolve data mining into solutions
data in data mining in many business contexts is records of for insights.
individuals conducting transactions in applications like
market basket analysis, insurance, direct-mail marketing, ACKNOWLEDGMENTS
and health care. This type of data, if it is considered as an
The authors are very grateful to the anonymous reviewers
N I matrix, has a very large number of possible attributes
and editor. Their many helpful and constructive comments
but is a very sparse matrix. For example, a typical market
and suggestions helped us significantly improve this work.
basket (an instance) can have tens of items purchased out of
This work is in part supported by a grant from the US
hundreds of thousands of available items. The significant
National Science Foundation (No. 0127815), and from ET-I3 .
and rapid growth of computer and Internet/Web techni-
ques makes some other types of data more commonly
available—text-based data (e.g., e-mails, online news, REFERENCES
newsgroups) and semistructure data (e.g., HTML, XML). [1] R. Agrawal, T. Imielinski, and A. Swami, “Database Mining: A
The wide deployment of various sensors, surveillance Performance Perspective,” IEEE Trans. Knowledge and Data Eng.,
cameras, and Internet/Web monitoring lately poses a vol. 5, no. 6, pp. 914-925, 1993.
[2] H. Almuallim and T.G. Dietterich, “Learning with Many
challenge to deal with another type of data—data streams. Irrelevant Features,” Proc. Ninth Nat’l Conf. Artificial Intelligence,
It pertains to data arriving over time, in a nearly continuous pp. 547-552, 1991.
fashion, and is often available only once or in a limited [3] H. Almuallim and T.G. Dietterich, “Learning Boolean Concepts in
the Presence of Many Irrelevant Features,” Artificial Intelligence,
amount of time [84]. As we have witnessed a growing vol. 69, nos. 1-2, pp. 279-305, 1994.
number of work of feature selection on unlabeled data, we [4] C. Apte, B. Liu, P.D. Pendault, and P. Smyth, “Business
are certain to anticipate more research and development on Applications of Data Mining,” Comm. ACM, vol. 45, no. 8, pp. 49-
new types of data for feature selection. It does not seem 53, 2002.
[5] M. Ben-Bassat, “Pattern Recognition and Reduction of Dimen-
reasonable to suggest that the existing algorithms can be sionality,” Handbook of Statistics-II, P.R. Krishnaiah and
easily modified for these new data types. L.N. Kanal, eds., pp. 773-791, North Holland, 1982.
Related Challenges for Feature Selection. Shown in [6] A.L. Blum and P. Langley, “Selection of Relevant Features and
Examples in Machine Learning,” Artificial Intelligence, vol. 97,
Section 5 are some exemplary cases of applying feature pp. 245-271, 1997.
selection as a preprocessing step in very large databases [7] A.L. Blum and R.L. Rivest, “Training a 3-Node Neural Networks
collected from Internet, business, scientific, and government is NP-Complete,” Neural Networks, vol. 5, pp. 117-127, 1992.
[8] L. Bobrowski, “Feature Selection Based on Some Homogeneity
applications. Novel feature selection applications will be
Coefficient,” Proc. Ninth Int’l Conf. Pattern Recognition, pp. 544-546,
found where creative data reduction has to be conducted 1988.
when our ability to capture and store data has far outpaced [9] P Bradley, J. Gehrke, R. Ramakrishna, and R. Ssrikant, “Scaling
our ability to process and utilize it [30]. Feature selection Mining Algorithms to Large Datbases,” Comm. ACM, vol. 45, no. 8,
pp. 38-43, 2002.
can help focus on relevant parts of data and improve our [10] G. Brassard and P. Bratley, Fundamentals of Algorithms. New
ability to process data. New data mining applications [4], Jersey: Prentice Hall, 1996.
[45] arise as techniques evolve. Scaling data mining [11] H. Brighton and C. Mellish, “Advances in Instance Selection for
Instance-Based Learning Algorithms,” Data Mining and Knowledge
algorithms to large databases is a pressing issue. As feature Discovery, vol. 6, no. 2, pp. 153-172, 2002.
selection is one step in data preprocessing, changes need to [12] C. Cardie, “Using Decision Trees to Improve Case-Based Learn-
be made for classic algorithms that require multiple ing,” Proc. 10th Int’l Conf. Machine Learning, P. Utgoff, ed., pp. 25-
database scans and/or random access to data. Research is 32, 1993.
[13] R. Caruana and D. Freitag, “Greedy Attribute Selection,” Proc.
required to overcome limitations imposed when it is costly 11th Int’l Conf. Machine Learning, pp. 28-36, 1994.
to visit large data sets multiple times or access instances at [14] W.G. Cochran, Sampling Techniques. John Wiley & Sons, 1977.
random as in data streams [9]. [15] S. Das, “Filters, Wrappers and a Boosting-Based Hybrid for
Recently, it has been noticed that in the context of Feature Selection,” Proc. 18th Int’l Conf. Machine Learning, pp. 74-
81, 2001.
clustering, many clusters may reside in different subspaces [16] M. Dash, “Feature Selection via Set Cover,” Proc. IEEE Knowledge
of very small dimensionality [32], either with their sets of and Data Eng. Exchange Workshop, pp. 165-171, 1997.
dimensions overlapped or nonoverlapped. Many subspace [17] M. Dash, K. Choi, P. Scheuermann, and H. Liu, “Feature Selection
clustering algorithms are developed [72]. Searching for for Clustering-a Filter Solution,” Proc. Second Int’l Conf. Data
Mining, pp. 115-122, 2002.
subspaces is not exactly a feature selection problem as it [18] M. Dash and H. Liu, “Feature Selection for Classification,”
tries to find many subspaces while feature selection only Intelligent Data Analysis: An Int’l J., vol. 1, no. 3, pp. 131-156, 1997.
tries to find one subspace. Feature selection can also be [19] M. Dash and H. Liu, “Handling Large Unsupervised Data via
extended to instance selection [55] in scaling down data Dimensionality Reduction,” Proc. 1999 SIGMOD Research Issues in
Data Mining and Knowledge Discovery (DMKD-99) Workshop, 1999.
which is a sister issue of scaling up algorithms. In addition [20] M. Dash and H. Liu, “Feature Selection for Clustering,” Proc.
to sampling methods [33], a suite of methods have been Fourth Pacific Asia Conf. Knowledge Discovery and Data Mining,
developed to search for representative instances so that data (PAKDD-2000), pp. 110-121, 2000.
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:02 UTC from IE Xplore. Restricon aply.
LIU AND YU: TOWARD INTEGRATING FEATURE SELECTION ALGORITHMS FOR CLASSIFICATION AND CLUSTERING 501
[21] M. Dash, H. Liu, and H. Motoda, “Consistency Based Feature [48] P. Langley, “Selection of Relevant Features in Machine Learning,”
Selection,” Proc. Fourth Pacific Asia Conf. Knowledge Discovery and Proc. AAAI Fall Symp. Relevance, pp. 140-144, 1994.
Data Mining, (PAKDD-2000), pp. 98-109, 2000. [49] W. Lee, S.J. Stolfo, and K.W. Mok, “Adaptive Intrusion Detection:
[22] M. Dash, H. Liu, and J. Yao, “Dimensionality Reduction of A Data Mining Approach,” AI Rev., vol. 14, no. 6, pp. 533-567,
Unsupervised Data,” Proc. Ninth IEEE Int’l Conf. Tools with AI 2000.
(ICTAI ’97), pp. 532-539, 1997. [50] E. Leopold and J. Kindermann, “Text Categorization with Support
[23] M. Devaney and A. Ram, “Efficient Feature Selection in Vector Machines. How to Represent Texts in Input Space?”
Conceptual Clustering,” Proc. 14th Int’l Conf. Machine Learning, Machine Learning, vol. 46, pp. 423-444, 2002.
pp. 92-97, 1997. [51] H. Liu, F. Hussain, C.L. Tan, and M. Dash, “Discretization: An
[24] P.A. Devijver and J. Kittler, Pattern Recognition: A Statistical Enabling Technique,” Data Mining and Knowledge Discovery, vol. 6,
Approach. Prentice Hall Int’l, 1982. no. 4, pp. 393-423, 2002.
[25] J. Doak, “An Evaluation of Feature Selection Methods and Their [52] Feature Extraction, Construction and Selection: A Data Mining
Application to Computer Security,” technical report, Univ. of Perspective, H. Liu and H. Motoda, eds. Boston: Kluwer Academic,
California at Davis, Dept. Computer Science, 1992. 1998, second printing, 2001.
[26] P. Domingos, “Context Sensitive Feature Selection for Lazy [53] H. Liu and H. Motoda, Feature Selection for Knowledge Discovery and
Learners,” AI Rev., vol. 14, pp. 227-253, 1997. Data Mining. Boston: Kluwer Academic, 1998.
[27] J.G. Dy and C.E. Brodley, “Feature Subset Selection and Order [54] H. Liu and H. Motoda, “Less Is More,” Feature Extraction,
Identification for Unsupervised Learning,” Proc. 17th Int’l Conf. Construction and Selection: A Data Mining Perspective, pp. 3-12,
Machine Learning, pp. 247-254, 2000. chapter 1, 1998, second printing, 2001.
[28] U.M. Fayyad and K.B. Irani, “Multi-Interval Discretization of [55] Instance Selection and Construction for Data Mining, H. Liu and
Continuous-Valued Attributes for Classification Learning,” Proc. H. Motoda, eds. Boston: Kluwer Academic Publishers, 2001.
13th Int’l Joint Conf. Artificial Intelligence, pp. 1022-1027, 1993. [56] H. Liu, H. Motoda, and M. Dash, “A Monotonic Measure for
[29] U.M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From Data Optmial Feature Selection,” Proc. 10th European Conf. Machine
Mining to Knowledge Discovery: An Overview,” Advances in Learning, pp. 101-106, 1998.
Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky- [57] H. Liu, H. Motoda, and L. Yu, “Feature Selection with Selective
Shapiro, P. Smyth, and R. Uthurusamy, eds., pp. 495-515, AAAI Sampling,” Proc. 19th Int’l Conf. Machine Learning, pp. 395-402,
Press/The MIT Press, 1996. 2002.
[30] U.M. Fayyad and R. Uthurusamy, “Evolving Data Mining into
[58] H. Liu and R. Setiono, “Feature Selection and Classification-A
Solutions for Insights,” Comm. ACM, vol. 45, no. 8, pp. 28-31, 2002.
Probabilistic Wrapper Approach,” Proc. Ninth Int’l Conf. Industrial
[31] I. Foroutan and J. Sklansky, “Feature Selection for Automatic and Eng. Applications of AI and ES, T. Tanaka, S. Ohsuga, and
Classification of Non-Gaussian Data,” Trans. Systems, Man, and M. Ali, eds., pp. 419-424, 1996.
Cybernatics, vol. 17, no. 2, pp. 187-198, 1987.
[59] H. Liu and R. Setiono, “A Probabilistic Approach to Feature
[32] J.H. Friedman and J.J. Meulman, “Clustering Objects on Subsets of
Selection-A Filter Solution,” Proc. 13th Int’l Conf. Machine Learning,
Attributes,” http://citeseer.ist.psu.edu/friedman02clustering.
pp. 319-327, 1996.
html, 2002.
[60] H. Liu, L. Yu, M. Dash, and H. Motoda, “Active Feature Selection
[33] B. Gu, F. Hu, and H. Liu, “Sampling: Knowing Whole from Its
Using Classes,” Proc. Seventh Pacific-Asia Conf. Knowledge Discovery
Part,” Instance Selection and Construction for Data Mining, pp. 21-38,
and Data Mining, pp. 474-485, 2003.
2001.
[61] D. Madigan, N. Raghavan, W. DuMouchel, C. Nason, M. Posse,
[34] M.A. Hall, “Correlation-Based Feature Selection for Discrete and
and G. Ridgeway, “Likelihood-Based Data Squashing: A Model-
Numeric Class Machine Learning,” Proc. 17th Int’l Conf. Machine
ing Approach to Instance Construction,” Data Mining and Knowl-
Learning, pp. 359-366, 2000.
edge Discovery, vol. 6, no. 2, pp. 173-190, 2002.
[35] J. Han and Y. Fu, “Attribute-Oriented Induction in Data Mining,”
Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, [62] A. Miller, Subset Selection in Regression, second ed. Chapman &
G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., pp. 399- Hall/CRC, 2002.
421, AAAI Press/The MIT Press, 1996. [63] P. Mitra, C.A. Murthy, and S.K. Pal, “Unsupervised Feature
[36] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Selection Using Feature Similarity,” IEEE Trans. Pattern Analysis
Morgan Kaufman, 2001. and Machine Intelligence, vol. 24, no. 3, pp. 301-312, Mar. 2002.
[37] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical [64] M. Modrzejewski, Feature Selection Using Rough Sets Theory,”
Learning. Springer, 2001. Proc. European Conf. Machine Learning, P.B. Brazdil, ed., pp. 213-
[38] M. Ichino and J. Sklansky, “Feature Selection for Linear 226, 1993.
Classifier,” Proc. Seventh Int’l Conf. Pattern Recognition, pp. 124- [65] A.W. Moore and M.S. Lee, “Efficient Algorithms for Minimizing
127, 1984. Cross Validation Error,” Proc. 11th Int’l Conf. Machine Learning,
[39] M. Ichino and J. Sklansky, “Optimum Feature Selection by Zero- pp. 190-198, 1994.
One Programming,” IEEE Trans. Systems, Man, and Cybernetics, [66] A.N. Mucciardi and E.E. Gose, “A Comparison of Seven
vol. 14, no. 5, pp. 737-746, 1984. Techniques for Choosing Subsets of Pattern Recognition,” IEEE
[40] A. Jain and D. Zongker, “Feature Selection: Evaluation, Applica- Trans. Computers, vol. 20, pp. 1023-1031, 1971.
tion, and Small Sample Performance,” IEEE Trans. Pattern Analysis [67] P.M. Narendra and K. Fukunaga, “A Branch and Bound
and Machine Intelligence, vol. 19, no. 2, 153-158, Feb. 1997. Algorithm for Feature Subset Selection,” IEEE Trans. Computer,
[41] G.H. John, R. Kohavi, and K. Pfleger, “Irrelevant Feature and the vol. 26, no. 9, pp. 917-922, Sept. 1977.
Subset Selection Problem,” Proc. 11th Int’l Conf. Machine Learning, [68] A.Y. Ng, “On Feature Selection: Learning with Exponentially
pp. 121-129, 1994. Many Irrelevant Features as Training Examples,” Proc. 15th Int’l
[42] Y. Kim, W. Street, and F. Menczer, “Feature Selection for Conf. Machine Learning, pp. 404-412, 1998.
Unsupervised Learning via Evolutionary Search,” Proc. Sixth [69] K.S. Ng and H. Liu, “Customer Retention via Data Mining,” AI
ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, Rev., vol. 14, no. 6, pp. 569-590, 2000.
pp. 365-369, 2000. [70] K. Nigam, A.K. Mccallum, S. Thrun, and T. Mitchell, “Text
[43] K. Kira and L.A. Rendell, “The Feature Selection Problem: Classification from Labeled and Unlabeled Documents Using
Traditional Methods and a New Algorithm,” Proc. 10th Nat’l Conf. EM,” Machine Learning, vol. 39, 103-134, 2000.
Artificial Intelligence, pp. 129-134, 1992. [71] A.L. Oliveira and A.S. Vincentelli, “Constructive Induction Using
[44] R. Kohavi and G.H. John, “Wrappers for Feature Subset a Non-Greedy Strategy for Feature Selection,” Proc. Ninth Int’l
Selection,” Artificial Intelligence, vol. 97, nos. 1-2, pp. 273-324, 1997. Conf. Machine Learning, pp. 355-360, 1992.
[45] R. Kohavi, N.J. Rothleder, and E. Simoudis, “Emerging Trends in [72] L. Parsons, E. Haque, and H. Liu, “Subspace Clustering for High
Business Analytics,” Comm. ACM, vol. 45, no. 8, pp. 45-48, 2002. Dimensional Data: A Review,” SIGKDD Explorations, vol. 6, no. 1,
[46] D. Koller and M. Sahami, “Toward Optimal Feature Selection,” pp. 90-105, 2004.
Proc. 13th Int’l Conf. Machine Learning, pp. 284-292, 1996. [73] P. Pudil and J. Novovicova, “Novel Methods for Subset Selection
[47] I. Kononenko, “Estimating Attributes: Analysis and Extension of with Respect to Problem Knowledge,” Feature Extraction, Con-
RELIEF,” Proc. Sixth European Conf. Machine Learning, pp. 171-182, struction and Selection: A Data Mining Perspective, pp. 101-116, 1998,
1994. second printing, 2001.
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:02 UTC from IE Xplore. Restricon aply.
502 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 4, APRIL 2005
[74] D. Pyle, Data Preparation for Data Mining. Morgan Kaufmann Huan Liu received the bachelor’s degree from
Publishers, 1999. Shanghai JiaoTong University, and the MS and
[75] C.E. Queiros and E.S. Gelsema, “On Feature Selection,” Proc. PhD degrees from the University of Southern
Seventh Int’l Conf. Pattern Recognition, pp. 128-130, 1984. California. Dr. Liu works in the Department of
[76] T. Reinartz, “A Unifying View on Instance Selection,” Data Mining Computer Science and Engineering at Arizona
and Knowledge Discovery, vol. 6, no. 2, pp. 191-210, 2002. State University (ASU) where he researches and
[77] Y. Rui, T.S. Huang, and S. Chang, “Image Retrieval: Current teaches data mining, machine learning, and
Techniques, Promising Directions and Open Issues,” Visual Comm. artificial intelligence and their applications to
and Image Representation, vol. 10, no. 4, pp. 39-62, 1999. real-world problems. Before joining ASU, he
[78] J.C. Schlimmer, “Efficiently Inducing Determinations: A Complete conducted research at Telecom (now Telstra)
and Systematic Search Algorithm that Uses Optimal Pruning,” Australia Research Laboratories, and taught at the School of Computing,
Proc. 10th Int’l Conf. Machine Learning, pp. 284-290, 1993. National University of Singapore. He has published books and technical
[79] J. Segen, “Feature Selection and Constructive Inference,” Proc. papers on data preprocessing techniques on feature selection, extrac-
Seventh Int’l Conf. Pattern Recognition, pp. 1344-1346, 1984. tion, construction, and instance selection. His research interests include
[80] J. Sheinvald, B. Dom, and W. Niblack, “A Modelling Approach to data mining, machine learning, data reduction, customer relation
Feature Selection,” Proc. 10th Int’l Conf. Pattern Recognition, pp. 535- management, bioinformatics, and intelligent systems. Professor Liu has
539, 1990. served on the program committees of many international conferences, is
[81] W. Siedlecki and J. Sklansky, “On Automatic Feature Selection,” on the editorial board or an editor of professional journals. He is a member
Int’l J. Pattern Recognition and Artificial Intelligence, vol. 2, pp. 197- of the ACM, the AAAI, the ASEE, and a senior member of the IEEE.
220, 1988.
[82] D.B. Skalak, “Prototype and Feature Selection by Sampling and Lei Yu received the bachelor’s degree from
Random Mutation Hill Climbing Algorithms,” Proc. 11th Int’l Dalian University of Technology, China, in 1999.
Conf. Machine Learning, pp. 293-301, 1994. He is currently a PhD candidate in the Depart-
[83] N. Slonim, G. Bejerano, S. Fine, and N. Tishbym, “Discriminative ment of Computer Science and Engineering at
Feature Selection via Multiclass Variable Memory Markov Arizona State University. His research interests
Model,” Proc. 19th Int’l Conf. Machine Learning, pp. 578-585, 2002. include data mining, machine learning, and
[84] P. Smyth, D. Pregibon, and C. Faloutsos, “Data-Driven Evolution bioinformatics. He has published technical pa-
of Data Mining Algorithms,” Comm. ACM, vol. 45, no. 8, pp. 33-37, pers in premier journals and leading confer-
2002. ences of data mining and machine learning. He
[85] D.J. Stracuzzi and P.E. Utgoff, “Randomized Variable Elimina- is a student member of the IEEE and the ACM.
tion,” Proc. 19th Int’l Conf. Machine Learning, pp. 594-601, 2002.
[86] D.L. Swets and J.J. Weng, “Efficient Content-Based Image
Retrieval Using Automatic Feature Selection,” IEEE Int’l Symp.
Computer Vision, pp. 85-90, 1995. . For more information on this or any other computing topic,
[87] L. Talavera, “Feature Selection as a Preprocessing Step for please visit our Digital Library at www.computer.org/publications/dlib.
Hierarchical Clustering,” Proc. Int’l Conf. Machine Learning (ICML
’99), pp. 389-397, 1999.
[88] H. Vafaie and I.F. Imam, “Feature Selection Methods: Genetic
Algorithms vs. Greedy-Like Search,” Proc. Int’l Conf. Fuzzy and
Intelligent Control Systems, 1994.
[89] I.H. Witten and E. Frank, Data Mining-Pracitcal Machine Learning
Tools and Techniques with JAVA Implementations. Morgan Kauf-
mann, 2000.
[90] N. Wyse, R. Dubes, and A.K. Jain, “A Critical Evaluation of
Intrinsic Dimensionality Algorithms,” Pattern Recognition in
Practice, E.S. Gelsema and L.N. Kanal, eds., pp. 415-425, Morgan
Kaufmann, Inc., 1980.
[91] E. Xing, M. Jordan, and R. Karp, “Feature Selection for High-
Dimensional Genomic Microarray Data,” Proc. 15th Int’l Conf.
Machine Learning, pp. 601-608, 2001.
[92] L. Xu, P. Yan, and T. Chang, “Best First Strategy for Feature
Selection,” Proc. Ninth Int’l Conf. Pattern Recognition, pp. 706-708,
1988.
[93] J. Yang and V. Honavar, “Feature Subset Selection Using A
Genetic Algorithm,” Feature Extraction, Construction and Selection:
A Data Mining Perspective, pp. 117-136, 1998, second printing,
2001.
[94] Y. Yang and J.O. Pederson, “A Comparative Study on Feature
Selection in Text Categorization,” Proc. 14th Int’l Conf. Machine
Learning, pp. 412-420, 1997.
[95] L. Yu and H. Liu, “Feature Selection for High-Dimensional Data:
A Fast Correlation-Based Filter Solution,” Proc. 20th Int’l Conf.
Machine Learning, pp. 856-863, 2003.
[96] L. Yu and H. Liu, “Redundancy Based Feature Selection for
Microarray Data,” Proc. 10th ACM SIGKDD Conf. Knowledge
Discovery and Data Mining, 2004.
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:02 UTC from IE Xplore. Restricon aply.