0% found this document useful (0 votes)
16 views8 pages

Icml 2005

This paper introduces FortalFS, a new hybrid algorithm for feature selection in machine learning that combines the strengths of filter and wrapper methods. The algorithm effectively reduces dimensionality while improving classification accuracy, outperforming existing methods in various experiments. FortalFS uses a guided random search strategy based on previously selected features to enhance efficiency and accuracy in identifying relevant subsets of features.

Uploaded by

Jerffeson Souza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views8 pages

Icml 2005

This paper introduces FortalFS, a new hybrid algorithm for feature selection in machine learning that combines the strengths of filter and wrapper methods. The algorithm effectively reduces dimensionality while improving classification accuracy, outperforming existing methods in various experiments. FortalFS uses a guided random search strategy based on previously selected features to enhance efficiency and accuracy in identifying relevant subsets of features.

Uploaded by

Jerffeson Souza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Feature Selection with a General Hybrid Algorithm

Jerffeson Teixeira de Souza [email protected]


Nathalie Japkowicz [email protected]
Stan Matwin [email protected]
School of Information Technology and Engineering, University of Ottawa, Ottawa, K1N 6N5, Canada

Abstract features present in the sample data. Intuitively, one


In this paper, we propose a new general hy- could think that the more features are used, the more
brid system for the problem of feature selec- accurate the classifier should be. However, what is
tion in machine learning. FortalFS employs seen in practice is that the more features there are,
a search strategy that can be characterized the more training examples are needed to induce an
as randomized, strongly guided: it uses the accurate classification model. Thus, the presence of
results of a filter as a seed to generate new a large number of features (some of them irrelevant)
subsets to be evaluated by a learning algo- can hurt the classification capacity of a machine learn-
rithm. In order to evaluate the power of this ing algorithm. This problem is sometimes referred to
new algorithm, a number of experiments were as “the curse of dimensionality” (Bellman, 1961) and
designed and executed and the results com- can be explained by the fact that the number of solu-
pared to well-known feature selection filter tions (classifiers) that should be considered increases
and wrapper algorithms. Such experiments exponentially in the number of features of the dataset,
were ran over thirteen datasets from the UCI making it more difficult for a learning algorithm to find
Repository and three additional high dimen- an accurate one. Also, as a consequence, when a large
sional datasets, using three different induc- number of features is present, a greater computational
tion algorithms - C4.5, Naive Bayes and k- effort is needed in order to compute these features.
Nearest Neighbour. The results show that The Feature Selection problem involves discovering a
FortalFS outperformed all other algorithms subset of features such that a classifier built only with
when considering the classification ability of this subset would have better predictive accuracy than
the selected subsets while achieving a dimen- a classifier built from the entire set of features. Other
sionality reduction of over 60%. In terms of benefits of feature selection include a reduction in the
time consumption, FortalFS required more amount of training data needed to induce an accurate
time than the filters but less than any of the classifier, that is consequently simpler and easier to
wrappers. understand, and a reduced execution time. In prac-
tice, feature selection algorithms will discover and se-
lect features of the data that are relevant to the task
1. Introduction to be learned.
Classification is a key problem in machine learning. The remainder of this paper is composed of five sec-
Algorithms for classification have the ability to pre- tions. Section 2 is an overview of previous work on
dict the outcome of a new situation after having been feature selection. Section 3 presents our new hybrid
trained on data representing past experience. A num- solution for feature selection. Section 4 presents an
ber of factors influence the performance of classifica- empirical evaluation of our method. Finally, Section 5
tion algorithms, including the number and quality of suggests several directions for future work and Sec-
features provided to describe the data, the training tion 6 concludes the paper by summarizing its contri-
and testing data distribution, and others. The factor butions.
we focus on in this paper is the number and quality of

Preliminary work. Under review by the International Con-


ference on Machine Learning (ICML). Do not distribute.
2. Previous Work fier accuracy (like wrappers).
In this section, we begin by describing three general In this paper, a general method that follows the hybrid
approaches to feature selection and then overview a approach.
number of important specific algorithms previously An overview of some important algorithms for selec-
used to improve the performance of regular classifiers. tion of features is presented next. These algorithms,
For a more detailed overview of previous work in fea- and others, will be used in the evaluation part of the
ture selection research please see (Blum & Langley, paper to contrast the results obtained with our pro-
1997). posed method to those obtained with currently avail-
able technology.
2.1. Filter, Wrapper and Hybrid Approaches
A large number of algorithms have already been pro- 2.2. Filters
posed to solve the feature selection problem. Although The Best-First feature selection algorithm (Xu et al.,
significantly different with regards to 1) the search 1989) starts with an empty set of features and gener-
strategy they use to determine the right subset of fea- ates all possible single feature expansions. The sub-
tures and 2) how each subset is evaluated, feature se- set with the highest evaluation is then chosen and is
lection algorithms are usually classified in three gen- expanded in the same way by adding single features.
eral groups: Filters, Wrappers and Hybrid algorithms. If expanding a subset results in no improvement, the
In a filter model, the feature selection is performed as search drops back to the next best unexpanded subset
a pre-processing step to classification. Thus, the selec- and continues from there. Given enough time a Best-
tion process is performed independently of the machine First search will explore the entire search space, so it
learning algorithm to be used to induce the classifier. is common to limit the number of subsets expanded
Filters are generally much less computationally expen- that result in no improvement. The best subset found
sive than wrappers. However, they may suffer from low is returned when the search terminates.
performance if the evaluation criterion does not match The Genetic Search feature selector (Vafaie & De Jong,
the classifier well. 1992) applies a simple genetic algorithm to search
As opposed to filters, wrappers (John et al., 1994) use through the subsets of features. Each chromosome,
the learning algorithm as an integral part of the selec- containing N genes representing the N features of the
tion process. The idea behind wrappers comes from dataset, represents a subset to be considered. For each
the fact that the optimal subset of features depends on gene in a chromosome, the value 1 means that a fea-
the specific biases of the learning system, as observed ture is present in the subset while the value 0 indicates
in (John et al., 1994). Therefore, the selection of fea- that it is not there. The algorithm starts by randomly
tures should consider the characteristics of the classi- generating an initial population of subsets. Each sub-
fier. Then, in order to evaluate subsets, wrappers use set is then evaluated and, based on this evaluation, a
the classifier error rate induced by the learning algo- number of genetic operations (selection, reproduction
rithms as its evaluation function. This aspect of wrap- and mutation) is applied to this population. The fit-
pers results in higher accuracy performance for subset ness value calculated for each subset during evaluation
selection than simple filters. However, since wrappers indicates how well this subset represents the original
have to train a classifier for each subset evaluation, dataset. A new population of subsets is generated and
they are much more time consuming. the process starts again until a pre-specified number
of populations has been considered. At the end, the
The idea behind hybrid systems for feature selection is subset with higher fitness value is selected.
to extract the good characteristics of filters and wrap-
pers and combine them in one single solution. These LVF (Liu & Setiono, 1996) randomly searches the
systems are specially designed to select features accu- space of subsets using a Las Vegas algorithm. At ev-
rately the way wrappers do while being time-efficient ery iteration, a new subset is randomly generated. If
the way filters are. Hybrid algorithms achieve this be- the number of features in this subset is smaller than or
havior usually by pre-evaluating the features with a equal to that of the current best subset, this subset is
filter in a way to reduce the search space to be consid- evaluated with the use of an inconsistency count (Liu
ered by the subsequent wrapper. The term “hybrid” & Setiono, 1996). Thus, subsets with more features
refers to the fact that two different evaluation methods than the current best are simply rejected. The incon-
are used, a filter-type evaluation (such as consistency, sistency count is based on the intuition that the most
distance, information gain, and so on) and the classi- frequent class label among the instances matching on
this subset of features (that is, instances that have the tion process. They claimed the selection algorithm
same values for all features in the subset) is the most should take into account the biases of the target in-
probable class label1 . An inconsistency threshold is duction algorithm. They proposed the use of n-fold
used to reject subsets that have an inconsistency rate cross-validation to evaluate the subset of features. In
greater than it. This process is repeated N umIter order to search the space of subsets to be evaluated,
times2 . AI search algorithms are considered, such as Backward
Elimination and Forward Selection. In this last case,
The Relief algorithm (Kira & Rendell, 1992b; Kira &
the selection algorithm will start with an empty sub-
Rendell, 1992a; Kononenko, 1994; Kononenko et al.,
set of features and greedily add features that most
1996) weights each feature according to its relevance
improve performance, until such improvement is not
to the classification task. Initially all weights are set
possible anymore. Similarly to the Forward Selection
to zero and then updated iteratively. In each iteration,
used in the Forward Wrapper algorithm, the Backward
the algorithm chooses a random instance in the dataset
Wrapper feature selection algorithm (John et al., 1994)
and estimates how well each feature value of this in-
applies the Backward Elimination strategy. This algo-
stance distinguishes between instances close to them.
rithm starts from the complete set and, in each itera-
In this process two groups of instances are selected:
tion, generates new subsets by discarding a feature un-
some closest instances belonging to the same class and
til improvements can not be achieved anymore. In this
some belonging to a different class. With these in-
random wrapper, adapted from (John et al., 1994)3 ,
stances, Relief will iteratively update the weight of
subsets of features are iteratively and randomly gener-
each feature according to how well this feature differen-
ated and evaluated with the help of a machine learning
tiates between data points from different classes while,
algorithm. At the end, the subset with best accuracy
simultaneously, recognizing data points from the same
is returned.
class. At the end, a threshold is used in such a way
that only the features with weights above this value
are selected. In an alternative version, a pre-defined 3. The FortalFS Algorithm
number of features with the highest weights is selected.
The idea behind FortalFS is to extract and combine
The Focus system (Almuallim & Dietterich, 1991; Al- the best characteristics of filters and wrappers into
muallim & Dietterich, 1994) starts searching through one algorithm, namely, an efficient heuristic used to
all individual features looking for one that perfectly search through subsets of features and a precise evalu-
represents the original dataset, that is, such feature ation criterion, respectively. Thus, the FortalFS algo-
will be found when the inconsistency count for this rithm (Souza et al., 2002) uses results from another
feature is equal to the one of the initial dataset. If no feature selection system as a starting point in the
single feature is found to meet this criterion, combina- search through subsets of features that are evaluated
tions of features of size two, three and so on are consid- by a machine learning algorithm (Figure 1). There-
ered until a perfect subset is discovered. Unlike LVF, fore, with an efficient heuristic, we can decrease the
this inconsistency measure simply counts the number number of subsets of features to be evaluated by the
of instances with same feature values but belonging to learning algorithm, consequently decreasing computa-
different classes. tional effort (the great advantage of filters) and still be
able to select an accurate subset (the big advantage of
2.3. Wrappers wrappers).
In (John et al., 1994), the authors first proposed the Initially, the k best subsets returned by a single run
use of the induction algorithm in the feature selec- of a FS (Feature Selection) system (or the single re-
1
sults of k different runs, if such an algorithm returns
This inconsistency count is calculated as follows: only one best subset) are stored into a two-dimensional
within a group of matching instances the inconsistency
count is the number of instances in the group minus the array. This array will then be condensed into a new
number of instances in the group with the most frequent array, called Adam, that will simply store the number
class value. The overall inconsistency rate is the sum of the of times each feature appeared in the k best subsets.
inconsistency counts of all groups of matching instances di- Next, FortalFS will iteratively generate new subsets of
vided by the total number of instances. features in a guided random fashion using Adam as a
2
In their original paper, the authors chose by experi-
seed and evaluate them with a ML system. The gen-
ment the number of iterations to be 77 · N 5 , where N is
the initial number of features in the dataset. This num- 3
The original wrapper algorithm proposed in (John
ber can be defined according to the application, where the et al., 1994) tried different searching heuristics such as
compromise is made between fast and accurate results. Backward Elimination, Forward Selection and Relief.
eration of a new subset is such that features with high feature weights directly. We present and evaluate
value in Adam have a better chance of being selected this modification later on.
than those with a low one at each iteration. At the
end, the subset with best accuracy will be returned. CalculateAdam(O) uses the following equation:
If subsets tie it terms of accuracy, the one with the
lowest cardinality is returned. Adam = {ai , 1 ≤ i ≤ n}
P
where: ai = oji , with 1 ≤ j ≤ k and 1 ≤ i ≤ n.
FortalFS(D, N umIter)
to create the Adam vector, which stores the num-
O = FeatureSelector(D) ber of occurrences of each feature in O.
Adam = CalculateAdam(O)
for i = 1 to N umIter GenerateSubset(Adam) generates a new subset of
S = GenerateSubset(Adam) features S in a guided random fashion using Adam
if ErrorRate(S, D) < ErrorRate(Sbest , D) then
Sbest = S as a seed. The generation process works as de-
else scribed below. Let i denote a particular feature
if ErrorRate(S, D) = ErrorRate(Sbest , D) and in Adam. Let S be a vector of n elements where
Card(S) < Card(Sbest ) then n is the total number of features in O. Element
Sbest = S Si (of S) = 1 if feature i is included in the subset
return Sbest
of features represented by S. Si = 0, otherwise.
———- Vector S is computed as follows:
where:
D - dataset.
Si = 1, if ai > random(k) and Si = 0 otherwise,
N umIter - number of iterations. where random(k) returns a random number be-
tween 0 and k.
This procedure is such that features with high fre-
Figure 1. The FortalFS Algorithm.
quency have a better chance of being selected than
those with a low one at each iteration.
We describe next, in detail, each of the methods used
by FortalFS. ErrorRate(S, D) makes use of a learning algorithm,
inputting the subset S to generate a prediction
FeatureSelector(D) runs a feature selection system, model and receiving the error rate calculated
for example LVF, getting the k best subsets gen- for this model over dataset D by using cross-
erated and storing them into the two-dimensional validation.
vector O.
We believe that this design will make FortalFS able to
There are a few characteristics that make a fea-
select accurate subsets of features in a reasonable time
ture selection algorithm suitable to be used as
mainly because the wrapper part on this new algo-
underlying algorithm in FortalFS. First, the algo-
rithm will have its search space reduced by the Adam
rithm must be non-deterministic, otherwise, the
vector. This way, only subsets with high probability of
k best subsets would be the same and FortalFS
being accurate will be considered. Yet, the use of the
would consequently select this same subset all the
leaner bias will make the evaluation of these subsets
time. For instance, Focus is not a good candidate
highly precise.
because of its deterministic behaviour. Second, it
should be ideally an anytime algorithm, that is,
being able to output several partial results during 4. Empirical Evaluation
processing. This way, one can obtain the k best
In order to evaluate FortalFS, the feature selection al-
results in one single run of the algorithm. LVF is
gorithms described in the previous work section were
an example of such algorithms. Finally, the algo-
implemented and their performances compared to our
rithm should be in fact a selection algorithm, not
new hybrid algorithm.
a weighting algorithm such as the original Relief4 .
However, FortalFS can be modified to work with
4.1. Methodology
4
The Relief version we presented previously and use
in our experiments is a “selection” version of the original Both the filters and the wrappers, as well as FortalFS,
“weighting” Relief algorithm. were implemented in Java by using the Weka library
(?) as a starting point and the datasets used were ob- Best-First
tained from the UCI Repository (?). These particular Number of non-improving expansions before termi-
datasets were chosen mainly because they have been nation = 5
used extensively to benchmark machine learning algo-
rithms as well as their diversity it terms of number of Genetic
features, number of instances, number of classes and Number of populations = 200
type of features. The following 13 datasets were the Size of population = 50
ones used: Credit (15 features, 690 instances), Labor Mutation probability = 0.001
(16, 57), Vote (16, 435), Primary Tumor (17, 339), Crossover probability = 0.6
Lymph (18, 148), Mushroom (22, 8124), Colic (23,
368), Autos (25, 205), Ionosphere (34, 351), Soybean LVF
(35, 683), Splice (60, 3190), Sonar (60, 208), Audiology Inconsistency threshold = initial inconsistency of the
(69, 226). dataset
Performance measures such as the accuracy of the se- Number of iterations = 77 · N 5
lected subset, the time used for selection and the num-
ber of features selected were obtained in each experi- Relief
ment. Number of iterations = number of instances in the
dataset
All methods are compared using this independent eval- Number of N earHits and N earM isses considered
uation to avoid the overfitting problem discussed in = 10
(?). This paper studies the overfitting problem when Selection threshold = 0.01
comparing feature selection methods. In light of ex-
perimental evidence, the author warns that the cross- Random Wrappers
validation on the data used to guide the search for Number of iterations = 10 · N and N 2
subsets of features in wrappers should not be used to
compare methods, since it may suffer from overfitting. FortalFS
In order to avoid this problem, it is suggested that Number of iterations = 6 · N , 8 · N and 10 · N
a separate testing set is to be used when comparing Underlying selection algorithm = LVF
methods. The accuracy for each selected subset was k = 10
obtained as follows: the original dataset was randomly
and equally split into a selection set and a testing set.
The feature selection in all cases was performed con-
sidering only the selection set. For the wrappers, we Table 1. Experimental Settings. N is the number of fea-
use 5-fold cross validation on the selection set as evalu- tures in the original dataset.
ation strategy. The selected subset was then evaluated
using 5-fold cross validation on the testing set.
Different induction algorithms have different biases, 4.3. Results and Analysis
each of which guides a learning algorithm towards bet- In the next section, we will present and discuss the re-
ter hypotheses. This fact impacts the performance of sults obtained in our experiments with FortalFS and
feature selection algorithms, that consequently may be other feature selection algorithms5 in a general man-
more suitable for some induction algorithms than oth- ner. Following, we will focus on specific pairwise com-
ers. With that in mind, we decided to run our ex- parisons between FortalFS and some of the other al-
periments over the same datasets using three different gorithms.
learning algorithms (C4.5, Naive Bayes and k-Nearest
Neighbour). A detailed description of the results can be found in
5
The following acronyms will be used on the ta-
4.2. Experimental Settings bles/figures to refer to each algorithm: NFS (No fea-
ture Selection - C4.5, Naive Bayes or k-Nearest Neigh-
Table 1 describes the parameters used during the ex- bour with original dataset), FFS (best FortalFS result
periments for each one of the feature selection algo- among the three settings), FS10 (FortalFS(10 · N )), FS8
rithms. Next, we present a short discussion on why (FortalFS(8 · N )), FS6 (FortalFS(6 · N )), FWR (Forward
such parameters were used. Wrapper), BWR (Backward Wrapper), W10N (Random
Wrapper(10 · N )), WN2 (Random Wrapper(N 2 )), B-F
(Best-First Search), Gen (Genetic Search), LVF (LVF), Rel
(Relief), Foc (Focus).
Appendixes R1, R2 and R3 for C4.5, Naive Bayes and
k-Nearest Neighbour, respectively, including the 24 sta-
tistical significance of the accuracy difference between
FortalFS and all other algorithms for each dataset and12
learning algorithm. 7 7 6 5 5 4 4 4 4 4
2 1
The total number of experiments is 39, since we have
13 datasets where we employ 3 different learning algo-
FFS FS10 FS8 Rel BWR FS6 FWR B-F Gen Foc W10N NN2 LVF NFS
rithms.
algorithm

4.3.1. Overall Results and Analysis Figure 3. Overall performance of all algorithms in terms of
accuracy when considering the three classifiers. Number of
Figure 2 shows the number of experiments (out of 39) experiments out of 39 which each algorithm performed the
for which each algorithm performed the best or tied best or tied with the best.
with the best when considering all three classifiers:
C4.5, Naive Bayes or k-Nearest Neighbour. As shown
in this figure, the three FortalFS settings (FS10, FS8 In terms of time consumption (Figure 3), as expected,
and FS6) are among the best algorithms in terms of the wrappers and FortalFS are down in the list, which
accuracy. FortalFS(10·N ) had the best performance in shows the impact of the evaluation method. However,
12 cases, FortalFS(8·N ) and Relief in 7, the Backward the three FortalFS settings were faster than all wrap-
Wrapper in 6, and FortalFS(6 · N ) and the Forward pers.
Wrapper in 5. When considering the best FortalFS
result in each case (FFS), FortalFS performs at least
as well as all other algorithms in 24 out of the 39 ex-
periments. 4983

As expected, the FortalFS performance was propor-


2300
tional to the number of subsets considered, that is,
FortalFS(10·N ) performed better than FortalFS(8·N ) 318 417 515 695 849
1 1 11 22 35
that was better than FortalFS(6 · N ).
The Random Wrappers did worse than FortalFS, For- B-F Gen LVF Foc Rel FS6 FS8 FS10 FWR W10N BWR WN2

ward and Backward Wrappers in most cases. An im- algorithm


portant and expected conclusion that can be extracted
from this result is that the strength of FortalFS, For- Figure 4. Overall time consumption (in minutes) for each
ward and Backward Wrappers come also from the algorithm considering all experiments and the three classi-
search heuristic they apply and not only from their fiers.
strong evaluation method.
In the next sections, we will examine with more details
the differences obtained in the experiments between
FortalFS and the three filters (Focus, Relief and LVF) Figure 4 shows the percentage of the features selected
along with the wrappers (Random, Forward and Back- by each algorithm. The Forward Wrapper, Best-First
ward). and Genetic algorithms selected the smallest num-
ber of features overall choosing respectively 16.91%,
19.76% and 21.95% of all features. FortalFS was able
24 to achieve a dimensionality reduction of over 60% and
12 still select very accurate subsets. The Backward Wrap-
7 7 6 5 5 4 4 4 4 4
2 1
per, with 1072 features selected (87.15%), is on the top
of the list.
FFS FS10 FS8 Rel BWR FS6 FWR B-F Gen Foc W10N NN2 LVF NFS

algorithm

Figure 2. Overall performance of all algorithms in terms of


accuracy. Number of datasets which each algorithm per-
formed the best or tied with the best. 5. Future Work
There are several avenues for future work that could
be explored:
7. Acknowledgments
87.15
76.59
48.21 48.72 We would like to thank Dr. Huan Liu for his support
36.34 37.8 38.21 38.37 38.62
16.91 19.76 21.95 with LVF for sending us its code and helping us in
using it.

FWR B-F Gen LVF Foc FS6 FS8 FS10 WN2 W10N
Finally,
Rel BWR
we acknowledge CAPES (Brazilian Federal
algorithm Agency for Graduate Studies) as well as NSERC (Nat-
Figure 5. Percentage of the features selected by each algo- ural Sciences and Engineering Research Council of
rithm in all experiments and the three classifiers from a Canada) for their financial support.
total number of 1230 features.
References
Almuallim, H., & Dietterich, T. (1991). Learning
with many irrelevant features. Proceedings of the
Ninth National Conference on Artificial Intelligence
(AAAI’91) (pp. 547–552). Anaheim, CA: AAAI
Press.
5.1. More Experiments
The experiments presented here give us a very good Almuallim, H., & Dietterich, T. (1994). Learning
idea about what we can expect when applying ForT- boolean concepts in the presence of many irrelevant
alFS in real situations. However, more experiments features. Artificial Intelligence, 69, 279–305.
with more highly-dimensional and practical domains Bellman, R. (1961). Adaptive control processes. Prince-
would be useful to confirm ForTalFS’s performance. ton University Press.
In addition, since ForTalFS can be parameterized to
work with different machine learning algorithms and John, G., Kohavi, R., & Pfleger, K. (1994). Irrele-
feature selection systems, it would be important to try vant features and the subset selection problem. Pro-
other combinations of algorithms to evaluate how they ceedings of the Eleventh International Conference on
behave with different datasets. Machine Learning (ICML’94) (pp. 121–129).

Kira, K., & Rendell, L. (1992a). The feature selec-


5.2. Time Analysis tion problem: Traditional methods and new algo-
Since ForTalFS follows a wrapper approach when se- rithm. Proceedings of the Tenth National Confer-
lecting features, it is expected that such a system will ence on Artificial lntelligence (pp. 129–134). MIT
add some computational overhead. In this sense, it Press.
would be useful to carry out a series of experiments
to evaluate how ForTalFS performs in terms of time Kira, K., & Rendell, L. (1992b). A practical approach
with large datasets and different combinations of ML to feature selection. Proceedings of the Ninth Inter-
algorithms and feature selection systems. national Workshop on Machine Learning (pp. 249–
256). Aberdeen, Scotland: Morgan-Kaufmann.

6. Conclusion Kononenko, I. (1994). Estimating attributes: Analy-


sis and extensions of Relief. Proceedings of the Eu-
This paper proposed a general approach, ForTalFS, to ropean Conference on Machine Learning (pp. 171–
boost the performance of random search based fea- 182).
ture selection systems. Besides being easy to un-
derstand and implement, empirical tests showed rele- Kononenko, I., Robnik-Sikonia, M., & Pompe, U.
vant improvements in comparison to other well-known (1996). ReliefF for estimation and discretization of
selection algorithms when selecting features in high- attributes in classification., 31–40. Artificial Intel-
dimensional datasets. Such good results could be ob- ligence: Methodology, Systems, Applications. IOS
tained due to the fact that ForTalFS uses results from Press.
previously developed feature selection systems in order
to drive its search over possible subsets of features in Liu, H., & Setiono, R. (1996). A probabilistic approach
such a way that only the subsets with high probabil- to feature selection - a filter solution. Proceedings of
ity of being the most relevant ones are considered for the Thirteenth International Conference on Machine
evaluation with a ML algorithm. Learning (ICML’96) (pp. 319–327).
Souza, J., Japkowicz, N., & Matwin, S. (2002). Fort-
alBooster - a general technique for boosting feature
selection systems. Proceedings of the Sixth IASTED
International Conference on Artificial Intelligence
and Soft Computing (ASC’02) (pp. 338–343). Banff,
Canada: ACTA Press.
Vafaie, H., & De Jong, K. (1992). Genetic algorithms
as a tool for feature selection in machine learning.
Proceedings of the Fourth International Conference
on Tools with Artificial Intelligence (pp. 200–204).
Arlington, VA.
Xu, L., Yan, P., & Chang, T. (1989). Best first strategy
for feature selection. Proceedings of the Ninth In-
ternational Conference on Pattern Recognition (pp.
706–708). IEEE Computer Society Press.

You might also like