0% found this document useful (0 votes)
51 views4 pages

Ant Colony Optimization For Feature Subset Selection: Abstract

ACO

Uploaded by

alpurie_nice
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views4 pages

Ant Colony Optimization For Feature Subset Selection: Abstract

ACO

Uploaded by

alpurie_nice
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

World Academy of Science, Engineering and Technology

International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:1, No:4, 2007

Ant Colony Optimization for Feature Subset


Selection
Ahmed Al-Ani

 out of the data before learning begins [2]. On the other


Abstract—The Ant Colony Optimization (ACO) is a hand, performance of classification algorithms is used to
metaheuristic inspired by the behavior of real ants in their search for select features for wrapper methods [3].
the shortest paths to food sources. It has recently attracted a lot of x Criterion for stopping the search. Feature selection
attention and has been successfully applied to a number of different
methods must decide when to stop searching through the
optimization problems. Due to the importance of the feature selection
International Science Index, Computer and Information Engineering Vol:1, No:4, 2007 waset.org/Publication/10371

problem and the potential of ACO, this paper presents a novel space of feature subsets. Some of the methods ask the
method that utilizes the ACO algorithm to implement a feature subset user to predefine the number of selected features. Other
search procedure. Initial results obtained using the classification of methods are based on the evaluation function, like
speech segments are very promising. whether addition/deletion of any feature does not produce
a better subset.
Keywords—Ant Colony Optimization, ant systems, feature In this paper, we will mainly be concerned with the second
selection, pattern recognition.
component, which is the search procedure. In the next section,
we give a brief description of some of the available search
I. INTRODUCTION
procedure algorithms and their limitations. An explanation of

T HE problem of feature selection has been widely


investigating due to its importance to a number of
disciplines such as pattern recognition and knowledge
the Ant Colony Optimization (ACO) is presented in section
three. Section four describes the proposed search procedure
algorithm. Experimental results are presented in section five
discovery. Feature selection allows the reduction of feature and a conclusion is given in section six.
space, which is crucial in reducing the training time and
improving the prediction accuracy. This is achieved by II. THE AVAILABLE SEARCH PROCEDURES
removing irrelevant, redundant, and noisy features (i.e.,
A number of search procedure methods have been proposed
selecting the subset of features that can achieve the best
in the literature. Some of the most famous ones are the
performance in terms of accuracy and computational time).
stepwise, branch-and-bound, and Genetic Algorithms (GA).
As described in their paper, Blum and Langley [1] argued
The stepwise search adds/removes a single feature to/from
that most existing feature selection algorithms consist of the
the current subset [4]. It considers local changes to the current
following four components:
feature subset. Often, a local change is simply the addition or
x Starting point in the feature space. The search for feature deletion of a single feature from the subset. The stepwise,
subsets could start with (i) no features, (ii) all features, or which is also called the Sequential Forward Selection
(iii) random subset of features. (SFS)/Sequential Backward Selection (SBS) is probably the
x Search procedure. Ideally, the best subset of features can simplest search procedure and is generally sub-optimal and
be found by evaluating all the possible subsets, which is suffers from the so-called “nesting effect”. It means that the
known as exhaustive search. However, this becomes features that were once selected/deleted cannot be later
prohibitive as the number of features increases, where discarded/re-selected. To overcome this problem, Pudil et al.
there are 2N possible combinations for N features. [5] proposed a method to flexibly add and remove features,
Accordingly, several search procedures have been which they called “floating search”.
developed that are more practical to implement, but they The branch and bound algorithm [6] requires monotonic
are not guaranteed to find the optimal subset of features. evaluation functions and is based on discarding subsets that do
These search procedures differ in their computational cost not meet a specified bound. When the size of feature set is
and the optimality of the subsets they find. moderate, the branch and bound algorithm may find a
x Evaluation function. The existing feature selection practicable solution. However, this method becomes
evaluation functions can be divided into two main groups: impracticable for feature selection problems involving a large
filters and wrappers. Filters operate independently of any number of features, especially because it may need to search
learning algorithm, where undesirable features are filtered the entire feasible region to find the optimal solution.
Another search procedure is based on the Genetic
A. Al-Ani is with the Faculty of Engineering, University of Technology, Algorithm (GA), which is a combinatorial search technique
Sydney, GPO Box 123, Broadway, Australia (e-mail: [email protected]).

International Scholarly and Scientific Research & Innovation 1(4) 2007 999 scholar.waset.org/1999.4/10371
World Academy of Science, Engineering and Technology
International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:1, No:4, 2007

based on both random and probabilistic measures. Subsets of pheromone trail will have a greater effect on the agents’
features are evaluated using a fitness function and then solutions.
combined via cross-over and mutation operators to produce It is worth mentioning that ACO makes probabilistic
the next generation of subsets [7]. The GA employ a decision in terms of the artificial pheromone trails and the
population of competing solutions, evolved over time, to local heuristic information. This allows ACO to explore larger
converge to an optimal solution. Effectively, the solution number of solutions than greedy heuristics. Another
space is searched in parallel, which helps in avoiding local characteristic of the ACO algorithm is the pheromone trail
optima. evaporation, which is a process that leads to decreasing the
A GA-based feature selection solution would typically be a pheromone trail intensity over time. According to [10],
fixed length binary string representing a feature subset, where pheromone evaporation helps in avoiding rapid convergence
the value of each position in the string represents the presence of the algorithm towards a sub-optimal region.
or absence of a particular feature. Promising results were In the next section, we present our proposed ACO
achieved when comparing the performance of GA with other algorithm, and explain how it is used for searching the feature
conventional methods [8]. space and selecting an “appropriate” subset of features.
We propose in this paper a subset search procedure that
International Science Index, Computer and Information Engineering Vol:1, No:4, 2007 waset.org/Publication/10371

utilizes the ACO algorithm and aims at achieving similar or IV. THE PROPOSED SEARCH PROCEDURE
better results than GA-based feature selection. For a given classification task, the problem of feature
selection can be stated as follows: given the original set, F, of
III. ANT COLONY OPTIMIZATION n features, find subset S, which consists of m features (m < n,
In real ant colonies, a pheromone, which is an odorous S  F), such that the classification accuracy is maximized.
substance, is used as an indirect communication medium. The feature selection problem representation exploited by
When a source of food is found, ants lay some pheromone to the artificial ants includes the following:
mark the path. The quantity of the laid pheromone depends x n features that constitute the original set, F = {f1, …, fn}.
upon the distance, quantity and quality of the food source. x A number of artificial ants to search through the feature
While an isolated ant that moves at random detects a laid space (na ants).
pheromone, it is very likely that it will decide to follow its
x Ti, the intensity of pheromone trail associated with feature
path. This ant will itself lay a certain amount of pheromone,
fi.
and hence enforce the pheromone trail of that specific path.
x For each ant j, a list that contains the selected feature
Accordingly, the path that has been used by more ants will be
subset, Sj = {s1, …, sm}.
more attractive to follow. In other words, the probability with
which an ant chooses a path increases with the number of ants We propose to use a hybrid evaluation measure that is able
that previously chose the same path. This process is hence to estimate the overall performance of subsets as well as the
characterized by a positive feedback loop [9]. local importance of features. A classification algorithm is used
Dorigo et. al. [10] adopted this concept and proposed an to estimate the performance of subsets (i.e., wrapper
artificial colony of ants algorithm, which was called the Ant evaluation function). On the other hand, the local importance
Colony Optimization (ACO) metaheuristic, to solve hard of a given feature is measured using the Mutual Information
combinatorial optimization problems. The ACO was Evaluation Function (MIEF) [14], which is a filter evaluation
originally applied to solve the classical traveling salesman function.
problem [9], where it was shown to be an effective tool in In the first iteration, each ant will randomly choose a
finding good solutions. The ACO has also been successfully feature subset of m features. Only the best k subsets, k < na,
applied to other optimization problems including will be used to update the pheromone trial and influence the
telecommunications networks, data mining, vehicle routing, feature subsets of the next iteration. In the second and
etc [11, 12, 13] following iterations, each ant will start with m – p features that
For the classical Traveling Salesman Problem (TSP) [9], are randomly chosen from the previously selected k-best
each artificial ant represents a simple “agent”. Each agent subsets, where p is an integer that ranges between 1 and m – 1.
explores the surrounding space and builds a partial solution In this way, the features that constitute the best k subsets will
based on local heuristics, i.e., distances to neighboring cities, have more chance to be present in the subsets of the next
and on information from previous attempts of other agents, iteration. However, it will still be possible for each ant to
i.e., pheromone trail or the usage of paths from previous consider other features as well. For a given ant j, those
attempts of the rest of the agents. features are the ones that achieve the best compromise
In the first iteration, solutions of the various agents are only between previous knowledge, i.e., pheromone trails, and Local
based on local heuristics. At the end of the iteration, “artificial Importance with respect to subset Sj, which consists of the
pheromone” will be laid. The pheromone intensity on the features that have already been selected by that specific ant.
various paths will be proportional to the optimality of the The Updated Selection Measure (USM) is used for this
solutions. As the number of iterations increases, the purpose and defined as:

International Scholarly and Scientific Research & Innovation 1(4) 2007 1000 scholar.waset.org/1999.4/10371
World Academy of Science, Engineering and Technology
International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:1, No:4, 2007

­ T K LI S j N
° i i if i  S j 5.
corresponding subset of features.
Using the feature subsets of the best k ant:
USM i
Sj °
® ¦ T g LI g j
K

S N
(1) x For j = 1 to k, /* update the pheromone trails */
° gS j ­ max ( MSE )  MSE
° °
g j
¯ 0 Otherwise g 1:k
if f i  S j
°
Sj 'T i ® max§ max( MSE )  MSE · (4)
where LIi is the local importance of feature fi given the subset ¨ h¸
° h 1:k © g 1:k
g
Sj. The parameters K and N control the effect of trail intensity ¹
S
and local feature importance respectively. LIi j is defined as: ¯° 0 Otherwise
Ti U .T i  'T i (5)
ª 2 º
I (C ; f i ) u «  1»
Sj
LI i (2) where U is a constant such that (1 - U) represents the
¬«1  exp(DDi ) ¼»
Sj
evaporation of pheromone trails.
where x For j = 1 to na,
ª H ( fi )  I ( fi , f s ) º o Randomly produce m – p feature subset for ant j,
»u
Sj
Di min « to be used in the next iteration, and store it in Sj.
f s S j
¬ H ( fi ) ¼
(3) 6. If the number of iterations is less than the maximum
International Science Index, Computer and Information Engineering Vol:1, No:4, 2007 waset.org/Publication/10371

1 ª § I (C ;{ f , f }) ·J º number of iterations, goto step 3.


¦ «« E ¨¨ I (C; f ) i I (Cs ; f ) ¸¸ »» It is worth mentioning that there is little difference between
s ¹
¬ ©
Sj f s S j i
¼ the computational cost of the proposed algorithm and the GA-
the parameters D, E, and J are constants, H(fi) is the entropy of based search procedure. This is due to the fact that both of
fi, I(fi; fs) is the mutual information between fi and fs, I(C; fi) is them evaluate the selected subsets using a “wrapper
the mutual information between the class labels and fi, and |Sj| approach”, which requires far more computational cost than
is the cardinal of Sj. For detailed explanation of the MIEF evaluating the local importance of features using the “filter
measure, the reader is referred to [14]. approach” adopted in the proposed algorithm.
Below are the steps of the algorithm:
1. Initialization: V. EXPERIMENTAL RESULTS
x Set Ti = cc and 'Ti = 0, (i = 1, …, n), where cc is a We conducted an experiment to classify speech segments
constant and 'Ti is the amount of change of according to their manner of articulation. Six classes were
pheromone trial quantity for feature fi. considered: vowel, nasal, fricative, stop, glide, and silence.
x Define the maximum number of iterations. We used speech signals from the TIMIT database, where
x Define k, where the k-best subsets will influence the segment boundaries were identified.
subsets of the next iteration. Three different sets of features were extracted from each
x Define p, where m – p is the number of features each speech frame: 16 log mel-filter bank (MFB), 12 linear
ant will start with in the second and following predictive reflection coefficients (LPR), and 10 wavelet
iterations. energy bands (WVT). A context dependent approach was
2. If in the first iteration, adopted to perform the classification. So, the features used to
x For j = 1 to na, represent each speech segment Segn were the average frame
o Randomly assign a subset of m features to Sj. features over the first and second halves of segment Segn and
x Goto step 4. the average frame features of the previous and following
3. Select the remaining p features for each ant: segments (Segn-1 and Segn+1 respectively). Hence, the baseline
feature sets based on MFB, LPR, and WVT consist of 64, 48
x For mm = m – p + 1 to m,
and 40 features respectively.
o For j = 1 to na,
An Artificial Neural Network (ANN) was used to classify
ƒ Given subset Sj, Choose feature fi that
S the features of each baseline set into one of the six manner-of-
maximizes USMi j.
articulation classes. Segments from 152 speakers (56456
ƒ Sj = Sj ‰ {fi}.
segments) were used to train the ANNs, and from 52 speakers
x Replace the duplicated subsets, if any, with randomly (19228 segments) to test them. The obtained classification
chosen subsets. accuracy for MFB, LPR and WVT were 87.13%, 76.86% and
4. Evaluate the selected subset of each ant using a chosen 84.57% respectively. It is clear that MFB achieved the best
classification algorithm: performance among the three baseline sets; however, it used
x For j = 1 to na, more features. The LPR on the other hand was outperformed
o Estimate the Mean Square Error (MSEj) of the by WVT despite the fact that it used more features.
classification results obtained by classifying the The three baseline feature sets were concatenated to form a
features of Sj. new set of 152 features. The SFS, GA and proposed ACO
x Sort the subsets according to their MSE. Update the algorithms are used to select from these features. For the SFS
minimum MSE (if achieved by any ant), and store the method, the algorithm starts with no features and then adds

International Scholarly and Scientific Research & Innovation 1(4) 2007 1001 scholar.waset.org/1999.4/10371
World Academy of Science, Engineering and Technology
International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:1, No:4, 2007

one feature at a time, such that the MIEF measure (Eq. 2) is similar performance as WVT with smaller number of features.
maximized. The GA-based selection is performed using the Both ACO and GA achieved comparable performance to MFB
following parameter settings: population size = 30, number of using similar number of features, with GA being slightly
generations = 20, probability of crossover = 0.8, and better. Note that SFS achieved a good performance when
probability of mutation = 0.05. The obtained strings are selecting small number of features, but its performance start to
constrained to have the number of ‘1’s matching a predefined worsen as the desired number of features increases. The figure
number of desired features. The MSE of an ANN trained with also shows that the overall performance of ACO is better than
randomly chosen 2000 segments is used as the fitness that of both GA and SFS, where the average classification
function. The parameters of the ACO algorithms described in accuracy of ACO, GA and SFS over all the cases are:
the previous section are assigned the following values: 84.22%, 83.49% and 83.19% respectively.
x K = N = 1, which basically makes the trail intensity and
local measure equally important. VI. CONCLUSION
x D = 0.3, E = 1.65 and J = 3, are found to be an In this paper, we presented a novel feature selection search
appropriate choice for this and other classification tasks. procedure based on the Ant Colony Optimization
x The number of ants, na = 30, and the maximum number metaheuristic. The proposed algorithm utilizes both the local
International Science Index, Computer and Information Engineering Vol:1, No:4, 2007 waset.org/Publication/10371

of iterations is 20, are chosen to justify the comparison importance of features and overall performance of subsets to
with GA. search through the feature space for optimal solutions. When
x k = 10. Thus, only the best na/3 ants are used to update used to select features for a speech segment classification
the pheromone trails and affect the feature subsets of the problem, the proposed algorithm outperformed both stepwise-
next iteration. and GA-based feature selection methods. Experiments on
x m – p = max(m – 5, round(0.65 u m)), where p is the other classification problems will be carried out in the future
number of the remaining features that need to be selected to further test the algorithm.
in each iteration. It can be seen that p will be equal to 5 if
m t 13. The rational behind this is that evaluating the REFERENCES
importance of features locally becomes less reliable as the [1] A.L. Blum and P. “Langley. Selection of relevant features and examples
number of selected features increases. In addition, this in machine learning”. Artificial Intelligence, 97:245–271, 1997.
[2] M.A. Hall. Correlation-based feature selection for machine learning.
will reduce the computational cost especially for large PhD thesis, The University of Waikato, 1999.
values of m. [3] R. Kohavi. Wrappers for performance enhancement and oblivious
x The initial value of trail intensity cc = 1, and the trail decision graphs. PhD thesis, Stanford University, 1995.
[4] J. Kittler. “Feature set search algorithms”. In C. H. Chen, editor, Pattern
evaporation is 0.25, i.e., U = 0.75. Recognition and Signal Processing. Sijhoff and Noordhoff, the
x Similar to the GA selection, the MSE of an ANN trained Netherlands, 1978.
with randomly chosen 2000 segments is used to evaluate [5] P. Pudil, J. Novovicova, and J. Kittler. “Floating search methods in
feature selection”. Pattern Recognition Letters, 15:1119-1125, 1994.
the performance of the selected subsets in each iteration. [6] P.M. Narendra and K. Fukunaga. “A branh and bound algorithm for
The selected features of each method are classified using feature subset selection”. IEEE Transactions on Computers, C-26: 917-
ANNs, and the obtained classification accuracies of the testing 922, 1977.
[7] J. Yang and V. Honavar, “Feature subset selection using a genetic
segments are shown in Fig. 1. algorithm,” IEEE Transactions on Intelligent Systems, 13: 44–49, 1998.
It can be seen that the three feature selection methods were [8] M. Gletsos, S.G. Mougiakakou, G.K. Matsopoulos, K.S. Nikita, A.S.
able to achieve classification accuracy similar to that of LPR Nikita, and D. Kelekis. “A Computer-Aided Diagnostic System to
Characterize CT Focal Liver Lesions: Design and Optimization of a
with far less number of features than that of the LPR baseline
Neural Network Classifier” IEEE Transactions on Information
set. However, the ACO was the only method that achieved Technology in Biomedicine, 7: 153-162, 2003.
[9] M. Dorigo, V. Maniezzo, and A. Colorni. “Ant System: Optimization by
88 a colony of cooperating agents”. IEEE Transactions on Systems, Man,
and Cybernetics – Part B, 26:29–41, 1996.
86 [10] T. Stützle and M. Dorigo. “The Ant Colony Optimization Metaheuristic:
Algorithms, Applications, and Advances”. In F. Glover and G.
Classification accuracy

84 Kochenberger, editors, Handbook of Metaheuristics, Kluwer Academic


Publishers, Norwell, MA, 2002.
82 [11] G. Di Caro and M. Dorigo. “AntNet: Distributed stigmergetic control for
communications networks”. Journal of Artificial Intelligence Research,
80 9:317–365, 1998.
[12] R.S. Parpinelli; H.S. Lopes; A.A. Freitas, “Data mining with an ant
78 colony optimization algorithm”, IEEE Transactions on Evolutionary
Computation, 6: 321 - 332 2002.
SFS
76 GA [13] R. Montemanni, L.M. Gambardella, A.E. Rizzoli and A.V. Donati. “A
ACO
WVT new algorithm for a Dynamic Vehicle Routing Problem based on Ant
LPR
74
MFB
Colony System”. Proceedings of ODYSSEUS 2003, 27-30, 2003.
10 20 30 40 50 60 70
[14] A. Al-Ani, M. Deriche and J. Chebil. “A new mutual information based
No. of selected features measure for feature selection”, Intelligent Data Analysis, 7: 43-57, 2003.
Fig1. Performance of feature selection methods

International Scholarly and Scientific Research & Innovation 1(4) 2007 1002 scholar.waset.org/1999.4/10371

You might also like