Woa 1
Woa 1
Abstract—Medical datasets are mainly composed of countless problem consists of identifying and selecting a useful subset of
irrelevant and redundant features in a series of patient records. features from a larger set of often mutually redundant, possibly
All these features are not required to obtain a medical decision- irrelevant, features with different associated importance [2, 3].
making process. On the other hand, the huge size of data is caused
Features selection techniques can be divided based on their
to increase the dimensionality and to reduce the performance of
classifier. Recently, there have been many methods proposed to dependence on the classification algorithms in two main
solve this problem and their results show that the feature selection categories: model-free and model-based [4]. In the model-free
can be an effective solution. The feature selection methods are methods, feature extraction is done based on statistical
mostly aim to reduce the size of data and enhance the efficiency of functions and independent of specific data model. Some
learning algorithms by eliminating the unrelated and redundant common model-free methods include: F-Score criterion,
features. In this paper, a meta-heuristic algorithm is proposed
information gain, correlation function and maximum relevance
named FSWOA for feature selection. This algorithm is based on
the hunting methods of Humpback Whales consisting of three minimum redundancy technique. In the model-based approach,
main steps: encircling prey, spiral bubble-net attacking and search the process of Selection and feature extraction are dependent on
for prey. The performance of proposed algorithm is evaluated the performance of the predictor. The model-free is better than
conducted by four standard medical datasets: Pima Indians model-based because of the independence on specific data
Diabetes, Original Wisconsin Breast Cancer, Statlog and model in features extraction in term of speed, scalability and
Hepatitis. The results show that the proposed algorithm can
computational cost. But this independence compared with the
reduce the dimensionality of medical datasets with acceptable
accuracy for diseases diagnosis. model-based techniques causes to reduce the performance.
In general, the space of problem must be completely
Index Terms— Feature selection, Whale optimization algorithm explored-all search-to extract the effective subset of features.
(WOA), Dimensionality reduction, Diseases diagnosis. However, using all search for the most of real-world problems
is impossible because of having the high dimensionality
especially in NP-hard problems. Obviously, exploring the
I. INTRODUCTION whole problem space and evaluating all states are very costly in
term of the computational complexity and response time.
M edical data mining is one of the most important issues in
recent years, which is relying on analysis and statistical
reasoning, machine learning techniques and pattern recognition
Therefore, many meta-heuristic algorithms have been proposed
to find the optimal solutions inspiring of the fauna’s foraging
to discover relations and hidden patterns in the datasets of the behavior in the nature. They consider trade-off between
patients. Generally, all activities in medicine can be divided into computational complexity and time mainly based on swarm
six areas: screening, diagnosis, treatment, prognosis, intelligence which is shared by cooperation and competition
monitoring and management [1]. The accuracy and sensitivity between agents. Consequently, some efficient meta-heuristic
have particular importance in the diagnosis and prediction of algorithms have been proposed for feature selection such as Ant
diseases. Its positive feedback can predispose until the doctor Colony Optimization algorithm (ACO) [5], Particle Swarm
by its analysis to speed up the process of diagnosis and Optimization (PSO) [6], Artificial Bee Colony (ABC) [7].
prognosis. Therefore, the costs of treatment can be reduced and Recently, they apply to a large number of applications in the
the rate of health in society can be increased. In the real world, medical sciences [8, 9].
medical databases are usually filled by irrelevant and redundant In this paper, a meta-heuristic algorithm is proposed named
features which increase the dimension of database or lead them Feature Selection based on Whale Optimization Algorithm or
to curse of dimensionality. Then, the accuracy, computational FSWOA in short. FSWOA is mainly aim to reduce the
cost and speed of the learning process are affected. dimensionality of medical data. In fact, this algorithm is based
Dimensionality reduction methods have been proposed to solve on the hunting methods of humpback whales including three
this problem. One of the most famous dimensionality reduction main operators: encircling prey, spiral bubble-net attacking and
techniques is Feature selection. The feature subset selection search for prey. The rest of paper is organized as follows.
* Corresponding Author
1243 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 9, September 2016
1244 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 9, September 2016
�� (3)
Precision (positive predictive value) = (%)
�P+��
�� (4)
Negative predictive value = (%) TABLE III
��+��
Evaluation functions (in %) of FSWOA algorithm
��+�� (5)
Accuracy (ACC) = (%)
��+��+��+�� Datasets # NF AUC PPV NPV
Hepatitis 8
Max 0.971 88.89 100.00
Mean 0.752 75.99 76.56
B. Experimental Setup
Breast Cancer 4
Max 0.994 100.00 97.92
The proposed algorithm is implemented using MATLAB on an Mean 0.965 98.17 93.64
Intel Core-i5 CPU with 6GB of RAM. To find the best subset Pima Indians 7
of features, our algorithm is tested 15 times by using the Diabetes
evaluation functions described in Section IV-A. During each Max 0.771 80.83 70.59
time, firstly, the datasets were randomly split into two sets of Mean 0.655 75.37 59.67
70% and 30% as a training set and a test set respectively. Then, Statlog 6
the proposed algorithm uses K-Nearest Neighbors algorithm Max 0.913 90.24 100.00
with K=3 for evaluating the subset of selected features. In Mean 0.795 76.74 91.53
addition, the maximum iteration is set to 60, the initial
population size is set to 30 and the lower and upper bound are
set to 0 and 1 respectively.
1245 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 9, September 2016
TABLE IV
Features reduction rate and subset of effective features Systems with Applications, vol. 37, pp. 2714-2723, 4// 2010.
[4] I. Guyon, Andr, #233, and Elisseeff, "An introduction to variable
Dataset Features Reduction Effective features and feature selection," J. Mach. Learn. Res., vol. 3, pp. 1157-1182,
rate (%) 2003.
Hepatitis 57.89 [27,5,28,11,15] [5] M. Dorigo and T. Stützle, "The Ant Colony Optimization
Breast Cancer 60.00 [7,2,3,10] Metaheuristic: Algorithms, Applications, and Advances," in
Statlog 53.85 [2,3,6,9,10,11] Handbook of Metaheuristics, F. Glover and G. A. Kochenberger,
PID 12.5% [5,6,1,3,8,4,7[ Eds., ed Boston, MA: Springer US, 2003, pp. 250-285.
[6] B. Xue, M. Zhang, and W. N. Browne, "Particle Swarm
Optimization for Feature Selection in Classification: A Multi-
Objective Approach," IEEE Transactions on Cybernetics, vol. 43,
pp. 1656-1671, 2013.
0.8
WOA
[7] B. Akay and D. Karaboga, "Artificial bee colony algorithm for
0.7 large-scale problems and engineering design optimization," Journal
of Intelligent Manufacturing, vol. 23, pp. 1001-1014, 2012.
0.6
[8] S. Al-Muhaideb and M. El Bachir Menai, "Hybrid Metaheuristics
Accuracy (%)
0.5
for Medical Data Classification," in Hybrid Metaheuristics, E.-G.
Talbi, Ed., ed Berlin, Heidelberg: Springer Berlin Heidelberg, 2013,
0.4 pp. 187-217.
0.3
[9] S. M. Vieira, L. F. Mendonça, G. J. Farinha, and J. M. C. Sousa,
"Modified binary PSO for feature selection using SVM applied to
0.2 mortality prediction of septic patients," Applied Soft Computing,
0.1
vol. 13, pp. 3494-3504, 8// 2013.
[10] H.-H. Hsu, C.-W. Hsieh, and M.-D. Lu, "Hybrid feature selection
0
3 4 5 6 7 8 9 10 11 12 13
by combining filters and wrappers," Expert Systems with
number of feature Applications, vol. 38, pp. 8144-8150, 7// 2011.
[11] S. Kashef and H. Nezamabadi-pour, "An advanced ACO algorithm
Fig. 2. The accuracy of classifier for different number of features in Heart for feature subset selection," Neurocomputing, vol. 147, pp. 271-
disease dataset 279, 1/5/ 2015.
[12] S. Şahan, K. Polat, H. Kodaz, and S. Güneş, "The Medical
Applications of Attribute Weighted Artificial Immune System
(AWAIS): Diagnosis of Heart and Diabetes Diseases," in Artificial
Immune Systems: 4th International Conference, ICARIS 2005,
Banff, Alberta, Canada, August 14-17, 2005. Proceedings, C. Jacob,
M. L. Pilat, P. J. Bentley, and J. I. Timmis, Eds., ed Berlin,
Heidelberg: Springer Berlin Heidelberg, 2005, pp. 456-468.
[13] C.-L. Huang, "ACO-based hybrid classification system with feature
subset selection and model parameters optimization,"
Neurocomputing, vol. 73, pp. 438-448, 12// 2009.
[14] H. H. Inbarani, A. T. Azar, and G. Jothi, "Supervised hybrid feature
selection based on PSO and rough sets for medical diagnosis,"
Computer Methods and Programs in Biomedicine, vol. 113, pp.
Fig. 3. The accuracy of classifier for different number of features in Pima 175-185, 1// 2014.
Indian Diabetes dataset. [15] J. Nahar, T. Imam, K. S. Tickle, and Y.-P. P. Chen, "Computational
intelligence for heart disease diagnosis: A medical knowledge
driven approach," Expert Systems with Applications, vol. 40, pp. 96-
V. CONCLUSION 104, 1// 2013.
Feature selection is the process for choosing a subset of features [16] S. Mirjalili and A. Lewis, "The Whale Optimization Algorithm,"
that maximizes the performance of learning algorithm and Advances in Engineering Software, vol. 95, pp. 51-67, 5// 2016.
reduces the dimensionality of the problem space. In this study, [17] K. Bache, M. Lichman (2016) UCI machine learning repository.
School of Information and Computer Science, University of
an efficient feature selection algorithm was proposed based on California, Irvine. Available: http://archive.ics.uci.edu/ml.
whale optimization algorithm named FSWOA. The proposed
algorithm searches the problem space and extract a subset of
optimal features for medical decision-making. The performance
of this algorithm was experimentally evaluated conducted by
four different medical datasets. The experimental results show
that our method can reduce the dimension of medical datasets
in diseases diagnosis with an acceptable accuracy.
REFERENCES
1246 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 9, September 2016
1247 https://sites.google.com/site/ijcsis/
ISSN 1947-5500