0% found this document useful (0 votes)
42 views5 pages

Woa 1

This article proposes a new feature selection algorithm called FSWOA (Feature Selection based on Whale Optimization Algorithm) based on the hunting behavior of humpback whales. The algorithm includes three main steps: encircling prey, spiral bubble-net attacking, and searching for prey. It is evaluated on four medical datasets and aims to reduce dimensionality while maintaining acceptable classification accuracy for disease diagnosis. Preliminary results show the proposed algorithm can reduce the dimensionality of medical datasets.

Uploaded by

gomathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views5 pages

Woa 1

This article proposes a new feature selection algorithm called FSWOA (Feature Selection based on Whale Optimization Algorithm) based on the hunting behavior of humpback whales. The algorithm includes three main steps: encircling prey, spiral bubble-net attacking, and searching for prey. It is evaluated on four medical datasets and aims to reduce dimensionality while maintaining acceptable classification accuracy for disease diagnosis. Preliminary results show the proposed algorithm can reduce the dimensionality of medical datasets.

Uploaded by

gomathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Computer Science and Information Security (IJCSIS),

Vol. 14, No. 9, September 2016

Feature Selection Based on Whale Optimization


Algorithm for Diseases Diagnosis
Hoda Zamani, Mohammad-Hossein Nadimi-Shahraki*
Faculty of Computer Engineering, Najafabad branch, Islamic Azad University, Najafabad, Iran
[email protected]
[email protected]


Abstract—Medical datasets are mainly composed of countless problem consists of identifying and selecting a useful subset of
irrelevant and redundant features in a series of patient records. features from a larger set of often mutually redundant, possibly
All these features are not required to obtain a medical decision- irrelevant, features with different associated importance [2, 3].
making process. On the other hand, the huge size of data is caused
Features selection techniques can be divided based on their
to increase the dimensionality and to reduce the performance of
classifier. Recently, there have been many methods proposed to dependence on the classification algorithms in two main
solve this problem and their results show that the feature selection categories: model-free and model-based [4]. In the model-free
can be an effective solution. The feature selection methods are methods, feature extraction is done based on statistical
mostly aim to reduce the size of data and enhance the efficiency of functions and independent of specific data model. Some
learning algorithms by eliminating the unrelated and redundant common model-free methods include: F-Score criterion,
features. In this paper, a meta-heuristic algorithm is proposed
information gain, correlation function and maximum relevance
named FSWOA for feature selection. This algorithm is based on
the hunting methods of Humpback Whales consisting of three minimum redundancy technique. In the model-based approach,
main steps: encircling prey, spiral bubble-net attacking and search the process of Selection and feature extraction are dependent on
for prey. The performance of proposed algorithm is evaluated the performance of the predictor. The model-free is better than
conducted by four standard medical datasets: Pima Indians model-based because of the independence on specific data
Diabetes, Original Wisconsin Breast Cancer, Statlog and model in features extraction in term of speed, scalability and
Hepatitis. The results show that the proposed algorithm can
computational cost. But this independence compared with the
reduce the dimensionality of medical datasets with acceptable
accuracy for diseases diagnosis. model-based techniques causes to reduce the performance.
In general, the space of problem must be completely
Index Terms— Feature selection, Whale optimization algorithm explored-all search-to extract the effective subset of features.
(WOA), Dimensionality reduction, Diseases diagnosis. However, using all search for the most of real-world problems
is impossible because of having the high dimensionality
especially in NP-hard problems. Obviously, exploring the
I. INTRODUCTION whole problem space and evaluating all states are very costly in
term of the computational complexity and response time.
M edical data mining is one of the most important issues in
recent years, which is relying on analysis and statistical
reasoning, machine learning techniques and pattern recognition
Therefore, many meta-heuristic algorithms have been proposed
to find the optimal solutions inspiring of the fauna’s foraging
to discover relations and hidden patterns in the datasets of the behavior in the nature. They consider trade-off between
patients. Generally, all activities in medicine can be divided into computational complexity and time mainly based on swarm
six areas: screening, diagnosis, treatment, prognosis, intelligence which is shared by cooperation and competition
monitoring and management [1]. The accuracy and sensitivity between agents. Consequently, some efficient meta-heuristic
have particular importance in the diagnosis and prediction of algorithms have been proposed for feature selection such as Ant
diseases. Its positive feedback can predispose until the doctor Colony Optimization algorithm (ACO) [5], Particle Swarm
by its analysis to speed up the process of diagnosis and Optimization (PSO) [6], Artificial Bee Colony (ABC) [7].
prognosis. Therefore, the costs of treatment can be reduced and Recently, they apply to a large number of applications in the
the rate of health in society can be increased. In the real world, medical sciences [8, 9].
medical databases are usually filled by irrelevant and redundant In this paper, a meta-heuristic algorithm is proposed named
features which increase the dimension of database or lead them Feature Selection based on Whale Optimization Algorithm or
to curse of dimensionality. Then, the accuracy, computational FSWOA in short. FSWOA is mainly aim to reduce the
cost and speed of the learning process are affected. dimensionality of medical data. In fact, this algorithm is based
Dimensionality reduction methods have been proposed to solve on the hunting methods of humpback whales including three
this problem. One of the most famous dimensionality reduction main operators: encircling prey, spiral bubble-net attacking and
techniques is Feature selection. The feature subset selection search for prey. The rest of paper is organized as follows.

* Corresponding Author
1243 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 9, September 2016

Section 2 is to review some related works on feature selection


based on meta-heuristic. The proposed algorithm is described Initialization
in section 3. In section 4, the performance of the proposed
algorithm is evaluated conducted by well-known medical Generate k humpback whales
datasets. Then, the results are shown in section 5 and the
conclusions are finally discussed in section 6.
Random distribution each
II. Related WORKS whales in problem space
Feature selection methods have been applied to classification
problems in order to select a reduced feature set that makes the
classifier more accurate and faster [10]. For a large number of Evaluate and select the best
features, evaluating all states is computationally non-feasible whale
requiring meta-heuristic search methods [11]. These methods
tries to solve the challenges that related to real world problems
with competition and cooperation strategy between agents. Bubble-net attacking method
Many studies have been done in the feature selection, which are
intersection with swarm intelligence. Advanced binary ACO
(ABACO) was proposed for feature selection and dimension Evaluate the position of each whales
reduction [11[. In 2005, Şahan et al. applied the Attribute
Weighted Artificial Immune System (AWAIS) to diagnose
Heart and Diabetic diseases. In this study has shown the Select the best whales
negative effects of irrelative features in diseases diagnosis
process [12]. In other study, Huang proposed a hybrid method
of ant colony optimization algorithm and support vector Search for prey
machine for feature selection [13]. Inbarani et al. offered a (exploration phase)
model based on PSO and rough sets strategy for selected
features [14]. Nahar et al. used the computational intelligence
to diagnose Heart disease [15]. Another swarm-based meta-
No Termination Yes
heuristic optimization algorithm inspired by the hunting Return the
condition satisfied?
behavior of humpback whales. This algorithm was proposed by best subset
Mirjalili and Lewis [16] and called whale optimization
algorithm (WOA).
Fig.1.The flowchart of proposed FSWOA
III. PROPOSED ALGORITHM
In this section, the proposed algorithm is described which is a
IV. EXPERIMENTAL EVALUATION
meta-heuristic algorithm named Feature Selection based on
Whale Optimization Algorithm or FSWOA in short. FSWOA In this section, the performance of proposed algorithm is
is a new algorithm for feature selection based on the hunting evaluated conducted by four benchmark medical datasets
methods of Humpback Whales including three main steps: downloaded from UCI machine learning repository [17]. These
encircling prey, spiral bubble-net attacking and search for prey. datasets include Pima Indians Diabetes, Original Wisconsin
Fig.1. shows the flowchart and main steps of FSWOA. In the Breast Cancer, Statlog and Hepatitis which are popular for
first step, it generates k humpback whales and randomly scatter feature selection problems. Table I shows the statistical
them in the search space. Then, the position of each humpback information of the datasets.
whale is evaluated and the best whales are selected. The other
TABLE I
whales will try to update their positions towards the best whale. Statistical Information of Datasets
In the second step, humpback whales start to attack with a
bubble-net strategy. There are two strategies: shrinking Dataset Features Sample Classes Missing
encircling and spiral updating position for bubble-net attacking. data
In fact, this step is similar the exploitation phase in which each Pima Indians 8 768 2 Yes
Diabetes
whale suggests a subset of features. Then, these subsets of Original Wisconsin 10 699 2 yes
feature are evaluated based on the accuracy of classifier on the Breast Cancer
testing set. In the third step or the exploration phase, the Statlog 13 270 2 no
humpback whales search prey randomly according to the Hepatitis 19 155 2 yes
position of each other.

1244 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 9, September 2016

The Pima Indians Diabetes dataset (PID) is to diagnose a C. Experimental Results


person has Diabetes or not based on clinical and laboratory data.
The purpose of Original Wisconsin Breast Cancer dataset is This section shows the result of experimental evaluation of the
Breast cancer diagnosis. The heart disease is predicted by proposed algorithm in Table II and III where max and mean
Statlog dataset. The objective of Hepatitis dataset is to predict indicate the maximum and average value respectively. The
whether a person will be live or die with Hepatitis disease. Since accuracy of the proposed algorithm observed on these medical
in real world the medical datasets are noisy and incomplete, datasets are 87.10 % for Hepatitis, 97.86 % for Breast Cancer,
therefor these datasets are firstly normalized and then the 78.57 % for Pima Indians Diabetes and 77.05 % for Statlog
proposed algorithm is evaluated by each dataset. Disease. Moreover, the proposed algorithm selects 6 features
for diagnosis of heart disease, 7 for Pima Indians Diabetes, 8
for Hepatitis and 4 for Breast cancer.
A. Evaluation Functions Finally, the feature reduction rate is computed for all
datasets. As Table 3 shown, its results for Hepatitis, Pima
The subsets of features selected by FSWOA algorithm are Indians Diabetes, Breast Cancer and Heart Disease are 57.89 %,
evaluated by well-known evaluation functions such as: 60 %, 12.5 % and 53.85%, respectively. In addition, Fig. 2 and
sensitivity, specificity, precision, negative predictive value 3 show the dimensionality reduction rate and the accuracy of
(NPV), area under the curve (AUC) and accuracy. The classifier of Diabetes and Heart diseases.
sensitivity and specificity shown in Eq. (1) and (2) indicate TABLE II
respectively the samples in the positive and negative classes Evaluation functions (in %) of FSWOA algorithm
which are correctly classified. Precision or positive predictive
value (PPV) and negative predictive value (NPV) are computed Datasets # NF Accuracy sensitivity specificity
Hepatitis 8
by Eq. (3) and (4). AUC is the true positive rate vs the false
Max 87.10 100.00 94.12
positive rate, its value is between 0.0 and 1.0. The cost function Mean 73.33 70.51 79.97
of the proposed algorithm is defined by the accuracy of Breast Cancer 4
classifier shown by Eq. (5) where the sum TP and FP is total Max 97.86 98.90 100.00
number of subjects with positive test and sum FN and TN is Mean 96.57 96.60 96.47
total number of subjects with negative test. Pima Indians Diabetes 7
Max 78.57 90.65 63.46
Mean 70.87 82.72 48.28
�� (1)
Sensitivity (True positive rate) = (%) Statlog 6
��+��
Max 77.05 100.00 82.76
�� Mean 62.84 94.71 64.36
Specificity (True negative rate) = (%) (2)
��+��

�� (3)
Precision (positive predictive value) = (%)
�P+��

�� (4)
Negative predictive value = (%) TABLE III
��+��
Evaluation functions (in %) of FSWOA algorithm
��+�� (5)
Accuracy (ACC) = (%)
��+��+��+�� Datasets # NF AUC PPV NPV
Hepatitis 8
Max 0.971 88.89 100.00
Mean 0.752 75.99 76.56
B. Experimental Setup
Breast Cancer 4
Max 0.994 100.00 97.92
The proposed algorithm is implemented using MATLAB on an Mean 0.965 98.17 93.64
Intel Core-i5 CPU with 6GB of RAM. To find the best subset Pima Indians 7
of features, our algorithm is tested 15 times by using the Diabetes
evaluation functions described in Section IV-A. During each Max 0.771 80.83 70.59
time, firstly, the datasets were randomly split into two sets of Mean 0.655 75.37 59.67
70% and 30% as a training set and a test set respectively. Then, Statlog 6
the proposed algorithm uses K-Nearest Neighbors algorithm Max 0.913 90.24 100.00
with K=3 for evaluating the subset of selected features. In Mean 0.795 76.74 91.53
addition, the maximum iteration is set to 60, the initial
population size is set to 30 and the lower and upper bound are
set to 0 and 1 respectively.

1245 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 9, September 2016
TABLE IV
Features reduction rate and subset of effective features Systems with Applications, vol. 37, pp. 2714-2723, 4// 2010.
[4] I. Guyon, Andr, #233, and Elisseeff, "An introduction to variable
Dataset Features Reduction Effective features and feature selection," J. Mach. Learn. Res., vol. 3, pp. 1157-1182,
rate (%) 2003.
Hepatitis 57.89 [27,5,28,11,15] [5] M. Dorigo and T. Stützle, "The Ant Colony Optimization
Breast Cancer 60.00 [7,2,3,10] Metaheuristic: Algorithms, Applications, and Advances," in
Statlog 53.85 [2,3,6,9,10,11] Handbook of Metaheuristics, F. Glover and G. A. Kochenberger,
PID 12.5% [5,6,1,3,8,4,7[ Eds., ed Boston, MA: Springer US, 2003, pp. 250-285.
[6] B. Xue, M. Zhang, and W. N. Browne, "Particle Swarm
Optimization for Feature Selection in Classification: A Multi-
Objective Approach," IEEE Transactions on Cybernetics, vol. 43,
pp. 1656-1671, 2013.
0.8
WOA
[7] B. Akay and D. Karaboga, "Artificial bee colony algorithm for
0.7 large-scale problems and engineering design optimization," Journal
of Intelligent Manufacturing, vol. 23, pp. 1001-1014, 2012.
0.6
[8] S. Al-Muhaideb and M. El Bachir Menai, "Hybrid Metaheuristics
Accuracy (%)

0.5
for Medical Data Classification," in Hybrid Metaheuristics, E.-G.
Talbi, Ed., ed Berlin, Heidelberg: Springer Berlin Heidelberg, 2013,
0.4 pp. 187-217.
0.3
[9] S. M. Vieira, L. F. Mendonça, G. J. Farinha, and J. M. C. Sousa,
"Modified binary PSO for feature selection using SVM applied to
0.2 mortality prediction of septic patients," Applied Soft Computing,
0.1
vol. 13, pp. 3494-3504, 8// 2013.
[10] H.-H. Hsu, C.-W. Hsieh, and M.-D. Lu, "Hybrid feature selection
0
3 4 5 6 7 8 9 10 11 12 13
by combining filters and wrappers," Expert Systems with
number of feature Applications, vol. 38, pp. 8144-8150, 7// 2011.
[11] S. Kashef and H. Nezamabadi-pour, "An advanced ACO algorithm
Fig. 2. The accuracy of classifier for different number of features in Heart for feature subset selection," Neurocomputing, vol. 147, pp. 271-
disease dataset 279, 1/5/ 2015.
[12] S. Şahan, K. Polat, H. Kodaz, and S. Güneş, "The Medical
Applications of Attribute Weighted Artificial Immune System
(AWAIS): Diagnosis of Heart and Diabetes Diseases," in Artificial
Immune Systems: 4th International Conference, ICARIS 2005,
Banff, Alberta, Canada, August 14-17, 2005. Proceedings, C. Jacob,
M. L. Pilat, P. J. Bentley, and J. I. Timmis, Eds., ed Berlin,
Heidelberg: Springer Berlin Heidelberg, 2005, pp. 456-468.
[13] C.-L. Huang, "ACO-based hybrid classification system with feature
subset selection and model parameters optimization,"
Neurocomputing, vol. 73, pp. 438-448, 12// 2009.
[14] H. H. Inbarani, A. T. Azar, and G. Jothi, "Supervised hybrid feature
selection based on PSO and rough sets for medical diagnosis,"
Computer Methods and Programs in Biomedicine, vol. 113, pp.
Fig. 3. The accuracy of classifier for different number of features in Pima 175-185, 1// 2014.
Indian Diabetes dataset. [15] J. Nahar, T. Imam, K. S. Tickle, and Y.-P. P. Chen, "Computational
intelligence for heart disease diagnosis: A medical knowledge
driven approach," Expert Systems with Applications, vol. 40, pp. 96-
V. CONCLUSION 104, 1// 2013.
Feature selection is the process for choosing a subset of features [16] S. Mirjalili and A. Lewis, "The Whale Optimization Algorithm,"
that maximizes the performance of learning algorithm and Advances in Engineering Software, vol. 95, pp. 51-67, 5// 2016.
reduces the dimensionality of the problem space. In this study, [17] K. Bache, M. Lichman (2016) UCI machine learning repository.
School of Information and Computer Science, University of
an efficient feature selection algorithm was proposed based on California, Irvine. Available: http://archive.ics.uci.edu/ml.
whale optimization algorithm named FSWOA. The proposed
algorithm searches the problem space and extract a subset of
optimal features for medical decision-making. The performance
of this algorithm was experimentally evaluated conducted by
four different medical datasets. The experimental results show
that our method can reduce the dimension of medical datasets
in diseases diagnosis with an acceptable accuracy.

REFERENCES

[1] N. Esfandiari, M. R. Babavalian, A.-M. E. Moghadam, and V. K.


Tabar, "Knowledge discovery in medicine: Current issue and future
trend," Expert Systems with Applications, vol. 41, pp. 4434-4463,
7// 2014.
[2] H. Liu and H. Motoda, "Feature Selection Methods," in Feature
Selection for Knowledge Discovery and Data Mining, ed Boston,
MA: Springer US, 1998, pp. 73-95.
[3] S. M. Vieira, J. M. C. Sousa, and T. A. Runkler, "Two cooperative
ant colonies for feature selection using fuzzy models," Expert

1246 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 9, September 2016

Hoda Zamani was born in Iran. She received


both B.S. and M.S. degrees in software
engineering in 2012 and 2015 respectively
from the Faculty of Computer Engineering,
Islamic Azad University of Najafabad
(IAUN) in Iran. Her research interests
include data mining, medical data mining and
meta-heuristic algorithm.

Mohammad-Hossein Nadimi-Shahraki was


born in Iran. He received his PhD in
computer science with major of artificial
intelligence and data mining from University
Putra of Malaysia (UPM) in 2010. His
research interests include data mining,
medical data mining, social network mining
and big data mining. He was director general
of research in IAUN from 2012 to 2014 and currently he is dean
of faculty of computer engineering of Islamic Azad University
of Najafabad (IAUN) in Iran. Dr. Nadimi is a member of
professional societies such as IEEE and IAENG. He was
awarded in International Research and Technology Expo,
Malaysia Invention & Innovation Awards 2010 (MTE2010).
He was also awarded as top researcher in 2012 and 2014 in
IAUN and his data mining book was awarded as top book in
2016 in Iran.

1247 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

You might also like