0% found this document useful (0 votes)
70 views

Correlation Based Feature Selection

Uploaded by

kiran munagala
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views

Correlation Based Feature Selection

Uploaded by

kiran munagala
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Proceedings of the International Conference on Communication and Electronics Systems (ICCES 2018)

IEEE Xplore Part Number:CFP18AWO-ART; ISBN:978-1-5386-4765-3

Correlation Based Feature Selection Algorithm for


Machine Learning

N.Gopika A.Meena kowshalaya M.E.,


Department of Computer Science and Engineering. Assistant professor, Department of Computer Science and
Government College of Technology Engineering. Government College of Technology
Coimbatore,India Coimbatore, India
Email: gopikagce@gmail.com Email:meenakowsalya@gct.ac.in

Abstract— Feature selection is an effective strategy to reduce (basis) in original dataset. Second, identify redundant features
dimensionality, remove irrelevant data and increase learning
accuracy. The curse of dimensionality of data poses a severe
and remove the same. The benefits of performing feature
challenge to many existing feature selection methods with respect selection [1] will prevent the model from over fitting, reduce
to efficiency and effectiveness. In this paper, we use three feature storage requirements, increase the accuracy of dataset, reduce
selection algorithms namely Fast Correlation Based Feature computational cost, improve the interpretability of predictive
Selection (FCBF), a variation of FCBF called Fast Correlation model, reduce computing resources and mitigate the curse of
Based Feature Selection # (FCBF#) and Fast Correlation Based
Feature Selection in Pieces (FCFBiP). The three feature
dimensionality.
selections are compared and experimental results prove that the Feature selection approaches are grouped into 3
FCFBiP is efficient compared to FCBF and FCBF#. methods. Filter methods, Wrapper methods and embedded
methods [1, 2]. Filter methods preprocesses the data. These
Keywords— Fast Correlation Based Feature Selection consider the relationship between features to calculate and
(FCBF), FCBF#, FCFBiP, filter methods, Wrapper methods predict the target feature. Various statistical tests are applied
on features to detect the high rank feature.
Wrapper methods evaluate subset of features by their
I. INTRODUCTION
predictive accuracy by statistical reassembling or cross
Today’s revolutionary paradigms include IoT, Big validation. The model is trained using subset of features
Data, Social Networks and Machine Learning. These chosen iteratively. This method is slower, because of excess
paradigms process voluminous amount of data every day. IoT training and cross validation. The embedded method is a
devices sense and handle heterogeneous huge dimensional hybrid method which combines filter method and wrapper
data. These data are unstructured, difficult to manage and are method.
noisy. Industries also process voluminous amount of data
every day depending upon customer’s need. The main II. LITERATURE SURVEY
challenge faced by the industry is to satisfy the customer’s The authors in [4] use Relief, Fisher scores and
expectation, support performance of customer application, Mutual information to identify the highest ranked features in
reduced cost for storage requirements, handle high the class. Two Differential Evaluations (DE) based filter
dimensional complex task, etc. One of the fundamental approaches are used to handle single object and multi object
requirement to structure ease of handling high dimensional problem based design. Mutual Information Feature Selection
data is to solve curse of dimensionality problem. Approach (MIFS) based minimum relevance and maximum
Curse of dimensionality is addressed through feature redundancy is applied to single object and multi object
extraction and feature selection methodologies. Feature differential evolution for feature selection. These algorithm
extraction extracts apt features from original dataset provide better performance for single and multiple object.
effectively, but are not suitable for all learning problems. Processes filter algorithm [5] is exploited for discrete
Feature selection as an alternative to feature extraction is often classification problem. Fast correlation based filter method is
used as a preprocessing step in machine learning. Feature applied to continuous and discrete problems. Features are
selection is the process of identifying the irrelevant features, selected using relief algorithm to reduce the dimensionality.
removing the redundant features present in original dataset, Correlation based feature selection (CFS) is a heuristic
eliminate the noisy and unreliable data. technique for evaluating subset features. CFS can be applied
Machine learning algorithms find huge impact in IoT to discrete and continuous features. If the feature is discrete,
environments. Machine learning provides predictive model to symmetric uncertainty can be used. If the feature is
predict response to problem employing knowledge previously continuous, linear correlation (Pearson’s correlation) can be
collected in a dataset. Feature selection consists of two
important stages. First, to select relevant features from class

978-1-5386-4765-3/18/$31.00 ©2018 IEEE 692

Authorized licensed use limited to: Vignan University. Downloaded on January 23,2021 at 10:07:40 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Communication and Electronics Systems (ICCES 2018)
IEEE Xplore Part Number:CFP18AWO-ART; ISBN:978-1-5386-4765-3

applied. These reduce the dimensionality of original dataset


and improve the performance of learning algorithm.
For classification and clustering, the feature selection (1)
algorithm was proposed [6]. Design framework for different
searching strategies algorithms are discussed. Frequently Where is the mean of X and is the mean of Y. The r
applied combinations of search strategies lead to build value lies between -1 to 1. r value is 0 then X and Y are
intelligent feature selection procedures. independent, 1 Otherwise.
Five methodologies for dimensionality reduction [7] Alternatively entropy can also be used. Entropy
proposed namely, 1) Document frequency (DF) removes measures the uncertainty of a features. The entropy of variable
redundant and unwanted documents. 2) Information gain (IG) X is defined according to equation 2.
measures how much bit of information is present/absent in
terms of a document. 3) Mutual information (MI) model finds
word associations and its importance to the document. 4) Chi-
square test measures the independence of two features. 5) (2)
Term strategy (TS) identifies closely related target documents. The entropy of X given other feature Y is called
These methodologies reduce the feature space and improve the conditional entropy of X and Y as is computed according to
application performance. equation 3.
A novel technique [8] proposed to select good
features from original dataset using statistical dependency
approach. The maximum dependency in the dataset is found
using minimal redundancy maximal relevance technique. (3)
Minimal dependency identified the redundant data with help Where probability of the all values of X and
of mutual information. Complete redundancy in the original
dataset was reduced. Different classifiers were applied to the posterior probability of X given value of Y.
dataset to measure their performances. Information gain defines finding the goodness of features,
A novel approach presented [9] to overcome
overlooking software faults utilizing both feature selection and (4)
classification techniques. This system involve Hybrid Feature Symmetric uncertainty is another desired property to measure
Selection (HFS) which find minimal and optimal set of the correlation of features,
features, combining ranking feature highly correlated features
are selected combining ranking subset selection The result of ] (5)
the system improves the system process, minimize the cost
and increase the prediction performance. SU(X,Y)=1,indicate x and Y completely correlated, otherwise
New methods called Orthogonal forward Selection uncorrelated.
(OFS) and Orthogonal Backward Elimination (OBE) [10] was
proposed. These method is used to find an orthogonal space FCBF feature selection can be done in two ways
and perform feature subset selection in efficiently. This system 1. Select one feature relevant to the class
mainly deal with real world problem and correlation among 2. Iteratively eliminate other features based on the
candidate features selected feature.

FAST CORRELATION BASED FEATURE SELECTION ALGORITHM


III. FAST CORRELATION BASED FEATURE SELECTION FCBF#
FCBFALGORITHM
FCBF# employs wrapper selection strategies [3]. FCBF
FCBF selects goodness of feature for classification [2]. The feature selection starts with all features, it then applies
feature is selected if it is good and is relevant to class but not sequential backward elimination strategies and finally uses
redundant to any other relevant features. Correlation between symmetric uncertainty value to find the dependency between
two features are measured. Relevant features are selected from the features. These processes are continue until no more
original dataset such that it is highly correlated to any other elimination cannot made in the dataset.
class. FCBF# exploits sequential forward selection strategy.
Two approaches are used to measure the correlation These methods are more balanced to select feature in best
between two random features, one is classical linear subset of k features. FCBF# eliminates features that are lightly
correlation and another one is based on information theory. correlated with the class. Only one feature will be eliminated
Numerical data uses classical linear correlation at every iteration which makes elimination process more
approach. For each pair (X,Y), the variable linear correlation balanced. These step are repeated until no more elimination
coefficient r is given according to equation 1. could be possible. Cross validation is applied until a k-feature
subset is obtained.

978-1-5386-4765-3/18/$31.00 ©2018 IEEE 693

Authorized licensed use limited to: Vignan University. Downloaded on January 23,2021 at 10:07:40 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Communication and Electronics Systems (ICCES 2018)
IEEE Xplore Part Number:CFP18AWO-ART; ISBN:978-1-5386-4765-3

FCBF# enables efficient selection of any given size where, N is the number of features in original dataset, P is the
feature subset. More validations in FCBF# leads to better number of selected feature.
selection of highly correlated feature subset. The main In FCBFiP if P value is small, time consumption is
drawback is that FCBF# consumes more amount of time to high. If P value is large, number of operations can be saved.
process the dataset due to excess training. Features are divided into number of pieces and establish the
parallel computation, speedup the processing time and each
piece of features independently works without any
FAST CORRELATION BASED FEATURE SELECTION IN PIECES dependency.
FCBFIP ALGORITHM The main objective to classify the features based on
FCBFiP is a new modification of FCBF# [1]. The number of relevance of class and redundancy. The computed symmetric
features contained in original dataset is divided into P pieces uncertainty value is arranged in ascending order and high
and the least scoring feature is removed in each elimination scoring features are removed until it contain k subset features.
step.
The previous version contains two main steps, the
first one to evaluate relevant features to the target class based IV. EXPERIMENTAL RESULT
on correlation between the features and the features are Python libraries Sklearn and Numpy packages are used for
arranged in descending order. Second one is to identify implementation purpose.. Sklearn is a machine learning library
redundancy among the set and iteratively eliminating one for data analysis. Numpy supports high level mathematical
feature from the original dataset and maintain the best subset functions.
list. Table 1 and Table 2 compares the FCBF, FCBF#,
In FCBFiP the number of pieces in the dataset is FCBFiP algorithm based on best score and elapsed time. Best
defined as, score value indicates the correlation between two features.The
elapsed time indicate the total time taken by each algorithm to
select best feature subset.
(6)

FAST CORRELATION BASED FEATURE SELECTION ALGORITHMS

FCBF FCBF# FCBF in FCBF in 4 FCBF in 8 FCBF in FCBF in


2pieces pieces pieces 16pieces 32pieces
Best score 0.8202 0.8241 0.8252 0.7668 0.8213 0.81357 0.8152

Elapsed Time 3.9379 3.8129 5.5939 2.9219 1.7190 0.9539 0.4690

Table 1 Comparison with decision tree classifier

FAST CORRELATION BASED FEATURE SELECTION ALGORITHMS

FCBF FCBF# FCBF in FCBF in 4 FCBF in 8 FCBF in FCBF in


2pieces pieces pieces 16pieces 32pieces
Best score 0.9031 0.9031 0.8759 0.8942 0.8842 0.9042 0.9037

Elapsed Time 3.733 3.6729 6.5789 2.969 1.5309 0.7970 0.4840

Table 2 Comparison with Logistic Regression classifier

CONCLUSION of the three algorithms were compared. Best score and


Elapsed time were used as the metric to measure the
Feature Selection is a definite solution to the curse of performances of the algorithms. The FCBF and FCBF#
dimensionality problem. The paper revisited three feature algorithms are simple and effective for huge dimensional
selection algorithms namely Fast Correlation Based Feature datasets. These algorithms consume excess time due to
Selection (FCBF), a variation of FCBF called Fast Correlation intensive training. The FCFBiP algorithm reduces the elapsed
Based Feature Selection # (FCBF#) and Fast Correlation time by dividing the feature set into pieces. The more the
Based Feature Selection in Pieces (FCFBiP). The performance

978-1-5386-4765-3/18/$31.00 ©2018 IEEE 694

Authorized licensed use limited to: Vignan University. Downloaded on January 23,2021 at 10:07:40 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Communication and Electronics Systems (ICCES 2018)
IEEE Xplore Part Number:CFP18AWO-ART; ISBN:978-1-5386-4765-3

number of piece the less will be the elapsed time without


scarifying best score value.

REFERENCES
[1]. Santiago Egea, Albert Rego Manez, Belen Carro, Antonio Sanchez-
Esgunevillas, and Jaime Lloret. “Intelligent IoT Traffic Classification
Using Novel Search Strategy for Fast-Based-Correlation Feature
Selection in Industrial Environment.” IEEE Internet of Things Journal,
Vol. 5, No.3, JUNE 2018.
[2]. Yu, Lei, and Huan Liu. "Feature selection for high-dimensional
data: A fast correlation-based filter solution." Proceedings of the 20th
international conference on machine learning (ICML-03). 2003.
[3]. Senliol, Baris, et al. "Fast Correlation Based Filter (FCBF) with a
different search strategy." Computer and Information Sciences, 2008.
ISCIS'08. 23rd International Symposium on. IEEE, 2008.
[4]. Hancer, Emrah, Bing Xue, and Mengjie Zhang. "Differential
evolution for filter feature selection based on information theory and
feature ranking." Knowledge-Based Systems 140 (2018): 103-119.
[5]. Hall, Mark A. "Correlation-based feature selection of discrete and
numeric class machine learning." (2000).
[6]. Liu, Huan, and Lei Yu. "Toward integrating feature selection
algorithms for classification and clustering." IEEE Transactions on
knowledge and data engineering 17.4 (2005): 491-502.
[7]. Yang, Yiming, and Jan O. Pedersen. "A comparative study on
feature selection in text categorization." Icml. Vol. 97. 1997.
[8]. Peng, Hanchuan, Fuhui Long, and Chris Ding. "Feature selection
based on mutual information criteria of max-dependency, max-
relevance, and min-redundancy." IEEE Transactions on pattern analysis
and machine intelligence27.8 (2005): 1226-1238.
[9] Jacob, Shomona, and Geetha Raju. "Software defect prediction in
large space systems through hybrid feature selection and
classification." Int. Arab J. Inf. Technol. 14.2 (2017): 208-214.
[10]. Mao, Kezhi Z. "Orthogonal forward selection and backward
elimination algorithms for feature subset selection." IEEE Transactions
on Systems, Man, and Cybernetics, Part B (Cybernetics) 34.1 (2004):
629-634.

978-1-5386-4765-3/18/$31.00 ©2018 IEEE 695

Authorized licensed use limited to: Vignan University. Downloaded on January 23,2021 at 10:07:40 UTC from IEEE Xplore. Restrictions apply.

You might also like