Correlation Based Feature Selection
Correlation Based Feature Selection
Abstract— Feature selection is an effective strategy to reduce (basis) in original dataset. Second, identify redundant features
dimensionality, remove irrelevant data and increase learning
accuracy. The curse of dimensionality of data poses a severe
and remove the same. The benefits of performing feature
challenge to many existing feature selection methods with respect selection [1] will prevent the model from over fitting, reduce
to efficiency and effectiveness. In this paper, we use three feature storage requirements, increase the accuracy of dataset, reduce
selection algorithms namely Fast Correlation Based Feature computational cost, improve the interpretability of predictive
Selection (FCBF), a variation of FCBF called Fast Correlation model, reduce computing resources and mitigate the curse of
Based Feature Selection # (FCBF#) and Fast Correlation Based
Feature Selection in Pieces (FCFBiP). The three feature
dimensionality.
selections are compared and experimental results prove that the Feature selection approaches are grouped into 3
FCFBiP is efficient compared to FCBF and FCBF#. methods. Filter methods, Wrapper methods and embedded
methods [1, 2]. Filter methods preprocesses the data. These
Keywords— Fast Correlation Based Feature Selection consider the relationship between features to calculate and
(FCBF), FCBF#, FCFBiP, filter methods, Wrapper methods predict the target feature. Various statistical tests are applied
on features to detect the high rank feature.
Wrapper methods evaluate subset of features by their
I. INTRODUCTION
predictive accuracy by statistical reassembling or cross
Today’s revolutionary paradigms include IoT, Big validation. The model is trained using subset of features
Data, Social Networks and Machine Learning. These chosen iteratively. This method is slower, because of excess
paradigms process voluminous amount of data every day. IoT training and cross validation. The embedded method is a
devices sense and handle heterogeneous huge dimensional hybrid method which combines filter method and wrapper
data. These data are unstructured, difficult to manage and are method.
noisy. Industries also process voluminous amount of data
every day depending upon customer’s need. The main II. LITERATURE SURVEY
challenge faced by the industry is to satisfy the customer’s The authors in [4] use Relief, Fisher scores and
expectation, support performance of customer application, Mutual information to identify the highest ranked features in
reduced cost for storage requirements, handle high the class. Two Differential Evaluations (DE) based filter
dimensional complex task, etc. One of the fundamental approaches are used to handle single object and multi object
requirement to structure ease of handling high dimensional problem based design. Mutual Information Feature Selection
data is to solve curse of dimensionality problem. Approach (MIFS) based minimum relevance and maximum
Curse of dimensionality is addressed through feature redundancy is applied to single object and multi object
extraction and feature selection methodologies. Feature differential evolution for feature selection. These algorithm
extraction extracts apt features from original dataset provide better performance for single and multiple object.
effectively, but are not suitable for all learning problems. Processes filter algorithm [5] is exploited for discrete
Feature selection as an alternative to feature extraction is often classification problem. Fast correlation based filter method is
used as a preprocessing step in machine learning. Feature applied to continuous and discrete problems. Features are
selection is the process of identifying the irrelevant features, selected using relief algorithm to reduce the dimensionality.
removing the redundant features present in original dataset, Correlation based feature selection (CFS) is a heuristic
eliminate the noisy and unreliable data. technique for evaluating subset features. CFS can be applied
Machine learning algorithms find huge impact in IoT to discrete and continuous features. If the feature is discrete,
environments. Machine learning provides predictive model to symmetric uncertainty can be used. If the feature is
predict response to problem employing knowledge previously continuous, linear correlation (Pearson’s correlation) can be
collected in a dataset. Feature selection consists of two
important stages. First, to select relevant features from class
Authorized licensed use limited to: Vignan University. Downloaded on January 23,2021 at 10:07:40 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Communication and Electronics Systems (ICCES 2018)
IEEE Xplore Part Number:CFP18AWO-ART; ISBN:978-1-5386-4765-3
Authorized licensed use limited to: Vignan University. Downloaded on January 23,2021 at 10:07:40 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Communication and Electronics Systems (ICCES 2018)
IEEE Xplore Part Number:CFP18AWO-ART; ISBN:978-1-5386-4765-3
FCBF# enables efficient selection of any given size where, N is the number of features in original dataset, P is the
feature subset. More validations in FCBF# leads to better number of selected feature.
selection of highly correlated feature subset. The main In FCBFiP if P value is small, time consumption is
drawback is that FCBF# consumes more amount of time to high. If P value is large, number of operations can be saved.
process the dataset due to excess training. Features are divided into number of pieces and establish the
parallel computation, speedup the processing time and each
piece of features independently works without any
FAST CORRELATION BASED FEATURE SELECTION IN PIECES dependency.
FCBFIP ALGORITHM The main objective to classify the features based on
FCBFiP is a new modification of FCBF# [1]. The number of relevance of class and redundancy. The computed symmetric
features contained in original dataset is divided into P pieces uncertainty value is arranged in ascending order and high
and the least scoring feature is removed in each elimination scoring features are removed until it contain k subset features.
step.
The previous version contains two main steps, the
first one to evaluate relevant features to the target class based IV. EXPERIMENTAL RESULT
on correlation between the features and the features are Python libraries Sklearn and Numpy packages are used for
arranged in descending order. Second one is to identify implementation purpose.. Sklearn is a machine learning library
redundancy among the set and iteratively eliminating one for data analysis. Numpy supports high level mathematical
feature from the original dataset and maintain the best subset functions.
list. Table 1 and Table 2 compares the FCBF, FCBF#,
In FCBFiP the number of pieces in the dataset is FCBFiP algorithm based on best score and elapsed time. Best
defined as, score value indicates the correlation between two features.The
elapsed time indicate the total time taken by each algorithm to
select best feature subset.
(6)
Authorized licensed use limited to: Vignan University. Downloaded on January 23,2021 at 10:07:40 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Communication and Electronics Systems (ICCES 2018)
IEEE Xplore Part Number:CFP18AWO-ART; ISBN:978-1-5386-4765-3
REFERENCES
[1]. Santiago Egea, Albert Rego Manez, Belen Carro, Antonio Sanchez-
Esgunevillas, and Jaime Lloret. “Intelligent IoT Traffic Classification
Using Novel Search Strategy for Fast-Based-Correlation Feature
Selection in Industrial Environment.” IEEE Internet of Things Journal,
Vol. 5, No.3, JUNE 2018.
[2]. Yu, Lei, and Huan Liu. "Feature selection for high-dimensional
data: A fast correlation-based filter solution." Proceedings of the 20th
international conference on machine learning (ICML-03). 2003.
[3]. Senliol, Baris, et al. "Fast Correlation Based Filter (FCBF) with a
different search strategy." Computer and Information Sciences, 2008.
ISCIS'08. 23rd International Symposium on. IEEE, 2008.
[4]. Hancer, Emrah, Bing Xue, and Mengjie Zhang. "Differential
evolution for filter feature selection based on information theory and
feature ranking." Knowledge-Based Systems 140 (2018): 103-119.
[5]. Hall, Mark A. "Correlation-based feature selection of discrete and
numeric class machine learning." (2000).
[6]. Liu, Huan, and Lei Yu. "Toward integrating feature selection
algorithms for classification and clustering." IEEE Transactions on
knowledge and data engineering 17.4 (2005): 491-502.
[7]. Yang, Yiming, and Jan O. Pedersen. "A comparative study on
feature selection in text categorization." Icml. Vol. 97. 1997.
[8]. Peng, Hanchuan, Fuhui Long, and Chris Ding. "Feature selection
based on mutual information criteria of max-dependency, max-
relevance, and min-redundancy." IEEE Transactions on pattern analysis
and machine intelligence27.8 (2005): 1226-1238.
[9] Jacob, Shomona, and Geetha Raju. "Software defect prediction in
large space systems through hybrid feature selection and
classification." Int. Arab J. Inf. Technol. 14.2 (2017): 208-214.
[10]. Mao, Kezhi Z. "Orthogonal forward selection and backward
elimination algorithms for feature subset selection." IEEE Transactions
on Systems, Man, and Cybernetics, Part B (Cybernetics) 34.1 (2004):
629-634.
Authorized licensed use limited to: Vignan University. Downloaded on January 23,2021 at 10:07:40 UTC from IEEE Xplore. Restrictions apply.