Feature Selection and Its Use in Big Data

Uploaded by

nirnika

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views17 pages

Feature Selection and Its Use in Big Data

Uploaded by

nirnika

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Received December 14, 2018, accepted January 15, 2019, date of publication January 23, 2019, date of current

version February 22, 2019.

Digital Object Identifier 10.1109/ACCESS.2019.2894366

Feature Selection and Its Use in Big Data:

Challenges, Methods, and Trends
MIAO RONG 1, DUNWEI GONG 2, (Member, IEEE), AND XIAOZHI GAO3
1 School of Information and Electrical Engineering, China University of Mining and Technology, Xuzhou 221006, China
2 School of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, China
3 School of Computing, University of Eastern Finland, 70211 Kuopio, Finland

Corresponding author: Dunwei Gong ([email protected])

This work was supported in part by the National Natural Science Foundation of China under Grant 61773384, Grant 61876184,
Grant 61876185, Grant 61873105, Grant 61703188, Grant 61573361, Grant 61573362, Grant 61673404, Grant 61763026, and
Grant 61473299, and in part by the National Key Research and Development Program of China under Grant 2018YFB1003802-01. The
work of X. Gao was supported by the National Natural Science Foundation of China under Grant 51875113.

ABSTRACT Feature selection has been an important research area in data mining, which chooses a subset
of relevant features for use in the model building. This paper aims to provide an overview of feature selection
methods for big data mining. First, it discusses the current challenges and difficulties faced when mining
valuable information from big data. A comprehensive review of existing feature selection methods in big
data is then presented. Herein, we approach the review from two aspects: methods specific to a particular
kind of big data with certain characteristics and applications of methods in classification analysis, which are
significantly different to the existing review work. This paper also highlights the current issues of feature
selection in big data and suggests the future research directions.

INDEX TERMS Feature selection, big data, data mining, applications.

I. INTRODUCTION can be recognized as a particular type of structured data.

Big data, as one of the IT buzzwords, generally have the It should be noted that non-structured data cover many more
following three characteristics: large volume, wide variety, areas than their structured counterparts. Regarding change,
and rapid change [1]–[3]. In terms of volume, Bezdek and the concept of dataflow is proposed to describe data changing
Hathaway divided data into a number of categories accord- continuously by time, which is required to be utilized from the
ing to the size of a datum, and pointed out that the size dynamic environment. Based on these, Manyika et al. [10]
of big data should go up to 10 Terabytes (TB, 1TB=1012 defined big data as ‘‘ a dataset whose size exceeds the capa-
bytes) or more [4]–[6]. With respect to variety, statistics have bility of conventional dataset manage systems in acquiring,
shown that in numerous fields, such as the Internet [7], [8], storing, processing, and analyzing ’’. The challenge due to
astronomy [3], biomedicine [5], and geoinformatics [9], there those 3V characters, i.e., volume, variety and velocity, has
are massive data to be exploited with a great variety of become the focus of learning methods when dealing with big
features. In terms of their types and formats, there are not only data. Additionally, redundancy and irrelativeness, which are
plenty of structured data, but also massive semi-structured significant in big data with the goal of avoiding losing effec-
and/or non-structured data, such as text messages and videos, tive material, often make the mining process more crucial.
which are widespread, where the former has a fixed for-
mat; whereas the data formats of the latter two are often
A. CONTRIBUTIONS OF FEATURE SELECTION
inflexible. Compared with structured data, non-structured
data exist more widely in practice. However, they are less One contributing commitment, feature selection(FS), has
useful due to the disordered distribution of these data [3]. already facilitated data mining for its good performance of
Semi-structured data have additional information to facilitate seeking correlated features and deleting redundant or uncor-
processing them, even though they are not structured, which related features from the original dataset [11], [12]. Fea-
ture selection is one of the most important data processing
The associate editor coordinating the review of this manuscript and techniques, and is frequently exploited to seek correlated
approving it for publication was Ruqiang Yan. features and delete redundant or uncorrelated features from a
2169-3536 2019 IEEE. Translations and content mining are permitted for academic research only.
VOLUME 7, 2019 Personal use is also permitted, but republication/redistribution requires IEEE permission. 19709
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
M. Rong et al.: FS and Its Use in Big Data: Challenges, Methods, and Trends

feature set [13]. Random or noisy features often disturb a with an analysis of possible challenges and trends in future
classifier learning correct correlations, and redundant or cor- research. Additionally, we discuss the applications of feature
related features increase the complexity of a classifier without selection methods in several specific kinds of data and clas-
adding any useful information to the classifier [14]. A variety sification analysis.
of feature selection methods, such as filter, wrapper, and The structure of our paper is explained as follows:
embedded approaches [13] [15], have been developed. Section II looks back to the feature selection methods for
As mentioned above, scalability is a major issue in big data traditional data. Next, available feature selection methods
processing systems. The enormous redundancy or irrelevance and difficulties with processing big data are analyzed in
absolutely accounts for it, not only consuming computing Section IV. Section VI summarizes the paper, and provides
resources, but also affecting processing performance. On this several promising directions for further research.
occasion, if this useless information can be removed while
valuable clues are retained, the dimension of big data will II. BASIC FEATURE SELECTION FRAMEWORK
be greatly lowered, and as a consequence, apart from the Feature selection, also known as variable selection, attribute
computational efficiency, the processing performance of big selection or variable subset selection, is a data mining tech-
data will be improved. As a result, studying feature selection nique targeting at selecting an optimal subset of features
approaches for big data so as to obtain a feature subset with from the whole feature set that renders the best performance
superior divisibility is of considerable necessity. in terms of well-defined criteria. Here, a feature refers to
Recently, some researchers have applied these methods an attribute of data, which represents the function of these
to high dimensionality domains, such as DNA microarray data in a certain aspect. Since feature selection performs
analysis [16]–[19], text classification [20]–[23], information well in simplifying the model, shortening training times, and
retrieval [24]–[26], and web mining [27]–[29]. Online fea- reducing the variance of the model, researchers can interpret
ture selection methods have also been applied to streaming and understand the pattern of the data model more easily
data [30] and valuable information has been extracted from by using feature selection. Yu et al. [34] pointed out that a
noisy data [31], albeit on a small-scale and with a huge good feature selection method should be capable of selecting
dimension [32], [33]. different features with a high degree of correlation and the
optimal classification results.
B. CHALLENGES OF FEATURE SELECTION
Compared to traditional data, some influential points need to A. FEATURE SELECTION FRAMEWORK
be highlighted on extracting valuable information from big A feature selection method can be divided into two parts,
data. Taking the 3V characteristics into consideration, tradi- i.e., a feature set selection technique that accounts for how
tional feature selection methods face the following threefold to select features from the original entire set, and a feature set
challenges with respect to the case of big data: (1) traditional evaluation technique that presents how to evaluate the feature
feature selection methods usually require large amounts of subsets [14], [35]. The process of feature selection is shown
learning time, so it is hard for processing speed to catch up in Algorithm1 and Fig.1.
with the change of big data; (2) generally, big data not only
include an immense amount of irrelevant and/or redundant Algorithm 1 The process of feature selection
features, but also have possible noises of different degrees
1: input the original dataset, X;
and different types, which greatly increases the difficulty of
2: while the termination condition is not met do
selecting features; (3) some data are unreliable/forged, due
3: generate the feature subset, F, by searching strate-
to different means of acquisition, or even loss, which further
gies;
enhances the complexity of feature selection.
4: evaluate the feature subset, F, by evaluation criteria;
Due to the properties of big data, existing feature selection
5: end while
methods face demanding challenges in a variety of phases,
6: return F;
e.g. the speed of data processing, tracing concept drifts, and
dealing with incomplete and/or noisy data. Thus, studying
pertinent feature selection methods for big data is of consider- In Algorithm1, the feature subset can be generated by
able urgency. However, the available methods are extremely searching strategies (in Line3), such as the random search
specific, and how to extract valuable information from big strategy, the stepwise addition or deletion of features, and
data based on tackling and analyzing them is still an open heuristic search methods. After a feature subset F has been
issue. obtained, its performance of it must be assessed (in Line4).
Apart from our review, Bolón-Canedo et al. [12] presented Figure1 depicts feature selection as a kind of learning method,
a review of feature selection in the context of big data, which which aims to find the appropriate variable subset for users.
mainly describes available feature selection methods classi-
fied by practical applications and the next future needs [12]. 1) BENEFITS FROM FEATURE SELECTION
Unlike their work, we aim to review and compare studies The basic idea of using feature selection is to obtain a new
to date regarding the threefold challenges mentioned above, dataset with neither redundant features nor irrelevant features,