0% found this document useful (0 votes)
92 views20 pages

Robin 1 PDF

Uploaded by

Saravanan V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views20 pages

Robin 1 PDF

Uploaded by

Saravanan V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

International Journal of Control and Automation

Vol. 12, No. 6, (2019), pp. 276-295

An Empirical Study on Methods, Metrics and Evaluation on Feature


Extraction in Big Data Analytics
1*. J. Jebamalai Robinson, Research Scholar, Bharathiar University, Coimbatore,
2. Dr. V. Saravanan, Dean – Computer Studies, Dr. SNS College of Arts and Science, Coimbatore.

ABSTRACT:
Big data usually refer to large data with high accumulation rate and complex which makes it very
difficult or even impossible for processing with the help of traditional techniques. Big data is often
referred as the data with many varieties accumulated in large volumes with exceptional velocity.
Many times, too much of information can be the cause of inefficiency in the domain of data mining.
Attributes that are irrelevant often add noises and can affect the accuracy of the data model. The
attributes may also be redundant that measures the same feature. These anamloies present in the data
that is built are prone to skew any logic of the DM algorithms and can cause adverse effects in the
model’s accuracy. Data with many such attributes involves lot of processing difficulties when data
mining algorithms are applied. The attributes present in a data model represents the dimensionality
of the processing space that are used by a particular algorithm. Greater the dimensionality, higher
the cost of computation in any algorithm design and processing. In order to minimize these noise and
high dimensions, specific form of dimensionality reduction techniques are required for the data
mining to be effective. Feature selection and feature extraction are common approaches towards
solving the issue. The former deals in selecting the attributes that are most relevant and the latter
helps to combine the attributes into a set of reduced features. Feature extraction in the process of
attribute reduction. Feature selection is the process where the attributes that are existing are ranked
based on the predictive significance whereas Feature Extraction transforms the attributes in reality.
Numerous researches have been conducted in proposing methods and techniques for effective feature
extraction. This paper is an outcome of the study on the methods, metrics and evaluation techniques
for the feature extraction in Big Data analytics.
Keyterms: Big data, Velocity, Variance, Attributes, Feature Extraction and Dimensionality

1. INTRODUCTION
Big data refers to the data that are available in large volumes. These can be either structured or
unstructured that gets flooded on a day-day fashion, the amount of data accumulated or gathered are
always not important but what the organization does with the accumulated data is what that really to
be considered. Big data analytics can be used to take better decisions and strategy planning in most of
the businesses [1]. The effective usage of Big data helps a company to outperform their competitors.
In many industries, the competitors and the new entrants are alike in using the strategies that are
resulted from the data analyzed for competence and innovation. Big data supports the organization for
creating a vertical growth and also to rise new categories of organizations that can merge and analyze
the industry data. Those companies might have huge information on the products and services, the
buyers and marketers,Consumer behavior and preferences that can be analyzed.

The term “Big Data” is comparatively new and the act of collecting information and storing it in large
volumes for analysis is very old. The complete character of the Big Date lies in 3 Vs (Variety,
Volume and Velocity) [2].The fundamental and important usage of big data lies in how the data is
used by an organization and not how much data is accumulated. All companies uses their data in own

ISSN: 2005-4297 IJCA


276
Copyright ⓒ 2019 SERSC
International Journal of Control and Automation
Vol. 12, No. 6, (2019), pp. 276-295

means. The more efficiently a organization uses it, the larger chances are there for it to grow. The
organization can use data from any of the sources and analysis can be done for finding answers that
can comprehensively improve the growth in terms of Cost reduction, time complexity reduction,
Planning and also to control the reputation and even more.
Feature Extraction is a way of dimensionality reduction through which an set of data that are raw are
reduced to manageable groups for easy processing. One of the character of these huge data is the
presence of numerous variables that makes the cost of computation and resource utilization to be
more. [3]. Feature extraction can be defined as the method through which selection and combination
of variables occurs by which these variables are transformed into features for effective reduction of
the amount of data that needs to be processes at the same time which maintains the accuracy and
completeness of the original data. Feature Extraction are normally used when there is a need for the
reduction of the resources used for the processing by keeping the important information safe and
secured. This also reduces the amount of data that are redundant in a certain analysis.
The speed of learning and the process of generalization improves when the data is reduced by the
combination of variables. The feature subset can be enough for constructing data models. As the
feature selection process keeps all the subsets of the original data, the important merit of this is that
the physical meaning remains unchanged that serves for better readability of the mode and also counts
for the interpretability [4]. Owing to this reason, the Feaure Extraction are widely adapted in most of
the real time applications. Removal of these redundant and non-relevant features helps greatly in
reducing the time complexity and storage costs without any loss of significant information or any
negative gradation in the performance of the learning [5].
Recently, the popularity of big data presents some challenges for the traditional feature extraction
task. Meanwhile, some unique characteristics of big data also bring about new opportunities for the
feature selection research. In recent times, the popularity of the Big data also present us with great
challenges in the conventional Feature Extraction methods. To add, some of the unique character of
the big data also come up with competitive opportunities for research in Feature Extraction.This paper
is intended to give a lucid knowledge on the methods that are used for the feature extraction based on
their data category such as structured, unstructured, multi label or multi-view data. The metrics used
for the performance study and the method of evaluation of the metrics are also discussed. The rest of
the paper are organized in the following manner. Section II gives the insights on the reviews on
feature extraction for structured data, un-structured data, Multi-label data, and Multi-View data
respectively. Section III deals with the comparative results and evaluation of the methods discussed
and Section IV concludes the paper with open research opportunities.

SECTION II
A. Feature Extraction methods of Big unstructured data
Unstructured data rather have an internal structure but lacks a pre-defined and structured schema.
These may in the form of text or non-text also generated by either human or machine. Many
organizations have turned up to a variety of software based solutions for extracting important
information from the unstructured data. The essential benefit of these type of tools is its ability to
gain useful information which can help in a company’s success. As the rate of data accumulation
is very high, Industries constantly look for better solutions to handle them efficiently. . The
following methods are studied from the literature that handles feature extraction in unstructured
big data.

ISSN: 2005-4297 IJCA


277
Copyright ⓒ 2019 SERSC
International Journal of Control and Automation
Vol. 12, No. 6, (2019), pp. 276-295

1. J. Wan et al [2019] proposed the Multi-Feature Extraction method. Keeping this as a base,
An MFE scheme for the AI driven Big Data is designed. The design is applied to the
algorithm of detecting hot events. The Multi-Feature based clustering model is developed that
works on the attention of the users which has two stages. In the initial stage, the MFKE
model is created for the evaluation of keywords and this is combined with the frequency
term. Through this, the keywords are extracted. In the next stage, the hot or the important
events are captured in accordance with the algorithm in the process of framing news clusters.
Different variadic parameters are analyzed for exploring the optimal effectiveness.
Experiments are conducted on large corpus and the results proves a significant increase in
Sensitivity, Specificity and F-Score. [6]

2. Jinrong He et al [2017]proposed DGOD (“Decision Graph Based Outlier Detection”). This


method works initially by calculating the score for the decision graph (DGS) on every sample
and the DGS is represented as a ration between the distance of discriminant and the local
density. The samples for the next ranking based on the DGS values will return the top r
largest DGS values as the outliers. Experiments were conducted both on synthetic and real-
time dataset and have confirmed a significant effectiveness in detecting the outliers and also
keeping the shape and dimensionality reduced which is evident from its low chi-square
results [7].

3. Jundong Li et al [2018]provided a lucid overview of the recent researches in the feature


Extraction. After motivated from the present difficulties, a revisit was done on the feature
extraction from the perspective of data and presented the available solutions for the FE in
unstructured, structured, and heterogeneous and also for streaming types of data. To give
more clear understanding, the similarities and the differences of the existing algorithms for
the unstructured data, categorization of them in four different groups have been done based
on Similarity , Information theory , sparse-learning and statistics based. The open challenges
in the research front are also discussed. The evaluation metrics are also discussed [8].

4. Mingkui Tan et al [2014]introduced the adaptive FSS (“Feature Scaling Scheme”) for the
very-high dimensional feature extraction for the Big Data and then reformulated it as a
convex SIP (“Semi-Infinite Problem). To solve this problem, an effective paradigm for
feature generation is proposed. Unlike the conventional approaches that are based on
gradients, which optimizes all the input features, the proposed method performs iterations and
activates only a group of features which solves the MKL(“Multiple Kernel Learning”) sub
problems. For the training process to get speed up, the MKL problems are then solved in the
primary form itself through the modified and accelerated proximity based gradient technique.
This also resulted in the identification of new cache techniques. The feature generation is
guaranteed to cover globally under mid conditions and can also achieve lower bias. The
proposed method is indented to solve couple of issues, firstly the FE based on groups for
complex structures and second, the non-linear method of Feature Extraction that has Feature
Mappings explicitly. Many experiments were carried out in a range of both synthetic and
real-time data with over a million points of data and the proposed method had a

ISSN: 2005-4297 IJCA


278
Copyright ⓒ 2019 SERSC
International Journal of Control and Automation
Vol. 12, No. 6, (2019), pp. 276-295

comprehensive performance measured in terms of Throughput and Time complexity when


compared with that of the other state-of-the art algorithms [9].

5. Kui Yu et al [2016] introduced the SAOLA for the feature extraction. The theoretical
analysis performed in the bounds of correlations occurring in pairs among the features, the
proposed method deploys a novel comparison method and maintains a thrifty model across
the time in an online fashion. Further, in-order to deal with the groups that gets added with
features, the proposed method is extended to a new group called group-SOALA for the online
feature extraction. The improved model can maintain a group which is sparse in both the
levels of groups and individually. The empirical study makes use of the benchmark data and
proved that both the algorithms proposed are scalable to data that has high demission and also
found to have better performance in terms of Sensitivity, Specificity and F-Score when
compared with other existing solutions [10].

6. Jun Lee et al [2017] proposed the SIE model for the classification of unstructured textual
data from the web in the form of five sensation based features namely the sight, hearing,
touch, smell and taste. Although the sensation is the primitive of all other human experience
with the environment, the study on this sensational information is normally neglected owing
to the non-availability of sensory expressions and the knowledge when compared to the
sentimental analysis or in opinion mining. Initially, the sensation measurement is assigned for
every feature. Then identification of as which measurement that is being assigned to feature is
done. Lastly, the sensational feature that has the strongest influence in human perceptional
experience is identified. The evaluation is done by calculating metrics such as correlation and
entropy [11].

7. Thee Zin Win et al [2018]Introduced the MIM (“Mutual Information Measure”) based
Feature extraction for removing the non-relevant and redundant features. It is observed that
the data which are massive demands efficient and effective mining techniques and the
researchers are in the process of developing algorithms that are scalable for successful mining
through which mountains of accumulated data are reduced to nuggets. The memory usage
will be more for the data with high dimension and this also increases the computational cost.
Hence, reducing the dimension of the data increases the performance considerably. The
proposed methodogy out performs the conventional algorithms which can eliminate only
irrelevant features but redundant features which is measured in terms of Entropy and
Correlation [12].
The table 2.1 gives an overview of the various methods and the metrics used for the Feature
selection of Unstructured Big data.

Sno Method Metrics


1 MFKE Sensitivity, Specificity and F score
2 FSA Chi Square
3 DGOD Correlation and Entropy
4 SIP Throughput and Time complexity
5 SAOLA Sensitivity, Specificity and F score

ISSN: 2005-4297 IJCA


279
Copyright ⓒ 2019 SERSC
International Journal of Control and Automation
Vol. 12, No. 6, (2019), pp. 276-295

6 SIE Correlation and Entropy


7 MIM Chi Square

Table 2.1 – Methods and Metrics for Feature Selection in Unstructured data

B. Feature Extraction methods of Big structured data


Structured data are the one that adheres to a model that is pre-defined and hence straight forward
for analysis purpose. These are normally in a tabulated form with inter-relations among the
various rows and columns. It is considered as the conventional method for data storage as the
earlier version of DBMS were capable of storing, processing and accessing the structured data.
This section describes the various feature extraction methods for structured big data

1. Tang et al [2015]made a research on the novel issue of feature extraction for the social media big
data in an unsupervised manner. Specific analysis is done to identify the difference among the
social media data and the traditional attribute-value based data and on how the hidden
relationship that are extracted from the linked form of data help in feature extraction. LUFS is
proposed for the linked data from social media. Systematic design and experiments are conducted
to evaluate the proposed framework on the real-time data from social media sites. The empirical
results shows that the proposed model is effective in terms of correlation and entropy when
compared with other methods [13]

2. Huan Liu et al [2016]proposed a unified platform as an intermediate. An exclusive example is


also proposed to demonstrate as how the available Feature extraction methods can be combined
bases on a meta algorithm which can take over the individual methods. This also facilitates the
user to deploy a suitable technique irrespective of the knowledge on the each algorithm. Few real
time applications are also included for the demonstrative purpose of the feature extraction in
mining process. The proposed method showed significant results in terms of Sensitivity,
Specificity and F score. The challenges and trends on feature selection is also discussed [14].

3. Lianzhi Li et al [2019]proposed a method for the building of an evolution model in the


educational service (EEM) quality in colleges and higher education institutions with an
orientation to the resources in colleges in Bid data platform. Further, experiments are carried out
as a verification on the basis of evolving data in 360 degree encyclopedia on the educational
domain. The experimental analysis proves that the proposed model can efficiently evaluate the
quality of the education in higher educational institutions. The metrics used for the experimental
analysis include Sensitivity, Specificity and F score[15]

4. Mehmet Burak Çatalkaya et al [2017]presented a software based architecture which can


identify the features with high estimation power. The Software is presented as a prototype and the
methods used for the development of the software, the techniques and algorithms and the
prototype are explained. The prototype developed is applied in banking data and the results are
analyzed. The together information gain , Chi square methods and Information value when used
together yields better results which is confirmed by the low value obtained for the chi-square and
the experiments also confirms a low logarithmic loss[16]

ISSN: 2005-4297 IJCA


280
Copyright ⓒ 2019 SERSC
International Journal of Control and Automation
Vol. 12, No. 6, (2019), pp. 276-295

5. J. A. Sáez et al [2019]introduced a novel feature extraction technique in the field of individual


learning, in which the unpredictability emerges from the way that the class marks can't be utilized
to choose the most discriminative highlights as it is customarily performed in managed learning.
The system planned, which is called Kolmogorov-Smirnov test-based Unsupervised Feature
Extraction(KSUFS), depends on the calculation of assessed feature conveyances that are later
contrasted with the first ones utilizing non-parametric factual tests to give the most delegate input
factors. Two renditions of the KSUFS are exhibited in the investigation: one of them is especially
intended to manage standard information and the other adaptation is intended to treat with large
information issues. The KSUFS is effectively contrasted with other best in class solo element
determination systems in an intensive trial study, which thinks about both standard and large
information issues. The outcomes acquired show that the technique proposed can beat the
remainder of reference of the individualFE strategies in terms of Throughput and time complexity
[17].

6. Sergio Ramírez-Gallego et al [2018]introduced the concept by which the feature extraction


methods are parallelized in the big data forums such as the Apache spark for the boosting of
performance and accuracy. The distributed framework on the conventional feature selection
which includes a variety of well-known pieces of information is developed. The experimental
results for the broad data set that has high dimensions as well as the data with larger samples
proves that the proposed method outperforms the sequential versions in terms of throughput and
time complexity [18].

7. Ke Gong et al [2018]proposed a rationale type of feature extraction, the O (|U|) FE method


whose time complexity increases in a linear fashion based on the no. of instances. For the
experimental purpose the data from UCI repository that has very high dimension and large scale
are considered. The dataset for the experiment had nearly three million features. The results
proved that the proposed model is efficient and effective in terms of Chi-Square Value and
logarithmic loss. Moreover, the proposed technique is also suitable for application in FE of even
large scale gigantic-dimensional data that are very difficult to process using the conventional
methods [19].

8. Makoto Yamada et al [2018] proposed the MPF (“Maximally Predictive Features”) with the
minimum redundant, high powered and increases interpretable features. The effectiveness of the
proposed method is demonstrated by application in various classifying phenotypes which are
based on expression in patients diagnosed with prostate cancer and also to detect enzymes
between the structures of the proteins. High accuracy is achieved as it extracted only 20 features
among the one million which contributed to a dimensionality reduction to 99.9 percent which is
obtained from the measures of correlation and entropy. . The proposed algorithm is also made
flexible to be applies in cloud platform. The proposed technique can serve as a great boost to
sophisticated predictive models in health care applications [20].

ISSN: 2005-4297 IJCA


281
Copyright ⓒ 2019 SERSC
International Journal of Control and Automation
Vol. 12, No. 6, (2019), pp. 276-295

The table 2.2 depicts the methods used and the metrics taken for consideration in the feature
selection of structured big data.

Sno Method Metrics


1 LUFS Correlation and Entropy
2 Meta-Algorithm Sensitivity , Specificity and F Score
3 EEM Sensitivity , Specificity and F Score
4 Chi-Square method Chi-Square Value , logarithmic loss

5 KSUFS Throughout , Time complexity


6 FS- Apache Spark Throughout , Time complexity
7 O (|U|) method Chi-Square Value , logarithmic loss
8 MPF Correlation and Entropy

Table 2.2 Method and Metrics for feature selection in Structured Big data

C. Feature Extraction methods of Big Multi-label data

The multi-label data assigns to every sample a set of labels that are pre-defined. These are
considered as the predictive data points which are not mutually exclusive for example the topics
that are more relevant in a document. The text data may be from any subject and from various
domain but has a certain label assigned to it. The available methods of feature extraction in the
multi-label data are explained as follows.

1. Konstantinos Sechidis et al [2011] investigated the issue of stratification in data that are
multi-labelled. The work considered couple of methods for the multi-label data and
comparison is done empirically along with the random sampling in numerous datasets with
definite sets of evaluation conditions. The results obtained shows interesting patters with
respect of the utilities of every method for certain types of multi-label data. The results were
evaluated using the Entropy , Correlation and RMSE metrics [21]

2. Conor Fahy et al [2019] proposed a Feature Mask that is dynamic for the clustering of very
high dimensional data. The features that are redundant are masked and the clustering was
performed only on the unmasked and relevant features. The Masks are updated regularly
when the importance of the feature changes. The proposed method stands independent of the
algorithm considered and are scalable to be applied for any of the density based clustering
techniques which does not poses a proper mechanism for the drift in features to handle with.
The evaluation is done in four data streams that are highly dimensional and the results proved
efficient in terms of Sensitivity , Specificity , RMSE and F-Score which helped in producing
a improved cluster performance and reduction in time complexity[22]

3. WenjunKe et al [2018]introduced the SCF method (“Score based Critea for Fusion Feature
Extraction”) for the prediction of cancer disease. This method is aimed to improve the quality
of classification model. The proposed method is evaluated on five large micro-array data and

ISSN: 2005-4297 IJCA


282
Copyright ⓒ 2019 SERSC
International Journal of Control and Automation
Vol. 12, No. 6, (2019), pp. 276-295

also with three very low dimensional data and shows significant increase in the metrics like
Hamming Loss and Logarithmic loss. The experiments also proves that the propose method
can be applied to find out the features that are discriminative unlike the other competitive
techniques. The method can also be used as a pre-processing method with combination of
other techniques in an efficient manner [23].

4. Jun Huang et al [2018]introduced a novel method through which the joint feature extraction
and the classification is performed using JFSC. The proposed method is intended to learn
from both the shared and label-specific feature by the consideration of pairwise correlation
and the multi-label classifier is then built on the learned data of low-dimension in a parallel
way. The performance of the proposed method was found to be better in terms of throughput
and time complexity when compared with that of the state of the art algorithms available in
the literature for multi-label data [24].
.
5. J. González-López et al [2019]proposed a MI (Mutual Information) model which is
distributive in nature. The model adapts for multi-label data using the Apache Spark. Couple
of approaches are proposed in the MI maximization and the minimum redundancy with
maximum relevance. The first one select the feature’s subset and the latter in addition reduces
the redundancy between the features. The experiments compares the distributed multilevel
model on 10 different datasets. The results were validates using the metrics such as Chi-
Square and correlation values and it is found that the MIM performs well and also reduces the
time complexity in orders of the magnitudes considered [25].

6. JianhuaXu et al [2019]Approximated the mesure of Moore-Penrose inverse matrix and the


kernel for calculating the space for the feature and kernel delta for the calculation of label
space. The symmetrization of entire matrix is done in the trace of the operation which results
in the effective approximation and symmetrized representations. Based on the projections of
orthogonality, and by maximizing a modified form of the model paved way for a new
eigenvalue issue for the linear Feature Extraction to be solved. Experiments were conducted
on 12 different datasets and the proposed method outstands seven other existing methods in
terms of Entropy and Correlation. The method also outperformed other three statistical tests
in the same domain [26].

7. Lin Sun et al [2019]proposed a novel thorough assessment capacity of CFS (NCFS) .The
NCFS is brought as a wellness work into the first BPSO and improved BPSO calculations to
advance multilabel order in the early and later stages, individually, and the enhancement
procedure is ended when the most extreme number of cycles is come to. Next, the Lebesgue
proportion of the local class is produced for examining the local estimate precision and the
reliance degree dependent on MNRS. Different properties are found, and the connections
among these measures are utilized to assess the vulnerability and relationships among names
of multilabel information. At long last, a half breed channel wrapper include choice
calculation utilizing NCFS-BPSO is intended for to begin with taking out excess highlights to
diminish the multifaceted nature, and a heuristic forward multilabel highlight determination
calculation is proposed for improving the exhibition of multilabel grouping. Trial results on

ISSN: 2005-4297 IJCA


283
Copyright ⓒ 2019 SERSC
International Journal of Control and Automation
Vol. 12, No. 6, (2019), pp. 276-295

fifteen multilabel datasets exhibit that our proposed calculations are compelling as far as
Sensitivity, Specificity and F-Score and in choosing noteworthy highlights and
acknowledging extraordinary characterization execution in multilabel neighborhood choice
frameworks [27].
8. A. A. Bidgoli et al [2018] proposed a model for optimization by considering the many-object
relation. The proposed method aims to select the sub features for the multi-label data that are
based on four important objectives namely the number of features, the two error measures
viz. the hamming and Logarithmic loss and the time complexity in extracting the features.
The many-objective problem is resolved using the NGSA-III (binary version) . The
experiments were carried out on several bench mark data which are multi-labelled and in
terms of many multi-objective assessment. The results show a great improvement in the
performance when compared to its peer NGSA-II in terms of lower values in loss metrics[28]

9. Jayaraman K Valadi et al [2017]introduced an efficient change in the Multi-label feature


extraction that are available in the literature. The method consisted of two phases. The first
phase , decomposition of the output label in smaller dimensions is done with the aid of
Simple Matrix factorization and the feature extraction method are deployed readily in the
reduced space. The simulated experiments with real time data revealed more efficiency in
terms of Chi-Square values [29]

10. Ali El-ZaartZiadAbdallah et al [2017]proposed the two most significant methods (MIML-
BOOST and MIML-SVM). The disadvantages of these current techniques that they didn't
take into contemplations: a) the depiction of the basic attributes from the picture, b) the
relationship between's marks. To conquer these issues a novel calculation (MIML-
GABORLPP) is proposed, which all the while handles these confinements. The calculation
utilizes Gabor channel bank as highlight descriptor to deal with the principal confinement. It
applies the Label Priority Power-set as Multi-mark change to take care of the issue of name
connection. The trial work shows that the aftereffects of MIML-GABORLPP are better as far
Sno Method Metric

11. as three assessment measurements, for example, F-measure , F-score and RMSE when
contrasted with other existing strategies [30]
The table 2.3 gives the methods and metrics used for the feature extraction of Multi-label Big
data.

ISSN: 2005-4297 IJCA


284
Copyright ⓒ 2019 SERSC
International Journal of Control and Automation
Vol. 12, No. 6, (2019), pp. 276-295

1 Random Sampling Entropy , Correlation and RMSE


2 Dynamic Feature Mask Sensitivity , Specificity , RMSE and F-Score
3 SCF Hamming Loss , Logarithmic loss
4 JFSC Throughput , Time Complexity
5 MI Maximization Chi Square , Correlation
6 Moore-Penrose inverse matrix Entropy , Correlation
7 NCFS Sensitivity , Specificity and F-Score
8 NGSA III Hamming loss , Ranking Loss
9 MLFS Chi Square
10 MIML-GABORLPP F-measure , F-score and RMSE

Table 2.3 Method and Metrics for Feature Selection in Multilabel Big data

D. Feature Extraction methods of Big Multi-View data


The multi-view data are those that has variety of features which are extracted from the similar
raw data. The data is meant for the investigation of the results from multiple clustering that runs
across each other and among different features so as to define the notion of “freshness and
interestingness”. The feature extraction of these data can bring out both interesting and un-
interesting results which are useful in either ways. The available methods and techniques for the
feature extraction applied in Multi-View data are discussed below.

1. Z. Wang et al [2015]proposed a novel method of Feature Extraction using the NMF – Multi-
view combined with graphical regularization , where the inside-view of the relationship
among the data is taken for consideration. The Matrix factorization is proposed by using the
construction of nearest k means neighbor for the integration of local geo information in each
view and to apply couple of rules for update in an iterative fashion. This is done to resolve the
optimization issue, The experimental results proves the effectiveness in terms of Entropy ,
Correlation and RMSE when compared with other methods [31]

2. ZALL R et al [2016] proposed a two-see semi-supervised learning strategy called semi-


regulated semi-supervised random correlation ensemble based on spectral clustering
(SS_RCE).SS_RCE utilizes a multi-see technique dependent on spatial clustering which
exploits discriminative data in various perspectives to gauge naming data of unlabeled
examples. So as to upgrade discriminative intensity of CCA highlights, consolidation of the
marking data of both unlabeled and named tests into CCA is performed. At that point,
irregular relationship is utilized between inside class tests from cross view to separate
different connected highlights for preparing segment classifiers. A general model in particular
SSMV_RCE is likewise stretched out to develop troupe technique to handle semi-managed
learning within the sight of numerous perspectives. The proposed techniques are contrasted
and existing multi-see highlight extraction strategies utilizing multi-see semi-administered
troupes. Test results on different multi-see informational indexes are introduced to exhibit the

ISSN: 2005-4297 IJCA


285
Copyright ⓒ 2019 SERSC
International Journal of Control and Automation
Vol. 12, No. 6, (2019), pp. 276-295

adequacy of the proposed techniques as far as Sensitivity, Specificity, RMSE and F-Score is
considered [32].

3. Zhiqiang Zuo et al [2014]proposed the MVMTFE framework that handles the features that
are multi-view in nature. The proposed method is applied in medical image classification.
The proposed method learns the FE matrix for every view and the combination of view co-
efficients. By this way, the proposed method is not only intended to handle the correlated and
noisy type of Features, but also to make use of the complimentaries of various views which
can further help in reducing the redundancy in every view. A specific algorithm is also
developed for the optimization factor through which each sub problem can also be solved.
The experiments produced very low Hamming Loss and Logarithmic loss which makes the
proposed method more stable than other techniques [33]

4. Hongfu Liu et al [2016]meant to find the discriminative highlights in each view for better
clarification and portrayal. The expectation was to give a right comprehension of multi-see
highlight determination. Not quite the same as the current work, which either erroneously
connects the highlights from various perspectives, or takes colossal time multifaceted nature
to get familiar with the pseudo marks, and proposed a novel calculation, Robust Multi-see
Feature Selection (RMFS), which applies hearty multi-see K-intends to acquire the powerful
and excellent pseudo names for scanty element determination in a proficient manner.
Nontrivially we give the solutio by taking the subsidiaries and further give a K-implies like
enhancement to refresh a few factors in a brought together structure with the combination
ensure. The broad investigations on three genuine world multi-see informational indexes,
which represent the viability and effectiveness of RMFS in terms ofThroughput and Time
Complexity[34].

5. Xuan Wu et al [2019]presented a unique approach called the SIMM for the multi-view and
multi-label feature extraction. This proposed method is intended to solve the issue of
subspaces that are shared and to view specific information being extracted. For the subspaces
that are shared, The SIMM minimizes the confusion loss and the loss due to multi-label for
the utilization of shared subspace for the utilization of the view dependent discriminative
features. Intense experiments were carried out on real-time data that clearly depicts the
performance improvement in-terms of Chi Square and Correlation [35]

6. Yasser Elmanzalawi et al [2018]introduced a novel multiview feature extraction method


based on CCA (“canonical correlation analysis”) which is used to extract unique features
from the multi-view data sets. The results obtained demonstrate that the model is effective in
predicting the KIRC (“kidney renal clear cell carcinoma “) diseases. The proposed method
can also be jointly used with other methods such as CAN, RNA-Sequence and Reversed
phase proteins array. The results outperform the other models that are trained using single-
view by achieving low values of Entropy and Correlation and an integrated model is also
brought in using the data-fusion method based CCA feature extraction[36]

ISSN: 2005-4297 IJCA


286
Copyright ⓒ 2019 SERSC
International Journal of Control and Automation
Vol. 12, No. 6, (2019), pp. 276-295

7. Wenzhang Zhuge et al [2018]introduced a unique framework FESG which is intended to


learn both the transformation matrix and the ideal and structural graph that contains the
information in the clusters. A novel method is also proposed for the extension of this FSEG
method for the multi-view feature extraction. The extension is names as MFESG (“Multiple
view feature extraction with structured graph”) and it is aimed to learn the optimal weight for
all the views in an automated way. The experimental results show that the proposed method
are effective in terms of Sensitivity , Specificity and F-Score [37].

8. OlcayKursun et al [2017]proposed a CCA based technique which is back supported by the


LDA for the multi-view feature extraction on a high dimensional data. The Canonical
Analysis is modified to work with two sets of variables that are interrelated and the linear
projections are calculated. The Maximum Mutual Correlation is reduced using the Fisher’s
Linear Discriminant Analysis. The results show a considerable performance as it produced
very low hamming loss and logarithmic loss [38]

9. Michele Volpi et al [2013] proposed a un-supervised multi-view feature extraction method


which is used before the classification process. A technique that automatically extract the
blocks based on the matrix of spectral correlation globally is then applied. The correlation
analysis is done using the conventional canonical method in the kernels. The proposed
method is then implemented in a multi-view setting (MVkCCA) inorder to identify the
projections in the data blocks. Experiments were carried out using LDA and the proposed
method shows an increased appropriateness by producing low value for the Chi-Square when
compared to other methods [39]
The table 2.4 gives a lucid description of the metrics that are used in the feature extraction of
Multi-view big data.

Sno Method Metrics


1 multi-view NMF Entropy , Correlation and RMSE
2 SSMV_RCE Sensitivity , Specificity , RMSE and F-Score
3 MVMTFE Hamming Loss , Logarithmic loss
4 RMFS Throughput , Time Complexity
5 SIMM Chi Square , Correlation
6 KIRC Entropy , Correlation
7 MFESG Sensitivity , Specificity and F-Score
8 LDA - CCA Hamming loss , Ranking Loss
9 MVkCCA Chi Square
Table 2.4 Methods and Metrics for the feature selection of Multi-View Big Data

Section III
A. COMPARITIVE RESULT ANALYSIS
The comparative analysis of the results obtained from the various methods considered based on
the metrics adapted are listed below. The metrics are grouped in sets irrespective of the data type
considered in the methodologies. The set1 is defined by S1 = {Sensitivity, Specificity, F-Score} ,

ISSN: 2005-4297 IJCA


287
Copyright ⓒ 2019 SERSC
International Journal of Control and Automation
Vol. 12, No. 6, (2019), pp. 276-295

set2 is defined by S2= { Entropy ,Correlation, RMSE} , S3 = { Through put , Tme complexity} ,
S4= {Hamming Loss ,Ranking Loss, Logarithmic loss} and S5={Chi-square }

Method Sensitivity Specificity F Score


MFKE 0.83 0.78 0.85
SAOLA 0.62 0.85 0.82
META-AlGORITHM 0.94 0.96 0.91
EEM 0.81 0.72 0.82
DFM 0.94 0.69 0.76
NCFS 0.86 0.81 0.80
SSMV_RCE 0.74 0.90 0.79
MFESG 0.89 0.79 0.75

Table 3.1 Experimental results based on metric set1

Figure 3.1 shows the graphical representation of the Table 3.1 and it is identified from the graph
that Meta-Algorithm gets the higher value for the metrics considered.

Figure 3.1 – Experimental Results based on Metric Set1


Table 3.2 shows the overview of the results obtained based on the metrics set2.

Method Entropy Correlation RMSE


DGOD 0.74 0.84 0.89

ISSN: 2005-4297 IJCA


288
Copyright ⓒ 2019 SERSC
International Journal of Control and Automation
Vol. 12, No. 6, (2019), pp. 276-295

SIE 0.98 0.92 0.57


LUFS 0.65 0.76 0.74
MPF 0.87 0.94 0.95
RS 0.80 0.72 0.58
MOORE 0.74 0.85 0.78
NMF 0.96 0.68 0.84
KIRC 0.61 0.79 0.92

Table 3.2Experimental results based on metric set2

It is obvious from the table that the NMF method has a high Entropy value which makes it less
stable as the no of records in the data increases. The MPF method has a high correlation which
makes the method superior in case of handling the dimensionality reduction. The RMSE value is
high is case of MPF which makes it more accurate. The Figure 3.2 gives the graphical
representation of the results in table 3.3

Figure 3.2 – Experimental Results based on Metric Set 2


Table 3.3 gives the empirical analysis of the results obtained based on the metric set 3. The
throughput and time complexity is measured in milliseconds. From the table it is evident that the
RMFS has the highest throughput

Method Throughput (Ms) Time complexity (Ms)

SIP 658 451

ISSN: 2005-4297 IJCA


289
Copyright ⓒ 2019 SERSC
International Journal of Control and Automation
Vol. 12, No. 6, (2019), pp. 276-295

KUFS 542 393


FS_APACHE 482 308
JFSC 854 691
RMFS 901 219
Table 3.3 Experimental results based on Metric Set 3

Figure 3.3 – Experimental Result based on Metric Set 3


Figure 3.3 gives the graphical representation of the results based on the Metric Set 3 and from the
values listed in Table 3.4. Table 3.5 lists the experimental result based on the Metric Set 4. The
null values in the table indicate that the method did not consider the particular metric head.

Method Hamming Loss Ranking Loss Logarithmic Loss


Chi-Square - - 0.265
O (|U|) Method - - 0.542
SCF 0.218 - 0.368
NGSA III 0.658 0.451 -
MVMTFE 0.521 - 0.129
LDA_CCA 0.458 0.468 -

Table 3.4 Experimental Results based on Metric Set 4

ISSN: 2005-4297 IJCA


290
Copyright ⓒ 2019 SERSC
International Journal of Control and Automation
Vol. 12, No. 6, (2019), pp. 276-295

Figure 3.4 – Experimental results based on Metric Set 4


The graphical representation of the results shows that the SCF method has the least Hamming
loss of 0.218 and NGSAIII has the least ranking loss and low Logarithmic Loss is maintained by
MVMTFE method. The table 3.5 depicts the result obtained based on the Metric Set 5.

Method Chi-Square
FSA 0.34
MIM 0.27
Chi-Sqaure 0.62
O(|U|) Method 0.54
MI-Max 0.38
MLFS 0.23
SIMM 0.48
MvKCCA 0.67
Table 3.5 Experimental Results based on Metric Set 5
The figure 3.5 represents the graphical notation of the results that are tabulated in 2.9. from the
figure it is understood that the MIM method has the lowest Chi-Square value which implies that
the method has the best data fitness function when compared to all the other methods.

ISSN: 2005-4297 IJCA


291
Copyright ⓒ 2019 SERSC
International Journal of Control and Automation
Vol. 12, No. 6, (2019), pp. 276-295

Figure 5 – Experimental Results based on Metric Set 5

SECTION IV

CONCLUSION
Big data analytics is one of the most needed and active research area in the information era. The
exponential growth in data accumulation makes it high nesseccary for a suitable technology to analyze the
same and to bring out patterns and hidden knowledge for the business and strategic growth for any
organization. One of the important and critical issue faced during the analysis is the dimensionality of the
large data. Feature selection and extraction are the two methods through which this issue is addressed.
This paper is intended to produce a clear description on the available methods for the feature extraction in
Big Data, the metrics that were used and their results comparison. The metrics used were segregated into
five different sets and a cumulative representation was made through facts and figures. Out of the study, it
is clear that there is still a large research gap and there are more scope for improvement in the Feature
Extraction domain in Big –Data.

REFRERENCES
1. M. Viceconti, P. Hunter and R. Hose, "Big Data, Big Knowledge: Big Data for Personalized
Healthcare," in IEEE Journal of Biomedical and Health Informatics, vol. 19, no. 4, pp. 1209-1215,
July 2015
2. Elgendy, Nada &Elragal, Ahmed. (2014). Big Data Analytics: A Literature Review Paper. Lecture
Notes in Computer Science. 8557. 214-227. 10.1007/978-3-319-08976-8_16
3. Kong, X., Chang, J., Niu, M. et al. Int J AdvManufTechnol (2018) 99: 1101.
https://doi.org/10.1007/s00170-016-9864-x

ISSN: 2005-4297 IJCA


292
Copyright ⓒ 2019 SERSC
International Journal of Control and Automation
Vol. 12, No. 6, (2019), pp. 276-295

4. Li, Jundong& Liu, Huan. (2016). Challenges of Feature Selection for Big Data Analytics. IEEE
Intelligent Systems. 32. 10.1109/MIS.2017.38
5. Li, Jundong& Cheng, Kewei& Wang, Suhang&Morstatter, Fred & Trevino, Robert & Tang, Jiliang&
Liu, Huan. (2016). Feature Selection: A Data Perspective. ACM Computing Surveys. 50.
10.1145/3136625
6. J. Wan, P. Zheng, H. Si, N. N. Xiong, W. Zhang and A. V. Vasilakos, "An Artificial Intelligence
Driven Multi-Feature Extraction Scheme for Big Data Detection," in IEEE Access, vol. 7, pp. 80122-
80132, 2019
7. J. He, N. Xiong, "An effective information detection method for social big data", Multimedia Tools
Appl., vol. 77, pp. 11277-11305, 2018
8. Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and
Huan Liu. 2017. Feature Selection: A Data Perspective. ACM Comput. Surv. 50, 6, Article 94
(December 2017)
9. Mingkui Tan, Ivor W. Tsang, and Li Wang. 2014. Towards ultrahigh dimensional feature selection
for big data. J. Mach. Learn. Res. 15, 1 (January 2014), 1371-1429.
10. Kui Yu, Xindong Wu, Wei Ding, and Jian Pei. 2016. Scalable and Accurate Online Feature Selection
for Big Data. ACM Trans. Knowl. Discov. Data 11, 2, Article 16 (December 2016
11. Jun Lee, Kyoung-Sook Kim, YongJin Kwon, and Hirotaka Ogawa. 2017. Understanding human
perceptual experience in unstructured data on the web. In Proceedings of the International Conference
on Web Intelligence (WI '17). ACM, New York, NY, USA, 491-498
12. Thee Zin Win and Nang Saing Moon Kham. 2018. Mutual Information-based Feature Selection
Approach to Reduce High Dimension of Big Data. In Proceedings of the 2018 International
Conference on Machine Learning and Machine Intelligence (MLMI2018). ACM, New York, NY,
USA, 3-7
13. Tang, Jiliang& Liu, Huan. (2014). An Unsupervised Feature Selection Framework for Social Media
Data. Knowledge and Data Engineering, IEEE Transactions on. 26. 2914-2927 2015
14. Liu H, Yu L. Toward integrating feature selection algorithms for classification and clustering. IEEE
Transactions on Knowledge & Data Engineering. Apr 1(4):491--502, (2015).
15. L. Li, "Evaluation Model of Education Service Quality Satisfaction in Colleges and Universities
Dependent on Classification Attribute Big Data Feature Selection Algorithm," 2019 International
Conference on Intelligent Transportation, Big Data & Smart City (ICITBS), Changsha, China, 2019,
pp. 645-649
16. M. B. Çatalkaya, O. Kalipsiz, M. S. Aktas and U. O. Turgut, "Data Feature Selection Methods on
Distributed Big Data Processing Platforms," 2018 3rd International Conference on Computer Science
and Engineering (UBMK), Sarajevo, 2018, pp. 133-138
17. J. A. Sáez and E. Corchado, "KSUFS: A Novel Unsupervised Feature Selection Method Based on
Statistical Tests for Standard and Big Data Problems," in IEEE Access, vol. 7, pp. 99754-99770, 2019
18. S. Ramírez-Gallego et al., "An Information Theory-Based Feature Selection Framework for Big Data
Under Apache Spark," in IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 48, no.
9, pp. 1441-1453, Sept. 2018
19. K. Gong, Y. Wang, M. Xu and Z. Xiao, "BSSReduce an $O(\left|U\right|)$ Incremental Feature
Selection Approach for Large-Scale and High-Dimensional Data," in IEEE Transactions on Fuzzy
Systems, vol. 26, no. 6, pp. 3356-3367, Dec. 2018

ISSN: 2005-4297 IJCA


293
Copyright ⓒ 2019 SERSC
International Journal of Control and Automation
Vol. 12, No. 6, (2019), pp. 276-295

20. M. Yamada et al., "Ultra High-Dimensional Nonlinear Feature Selection for Big Biological Data," in
IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 7, pp. 1352-1365, 1 July 2018
21. Sechidis K., Tsoumakas G., Vlahavas I. (2011) On the Stratification of Multi-label Data. In:
Gunopulos D., Hofmann T., Malerba D., Vazirgiannis M. (eds) Machine Learning and Knowledge
Discovery in Databases. ECML PKDD 2011. Lecture Notes in Computer Science, vol 6913.
Springer, Berlin, Heidelberg
22. Fahy and S. Yang, "Dynamic Feature Selection for Clustering High Dimensional Data Streams," in
IEEE Access, vol. 7, pp. 127128-127140, 2019
23. W. Ke, C. Wu, Y. Wu and N. N. Xiong, "A New Filter Feature Selection Based on Criteria Fusion for
Gene Microarray Data," in IEEE Access, vol. 6, pp. 61065-61076, 2018
24. J. Huang, G. Li, Q. Huang and X. Wu, "Joint Feature Selection and Classification for Multilabel
Learning," in IEEE Transactions on Cybernetics, vol. 48, no. 3, pp. 876-889, March 2018
25. J. González-López, S. Ventura and A. Cano, "Distributed Selection of Continuous Features in
Multilabel Classification Using Mutual Information," in IEEE Transactions on Neural Networks and
Learning Systems,2019
26. J. Xu and Z. Mao, "Multilabel Feature Extraction Algorithm via Maximizing Approximated and
Symmetrized Normalized Cross-Covariance Operator," in IEEE Transactions on Cybernetics,2019
27. L. Sun, T. Yin, W. Ding and J. Xu, "Hybrid Multilabel Feature Selection Using BPSO and
Neighborhood Rough Sets for Multilabel Neighborhood Decision Systems," in IEEE Access, vol. 7,
pp. 175793-175815, 2019
28. A. A. Bidgoli, H. Ebrahimpour-Komleh and S. Rahnamayan, "A Many-objective Feature Selection
Algorithm for Multi-label Classification Based on Computational Complexity of Features," 2019
29. J. K. Valadi, P. T. Ovhal and K. J. Rathore, "A Simple Method of Solution For Multi-label Feature
Selection," 2019 IEEE International Conference on Electrical, Computer and Communication
Technologies (ICECCT), Coimbatore, India, 2019, pp. 1-4
30. E. ZiadAbdallah and M. Oueidat, "An Improved Framework For Image Multi-label Classification
Using Gabor Feature Extraction," 2017 International Conference on Computer and Applications
(ICCA), Doha, 2017, pp. 151-157
31. Z. Wang, X. Kong, H. Fu, M. Li and Y. Zhang, "Feature extraction via multi-view non-negative
matrix factorization with local graph regularization," 2015 IEEE International Conference on Image
Processing (ICIP), Quebec City, QC, 2015, pp. 3500-3504
32. ZALL R , KEYVANPOUR ,"Semi-Supervised Multi-View Ensemble Learning Based On Extracting
Cross-View Correlation",Advances in Electrical and Computer Engineering, Volume 16, Issue 2,
Year 2016, On page(s): 111 - 124
33. ZhiqiangZuo, Yong Luo, Dacheng Tao, and Chao Xu. 2014. Multi-view Multi-task Feature
Extraction for Web Image Classification. In Proceedings of the 22nd ACM international conference
on Multimedia (MM '14). ACM, New York, NY, USA, 1137-114
34. Hongfu Liu, Haiyi Mao and Yun Fu,"Robust Multi-View Feature Selection ",2016 IEEE 16th
International Conference on Data Mining
35. XuanWu,Qing-GuoChen,YaoHu,DengbaoWang,XiaodongChang,Multi-View Multi-Label Learning
with View-Specific Information Extraction , Proceedings of the Twenty-Eighth International Joint
Conference on Artificial Intelligence (IJCAI-19)

ISSN: 2005-4297 IJCA


294
Copyright ⓒ 2019 SERSC
International Journal of Control and Automation
Vol. 12, No. 6, (2019), pp. 276-295

36. Yasser Elmanzalawi , CCA based multi-view feature selection for multiomics data integration , 2018
IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology,
CIBCB 2018
37. W. Zhuge, F. Nie, C. Hou and D. Yi, "Unsupervised Single and Multiple Views Feature Extraction
with Structured Graph," in IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 10,
pp. 2347-2359, 1 Oct. 2017
38. Kursun O., Alpaydin E. (2010) Canonical Correlation Analysis for MultiviewSemisupervised Feature
Extraction. In: Rutkowski L., Scherer R., Tadeusiewicz R., Zadeh L.A., Zurada J.M. (eds) Artificial
Intelligence and Soft Computing. ICAISC 2010. Lecture Notes in Computer Science, vol 6113.
Springer, Berlin, Heidelberg
39. Michele Volpi1,Giona Matasci1,Mikhail Kanevski1,Devis Tuia,Multi-view feature extraction for
hyperspectral image classification, European Symposium on Artificial Neural Networks,
Computational Intelligence and Machine Learning. Bruges (Belgium), 24-26 April 2013

ISSN: 2005-4297 IJCA


295
Copyright ⓒ 2019 SERSC

You might also like