Expert Systems With Applications 57 (2016) 232–247
Contents lists available at ScienceDirect
Expert Systems With Applications
journal homepage: www.elsevier.com/locate/eswa
Ensemble of keyword extraction methods and classifiers in text
classification
Aytuğ Onan a,∗, Serdar Korukoğlu b, Hasan Bulut b
a
Celal Bayar University, Department of Computer Engineering, 45140 Muradiye, Manisa, Turkey
b
Ege University, Department of Computer Engineering, 35100 Bornova, Izmir, Turkey
a r t i c l e i n f o a b s t r a c t
Article history: Automatic keyword extraction is an important research direction in text mining, natural language pro-
Received 4 January 2016 cessing and information retrieval. Keyword extraction enables us to represent text documents in a con-
Revised 22 March 2016
densed way. The compact representation of documents can be helpful in several applications, such as au-
Accepted 26 March 2016
tomatic indexing, automatic summarization, automatic classification, clustering and filtering. For instance,
Available online 29 March 2016
text classification is a domain with high dimensional feature space challenge. Hence, extracting the most
Keywords: important/relevant words about the content of the document and using these keywords as the features
Keyword extraction can be extremely useful. In this regard, this study examines the predictive performance of five statistical
Text classification keyword extraction methods (most frequent measure based keyword extraction, term frequency-inverse
Ensemble learning sentence frequency based keyword extraction, co-occurrence statistical information based keyword ex-
Scientific text classification traction, eccentricity-based keyword extraction and TextRank algorithm) on classification algorithms and
ensemble methods for scientific text document classification (categorization). In the study, a compre-
hensive study of comparing base learning algorithms (Naïve Bayes, support vector machines, logistic re-
gression and Random Forest) with five widely utilized ensemble methods (AdaBoost, Bagging, Dagging,
Random Subspace and Majority Voting) is conducted. To the best of our knowledge, this is the first empir-
ical analysis, which evaluates the effectiveness of statistical keyword extraction methods in conjunction
with ensemble learning algorithms. The classification schemes are compared in terms of classification
accuracy, F-measure and area under curve values. To validate the empirical analysis, two-way ANOVA
test is employed. The experimental analysis indicates that Bagging ensemble of Random Forest with the
most-frequent based keyword extraction method yields promising results for text classification. For ACM
document collection, the highest average predictive performance (93.80%) is obtained with the utilization
of the most frequent based keyword extraction method with Bagging ensemble of Random Forest algo-
rithm. In general, Bagging and Random Subspace ensembles of Random Forest yield promising results.
The empirical analysis indicates that the utilization of keyword-based representation of text documents
in conjunction with ensemble learning can enhance the predictive performance and scalability of text
classification schemes, which is of practical importance in the application fields of text classification.
© 2016 Elsevier Ltd. All rights reserved.
1. Introduction panding and the manual keyword extraction can be an infeasi-
ble task. Keyword extraction is an important research direction in
Automatic keyword extraction is the process of identifying key text mining, natural language processing and information retrieval.
terms, key phrases, key segments or keywords from a document Since keyword extraction provides a compact representation of the
that can appropriately represent the subject of the document document, many applications, such as automatic indexing, auto-
(Beliga, Mestrovic, & Martincic-Ipsic, 2015). The Web is a very rich matic summarization, automatic classification, automatic cluster-
source of information which is progressively expanding. Hence, the ing, and automatic filtering can benefit from the keyword extrac-
number of digital documents available has been progressively ex- tion process (Zhang et al., 2008).
Automatic keyword generation process can be broadly divided
into two categories as keyword assignment and keyword extrac-
∗
Corresponding author. Tel.: +90 232 3887221, +90 544 810 70 80; fax: +90 tion (Siddiqi & Sharan, 2015). In keyword assignment, a set of pos-
232 3399405. sible keywords is selected from a controlled vocabulary of words,
E-mail addresses:
[email protected],
[email protected] (A. Onan), whereas keyword extraction identifies the most relevant words
[email protected] (S. Korukoğlu),
[email protected] (H. Bulut).
http://dx.doi.org/10.1016/j.eswa.2016.03.045
0957-4174/© 2016 Elsevier Ltd. All rights reserved.
A. Onan et al. / Expert Systems With Applications 57 (2016) 232–247 233
available in the examined document (Beliga et al., 2015). Keyword on a larger text document collection. Finally, Section 8 presents the
extraction methods can be broadly grouped into four categories as concluding remarks.
statistical approaches, linguistic approaches, machine learning ap-
proaches and other approaches (Han & Kamber, 2006). 2. Literature review
Text classification is an important subfield of text mining which
assigns a text document into one or more predefined classes or This section briefly reviews the literature on keyword extraction
categories. Several forms of text collections, such as news articles, methods and the ensemble methods.
digital libraries and Web pages are important sources of informa-
tion (Han & Kamber, 2006). Hence, text classification is an impor- 2.1. Related work on keyword extraction
tant research direction in library science, information science and
computer science (Jain, Raghuvanshi, & Shrivastava, 2012). Many In statistical keyword extraction methods, statistical measures,
applications of text mining can be modelled as a text classifica- such as n-gram statistics, word frequency and TF-IDF measure are
tion problem. These applications include news filtering, organiza- utilized to identify keywords. The statistical keyword extraction
tion, document organization, retrieval, opinion mining (sentiment methods can be domain-independent and do not require training
analysis), and spam filtering (Aggarwal & Zhai, 2012). data (Beliga et al., 2015). Matsuo and Ishizuka (2003) presented a
High dimensional feature space is a typical challenge of text statistical keyword extraction method from a single document. Ini-
classification applications (Joachims, 2002). When all the words tially, frequent terms are extracted. Then, co-occurrence between
of the training documents are used as the features, text classi- each term and the frequent terms are evaluated. Based on the co-
fication process becomes computationally intensive task (Onan & occurrence distributions, the significance of a term in the docu-
Korukoğlu, 2015). Hence, keywords of a text collection, which are ment is determined. The method does not require a training cor-
the most important/relevant words about the content of the docu- pus and can yield comparable results to TF-IDF measure. Turney
ments, can be good candidates to select as features in classification (2003) presented an improved key phrase extraction algorithm,
model construction (Liu & Wang, 2007; Rossi, Maracini, & Rezende, which uses statistical association among the key phrases to im-
2014). Machine learning algorithms, such as Naïve Bayes, k-nearest prove the coherence of the obtained keywords. In order to mea-
neighbour algorithm, support vector machines and artificial neural sure the association between key phrases, web mining is utilized.
networks, have been successfully applied in classifying text doc- In another statistical keyword extraction method, text document is
uments (Sebastiani, 2002). Ensemble methods are a set of learn- represented as an undirected graph (Palshikar, 2007). The vertices
ing algorithms, which combine the decisions of these algorithms of the graph contains words of the document, whereas the edges
so that a more robust classification model can be built with higher are assigned values based on a statistical measure of dissimilarity
predictive performance (Dietterich, 20 0 0). between the two words.
Considering these issues, this paper examines the effectiveness In linguistic approaches, linguistic features of the document are
of statistical keyword extraction methods, base learning algorithms utilized to identify keywords. These include lexical, syntactic, se-
and ensemble methods in scientific text document classification. mantic and discourse analysis (Zhang et al., 2008). The linguis-
To the best of our knowledge, this is the first attempt, which em- tic keyword extraction methods are domain-dependent (Siddiqi &
pirically evaluates the effectiveness of statistical keyword extrac- Sharan, 2015). Hulth (2003) examined the incorporation of lin-
tion methods in conjunction with ensemble learning algorithms. guistic knowledge, such as syntactic features to the keyword ex-
In comparative evaluation, five popular ensemble methods (Boost- traction process. The experimental results indicated that linguis-
ing, Bagging, Dagging, Random Subspace and Voting) are utilized. tic features can obtain improvements over the use of only statisti-
Naïve Bayes algorithm, support vector machines, logistic regression cal measures, such as term frequency or n-grams. HaCohen-Kerner
and Random Forest algorithm are utilized as the base learning al- (2003) presented a keyword extraction model from abstracts and
gorithms. In the experimental analysis, the domain independent titles. In the model, text representation schemes, such as unigrams,
statistical keyword extraction framework proposed in (Rossi et al., bigrams and trigrams are utilized. Nguyen and Kan (2007) pre-
2014) is utilized. In summary, the experimental study aims to an- sented a key phrase extraction algorithm from scientific publica-
swer the following research questions: tions. In this method, linguistic features, such as the positions of
phrases in the text documents, salient morphological phenomena
(1) Which configuration of statistical keyword extraction, classi-
are taken into account. Krapivin, Autayeu, Marchese, Blanzieri, and
fication and ensemble learning algorithms yield the highest
Segata (2010) incorporated natural language processing methods
performance in scientific text document classification?
to automatic key phrase extraction from scientific papers to en-
(2) Is there an optimal number of keywords to represent the
hance the performance of machine learning algorithms, such as
text documents and which number of keywords obtains
support vector machines and Random Forests. The experimental
promising results?
results are obtained on ACM dataset. The evaluations are done with
To the best of our knowledge, this is the first extensive em- expert-assigned key phrases and key phrase extraction algorithm
pirical analysis which examines the predictive performance of sta- (KEA).
tistical keyword extraction methods in conjunction with ensem- In machine learning approaches, a learning algorithm, such as
ble learning algorithms. The presented classification scheme, which support vector machines, Naïve Bayes, decision tree, is used to con-
integrates Bagging ensemble of Random Forest with the most- struct a classification model. In model construction, a training set
frequent based keyword extraction method, yields very promis- of documents with tags are used and the model is validated via
ing results on scientific text classification. The rest of this paper a test set of documents. The drawback of the machine learning
is organized as follows. Section 2 briefly reviews the literature on based feature extraction models is the need to obtain a tagged set
keyword extraction and ensemble methods. Section 3 presents the of documents. Witten, Paynter, Frank, Gutwin, and Nevill-Manning
statistical keyword extraction methods utilized in the experimen- (1999) presented a simple and efficient key phrase extraction al-
tal evaluations. Section 4 briefly describes the classification algo- gorithm (KEA) which utilizes Naïve Bayes algorithm for domain-
rithms and Section 5 describes the ensemble learning methods. based key phrase extraction. In this method, possible key phrases
Section 6 presents the experimental results, discussion and sta- are determined by lexical methods and good key phrases are ob-
tistical analysis of empirical results on ACM document collection. tained by the machine learning algorithm. HaCohen-Kerner, Gross,
Section 7 presents the results of ensemble classification schemes and Masa (2005) examined the effectiveness of several automatic
Download English Version:
https://daneshyari.com/en/article/381975
Download Persian Version:
https://daneshyari.com/article/381975
Daneshyari.com