Automatic Text Document Summarization Based On Machine Learning
Automatic Text Document Summarization Based On Machine Learning
Machine Learning
Gabriel Silva ,
Rafael Ferreira
{rdl, htao}@cin.ufpe.br
Hewlett-Packard Labs.
Fort Collins, CO 80528, USA
Hewlett-Packard Brazil
Porto Alegre, RS, Brazil
{gfps, rflm}@cin.ufpe.br
Steven J. Simske
Marcelo Riss
ABSTRACT
The need for automatic generation of summaries gained importance with the unprecedented volume of information available in the Internet. Automatic systems based on extractive
summarization techniques select the most significant sentences of one or more texts to generate a summary. This
article makes use of Machine Learning techniques to assess
the quality of the twenty most referenced strategies used in
extractive summarization, integrating them in a tool. Quantitative and qualitative aspects were considered in such assessment demonstrating the validity of the proposed scheme.
The experiments were performed on the CNN-corpus, possibly the largest and most suitable test corpus today for
benchmarking extractive summarization.
General Terms
Algorithms, Experimentation
Keywords
Text Summarization; Extractive features; Sentence Scoring
Methods
1.
INTRODUCTION
191
2.
3.2
3.
THE SYSTEM
3.1
Feature Extraction
Text pre-processing
192
http://nlp.stanford.edu/software/corenlp.shtml
3.3
Classification model
The steps for creating the classification model used to select the sentences that will compose the summary are detailed here.
The first step has the purpose of reducing the problems
inherent to feature extraction of each sentence. First, the
feature vectors that have missing information and outliers
(when all features reach the maximum value) are eliminated.
Another problem addressed here is basis unbalance, whenever there is a large disparity in the number of data of the
training classes, the problem known in the literature as a
problem of class imbalance arises. Classification models that
are optimized with respect to overall accuracy tend to create
trivial models that almost always predict the majority class.
The algorithm chosen to address the problem of balancing was SMOTE [3]. The principle of the algorithm is to
create artificial data based on spatial features between examples of the minority class. Specifically, for a subset (whole
minority class), consider the k nearest neighbors for each instance belonging to k for some integer value. Depending
on the amount of oversampling chosen k nearest neighbors
are randomly chosen. Synthetic samples are generated as
follows: Calculate the Euclidean distance between the vector of points (samples) in question and its nearest neighbor.
Multiply this distance by a random number between zero
and one and add the vector points into consideration. This
causes the selection of a point along a line between the two
points selected. This approach effectively makes the region
the minority class becomes harder to become more general
[3].
Then, the system perform a feature selection, which is
an important tool for reducing the dimensionality of the
vectors, as some features contribute to decreasing the efficiency of the classifier. Another contribution of this study
is to identify which of the 20 most used features in the last
10 years in the problems of extractive summarization contribute effectively to a good performance of classifiers. The
experiment was conducted under the corpus of 400 news
CNN-English.
The experiments were performed with selection algorithms
of WEKA2 , three were chosen and applied on the balanced
basis for defining the best attributes of the vector. Below,
the methods of selection of attributes are listed: (i) CFS
Subset Evaluator: Evaluates the worth of a subset of attributes by considering the individual predictive ability of
each feature along with the degree of redundancy between
them; (ii) Information Gain Evaluator: Evaluates the
worth of an attribute by measuring the information gain
with respect to the class; (iii) SVM Attribute: Evaluates
the worth of an attribute by using an SVM classifier. The
top five characteristics indicated by the selection methods
were chosen. Figure 1 shows the profile of the selected features.
The selected features demonstrate the prevalence of language independent features such as the position of text,
TF/IDF and similarity. This allows summarization texts
in different languages.
Five classifiers were tested using the WEKA platform:
Naive Bayes [8], MLP [7], SVM [7], KNN [1], Ada Boost
[6], and Random Forest [2]. The results of the classifiers
were compared with seven summarization systems: Open
2
Text Summarizer (OTS), Text Compactor (TC), Free Summarizer (FS), Smmry (SUMM), Web Summarizer (WEB),
Intellexer Summarizer (INT)3 , Compendium (COMP) [10].
Figure 2 presents the proposed summarization method,
showing the number of correct sentences chosen from the
human selected sentences that form the gold standard. This
experiment used 400 texts from CNN news.
http://www.cs.waikato.ac.nz/ml/weka/
193
which means an improvement of more than 100%, in relation to the best tool found in literature. It was also evident
that the balancing judgment on the basis of examples yields
gains in the performance of the sentence selection system.
The next step is the validation of the experiments in other
summarization test corpora for texts other than news articles. Although the CNN-corpus may possibly be the largest
and best test corpus for assessing news articles today, the
authors of this paper are promoting an effort to double its
size in the near future, allowing even better testing capabilities.
5.
ACKNOWLEDGMENTS
6.
4.
REFERENCES
Automatic summarization opens a wide number of possibilities such as the efficient classification, retrieval and information based compression of text documents. This paper presents an assessment of the most widely used sentence
scoring methods for text summarization. The results demonstrate that a criterions choice of the set of automatic sentence summarization methods provides better quality summaries and also greater processing efficiency. The proposed
system selects 554 more relevant sentences to the summaries,
194