0% found this document useful (0 votes)
33 views4 pages

Automatic Text Document Summarization Based On Machine Learning

text sumarization select

Uploaded by

david
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views4 pages

Automatic Text Document Summarization Based On Machine Learning

text sumarization select

Uploaded by

david
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Automatic Text Document Summarization Based on

Machine Learning
Gabriel Silva ,
Rafael Ferreira

Rafael Lins, Luciano


Cabral, Hilrio Oliveira

UFRPE/UFPE, Recife, PE,


Brazil

{rdl, htao}@cin.ufpe.br

Hewlett-Packard Labs.
Fort Collins, CO 80528, USA

Hewlett-Packard Brazil
Porto Alegre, RS, Brazil

UFPE, Recife, PE, Brazil

{gfps, rflm}@cin.ufpe.br
Steven J. Simske

Marcelo Riss

[email protected]

[email protected]

ABSTRACT

even be seen as a way to compress information [12]. TS


platforms may receive one or more documents as input to
generate a summary. Such technique is classified as extractive when the summary is formed by sentences of the original
document, or abstractive, when summaries modify the original sentences chosen to yield a better quality summary [11].
In general, abstractive summarization may be seen as a step
further ahead of extractive summarization and research in
that area may be considered in the very beginning. The
extractive summarization techniques (RTS) select the sentences with the highest score from the original document
based on a set of criteria. The Extractive Summarization
methods are better consolidated and may be considered efficient in the automatic generation of summaries [12, 11, 4].
Summaries may also be classified as generic or query dependent or driven. Generic summaries analyze the text as a
whole without prioritizing any aspect. On the other hand,
query dependant or driven summaries look at the text trying to find sentences that may answer a query from the user.
Text summarization may also be seen as a text compression
strategy. The vertical compression rate of a summary may
be defined as the ratio between the number of sentences in
the original document and the number of sentences in the
summary. Another possibility is horizontal sentence compression in which each sentence may be summarized by removing non-essential information. In this case the compression rate is measured by the ratio between the number of
words in the original document and the number of words in
the summary. Both compression rates are important factors
that influence the overall quality and purpose of the summary. This paper focuses exclusively in extractive vertical
summarization.
Extractive text summarization techniques are split into
three categories [4]: word-based, sentence-based, and graphbased scoring methods. In the methods based on word
scoring each word receives a score and the weight of each
sentence is the sum of all scores of its constituent words.
Sentence-based Scoring analyzes the features of the sentence
and its relation to the text. Cue-phrases (such as it is
important, in summary, etc.), resemblance to the title,
and sentence position are examples of sentence-based scoring
techniques. Finally, in graph-based methods, the score of a
sentence reflects some relationship among sentences. When
a word or sentence refers to another one, an edge is gener-

The need for automatic generation of summaries gained importance with the unprecedented volume of information available in the Internet. Automatic systems based on extractive
summarization techniques select the most significant sentences of one or more texts to generate a summary. This
article makes use of Machine Learning techniques to assess
the quality of the twenty most referenced strategies used in
extractive summarization, integrating them in a tool. Quantitative and qualitative aspects were considered in such assessment demonstrating the validity of the proposed scheme.
The experiments were performed on the CNN-corpus, possibly the largest and most suitable test corpus today for
benchmarking extractive summarization.

Categories and Subject Descriptors


I.2.7 [Natural Language Processing]: Text analysis.

General Terms
Algorithms, Experimentation

Keywords
Text Summarization; Extractive features; Sentence Scoring
Methods

1.

INTRODUCTION

Automatic document summarization is a research area


that was born in the early 1950s. Recently, with the pervasiveness of the Internet and the fast growing number of text
documents the search for efficient automated systems for
Text Summarization (TS) has gained importance and may
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected].
DocEng15, September 8-11, 2015, Lausanne, Switzerland.
c 2015 ACM. ISBN 978-1-4503-3307-8/15/09 ...$15.00.

DOI: http://dx.doi.org/10.1145/2682571.2797099.

191

ated with a weight between them. The sum of the weights of


a sentence is its score. This article analyzes 15 sentence scoring methods, and some variation of them, widely used and
referenced in the literature applied to document summarization in the last 10 years. The scoring methods comprise the
feature vector that will be used to train the classifier and
to rank sentences, totaling 20 features. The key point in
this paper is to use Machine Learning techniques to analyze
such features in a way to point out which of them better
contribute to yield good quality summaries.
Quantitative and qualitative strategies are used here as
ways of assessing the quality of summaries. The quantitative
assessment was performed using ROUGE (Recall-Oriented
Understudy for Gisting Evaluation) [9], a measure widely
accepted for such a purpose. In addition, another quantitative analysis was performed by three people who analyzed each original text and generated summaries following
a methodology that is better described below. The qualitative assessment is made by counting the number of sentences
selected by the system that coincides with the sentences selected by all the three human users. The results obtained
shows the effectiveness of the proposed method. It selects
two times more relevant sentences to compose the summary.
Moreover, it achieves results 71% better in evaluation using
ROUGE 2 metric.

2.

moval and stemming. Each text paragraph is numbered, as


well as each of their sentences. Then, sentence segmentation is performed by Stanford CoreNLP1 . Stop words [5] are
removed since they are considered unimportant and can indicate noise. Stop Words are predefined and stored in an
array that is used for comparison with the words in the document. Word Stemming [13] converts each word in its root
form, removing its prefix and suffix of the specified word is
performed. After this stage the text is structured in XML
and included in the XML file that corresponds to the news
article. As the focus here is in the text part of the document for summarization all other XML-file attributes will
no longer be addressed in this paper.

3.2

The XML document after preprocessing is represented by


the set D = {S1 , S2 , ..., Si }, where Si is a sentence in the
document D. The preprocessed sentences are subjected to
the feature extraction process so that a feature vector is
generated for each sentence, Vi = {F1 , F2 , ..., Fi }, where Vi
is the feature vector of each sentence Si . As already mentioned extractive summarization use three scoring strategies
[4]: (i) Word : it assigns scores to the most important words;
(ii) Sentence: it accounts for features of sentences, such as
its position in the document, similarity to the title, etc; (iii)
Graphic: it uses the relationship between words and sentences.
Table 1 shows the features analyzed in this work and
their kind of scoring. They correspond to the most widely
acknowledged techniques for extractive summarization reported in the literature.

THE CNN CORPUS

The CNN corpus developed by Lins and his colleagues


[9] consists of news texts extracted from the CNN website
(www.cnn.com). The main advantage of this test corpus
rests not only on the high quality of the writing using grammatically correct standard English to report on general interest subjects, but each of the texts of the news article
is provided with its highlights, which consists of a 3 to 5
sentences long summary written by the original author(s).
The highlights were the basis for the development the gold
standard, which was obtained by the injective mapping of
each of the sentences in the highlights onto the original sentences of the text. Such mapping process was performed
by three different people. The gold standard was formed
with most voted mapped sentences chosen. A very high degree of consistency in sentence selection was observed. The
CNN-corpus is possibly the largest existing corpus for benchmarking extractive summarization techniques. The current
version has 400 documents, written in the English language,
totaling 13,228 sentences, of which 1,471 were selected for
the gold standards, representing an average compression rate
of 90%.

3.

Table 1: Number of summaries sentences into gold standard.


Feature
Name of Extractive
Type of
Summarization Strategy
Scoring
F01
Aggregate Similarity
Graph
F02
Bushy Path
Graph
F03
Centrality
Sentence
F04
Heterogeneous Graph
Graph
F05
Text Rank
Graph
F06
Cue-Phrase
Sentence
F07
Numerical Data
Sentence
F08
Position Paragraph
Sentence
F09
Position Text
Sentence
F10
Resemblance Title
Sentence
F11
Sentence Length
Sentence
F12
Sentence Position in Paragraph Sentence
F13
Sentence Position in Text
Sentence
F14
Proper-Noun
Word
F15
Co-Occurrence Bleu
Word
F16
Lexical Similarity
Word
F17
Co-Occurrence N-gram
Word
F18
TF/IDF
Word
F19
Upper Case
Word
F20
Word Frequency
Word

THE SYSTEM

The steps for creating the methodology for obtaining the


extractive summaries are presented in the following sections.

3.1

Feature Extraction

Text pre-processing

The news articles obtained from the CNN website must


be carefully chosen in order to contain only text, thus news
articles with figures, videos, tables and other multi-media
elements are discarded. Besides that, the article must be
complete with the text, highlights, title, author(s), subject area, etc. All such data is inserted in a XML file.
The text part of the document in then processed for paragraph segmentation, sentence segmentation, stop word re-

192

http://nlp.stanford.edu/software/corenlp.shtml

3.3

Classification model

The steps for creating the classification model used to select the sentences that will compose the summary are detailed here.
The first step has the purpose of reducing the problems
inherent to feature extraction of each sentence. First, the
feature vectors that have missing information and outliers
(when all features reach the maximum value) are eliminated.
Another problem addressed here is basis unbalance, whenever there is a large disparity in the number of data of the
training classes, the problem known in the literature as a
problem of class imbalance arises. Classification models that
are optimized with respect to overall accuracy tend to create
trivial models that almost always predict the majority class.
The algorithm chosen to address the problem of balancing was SMOTE [3]. The principle of the algorithm is to
create artificial data based on spatial features between examples of the minority class. Specifically, for a subset (whole
minority class), consider the k nearest neighbors for each instance belonging to k for some integer value. Depending
on the amount of oversampling chosen k nearest neighbors
are randomly chosen. Synthetic samples are generated as
follows: Calculate the Euclidean distance between the vector of points (samples) in question and its nearest neighbor.
Multiply this distance by a random number between zero
and one and add the vector points into consideration. This
causes the selection of a point along a line between the two
points selected. This approach effectively makes the region
the minority class becomes harder to become more general
[3].
Then, the system perform a feature selection, which is
an important tool for reducing the dimensionality of the
vectors, as some features contribute to decreasing the efficiency of the classifier. Another contribution of this study
is to identify which of the 20 most used features in the last
10 years in the problems of extractive summarization contribute effectively to a good performance of classifiers. The
experiment was conducted under the corpus of 400 news
CNN-English.
The experiments were performed with selection algorithms
of WEKA2 , three were chosen and applied on the balanced
basis for defining the best attributes of the vector. Below,
the methods of selection of attributes are listed: (i) CFS
Subset Evaluator: Evaluates the worth of a subset of attributes by considering the individual predictive ability of
each feature along with the degree of redundancy between
them; (ii) Information Gain Evaluator: Evaluates the
worth of an attribute by measuring the information gain
with respect to the class; (iii) SVM Attribute: Evaluates
the worth of an attribute by using an SVM classifier. The
top five characteristics indicated by the selection methods
were chosen. Figure 1 shows the profile of the selected features.
The selected features demonstrate the prevalence of language independent features such as the position of text,
TF/IDF and similarity. This allows summarization texts
in different languages.
Five classifiers were tested using the WEKA platform:
Naive Bayes [8], MLP [7], SVM [7], KNN [1], Ada Boost
[6], and Random Forest [2]. The results of the classifiers
were compared with seven summarization systems: Open
2

Figure 1: Selected Features

Text Summarizer (OTS), Text Compactor (TC), Free Summarizer (FS), Smmry (SUMM), Web Summarizer (WEB),
Intellexer Summarizer (INT)3 , Compendium (COMP) [10].
Figure 2 presents the proposed summarization method,
showing the number of correct sentences chosen from the
human selected sentences that form the gold standard. This
experiment used 400 texts from CNN news.

Figure 2: Evaluation of the classifiers for summarization


The classifiers were tested with variations of parameters
with and without adjustment and balancing of the base.
The technique chosen to validate the models was the CrossValidation. The tests performed with the unbalanced basis yielded an accuracy of 52% and balanced with the base
yielded 70% accuracy. The Naive Bayes classifier achieve the
best result in all cases. In qualitative evaluation it reach 969
and 1082 correct sentences selected to the summary on unbalanced and balanced cases respectively. In the first case
Naive Bayes outperforms in 7.42% the second place (Ada
Boost) and it selects the same number of important sentences of KNN on balanced case.
3

libots.sourceforge.net, www.textcompactor.com, freesummarizer.com, smmry.com, www.websummarizer.com, summarizer.intellexer.com

http://www.cs.waikato.ac.nz/ml/weka/

193

Figure 3 and 4 presents the comparison of the Naive Bayes


classifier results against the seven summarization systems.
The superiority of the proposed method was proved on both
evaluation. In the qualitative assessment the proposed method
reach 1082 correct sentences selected, which means an improvement of more than 100% in relation to Text Compactor
the best tool found in the literature. In number it obtained
554 more correct sentences. Using ROUGE the Naive Bayes
Classifier achieve a result 61.3% better than Web Summarizer, the second place. The proposed method reach 71%
of accuracy while WEB obtained 44%. These results confirms the hypothesis that using machine learning technics
improves the text summarization results.

which means an improvement of more than 100%, in relation to the best tool found in literature. It was also evident
that the balancing judgment on the basis of examples yields
gains in the performance of the sentence selection system.
The next step is the validation of the experiments in other
summarization test corpora for texts other than news articles. Although the CNN-corpus may possibly be the largest
and best test corpus for assessing news articles today, the
authors of this paper are promoting an effort to double its
size in the near future, allowing even better testing capabilities.

5.

ACKNOWLEDGMENTS

The research results reported in this paper have been


partly funded by a R&D project between Hewlett-Packard
do Brazil and UFPE originated from tax exemption (IPI Law number 8.248, of 1991 and later updates).

6.

Figure 3: Evaluation of the summarization systems

Figure 4: Precision of the Summarization Systems


using ROUGE 2

4.

REFERENCES

[1] D. W. Aha, D. Kibler, and M. K. Albert. Instance-based


learning algorithms. Mach. Learn., 6(1):3766, Jan. 1991.
[2] L. Breiman. Random forests. Mach. Learn., 45(1):532,
Oct. 2001.
[3] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P.
Kegelmeyer. Smote: Synthetic minority over-sampling
technique. J. Artif. Int. Res., 16(1):321357, June 2002.
[4] R. Ferreira, L. de Souza Cabral, R. D. Lins, G. P. e Silva,
F. Freitas, G. D. Cavalcanti, R. Lima, S. J. Simske, and
L. Favaro. Assessing sentence scoring techniques for
extractive text summarization. Expert Systems with
Applications, 40(14):5755 5764, 2013.
[5] W. B. Frakes and R. Baeza-Yates, editors. Information
Retrieval: Data Structures and Algorithms. Prentice-Hall,
Inc., Upper Saddle River, NJ, USA, 1992.
[6] Y. Freund and R. E. Schapire. Experiments with a new
boosting algorithm. In International Conference on
Machine Learning, pages 148156, 1996.
[7] S. Haykin. Neural Networks: A Comprehensive
Foundation. Prentice Hall PTR, Upper Saddle River, NJ,
USA, 2nd edition, 1998.
[8] G. H. John and P. Langley. Estimating continuous
distributions in bayesian classifiers. In Proceedings of the
Eleventh Conference on Uncertainty in Artificial
Intelligence, UAI95, pages 338345, San Francisco, CA,
USA, 1995. Morgan Kaufmann Publishers Inc.
[9] C.-Y. Lin. Rouge: A package for automatic evaluation of
summaries. In M.-F. Moens and S. Szpakowicz, editors,
Text Summarization Branches Out: Proceedings of the
ACL-04 Workshop, pages 7481, Barcelona, Spain, July
2004. Association for Computational Linguistics.
[10] E. Lloret and M. Palomar. Compendium: a text
summarisation tool for generating summaries of multiple
purposes, domains, and genres. Natural Language
Engineering, FirstView:140, 2012.
[11] E. Lloret and M. Palomar. Text summarisation in progress:
a literature review. Artif. Intell. Rev., 37(1):141, Jan.
2012.
[12] A. Patel, T. Siddiqui, and U. S. Tiwary. A language
independent approach to multilingual text summarization.
In Large Scale Semantic Access to Content (Text, Image,
Video, and Sound), RIAO 07, pages 123132, Paris,
France, France, 2007. LE CENTRE DE HAUTES
ETUDES INTERNATIONALES DINFORMATIQUE
DOCUMENTAIRE.
[13] C. Silva and B. Ribeiro. The importance of stop word
removal on recall values in text categorization. In IJCNN
2003, volume 3, n/a, 2003.

CONCLUSIONS AND LINES FOR FURTHER WORKS

Automatic summarization opens a wide number of possibilities such as the efficient classification, retrieval and information based compression of text documents. This paper presents an assessment of the most widely used sentence
scoring methods for text summarization. The results demonstrate that a criterions choice of the set of automatic sentence summarization methods provides better quality summaries and also greater processing efficiency. The proposed
system selects 554 more relevant sentences to the summaries,

194

You might also like