0% found this document useful (0 votes)

33 views4 pages

Automatic Text Document Summarization Based On Machine Learning

text sumarization select

Uploaded by

david

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views4 pages

Automatic Text Document Summarization Based On Machine Learning

text sumarization select

Uploaded by

david

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Automatic Text Document Summarization Based on

Machine Learning
Gabriel Silva ,
Rafael Ferreira

Rafael Lins, Luciano

Cabral, Hilrio Oliveira

UFRPE/UFPE, Recife, PE,

Brazil

{rdl, htao}@cin.ufpe.br

Hewlett-Packard Labs.
Fort Collins, CO 80528, USA

Hewlett-Packard Brazil
Porto Alegre, RS, Brazil

UFPE, Recife, PE, Brazil

{gfps, rflm}@cin.ufpe.br
Steven J. Simske

Marcelo Riss

[email protected]

ABSTRACT

even be seen as a way to compress information [12]. TS

platforms may receive one or more documents as input to
generate a summary. Such technique is classified as extractive when the summary is formed by sentences of the original
document, or abstractive, when summaries modify the original sentences chosen to yield a better quality summary [11].
In general, abstractive summarization may be seen as a step
further ahead of extractive summarization and research in
that area may be considered in the very beginning. The
extractive summarization techniques (RTS) select the sentences with the highest score from the original document
based on a set of criteria. The Extractive Summarization
methods are better consolidated and may be considered efficient in the automatic generation of summaries [12, 11, 4].
Summaries may also be classified as generic or query dependent or driven. Generic summaries analyze the text as a
whole without prioritizing any aspect. On the other hand,
query dependant or driven summaries look at the text trying to find sentences that may answer a query from the user.
Text summarization may also be seen as a text compression
strategy. The vertical compression rate of a summary may
be defined as the ratio between the number of sentences in
the original document and the number of sentences in the
summary. Another possibility is horizontal sentence compression in which each sentence may be summarized by removing non-essential information. In this case the compression rate is measured by the ratio between the number of
words in the original document and the number of words in
the summary. Both compression rates are important factors
that influence the overall quality and purpose of the summary. This paper focuses exclusively in extractive vertical
summarization.
Extractive text summarization techniques are split into
three categories [4]: word-based, sentence-based, and graphbased scoring methods. In the methods based on word
scoring each word receives a score and the weight of each
sentence is the sum of all scores of its constituent words.
Sentence-based Scoring analyzes the features of the sentence
and its relation to the text. Cue-phrases (such as it is
important, in summary, etc.), resemblance to the title,
and sentence position are examples of sentence-based scoring
techniques. Finally, in graph-based methods, the score of a
sentence reflects some relationship among sentences. When
a word or sentence refers to another one, an edge is gener-

The need for automatic generation of summaries gained importance with the unprecedented volume of information available in the Internet. Automatic systems based on extractive
summarization techniques select the most significant sentences of one or more texts to generate a summary. This
article makes use of Machine Learning techniques to assess
the quality of the twenty most referenced strategies used in
extractive summarization, integrating them in a tool. Quantitative and qualitative aspects were considered in such assessment demonstrating the validity of the proposed scheme.
The experiments were performed on the CNN-corpus, possibly the largest and most suitable test corpus today for
benchmarking extractive summarization.

Categories and Subject Descriptors

I.2.7 [Natural Language Processing]: Text analysis.

General Terms
Algorithms, Experimentation

Keywords
Text Summarization; Extractive features; Sentence Scoring
Methods

INTRODUCTION

Automatic document summarization is a research area

that was born in the early 1950s. Recently, with the pervasiveness of the Internet and the fast growing number of text
documents the search for efficient automated systems for
Text Summarization (TS) has gained importance and may
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected].
DocEng15, September 8-11, 2015, Lausanne, Switzerland.
c 2015 ACM. ISBN 978-1-4503-3307-8/15/09 ...$15.00.

DOI: http://dx.doi.org/10.1145/2682571.2797099.

191

ated with a weight between them. The sum of the weights of

a sentence is its score. This article analyzes 15 sentence scoring methods, and some variation of them, widely used and
referenced in the literature applied to document summarization in the last 10 years. The scoring methods comprise the
feature vector that will be used to train the classifier and
to rank sentences, totaling 20 features. The key point in
this paper is to use Machine Learning techniques to analyze
such features in a way to point out which of them better
contribute to yield good quality summaries.
Quantitative and qualitative strategies are used here as
ways of assessing the quality of summaries. The quantitative
assessment was performed using ROUGE (Recall-Oriented
Understudy for Gisting Evaluation) [9], a measure widely
accepted for such a purpose. In addition, another quantitative analysis was performed by three people who analyzed each original text and generated summaries following
a methodology that is better described below. The qualitative assessment is made by counting the number of sentences
selected by the system that coincides with the sentences selected by all the three human users. The results obtained
shows the effectiveness of the proposed method. It selects
two times more relevant sentences to compose the summary.
Moreover, it achieves results 71% better in evaluation using
ROUGE 2 metric.

moval and stemming. Each text paragraph is numbered, as

well as each of their sentences. Then, sentence segmentation is performed by Stanford CoreNLP1 . Stop words [5] are
removed since they are considered unimportant and can indicate noise. Stop Words are predefined and stored in an
array that is used for comparison with the words in the document. Word Stemming [13] converts each word in its root
form, removing its prefix and suffix of the specified word is
performed. After this stage the text is structured in XML
and included in the XML file that corresponds to the news
article. As the focus here is in the text part of the document for summarization all other XML-file attributes will
no longer be addressed in this paper.

3.2

The XML document after preprocessing is represented by

the set D = {S1 , S2 , ..., Si }, where Si is a sentence in the
document D. The preprocessed sentences are subjected to
the feature extraction process so that a feature vector is
generated for each sentence, Vi = {F1 , F2 , ..., Fi }, where Vi
is the feature vector of each sentence Si . As already mentioned extractive summarization use three scoring strategies
[4]: (i) Word : it assigns scores to the most important words;
(ii) Sentence: it accounts for features of sentences, such as
its position in the document, similarity to the title, etc; (iii)
Graphic: it uses the relationship between words and sentences.
Table 1 shows the features analyzed in this work and
their kind of scoring. They correspond to the most widely
acknowledged techniques for extractive summarization reported in the literature.

THE CNN CORPUS

The CNN corpus developed by Lins and his colleagues

[9] consists of news texts extracted from the CNN website
(www.cnn.com). The main advantage of this test corpus
rests not only on the high quality of the writing using grammatically correct standard English to report on general interest subjects, but each of the texts of the news article
is provided with its highlights, which consists of a 3 to 5
sentences long summary written by the original author(s).
The highlights were the basis for the development the gold
standard, which was obtained by the injective mapping of
each of the sentences in the highlights onto the original sentences of the text. Such mapping process was performed
by three different people. The gold standard was formed
with most voted mapped sentences chosen. A very high degree of consistency in sentence selection was observed. The
CNN-corpus is possibly the largest existing corpus for benchmarking extractive summarization techniques. The current
version has 400 documents, written in the English language,
totaling 13,228 sentences, of which 1,471 were selected for
the gold standards, representing an average compression rate
of 90%.

Table 1: Number of summaries sentences into gold standard.

Feature
Name of Extractive
Type of
Summarization Strategy
Scoring
F01
Aggregate Similarity
Graph
F02
Bushy Path
Graph
F03
Centrality
Sentence
F04
Heterogeneous Graph
Graph
F05
Text Rank
Graph
F06
Cue-Phrase
Sentence
F07
Numerical Data
Sentence
F08
Position Paragraph
Sentence
F09
Position Text
Sentence
F10
Resemblance Title
Sentence
F11
Sentence Length
Sentence
F12
Sentence Position in Paragraph Sentence
F13
Sentence Position in Text
Sentence
F14
Proper-Noun
Word
F15
Co-Occurrence Bleu
Word
F16
Lexical Similarity
Word
F17
Co-Occurrence N-gram
Word
F18
TF/IDF
Word
F19
Upper Case
Word
F20
Word Frequency
Word

THE SYSTEM

The steps for creating the methodology for obtaining the

extractive summaries are presented in the following sections.

3.1

Feature Extraction

Text pre-processing

The news articles obtained from the CNN website must

be carefully chosen in order to contain only text, thus news
articles with figures, videos, tables and other multi-media
elements are discarded. Besides that, the article must be
complete with the text, highlights, title, author(s), subject area, etc. All such data is inserted in a XML file.
The text part of the document in then processed for paragraph segmentation, sentence segmentation, stop word re-

192

http://nlp.stanford.edu/software/corenlp.shtml

3.3

Classification model

The steps for creating the classification model used to select the sentences that will compose the summary are detailed here.
The first step has the purpose of reducing the problems
inherent to feature extraction of each sentence. First, the
feature vectors that have missing information and outliers
(when all features reach the maximum value) are eliminated.
Another problem addressed here is basis unbalance, whenever there is a large disparity in the number of data of the
training classes, the problem known in the literature as a
problem of class imbalance arises. Classification models that
are optimized with respect to overall accuracy tend to create
trivial models that almost always predict the majority class.
The algorithm chosen to address the problem of balancing was SMOTE [3]. The principle of the algorithm is to
create artificial data based on spatial features between examples of the minority class. Specifically, for a subset (whole
minority class), consider the k nearest neighbors for each instance belonging to k for some integer value. Depending
on the amount of oversampling chosen k nearest neighbors
are randomly chosen. Synthetic samples are generated as
follows: Calculate the Euclidean distance between the vector of points (samples) in question and its nearest neighbor.
Multiply this distance by a random number between zero
and one and add the vector points into consideration. This
causes the selection of a point along a line between the two
points selected. This approach effectively makes the region
the minority class becomes harder to become more general
[3].
Then, the system perform a feature selection, which is
an important tool for reducing the dimensionality of the
vectors, as some features contribute to decreasing the efficiency of the classifier. Another contribution of this study
is to identify which of the 20 most used features in the last
10 years in the problems of extractive summarization contribute effectively to a good performance of classifiers. The
experiment was conducted under the corpus of 400 news
CNN-English.
The experiments were performed with selection algorithms
of WEKA2 , three were chosen and applied on the balanced
basis for defining the best attributes of the vector. Below,
the methods of selection of attributes are listed: (i) CFS
Subset Evaluator: Evaluates the worth of a subset of attributes by considering the individual predictive ability of
each feature along with the degree of redundancy between
them; (ii) Information Gain Evaluator: Evaluates the
worth of an attribute by measuring the information gain
with respect to the class; (iii) SVM Attribute: Evaluates
the worth of an attribute by using an SVM classifier. The
top five characteristics indicated by the selection methods
were chosen. Figure 1 shows the profile of the selected features.
The selected features demonstrate the prevalence of language independent features such as the position of text,
TF/IDF and similarity. This allows summarization texts
in different languages.
Five classifiers were tested using the WEKA platform:
Naive Bayes [8], MLP [7], SVM [7], KNN [1], Ada Boost
[6], and Random Forest [2]. The results of the classifiers
were compared with seven summarization systems: Open
2

Figure 1: Selected Features

Text Summarizer (OTS), Text Compactor (TC), Free Summarizer (FS), Smmry (SUMM), Web Summarizer (WEB),
Intellexer Summarizer (INT)3 , Compendium (COMP) [10].
Figure 2 presents the proposed summarization method,
showing the number of correct sentences chosen from the
human selected sentences that form the gold standard. This
experiment used 400 texts from CNN news.

Figure 2: Evaluation of the classifiers for summarization

The classifiers were tested with variations of parameters
with and without adjustment and balancing of the base.
The technique chosen to validate the models was the CrossValidation. The tests performed with the unbalanced basis yielded an accuracy of 52% and balanced with the base
yielded 70% accuracy. The Naive Bayes classifier achieve the
best result in all cases. In qualitative evaluation it reach 969
and 1082 correct sentences selected to the summary on unbalanced and balanced cases respectively. In the first case
Naive Bayes outperforms in 7.42% the second place (Ada
Boost) and it selects the same number of important sentences of KNN on balanced case.
3

libots.sourceforge.net, www.textcompactor.com, freesummarizer.com, smmry.com, www.websummarizer.com, summarizer.intellexer.com

http://www.cs.waikato.ac.nz/ml/weka/

193

Figure 3 and 4 presents the comparison of the Naive Bayes

classifier results against the seven summarization systems.
The superiority of the proposed method was proved on both
evaluation. In the qualitative assessment the proposed method
reach 1082 correct sentences selected, which means an improvement of more than 100% in relation to Text Compactor
the best tool found in the literature. In number it obtained
554 more correct sentences. Using ROUGE the Naive Bayes
Classifier achieve a result 61.3% better than Web Summarizer, the second place. The proposed method reach 71%
of accuracy while WEB obtained 44%. These results confirms the hypothesis that using machine learning technics
improves the text summarization results.

which means an improvement of more than 100%, in relation to the best tool found in literature. It was also evident
that the balancing judgment on the basis of examples yields
gains in the performance of the sentence selection system.
The next step is the validation of the experiments in other
summarization test corpora for texts other than news articles. Although the CNN-corpus may possibly be the largest
and best test corpus for assessing news articles today, the
authors of this paper are promoting an effort to double its
size in the near future, allowing even better testing capabilities.

ACKNOWLEDGMENTS

The research results reported in this paper have been

partly funded by a R&D project between Hewlett-Packard
do Brazil and UFPE originated from tax exemption (IPI Law number 8.248, of 1991 and later updates).

Figure 3: Evaluation of the summarization systems

Figure 4: Precision of the Summarization Systems

using ROUGE 2

REFERENCES

[1] D. W. Aha, D. Kibler, and M. K. Albert. Instance-based

learning algorithms. Mach. Learn., 6(1):3766, Jan. 1991.
[2] L. Breiman. Random forests. Mach. Learn., 45(1):532,
Oct. 2001.
[3] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P.
Kegelmeyer. Smote: Synthetic minority over-sampling
technique. J. Artif. Int. Res., 16(1):321357, June 2002.
[4] R. Ferreira, L. de Souza Cabral, R. D. Lins, G. P. e Silva,
F. Freitas, G. D. Cavalcanti, R. Lima, S. J. Simske, and
L. Favaro. Assessing sentence scoring techniques for
extractive text summarization. Expert Systems with
Applications, 40(14):5755 5764, 2013.
[5] W. B. Frakes and R. Baeza-Yates, editors. Information
Retrieval: Data Structures and Algorithms. Prentice-Hall,
Inc., Upper Saddle River, NJ, USA, 1992.
[6] Y. Freund and R. E. Schapire. Experiments with a new
boosting algorithm. In International Conference on
Machine Learning, pages 148156, 1996.
[7] S. Haykin. Neural Networks: A Comprehensive
Foundation. Prentice Hall PTR, Upper Saddle River, NJ,
USA, 2nd edition, 1998.
[8] G. H. John and P. Langley. Estimating continuous
distributions in bayesian classifiers. In Proceedings of the
Eleventh Conference on Uncertainty in Artificial
Intelligence, UAI95, pages 338345, San Francisco, CA,
USA, 1995. Morgan Kaufmann Publishers Inc.
[9] C.-Y. Lin. Rouge: A package for automatic evaluation of
summaries. In M.-F. Moens and S. Szpakowicz, editors,
Text Summarization Branches Out: Proceedings of the
ACL-04 Workshop, pages 7481, Barcelona, Spain, July
2004. Association for Computational Linguistics.
[10] E. Lloret and M. Palomar. Compendium: a text
summarisation tool for generating summaries of multiple
purposes, domains, and genres. Natural Language
Engineering, FirstView:140, 2012.
[11] E. Lloret and M. Palomar. Text summarisation in progress:
a literature review. Artif. Intell. Rev., 37(1):141, Jan.
2012.
[12] A. Patel, T. Siddiqui, and U. S. Tiwary. A language
independent approach to multilingual text summarization.
In Large Scale Semantic Access to Content (Text, Image,
Video, and Sound), RIAO 07, pages 123132, Paris,
France, France, 2007. LE CENTRE DE HAUTES
ETUDES INTERNATIONALES DINFORMATIQUE
DOCUMENTAIRE.
[13] C. Silva and B. Ribeiro. The importance of stop word
removal on recall values in text categorization. In IJCNN
2003, volume 3, n/a, 2003.

CONCLUSIONS AND LINES FOR FURTHER WORKS

Automatic summarization opens a wide number of possibilities such as the efficient classification, retrieval and information based compression of text documents. This paper presents an assessment of the most widely used sentence
scoring methods for text summarization. The results demonstrate that a criterions choice of the set of automatic sentence summarization methods provides better quality summaries and also greater processing efficiency. The proposed
system selects 554 more relevant sentences to the summaries,

194

Man 8.163
75% (8)
Man 8.163
198 pages
Text Summarization Using NLP Final
No ratings yet
Text Summarization Using NLP Final
38 pages
Automatic Summarization
No ratings yet
Automatic Summarization
232 pages
A Fuzzy Ontology and Its Application To News Summarization
100% (1)
A Fuzzy Ontology and Its Application To News Summarization
22 pages
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
From Everand
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
Carl A. Bolton
No ratings yet
Design and Implementation of An Android
100% (1)
Design and Implementation of An Android
12 pages
Web Image Reranking Project Report
100% (1)
Web Image Reranking Project Report
28 pages
Tendernotice 5 PDF
No ratings yet
Tendernotice 5 PDF
148 pages
A Context Based Text Summarization System
No ratings yet
A Context Based Text Summarization System
5 pages
Text Summarization Using Word Frequency
No ratings yet
Text Summarization Using Word Frequency
3 pages
Text Summarization Using Python NLTK
No ratings yet
Text Summarization Using Python NLTK
8 pages
Automatic Text Summarization Using Natural Language Processing
No ratings yet
Automatic Text Summarization Using Natural Language Processing
54 pages
A Domain-Specific Automatic Text Summarization Using Fuzzy Logic
No ratings yet
A Domain-Specific Automatic Text Summarization Using Fuzzy Logic
13 pages
Technological Innovations For Waste Management in Nigeria-From Collection To Recling
No ratings yet
Technological Innovations For Waste Management in Nigeria-From Collection To Recling
32 pages
Emerinig Technology
No ratings yet
Emerinig Technology
40 pages
Development of e SIWES Portal A Web Base
No ratings yet
Development of e SIWES Portal A Web Base
8 pages
Haramaya University Cover11
No ratings yet
Haramaya University Cover11
7 pages
FINAL - YEAR - PROJECT Face Attendance - 1 - 1
No ratings yet
FINAL - YEAR - PROJECT Face Attendance - 1 - 1
36 pages
Text Mining Techniques Applications and Issues2
No ratings yet
Text Mining Techniques Applications and Issues2
5 pages
Group 3 - Chapter 1 and 2
100% (1)
Group 3 - Chapter 1 and 2
15 pages
CST Latest Project Topics
No ratings yet
CST Latest Project Topics
8 pages
Design and Implementation of Online Logbook For Students Esiwes
No ratings yet
Design and Implementation of Online Logbook For Students Esiwes
48 pages
Text Summarization As Feature Selection For Arabic Text Classification
No ratings yet
Text Summarization As Feature Selection For Arabic Text Classification
4 pages
Final Year Report Submitted
No ratings yet
Final Year Report Submitted
61 pages
Evaluating Student Descriptive Answers Using Natural Language Processing IJERTV3IS031517
No ratings yet
Evaluating Student Descriptive Answers Using Natural Language Processing IJERTV3IS031517
3 pages
Project Proposal Title of Final Project
No ratings yet
Project Proposal Title of Final Project
10 pages
Com 225 Web Technology Theory
No ratings yet
Com 225 Web Technology Theory
209 pages
Seminar Final Reports
No ratings yet
Seminar Final Reports
25 pages
My Final Project
No ratings yet
My Final Project
58 pages
The Design and Implementation of Teachers Pension System in Nigeria
No ratings yet
The Design and Implementation of Teachers Pension System in Nigeria
52 pages
Design and Implementation of An Electronic Invoicing System
No ratings yet
Design and Implementation of An Electronic Invoicing System
63 pages
Student Project Allocation and Management
No ratings yet
Student Project Allocation and Management
1 page
Speech Emotion Detection Using Machine Learning
No ratings yet
Speech Emotion Detection Using Machine Learning
11 pages
Design and Implementation of Nysc Orient
No ratings yet
Design and Implementation of Nysc Orient
64 pages
Online Assignment Plagiarism Checker Project Using Data Mining
No ratings yet
Online Assignment Plagiarism Checker Project Using Data Mining
5 pages
E-Commerse Website Evaluation Using Openion Mining
No ratings yet
E-Commerse Website Evaluation Using Openion Mining
20 pages
NLP Based Automatic Answer Script Evaluation
No ratings yet
NLP Based Automatic Answer Script Evaluation
9 pages
Domicile Letter Abuja Municipal Area Council For Saadatu Ufedo 1
No ratings yet
Domicile Letter Abuja Municipal Area Council For Saadatu Ufedo 1
1 page
Website Development of Crime Management System: January 2022
0% (1)
Website Development of Crime Management System: January 2022
35 pages
Seminar Report PDF
100% (1)
Seminar Report PDF
30 pages
741 Adding Intelligence To Internet Through Satellite
No ratings yet
741 Adding Intelligence To Internet Through Satellite
30 pages
White Paper
No ratings yet
White Paper
9 pages
AI-Based Proctoring System For Online Tests: Abstract
100% (1)
AI-Based Proctoring System For Online Tests: Abstract
6 pages
List of Computer Science Project Topics
100% (1)
List of Computer Science Project Topics
3 pages
ETHICAL HACKING AND CYBERSECURITY (AutoRecovered)
100% (1)
ETHICAL HACKING AND CYBERSECURITY (AutoRecovered)
77 pages
Text Summarizer
No ratings yet
Text Summarizer
9 pages
Com 112
No ratings yet
Com 112
9 pages
DCT Based Video Watermarking in MATLAB PDF
No ratings yet
DCT Based Video Watermarking in MATLAB PDF
11 pages
Rtmnu Time Table 5th STT - Be - 5th-Sem CBCS
No ratings yet
Rtmnu Time Table 5th STT - Be - 5th-Sem CBCS
3 pages
Project Report Online Exam System 2011
94% (17)
Project Report Online Exam System 2011
38 pages
Online Fee Clearance Information System-1
No ratings yet
Online Fee Clearance Information System-1
15 pages
Design and Implementation of A University Online Clearance System
33% (3)
Design and Implementation of A University Online Clearance System
6 pages
Automated Emerging Cyber Threat Identification and Profiling Based On Natural Language Processing
No ratings yet
Automated Emerging Cyber Threat Identification and Profiling Based On Natural Language Processing
57 pages
PRJT Repo
No ratings yet
PRJT Repo
50 pages
Phase 2 Final
100% (1)
Phase 2 Final
65 pages
Computer Science PHD Thesis
100% (1)
Computer Science PHD Thesis
5 pages
SIWES Report
No ratings yet
SIWES Report
25 pages
Design and Implementation of Online Charity
No ratings yet
Design and Implementation of Online Charity
31 pages
Natural Language Processing Notes
No ratings yet
Natural Language Processing Notes
61 pages
Extractive Arabic Text Summarization-Graph-Based Approach
No ratings yet
Extractive Arabic Text Summarization-Graph-Based Approach
17 pages
Assessing Sentence Scoring Techniques Fo
No ratings yet
Assessing Sentence Scoring Techniques Fo
10 pages
Automatic Text Summarization Using Python
No ratings yet
Automatic Text Summarization Using Python
8 pages
ME Assignment 2
No ratings yet
ME Assignment 2
3 pages
Quran Thesis Statement
100% (3)
Quran Thesis Statement
5 pages
IMO PS Speed Logs
No ratings yet
IMO PS Speed Logs
2 pages
Prof Savita Gautam Fore School of Management
No ratings yet
Prof Savita Gautam Fore School of Management
27 pages
Project Synopsis
No ratings yet
Project Synopsis
5 pages
X28HC64
No ratings yet
X28HC64
24 pages
03 - 2011 - Markus G. Kuhn - Compromising Emanations of LCD TV Sets
No ratings yet
03 - 2011 - Markus G. Kuhn - Compromising Emanations of LCD TV Sets
7 pages
Solid Dosage Forms: Tablets: Abhay ML Verma (Pharmaceutics)
No ratings yet
Solid Dosage Forms: Tablets: Abhay ML Verma (Pharmaceutics)
5 pages
Converting Common Units of Mass Measure KG and Grams
No ratings yet
Converting Common Units of Mass Measure KG and Grams
7 pages
Android Practical File (IT-602)
50% (2)
Android Practical File (IT-602)
59 pages
Market Leader Pre Intermediate WB Planning
No ratings yet
Market Leader Pre Intermediate WB Planning
1 page
Study and Comparison of Various Image Ed
No ratings yet
Study and Comparison of Various Image Ed
12 pages
01112015000000B - Boehler FOX CEL - Ce
No ratings yet
01112015000000B - Boehler FOX CEL - Ce
1 page
The Processor Status and THR Flags Register
No ratings yet
The Processor Status and THR Flags Register
12 pages
The Power of Prophetic Vision (Hunter, Joan (Hunter, Joan) )
No ratings yet
The Power of Prophetic Vision (Hunter, Joan (Hunter, Joan) )
105 pages
M4TXX-BR12SH: TIMEKEEPER SNAPHAT (Battery & Crystal)
No ratings yet
M4TXX-BR12SH: TIMEKEEPER SNAPHAT (Battery & Crystal)
7 pages
3-1 Hydraulic Circuit Cluster Type 1
No ratings yet
3-1 Hydraulic Circuit Cluster Type 1
4 pages
Adj - To V - That Clause
No ratings yet
Adj - To V - That Clause
7 pages
This Study Resource Was: Nursing in The Philippines (10th
No ratings yet
This Study Resource Was: Nursing in The Philippines (10th
8 pages
LAB 08 Web Filtering
No ratings yet
LAB 08 Web Filtering
19 pages
SWOT Analysis - Docx 2
No ratings yet
SWOT Analysis - Docx 2
11 pages
How To Read A Patent
No ratings yet
How To Read A Patent
48 pages
Espaol III Holistic Rubric For Webquest Mayas Incas Aztecs
No ratings yet
Espaol III Holistic Rubric For Webquest Mayas Incas Aztecs
2 pages
GAM Summary Group 4
No ratings yet
GAM Summary Group 4
45 pages
Hope 1 (Reveiwer)
No ratings yet
Hope 1 (Reveiwer)
2 pages
CFD Unit 1 QB Answers
No ratings yet
CFD Unit 1 QB Answers
34 pages
Sense and Sensibility by Jane Austen Preview
100% (2)
Sense and Sensibility by Jane Austen Preview
20 pages
Successful Backtesting of Algorithmic Trading Strategies - Part II - QuantStart PDF
0% (1)
Successful Backtesting of Algorithmic Trading Strategies - Part II - QuantStart PDF
4 pages

Automatic Text Document Summarization Based On Machine Learning

Uploaded by

Automatic Text Document Summarization Based On Machine Learning

Uploaded by

Automatic Text Document Summarization Based on

Rafael Lins, Luciano

UFRPE/UFPE, Recife, PE,

UFPE, Recife, PE, Brazil

even be seen as a way to compress information [12]. TS

Categories and Subject Descriptors

Automatic document summarization is a research area

ated with a weight between them. The sum of the weights of

moval and stemming. Each text paragraph is numbered, as

The XML document after preprocessing is represented by

THE CNN CORPUS

The CNN corpus developed by Lins and his colleagues

Table 1: Number of summaries sentences into gold standard.

The steps for creating the methodology for obtaining the

The news articles obtained from the CNN website must

Figure 1: Selected Features

Figure 2: Evaluation of the classifiers for summarization

libots.sourceforge.net, www.textcompactor.com, freesummarizer.com, smmry.com, www.websummarizer.com, summarizer.intellexer.com

Figure 3 and 4 presents the comparison of the Naive Bayes

The research results reported in this paper have been

Figure 3: Evaluation of the summarization systems

Figure 4: Precision of the Summarization Systems

[1] D. W. Aha, D. Kibler, and M. K. Albert. Instance-based

CONCLUSIONS AND LINES FOR FURTHER WORKS

You might also like