TOPSIS With Multiple Linear Regression For Multi-Document Text Summarization
The huge amount of information in the internet makes rapid need of text
summarization. Text summarization is the process of selecting important sentences
from documents with keeping the main idea of the original documents. This paper
proposes a method depends on Technique for Order of Preference by Similarity to
Ideal Solution (TOPSIS). The first step in our model is based on extracting seven
features for each sentence in the documents set. Multiple Linear Regression (MLR)
is then used to assign a weight for the selected features. Then TOPSIS method
applied to rank the sentences. The sentences with high scores will be selected to be
included in the generated summary. The proposed model is evaluated using dataset
supplied by the Text Analysis Conference (TAC-2011) for English documents. The
performance of the proposed model is evaluated using Recall-Oriented Understudy
for Gisting Evaluation (ROUGE) metric. The obtained results support the
effectiveness of the proposed model.
بالنظر لمكميات الكبيرة الموجودة من المعمومات في االنترنت ادى الى الحاجة الضرورية لتمخيص
أن عممية تمخيص المعمومات تتضمن أستخراج الجمل الميمو من النصوص مع المحافظة عمى.المعمومات
ىذا البحث يقترح طريقة تعتمد عمى تقنية ترتيب االفضمية عن طريق.االفكار الرئيسية لمنصوص الممخصو
أستخراج سبعة الخطوة االولى في موديمنا المقترح تعتمد عمى.(TOPSIS(التشابو الى الحل المثالي
بعدىا تم طريقة أستخدام االنحدار الخطي المتعدد.خصائص لكل جممة من جمل النصوص المراد تمخيصيا
.لغرض تعيين أوزان لمخصائص المختارة
يتم أختيار الجمل ذات الدرجة االعمى لغرض تضمينيا. لغرض ترتيب الجملTOPSIS ثم تطبق طريقة
أختبرت النتائج باستخدام. ) لمغة االنكميزيةTAC-2011 ( تم أستخدام قاعدة بيانات.ضمن الممخص المتكون
.أثبتت النتائج كفاءه النظام المقترح ROUGE برنامج
Malallah and Ali Iraqi Journal of Science, 2017, Vol. 58, No.3A, pp: 1298-1307
DOI: 10.24996/ijs.2017.58.3A.14
1. Introduction
According to the fast development of information-communication technologies, enormous quantity
of documents have been created and put together in the World Wide Web. The huge amount of
documents makes it difficult for the user to get useful information [1]. To deal with such problem of
information overload, Automatic Text Summarization (ATS) has been used as a solution. ATS is the
process of generating a single document summary from a set of documents or from a single document
without losing its main ideas [2]. This process helps users to the general review of all related
documents and interested issues with understanding the main content of the summarized documents;
this process also helps to reduce the time needed to get these briefs. Rely on the amount of document
to be summarized ATS can be classified as a Single Document summarization (SDS) or Multi
Document summarization (MDS). In SDS only one document can be summarized into shorter one,
whereas in MDS a set of related documents with same topic is summarized into one shorter summary
[3]. Summarization methods, also, can be classified as abstractive summarization and extractive
summarization. Aabstractive summarization depends on Natural Language Processing (NLP)
strategies, which request deep understanding of NLP techniques to analyze the documents sentences
and paragraphs, since some changes have to be done to the selected sentences. Whereas in the
extractive summarization, no change is applied to the sentences which are selected to be included in
the final summary[4]. Thus abstractive summarization seems to be more difficult and time-consuming
than extractive summarization [5]. Also summarization can be categorized as query summarization
and generic summarization. In the query based summarization a summary was generated according to
the user query, where the documents searched to match with the user query [6]. While generic
summarization creates a summary which include the main content of the documents. One of the most
challenges for the generic summarization is that no topic or query available for the summarization
process [7].
2. Related Works
ATS reduces a large number of text documents to a smaller set of sentences which explain the main
ideas of these documents. Specialists in NLP are more interested to discover new methods for
summarizing and exploring a variety of models to come up with perfect summarization. In this section
we investigate some of these methods [8].
In [9] the authors suggested a method for calculating the weights of the selected features. Five
different features were used, the first two are structural features in which consist of more than simple
features, while the remaining three features are simple features. These five selected features are used
as input parameters to the particle swarm optimization (PSO) used to train these features and assign a
weight to each one of them. Their results showed that structural features got average weight higher
than simple features. In [10] the authors suggested a method based on selecting five features. These
features are: sentence position, sentence length, numerical information, thematic words and title
feature. The pseudo genetic algorithm was used to train the dataset and assign a weight to each feature.
Their results showed that the importance of these features are in the following order title feature,
sentence position, thematic words, sentence length and numerical information. In [11], a set of features
were extracted for each setence ; this set was used as input to a model consist of three functions:
Cellular Learning Automata (CLA), PSO, and fuzzy logic. The CLA was used to calculate the
similarity between sentences to reduce the redundancy. While the PSO was used to set a weight for
each feature, then the fuzzy logic was used to give scores to the sentences, these scored sentences were
arranged in descending order, and the sentence with higher score was selected to be included in the
created summary. In [12], the authors proposed a method based on formulating the problem of MDS
as a multi-objective optimization (MOO). Two main objective functions were formulated these are
redundancy reduction and content coverage. The redundancy reduction was computed using cosine
similarity between each sentence in the dataset, whereas the content coverage was computed using the
cosine similarity between each sentence with the mean of document collection. Evolutionary
Algorithm was used to combine these two objective functions with the aim to minimize the first
objective and maximize the second objective function. Good results are obtained from their method.
The fundamental objective of document summarization is the extraction of suitable and pertinent
sentences from the input document(s). A technique to acquire the significant sentences is through
assigning a weight for each sentence which indicates the salience of a sentence for selection to the
summary and then selecting the top ones [13]. In this paper a method for extracting generic MDS for
English text is proposed which depends on extracting seven features for each sentence in the
documents, then a mathematical model is used for assigning a weight for each feature. The
mathematical model is based on Multiple Linear Regression (MLR). The weights of the selected
features are used as input to the TOPSIS algorithm. The TOPSIS uses both: the selected features and
their calculated weights to rank the sentences. We have utilized Text Analysis Conference (TAC-
2011) dataset to assess the summarized results.
3. Problem Statement and Formulation
To produce a good summary for any MDS system two issues must be considered. These issues are
1-Relevancey: can be defined as the goodness of information included in the created summary. A
summary considered as relevant if it includes many information relevant to the main topic of the
2-Redundancy: The generated summary should include less redundant information to cover most of the
relevant topics.
Formally, given a corpus which consists of many clusters, each cluster contains a set of documents
called D with the same topic. The set D can be defined as D= {d1, d2,…, dn} where n is the number of
distinct document in D. Each D can be represented by a set of sentences called Si, i.e D= {Si |
1<=i<=M}where M represents the total number of sentences in the set D.
Our goal is to find a subset of set D called A i.e. A ⊂ D that satisfies both objectives: relevancy
maximization and redundancy reduction.
4. Basic Concepts
There are two main stages: Preprocessing and feature extraction.
4.1 Preprocessing
There are four steps in this stage.
A- Sentence segmentation: which can be done by splitting sentences according to the dot between
B- Tokenization: Is the process of splitting sentence into words
C- Stop Words Removal: Words which don't give the necessary information for identifying
significant meaning of the document content and appear frequently are removed. There are a
variety of methods used for specifying such stop words list. . Presently, a number of English stop
word list is usually used to help text summarization process
D- Stemming: is the process of producing root of the word, in This paper word stemming is
performed using Porter’s stemming algorithm [14].
4.2 Features Extraction
An essential part of ATS is computing features score for every sentence. The features include:
sentence position, sentence length, numerical data, thematic word, title word, proper noun and centroid
value [15].
A- Sentence Position (SP): higher score will be given to the first sentence; the score decreases
according to the sentence position in the document. This feature can be computed according to Eq.
N i 1
F1( si ) (1)
Where i is the position of the sentence ( s) in a document of N sentences
B- Sentence length (SL): This feature is computed by dividing the sentence length by the length of the
longest sentence in the document as in Eq. (2).
L( s i )
F 2( si ) (2)
Where L(si) is the length of sentence si and Lmax is the length of longest sentence in the document.
C- Numerical data (ND): has important information to be included in the summary. This feature is
calculated by dividing the number of numerical data in the sentence by the sentence length as in Eq.
Num( si )
F 3( si ) (3)
L( s i )
L ( Si )
F 7( S i ) C
i 1
w (7)
Y p W6
Where p is the number of sentences from the collected document data set. To estimate the weights
for the extracted features we must train our model. There are 70 documents from the TAC-2011
dataset used for the training mode. The seven extracted features (X1, X2,…, X7) that were described in
section 5 are used as input to the model. The desired output Y can be computed using cosine
similarity between all sentences from selected trained documents and each sentence from the manually
summarized documents. As in Eq (11)
AB i i
Similarity ( A, B) i 1
z z
(A )
i 1
* (B )
i 1
W=(X.Xt)-1.XtY (12)
TOPSIS Algorithm
Step1 :input decision matrix {section 5.3}
Output sentences in descending order
By algorithm 1- all the sentences arranged in descending order depending on their score.
input 1- set of ranked sentences in descending order from topsis algorithm called
( )
Where the Si number of sentences occurring in both system and ideal summaries
Sj the number of sentences in the system summary.
( )
Where the Si number of sentences occurring in both system and ideal summaries
Sk the number of sentences occurring in ideal summary.
( )
( )
7. Experimental Results
Table 2- shows the results of our proposed MDS method and system summary that included in the
TAC-2011 dataset [25] using ROUGE-1.
Precision Recall F-Score Precision Recall F-Score
As it's clear the results of the proposed method are better than the results of the peer summaries and
that because of two reasons; the first reason the effect of the selected features which improves the
performance of a TOPSIS method. The second reason most of the ATS methods may be affected by
one feature that makes the sentence score high, while TOPSIS computes the effect of all features in the
selected sentences.
8. Conclusions
The need for MDS increases with the rapid growth of information in the Internet. In this paper a
method for MDS has been proposed which depends on TOPSIS. There are two important points in
TOPSIS over other sentence ranking techniques. The first point TOPSIS depends on ranking the
sentences according to the effect of all feature whereas in other method the effect of one feature may
exceed the effect of other features which allows the sentence to take a high score. The second point is
the weights of the features that are calculated mathematically using MLR to overcome the problem of
assigning weight manually.
