0% found this document useful (0 votes)

58 views22 pages

Summarization From Medical Documents A Survey

Uploaded by

nivas kolla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views22 pages

Summarization From Medical Documents A Survey

Uploaded by

nivas kolla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220103096

Summarization from Medical Documents: A Survey

Article in Artificial Intelligence in Medicine · February 2005

DOI: 10.1016/j.artmed.2004.07.017 · Source: DBLP

CITATIONS READS

213 2,643

3 authors:

Stergos Afantenos Vangelis Karkaletsis

Institut de Recherche en Informatique de Toulouse National Center for Scientific Research Demokritos
43 PUBLICATIONS 722 CITATIONS 268 PUBLICATIONS 3,813 CITATIONS

SEE PROFILE SEE PROFILE

Panagiotis Stamatopoulos
National and Kapodistrian University of Athens
45 PUBLICATIONS 1,355 CITATIONS

SEE PROFILE

All content following this page was uploaded by Vangelis Karkaletsis on 19 March 2018.

The user has requested enhancement of the downloaded file.

Artificial Intelligence in Medicine (2005) 33, 157—177

http://www.intl.elsevierhealth.com/journals/aiim

Summarization from medical documents: a survey

Stergos Afantenosa,*, Vangelis Karkaletsisa, Panagiotis Stamatopoulosb,1

a
Software and Knowledge Engineering Laboratory, Institute of Informatics
and Telecommunications, National Centre for Scientific Research (NCSR)
‘‘Demokritos’’, 15310 Aghia Paraskevi Attikis, Athens, Greece
b
Department of Informatics, University of Athens, TYPA Buildings,
Panepistimiopolis, GR-15771 Athens, Greece

Received 16 December 2002; received in revised form 21 July 2004; accepted 21 July 2004

KEYWORDS Summary
Summarization from
medical documents; Objective: The aim of this paper is to survey the recent work in medical documents
Single-document summarization.
summarization; Background: During the last decade, documents summarization got increasing atten-
Multi-document tion by the AI research community. More recently it also attracted the interest of the
summarization; medical research community as well, due to the enormous growth of information that
Multi-media is available to the physicians and researchers in medicine, through the large and
summarization; growing number of published journals, conference proceedings, medical sites and
Extractive portals on the World Wide Web, electronic medical records, etc.
summarization; Methodology: This survey gives first a general background on documents summariza-
Abstractive tion, presenting the factors that summarization depends upon, discussing evaluation
summarization; issues and describing briefly the various types of summarization techniques. It then
Cognitive examines the characteristics of the medical domain through the different types of
summarization medical documents. Finally, it presents and discusses the summarization techniques
used so far in the medical domain, referring to the corresponding systems and their
characteristics.
Discussion and conclusions: The paper discusses thoroughly the promising paths for
future research in medical documents summarization. It mainly focuses on the issue of
scaling to large collections of documents in various languages and from different
media, on personalization issues, on portability to new sub-domains, and on the
integration of summarization technology in practical applications.
# 2004 Elsevier B.V. All rights reserved.

1. Introduction
* Corresponding author. Tel.: +30 210 6503149.
E-mail addresses: [email protected] (S. Afantenos),
[email protected] (V. Karkaletsis),
New technologies, such as high-speed networks
[email protected] (P. Stamatopoulos). and inexpensive massive storage, along with the
1
Tel.: +30 210 7752222. remarkable growth of the Internet, have led to an

0933-3657/$ — see front matter # 2004 Elsevier B.V. All rights reserved.
doi:10.1016/j.artmed.2004.07.017
158 S. Afantenos et al.

enormous increase in the amount and availability of tive segments at the expense of the rest is the main
on-line documents. This is also the case for medical challenge in summarization.’’
information, which is now available from a variety of Although initial work on summarization dates
sources. However, information is only valuable to back to the late 1950s and 1960s (e.g. [3,4]), fol-
the extent that it is accessible, easily retrieved and lowed by some sparse publications (e.g. [5,6]), most
concerns the personal interests of the user. The research in the field has been carried out during the
growing volume of data, the lack of structured last decade. During these last few years, research-
information, and the information diversity have ers examined a great variety of techniques and
made information and knowledge management a applied them in different domains and genres of
real challenge towards the effort to support the documents, in order to see which are the ones that
medical society. It has been realized that added yield the most practical results for each domain and
value is not gained merely through larger quantities genre.
of data, but through easier access to the required This survey presents the potential of summariza-
information at the right time and in the most sui- tion technology in the medical domain, based on the
table form. Thus, there is a strong need for examination of the state of the art, as well as of
improved means of facilitating information access. existing medical document types and summariza-
The medical domain suffers particularly from the tion applications. An important aspect of this survey
problem of information overload since it is crucial is that it is not restricted to a mere examination of
for physicians and researchers in medicine and biol- the various summarization techniques, but it exam-
ogy to have quick and efficient access to up-to-date ines the issues that arise in the use of these tech-
information according to their interests and needs. niques taking into account the characteristics of the
Considering, for instance, the scientific medical medical domain.
articles [1, p. 38] state that: ‘‘. . .there are five The structure of the survey is as follows. Second
journals which publish papers in the narrow speci- section presents a roadmap of summarization com-
alty for cardiac anesthesiology, but 35 different prising the factors that have to be taken into
anesthesia journals in general; approximately 100 account and the main techniques considered so
journals in the closely related fields of cardiology far in the summarization literature. Third section
(60) and cardiothoracic surgery (40); and over 1000 presents different types of medical documents and
journals in the more general field of internal med- the requirements that they introduce to the sum-
icine.’’ The situation becomes much worse if one marization process. Fourth section examines the
considers relevant journals or newsletters in other techniques used so far for summarization in medical
languages, Web sites with relevant information, documents. Finally, fifth section summarizes the
medical reports, etc. most interesting remarks of this survey and presents
Given the number and diversity of medical infor- promising paths for future research, while last sec-
mation sources, methods must be found that will tion concludes the paper.
enable users to quickly assimilate and determine the
content of a document. Summarization is one such
approach that can help users to quickly determine 2. Summarization roadmap
the main points of a document. Radev et al. [2]
provide the following definition for a summary: ‘‘A A summarization system in order to achieve its task
summary can be loosely defined as a text that is takes into account several factors. These factors
produced from one or more texts, that conveys concern mainly the type of input documents, the
important information in the original text(s), and purpose that the final summary should serve, and
that is no longer than half of the original text(s) and the possible ways of presenting a summary. Sum-
usually significantly less than that. Text here is used mary evaluation is also an important issue. These
rather loosely and can refer to speech, multimedia factors are examined in the following sections.
documents, hypertext, etc. The main goal of a Various techniques that have been used so far for
summary is to present the main ideas in a document document summarization are also presented. This
in less space. If all sentences in a text document presentation is necessary for the examination of
were of equal importance, producing a summary existing approaches to summarization from medical
would not be very effective, as any reduction in the documents.
size of a document would carry a proportional
decrease in its informativeness. Luckily, informa- 2.1. Summarization factors
tion content in a document appears in bursts, and
one can therefore distinguish between more and A detailed presentation of the factors that have to
less informative segments. Identifying the informa- be taken into account for the development of a
Summarization from medical documents: a survey 159

summarization system has been given in [7]. How- supposed to serve when presented to its reader, it
ever, as it is noted there, all these factors are hard can either be indicative or informative. An indica-
to define and therefore it is very difficult to capture tive summary does not claim any role of substituting
them precisely enough to guide summarization in the source document(s). Its purpose is merely to
various applications. The following presentation of alert its reader in relation to the contents of the
factors adopts the main categorization presented in original document(s), so that the reader can choose
[7]: input, purpose and output. For each of these which of the original documents to read further. The
categories, the factors considered as the most purpose of an informative summary, on the other
important ones are presented. hand, is to substitute the original document(s) as far
as coverage of information is concerned. Apart from
2.1.1. Input factors the indicative and informative summaries, there are
The main factors in this category are the following. also critical summaries [7,8], but, as far as we know,
no actual summarization system creates critical
2.1.1.1. Single-document versus multi-docu- summaries.
ment. This is the unit input parameter or the span
parameter, as Sparck-Jones [7] and Mani [8] respec- 2.1.2.2. Generic versus user-oriented summar-
tively call it, which in simple words is the number of ies. This factor concerns the information a system
documents that the system has to summarize. In needs to locate in order to produce a summary.
single-document summarization the system pro- Generic systems create a summary of a document
cesses just one document at a time, whereas in or a set of documents taking into account all the
multi-document summarization more than one information found in the documents. On the other
document are processed by the system. hand, user-oriented systems try to create a sum-
mary of the information found in the document(s)
2.1.1.2. Language. Another input factor is the which is relevant to a user query. In a sense, we can
number of languages in which the input documents say that the query-oriented summarization systems
are written. So, a system can be monolingual, multi- are user-focused, adapting each time to the verbally
lingual or cross-lingual. In the first case, the output expressed needs of the users, as viewed through the
language is the same as the input language. In the query they make or through their model (persona-
case of multilingual summarization systems, the lized summaries).
output language is the same as the input language,
but the system can handle a certain number of 2.1.2.3. General purpose versus domain-specific.
languages. In the final case of cross-lingual summar- General-purpose systems can be easily ported to a
ization, the system can accept a source text in a different domain (e.g. financial, medical). This can
specific language and deliver the summary in be done, for instance, by changing the resources
another language, not necessarily the same as the that characterize the domain (e.g. keywords, a
input one. domain-specific ontology), or by tuning specific
parameters which concern the selection of the most
2.1.1.3. Text versus multimedia summaries. An- appropriate techniques for the domain. On the
other important factor is the medium used to repre- other hand, domain-specific systems are able to
sent the content of the input document(s), as well as process documents belonging to a specific domain.
the output summary. Thus, we have text, or multi-
media (e.g. images, speech, video apart from tex- 2.1.3. Output factors
tual content) summarization. The most studied case These factors are related to the criteria that are
is, of course, text summarization. However, there used to judge the quality of the resulting summary
are also summarization systems that deal, for exam- as well as with the type of summary in terms of
ple, with the summarization of broadcast news [9] whether this is an extract from the original docu-
and of diagrams [10]. ment(s) or an abstraction.

2.1.2. Purpose factors 2.1.3.1. Output quality. The developer of a sum-

These factors concern the possible uses of the summarization system has to specify certain qualitative
mary, the potential readers of the summary, as well or quantitative criteria, which are related to the
as the domain(s) that must be covered by the sys- specific summarization task and the evaluation
tem. method that will be used (see Evaluation). Such
criteria may be among others the completeness,
2.1.2.1. Informative versus indicative summarie- the accuracy, the coherence of the resulting sum-
s. According to the function that the summary is mary, etc. If accuracy is crucial for a specific task,
160 S. Afantenos et al.

then the system developer must tune its system

Intrinsic, extrinsic

Intrinsic, extrinsic
accordingly, i.e. producing accurate summaries even
though these do not contain all the relevant results.

Evaluation

Extrinsic
Intrinsic

Intrinsic

Intrinsic
2.1.3.2. Extracts versus abstracts. Considering the
relation that the summary has to the source docu-
ment(s), a summary can either be an extract or an
abstract. An extract involves the selection and ver-

Language processing (to identify keywords)

Graph-based, statistics (cosine similarity,

batim inclusion of ‘‘material’’ from the source docu-

(to identify the RST relations markers)

use of thematic keywords, no revision
ment(s) in the summary; this ‘‘material’’ is usually

Statistics (Edmundsonian paradigm),

sentences, paragraphs or even phrases. The

Graph-based, cohesion relations,

Tree-based, language processing

excerpted textual units can be included in the
summary verbatim, or they can be processed further
in order to smooth the text flow. An abstract, on the

use of thesauri, revision

other hand, involves the identification of the most

vector space model)

language processing
salient concepts prevalent in the source docu-
ment(s), the fusion and the appropriate presenta-
tion of them, usually through Natural Language

no revision

no revision
Generation.

Method
2.2. Evaluation

Although the summarization community considers

Text regions (sentences,

evaluation as a critical issue, it still remains unclear

paragraphs, sections)
what the evaluation criteria should be. This is
mainly related to the subjective aspect of summar-
ization, in terms of whether or not a summary is of
‘‘good’’ quality. Existing evaluation techniques can

Paragraphs
Sentences

Sentences

Sentences
Sentences

Sentences
be split into two categories, intrinsic and extrinsic
Output

ones (see Section 5 in [11]). An intrinsic method

evaluates the outcome of a summarization system
independently of the purpose that the summary is
(scientific articles on specific topics)

supposed to serve. An extrinsic evaluation, on the

Representative systems employing extractive techniques

other hand, evaluates the produced summary in

Generic, domain-specific (news)

User-oriented, General purpose

User-oriented, domain-specific
(scientific and technical texts)

terms of a specific task.

An intrinsic method can measure the quality of
Generic, general purpose
Generic, domain-specific

Generic, domain-specific

Generic, domain-specific
the summary using criteria such as the integrity of
its sentences, the existence of anaphors without
(scientific articles)
(technical papers)

their referents (for an extract), the summary read-

ability (for an abstract), the fidelity of the summary
compared to the source document(s). Another way
to perform an intrinsic evaluation is to have human
Purpose

subjects create a ‘‘gold’’ summary, i.e. an ideal one,

which will be compared to the summary created by
the system. In this case evaluation can be more
multilingual (English,

quantitative and measure things such as precision

multilingual, text

or recall. The problem with this approach is that it is

Single-document,

Single-document,
Multi-document,

Multi-document,

usually difficult to make people agree on what

Chinese), text
English, text

English, text

constitutes a ‘‘gold’’ summary. One way to sidestep

this problem is to employ a utility-based measure, in
which a sentence is not assigned a Boolean value
Input

(belonging or not to the summary), but instead a

value in a scale, according to the opinion of each
Table 1

[18,19]

judge. Those values are then averaged and the top

[13]

[14]

[15]

[16]

[17]

sentences can be considered as forming the ‘‘gold’’

[3]

[4]

summary.
Summarization from medical documents: a survey 161

In an extrinsic method, the summary is evaluated normally based on a formula, which assigns a weight
in relation to the particular task it is supposed to to each sentence based on various factors. For
serve. Thus, such an evaluation can greatly vary example, the cue phrases or keywords that the
from system to system. Relevance assessment is an sentence contains, its location in the document,
extrinsic method and is usually performed by show- the fact that it may contain some non-trivial words
ing judges a document (summary or source) and a that are also found in the sections’ headings of the
topic and asking them to determine whether that document. The problem is that most of the times the
document is relevant to the topic. If, on average, produced summary is suffering from incoherencies
the choices for the source document and the sum- (semantic gaps, anaphora problems). Some of the
mary are the same, then the summary scores high in systems falling under this category post-process the
the evaluation. Of course, instead of providing a produced summary using revision techniques in
single topic, a list of topics can be provided asking order to resolve such problems.
the judges to choose one of them. Another example The second category of extractive techniques
is the evaluation of reading comprehension. In this concerns the creation of a graph (or tree) repre-
case, judges are given a document (either the sum- sentation of the document(s) to be summarized
mary or the source document) and are asked a set of exploiting machine learning and/or language pro-
questions. Their answers to these questions deter- cessing techniques. Several different representa-
mine their comprehension of the text. If the answers tions can be used:
to the summary and the corresponding source docu-
ment(s) are similar, then the summary is positively The nodes of the graph are the paragraphs of the
evaluated. document to be summarized and the edges repre-
sent the similarity between the paragraphs they
2.3. Summarization techniques connect. The paragraphs corresponding to nodes
with many edges can be extracted in order to form
Summarization techniques can be classified accord- the summary of the document.
ing to the factors presented in Summarization fac- The nodes of the graph are text items such as
tors. For example, they can be classified according words, phrases, proper names and the edges are
to the number of input documents (single-document cohesion relationships between these items, such
versus multi-document), to the type of these docu- as coreference, hypernymy. Once the graph for
ments (textual versus multimedia), to the output the document is created, the most salient nodes
types (extractive versus abstractive), etc. In this in the graph can be located, based for example on
section, the various summarization techniques are a user query. The set of salient nodes can then be
presented under the following classification: used to extract the corresponding sentences,
paragraphs, or even sections that will form the
extractive; summary.
abstractive; A tree representation can be created exploiting
multi-document; relations from the rhetorical structure theory
multimedia. (RST) [12]. The tree nodes are sentences, which
are connected using RSTrelations such as elabora-
An extra category is added to include a technique tion, antithesis, etc. Then in order to get the most
that although it presents similarities with techni- salient sentences, the tree must be traversed in
ques in the other categories, it has the special ch- order to build a partial ordering of the sentences
aracteristic of approaching summarization from a in terms of their importance. According to the
cognitive perspective aiming at simulating the hu- target compression rate, the top n sentences can
man summarizers’ tasks. be extracted and presented as a summary.

2.3.1. Extractive techniques Table 1 presents representative systems employ-

Two representative categories of extractive techni- ing the extractive techniques presented above, tak-
ques implemented by existing systems (see Table 1) ing into account the factors specified in
are presented below. Summarization factors. The input field concerns t-
The first one concerns statistical techniques he number of input documents, their language, and
based on what [8, pp. 47—53] calls the Edmundso- whether they contain only text. The purpose
nian paradigm. In this paradigm, each sentence field concerns whether the resulting summary is
should be ranked in relation to the other sentences, indicative or informative (in most of the cases it
so that the n highest ranked sentences could be is difficult to judge this), generic or user-oriented,
extracted and form the summary. The ranking is as well as whether the technique is a general
162 S. Afantenos et al.

purpose or a domain-specific one. The output field

Evaluation of system
concerns the ‘‘material’’ used to create the sum-
mary (sentences, paragraphs, sections). The table

components
also includes a field about the specific methods used

Evaluation
(e.g. statistics, language processing, use of revi-

Extrinsic
Intrinsic
sions, etc.), as well as a field on the evaluation a-
pproach. The lack of the corresponding information
in some field values denotes that a definite answer
cannot be given.

sentences, ontology-based annotation, NLG

Statistics (for scoring each UNL sentence),
2.3.2. Abstractive techniques

Syntactic processing of representative

Script activation, canned generation
The most straightforward way of creating abstracts
is to identify in a way the most important informa-
tion in the document(s), appropriately encode it and

Information extraction, NLG

removing redundant words,

then feed it to a natural language generation (NLG)
system [20,21], which generates the summary. Two

combining sentences
representative categories of abstractive techni-
ques, implemented by existing systems (see Table

sentences, NLG
2), are presented below.
In the first category the process of identifying and
encoding the most important information in the docu-

Method
ment(s) can be performed using prior knowledge
about the structure of this information. This knowl-
edge is represented through cognitive schemas such
as frames, scripts, templates. Thus, in such cases, the

representation in UNL
summary produced is not a generic one, but a rather
user-oriented one since the schema can be considered

Ontology-based
as a sort of user query. Different approaches in this

representation
category may be the following:

Conceptual
Templates

Clusters
Output
Scripts

Use of a script, i.e. a sort of a simple-structured

template with slots identifying common impor-
tant events over a domain. There is a separate
Table 2 Representative systems employing abstractive techniques

script for each domain. When a document is

Informative, user-oriented,

processed, the corresponding script is activated

Generic, domain-specific

and its slots filled with information from the

document. The activation can be performed
through the appearance of certain words, or by
domain-specific

domain-specific

domain-specific
(news articles)

the activation of another script. After the script

has been activated and filled, the summary can be
generated using in most cases simple techniques
Purpose

(canned text generation versus the more sophis-

ticated NLG techniques) due to the rather simple
structure of scripts.
Use of a MUC1-like domain-specific template, a
multilingual, text
Single-document,

Single-document,

sort of a relational database, having a more

Multi-document,

complex structure compared to scripts. The tem-

English, text

plate can be filled from a document using infor-

mation extraction techniques. The filled
Input

templates can then be processed in order to

transform them in an appropriate form for the

1
MUC (Message Understanding Conferences) were evalua-
[23,24]

tion conferences for information extraction systems; see

[22]

[25]

[26]

[27]

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/
index.html.
Summarization from medical documents: a survey 163

NLG system. Processing is done using various While a technique extracting textual units — such
semantic operators, such as Change of Per- as sentences — from a single document, may cope
spective, Contradiction, Addition/Elaboration, with redundancy and preserve the coherence of the
Refinement, Agreement, etc. original document, extracting textual units from
multiple documents increases the redundancies
The second category involves techniques that do and incoherencies, since textual units are not pre-
not use prior knowledge about the structure of the viously connected across documents. Thus, abstrac-
important information to be used in the summary, tive techniques seem more appropriate for multi-
but instead they produce a semantic representation document summarization. However, extractive
of the document(s), which is then fed to the NLG techniques can also be used followed by a post-
system. Different approaches in this category may processing stage in order to ensure summary coher-
be the following: ence and cope with redundancy. In both cases, the
first step is to identify those documents talking
The documents are linguistically processed in order about a specific topic. This can be done, using
to identify noun phrases, verb phrases that can be classification techniques in terms of relevance to
linked to the concepts, attributes and relations of a a query or existing topic models, or using clustering
domain-specific ontology. Ontology-based annota- techniques. The next step is to identify the impor-
tions can then be used to select the important tant information to be added in the summary in the
document regions (sentences, paragraphs). These group of the topic-specific documents.
regions are then converted into some semantic In the extractive-based category, a system can
representation using the results of the linguistic apply firstly extractive techniques to each docu-
processing and the ontology-based annotations. ment separately in order to locate and rank the
This representation is then fed to an NLG system most important document regions (phrases, sen-
that produces the abstract. tences, paragraphs). The highest ranked regions
The summarization system can identify informa- from all the documents can then be combined and
tional equivalent paragraphs in the input docu- re-ranked using similarity measures, in order to
ment(s) using clustering techniques. From each minimize redundancy. Some cohesion rules can then
theme some representative sentences are be applied to the final set of document regions in
extracted, which can be analyzed syntactically order to produce the summary. Another approach
and then fed to a sentence generator in order to would be to work with all the documents from the
produce the abstract. beginning. For instance, the topic model produced
from the group of input documents can be compared
Table 2 presents representative systems employ- to sentence vectors from all the documents in order
ing abstractive techniques. Table 2 fields are the to get the, most similar to that topic, sentences.
same with the fields in Table 1. There are differ- The abstractive-based category can also involve
ences between the two tables concerning the out- extractive techniques at a pre-processing stage. In
put and the method fields, which are filled with such a case, the extracted document regions are
different values. The output field in the abstractive linguistically processed in order to be converted into
techniques concerns the ‘‘semantic representa- some representation, which can be used by an NLG
tion’’ used to create the summary (scripts, tem- system, which will then produce the abstract.
plates, ontology-based representation, clusters of Another approach involves the establishment and
informational equivalent document regions). The use of a set of intra- and inter-document relation-
method field is filled with the specific methods used ships, which could hold between various textual
(script activation, information extraction, syntactic units of the documents, such as words, sentences,
processing, NLG, etc.). paragraphs or whole documents, and which could
guide the identification of the most salient informa-
2.3.3. Multi-document summarization tion across the multiple documents. Such relation-
techniques ships concern not only the similarities across
Radev et al. [2] define multi-document summariza- documents, but also their differences (e.g. equiva-
tion as the process of producing a single summary of lence, contradiction, elaboration, etc.).
a set of related source documents. As they note, this To cope with the inherent problems of multi-
is a relatively new field where three major problems document summarization (redundancy, incoheren-
are introduced: (1) recognizing and coping cies), a different output representation can be used
with redundancy, (2) identifying important differ- instead of producing a summary containing, for
ences among documents, and (3) ensuring summary instance, the most salient sentences across docu-
coherence. ments or even their abstraction. Ando et al. [28] use
164 S. Afantenos et al.

Evaluation
a scatter plot per topic presenting the extracted

Intrinsic

Intrinsic
sentences visually to the user.
Table 3 includes representative examples of
multi-document summarization techniques. Com-
pared to Tables 1 and 2, there are differences
concerning the values in the output and the method

Statistics (vector space model), language processing

fields. The output field concerns whether the output
is an extract or abstract or even another represen-

Statistical (support vector machine, MD-MRD)

tation format. The method field is filled with the
specific methods used (statistical, language proces-

marginal relevance — MD-MRD), revision

Intra- and inter-document relationships

sing, revision stage, use of intra- and inter-docu-

Statistical (multi-document maximal

ment relationships).

2.3.4. Multimedia summarization techniques

In this section, work on multimedia summarization is
discussed. This involves such diverse fields as dialog
summarization, summarization of diagrams, and
video summarization. Due to the limited number
of relevant works and their field-specific features,
only one approach per field is presented.

Method
2.3.4.1. Dialogue summarization. Zechner [32,33]
works on dialogue summarization in unrestricted
domains and genres. Zechner’s works with tran-
scripts of human dialogs, which are generated either

most salient sentences

manually or automatically (in this case Zechner also

A scatter plot of the

Representative systems employing multi-document summarization techniques

addresses the issue of speech recognition errors).

The major steps in Zechner’s work are the following.
In the first step, input tokenization, all noise is
removed and the transcript is tokenized. Disfluency
Extracts

Extracts
Output

detection follows in which false starts, restarts or

repairs and filled pauses are annotated and cor-
rected. In this step the boundaries of the speakers’
sentences are also detected. In the next step, cross-
speaker information linking, pairs of question—
User-oriented, general purpose

answer among the speakers are identified and anno-

tated. Those pairs are extracted together later
Generic, general purpose

Generic, general purpose

when producing the summary, thus making the

resulting summary more coherent and informative.
The following step is topic segmentation where
segments of the discussion on a certain topic are
identified and a list of keywords, for each topic, is
Purpose

extracted. The final step, sentence ranking and

selection, uses a version of the maximal marginal
relevance (MMR) algorithm [34] to create a summary
of extracted sentences, for each topic.
Multi-document,

Multi-document,

2.3.4.2. Diagram summarization. Futrelle [10]

English, text

presented a preliminary and only partially imple-

mented work on diagram summarization, yet quite
innovative. His aim was to present a summary of
Input

the diagrams in a scientific paper, either by

selecting (i.e. extracting) one or more diagrams
Table 3

from the paper or by distilling (i.e. simplifying) a

[29]

[30]

[28]

[31]

diagram, or even by merging several diagrams.

Futrelle assumes that a structural description of
Summarization from medical documents: a survey 165

the diagrams can be obtained as metadata from the mented in the SimSum system [36]. SimSum is imple-
author or by parsing the diagrams. In order to mented as an object-oriented blackboard, which
achieve his goals he takes into account not only involves 79 object-oriented agents, each one per-
the structural description of each diagram, but forming a relatively simple task. For instance, the
the text in its caption or in the diagram itself. Context agent checks whether the context condi-
tions of the query are met by an input document, the
2.3.4.3. Video summarization. Merlino and Mark TexttoProposition agent transforms input sentences
[9] extract information from various media in order into propositions known to the domain ontology, the
to provide a summary of a broadcast news story. They Redundancy agent checks if a proposition has
use MITRE’s broadcast news navigator (BNN [35]) already been introduced. All the agents cooperate
which helps searching and browsing of news stories, in order to deliver the summary. They can access a
not from just one channel, but from a variety of common knowledge database, which contains the
sources. They use silence and speaker change from text and ontology concepts. Furthermore several
the audio, anchor and logo detection from the video RST relations have been implemented for discourse
and closed-captioned text in order to segment the level structures of the text.
stream into news stories. For the purpose of present-
ing a summary, they experiment with a diversity of
presentation methods, which include mixed-media. 3. The medical domain
The first one is key-frames, i.e. important single
frames and shots i.e. an important sequence of Medical Informatics represents the core theories,
frames. The second one is extracted single sentences, concepts and techniques of information applications
based on weights according to the presence of named in medicine. It involves four different levels depend-
entities (organizations, persons and places). The final ing on the focus from the cell to the population [40]:
one is named entities keywords.
Bioinformatics concerns molecular and cellular
2.3.5. Summarization from a cognitive science processes, such as gene sequences;
perspective Imaging informatics concerns tissues and organs,
The techniques presented so far pay little attention such as radiology imaging systems;
to the ways used by humans to create the summaries Clinical informatics concerns clinicians and
themselves. Endres-Niggemeyer et al. [36—39] on patients, involving applications of various clinical
the other hand, try to simulate the human cognitive specialties;
process of professional summarizers. They aim at Public health informatics concerns populations
developing an empirical model for summarization, involving applications such as the disease surveil-
based on professional summarizers, and to imple- lance systems.
ment this model into a system, which will ‘‘imitate’’
human process of summarizing. For this purpose, Medical information distributed through all the
they recruited six professional summarizers who above levels concerns various document types: sci-
worked on nine summarization processes. The whole entific articles, electronic medical records, semi-
process included the division of the tasks of the structured databases, web documents, e-mailed r-
professional summarizers into sub-tasks, the inter- eports, X-rays images, videos. The characteristics of
pretation of each sub-task into a more formal frame- each document type have to be taken into account
work (i.e. giving a name, functional definition, in the development of a summarization system. S-
etc.), and the hierarchical organization of the cientific articles are mainly composed of text and, in
resulting strategies according to their function. several cases, they have a sectioning that can be
Despite the diversity of the technical background exploited by a summarization system. Electronic
and cognitive profile of each professional summar- medical records contain structured data, apart from
izer and their idiosyncrasies on creating summaries, free text. Web documents may appear in health
the results are quite stunning. Quoting from [36, p. directories and catalogs, which need to be searched
129]: ‘‘83 strategies are used by all experts of the first in order to locate the interesting web pages.
group, 60 strategies are shared by five experts, The web pages layout is also another factor that
another 62 strategies are common knowledge of needs to be taken into account. E-mailed reports are
four summarizing experts, 79 strategies belong to mainly free text without any other structure. X-rays
the repertory of three summarization experts, 101 images, videos such as echocardiograms, represent
strategies are used by two experts, and 167 strate- a completely different document type, where text
gies are individual.’’ From all these strategies, 79 may not be included at all or may be a part of an
agents, which simulate them, were finally imple- image.
166 S. Afantenos et al.

Compared to other domains, medical documents of 270 Medical Journals over a period of 5 years. This
have certain unique characteristics that make very corpus is annotated with information such as the
challenging and attractive their analysis. Unique- Medline identifier, MeSH terms assigned by humans,
ness of medical documents is due to their volume, title, abstract, publication type, source and
their heterogeneity, as well as due to the fact that authors. The corpus was created for the experi-
they are the most rewarding documents to analyze, ments described in [40]. Abstracts represent a dif-
especially those concerning human medical infor- ferent type of document, since they are only
mation due to the expected social benefits (see composed of text and metadata (such as the MeSH
[41]). annotations in MEDLINE). A summarization system
must be able to exploit these metadata in several
3.1. Scientific articles ways. For instance, in multi-document summariza-
tion, the system must be able to locate first those
The number of scientific journals in the fields of abstracts that discuss a specific topic, and topic
health and biomedicine is unmanageably large, even information is found on the metadata of the
for a single specialty, making very difficult for phy- abstracts. On the other hand, a sub-language ana-
sicians and researchers to stay informed of the new lysis of the abstracts may be necessary in order to
results reported in their fields. Scientific articles identify certain characteristics that may affect the
may contain apart from text, structured data (e.g. performance of the language processing applica-
tables), graphs, or images. Therefore, depending on tion. As it is noted in Ariadne Genomics NLP white
the summarization task, the system may have to paper,4 their sub-language analysis indicated that
process various types of data. It may be the case, for ‘‘MEDLINE sentences often include idiosyncratic lin-
instance, that important information (e.g. experi- guistic constructs not necessarily reflected in gen-
mental results) is found in a table and needs to be eralized English grammar’’. This explains, as they
located and added to the summary. The article claim, why existing syntactic parsers with general
layout could also be exploited, since a large number grammar are not suitable for dealing with this type
of articles reporting on experimental results have of text.
an almost standard sectioning, the order of the
sections being Introduction, Methods, Statistical
Analysis, Results, Discussion, Previous Work, Lim- 3.3. Semi-structured databases
itations of the Study, and Conclusions. A study of the
types of scientific articles on the various fields of A number of databases have been built to provide
medicine is necessary for their processing either for access to biomedical information, such as informa-
summarization or for other language processing tion about protein function and cellular pathways.
tasks. More than 280 semi-structured databases exist cur-
rently.5 Some examples are the following:
3.2. Databases of abstracts
FlyBase: fruit fly genes and proteins;
Mouse Genome Database (MGB);
Despite the plethora of medical journals, most of
Protein Information Resource (PIR);
them are not freely accessible over the Web, due to
DIP: The database of interacting proteins.6
copyright reasons. Luckily, there are other online
databases, which contain the abstracts and citation
Let us take DIP, for example, which provides an
information of most articles on the general field of
integrated set of tools for browsing and extracting
Medicine. One such online database is MEDLINE,2
information about protein interaction networks. As
which contains abstracts from more than 3500 jour-
it is reported in [42], DIP database is implemented
nals. MEDLINE provides keyword searches and
as a relational database composed of four tables:
returns abstracts that contain the keywords. The
protein table, interaction table, method table,
abstracts are indexed according to the Medical
and reference table. The reference table lists all
Subject Headings (MeSH)3 thesaurus. Apart from
the references to different articles that demon-
the access to the abstracts, MEDLINE also provides
strate protein interactions and link them to
full citations of the articles along with links to the
MEDLINE database. Therefore, a summarization
articles themselves, in case they are online. Hersh
et al. [40] have created a corpus from Medline, 4
See http://host.ariadnegenomics.com/downloads/.
which consists of titles and abstracts from articles 5
See KDD Cup 2002 Task1: Information Extraction from
Biomedical Articles (http://www.biostat.wisc.edu/craven/
2
See http://www.ncbi.nlm.nih.gov/entrez/query.fcgi. kddcup/).
3 6
See http://www.nlm.nih.gov/mesh/meshhome.html. See http://dip.doe-mbi.ucla.edu.
Summarization from medical documents: a survey 167

system that exploits information from DIP should with interesting information for the summarization
be able to employ DIP tools for browsing the data- task, there may be interesting information stored in
base and even access relevant MEDLINE abstracts a table, etc. Web page layout should also be taken
(see previous discussion) through the reference into account in order to locate interesting informa-
table. tion inside the web page, especially in the case that
the catalog pages are generated dynamically from a
database. In addition, the identification of web
3.4. Web documents
pages that are relevant to the summarization task,
demands the use of web spidering techniques.
During the last few years, a number of specialized
Therefore, as it is the case of web information
health directories and catalogs (portals) have been
retrieval and extraction tasks in other domains
created, such as CliniWeb,7 HON,8 CISMeF,9 Medical
(see the results of the CROSSMARC project10 in
Matrix, Yahoo Health, HealthFinder. Some of these
[43]), summarizing from web documents needs to
catalogs, including the first three of the above, are
take into account the web catalog and web pages
additionally indexed with the MeSH Thesaurus. This
structure and features. It must also be able to
allows complex queries to be stated, which exploit
exploit metadata information (e.g. MeSH annota-
the hierarchical structure of the MeSH. CliniWeb, for
tions) already used by some of the existing web
instance, provides clinically oriented information,
catalogs. However, even in the cases where a web
including:
catalog is not indexed, the summarization system
must be able to employ existing medical ontologies,
1. Cataloging content per-page basis;
thesauri or lexica. This is essential for a scientific
2. Including only pages that have clinical content,
domain with rich terminology, where the informa-
i.e., excluding individual and institutional home
tion to be extracted or summarized needs to be as
pages, advertisements, and lists of links;
precise as possible.
3. Indexing with a higher level of specificity, using
MeSH as opposed to broad subject categories
such as Orthopedics or Cancer. CliniWeb provides 3.5. E-mailed reports
access to Web pages manually indexed by a large
subset (trees A—G) of MeSH, including the major With the advent of the Web, several web-based
trees, Diseases, Anatomy, and Chemicals and services, whose purpose is the exchange of opinions
Drugs. and news, have emerged. For example, ProMED-
mail11 is a public free service, which promotes
Another example of a health resources portal is the exchange of news concerning bursts of epi-
CISMeF, which aims to describe and index the main demics. Other non-free services, such as MDLinx,12
French-language health resources to assist health provide physicians and researchers with the oppor-
professionals and consumers in their search for tunity to subscribe and receive alerts concerning
electronic information available on the Internet. new findings on their specialty fields, described in
In April 2002, the number of indexed resources journal articles. The use of e-mailed reports for a
totaled over 9600 with a mean of 50 new sites each fast dissemination of epidemiological information
week. CISMeF uses two standard tools for organizing by the Internet shows an increasing success for
information: the Medline bibliographic database monitoring epidemiological events [44]. The
MeSH thesaurus and several metadata element sets, descriptive possibilities of these reports and their
including the Dublin Core. To index resources, CIS- ability to deal with unattended situations make
MeF uses four different concepts: ‘‘meta-term’’, them competitive for reporting emerging infectious
keyword, subheading, and resource type. CISMeF disease outbreaks and unusual disease patterns,
contains a thematic index, including medical spe- including biological threats. However, as [45] note,
cialities and an alphabetic index. analysts cannot feasibly acquire, manage, and
Web pages included in catalogs, such as the above digest the vast amount of information available
ones, form a different type of medical documents. through emailed reports or other information
These are .html pages, which may contain informa- sources 24 h a day, 7 days a week. In addition, access
tion from other media apart from text (e.g. images, to foreign language documents as well as the local
videos). Even in the case that they contain text only, news of other countries is generally limited. Even
there will be links pointing to other relevant pages when foreign language news is available, it is usually
7 10
See http://www.ohsu.edu/cliniweb/. See http://www.iit.demokritos.gr/skel/crossmarc.
8 11
See http://www.chu-rouen.fr/cismef/. See http://www.promedmail.org.
9 12
See http://www.chu-rouen.fr/cismef. See http://www.mdlinx.com/.
168 S. Afantenos et al.

no longer current by the time it gets translated and must be able to process the free text reports, which
reaches the hands of an analyst. This very real may be problematic due to the specific sub-language
problem raises an urgent need for the development used by the clinicians or due to the fact that the
of automated support for global tracking of infec- report is the result of dictation. Information from
tious disease outbreaks and emerging biological written reports may also be combined with informa-
threats. tion existing in structured data (tables, graphs) or
ProMED-mail is a service monitoring news on even information existing in other media (e.g. X-
infection disease outbreaks around the world, 7 days rays images, videos). The situation, however, may
a week. By providing early warning of outbreaks of become even more complex for a summarization
emerging and re-emerging diseases, ProMED aims at system that aims to summarize information for a
enabling public health precautions at all levels in a clinician, which is collected not only from the
timely manner to prevent epidemic transmission patient record but also from other records (similar
and to save lives. ProMED sources of information cases to the specific patient), from relevant scien-
include — among others — media reports, official tific articles or abstracts from journals or databases
reports, online summaries, local observers. Reports respectively. If a summarization system is to be
are also contributed by ProMED-mail subscribers. A integrated into the busy clinical workflow, it must
team of expert moderators investigate reports provide the clinician with such facilities.
before posting them to the network. Reports are
distributed by e-mail to subscribers and posted on 3.7. Multimedia documents
the ProMED-mail web site. ProMED-mail currently
reaches over 30,000 subscribers in 150 countries. Apart from documents in textual form, physicians and
ProMED-mail is also available, apart from English, in researchers produce and use several other docu-
Portuguese and in Spanish. Both of these lists cover ments, which are multimedia in nature. Such docu-
disease news and topics relevant to Portuguese- and ments can be graphs, such as cardiograms, images
Spanish-speaking countries, respectively. such as X-Rays, etc., videos, such as the various
E-mailed reports for monitoring infectious dis- echograms, e.g. echocardiograms, echoencephalo-
ease outbreaks and emerging biological threats grams, etc., or the medical videos used mainly for
represent a different type of medical documents. educational purposes, e.g. videos of clinical opera-
Such reports may contain apart from raw text, tions or videos of dialogs between the doctor and the
various types of information in attached files. The patient. Most of these documents are now transcribed
fact that these reports may be in several languages and stored in digital form, even connected to the
or may point to other sources, such as local news, specific patient record, giving the users the ability to
makes the summarization task even more difficult. A search and access them much faster than in the past.
sub-language analysis may also be necessary for This is a completely different type of medical docu-
these types of documents since they often follow ments, which contain very interesting information
a specific writing style and structure. Medical ter- that must be added to a summary. Techniques from
minology should also be taken into account, as it is areas other than language processing, such as image
the case in the other document types, exploiting processing and video analysis must also be employed
existing medical resources for the specific diseases in order to locate the information to be included in
or biological threats. the summary. In addition, several of these multimedia
documents are also linked with free text reports,
3.6. Electronic medical records which must also be used by the summarization sys-
tem. As noted in the discussion on electronic medical
Most hospitals keep a record for each of their records, a summarization system integrated in the
patients. Usually the records contain data of clinical workflow must be able to handle such docu-
patients in a standard structured form, with pre- ments. Concluding, in the medical domain, proces-
defined fields or tabular representations, as well as sing of multimedia documents is crucial for
free text fields containing unstructured informa- summarization and in general for information retrie-
tion, usually doctors’ reports about their patients val and extraction applications.
(either written reports or the result of dictation). As
Mckeown et al. [46] note, a patient record for any
single patient consists of many individual reports, 4. Summarization techniques in the
collected during a visit to hospital. For some medical domain
patients, this can be up to several hundred reports.
A system summarizing information from medical Most of the researchers extend to the medical
records needs to take several factors into account. It domain the techniques already used in other
Summarization from medical documents: a survey 169

domains. Based on the categorization given earlier Another project is MUSI [48]. MUSI stands for
in Summarization techniques, the techniques used ‘‘MUltilingual Summarization for the Internet’’
in the medical domain are classified under the and it is a cross-lingual summarization system,
following categories: which uses articles from The Journal of Anaesthe-
siology as input. The journal is freely accessible
extractive single-document summarization; online16 and its articles are written in Italian and
abstractive single-document summarization; English. MUSI takes those articles and creates sum-
extractive multi-document summarization; maries from them in French and German. The sys-
abstractive multi-document summarization; tem is query-based and it extracts sentences from
multimedia summarization; the input article according to the following criteria:
cognitive model based summarization. cue phrases, position of the sentences, query words
and compression rate. That is, MUSI follows the
Edmundsonian paradigm for the selection of the
In the following sections, various summarization sentences. Once the sentences have been
projects/systems are presented based on this cate- extracted, two approaches can be followed: either
gorization. they are used as they are to form the extractive
summary or they are converted into a semantic
representation to produce an abstractive summary.
4.1. Extractive single-document A third project exploiting extractive techniques is
summarization presented in [49] The most important aspect of this
approach is that it ranks the extracted sentences
One of the projects belonging in this category is according to the so-called cluster signature of the
MiTAP [45]. The aim of MiTAP (MITRE Text and Audio document. More specifically, their prototype system
Processing)13 is to monitor infectious disease out- takes medical documents (result of a query using a
breaks or other biological threats by monitoring search engine) as input and clusters them into
multiple information sources such as epidemiologi- groups. These groups are then analyzed for features
cal reports, newswire feeds, email, online news, with high support, called key features, forming a
television news and radio news in multiple lan- cluster signature that best characterizes each docu-
guages. All the captured information is filtered ment group. The summary is generated by matching
and the resulting information is normalized. Each the cluster signature to each sentence of the docu-
normalized article is passed through a zoner that ment to be summarized. Both the sentence and the
uses human-created rules to identify the source, cluster signature are represented using a vector
date, and other fields such as the article title and space model. The ranked sentences are then
body. The zoned messages are processed to identify selected and presented to the user as a summary.
paragraph, sentence and word boundaries as well as Johnson et al. [49] used for their experiments
part-of-speech tags. The processed messages are abstracts and full texts from the Journal of the
then fed into a named entity recognizer, which American Medical Association.
identifies person, organization and location names
as well as dates, diseases, and victim descriptions
using human-created rules. Finally, the document is 4.2. Abstractive single-document
processed by WebSumm [47], which generates a summarization
summary out of modified versions of extracted sen-
tences. For non-English sources, a machine transla- MUSI [48] is a system generating either extractive
tion system is used to translate the messages summaries (see the previous section) or abstractive
automatically into English. In addition to single- ones. In the case of abstractive summarization,
document summarization, MiTAP has recently incor- after the system has selected the sentences, it
porated two types of multi-document summariza- converts them into a predicate-argument structure
tion: Newsblaster [46]14 automatically clusters representation, instead of simply presenting them
articles and generates summaries based on each to the user. The steps in achieving that representa-
cluster, Alias I15 produces summaries on particular tion are: tokenization, morphological analysis,
entities and generates daily top 10 lists of diseases shallow syntactic parsing, chunking, dependency
in the news. analysis and mapping to the internal representa-
tion. After the representation has been achieved,
13
For more information on MiTAP, visit http://tides2000.mi- they create the summaries of those extracted
tre.org.
14 16
See http://www.cs.columbia.edu/nlp/newsblaster. You can access this online journal at http://anestit.unipa.it/
15
See http://www.alias-i.com/. esiait/esiaing/esianuming.htm.
170 S. Afantenos et al.

sentences using the natural language generation lighting the similarities and differences among the
(NLG) system Lexigen [50] for the French language documents.
and TG/2 [51] for German. The generation The input to the Centrifuser is articles, retrieved
systems produce indicative summaries of the by the search engine of the PERSIVAL system accord-
document content. Summaries include both trans- ing to the patient record and the user query. For
lated portions of the extracted sentences, and each article they create a topic tree which depicts
‘‘meta-statements’’ about the original document. the sectioning of each article. A composite topic
The latter provide the user with additional tree is then created by merging together all the
optional information about the content and struc- topic trees and adding details to each node such as
ture of the source text, the relevance of the relative typicality (i.e. how typical is that topic
extracted pieces of information as well as of the compared to the rest of the topics), position within
whole document with respect to the query, etc. the article, and various lexical forms in which it may
Users can customize the summary length, as well be expressed [53]. In the next step they try to match
as some other aspects concerning style and presen- the nodes of the topic trees with the query. The
tation. matched nodes do not contain any text, but instead
TRESTLE (Text Retrieval Extraction and Summar- they point to sections in the original documents,
ization Technologies for Large Enterprises) is a from which the most representative sentences
system, which produces single sentence summaries should be extracted. Since the compression rate
of Scrip17 pharmaceutical newsletters [52]. Their posed will not always allow for each topic to receive
system is in essence an Information Extraction sys- a sentence, the first step is to choose which topics
tem, which relies heavily on Named Entity (NE) are going to receive a sentence. In the next step
recognition. For this system, also drug names and they choose for each topic the representative sen-
diseases are named entities, apart from the classical tences. The final step for the creation of the sum-
ones, such as organization, person and location. mary involves the ordering of those sentences,
TRESTLE allows users to navigate through the Scrip which is achieved by first ordering the topics accord-
articles, and thus find the information that they are ing to each topic’s typicality, and then ordering the
interested in, using the named entities that the sentences themselves inside every topic, according
system has extracted, which are links to the original to the physical position of every sentence.
articles from which the NEs have been extracted.
Apart from this, TRESTLE also creates single sen-
tence summaries for each newsletter from the tem- 4.4. Abstractive multi-document
plate that was filled by the Information Extraction summarization
process. A link is also provided to the original news-
letter. Apart from informative extractive multi-document
summaries, Centrifuser creates indicative abstrac-
tive multi-document summaries, as well, which are
4.3. Extractive multi-document
used by the PERSIVAL users for searching papers. As
summarization
noted in Extractive multi-document summarization,
the approach of Kan et al. [53,54] leads to nodes in
Although the production of summaries from multiple
the topic trees which match with the query of the
documents is usually done with abstractive techni-
user. This, they argue, can be the first phase for
ques, Kan et al. [53,54] follow a different approach.
Natural Language Generation (NLG). In the next step
They argue that different types of summaries, such
of NLG, which they call planning, they try to figure
as indicative or informative, serve different infor-
out which nodes of the topic trees they will sum-
mational purposes and both can be useful, and that
marize. To achieve this, they determine which
extracting sentences for the creation of an informa-
nodes are relevant, irrelevant and intricate, based
tive multi-document summary ‘‘is well accepted
on how deep the nodes are, compared with the
since it is simple, fast and easy to evaluate’’. Their
query node, i.e. the node that matches the user
system, Centrifuser, which is the summarization
query. Thus, nodes that are descendants of the
engine of the PERSIVAL (PErsonilized Retrieval and
query node and are below depth k are considered
Summarization of Image, Video and Language) pro-
intricate, above depth k relevant and all the other
ject,18 produces both indicative and informative
nodes (i.e. the ones that are not descendants of the
multi-document summaries, with the aim of high-
query node) are irrelevant. In the final NLG step,
17
See http://www.pjpub.co.uk for more information on Scrip realization, the ordered information is converted to
newsletters. text. For a more thorough treatment of Centrifuser,
18
See http://persival.cs.columbia.edu/. see [55].
Summarization from medical documents: a survey 171

Apart from the abstractive indicative summaries answer, but instead gives as much information as
that Centrifuser produces, PERSIVAL produces possible about the question using some of the
another type of abstractive summary [1,56]. These query keywords.
summaries are not concerned with highlighting the
similarities and differences among several medical The input articles are first classified automati-
articles, but with the creation of an informative cally into three categories: prognosis, treatment,
abstractive summary. That summary is tailored diagnosis. The next step involves the identification
according to the preferences of two different types and extraction of the results, i.e. the tuples men-
of users: the physician, the patient or her relatives. tioned above. For this purpose, the authors are
The system should identify in the documents and exploiting the ‘‘rigid’’, as they call it, structure
extract tuples of the form: (Parameter(s), Finding, of the medical articles. This means that they try to
Relation). The relations can be any of the following locate the Results section and select the sentences
six types: association, prediction, risk, absence of that are relevant to the patient. The selected se-
association, absence of prediction and absence of ntences are then passed to the extraction module,
risk. They call those tuples results. Elhadad and that extracts in a template form, the following i-
Mckeown [56] using empirical methods, i.e. inter- nformation: the finding(s), the parameters, the
views with the physicians, concluded that a sum- relation, the degree of dependence of the para-
mary should fulfill the following qualitative criteria, meters, the article and the sentence it has been
in relation to the results: extracted from and various other minor informa-
tion. The templates are filled with the aid of hand-
Completeness and accuracy. The results should be crafted patterns.
complete and accurate, in the sense that all the The next step involves the determination of
relevant results and only them should be which portions, if any, of the extracted parameters
included. are relevant with the patient record. After that, the
Repetitions and contradictions. The system resulting templates are merged and ordered. To
should identify repetitions and contradictions achieve this, the templates are rendered into an
among the results. In order to do so, Elhadad internal ‘‘semantic’’ representation, in the form of
and Mckeown [56] have created a representation a graph. From this graph, they are able to identify
of the results, which allows them to identify repetitions and contradictions. A repetition occurs
relations such as subsumption and contradiction if two nodes are connected by more than one vertex,
among the results. and the vertices have ‘‘similar’’ types. What is
Coherence and cohesion. Coherence for [56] is similar and what is not has been ‘‘established’’ in
established by ‘‘accurate aggregation and order- interviews with physicians. A contradiction occurs in
ing of the related results’’. Cohesion is defined as the same situation, but now the vertices have dif-
follows: ‘‘two sentences are part of the same ferent types. Repetitions and contradictions are
paragraph, if and only if they are related.’’ used in order to create a more coherent summary.
Related are the sentences that present either With this method they manage to perform the mer-
the same finding, or the same parameter(s). ging of the templates. For the ordering, they use the
following criteria:
The system described in [56] takes input from
three different sources: Query based: a relation that answers the user
query is weighted higher.
Patient record. In general, the patient record Salience based: recitations and contradictions are
consists of structured documents, usually in tab- weighted higher.
ular form, and unstructured documents, and Domain based: studies with physicians show that
sometimes it can be very large. some relation types are more interesting than
Journal medical articles. Their system takes others. For instance a risk relation is weighted
as input a vast amount of online articles from higher than an association relation.
medical journals on the field of cardiology. In Source based: dependent relations from the same
fact, the articles that are input to the system template are presented together.
are the ones that globally match the patient, i.e.
the ones that contain information relevant to the The final step involves the creation of the sum-
patient. mary, through NLG techniques. In the final
The user query. Although the physician’s query is summary, all the medical terms are hyperlinked
posed in natural language, the system does not try to their definitions. This is achieved by conne-
to fully understand the question and give an cting the system of Elhadad and Mckeown [56]
172
Table 4 Summarization systems from medical documents
Input Purpose Output Method Evaluation
[45] Single-document Indicative, user-oriented, Sentences (extracts) Language processing (named entity Extrinsic
(also multi-document), mutilingual, domain-specific recognition, machine translation),
multimedia (text, audio, video) machine learning
[48] Single-document, multilingual, text Indicative, user-oriented, Sentences (extracts), Statistics (sentences extraction), Intrinsic, extrinsic
domain-specific abstracts language processing (semantic
representation for abstraction)
[49] Single-document, monolingual, text Indicative, generic, Sentences (extracts) Statistics (vector space model)
domain-specific
[52] Single-document, monolingual, text Indicative, generic, Abstracts Language processing
domain-specific (information extraction)
[55] Multi-document, monolingual, text Indicative-informative, Extracts, abstracts Statistics (clustering using similarity Extrinsic
generic, domain-specific measures), language processing
[56] Multi-document, monolingual, text Informative, user-oriented, Abstracts Language processing (information
domain-specific extraction, NLG)
[58] Single-document, video Generic, domain-specific Video sequences Image and video processing
(echocardiograms) (extracts)
[59] Single-document, video Generic, domain-specific Video sequences Image and video processing
(clinical operations, (extracts)
dialogues, presentations)
[61,62] Multi-document, monolingual, text Informative, user-oriented, Abstracts Agents simulating summarization tasks,
domain-specific language processing

S. Afantenos et al.
Summarization from medical documents: a survey 173

with DEFINDER, a text mining tool for extrac- segments of videos they create the dynamic sum-
ting definitions of terms from medical articles mary.
(see [57]).
Xingquan et al. [59], follow a similar approach to
4.5. Multimedia summarization [58] in order to parse the video stream into physical
units. Then video group detection, scene detection
Ebadollahi et al. [58] and Xingquan et al. [59] pre- and clustering strategies are used to mine the video
sent systems performing summarization of docu- content structure. Various visual and audio feature
ments which have multimedia content, processing techniques are utilized to detect some
echocardiograms and medical videos respectively. semantic cues, such as slides, face and speaker c-
The work presented in [49] is part of the PERSIVAL hanges, etc. within the video, and these detection
project mentioned above. In their study they are results are joined together to mine three types of
concerned with echocardiograms (ECGs). ECGs are events from the detected video scenes (presenta-
usually videotaped for archival purposes and tions by doctors or experts on video topics, clinical
recently they have started to be transcribed into operations to present details of diseases, and dia-
a digital format, which helps clinicians, and facil- logs between doctors and patients). Based on mined
itates the task of summarizing them. Summarizing video content structure and event information, a
an ECG, and video in general as seen in the work of scalable video skimming and summarization tool,
Merlino and Mark [9], involves extracting the most ClassMiner, has been constructed to visualize the
interesting video frames, which are called key- video overview and help users access video content.
frames, which enable the user to easily navigate Their system utilizes a four-layer video skimming,
through the ECGs and view their essential parts. For where levels 4 through 1 consist of representative
[58] summarizing an ECG involves two things: par- shots of clustered scenes, all scenes, all groups, and
sing the ECG and selecting the key-frames. The aim all shots of the video, respectively.
of the parser is to temporally segment the
sequences of the video into smaller units, which 4.6. Cognitive model based summarization
are called shots. A shot is a sequence of frames in
which the camera is uninterrupted. In the context of Based on the cognitive model used in the SimSum
ECG videos, a shot corresponds to a single position system (see Summarization from a cognitive science
and angle of the ultrasound transducer. The method perspective), Endres-Niggemeyer [61,62] presented
they use for the parsing is a special case of the its extension, the SummIt-BMT, which is concerned
algorithm presented in [60]. The next step is the with the summarization of MEDLINE abstracts and
key-frame selection, which extracts the most infor- articles for bone marrow transplantation, a specia-
mative (important) frames in the sequence of the lized field of internal medicine. SummIt-BMT is a
video. After mentioning several methods for query-based summarization system. In general, the
extracting key-frames, they conclude that for the summarization process is the following:
context of ECGs key-frames are ‘‘the local extrema
of the cardiac periodic expansive—contractive 1. A user forms a search scenario using concepts
motion’’, since ‘‘the time at which the cardiac from the domain ontology.
motion changes from expansive to contractive cor- 2. This scenario is mapped to a MEDLINE query. If
responds to the end-diastole and the time at which the outcome of the query points to journal arti-
the motion changes from contractive to expansive cles, they are included in the results.
corresponds to end-systole’’. Having performed the 3. A text retrieval component identifies the inter-
above two tasks, they create two summaries which esting pieces of text in the results.
they call static and dynamic. 4. Those pieces are summarized in relation to the
query scenario. Links to the original articles are
Static summary. This summary, in essence, is also given.
constituted from the selection of the extracted
key-frames, and it is useful for browsing the con- Although SummIt-BMT is based on SimSum it dif-
tent of the echo video. fers from it in several ways. It is not a presentational
Dynamic summary. This summary, also called clin- model anymore but a functional one. Thus, agents
ical summary among the clinicians, is a concate- simulating lower level cognitive processes have b-
nation of the small extracted sequences of the een replaced by functional ones. Text production
video. They chose to extract one (or more, based agents have been removed since SummIt-BMT does
on the needs of the clinicians) cycle of the heart not produce smooth text, but organized text clips
motion, known also as R—R cycle. By joining those that are linked to their source positions. As the
174 S. Afantenos et al.

application field is bone marrow transplantation medical summarization lead to certain interesting
(BMT), a BMT ontology was set up. Although several remarks concerning the promising paths for future
medical ontologies existed which are loosely related research. These remarks are presented below in
to the BMT field, a BMT-specific ontology was cre- terms of the summarization factors.
ated due to the fact that the existing ontologies did In terms of the input medium, almost all methods
not contain enough deep BMT knowledge for text concern summarization from text, although the
knowledge processing. The ontology they created is specific domain can provide a lot of useful input
very important for Summit-BMT, since it is being in other media as well (e.g. speech, images, videos).
used in almost all the stages of the summarization Summarizing information from different media (e.g.
process. spoken transcriptions and textual reports related to
A scenario interface reflecting everyday situa- the specific echo-videos) is an important issue for
tions of BMT physicians [63] helps users to state practical applications, representing a promising
their queries. Users fill in ontology concepts, which path for future research and development.
are for their convenience equipped with definitions Concerning the number of the input documents,
and explanations assembled from various sources on both categories of techniques (single and multi-
the web. From scenario forms and user-selected document) have been examined. As it is the case
ontology terms the system obtains structured in other domains, apart from medicine, single-docu-
queries in the predicate-logic form that it can ment summarization methods are mainly using
‘‘understand’’. Queries are given to the search extractive techniques, whereas almost all of the
engines, which return a set of documents, abstracts multi-document summarizers are based on abstrac-
and maybe journal articles, from MEDLINE. The tive techniques. However, the selection between
retrieved documents are checked for possible rele- the simpler extractive techniques and the more
vance by a text passage retrieval component. Irre- complex abstractive ones should not only be based
levant documents are discarded. From the final on the number of input documents, but also on the
set of documents, the summarization agents take available resources, tools and the summary purpose
the positive passages from text passage retrieval, and output factors.
represent their phrases and sentences in a predi- Concerning the language of the input docu-
cate-logic form, and examine them with human- ment(s), most of the existing systems are monolin-
style criteria: whether they are related to the user gual (English in almost all the cases). There are two
query, whether they are redundant, and so on. The cases (MiTAP, MUSI), where the multilingual aspect
agents remove items that do not meet their rele- was taken into account. In the MiTAP case, this was
vance criteria. due to the domain (monitoring disease outbreaks)
Table 4 summarizes the main features of the where the information sources are in various lan-
projects/systems presented in Summarization tech- guages. On the other hand, MUSI summarizes the
niques in the medical domain. articles of a bilingual Journal. In the medical domain
there is an enormous amount of documents in var-
ious categories (e.g. patient records) in other lan-
5. Promising paths for future research guages apart from English. There are resources and
tools in several other languages that can be
Although initial work on Summarization dates back exploited in building summarizers for handling more
to the late 1950s and 1960s (e.g. [3,4]), most than one language using either shallow or deeper
research on the field has been performed during approaches for language processing.
the last few years. The result is that the research In relation to purpose factors, the existing meth-
field has not yet achieved a mature state, and a ods mainly concern indicative summarization. The
variety of challenges still need to be overcome. The purpose of such summaries is to navigate the reader
scaling to large size collections of documents, the to the required information, which seems to be
use of more sophisticated natural language proces- sufficient for most practical applications in medi-
sing techniques for generating abstracts, the avail- cine as long as no better solution is available. The
ability of annotated summarization corpora for production of indicative summaries seems to ‘‘indi-
training and testing purposes, are some of these cate’’ that the shallow summarizing strategies used
challenges. so far are not enough for producing informative or
This is also the case for the domain of medical even critical summaries. Deeper language proces-
documents. The study of existing summarization sing techniques [64] and their combination with
techniques in other domains, the examination of shallow processing ones seems to be a promising
different types of medical documents and the study path for future research in NLP in general and more
of techniques reported so far in the literature for specifically in summarization.
Summarization from medical documents: a survey 175

There is a trend towards user-oriented summa- the development of portable summarization tech-
ries, which is reasonable since summarization sys- nology and the medical domain can provide the
tems in the medical domain aim to cover the necessary application areas.
information needs of different user types (clini- Concerning the output factors, the quality of the
cians, researchers, patients) and specific users. User summarization output is strongly related to the
involvement does not concern only the submission of summarization task. Therefore, qualitative and
a query to the system, but also the summary cus- quantitative criteria need to be established follow-
tomization and presentation according to the user’s ing a study of the domain and the users’ interests. In
model. The PERSIVAL system [56] maintains infor- terms of the decision between extractive and
mation about the users’ preferences taking into abstractive techniques, as noted above, this has
account their expertise in the domain, as well as to take into account several factors related to the
the users’ access tasks. The summary presentation input documents, the purpose of the summary, the
can also be affected by the user’s model (e.g. qualitative criteria established as well as the avail-
production of a summary in the form of hypertext, able resources and tools.
combination of text and images or video, etc.).
Personalized access to medical information is a
crucial issue and needs to be further investigated. 6. Conclusions
There is a lot of expertise from the application of
user modeling techniques in other domains which This survey presented the potential of summariza-
can also be exploited in the medical domain (see tion technology in the medical domain, based on the
[65,66]). examination of the state of the art, as well as of
Domain customization is another significant existing medical document types and summariza-
issue. Most of the existing medical summarization tion applications.
systems are able to process documents belonging in The challenges that the summarization research
specific sub-domains of medicine. Emphasis must has to overcome need to be viewed under the prism
be given to the development of technology that can of the requirements of the specific field. The scaling
be easily ported to new sub-domains. The devel- to large collections of documents in various lan-
opment of open architecture systems with reusable guages and from different media, the generation
and trainable components and resources is impera- of informative summaries using more sophisticated
tive in summarization technology. This is directly language and knowledge engineering techniques,
related to the ability of exploiting pre-existing the generation of personalized summaries, the port-
medical knowledge resources. There are currently ability to new sub-domains, the design of evaluation
various knowledge repositories such as the Unified scenarios which model real-world situations, the
Medical Language System (UMLS),19 and MeSH, integration of summarization technology in practi-
which can be exploited in several ways by summar- cal applications such as the clinical workflow, are
ization engines. For instance, they can be used to among the issues that the summarization commu-
locate interesting document(s), and interesting nity needs to focus on.
sentences inside those documents. They can even
be used to create conceptual representations of
the selected sentences in order to produce abstrac- Acknowledgements
tive summaries in the same or in a different lan-
guage. Such approaches are presented in the The authors would like to thank the anonymous
literature and can be further investigated. The reviewers, as well as Dr. Constantine D. Spyropoulos
development of customizable summarization tech- and Dr. George Paliouras, for their helpful and con-
nologies requires also in-depth study of the medical structive comments. Many thanks also to Ms. Eleni
document types and medical sub-language. A gen- Kapelou and Ms. Irene Doura for checking the use of
eral-purpose system must be able to exploit the English.
various characteristics of the medical documents.
For instance, the sectioning of scientific articles,
the specialized language used in e-mailed reports
or in patient records are important features that
References
can significantly affect the performance of the
involved language processing tools. In general, [1] McKeown KR, Jordan DA, Hatzivassiloglou V. Generating
patient-specific summaries of online literature. In: Hovy
the research community must cooperate towards E, Radev D, editors. Intelligent text summarization papers
from the 1998 AAAI symposium, vols. 34—43. Satabford, CA,
19
See http://www.nlm.nih.gov/research/umls/. USA: AAAI Press; 1998.
176 S. Afantenos et al.

[2] Radev D, Hovy E, McKeown K. Introduction to the special [25] Barzilay R, Elhadad M. Using lexical chains for text
issue on text summarization. Comput Linguist 2002;28(4). summarization. In: Mani I, Maybury MT, editors. Advances
[3] Luhn HP. The automatic creation of literature abstracts. IBM in automatic text summarization. 1999. p. 1110—121.
J Res Dev 1958;2(2):159—65. [26] Saggion H, Lapalme G. Generating indicative-onformative
[4] Edmundson HP. New methods in automatic extracting. J summaries with SumUM. Comput Linguist 2002;28(4):497—
Assoc Comput Mach 1969;16(2):264—85. 526.
[5] Paice CD. The automatic generation of literature abstracts: [27] Virach S, Potipiti T, Charoenporn T. UNL document summar-
an approach based on the identification of self-indicating ization. In: Proceedings of the First International Workshop
phrases. In: Oddy RN, Robertson SE, van Rijsbergen CJ, on Multimedia Annotation (MMA2001), Tokyo, Japan, Janu-
Williams PW, editors. Information retrieval research. Lon- ary 2001.
don: Butterworth; 1981. p. 172—91. [28] Ando R, Boguraev B, Byrd R, Neff M. Multi-document sum-
[6] Paice CD. Constructing literature abstracts by computer. marization by visualizing topical content. In: Proceedings of
Inform Process Manage 1990;26(1):171—86. the ANLP/NAACL 2000 Workshop on Automatic Summariza-
[7] Sparck-Jones K. Automatic summarizing: factors and tion, Seattle, WA, April 2000.
directions. In: Mani I, Maybury MT, editors. Advances in [29] Goldstein J, Mittal V, Carbonell J, Callan J. Creating and
automatic text summarization. 1999. p. 10—12 [chapter 1]. evaluating multi-document sentence extract summaries. In:
[8] Mani I. Automatic summarization. Volume 3 of Natural Proceedings of the 2000 ACM CIKM International Conference
language processing. Amsterdam/Philadelphia: John Benja- on Information and Knowledge Management, McLean, VA,
mins Publishing Company; 2001. USA, November 2000, p. 165—72.
[9] Merlino A, Mark M. An empirical study of the optimal pre- [30] Tsutomu H, Suzuki J, Isozaki H, Maeda E. NTT’s multiple
sentation of multimedia summaries of broadcast news. In: document summarization system for DUC 2003. In: Proceed-
Mani I, Maybury MT, editors. Advances in automatic text ings of the Workshop on Text Summarization, at the Human
summarization. 1999. p. 3910—401 [chapter 25]. Language Technology Conference 2003, Edmonton, Canada,
[10] Futrelle RP. Summarization of diagrams in documents. In: May 31—June 1, 2003.
Mani I, Maybury MT, editors. Advances in automatic text [31] Radev DR. A common theory of information fusion
summarization. 1999. p. 4030—421 [chapter 26]. from multiple text sources, step one: cross-document
[11] Mani I, Maybury MT, editors. Advances in automatic text structure. In: Proceedings of the First ACL SIGDIAL
summarization. The MIT Press; 1999. Workshop on Discourse and Dialogue, Hong Kong, October
[12] Mann WC, Thompson SA. Rhetorical structure theory: 2000.
towards a functional theory of text organization. Text [32] Zechner K. Automatic summarization of spoken dialogues in
1988;8(3):243—81. unrestricted domains. PhD thesis, Carnegie Mellon Univer-
[13] Dalianis H, Hassel M, de Smedt K, Liseth A, Lech TC, Wede- sity, School of Computer Science, Language Technologies
kind J. Porting and evaluation of automatic summarization. Institute, November 2001.
In: Holmboe H, editor. Nordisk Sprogteknologi. 1988. [33] Zechner K. Automatic summarization of open-domain multi-
[14] Abderrafih L. Multilingual alert agent with automatic text party dialogues in diverse genres. Computat Linguist
summarization. http://www.lehmam.freesurf.fr/automa- 2002;28(4):447—85.
tic_summarization.htm. [34] Carbonell J, Goldstein J. The use Of MMR, diversity based
[15] Hsin-Hsi C, Lin C-J. Multilingual news summarizer. In: Pro- reranking for reordering documents and producing summa-
ceedings of the18th International Conference on Computa- ries. In: Proceedings of the 21st Annual International ACM
tional Linguistics, July 31—August 4 2000. University of SIGIR Conference on Research and Development in Informa-
Saarlandes, p. 159—65. tion Retrieval, Melbourne, Australia. Poster Session, Augusts
[16] Salton G, Singhal A, Mandar M, Buckley C. Automatic text 1998.
structuring and summarization. Inform Process Manage [35] Merlino A, Morey D, Maybury M. Broadcast news navigation
1997;33(2):193—207. using story segments. In: Proceedings of the ACM Multi-
[17] Mani I, Bloedorn E. Summarizing similarities and differences media; 1997. p. 381—91.
among related documents. Inform Retrieval 1999;1(1):1—23. [36] Endres-Niggemeyer B. Summarizing information. Berlin:
[18] Marcu D. The Rhetorical parsing of natural language texts. Springer-Verlag; 1998.
In: Proceedings of the 35th Annual Meeting of the Association [37] Endres-Niggemeyer B. SimSum: an empirically founded
for Computational Linguistics. New Brunswick, New Jersey: simulation of summarizing. Inform Process Manage
Association for Computational Linguistics; 1997, p. 96—103. 2000;36(4):659—82.
[19] Marcu D. The theory and practice of discourse parsing and [38] Endres-Niggemeyer B, Maier E, Sigel A. How to implement a
summarization. The MIT Press; 2000. naturalistic model of abstracting: four core working steps of
[20] Reiter E, Dale R. Building applied natural language genera- an expert abstractor. Inform Process Manage 1995;31(5):
tion systems. Nat Language Eng 1997;3(1):57—87. 631—74.
[21] Reiter E, Dale R. Building natural language generation sys- [39] Endres-Niggemeyer B, Neugebauer E. Professional summar-
tems. In: Studies in natural language sing. Cambridge Uni- izing: no cognitive simulation without observation. J Am Soc
versity Press; 2000. Inform Sci 1998;49(6):486—506.
[22] DeJong G. An overview of the FRUMP system. In: Lehnert [40] Hersh W, Buckley C, Leone TJ, Hickam D. OHSUMED: an
WG, Ringle MH, editors. Strategies for natural language interactive retrieval evaluation and new large test collec-
processing. New Jersey: Erlbaum: Hillsdale; 1982. p. 149— tion for research. In: Proceedings of the 17th Annual ACM
76. SIGIR Conference; 1994. p. 192—201.
[23] Radev DR, McKeown KR. Generating natural language sum- [41] Cios Krzysztof J, William Moore G. Uniqueness of medical
maries from multiple on-line sources. Comput Linguist data mining. Artif Intell Med 2002;26:1—24.
1998;24(3):469—500. [42] Xenarios I, Salwinski L, Duan XJ, Higney P, Kim S, Eisenberg
[24] Radev DR. Generating natural language summaries from D. DIP: the database of interacting proteins a research tool
multiple on-line sources: language reuse and regeneration. for studying cellular networks of protein interactions. Nucl
PhD thesis, Columbia University; 1999. Acids Res 2002;30:303—5.
Summarization from medical documents: a survey 177

[43] Karkaletsis V, Spyropoulos CD. Cross-lingual information [54] Kan M-Y, McKeown KR, Klavans JL. Domain-specific informa-
management from web pages. In: Proceedings of the tive and indicative summarization for information retrieval.
Ninth Panhellenic Conference in Informatics (PCI-2003); In: Workshop on text summarization (DUC 2001); 2001.
2003. [55] Kan, M-Y. Automatic text summarization as applied to infor-
[44] Woodall J. Official versus unofficial outbreak reporting mation retrieval: using indicative and informative summa-
through the internet International. J Med Inform ries. PhD dissertation, Columbia University, New York, USA,
1997;47:31—4. February; 2003.
[45] Damianos L, Day D, Hirschman L, Kozierok R, Mardis S, [56] Elhadad N, McKeown KR. Towards generating patient specific
McEntee T, et al. Real Users, Real Data, Real Problems: summaries of medical articles. In: Proceedings of the Auto-
The MiTAP System for Monitoring Bio Events. Proceedings matic Summarization Workshop (NAACL 2001); 2001.
of the Conference on Unified Science & Technology for [57] Klavans JL, Muresan S. DEFINDER: rule-based methods for
Reducing Biological Threats & Countering Terrorism (BTR the extraction of medical terminology and their associated
2002); 2002. 167—77. definitions from on-line text. In: Proceedings of the Amer-
[46] McKeown K, Elhadad N, Hatzivassiloglou V. Leveraging a ican Medical Informatics Association Annual Symposium,
common representation for personalized search and sum- AMIA 2000; 2000.
marization in a medical digital library. In: Proceedings of the [58] Ebadollahi S, Chang S-F, Wu H, Takoma S. Echocardiogram
Joint Conference on Digital Libraries; 2003. video summarization. In: Proceedings of the SPIE MI 2001;
[47] Mani I, Bloedorn E. Summarizing similarities and differences 2001.
among related documents. Inform Retrieval 1999;1(1):1— [59] Xingquan Z, Fan J, Hacid M-S, Elmagarmid AK. Classminer:
23. mining medical video for scalable skimming and summariza-
[48] Lenci A, Bartolini R, Calzolari N, Agua A, Busemann S, Cartier tion. In: Proceedings of the 10th ACM International Confer-
E, et al. Multilingual summarization by integrating linguistic ence on Multimedia (Demonstration); 2002. p. 79—80.
resources in the MLIS-MUSI project. In: Proceedings of the [60] Zabih R, Miller J, Mai K. A feature-based algorithm for
Third International Conference on Language Resources and detecting and classifying scene breaks. In: Proceedings of
Evaluation (LREC’02); 2002. the ACM Multimedia; 1993. p. 189—200.
[49] Johnson DB, Zou Q, Dionisio JD, Liu VZ, Chu WW. Modeling [61] Endres-Niggemeyer B. Empirical methods for ontology engi-
medical content for automated summarization. Ann NY Acad neering in bone marrow transplantation. In: International
Sci 2002;980:247—58. Workshop on Ontological Engineering on the Global Informa-
[50] Coch J, Chevreau K. Interactive multilingual generation. In: tion Infrastructure; 1999.
Gelbuckh A, editor. Computational linguistics and intelligent [62] Endres-Niggemeyer B. Human-style; 2001. www summa-
text processing. Lecture notes in computer science, vol. rization. http://www.ik.fh-hannover.de/ik/person/ben/
2004. Berlin: Springer-Verlag; 2001. human-stylesummanew.pdf.
[51] Busemann S. Best-first surface realization. In: Scott D, [63] Becher M, Endres-Niggemeyer B, Fichtner G. Scenario forms
editor. Eighth International Natural Language Generation for web information seeking and summarizing in bone mar-
Workshop Proceedings; 1996 p. 101—10. row transplantation. In: COLING 2002: Workshop on Multi-
[52] Gaizauskas R, Herring P, Oakes M, Beaulieu M, Willett P, lingual Summarization and Question Answering; 2002.
Fowkes H, et al. Intelligent access to text: integrating [64] Oepen S, Flickinger D, Uszkoreit H, Tsuji J. Introduction to
information extraction technology into text browsers. In: the special issue on recent achievements in the domain of
Proceedings of the Human Language Technology Conference HPSG-based parsing. J Nat Language Eng 2000;6(1):1—14.
(HLT 2001); 2001. p. 189—93. [65] Alfred K. Generic user modeling systems. User Model User-
[53] Kan M-Y, McKeown KR, Klavans JL. Applying natural language adapted Interaction 2001;11(12):49—63.
generation to indicative summarization. In: Proceedings of [66] Pierrakos D, Paliouras G, Papatheodorou C, Spyropoulos CD.
the Eighth European Workshop on Natural Language Gen- Web usage mining as a tool for personalization: a survey.
eration; 2001. User Model User-adapted Interaction 2003;13(4):311—72.

View publication stats

Arem Language
No ratings yet
Arem Language
3 pages
Structure of English-LET REVIEWER
100% (1)
Structure of English-LET REVIEWER
25 pages
Hebrew Names of God
100% (5)
Hebrew Names of God
57 pages
Pooja PDF
No ratings yet
Pooja PDF
21 pages
Power Transmission Lines Inspection Using Properly Equipped Unmanned Aerial Vehicle (UAV)
No ratings yet
Power Transmission Lines Inspection Using Properly Equipped Unmanned Aerial Vehicle (UAV)
6 pages
Afantenos&Al AIM Preprint
No ratings yet
Afantenos&Al AIM Preprint
22 pages
Outage Probability of Triple-Hop Mixed RF/FSO/RF Stratospheric Communication Systems
No ratings yet
Outage Probability of Triple-Hop Mixed RF/FSO/RF Stratospheric Communication Systems
7 pages
IDF - Ombrian - Hydrognomon
No ratings yet
IDF - Ombrian - Hydrognomon
2 pages
Force-Velocity Relationship Between Sprinting and Jumping Testing Procedures
No ratings yet
Force-Velocity Relationship Between Sprinting and Jumping Testing Procedures
5 pages
Homogenous Earth Approximation of Two-Layer Earth
No ratings yet
Homogenous Earth Approximation of Two-Layer Earth
10 pages
A 15 Forensicinvestigationofsubmersiondeaths
No ratings yet
A 15 Forensicinvestigationofsubmersiondeaths
10 pages
Paper8 PDF
No ratings yet
Paper8 PDF
14 pages
Healthcareand ERP
No ratings yet
Healthcareand ERP
7 pages
Liarosetal DMPCO2023
No ratings yet
Liarosetal DMPCO2023
6 pages
ICTExpress
No ratings yet
ICTExpress
2 pages
Assessment of Geothermal Resources For Power Gener
No ratings yet
Assessment of Geothermal Resources For Power Gener
7 pages
Data Analytics Platform For The Optimization of Waste Management Procedures
No ratings yet
Data Analytics Platform For The Optimization of Waste Management Procedures
7 pages
Naka 2018 Do Composite Resin Restorations Protect Cracked Teeth An in Vitro Study
No ratings yet
Naka 2018 Do Composite Resin Restorations Protect Cracked Teeth An in Vitro Study
7 pages
Architecture and DSP Implementation of A DVB-S2 Ba
No ratings yet
Architecture and DSP Implementation of A DVB-S2 Ba
9 pages
0513 Af 006564 - Withtags
No ratings yet
0513 Af 006564 - Withtags
8 pages
Traumacare 4 1030
No ratings yet
Traumacare 4 1030
5 pages
Reliabilityandvalidityofthe Satisfactionwith Life Scale SWLSina Greeksample
No ratings yet
Reliabilityandvalidityofthe Satisfactionwith Life Scale SWLSina Greeksample
9 pages
Diesel Spec
No ratings yet
Diesel Spec
10 pages
Term Paper On BEOWULF Cluster
No ratings yet
Term Paper On BEOWULF Cluster
5 pages
Road Traffic Prediction Using Artificial Neural Networks: September 2018
No ratings yet
Road Traffic Prediction Using Artificial Neural Networks: September 2018
6 pages
Water 16 00333
No ratings yet
Water 16 00333
19 pages
Economic Crises and Mortality A Review of The Lite
No ratings yet
Economic Crises and Mortality A Review of The Lite
9 pages
Road Traffic Prediction Loumiotis
No ratings yet
Road Traffic Prediction Loumiotis
6 pages
IDF HYDRONOGMON 2013kos - Ombrian
No ratings yet
IDF HYDRONOGMON 2013kos - Ombrian
2 pages
Defining A Management Function Based Architecture For 5G Network Slicing
No ratings yet
Defining A Management Function Based Architecture For 5G Network Slicing
8 pages
Kambas Et Al 54 61
No ratings yet
Kambas Et Al 54 61
10 pages
Healthcare and Erp
No ratings yet
Healthcare and Erp
7 pages
A Severe Covid19 Case Study
No ratings yet
A Severe Covid19 Case Study
11 pages
2021-Review of Eye Tracking Metrics Involved in Emotional and Cognitive Processes
No ratings yet
2021-Review of Eye Tracking Metrics Involved in Emotional and Cognitive Processes
20 pages
The Mangled Extremity and Attempt For Limb Salvage: Article
No ratings yet
The Mangled Extremity and Attempt For Limb Salvage: Article
7 pages
RMEand Mobiledevicesin Kindergarten
No ratings yet
RMEand Mobiledevicesin Kindergarten
15 pages
Fire Detection From Images Using Faster R-CNN and Multidimensional Texture Analysis
No ratings yet
Fire Detection From Images Using Faster R-CNN and Multidimensional Texture Analysis
6 pages
32D Antoniadou
No ratings yet
32D Antoniadou
28 pages
8 Dot Braille Code For Complex Nemeth Symbols
No ratings yet
8 Dot Braille Code For Complex Nemeth Symbols
6 pages
Masking Posterior Tooth Discolorations With Color
No ratings yet
Masking Posterior Tooth Discolorations With Color
8 pages
Clickbait Open Access-1
No ratings yet
Clickbait Open Access-1
8 pages
982 4438 1 PB
No ratings yet
982 4438 1 PB
9 pages
Prescriptive Analytics: Literature Review and Research Challenges
No ratings yet
Prescriptive Analytics: Literature Review and Research Challenges
15 pages
Weekly External Load Correlation in Season Microcycles With Game Running Performance and Training Quantification in Elite Young Soccer Players
No ratings yet
Weekly External Load Correlation in Season Microcycles With Game Running Performance and Training Quantification in Elite Young Soccer Players
14 pages
Sujet 25
No ratings yet
Sujet 25
9 pages
DFGT
No ratings yet
DFGT
9 pages
Construction Cost Analysis of Retaining Walls
No ratings yet
Construction Cost Analysis of Retaining Walls
7 pages
SCADA Implementations To Supervise The Water Networks Infrastructures in The City of Athens
No ratings yet
SCADA Implementations To Supervise The Water Networks Infrastructures in The City of Athens
7 pages
Effectiveness of RTSCTS Handshake in IEEE 80211a Wireless LANs
No ratings yet
Effectiveness of RTSCTS Handshake in IEEE 80211a Wireless LANs
3 pages
Unit 4stretching Technique
No ratings yet
Unit 4stretching Technique
9 pages
Sensor Fusionfor Predicting Vehicles Pathfor Collision Avoidance Systems
No ratings yet
Sensor Fusionfor Predicting Vehicles Pathfor Collision Avoidance Systems
15 pages
Environmental Monitoring Through Embedded System and Sensors
No ratings yet
Environmental Monitoring Through Embedded System and Sensors
8 pages
Construction of Ombrian Curves Using The Hydrognomon Software System
No ratings yet
Construction of Ombrian Curves Using The Hydrognomon Software System
2 pages
Gerodimosetal 2006
No ratings yet
Gerodimosetal 2006
14 pages
GGGGGGG 4
No ratings yet
GGGGGGG 4
10 pages
Major 317 Accidents
No ratings yet
Major 317 Accidents
13 pages
Modeling of CO2 Capture Via Chemical Absorption Processes An Extensiveliteraturereview
No ratings yet
Modeling of CO2 Capture Via Chemical Absorption Processes An Extensiveliteraturereview
21 pages
Pinisarafismalliarou
No ratings yet
Pinisarafismalliarou
9 pages
2012-JMSA Optimization of Routing Considering Uncertainties
No ratings yet
2012-JMSA Optimization of Routing Considering Uncertainties
9 pages
COVID-19 Pandemic: The Impact of The Social Media Technology On Higher Education
No ratings yet
COVID-19 Pandemic: The Impact of The Social Media Technology On Higher Education
27 pages
PASER A Curricula Synthesis System Based On Automa
No ratings yet
PASER A Curricula Synthesis System Based On Automa
14 pages
Final Report For Submission of PROJECT (2) - 1
No ratings yet
Final Report For Submission of PROJECT (2) - 1
1 page
Project Conference Paper
No ratings yet
Project Conference Paper
10 pages
Final Report For Submission of PROJECT
No ratings yet
Final Report For Submission of PROJECT
50 pages
Medical Report - 2
No ratings yet
Medical Report - 2
2 pages
Medical Reports Summarization Using Text-To-Text Transformer
No ratings yet
Medical Reports Summarization Using Text-To-Text Transformer
5 pages
Suyoga - C++ Viva Questions
No ratings yet
Suyoga - C++ Viva Questions
14 pages
Student Information Management System
No ratings yet
Student Information Management System
7 pages
Spanish Classes Syllabus 2016 2017
No ratings yet
Spanish Classes Syllabus 2016 2017
4 pages
The IFRS Foundation and Its Translation Policies
No ratings yet
The IFRS Foundation and Its Translation Policies
2 pages
Grade 6 DLL ENGLISH 6 Q1 Week 1
No ratings yet
Grade 6 DLL ENGLISH 6 Q1 Week 1
7 pages
Python Classes and Inheritance
No ratings yet
Python Classes and Inheritance
24 pages
Multiple Choice Questions
100% (2)
Multiple Choice Questions
4 pages
ENGLISH HL P1 Gr12 MEMO SEPT2024
No ratings yet
ENGLISH HL P1 Gr12 MEMO SEPT2024
12 pages
Dreambooth: Fine Tuning Text-To-Image Diffusion Models For Subject-Driven Generation
No ratings yet
Dreambooth: Fine Tuning Text-To-Image Diffusion Models For Subject-Driven Generation
21 pages
Use of English
No ratings yet
Use of English
61 pages
Iraqi Words Dictionary
No ratings yet
Iraqi Words Dictionary
45 pages
What's This?: It's A Pen
100% (1)
What's This?: It's A Pen
128 pages
Engligh 6 q2 - Gather Relevant Information Using Dictionary and Thesaurus
No ratings yet
Engligh 6 q2 - Gather Relevant Information Using Dictionary and Thesaurus
20 pages
Extra Grammar Exercises (Unit 1, Page 6) : Top Notch 3, Third Edition
No ratings yet
Extra Grammar Exercises (Unit 1, Page 6) : Top Notch 3, Third Edition
2 pages
XII 301 1 1 1 MS Unsigned
No ratings yet
XII 301 1 1 1 MS Unsigned
23 pages
Subject Verb Agreement
No ratings yet
Subject Verb Agreement
4 pages
Maranao Culture and Its Relationship To The Standard English Phonetic and Intonation Skills
No ratings yet
Maranao Culture and Its Relationship To The Standard English Phonetic and Intonation Skills
12 pages
List of Countries, Nationalities and Their Languages
No ratings yet
List of Countries, Nationalities and Their Languages
3 pages
Lesson Plan in English
No ratings yet
Lesson Plan in English
4 pages
Grade 10 Unit 6 - Natural Language Processing
No ratings yet
Grade 10 Unit 6 - Natural Language Processing
33 pages
Simple Present Tense Worksheet
No ratings yet
Simple Present Tense Worksheet
2 pages
Passive Voice in English Grammar 1
No ratings yet
Passive Voice in English Grammar 1
7 pages
G5mathq2w8 SLM
No ratings yet
G5mathq2w8 SLM
9 pages
Rusko-Angliski Razgovornik
No ratings yet
Rusko-Angliski Razgovornik
21 pages
PRESENT and PAST
No ratings yet
PRESENT and PAST
20 pages
Jayamala - Wikipedia
No ratings yet
Jayamala - Wikipedia
7 pages
English FAL P1 Nov 2015 Memo
No ratings yet
English FAL P1 Nov 2015 Memo
9 pages

Summarization From Medical Documents A Survey

Uploaded by

Summarization From Medical Documents A Survey

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Summarization from Medical Documents: A Survey

Article in Artificial Intelligence in Medicine · February 2005

Stergos Afantenos Vangelis Karkaletsis

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Summarization from medical documents: a survey

2.1.2. Purpose factors 2.1.3.1. Output quality. The developer of a sum-

then the system developer must tune its system

Language processing (to identify keywords)

Graph-based, statistics (cosine similarity,

(to identify the RST relations markers)

Statistics (Edmundsonian paradigm),

Statistics (Edmundsonian paradigm),

Statistics (Edmundsonian paradigm),

Statistics (Edmundsonian paradigm),

Graph-based, cohesion relations,

Tree-based, language processing

use of thesauri, revision

vector space model)

Although the summarization community considers

Text regions (sentences,

ones (see Section 5 in [11]). An intrinsic method

supposed to serve. An extrinsic evaluation, on the

other hand, evaluates the produced summary in

Generic, domain-specific (news)

User-oriented, General purpose

terms of a specific task.

their referents (for an extract), the summary read-

subjects create a ‘‘gold’’ summary, i.e. an ideal one,

quantitative and measure things such as precision

or recall. The problem with this approach is that it is

usually difficult to make people agree on what

constitutes a ‘‘gold’’ summary. One way to sidestep

(belonging or not to the summary), but instead a

judge. Those values are then averaged and the top

sentences can be considered as forming the ‘‘gold’’

2.3.1. Extractive techniques Table 1 presents representative systems employ-

purpose or a domain-specific one. The output field

sentences, ontology-based annotation, NLG

Syntactic processing of representative

Syntactic processing of representative

Information extraction, NLG

removing redundant words,

 Use of a script, i.e. a sort of a simple-structured

script for each domain. When a document is

processed, the corresponding script is activated

and its slots filled with information from the

the activation of another script. After the script

(canned text generation versus the more sophis-

sort of a relational database, having a more

complex structure compared to scripts. The tem-

plate can be filled from a document using infor-

templates can then be processed in order to

tion conferences for information extraction systems; see

Statistics (vector space model), language processing

Statistical (support vector machine, MD-MRD)

marginal relevance — MD-MRD), revision

Intra- and inter-document relationships

Statistical (multi-document maximal

2.3.4. Multimedia summarization techniques

most salient sentences

A scatter plot of the

addresses the issue of speech recognition errors).

detection follows in which false starts, restarts or

answer among the speakers are identified and anno-

Generic, general purpose

Generic, general purpose

when producing the summary, thus making the

extracted. The final step, sentence ranking and

2.3.4.2. Diagram summarization. Futrelle [10]

presented a preliminary and only partially imple-

the diagrams in a scientific paper, either by

from the paper or by distilling (i.e. simplifying) a

diagram, or even by merging several diagrams.

View publication stats

Use of a script, i.e. a sort of a simple-structured