Summarization From Medical Documents A Survey
Summarization From Medical Documents A Survey
net/publication/220103096
CITATIONS READS
213 2,643
3 authors:
Panagiotis Stamatopoulos
National and Kapodistrian University of Athens
45 PUBLICATIONS 1,355 CITATIONS
SEE PROFILE
All content following this page was uploaded by Vangelis Karkaletsis on 19 March 2018.
http://www.intl.elsevierhealth.com/journals/aiim
a
Software and Knowledge Engineering Laboratory, Institute of Informatics
and Telecommunications, National Centre for Scientific Research (NCSR)
‘‘Demokritos’’, 15310 Aghia Paraskevi Attikis, Athens, Greece
b
Department of Informatics, University of Athens, TYPA Buildings,
Panepistimiopolis, GR-15771 Athens, Greece
Received 16 December 2002; received in revised form 21 July 2004; accepted 21 July 2004
KEYWORDS Summary
Summarization from
medical documents; Objective: The aim of this paper is to survey the recent work in medical documents
Single-document summarization.
summarization; Background: During the last decade, documents summarization got increasing atten-
Multi-document tion by the AI research community. More recently it also attracted the interest of the
summarization; medical research community as well, due to the enormous growth of information that
Multi-media is available to the physicians and researchers in medicine, through the large and
summarization; growing number of published journals, conference proceedings, medical sites and
Extractive portals on the World Wide Web, electronic medical records, etc.
summarization; Methodology: This survey gives first a general background on documents summariza-
Abstractive tion, presenting the factors that summarization depends upon, discussing evaluation
summarization; issues and describing briefly the various types of summarization techniques. It then
Cognitive examines the characteristics of the medical domain through the different types of
summarization medical documents. Finally, it presents and discusses the summarization techniques
used so far in the medical domain, referring to the corresponding systems and their
characteristics.
Discussion and conclusions: The paper discusses thoroughly the promising paths for
future research in medical documents summarization. It mainly focuses on the issue of
scaling to large collections of documents in various languages and from different
media, on personalization issues, on portability to new sub-domains, and on the
integration of summarization technology in practical applications.
# 2004 Elsevier B.V. All rights reserved.
1. Introduction
* Corresponding author. Tel.: +30 210 6503149.
E-mail addresses: [email protected] (S. Afantenos),
[email protected] (V. Karkaletsis),
New technologies, such as high-speed networks
[email protected] (P. Stamatopoulos). and inexpensive massive storage, along with the
1
Tel.: +30 210 7752222. remarkable growth of the Internet, have led to an
0933-3657/$ — see front matter # 2004 Elsevier B.V. All rights reserved.
doi:10.1016/j.artmed.2004.07.017
158 S. Afantenos et al.
enormous increase in the amount and availability of tive segments at the expense of the rest is the main
on-line documents. This is also the case for medical challenge in summarization.’’
information, which is now available from a variety of Although initial work on summarization dates
sources. However, information is only valuable to back to the late 1950s and 1960s (e.g. [3,4]), fol-
the extent that it is accessible, easily retrieved and lowed by some sparse publications (e.g. [5,6]), most
concerns the personal interests of the user. The research in the field has been carried out during the
growing volume of data, the lack of structured last decade. During these last few years, research-
information, and the information diversity have ers examined a great variety of techniques and
made information and knowledge management a applied them in different domains and genres of
real challenge towards the effort to support the documents, in order to see which are the ones that
medical society. It has been realized that added yield the most practical results for each domain and
value is not gained merely through larger quantities genre.
of data, but through easier access to the required This survey presents the potential of summariza-
information at the right time and in the most sui- tion technology in the medical domain, based on the
table form. Thus, there is a strong need for examination of the state of the art, as well as of
improved means of facilitating information access. existing medical document types and summariza-
The medical domain suffers particularly from the tion applications. An important aspect of this survey
problem of information overload since it is crucial is that it is not restricted to a mere examination of
for physicians and researchers in medicine and biol- the various summarization techniques, but it exam-
ogy to have quick and efficient access to up-to-date ines the issues that arise in the use of these tech-
information according to their interests and needs. niques taking into account the characteristics of the
Considering, for instance, the scientific medical medical domain.
articles [1, p. 38] state that: ‘‘. . .there are five The structure of the survey is as follows. Second
journals which publish papers in the narrow speci- section presents a roadmap of summarization com-
alty for cardiac anesthesiology, but 35 different prising the factors that have to be taken into
anesthesia journals in general; approximately 100 account and the main techniques considered so
journals in the closely related fields of cardiology far in the summarization literature. Third section
(60) and cardiothoracic surgery (40); and over 1000 presents different types of medical documents and
journals in the more general field of internal med- the requirements that they introduce to the sum-
icine.’’ The situation becomes much worse if one marization process. Fourth section examines the
considers relevant journals or newsletters in other techniques used so far for summarization in medical
languages, Web sites with relevant information, documents. Finally, fifth section summarizes the
medical reports, etc. most interesting remarks of this survey and presents
Given the number and diversity of medical infor- promising paths for future research, while last sec-
mation sources, methods must be found that will tion concludes the paper.
enable users to quickly assimilate and determine the
content of a document. Summarization is one such
approach that can help users to quickly determine 2. Summarization roadmap
the main points of a document. Radev et al. [2]
provide the following definition for a summary: ‘‘A A summarization system in order to achieve its task
summary can be loosely defined as a text that is takes into account several factors. These factors
produced from one or more texts, that conveys concern mainly the type of input documents, the
important information in the original text(s), and purpose that the final summary should serve, and
that is no longer than half of the original text(s) and the possible ways of presenting a summary. Sum-
usually significantly less than that. Text here is used mary evaluation is also an important issue. These
rather loosely and can refer to speech, multimedia factors are examined in the following sections.
documents, hypertext, etc. The main goal of a Various techniques that have been used so far for
summary is to present the main ideas in a document document summarization are also presented. This
in less space. If all sentences in a text document presentation is necessary for the examination of
were of equal importance, producing a summary existing approaches to summarization from medical
would not be very effective, as any reduction in the documents.
size of a document would carry a proportional
decrease in its informativeness. Luckily, informa- 2.1. Summarization factors
tion content in a document appears in bursts, and
one can therefore distinguish between more and A detailed presentation of the factors that have to
less informative segments. Identifying the informa- be taken into account for the development of a
Summarization from medical documents: a survey 159
summarization system has been given in [7]. How- supposed to serve when presented to its reader, it
ever, as it is noted there, all these factors are hard can either be indicative or informative. An indica-
to define and therefore it is very difficult to capture tive summary does not claim any role of substituting
them precisely enough to guide summarization in the source document(s). Its purpose is merely to
various applications. The following presentation of alert its reader in relation to the contents of the
factors adopts the main categorization presented in original document(s), so that the reader can choose
[7]: input, purpose and output. For each of these which of the original documents to read further. The
categories, the factors considered as the most purpose of an informative summary, on the other
important ones are presented. hand, is to substitute the original document(s) as far
as coverage of information is concerned. Apart from
2.1.1. Input factors the indicative and informative summaries, there are
The main factors in this category are the following. also critical summaries [7,8], but, as far as we know,
no actual summarization system creates critical
2.1.1.1. Single-document versus multi-docu- summaries.
ment. This is the unit input parameter or the span
parameter, as Sparck-Jones [7] and Mani [8] respec- 2.1.2.2. Generic versus user-oriented summar-
tively call it, which in simple words is the number of ies. This factor concerns the information a system
documents that the system has to summarize. In needs to locate in order to produce a summary.
single-document summarization the system pro- Generic systems create a summary of a document
cesses just one document at a time, whereas in or a set of documents taking into account all the
multi-document summarization more than one information found in the documents. On the other
document are processed by the system. hand, user-oriented systems try to create a sum-
mary of the information found in the document(s)
2.1.1.2. Language. Another input factor is the which is relevant to a user query. In a sense, we can
number of languages in which the input documents say that the query-oriented summarization systems
are written. So, a system can be monolingual, multi- are user-focused, adapting each time to the verbally
lingual or cross-lingual. In the first case, the output expressed needs of the users, as viewed through the
language is the same as the input language. In the query they make or through their model (persona-
case of multilingual summarization systems, the lized summaries).
output language is the same as the input language,
but the system can handle a certain number of 2.1.2.3. General purpose versus domain-specific.
languages. In the final case of cross-lingual summar- General-purpose systems can be easily ported to a
ization, the system can accept a source text in a different domain (e.g. financial, medical). This can
specific language and deliver the summary in be done, for instance, by changing the resources
another language, not necessarily the same as the that characterize the domain (e.g. keywords, a
input one. domain-specific ontology), or by tuning specific
parameters which concern the selection of the most
2.1.1.3. Text versus multimedia summaries. An- appropriate techniques for the domain. On the
other important factor is the medium used to repre- other hand, domain-specific systems are able to
sent the content of the input document(s), as well as process documents belonging to a specific domain.
the output summary. Thus, we have text, or multi-
media (e.g. images, speech, video apart from tex- 2.1.3. Output factors
tual content) summarization. The most studied case These factors are related to the criteria that are
is, of course, text summarization. However, there used to judge the quality of the resulting summary
are also summarization systems that deal, for exam- as well as with the type of summary in terms of
ple, with the summarization of broadcast news [9] whether this is an extract from the original docu-
and of diagrams [10]. ment(s) or an abstraction.
Intrinsic, extrinsic
Intrinsic, extrinsic
accordingly, i.e. producing accurate summaries even
though these do not contain all the relevant results.
Evaluation
Extrinsic
Intrinsic
Intrinsic
Intrinsic
2.1.3.2. Extracts versus abstracts. Considering the
relation that the summary has to the source docu-
ment(s), a summary can either be an extract or an
abstract. An extract involves the selection and ver-
language processing
salient concepts prevalent in the source docu-
ment(s), the fusion and the appropriate presenta-
tion of them, usually through Natural Language
no revision
no revision
Generation.
Method
2.2. Evaluation
paragraphs, sections)
what the evaluation criteria should be. This is
mainly related to the subjective aspect of summar-
ization, in terms of whether or not a summary is of
‘‘good’’ quality. Existing evaluation techniques can
Paragraphs
Sentences
Sentences
Sentences
Sentences
Sentences
Sentences
be split into two categories, intrinsic and extrinsic
Output
Generic, domain-specific
Generic, domain-specific
the summary using criteria such as the integrity of
its sentences, the existence of anaphors without
(scientific articles)
(technical papers)
multilingual, text
Single-document,
Single-document,
Single-document,
Single-document,
Single-document,
Multi-document,
Multi-document,
English, text
English, text
English, text
English, text
[18,19]
[14]
[15]
[16]
[17]
[4]
summary.
Summarization from medical documents: a survey 161
In an extrinsic method, the summary is evaluated normally based on a formula, which assigns a weight
in relation to the particular task it is supposed to to each sentence based on various factors. For
serve. Thus, such an evaluation can greatly vary example, the cue phrases or keywords that the
from system to system. Relevance assessment is an sentence contains, its location in the document,
extrinsic method and is usually performed by show- the fact that it may contain some non-trivial words
ing judges a document (summary or source) and a that are also found in the sections’ headings of the
topic and asking them to determine whether that document. The problem is that most of the times the
document is relevant to the topic. If, on average, produced summary is suffering from incoherencies
the choices for the source document and the sum- (semantic gaps, anaphora problems). Some of the
mary are the same, then the summary scores high in systems falling under this category post-process the
the evaluation. Of course, instead of providing a produced summary using revision techniques in
single topic, a list of topics can be provided asking order to resolve such problems.
the judges to choose one of them. Another example The second category of extractive techniques
is the evaluation of reading comprehension. In this concerns the creation of a graph (or tree) repre-
case, judges are given a document (either the sum- sentation of the document(s) to be summarized
mary or the source document) and are asked a set of exploiting machine learning and/or language pro-
questions. Their answers to these questions deter- cessing techniques. Several different representa-
mine their comprehension of the text. If the answers tions can be used:
to the summary and the corresponding source docu-
ment(s) are similar, then the summary is positively The nodes of the graph are the paragraphs of the
evaluated. document to be summarized and the edges repre-
sent the similarity between the paragraphs they
2.3. Summarization techniques connect. The paragraphs corresponding to nodes
with many edges can be extracted in order to form
Summarization techniques can be classified accord- the summary of the document.
ing to the factors presented in Summarization fac- The nodes of the graph are text items such as
tors. For example, they can be classified according words, phrases, proper names and the edges are
to the number of input documents (single-document cohesion relationships between these items, such
versus multi-document), to the type of these docu- as coreference, hypernymy. Once the graph for
ments (textual versus multimedia), to the output the document is created, the most salient nodes
types (extractive versus abstractive), etc. In this in the graph can be located, based for example on
section, the various summarization techniques are a user query. The set of salient nodes can then be
presented under the following classification: used to extract the corresponding sentences,
paragraphs, or even sections that will form the
extractive; summary.
abstractive; A tree representation can be created exploiting
multi-document; relations from the rhetorical structure theory
multimedia. (RST) [12]. The tree nodes are sentences, which
are connected using RSTrelations such as elabora-
An extra category is added to include a technique tion, antithesis, etc. Then in order to get the most
that although it presents similarities with techni- salient sentences, the tree must be traversed in
ques in the other categories, it has the special ch- order to build a partial ordering of the sentences
aracteristic of approaching summarization from a in terms of their importance. According to the
cognitive perspective aiming at simulating the hu- target compression rate, the top n sentences can
man summarizers’ tasks. be extracted and presented as a summary.
Evaluation of system
concerns the ‘‘material’’ used to create the sum-
mary (sentences, paragraphs, sections). The table
components
also includes a field about the specific methods used
Evaluation
(e.g. statistics, language processing, use of revi-
Extrinsic
Intrinsic
sions, etc.), as well as a field on the evaluation a-
pproach. The lack of the corresponding information
in some field values denotes that a definite answer
cannot be given.
combining sentences
representative categories of abstractive techni-
ques, implemented by existing systems (see Table
sentences, NLG
2), are presented below.
In the first category the process of identifying and
encoding the most important information in the docu-
Method
ment(s) can be performed using prior knowledge
about the structure of this information. This knowl-
edge is represented through cognitive schemas such
as frames, scripts, templates. Thus, in such cases, the
representation in UNL
summary produced is not a generic one, but a rather
user-oriented one since the schema can be considered
Ontology-based
as a sort of user query. Different approaches in this
representation
category may be the following:
Conceptual
Templates
Clusters
Output
Scripts
Informative, user-oriented,
Informative, user-oriented,
domain-specific
domain-specific
(news articles)
Single-document,
Single-document,
Single-document,
English, text
English, text
English, text
1
MUC (Message Understanding Conferences) were evalua-
[23,24]
[25]
[26]
[27]
http://www.itl.nist.gov/iaui/894.02/related_projects/muc/
index.html.
Summarization from medical documents: a survey 163
NLG system. Processing is done using various While a technique extracting textual units — such
semantic operators, such as Change of Per- as sentences — from a single document, may cope
spective, Contradiction, Addition/Elaboration, with redundancy and preserve the coherence of the
Refinement, Agreement, etc. original document, extracting textual units from
multiple documents increases the redundancies
The second category involves techniques that do and incoherencies, since textual units are not pre-
not use prior knowledge about the structure of the viously connected across documents. Thus, abstrac-
important information to be used in the summary, tive techniques seem more appropriate for multi-
but instead they produce a semantic representation document summarization. However, extractive
of the document(s), which is then fed to the NLG techniques can also be used followed by a post-
system. Different approaches in this category may processing stage in order to ensure summary coher-
be the following: ence and cope with redundancy. In both cases, the
first step is to identify those documents talking
The documents are linguistically processed in order about a specific topic. This can be done, using
to identify noun phrases, verb phrases that can be classification techniques in terms of relevance to
linked to the concepts, attributes and relations of a a query or existing topic models, or using clustering
domain-specific ontology. Ontology-based annota- techniques. The next step is to identify the impor-
tions can then be used to select the important tant information to be added in the summary in the
document regions (sentences, paragraphs). These group of the topic-specific documents.
regions are then converted into some semantic In the extractive-based category, a system can
representation using the results of the linguistic apply firstly extractive techniques to each docu-
processing and the ontology-based annotations. ment separately in order to locate and rank the
This representation is then fed to an NLG system most important document regions (phrases, sen-
that produces the abstract. tences, paragraphs). The highest ranked regions
The summarization system can identify informa- from all the documents can then be combined and
tional equivalent paragraphs in the input docu- re-ranked using similarity measures, in order to
ment(s) using clustering techniques. From each minimize redundancy. Some cohesion rules can then
theme some representative sentences are be applied to the final set of document regions in
extracted, which can be analyzed syntactically order to produce the summary. Another approach
and then fed to a sentence generator in order to would be to work with all the documents from the
produce the abstract. beginning. For instance, the topic model produced
from the group of input documents can be compared
Table 2 presents representative systems employ- to sentence vectors from all the documents in order
ing abstractive techniques. Table 2 fields are the to get the, most similar to that topic, sentences.
same with the fields in Table 1. There are differ- The abstractive-based category can also involve
ences between the two tables concerning the out- extractive techniques at a pre-processing stage. In
put and the method fields, which are filled with such a case, the extracted document regions are
different values. The output field in the abstractive linguistically processed in order to be converted into
techniques concerns the ‘‘semantic representa- some representation, which can be used by an NLG
tion’’ used to create the summary (scripts, tem- system, which will then produce the abstract.
plates, ontology-based representation, clusters of Another approach involves the establishment and
informational equivalent document regions). The use of a set of intra- and inter-document relation-
method field is filled with the specific methods used ships, which could hold between various textual
(script activation, information extraction, syntactic units of the documents, such as words, sentences,
processing, NLG, etc.). paragraphs or whole documents, and which could
guide the identification of the most salient informa-
2.3.3. Multi-document summarization tion across the multiple documents. Such relation-
techniques ships concern not only the similarities across
Radev et al. [2] define multi-document summariza- documents, but also their differences (e.g. equiva-
tion as the process of producing a single summary of lence, contradiction, elaboration, etc.).
a set of related source documents. As they note, this To cope with the inherent problems of multi-
is a relatively new field where three major problems document summarization (redundancy, incoheren-
are introduced: (1) recognizing and coping cies), a different output representation can be used
with redundancy, (2) identifying important differ- instead of producing a summary containing, for
ences among documents, and (3) ensuring summary instance, the most salient sentences across docu-
coherence. ments or even their abstraction. Ando et al. [28] use
164 S. Afantenos et al.
Evaluation
a scatter plot per topic presenting the extracted
Intrinsic
Intrinsic
sentences visually to the user.
Table 3 includes representative examples of
multi-document summarization techniques. Com-
pared to Tables 1 and 2, there are differences
concerning the values in the output and the method
Method
2.3.4.1. Dialogue summarization. Zechner [32,33]
works on dialogue summarization in unrestricted
domains and genres. Zechner’s works with tran-
scripts of human dialogs, which are generated either
Extracts
Output
Multi-document,
Multi-document,
Multi-document,
English, text
English, text
English, text
[30]
[28]
[31]
the diagrams can be obtained as metadata from the mented in the SimSum system [36]. SimSum is imple-
author or by parsing the diagrams. In order to mented as an object-oriented blackboard, which
achieve his goals he takes into account not only involves 79 object-oriented agents, each one per-
the structural description of each diagram, but forming a relatively simple task. For instance, the
the text in its caption or in the diagram itself. Context agent checks whether the context condi-
tions of the query are met by an input document, the
2.3.4.3. Video summarization. Merlino and Mark TexttoProposition agent transforms input sentences
[9] extract information from various media in order into propositions known to the domain ontology, the
to provide a summary of a broadcast news story. They Redundancy agent checks if a proposition has
use MITRE’s broadcast news navigator (BNN [35]) already been introduced. All the agents cooperate
which helps searching and browsing of news stories, in order to deliver the summary. They can access a
not from just one channel, but from a variety of common knowledge database, which contains the
sources. They use silence and speaker change from text and ontology concepts. Furthermore several
the audio, anchor and logo detection from the video RST relations have been implemented for discourse
and closed-captioned text in order to segment the level structures of the text.
stream into news stories. For the purpose of present-
ing a summary, they experiment with a diversity of
presentation methods, which include mixed-media. 3. The medical domain
The first one is key-frames, i.e. important single
frames and shots i.e. an important sequence of Medical Informatics represents the core theories,
frames. The second one is extracted single sentences, concepts and techniques of information applications
based on weights according to the presence of named in medicine. It involves four different levels depend-
entities (organizations, persons and places). The final ing on the focus from the cell to the population [40]:
one is named entities keywords.
Bioinformatics concerns molecular and cellular
2.3.5. Summarization from a cognitive science processes, such as gene sequences;
perspective Imaging informatics concerns tissues and organs,
The techniques presented so far pay little attention such as radiology imaging systems;
to the ways used by humans to create the summaries Clinical informatics concerns clinicians and
themselves. Endres-Niggemeyer et al. [36—39] on patients, involving applications of various clinical
the other hand, try to simulate the human cognitive specialties;
process of professional summarizers. They aim at Public health informatics concerns populations
developing an empirical model for summarization, involving applications such as the disease surveil-
based on professional summarizers, and to imple- lance systems.
ment this model into a system, which will ‘‘imitate’’
human process of summarizing. For this purpose, Medical information distributed through all the
they recruited six professional summarizers who above levels concerns various document types: sci-
worked on nine summarization processes. The whole entific articles, electronic medical records, semi-
process included the division of the tasks of the structured databases, web documents, e-mailed r-
professional summarizers into sub-tasks, the inter- eports, X-rays images, videos. The characteristics of
pretation of each sub-task into a more formal frame- each document type have to be taken into account
work (i.e. giving a name, functional definition, in the development of a summarization system. S-
etc.), and the hierarchical organization of the cientific articles are mainly composed of text and, in
resulting strategies according to their function. several cases, they have a sectioning that can be
Despite the diversity of the technical background exploited by a summarization system. Electronic
and cognitive profile of each professional summar- medical records contain structured data, apart from
izer and their idiosyncrasies on creating summaries, free text. Web documents may appear in health
the results are quite stunning. Quoting from [36, p. directories and catalogs, which need to be searched
129]: ‘‘83 strategies are used by all experts of the first in order to locate the interesting web pages.
group, 60 strategies are shared by five experts, The web pages layout is also another factor that
another 62 strategies are common knowledge of needs to be taken into account. E-mailed reports are
four summarizing experts, 79 strategies belong to mainly free text without any other structure. X-rays
the repertory of three summarization experts, 101 images, videos such as echocardiograms, represent
strategies are used by two experts, and 167 strate- a completely different document type, where text
gies are individual.’’ From all these strategies, 79 may not be included at all or may be a part of an
agents, which simulate them, were finally imple- image.
166 S. Afantenos et al.
Compared to other domains, medical documents of 270 Medical Journals over a period of 5 years. This
have certain unique characteristics that make very corpus is annotated with information such as the
challenging and attractive their analysis. Unique- Medline identifier, MeSH terms assigned by humans,
ness of medical documents is due to their volume, title, abstract, publication type, source and
their heterogeneity, as well as due to the fact that authors. The corpus was created for the experi-
they are the most rewarding documents to analyze, ments described in [40]. Abstracts represent a dif-
especially those concerning human medical infor- ferent type of document, since they are only
mation due to the expected social benefits (see composed of text and metadata (such as the MeSH
[41]). annotations in MEDLINE). A summarization system
must be able to exploit these metadata in several
3.1. Scientific articles ways. For instance, in multi-document summariza-
tion, the system must be able to locate first those
The number of scientific journals in the fields of abstracts that discuss a specific topic, and topic
health and biomedicine is unmanageably large, even information is found on the metadata of the
for a single specialty, making very difficult for phy- abstracts. On the other hand, a sub-language ana-
sicians and researchers to stay informed of the new lysis of the abstracts may be necessary in order to
results reported in their fields. Scientific articles identify certain characteristics that may affect the
may contain apart from text, structured data (e.g. performance of the language processing applica-
tables), graphs, or images. Therefore, depending on tion. As it is noted in Ariadne Genomics NLP white
the summarization task, the system may have to paper,4 their sub-language analysis indicated that
process various types of data. It may be the case, for ‘‘MEDLINE sentences often include idiosyncratic lin-
instance, that important information (e.g. experi- guistic constructs not necessarily reflected in gen-
mental results) is found in a table and needs to be eralized English grammar’’. This explains, as they
located and added to the summary. The article claim, why existing syntactic parsers with general
layout could also be exploited, since a large number grammar are not suitable for dealing with this type
of articles reporting on experimental results have of text.
an almost standard sectioning, the order of the
sections being Introduction, Methods, Statistical
Analysis, Results, Discussion, Previous Work, Lim- 3.3. Semi-structured databases
itations of the Study, and Conclusions. A study of the
types of scientific articles on the various fields of A number of databases have been built to provide
medicine is necessary for their processing either for access to biomedical information, such as informa-
summarization or for other language processing tion about protein function and cellular pathways.
tasks. More than 280 semi-structured databases exist cur-
rently.5 Some examples are the following:
3.2. Databases of abstracts
FlyBase: fruit fly genes and proteins;
Mouse Genome Database (MGB);
Despite the plethora of medical journals, most of
Protein Information Resource (PIR);
them are not freely accessible over the Web, due to
DIP: The database of interacting proteins.6
copyright reasons. Luckily, there are other online
databases, which contain the abstracts and citation
Let us take DIP, for example, which provides an
information of most articles on the general field of
integrated set of tools for browsing and extracting
Medicine. One such online database is MEDLINE,2
information about protein interaction networks. As
which contains abstracts from more than 3500 jour-
it is reported in [42], DIP database is implemented
nals. MEDLINE provides keyword searches and
as a relational database composed of four tables:
returns abstracts that contain the keywords. The
protein table, interaction table, method table,
abstracts are indexed according to the Medical
and reference table. The reference table lists all
Subject Headings (MeSH)3 thesaurus. Apart from
the references to different articles that demon-
the access to the abstracts, MEDLINE also provides
strate protein interactions and link them to
full citations of the articles along with links to the
MEDLINE database. Therefore, a summarization
articles themselves, in case they are online. Hersh
et al. [40] have created a corpus from Medline, 4
See http://host.ariadnegenomics.com/downloads/.
which consists of titles and abstracts from articles 5
See KDD Cup 2002 Task1: Information Extraction from
Biomedical Articles (http://www.biostat.wisc.edu/craven/
2
See http://www.ncbi.nlm.nih.gov/entrez/query.fcgi. kddcup/).
3 6
See http://www.nlm.nih.gov/mesh/meshhome.html. See http://dip.doe-mbi.ucla.edu.
Summarization from medical documents: a survey 167
system that exploits information from DIP should with interesting information for the summarization
be able to employ DIP tools for browsing the data- task, there may be interesting information stored in
base and even access relevant MEDLINE abstracts a table, etc. Web page layout should also be taken
(see previous discussion) through the reference into account in order to locate interesting informa-
table. tion inside the web page, especially in the case that
the catalog pages are generated dynamically from a
database. In addition, the identification of web
3.4. Web documents
pages that are relevant to the summarization task,
demands the use of web spidering techniques.
During the last few years, a number of specialized
Therefore, as it is the case of web information
health directories and catalogs (portals) have been
retrieval and extraction tasks in other domains
created, such as CliniWeb,7 HON,8 CISMeF,9 Medical
(see the results of the CROSSMARC project10 in
Matrix, Yahoo Health, HealthFinder. Some of these
[43]), summarizing from web documents needs to
catalogs, including the first three of the above, are
take into account the web catalog and web pages
additionally indexed with the MeSH Thesaurus. This
structure and features. It must also be able to
allows complex queries to be stated, which exploit
exploit metadata information (e.g. MeSH annota-
the hierarchical structure of the MeSH. CliniWeb, for
tions) already used by some of the existing web
instance, provides clinically oriented information,
catalogs. However, even in the cases where a web
including:
catalog is not indexed, the summarization system
must be able to employ existing medical ontologies,
1. Cataloging content per-page basis;
thesauri or lexica. This is essential for a scientific
2. Including only pages that have clinical content,
domain with rich terminology, where the informa-
i.e., excluding individual and institutional home
tion to be extracted or summarized needs to be as
pages, advertisements, and lists of links;
precise as possible.
3. Indexing with a higher level of specificity, using
MeSH as opposed to broad subject categories
such as Orthopedics or Cancer. CliniWeb provides 3.5. E-mailed reports
access to Web pages manually indexed by a large
subset (trees A—G) of MeSH, including the major With the advent of the Web, several web-based
trees, Diseases, Anatomy, and Chemicals and services, whose purpose is the exchange of opinions
Drugs. and news, have emerged. For example, ProMED-
mail11 is a public free service, which promotes
Another example of a health resources portal is the exchange of news concerning bursts of epi-
CISMeF, which aims to describe and index the main demics. Other non-free services, such as MDLinx,12
French-language health resources to assist health provide physicians and researchers with the oppor-
professionals and consumers in their search for tunity to subscribe and receive alerts concerning
electronic information available on the Internet. new findings on their specialty fields, described in
In April 2002, the number of indexed resources journal articles. The use of e-mailed reports for a
totaled over 9600 with a mean of 50 new sites each fast dissemination of epidemiological information
week. CISMeF uses two standard tools for organizing by the Internet shows an increasing success for
information: the Medline bibliographic database monitoring epidemiological events [44]. The
MeSH thesaurus and several metadata element sets, descriptive possibilities of these reports and their
including the Dublin Core. To index resources, CIS- ability to deal with unattended situations make
MeF uses four different concepts: ‘‘meta-term’’, them competitive for reporting emerging infectious
keyword, subheading, and resource type. CISMeF disease outbreaks and unusual disease patterns,
contains a thematic index, including medical spe- including biological threats. However, as [45] note,
cialities and an alphabetic index. analysts cannot feasibly acquire, manage, and
Web pages included in catalogs, such as the above digest the vast amount of information available
ones, form a different type of medical documents. through emailed reports or other information
These are .html pages, which may contain informa- sources 24 h a day, 7 days a week. In addition, access
tion from other media apart from text (e.g. images, to foreign language documents as well as the local
videos). Even in the case that they contain text only, news of other countries is generally limited. Even
there will be links pointing to other relevant pages when foreign language news is available, it is usually
7 10
See http://www.ohsu.edu/cliniweb/. See http://www.iit.demokritos.gr/skel/crossmarc.
8 11
See http://www.chu-rouen.fr/cismef/. See http://www.promedmail.org.
9 12
See http://www.chu-rouen.fr/cismef. See http://www.mdlinx.com/.
168 S. Afantenos et al.
no longer current by the time it gets translated and must be able to process the free text reports, which
reaches the hands of an analyst. This very real may be problematic due to the specific sub-language
problem raises an urgent need for the development used by the clinicians or due to the fact that the
of automated support for global tracking of infec- report is the result of dictation. Information from
tious disease outbreaks and emerging biological written reports may also be combined with informa-
threats. tion existing in structured data (tables, graphs) or
ProMED-mail is a service monitoring news on even information existing in other media (e.g. X-
infection disease outbreaks around the world, 7 days rays images, videos). The situation, however, may
a week. By providing early warning of outbreaks of become even more complex for a summarization
emerging and re-emerging diseases, ProMED aims at system that aims to summarize information for a
enabling public health precautions at all levels in a clinician, which is collected not only from the
timely manner to prevent epidemic transmission patient record but also from other records (similar
and to save lives. ProMED sources of information cases to the specific patient), from relevant scien-
include — among others — media reports, official tific articles or abstracts from journals or databases
reports, online summaries, local observers. Reports respectively. If a summarization system is to be
are also contributed by ProMED-mail subscribers. A integrated into the busy clinical workflow, it must
team of expert moderators investigate reports provide the clinician with such facilities.
before posting them to the network. Reports are
distributed by e-mail to subscribers and posted on 3.7. Multimedia documents
the ProMED-mail web site. ProMED-mail currently
reaches over 30,000 subscribers in 150 countries. Apart from documents in textual form, physicians and
ProMED-mail is also available, apart from English, in researchers produce and use several other docu-
Portuguese and in Spanish. Both of these lists cover ments, which are multimedia in nature. Such docu-
disease news and topics relevant to Portuguese- and ments can be graphs, such as cardiograms, images
Spanish-speaking countries, respectively. such as X-Rays, etc., videos, such as the various
E-mailed reports for monitoring infectious dis- echograms, e.g. echocardiograms, echoencephalo-
ease outbreaks and emerging biological threats grams, etc., or the medical videos used mainly for
represent a different type of medical documents. educational purposes, e.g. videos of clinical opera-
Such reports may contain apart from raw text, tions or videos of dialogs between the doctor and the
various types of information in attached files. The patient. Most of these documents are now transcribed
fact that these reports may be in several languages and stored in digital form, even connected to the
or may point to other sources, such as local news, specific patient record, giving the users the ability to
makes the summarization task even more difficult. A search and access them much faster than in the past.
sub-language analysis may also be necessary for This is a completely different type of medical docu-
these types of documents since they often follow ments, which contain very interesting information
a specific writing style and structure. Medical ter- that must be added to a summary. Techniques from
minology should also be taken into account, as it is areas other than language processing, such as image
the case in the other document types, exploiting processing and video analysis must also be employed
existing medical resources for the specific diseases in order to locate the information to be included in
or biological threats. the summary. In addition, several of these multimedia
documents are also linked with free text reports,
3.6. Electronic medical records which must also be used by the summarization sys-
tem. As noted in the discussion on electronic medical
Most hospitals keep a record for each of their records, a summarization system integrated in the
patients. Usually the records contain data of clinical workflow must be able to handle such docu-
patients in a standard structured form, with pre- ments. Concluding, in the medical domain, proces-
defined fields or tabular representations, as well as sing of multimedia documents is crucial for
free text fields containing unstructured informa- summarization and in general for information retrie-
tion, usually doctors’ reports about their patients val and extraction applications.
(either written reports or the result of dictation). As
Mckeown et al. [46] note, a patient record for any
single patient consists of many individual reports, 4. Summarization techniques in the
collected during a visit to hospital. For some medical domain
patients, this can be up to several hundred reports.
A system summarizing information from medical Most of the researchers extend to the medical
records needs to take several factors into account. It domain the techniques already used in other
Summarization from medical documents: a survey 169
domains. Based on the categorization given earlier Another project is MUSI [48]. MUSI stands for
in Summarization techniques, the techniques used ‘‘MUltilingual Summarization for the Internet’’
in the medical domain are classified under the and it is a cross-lingual summarization system,
following categories: which uses articles from The Journal of Anaesthe-
siology as input. The journal is freely accessible
extractive single-document summarization; online16 and its articles are written in Italian and
abstractive single-document summarization; English. MUSI takes those articles and creates sum-
extractive multi-document summarization; maries from them in French and German. The sys-
abstractive multi-document summarization; tem is query-based and it extracts sentences from
multimedia summarization; the input article according to the following criteria:
cognitive model based summarization. cue phrases, position of the sentences, query words
and compression rate. That is, MUSI follows the
Edmundsonian paradigm for the selection of the
In the following sections, various summarization sentences. Once the sentences have been
projects/systems are presented based on this cate- extracted, two approaches can be followed: either
gorization. they are used as they are to form the extractive
summary or they are converted into a semantic
representation to produce an abstractive summary.
4.1. Extractive single-document A third project exploiting extractive techniques is
summarization presented in [49] The most important aspect of this
approach is that it ranks the extracted sentences
One of the projects belonging in this category is according to the so-called cluster signature of the
MiTAP [45]. The aim of MiTAP (MITRE Text and Audio document. More specifically, their prototype system
Processing)13 is to monitor infectious disease out- takes medical documents (result of a query using a
breaks or other biological threats by monitoring search engine) as input and clusters them into
multiple information sources such as epidemiologi- groups. These groups are then analyzed for features
cal reports, newswire feeds, email, online news, with high support, called key features, forming a
television news and radio news in multiple lan- cluster signature that best characterizes each docu-
guages. All the captured information is filtered ment group. The summary is generated by matching
and the resulting information is normalized. Each the cluster signature to each sentence of the docu-
normalized article is passed through a zoner that ment to be summarized. Both the sentence and the
uses human-created rules to identify the source, cluster signature are represented using a vector
date, and other fields such as the article title and space model. The ranked sentences are then
body. The zoned messages are processed to identify selected and presented to the user as a summary.
paragraph, sentence and word boundaries as well as Johnson et al. [49] used for their experiments
part-of-speech tags. The processed messages are abstracts and full texts from the Journal of the
then fed into a named entity recognizer, which American Medical Association.
identifies person, organization and location names
as well as dates, diseases, and victim descriptions
using human-created rules. Finally, the document is 4.2. Abstractive single-document
processed by WebSumm [47], which generates a summarization
summary out of modified versions of extracted sen-
tences. For non-English sources, a machine transla- MUSI [48] is a system generating either extractive
tion system is used to translate the messages summaries (see the previous section) or abstractive
automatically into English. In addition to single- ones. In the case of abstractive summarization,
document summarization, MiTAP has recently incor- after the system has selected the sentences, it
porated two types of multi-document summariza- converts them into a predicate-argument structure
tion: Newsblaster [46]14 automatically clusters representation, instead of simply presenting them
articles and generates summaries based on each to the user. The steps in achieving that representa-
cluster, Alias I15 produces summaries on particular tion are: tokenization, morphological analysis,
entities and generates daily top 10 lists of diseases shallow syntactic parsing, chunking, dependency
in the news. analysis and mapping to the internal representa-
tion. After the representation has been achieved,
13
For more information on MiTAP, visit http://tides2000.mi- they create the summaries of those extracted
tre.org.
14 16
See http://www.cs.columbia.edu/nlp/newsblaster. You can access this online journal at http://anestit.unipa.it/
15
See http://www.alias-i.com/. esiait/esiaing/esianuming.htm.
170 S. Afantenos et al.
sentences using the natural language generation lighting the similarities and differences among the
(NLG) system Lexigen [50] for the French language documents.
and TG/2 [51] for German. The generation The input to the Centrifuser is articles, retrieved
systems produce indicative summaries of the by the search engine of the PERSIVAL system accord-
document content. Summaries include both trans- ing to the patient record and the user query. For
lated portions of the extracted sentences, and each article they create a topic tree which depicts
‘‘meta-statements’’ about the original document. the sectioning of each article. A composite topic
The latter provide the user with additional tree is then created by merging together all the
optional information about the content and struc- topic trees and adding details to each node such as
ture of the source text, the relevance of the relative typicality (i.e. how typical is that topic
extracted pieces of information as well as of the compared to the rest of the topics), position within
whole document with respect to the query, etc. the article, and various lexical forms in which it may
Users can customize the summary length, as well be expressed [53]. In the next step they try to match
as some other aspects concerning style and presen- the nodes of the topic trees with the query. The
tation. matched nodes do not contain any text, but instead
TRESTLE (Text Retrieval Extraction and Summar- they point to sections in the original documents,
ization Technologies for Large Enterprises) is a from which the most representative sentences
system, which produces single sentence summaries should be extracted. Since the compression rate
of Scrip17 pharmaceutical newsletters [52]. Their posed will not always allow for each topic to receive
system is in essence an Information Extraction sys- a sentence, the first step is to choose which topics
tem, which relies heavily on Named Entity (NE) are going to receive a sentence. In the next step
recognition. For this system, also drug names and they choose for each topic the representative sen-
diseases are named entities, apart from the classical tences. The final step for the creation of the sum-
ones, such as organization, person and location. mary involves the ordering of those sentences,
TRESTLE allows users to navigate through the Scrip which is achieved by first ordering the topics accord-
articles, and thus find the information that they are ing to each topic’s typicality, and then ordering the
interested in, using the named entities that the sentences themselves inside every topic, according
system has extracted, which are links to the original to the physical position of every sentence.
articles from which the NEs have been extracted.
Apart from this, TRESTLE also creates single sen-
tence summaries for each newsletter from the tem- 4.4. Abstractive multi-document
plate that was filled by the Information Extraction summarization
process. A link is also provided to the original news-
letter. Apart from informative extractive multi-document
summaries, Centrifuser creates indicative abstrac-
tive multi-document summaries, as well, which are
4.3. Extractive multi-document
used by the PERSIVAL users for searching papers. As
summarization
noted in Extractive multi-document summarization,
the approach of Kan et al. [53,54] leads to nodes in
Although the production of summaries from multiple
the topic trees which match with the query of the
documents is usually done with abstractive techni-
user. This, they argue, can be the first phase for
ques, Kan et al. [53,54] follow a different approach.
Natural Language Generation (NLG). In the next step
They argue that different types of summaries, such
of NLG, which they call planning, they try to figure
as indicative or informative, serve different infor-
out which nodes of the topic trees they will sum-
mational purposes and both can be useful, and that
marize. To achieve this, they determine which
extracting sentences for the creation of an informa-
nodes are relevant, irrelevant and intricate, based
tive multi-document summary ‘‘is well accepted
on how deep the nodes are, compared with the
since it is simple, fast and easy to evaluate’’. Their
query node, i.e. the node that matches the user
system, Centrifuser, which is the summarization
query. Thus, nodes that are descendants of the
engine of the PERSIVAL (PErsonilized Retrieval and
query node and are below depth k are considered
Summarization of Image, Video and Language) pro-
intricate, above depth k relevant and all the other
ject,18 produces both indicative and informative
nodes (i.e. the ones that are not descendants of the
multi-document summaries, with the aim of high-
query node) are irrelevant. In the final NLG step,
17
See http://www.pjpub.co.uk for more information on Scrip realization, the ordered information is converted to
newsletters. text. For a more thorough treatment of Centrifuser,
18
See http://persival.cs.columbia.edu/. see [55].
Summarization from medical documents: a survey 171
Apart from the abstractive indicative summaries answer, but instead gives as much information as
that Centrifuser produces, PERSIVAL produces possible about the question using some of the
another type of abstractive summary [1,56]. These query keywords.
summaries are not concerned with highlighting the
similarities and differences among several medical The input articles are first classified automati-
articles, but with the creation of an informative cally into three categories: prognosis, treatment,
abstractive summary. That summary is tailored diagnosis. The next step involves the identification
according to the preferences of two different types and extraction of the results, i.e. the tuples men-
of users: the physician, the patient or her relatives. tioned above. For this purpose, the authors are
The system should identify in the documents and exploiting the ‘‘rigid’’, as they call it, structure
extract tuples of the form: (Parameter(s), Finding, of the medical articles. This means that they try to
Relation). The relations can be any of the following locate the Results section and select the sentences
six types: association, prediction, risk, absence of that are relevant to the patient. The selected se-
association, absence of prediction and absence of ntences are then passed to the extraction module,
risk. They call those tuples results. Elhadad and that extracts in a template form, the following i-
Mckeown [56] using empirical methods, i.e. inter- nformation: the finding(s), the parameters, the
views with the physicians, concluded that a sum- relation, the degree of dependence of the para-
mary should fulfill the following qualitative criteria, meters, the article and the sentence it has been
in relation to the results: extracted from and various other minor informa-
tion. The templates are filled with the aid of hand-
Completeness and accuracy. The results should be crafted patterns.
complete and accurate, in the sense that all the The next step involves the determination of
relevant results and only them should be which portions, if any, of the extracted parameters
included. are relevant with the patient record. After that, the
Repetitions and contradictions. The system resulting templates are merged and ordered. To
should identify repetitions and contradictions achieve this, the templates are rendered into an
among the results. In order to do so, Elhadad internal ‘‘semantic’’ representation, in the form of
and Mckeown [56] have created a representation a graph. From this graph, they are able to identify
of the results, which allows them to identify repetitions and contradictions. A repetition occurs
relations such as subsumption and contradiction if two nodes are connected by more than one vertex,
among the results. and the vertices have ‘‘similar’’ types. What is
Coherence and cohesion. Coherence for [56] is similar and what is not has been ‘‘established’’ in
established by ‘‘accurate aggregation and order- interviews with physicians. A contradiction occurs in
ing of the related results’’. Cohesion is defined as the same situation, but now the vertices have dif-
follows: ‘‘two sentences are part of the same ferent types. Repetitions and contradictions are
paragraph, if and only if they are related.’’ used in order to create a more coherent summary.
Related are the sentences that present either With this method they manage to perform the mer-
the same finding, or the same parameter(s). ging of the templates. For the ordering, they use the
following criteria:
The system described in [56] takes input from
three different sources: Query based: a relation that answers the user
query is weighted higher.
Patient record. In general, the patient record Salience based: recitations and contradictions are
consists of structured documents, usually in tab- weighted higher.
ular form, and unstructured documents, and Domain based: studies with physicians show that
sometimes it can be very large. some relation types are more interesting than
Journal medical articles. Their system takes others. For instance a risk relation is weighted
as input a vast amount of online articles from higher than an association relation.
medical journals on the field of cardiology. In Source based: dependent relations from the same
fact, the articles that are input to the system template are presented together.
are the ones that globally match the patient, i.e.
the ones that contain information relevant to the The final step involves the creation of the sum-
patient. mary, through NLG techniques. In the final
The user query. Although the physician’s query is summary, all the medical terms are hyperlinked
posed in natural language, the system does not try to their definitions. This is achieved by conne-
to fully understand the question and give an cting the system of Elhadad and Mckeown [56]
172
Table 4 Summarization systems from medical documents
Input Purpose Output Method Evaluation
[45] Single-document Indicative, user-oriented, Sentences (extracts) Language processing (named entity Extrinsic
(also multi-document), mutilingual, domain-specific recognition, machine translation),
multimedia (text, audio, video) machine learning
[48] Single-document, multilingual, text Indicative, user-oriented, Sentences (extracts), Statistics (sentences extraction), Intrinsic, extrinsic
domain-specific abstracts language processing (semantic
representation for abstraction)
[49] Single-document, monolingual, text Indicative, generic, Sentences (extracts) Statistics (vector space model)
domain-specific
[52] Single-document, monolingual, text Indicative, generic, Abstracts Language processing
domain-specific (information extraction)
[55] Multi-document, monolingual, text Indicative-informative, Extracts, abstracts Statistics (clustering using similarity Extrinsic
generic, domain-specific measures), language processing
[56] Multi-document, monolingual, text Informative, user-oriented, Abstracts Language processing (information
domain-specific extraction, NLG)
[58] Single-document, video Generic, domain-specific Video sequences Image and video processing
(echocardiograms) (extracts)
[59] Single-document, video Generic, domain-specific Video sequences Image and video processing
(clinical operations, (extracts)
dialogues, presentations)
[61,62] Multi-document, monolingual, text Informative, user-oriented, Abstracts Agents simulating summarization tasks,
domain-specific language processing
S. Afantenos et al.
Summarization from medical documents: a survey 173
with DEFINDER, a text mining tool for extrac- segments of videos they create the dynamic sum-
ting definitions of terms from medical articles mary.
(see [57]).
Xingquan et al. [59], follow a similar approach to
4.5. Multimedia summarization [58] in order to parse the video stream into physical
units. Then video group detection, scene detection
Ebadollahi et al. [58] and Xingquan et al. [59] pre- and clustering strategies are used to mine the video
sent systems performing summarization of docu- content structure. Various visual and audio feature
ments which have multimedia content, processing techniques are utilized to detect some
echocardiograms and medical videos respectively. semantic cues, such as slides, face and speaker c-
The work presented in [49] is part of the PERSIVAL hanges, etc. within the video, and these detection
project mentioned above. In their study they are results are joined together to mine three types of
concerned with echocardiograms (ECGs). ECGs are events from the detected video scenes (presenta-
usually videotaped for archival purposes and tions by doctors or experts on video topics, clinical
recently they have started to be transcribed into operations to present details of diseases, and dia-
a digital format, which helps clinicians, and facil- logs between doctors and patients). Based on mined
itates the task of summarizing them. Summarizing video content structure and event information, a
an ECG, and video in general as seen in the work of scalable video skimming and summarization tool,
Merlino and Mark [9], involves extracting the most ClassMiner, has been constructed to visualize the
interesting video frames, which are called key- video overview and help users access video content.
frames, which enable the user to easily navigate Their system utilizes a four-layer video skimming,
through the ECGs and view their essential parts. For where levels 4 through 1 consist of representative
[58] summarizing an ECG involves two things: par- shots of clustered scenes, all scenes, all groups, and
sing the ECG and selecting the key-frames. The aim all shots of the video, respectively.
of the parser is to temporally segment the
sequences of the video into smaller units, which 4.6. Cognitive model based summarization
are called shots. A shot is a sequence of frames in
which the camera is uninterrupted. In the context of Based on the cognitive model used in the SimSum
ECG videos, a shot corresponds to a single position system (see Summarization from a cognitive science
and angle of the ultrasound transducer. The method perspective), Endres-Niggemeyer [61,62] presented
they use for the parsing is a special case of the its extension, the SummIt-BMT, which is concerned
algorithm presented in [60]. The next step is the with the summarization of MEDLINE abstracts and
key-frame selection, which extracts the most infor- articles for bone marrow transplantation, a specia-
mative (important) frames in the sequence of the lized field of internal medicine. SummIt-BMT is a
video. After mentioning several methods for query-based summarization system. In general, the
extracting key-frames, they conclude that for the summarization process is the following:
context of ECGs key-frames are ‘‘the local extrema
of the cardiac periodic expansive—contractive 1. A user forms a search scenario using concepts
motion’’, since ‘‘the time at which the cardiac from the domain ontology.
motion changes from expansive to contractive cor- 2. This scenario is mapped to a MEDLINE query. If
responds to the end-diastole and the time at which the outcome of the query points to journal arti-
the motion changes from contractive to expansive cles, they are included in the results.
corresponds to end-systole’’. Having performed the 3. A text retrieval component identifies the inter-
above two tasks, they create two summaries which esting pieces of text in the results.
they call static and dynamic. 4. Those pieces are summarized in relation to the
query scenario. Links to the original articles are
Static summary. This summary, in essence, is also given.
constituted from the selection of the extracted
key-frames, and it is useful for browsing the con- Although SummIt-BMT is based on SimSum it dif-
tent of the echo video. fers from it in several ways. It is not a presentational
Dynamic summary. This summary, also called clin- model anymore but a functional one. Thus, agents
ical summary among the clinicians, is a concate- simulating lower level cognitive processes have b-
nation of the small extracted sequences of the een replaced by functional ones. Text production
video. They chose to extract one (or more, based agents have been removed since SummIt-BMT does
on the needs of the clinicians) cycle of the heart not produce smooth text, but organized text clips
motion, known also as R—R cycle. By joining those that are linked to their source positions. As the
174 S. Afantenos et al.
application field is bone marrow transplantation medical summarization lead to certain interesting
(BMT), a BMT ontology was set up. Although several remarks concerning the promising paths for future
medical ontologies existed which are loosely related research. These remarks are presented below in
to the BMT field, a BMT-specific ontology was cre- terms of the summarization factors.
ated due to the fact that the existing ontologies did In terms of the input medium, almost all methods
not contain enough deep BMT knowledge for text concern summarization from text, although the
knowledge processing. The ontology they created is specific domain can provide a lot of useful input
very important for Summit-BMT, since it is being in other media as well (e.g. speech, images, videos).
used in almost all the stages of the summarization Summarizing information from different media (e.g.
process. spoken transcriptions and textual reports related to
A scenario interface reflecting everyday situa- the specific echo-videos) is an important issue for
tions of BMT physicians [63] helps users to state practical applications, representing a promising
their queries. Users fill in ontology concepts, which path for future research and development.
are for their convenience equipped with definitions Concerning the number of the input documents,
and explanations assembled from various sources on both categories of techniques (single and multi-
the web. From scenario forms and user-selected document) have been examined. As it is the case
ontology terms the system obtains structured in other domains, apart from medicine, single-docu-
queries in the predicate-logic form that it can ment summarization methods are mainly using
‘‘understand’’. Queries are given to the search extractive techniques, whereas almost all of the
engines, which return a set of documents, abstracts multi-document summarizers are based on abstrac-
and maybe journal articles, from MEDLINE. The tive techniques. However, the selection between
retrieved documents are checked for possible rele- the simpler extractive techniques and the more
vance by a text passage retrieval component. Irre- complex abstractive ones should not only be based
levant documents are discarded. From the final on the number of input documents, but also on the
set of documents, the summarization agents take available resources, tools and the summary purpose
the positive passages from text passage retrieval, and output factors.
represent their phrases and sentences in a predi- Concerning the language of the input docu-
cate-logic form, and examine them with human- ment(s), most of the existing systems are monolin-
style criteria: whether they are related to the user gual (English in almost all the cases). There are two
query, whether they are redundant, and so on. The cases (MiTAP, MUSI), where the multilingual aspect
agents remove items that do not meet their rele- was taken into account. In the MiTAP case, this was
vance criteria. due to the domain (monitoring disease outbreaks)
Table 4 summarizes the main features of the where the information sources are in various lan-
projects/systems presented in Summarization tech- guages. On the other hand, MUSI summarizes the
niques in the medical domain. articles of a bilingual Journal. In the medical domain
there is an enormous amount of documents in var-
ious categories (e.g. patient records) in other lan-
5. Promising paths for future research guages apart from English. There are resources and
tools in several other languages that can be
Although initial work on Summarization dates back exploited in building summarizers for handling more
to the late 1950s and 1960s (e.g. [3,4]), most than one language using either shallow or deeper
research on the field has been performed during approaches for language processing.
the last few years. The result is that the research In relation to purpose factors, the existing meth-
field has not yet achieved a mature state, and a ods mainly concern indicative summarization. The
variety of challenges still need to be overcome. The purpose of such summaries is to navigate the reader
scaling to large size collections of documents, the to the required information, which seems to be
use of more sophisticated natural language proces- sufficient for most practical applications in medi-
sing techniques for generating abstracts, the avail- cine as long as no better solution is available. The
ability of annotated summarization corpora for production of indicative summaries seems to ‘‘indi-
training and testing purposes, are some of these cate’’ that the shallow summarizing strategies used
challenges. so far are not enough for producing informative or
This is also the case for the domain of medical even critical summaries. Deeper language proces-
documents. The study of existing summarization sing techniques [64] and their combination with
techniques in other domains, the examination of shallow processing ones seems to be a promising
different types of medical documents and the study path for future research in NLP in general and more
of techniques reported so far in the literature for specifically in summarization.
Summarization from medical documents: a survey 175
There is a trend towards user-oriented summa- the development of portable summarization tech-
ries, which is reasonable since summarization sys- nology and the medical domain can provide the
tems in the medical domain aim to cover the necessary application areas.
information needs of different user types (clini- Concerning the output factors, the quality of the
cians, researchers, patients) and specific users. User summarization output is strongly related to the
involvement does not concern only the submission of summarization task. Therefore, qualitative and
a query to the system, but also the summary cus- quantitative criteria need to be established follow-
tomization and presentation according to the user’s ing a study of the domain and the users’ interests. In
model. The PERSIVAL system [56] maintains infor- terms of the decision between extractive and
mation about the users’ preferences taking into abstractive techniques, as noted above, this has
account their expertise in the domain, as well as to take into account several factors related to the
the users’ access tasks. The summary presentation input documents, the purpose of the summary, the
can also be affected by the user’s model (e.g. qualitative criteria established as well as the avail-
production of a summary in the form of hypertext, able resources and tools.
combination of text and images or video, etc.).
Personalized access to medical information is a
crucial issue and needs to be further investigated. 6. Conclusions
There is a lot of expertise from the application of
user modeling techniques in other domains which This survey presented the potential of summariza-
can also be exploited in the medical domain (see tion technology in the medical domain, based on the
[65,66]). examination of the state of the art, as well as of
Domain customization is another significant existing medical document types and summariza-
issue. Most of the existing medical summarization tion applications.
systems are able to process documents belonging in The challenges that the summarization research
specific sub-domains of medicine. Emphasis must has to overcome need to be viewed under the prism
be given to the development of technology that can of the requirements of the specific field. The scaling
be easily ported to new sub-domains. The devel- to large collections of documents in various lan-
opment of open architecture systems with reusable guages and from different media, the generation
and trainable components and resources is impera- of informative summaries using more sophisticated
tive in summarization technology. This is directly language and knowledge engineering techniques,
related to the ability of exploiting pre-existing the generation of personalized summaries, the port-
medical knowledge resources. There are currently ability to new sub-domains, the design of evaluation
various knowledge repositories such as the Unified scenarios which model real-world situations, the
Medical Language System (UMLS),19 and MeSH, integration of summarization technology in practi-
which can be exploited in several ways by summar- cal applications such as the clinical workflow, are
ization engines. For instance, they can be used to among the issues that the summarization commu-
locate interesting document(s), and interesting nity needs to focus on.
sentences inside those documents. They can even
be used to create conceptual representations of
the selected sentences in order to produce abstrac- Acknowledgements
tive summaries in the same or in a different lan-
guage. Such approaches are presented in the The authors would like to thank the anonymous
literature and can be further investigated. The reviewers, as well as Dr. Constantine D. Spyropoulos
development of customizable summarization tech- and Dr. George Paliouras, for their helpful and con-
nologies requires also in-depth study of the medical structive comments. Many thanks also to Ms. Eleni
document types and medical sub-language. A gen- Kapelou and Ms. Irene Doura for checking the use of
eral-purpose system must be able to exploit the English.
various characteristics of the medical documents.
For instance, the sectioning of scientific articles,
the specialized language used in e-mailed reports
or in patient records are important features that
References
can significantly affect the performance of the
involved language processing tools. In general, [1] McKeown KR, Jordan DA, Hatzivassiloglou V. Generating
patient-specific summaries of online literature. In: Hovy
the research community must cooperate towards E, Radev D, editors. Intelligent text summarization papers
from the 1998 AAAI symposium, vols. 34—43. Satabford, CA,
19
See http://www.nlm.nih.gov/research/umls/. USA: AAAI Press; 1998.
176 S. Afantenos et al.
[2] Radev D, Hovy E, McKeown K. Introduction to the special [25] Barzilay R, Elhadad M. Using lexical chains for text
issue on text summarization. Comput Linguist 2002;28(4). summarization. In: Mani I, Maybury MT, editors. Advances
[3] Luhn HP. The automatic creation of literature abstracts. IBM in automatic text summarization. 1999. p. 1110—121.
J Res Dev 1958;2(2):159—65. [26] Saggion H, Lapalme G. Generating indicative-onformative
[4] Edmundson HP. New methods in automatic extracting. J summaries with SumUM. Comput Linguist 2002;28(4):497—
Assoc Comput Mach 1969;16(2):264—85. 526.
[5] Paice CD. The automatic generation of literature abstracts: [27] Virach S, Potipiti T, Charoenporn T. UNL document summar-
an approach based on the identification of self-indicating ization. In: Proceedings of the First International Workshop
phrases. In: Oddy RN, Robertson SE, van Rijsbergen CJ, on Multimedia Annotation (MMA2001), Tokyo, Japan, Janu-
Williams PW, editors. Information retrieval research. Lon- ary 2001.
don: Butterworth; 1981. p. 172—91. [28] Ando R, Boguraev B, Byrd R, Neff M. Multi-document sum-
[6] Paice CD. Constructing literature abstracts by computer. marization by visualizing topical content. In: Proceedings of
Inform Process Manage 1990;26(1):171—86. the ANLP/NAACL 2000 Workshop on Automatic Summariza-
[7] Sparck-Jones K. Automatic summarizing: factors and tion, Seattle, WA, April 2000.
directions. In: Mani I, Maybury MT, editors. Advances in [29] Goldstein J, Mittal V, Carbonell J, Callan J. Creating and
automatic text summarization. 1999. p. 10—12 [chapter 1]. evaluating multi-document sentence extract summaries. In:
[8] Mani I. Automatic summarization. Volume 3 of Natural Proceedings of the 2000 ACM CIKM International Conference
language processing. Amsterdam/Philadelphia: John Benja- on Information and Knowledge Management, McLean, VA,
mins Publishing Company; 2001. USA, November 2000, p. 165—72.
[9] Merlino A, Mark M. An empirical study of the optimal pre- [30] Tsutomu H, Suzuki J, Isozaki H, Maeda E. NTT’s multiple
sentation of multimedia summaries of broadcast news. In: document summarization system for DUC 2003. In: Proceed-
Mani I, Maybury MT, editors. Advances in automatic text ings of the Workshop on Text Summarization, at the Human
summarization. 1999. p. 3910—401 [chapter 25]. Language Technology Conference 2003, Edmonton, Canada,
[10] Futrelle RP. Summarization of diagrams in documents. In: May 31—June 1, 2003.
Mani I, Maybury MT, editors. Advances in automatic text [31] Radev DR. A common theory of information fusion
summarization. 1999. p. 4030—421 [chapter 26]. from multiple text sources, step one: cross-document
[11] Mani I, Maybury MT, editors. Advances in automatic text structure. In: Proceedings of the First ACL SIGDIAL
summarization. The MIT Press; 1999. Workshop on Discourse and Dialogue, Hong Kong, October
[12] Mann WC, Thompson SA. Rhetorical structure theory: 2000.
towards a functional theory of text organization. Text [32] Zechner K. Automatic summarization of spoken dialogues in
1988;8(3):243—81. unrestricted domains. PhD thesis, Carnegie Mellon Univer-
[13] Dalianis H, Hassel M, de Smedt K, Liseth A, Lech TC, Wede- sity, School of Computer Science, Language Technologies
kind J. Porting and evaluation of automatic summarization. Institute, November 2001.
In: Holmboe H, editor. Nordisk Sprogteknologi. 1988. [33] Zechner K. Automatic summarization of open-domain multi-
[14] Abderrafih L. Multilingual alert agent with automatic text party dialogues in diverse genres. Computat Linguist
summarization. http://www.lehmam.freesurf.fr/automa- 2002;28(4):447—85.
tic_summarization.htm. [34] Carbonell J, Goldstein J. The use Of MMR, diversity based
[15] Hsin-Hsi C, Lin C-J. Multilingual news summarizer. In: Pro- reranking for reordering documents and producing summa-
ceedings of the18th International Conference on Computa- ries. In: Proceedings of the 21st Annual International ACM
tional Linguistics, July 31—August 4 2000. University of SIGIR Conference on Research and Development in Informa-
Saarlandes, p. 159—65. tion Retrieval, Melbourne, Australia. Poster Session, Augusts
[16] Salton G, Singhal A, Mandar M, Buckley C. Automatic text 1998.
structuring and summarization. Inform Process Manage [35] Merlino A, Morey D, Maybury M. Broadcast news navigation
1997;33(2):193—207. using story segments. In: Proceedings of the ACM Multi-
[17] Mani I, Bloedorn E. Summarizing similarities and differences media; 1997. p. 381—91.
among related documents. Inform Retrieval 1999;1(1):1—23. [36] Endres-Niggemeyer B. Summarizing information. Berlin:
[18] Marcu D. The Rhetorical parsing of natural language texts. Springer-Verlag; 1998.
In: Proceedings of the 35th Annual Meeting of the Association [37] Endres-Niggemeyer B. SimSum: an empirically founded
for Computational Linguistics. New Brunswick, New Jersey: simulation of summarizing. Inform Process Manage
Association for Computational Linguistics; 1997, p. 96—103. 2000;36(4):659—82.
[19] Marcu D. The theory and practice of discourse parsing and [38] Endres-Niggemeyer B, Maier E, Sigel A. How to implement a
summarization. The MIT Press; 2000. naturalistic model of abstracting: four core working steps of
[20] Reiter E, Dale R. Building applied natural language genera- an expert abstractor. Inform Process Manage 1995;31(5):
tion systems. Nat Language Eng 1997;3(1):57—87. 631—74.
[21] Reiter E, Dale R. Building natural language generation sys- [39] Endres-Niggemeyer B, Neugebauer E. Professional summar-
tems. In: Studies in natural language sing. Cambridge Uni- izing: no cognitive simulation without observation. J Am Soc
versity Press; 2000. Inform Sci 1998;49(6):486—506.
[22] DeJong G. An overview of the FRUMP system. In: Lehnert [40] Hersh W, Buckley C, Leone TJ, Hickam D. OHSUMED: an
WG, Ringle MH, editors. Strategies for natural language interactive retrieval evaluation and new large test collec-
processing. New Jersey: Erlbaum: Hillsdale; 1982. p. 149— tion for research. In: Proceedings of the 17th Annual ACM
76. SIGIR Conference; 1994. p. 192—201.
[23] Radev DR, McKeown KR. Generating natural language sum- [41] Cios Krzysztof J, William Moore G. Uniqueness of medical
maries from multiple on-line sources. Comput Linguist data mining. Artif Intell Med 2002;26:1—24.
1998;24(3):469—500. [42] Xenarios I, Salwinski L, Duan XJ, Higney P, Kim S, Eisenberg
[24] Radev DR. Generating natural language summaries from D. DIP: the database of interacting proteins a research tool
multiple on-line sources: language reuse and regeneration. for studying cellular networks of protein interactions. Nucl
PhD thesis, Columbia University; 1999. Acids Res 2002;30:303—5.
Summarization from medical documents: a survey 177
[43] Karkaletsis V, Spyropoulos CD. Cross-lingual information [54] Kan M-Y, McKeown KR, Klavans JL. Domain-specific informa-
management from web pages. In: Proceedings of the tive and indicative summarization for information retrieval.
Ninth Panhellenic Conference in Informatics (PCI-2003); In: Workshop on text summarization (DUC 2001); 2001.
2003. [55] Kan, M-Y. Automatic text summarization as applied to infor-
[44] Woodall J. Official versus unofficial outbreak reporting mation retrieval: using indicative and informative summa-
through the internet International. J Med Inform ries. PhD dissertation, Columbia University, New York, USA,
1997;47:31—4. February; 2003.
[45] Damianos L, Day D, Hirschman L, Kozierok R, Mardis S, [56] Elhadad N, McKeown KR. Towards generating patient specific
McEntee T, et al. Real Users, Real Data, Real Problems: summaries of medical articles. In: Proceedings of the Auto-
The MiTAP System for Monitoring Bio Events. Proceedings matic Summarization Workshop (NAACL 2001); 2001.
of the Conference on Unified Science & Technology for [57] Klavans JL, Muresan S. DEFINDER: rule-based methods for
Reducing Biological Threats & Countering Terrorism (BTR the extraction of medical terminology and their associated
2002); 2002. 167—77. definitions from on-line text. In: Proceedings of the Amer-
[46] McKeown K, Elhadad N, Hatzivassiloglou V. Leveraging a ican Medical Informatics Association Annual Symposium,
common representation for personalized search and sum- AMIA 2000; 2000.
marization in a medical digital library. In: Proceedings of the [58] Ebadollahi S, Chang S-F, Wu H, Takoma S. Echocardiogram
Joint Conference on Digital Libraries; 2003. video summarization. In: Proceedings of the SPIE MI 2001;
[47] Mani I, Bloedorn E. Summarizing similarities and differences 2001.
among related documents. Inform Retrieval 1999;1(1):1— [59] Xingquan Z, Fan J, Hacid M-S, Elmagarmid AK. Classminer:
23. mining medical video for scalable skimming and summariza-
[48] Lenci A, Bartolini R, Calzolari N, Agua A, Busemann S, Cartier tion. In: Proceedings of the 10th ACM International Confer-
E, et al. Multilingual summarization by integrating linguistic ence on Multimedia (Demonstration); 2002. p. 79—80.
resources in the MLIS-MUSI project. In: Proceedings of the [60] Zabih R, Miller J, Mai K. A feature-based algorithm for
Third International Conference on Language Resources and detecting and classifying scene breaks. In: Proceedings of
Evaluation (LREC’02); 2002. the ACM Multimedia; 1993. p. 189—200.
[49] Johnson DB, Zou Q, Dionisio JD, Liu VZ, Chu WW. Modeling [61] Endres-Niggemeyer B. Empirical methods for ontology engi-
medical content for automated summarization. Ann NY Acad neering in bone marrow transplantation. In: International
Sci 2002;980:247—58. Workshop on Ontological Engineering on the Global Informa-
[50] Coch J, Chevreau K. Interactive multilingual generation. In: tion Infrastructure; 1999.
Gelbuckh A, editor. Computational linguistics and intelligent [62] Endres-Niggemeyer B. Human-style; 2001. www summa-
text processing. Lecture notes in computer science, vol. rization. http://www.ik.fh-hannover.de/ik/person/ben/
2004. Berlin: Springer-Verlag; 2001. human-stylesummanew.pdf.
[51] Busemann S. Best-first surface realization. In: Scott D, [63] Becher M, Endres-Niggemeyer B, Fichtner G. Scenario forms
editor. Eighth International Natural Language Generation for web information seeking and summarizing in bone mar-
Workshop Proceedings; 1996 p. 101—10. row transplantation. In: COLING 2002: Workshop on Multi-
[52] Gaizauskas R, Herring P, Oakes M, Beaulieu M, Willett P, lingual Summarization and Question Answering; 2002.
Fowkes H, et al. Intelligent access to text: integrating [64] Oepen S, Flickinger D, Uszkoreit H, Tsuji J. Introduction to
information extraction technology into text browsers. In: the special issue on recent achievements in the domain of
Proceedings of the Human Language Technology Conference HPSG-based parsing. J Nat Language Eng 2000;6(1):1—14.
(HLT 2001); 2001. p. 189—93. [65] Alfred K. Generic user modeling systems. User Model User-
[53] Kan M-Y, McKeown KR, Klavans JL. Applying natural language adapted Interaction 2001;11(12):49—63.
generation to indicative summarization. In: Proceedings of [66] Pierrakos D, Paliouras G, Papatheodorou C, Spyropoulos CD.
the Eighth European Workshop on Natural Language Gen- Web usage mining as a tool for personalization: a survey.
eration; 2001. User Model User-adapted Interaction 2003;13(4):311—72.