Trasformada de Welvert
Trasformada de Welvert
Trasformada de Welvert
Special Issue:
Computational Linguistics
and its Applications
Guest Editor:
Yulia Ledeneva
Grigori Sidorov
1977
Editorial Board
Juan Carlos Augusto (Argentina)
Costin Badica (Romania)
Vladimir Batagelj (Slovenia)
Francesco Bergadano (Italy)
Marco Botta (Italy)
Pavel Brazdil (Portugal)
Andrej Brodnik (Slovenia)
Ivan Bruha (Canada)
Wray Buntine (Finland)
Oleksandr Dorokhov (Ukraine)
Hubert L. Dreyfus (USA)
Jozo Dujmovic (USA)
Johann Eder (Austria)
Ling Feng (China)
Vladimir A. Fomichov (Russia)
Maria Ganzha (Poland)
Marjan Guev (Macedonia)
N. Jaisankar (India)
Dimitris Kanellopoulos (Greece)
Hiroaki Kitano (Japan)
Samee Ullah Khan (USA)
Igor Kononenko (Slovenia)
Miroslav Kubat (USA)
Ante Lauc (Croatia)
Jadran Lenarcic (Slovenia)
Huan Liu (USA)
Suzana Loskovska (Macedonia)
Ramon L. de Mantras (Spain)
Angelo Montanari (Italy)
Pavol Nvrat (Slovakia)
Jerzy R. Nawrocki (Poland)
Nadja Nedjah (Brasil)
Franc Novak (Slovenia)
Marcin Paprzycki (USA/Poland)
Ivana Podnar arko (Croatia)
Karl H. Pribram (USA)
Luc De Raedt (Belgium)
Shahram Rahimi (USA)
Dejan Rakovic (Serbia)
Jean Ramaekers (Belgium)
Wilhelm Rossak (Germany)
Ivan Rozman (Slovenia)
Sugata Sanyal (India)
Walter Schempp (Germany)
Johannes Schwinn (Germany)
Zhongzhi Shi (China)
Oliviero Stock (Italy)
Robert Trappl (Austria)
Terry Winograd (USA)
Stefan Wrobel (Germany)
Konrad Wrona (France)
Xindong Wu (USA)
Editorial
Computational Linguistics and its Applications
This special issue contains several interesting papers
related to computational linguistics and its applications.
The papers were carefully selected by the guest editors
on the basis of peer reviews. We are happy that authors
from various countries chose this forum for presenting
their work: USA, Spain, Mexico, China, Germany,
Hungary, India, Japan, Lithuania; submitting the high
quality research results.
The first paper Recent Advances in Computational
Linguistics by Yulia Ledeneva and Grigori Sidorov
(Mexico), presents the view of the authors concerning
some aspects of the current state of computational
linguistics and its applications.
The paraphrase ability can be considered one of the
important characteristics of the usage of natural language
that proves the correct understanding of phrases. This
interesting phenomenon is discussed in the paper
Paraphrase Identification using Weighted Dependencies
and Word Semantics by Mihai C. Lintean and Vasile
Rus (USA). The authors analyze the paraphrase using
syntactic and semantic (lexical) information. Evaluation
data is presented.
Some important issues of automatic summarization
are discussed in the paper Challenging Issues of
Automatic Summarization: Relevance Detection and
Quality-based Evaluation by Elena Lloret and Manuel
Palomar (Spain). Namely, two related ideas are
presented. First, it is shown that the code quantity
principle (most important information) is applicable to
automatic summarization. Second, an evaluation of
quality of summaries is discussed.
Axel-Cyrille Ngonga Ngomo (Germany) presents in
his paper Low-Bias Extraction of Domain-Specific
Concepts a new approach for extraction of the domain
specific terms that tries to be independent from the
specific domain. The approach is based on graph
clustering.
In the work of Alberto Tllez-Valero, Manuel
Montes-y-Gmez, Luis Villaseor-Pineda and Anselmo
Peas-Padilla (Mexico, Spain) Towards Multi-Stream
Question Answering using Answer Validation a method
of determining the correctness of automatically generated
answers is discussed. The method uses the combination
of several answering systems.
The paper by Asif Ekbal and Sivaji Bandyopadhyay
(India) Named Entity Recognition using Appropriate
Unlabeled Data, Post-processing and Voting evaluates
the improvement of NER by using unlabeled data, post
processing and weighted voting of various models.
In the paper Assigning Library of Congress
Classification Codes to Books Based Only on their
Titles by Ricardo vila-Argelles, Hiram Calvo,
Alexander Gelbukh, and Salvador Godoy-Caldern
(Mexico and Japan), experiments are presented for book
classification using very small amount of information,
Research,
National
Informatica 34 (2010) 11
Y. Ledeneva et al.
Introduction
Phonetics/phonology,
Morphology,
Syntax,
Semantics,
Pragmatics, and
Discourse.
Y. Ledeneva et al.
3.
4.
5.
6.
7.
3.1
Morphological classification of
languages
3.
4.
3.2
3.2.1
Model of analysis
Y. Ledeneva et al.
3.2.2
Model of grammar
3.2.3
Computer implementation
3.3
3.3.1
3.3.2
and any other stem from the first stem. In this way,
starting from any stem we can generate any other stem.
The difference between static and dynamic method is
that in the former case, the algorithm is applied during
the compile time (when the dictionary is built), while in
the latter case, during runtime.
Note that the rules of these algorithms are different
from the rules that have to be developed to implement
analysis directly. For Russian, we use about 50 rules,
intuitively so clear that in fact any person learning
Russian is aware of these rules of stem construction.
Here is an example of a stem transformation rule:
-VC, * -C
which means: if the stem ends in a vowel (V) following
by a consonant (C) and the stem type contains the
symbol * then remove this vowel. Being applied to the
first stem of the noun hammer, the rule gives
the stem -() of hammer.
This contrasts with about 1,000 rules necessary for
direct analysis, which in addition are very superficial and
anti-intuitive. For example, to analyze a non-first-stem
word, [Mal85] uses rules that try to invert the effect of
the mentioned rule: if the stem ends in a consonant, try to
insert a vowel before it and look up each resulting
hypothetical stem in the dictionary: for -(), try
-, -, etc. This also slows down the
system performance.
Two considerations are related to the simplicity of
our rules. First, we use the information about the type of
the stem stored in the dictionary. Second, often
generation of a non-first stem from the first one is
simpler than vice versa. More precisely, the stem that
appears in the dictionaries for a given language is the one
that allows simpler generation of other stems.
3.3.3
Y. Ledeneva et al.
3.3.5
2.
Data preparation
3.
4.
3.3.4
Analysis process
5.
Generation process
3.3.6
Computational linguistic
applications
4.1
Information retrieval
n-grams [Man99].
Recent improvements to the following models should
be mentioned. As far as term selection is concerned:
MFS [Gar04, Gar06], Collocations [Bol04a, Bol04b,
Bol05, Bol08], Passages [Yin07].
As far as weighting of terms is concerned: Entropy,
Transition Point enrichment approach.
Various tasks where IR methods can be used are:
Monolingual Document Retrieval,
Multilingual Document Retrieval,
Interactive Cross-Language Retrieval,
Cross-Language Image, Speech and Video
Retrieval,
Cross-Language
Geographical
Information
Retrieval,
Domain-Specific Data Retrieval (Web, Medical,
Scientific digital corpora) [Van08].
4.2
Question answering
10
4.3
Text summarization
Y. Ledeneva et al.
4.3.1
11
12
4.3.2
Abstractive
summarization
approaches
use
information
extraction,
ontological
information,
information fusion, and compression. Automatically
generated abstracts (abstractive summaries) moves the
summarization field from the use of purely extractive
methods to the generation of abstracts that contain
Y. Ledeneva et al.
4.3.3
base
4.3.4
on
4.4
Text generation
13
14
Y. Ledeneva et al.
5.2
Graph methods
5.1
S (Vi ) (1 d ) d *
V j In (Vi )
S (V j )
Out (V j )
(1)
where d is a parameter that is set between 0 and 1.
The score of each vertex is recalculated upon each
iteration based on the new weights that the neighboring
vertices have accumulated. The algorithm terminates
when the convergence point is reached for all the
vertices, meaning that the error rate for each vertex falls
below a pre-defined threshold.
This vertex scoring scheme is based on a randomwalk model, where a walker takes random steps on the
graph, with the walk being modeled as a Markov process.
Under certain conditions (when the graph is acyclic and
irreducible) the model is guaranteed to converge to a
stationary distribution of probabilities associated with the
vertices in the graph. Intuitively, the stationary
probability associated with a vertex represents the
probability of finding the walker at that vertex during the
random-walk, and thus it represents the importance of the
vertex within the graph.
Two of the most used algorithms are PageRank
[Bri98] and HITS (Hyperlinked Induced Topic Search)
[Kle99].
Undirected Graphs: Although traditionally applied
on directed graphs, algorithms for node activation or
ranking can be also applied to undirected graphs. In such
graphs, convergence is usually achieved after a larger
number of iterations, and the final ranking can differ
significantly compared to the ranking obtained on
directed graphs.
Weighted Graphs: When the graphs are built from
natural language texts, they may include multiple or
5.3
Conclusions
References
[Ace06]
15
16
Y. Ledeneva et al.
[Hov03]
17
18
Y. Ledeneva et al.
19
We present in this article a novel approach to the task of paraphrase identification. The proposed approach
quantifies both the similarity and dissimilarity between two sentences. The similarity and dissimilarity
is assessed based on lexico-semantic information, i.e., word semantics, and syntactic information in the
form of dependencies, which are explicit syntactic relations between words in a sentence. Word semantics
requires mapping words onto concepts in a taxonomy and then using word-to-word similarity metrics to
compute their semantic relatedness. Dependencies are obtained using state-of-the-art dependency parsers.
One important aspect of our approach is the weighting of missing dependencies, i.e., dependencies present
in one sentence but not the other. We report experimental results on the Microsoft Paraphrase Corpus, a
standard data set for evaluating approaches to paraphrase identification. The experiments showed that the
proposed approach offers state-of-the-art results. In particular, our approach offers better precision when
compared to other approaches.
Povzetek: Prispevek se ukvarja z vsebinsko primerjavo dveh stavkov, tj. parafrazami.
1 Introduction
We present in this paper a novel approach to the task of
paraphrase identification. Paraphrase is a text-to-text relation between two non-identical text fragments that express
the same idea in different ways. As an example of a paraphrase we show below a pair of sentences from the Microsoft Research (MSR) Paraphrase Corpus [5] in which
Text A is a paraphrase of Text B and vice versa.
Text A: York had no problem with MTAs insisting the
decision to shift funds had been within its legal rights.
Text B: York had no problem with MTAs saying the decision to shift funds was within its powers.
Paraphrase identification is the task of deciding whether
two given text fragments have the same meaning. We focus
in this article on identifying paraphrase relations between
sentences such as the ones shown above. It should be noted
that paraphrase identification is different from paraphrase
extraction. Paraphrase extraction [1, 2] is the task of extracting fragments of texts that are in a paraphrase relation
from various sources. Paraphrase could be extracted, for
instance, from texts that contain redundant semantic content such as news articles from different media sources that
cover the same topic, or multiple English translations, by
different translators, of same source texts in a foreign language. Recognizing textual entailment [4, 20] is another
task related to paraphrase identification. Entailment is a
text-to-text relation between two texts in which one text
entails, or logically infers, the other. Entailment defines an
asymmetric relation between two texts, meaning that one
20
M. C. Lintean et al.
2 What is a paraphrase?
A quick search with the query What is a paraphrase? on a
major search engine reveals many definitions for the concept of paraphrase. Table 1 presents a small sample of such
definitions. From the table, we notice that the most common feature in all these definitions is different/own words.
That is, a sentence is a paraphrase of another sentence if
it conveys the same meaning using different words. While
these definitions seem to be quite clear, one particular type
of paraphrases, sentence-level paraphrases (among texts
the size of a sentence), do not seem to follow the above
definitions as evidenced by existing data sets of such paraphrases.
For sentential paraphrases, the feature of different
words" seems to be too restrictive, although not impossible.
As we will show later in the article, the MSR Paraphrase
corpus supports this claim as the paraphrases in the corpus
tend to have many words in common as opposed to using
different words to express the same meaning. While the
high lexical overlap of the paraphrases in the MSR corpus
can be explained by the protocol used to create the corpus - same keywords were used to retrieve same stories
from different sources on the web, in general, we could
argue that avoiding the high word overlap issue in sentential paraphrasing would be hard. Given an isolated sentence it would be quite challenging to omit/replace some
core concepts when trying to paraphrase. Here is an example of a sentence (instance 735 in MSR corpus), Counties with population declines will be Vermillion, Posey and
Madison., which would be hard to paraphrase using many
other/different words. The difficulty is due to the large
number of named entities in the sentence. Actually, its
paraphrase in the corpus is Vermillion, Posey and Madison County populations will decline., which retains all the
named entities from the original corpus as it is close to impossible to replace them with other words. It is beyond the
21
Wikipedia
Wordnet
Purdues OWL
Bedford/St.Martins
Pearsons Glossary
LupinWorks
3 Related work
Paraphrase identification has been explored in the past by
many researchers, especially after the release of the MSR
22
M. C. Lintean et al.
be
nsubj
decision
det
the
aux
have
be
prep-within
nsubj
right
decision
amod
poss
its
legal
Paired Dependencies:
det(decision, the) = det(decision, the)
nsubj(be, decision) = nsubj(be, decision)
poss(power, its) = poss(right, its)
prep_within(be, power) = prep_within(be, right)
det
the
prep-within
power
poss
its
Unpaired Dependencies/Sentence 1:
aux(be, had)
amod(right-n, legal-a)
Unpaired Dependencies/Sentence 2:
EMPTY
Figure 1: Example of dependency trees and sets of paired and non-paired dependencies.
phase is similar to our approach for detecting common
dependencies. In the second phase, they used a supervised classifier to detect whether the dissimilarities are important. There are two advantages of our approach compared to Qiu and colleagues approach (1) we use word semantics to compute similarities, (2) we take advantage of
the dependency types and position in the dependency tree
to weight dependencies as opposed to simply using nonweighted/unlabeled predicate-argument relations.
Zhang and Patrick [22] offer another ingenious solution
to identify sentence-level paraphrase pairs by transforming
source sentences into canonicalized text forms at the lexical and syntactic level, i.e. generic and simpler forms than
the original text. One of the surprising findings is that a
baseline system based on a supervised decision tree classifier with simple lexical matching features leads to best results compared to more sophisticated approaches that were
experimented by them or others. They also revealed limitations of the MSR Paraphrase Corpus. The fact that their
text canonicalization features did not lead to better than the
baseline approach supports their findings that the sentential
paraphrases, at least in the MSR corpus, share more words
in common than one might expect given the standard definition of a paraphrase. The standard definition implies to
use different words when paraphrasing. Zhang and Patrick
used decision trees to classify the sentence pairs making
their approach a supervised one as opposed to our approach
which is minimally supervised - we only need to derive
the value of the threshold from training data for which it is
only necessary to know the distribution of true-false paraphrases in the training corpus and not the individual judgment for every instance in the corpus. They rely only on
lexical and syntactic features while we also use semantic
similarity factors.
We will compare the results of our approach on the MSR
corpus with these related approaches. But first, we must
detail the innerworkings of our approach.
4 Approach
As mentioned earlier, our approach is based on the observation that two sentences express the same meaning, i.e.,
are paraphrases, if they have all or many words and syntactic relations in common. Furthermore, the two sentences
should have few or no dissimilar words or syntactic relations. In the example below, we show two sentences
with high lexical and syntactic overlap. The different information, legal rights in the first sentence and powers in
the second sentence, does not have a significant impact on
the overall decision that the two sentences are paraphrases,
which can be drawn based on the high degree of lexical and
syntactic overlap.
Text A: The decision was within its legal rights.
Text B: The decision was within its powers.
On the other hand, there are sentences that are almost
identical, lexically and syntactically, and yet they are not
paraphrases because the few dissimilarities make a big difference. In the example below, there is a relatively small"
difference between the two sentences. Only the subject of
the sentences is different. However, due to the importance
of the subject relation to the meaning of any sentence the
high similarity between the sentences is sufficiently dominated by the small" dissimilarity to make the two sentences non-paraphrases.
Text A: CBS is the leader in the 18 to 46 age group.
Text B: NBC is the leader in the 18 to 46 age group.
Thus, it is important to assess both similarities and
dissimilarities between two sentences S1 and S2 before making a decision with respect to them being paraphrases or not. In our approach, we capture the two
aspects, similarity or dissimilarity, and then find the
dominant aspect by computing a final paraphrase score
as the ratio of the similarity and dissimilarity scores:
Paraphrase(S1 , S2 )=Sim(S1 , S2 )/Diss(S1 , S2 ). If the paraphrase score is above a learned threshold T the sentences are deemed paraphrases. Otherwise, they are nonparaphrases.
The similarity and dissimilarity scores are computed
23
step procedure is used. In the first step, we take one dependency from the shorter sentence in terms of number of
dependencies (a computational efficiency trick) and identify dependencies of the same type in the other sentence.
In the second step, we compute a dependency similarity
score (d2dSim) using the word-to-word similarity metrics
applied to the two heads and two modifiers of the matched
dependencies. Heads and modifiers are mapped onto all
the corresponding concepts in WordNet, one concept for
each sense of the heads and modifiers. The similarity is
computed among all senses/concepts of the two heads and
modifiers, respectively, and then the maximum similarity is
retained. If a word is not present in WordNet exact matching is used. The word-to-word similarity scores are combined into one final dependency-to-dependency similarity
score by taking the weighted average of the similarities of
the heads and modifiers. Intuitively, more weight should
be given to the similarity score of heads and less to the
similarity score of modifiers because heads are the more
important words. Surprisingly, while trying to learn a good
weighting scheme from the training data we found that the
opposite should be applied: more weight should be given
to modifiers (0.55) and less to heads (0.45). We believe
this is true only for the MSR Paraphrase Corpus and this
weighting scheme should not be generalized to other paraphrase corpora. The MSR corpus was built in such a way
that favored highly similar sentences in terms of major content words (common or proper nouns) because the extraction of the sentences was based on keyword searching of
major events from the web. With the major content words
similar, the modifiers are the heavy lifters when it comes
to distinguishing between paraphrase and non-paraphrase
cases. Another possible approach to calculate the similarity score between dependencies is to rely only on the similarity of the most disimilar items, either heads or modifiers.
We also tried this alternative approach, but it gave slightly
poorer results (around 2% decrease in performance), and
therefore, using a weighted scheme to calculate the similarity score for dependencies proved to be a better choice.
The dependency-to-dependency similarity score needs to
exceed a certain threshold for two matched dependencies to
be deemed similar. Empirically, we found out from training data that a good value for this threshold would be 0.5.
Once a pair of dependencies is deemed similar, we place
it into the paired dependencies set, along with the calculated dependency-to-dependency similarity value. All the
dependencies that could not be paired are moved into the
unpaired dependencies sets.
sim(S1 , S2 ) =
maxd2 S2 [d2dSim(d1 , d2 )]
d1 S1
diss(S1 , S2 ) =
weight(d)
d{unpairedS1 ,unpairedS2 }
24
M. C. Lintean et al.
Sentence
1
Sentence
2
Set of
dependencies
from sentence 1
Set of
dependencies
from sentence 2
Set of unpaired
dependencies from
sentence 1
Set of
paired/common
dependencies
S / D > Threshold
Set of unpaired
dependencies from
sentence 2
Accuracy
Precision
Recall
F-measure
Uniform baseline
Random baseline [3]
Lexical baseline (from Zhang et. al.)[22]
Corley and Mihalcea [3]
Qiu [18]
Rus - average [19]
Simple dependency overlap (Minipar) [13]
Simple dependency overlap (Stanford) [13]
0.6649
0.5130
0.7230
0.7150
0.7200
0.7061
0.6939
0.6823
0.6649
0.6830
0.7880
0.7230
0.7250
0.7207
0.7109
0.7064
1.0000
0.5000
0.7980
0.9250
0.9340
0.9111
0.9093
0.8936
0.7987
0.5780
0.7930
0.8120
0.8160
0.8048
0.7979
0.7890
0.7206
0.7101
0.7038
0.7032
0.7177
0.7067
0.7067
0.7032
0.7404
0.7270
0.7184
0.7237
0.7378
0.7265
0.7275
0.7138
0.8928
0.9032
0.9119
0.8954
0.8928
0.8963
0.8936
0.9241
0.8095
0.8056
0.8037
0.8005
0.8079
0.8025
0.8020
0.8055
in Phase 2: a cumulative similarity score and a cumulative dissimilarity score. The cumulative similarity score
Sim(S1 , S2 ) is computed from the set of paired dependencies by summing up the dependency-to-dependency
similarity scores (S2 in the equation for similarity score
represents the set of remaining unpaired dependencies in
the second sentence). Similarly, the dissimilarity score
Diss(S1 , S2 ) is calculated from the two sets of unpaired dependencies. Each unpaired dependency is weighted based
on two features: the depth of the dependency within the dependency tree and type of dependency. The depth is important because an unpaired dependency that is closer to the
root of the dependency tree, e.g., the main verb/predicate
of sentence, is more important to indicate a big difference
between two sentences. In our approach, each unpaired dependency is initially given a perfect weight of 1.00, which
is then gradually penalized with a constant value (0.20 for
the Minipar output and 0.18 for the Stanford output), the
farther away it is from the root node. The penalty values
5 Summary of results
We experimented with our approach on the MSR Paraphrase Corpus [5]. The MSR Paraphrase Corpus is
the largest publicly available annotated paraphrase corpus which has been used in most of the recent studies
that addressed the problem of paraphrase identification.
The corpus consists of 5801 sentence pairs collected from
newswire articles, 3900 of which were labeled as paraphrases by human annotators. The whole set is divided
into a training subset (4076 sentences of which 2753 are
true paraphrases) which we have used to determine the optimum threshold T , and a test subset (1725 pairs of which
1147 are true paraphrases) that is used to report the performance results. We report results using four performance
metrics: accuracy (percentage of instances correctly predicted out of all instances), precision (percentage of predicted paraphrases that are indeed paraphrases), recall (percentage of true paraphrases that were predicted as such),
and f-measure (harmonic mean of precision and recall).
In Table 1 three baselines are reported: a uniform baseline in which the majority class (paraphrase) in the training data is always chosen, a random baseline taken from
[3], and a lexical baseline taken from [22] which uses
25
26
M. C. Lintean et al.
Table 3: Accuracy results for different WordNet metrics with optimum test threshold values
Metric
Acc.
Prec.
Rec.
LinM inipar
LinStanf ord
Path
L&C
W&P
J&C
Lesk
Vector
Vector pairs
.7241
.7130
.7183
.7165
.7188
.7217
.7148
.7200
.7188
.7395
.7387
.7332
.7253
.7270
.7425
.7446
.7330
.7519
.9032
.8797
.9058
.9233
.9241
.8901
.8692
.9093
.8614
.8132
.8030
.8105
.8124
.8138
.8097
.8021
.8117
.8029
6 Discussion
One item worth discussing is the annotation of the MSR
Paraphrase Corpus. Some sentences are intentionally labeled as paraphrases in the corpus even when the small
dissimilarities are extremely important, e.g. different numbers. Below is a pair of sentences from the corpus in
which the small" difference in both the numbers and the
anonymous stocks in Text A are not considered important
enough for the annotators to judge the two sentences as
non-paraphrases.
Text A: The stock rose $2.11, or about 11 percent, to
close on Friday at $21.51 on the New York Stock Exchange.
Text B: PG&E Corp. shares jumped $1.63 or 8 percent
to $21.03 on the New York Stock Exchange on Friday.
This makes the corpus more challenging and the fullyautomated solutions look less powerful than they would on
a paraphrase corpus that followed the standard interpretation of what a paraphrase is, i.e. the two texts have exactly
the same meaning.
Another item worth discussing is the comparison of the
dependency parsers. Our experimental results show that
Minipar consistently outperforms Stanford, in terms of accuracy of our paraphrase identification approach. Minipar is also faster than Stanford, which first generates the
phrase-based syntactic tree for a sentence and then extracts
the corresponding sets of dependencies from the phrasebased syntactic tree. For instance, Minipar can parse 1725
pairs of sentences, i.e. 3450 sentences, in 48 seconds while
4
X
i=1
27
References
[1] Barzilay, R., and Lee, L. 2003. Learning to Paraphrase: An Unsupervised Approach Using Multiple Sequence Alignment. In Proceedings of NAACL
2003.
[2] Brockett, C., and Dolan, W. B. 2005. Support Vector Machines for Paraphrase Identification and Corpus Construction. In Proceedings of the 3rd International Workshop on Paraphrasing.
[3] Corley, C., and Mihalcea, R. 2005. Measuring the
Semantic Similarity of Texts. In Proceedings of the
ACL Workshop on Empirical Modeling of Semantic
Equivalence and Entailment. Ann Arbor, MI.
[4] Dagan, I., Glickman, O., and Magnini B. 2006.
The PASCAL Recognising Textual Entailment Challenge. In Quionero-Candela, J.; Dagan, I.; Magnini,
B.; dAlch-Buc, F. (Eds.), Machine Learning Challenges. Lecture Notes in Computer Science , Vol.
3944, 177190, Springer, 2006.
[5] Dolan, B.; Quirk, C.; and Brockett, C. 2004. Unsupervised Construction of Large Paraphrase Corpora:
Exploiting Massively Parallel News Sources. In Proceedings of COLING, Geneva, Switzerland.
[6] Graesser, A.C.; Olney,A.; Haynes, B.; and Chipman,
P. 2005. Cognitive Systems: Human Cognitive Models in Systems Design. Erlbaum, Mahwah, NJ. chapter
AutoTutor: A cognitive system that simulates a tutor
that facilitates learning through mixed-initiative dialogue.
[7] Hays, D. 1964. Dependency Theory: A Formalism
and Some Observations. Languages, 40: 511525.
[8] Kozareva, Z., and Montoyo, A. 2006. Lecture Notes
in Artificial Intelligence: Proceedings of the 5th International Conference on Natural Language Processing (Fin-TAL 2006). chapter Paraphrase Identification
on the basis of Supervised Machine Learning Techniques.
[9] Ibrahim, A.; Katz B.; and Lin, J. 2003. Extracting Structural Paraphrases from Aligned Monolingual
Corpora. in Proceeding of the Second International
Workshop on Paraphrasing, (ACL 2003).
[10] Iordanskaja, L.; Kittredge, R.; and Polgere, A. 1991.
Natural Language Generation in Artificial Intelligence and Computational Linguistics. Lexical selection and paraphrase in a meaning-text generation
model, Kluwer Academic.
28
M. C. Lintean et al.
Parser
Minipar
Stanford
Minipar
Stanford
Diss only
[11] Lin, D. 1993. Principle-Based Parsing Without Overgeneration. In Proceedings of ACL, 112120, Columbus, OH.
[12] Lin, D. 1995. A Dependency-based Method for
Evaluating Broad-coverage Parsers. In Proceedings
of IJCAI-95.
[13] Lintean, M.; Rus, V.; and Graesser, A. 2008. Using Dependency Relations to Decide Paraphrasing In
Proceedings of the Society for Text and Discourse
Conference 2008.
[14] Marneffe, M. C; MacCartney, B.; and Manning, C.
D. 2006. Generating Typed Dependency Parsers
from Phrase Structure Parses. In Proceeding of LREC
2006.
[15] McNamara, D.S.; Boonthum, C.; Levinstein, I. B.;
and Millis, K. 2007. Handbook of Latent Semantic
Analysis. Erlbaum, Mahwah, NJ. chapter Evaluating
self-explanations in iSTART: comparing word-based
and LSA algorithms, 227241.
[16] Miller, G. 1995 WordNet: A Lexical Database of English. Communications of the ACM, v.38 n.11, p.3941.
[17] Patwardhan, S.; Banerjee, S.; and Pedersen, T. 2003.
Using Measures of Semantic Relatedness for Word
Sense Disambiguation. in Proceeding of the Fourth
International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City,
February.
[18] Qiu, L.; Kan M. Y.; and Chua T. S. 2006. Paraphrase
Recognition via Dissimilarity Significance Classification. In Proceeding of EMNLP, Sydney 2006.
[19] Rus, V.; McCarthy, P. M.; Lintean, M.; McNamara,
D. S.; and Graesser, A. C. 2008. Paraphrase Identification with Lexico-Syntactic Graph Subsumption.
In Proceedings of the Florida Artificial Intelligence
Research Society International Conference (FLAIRS2008).
[20] Rus, V.; McCarthy, P. M.; McNamara, D. S.; and
Graesser, A. C. 2008. A Study of Textual Entailment. International Journal of Artificial Intelligence
Tools, August 2008.
[21] Wu, D. 2005. Recognizing Paraphrases and Textual Entailnment using Inversion Transduction Grammars. in Proceeding of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailnment, Ann Arbor, MI.
[22] Zhang, Y., and Patrick, J. 2005. Paraphrase Identification by Text Canonicalization In Proceedings of the
Australasian Language Technology Workshop 2005,
160166.
29
This paper is about the Automatic Summarization task within two different points of view, focusing on
two main goals. On the one hand, a study of the suitability for The Code Quantity Principle in the Text
Summarization task is described. This linguistic principle is implemented to select those sentences from a
text, which carry the most important information. Moreover, this method has been run over the DUC 2002
data, obtaining encouraging results in the automatic evaluation with the ROUGE tool. On the other hand,
the second topic discussed in this paper deals with the evaluation of summaries, suggesting new challenges
for this task. The main methods to perform the evaluation of summaries automatically have been described,
as well as the current problems existing with regard to this difficult task. With the aim of solving some of
these problems, a novel type of evaluation is outlined to be developed in the future, taking into account a
number of quality criteria in order to evaluate the summary in a qualitative way.
Povzetek: Razvita je metoda za zbirni opis besedila, ki temelji na iskanju najpomembnejih stavkov.
1 Introduction
The high amount of electronic information available on the
Internet increases the difficulty of dealing with it in recent
years. Automatic Summarization (AS) task helps users
condense all this information and present it in a brief way,
in order to make it easier to process the vast amount of
documents related to the same topic that exist these days.
Moreover, AS can be very useful for neighbouring Natural Language Processing (NLP) tasks, such as Information
Retrieval, Question Answering or Text Comprehension, because these tasks can take advantadge of the summaries to
save time and resources [1].
A summary can be defined as a reductive transformation
of source text through content condensation by selection
and/or generalisation of what is important in the source [2].
According to [3], this process involves three stages: topic
identification, interpretation and summary generation. To
identify the topic in a document what systems usually do is
to assign a score to each unit of input (word, sentence, passage) by means of statistical or machine learning methods.
The stage of interpretation is what distinguishes extracttype summarization systems from abstract-type systems.
During interpretation, the topics identified as important are
fused, represented in new terms, and expressed using a
new formulation, using concepts or words not found in the
original text. Finally, when the summary content has been
created through abstracting and/or information extraction,
it requires techniques of Natural Language Generation to
build the summary sentences. When an extractive approach
is taken, there is no generation stage involved.
30
E. Lloret et al.
Scsi =
X
1
|w| .
#N Pi
(1)
wN P
where:
#NPi = number of noun-phrases contained in sentence i,
|w |= 1, when a word belongs to a noun-phrase.
In Figure 1, an example of how we compute the score
of a pair of sentences is showed. Firstly, two sentences
that belong to the original document can be seen. Then,
chunks of these sentences are identified and stopwords are
removed from them. Lastly, scores are calculated according to Formula 1. These sentences have been extracted
from the DUC 2002 test data3 . Once we have the score for
2 This
3 Document
31
3 Evaluating automatic
summarization
Evaluating summaries, either manually or automatically, is
a hard task. The main difficulty in evaluation comes from
the impossibility of building a fair gold-standard against
which the results of the systems can be compared [13].
Furthermore, it is also very hard to determine what a correct summary is, because there is always the possibility of
a system to generate a good summary that is quite different from any human summary used as an approximation to
the correct output [4]. In Section 1, we mentioned the two
approaches that can be adopted to evaluate an automatic
summary: instrinsic or extrinsic evaluation. Instrinsic evaluation assesses mainly coherence and summarys information content, whereas extrinsic methods focus on determining the effect of summarization on some other task, for instance Question Answering.
Next, in Section 3.1, we show how we evaluated the
novel source of knowledge and the results obtained. Afterwards, in Sections 3.2 and 3.3, we present the problems
of the evaluation and the automatic methods developed so
far, and we propose a novel idea for evaluating automatic
summaries based on quality criteria, respectively.
32
E. Lloret et al.
33
ROUGE-1
0.42776
0.42241
0.41488
0.41132
0.40823
ROUGE-2
0.21769
0.17177
0.21038
0.21075
0.20878
ROUGE-SU4
0.17315
0.19320
0.16546
0.16604
0.16377
ROUGE-L
0.38645
0.38133
0.37543
0.37535
0.37351
be the one that best discriminates between manual and automatically generated summaries. These measures are: (1)
a measure to evaluate the quality of any set of similarity
metrics, (2) a measure to evaluate the quality of a summary
using an optimal set of similarity metrics, and (3) a measure to evaluate whether the set of baseline summaries is
reliable or may produce biased results.
Despite the fact that many approaches have been developed, some important aspects of summaries, such as legibility, grammaticality, responsiveness or well-formedness
are still evaluated manually by experts. For instance, DUC
assessors had a list of linguistic qualitity questions7 , and
they manually assigned a mark to automatic summaries depending on what extent they accomplished each of these
criteria.
the original source. Once performed, it could be used together with any other automatic methodology to measure
summarys informativeness.
Attempts to measure the quality of a summary have been
previosuly described. In [32] indicativeness (by means of
document topics) and sentence acceptability were evaluated by comparing automatic summaries with model ones.
More recent approaches have suggested automatic methods
to determine the coherence of a summary [33], or even an
analisys of several factors regarding readability, which can
be used for predicting the quality of texts [34].
As can be seen in Figure 3, the quality criteria aforementioned for the proposed methodoloy will include, among
others, coherence within the summary, how anaphoric expressions have been dealt with, whether the topic has been
identified correctly or not, or how language generation has
been used. The final goal is to set up an independent
summarization evaluation environment suitable for generic
summaries, which tests a summarys quality, and decides
on whether the summary is correct or not, with respect
to its original document. Having available a methodology
like the one proposed here, would allow automatic summaries to be evaluated automatically in an objective way on
their own, without comparing them to any gold-standard in
terms of more linguistic and readability aspects.
34
Acknowledgement
This research has been supported by the FPI grant (BES2007-16268) from the Spanish Ministry of Science and
Innovation, under the project TEXT-MESS (TIN200615265-C06-01).
References
[1] Hassel, M.: Resource Lean and Portable Automatic
Text Summarization. PhD thesis, Department of Numerical Analysis and Computer Science, Royal Institute of Technology, Stockholm, Sweden (2007)
E. Lloret et al.
[14] Marcu, D.: Discourse trees are good indicators of importance in text. In: Inderjeet Mani and Mark Maybury, editors, Advances in Automatic Text Summarization, MIT Press (1999) 123136
[15] Mihalcea, R.: Graph-based ranking algorithms for
sentence extraction, applied to text summarization.
In: Proceedings of the ACL 2004 on Interactive poster
and demonstration sessions. (2004) 20
[16] Radev, D.R., Erkan, G., Fader, A., Jordan, P., Shen,
S., Sweeney, J.P.: Lexnet: A graphical environment
for graph-based nlp. In: Proceedings of the 21st International Conference on Computational Linguistics
and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia. (July 2006)
4548
[17] Wan, X., Yang, J., Xiao, J.: Towards a unified
approach based on affinity graph to various multidocument summarizations. In: Proceedings of the
11th European Conference, ECDL 2007, Budapest,
Hungary. (2007) 297308
[18] Ji, S.: A textual perspective on Givons quantity principle. Journal of Pragmatics 39(2) (2007) 292304
[19] Givn, T.: A functional-typological introduction, II.
Amsterdam : John Benjamins (1990)
[20] Ramshaw, L.A., Marcus, M.P.: Text chunking using
transformation-based learning. In: Proceedings of the
Third ACL Workshop on Very Large Corpora, Cambridge MA, USA. (1995)
[21] Lin, C.Y., Hovy, E.: Automatic evaluation of summaries using n-gram co-occurrence statistics. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational
Linguistics on Human Language Technology (HLTNAACL 2003). (2003) 7178
[22] Steinberger, J., Poesio, M., Kabadjov, M.A., Jeek,
K.: Two uses of anaphora resolution in summarization. Information Processing & Management 43(6)
(2007) 16631680
[23] Schlesinger, J.D., Okurowski, M.E., Conroy, J.M.,
OLeary, D.P., Taylor, A., Hobbs, J., Wilson, H.:
Understanding machine performance in the context
of human performance for multi-document summarization. In: Proceedings of the DUC 2002 Workshop on Text Summarization (In conjunction with
the ACL 2002 and including the DARPA/NIST sponsored DUC 2002 Meeting on Text Summarization),
Philadelphia. (2002)
[24] Nenkova, A.: Summarization evaluation for text and
speech: issues and approaches. In: INTERSPEECH2006, paper 2079-Wed1WeS.1. (2006)
35
[25] Zhou, L., Lin, C.Y., Munteanu, D.S., Hovy, E.: Paraeval: Using paraphrases to evaluate summaries automatically. In: Proceedings of the Human Language
Technology / North American Association of Computational Linguistics conference (HLT-NAACL 2006).
New York, NY. (2006) 447454
[26] Endres-Niggemeyer, B.: Summarizing Information.
Berlin: Springer (1998)
[27] Nenkova, A., Passonneau, R., McKeown, K.: The
pyramid method: Incorporating human content selection variation in summarization evaluation. ACM
Transactions on Speech and Language Processing
4(2) (2007) 4
[28] Hovy, E., Lin, C.Y., Zhou, L., Fukumoto, J.: Automated summarization evaluation with basic elements.
In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC).
Genoa, Italy. (2006)
[29] Fuentes, M., Gonzlez, E., Ferrs, D., Rodrguez,
H.: Qasum-talp at duc 2005 automatically evaluated
with a pyramid based metric. In: the Document Understanding Workshop (presented at the HLT/EMNLP
Annual Meeting), Vancouver, B.C., Canada. (2005)
[30] Radev, D.R., Tam, D.: Summarization evaluation using relative utility. In: CIKM 03: Proceedings of the
twelfth international conference on Information and
knowledge management. (2003) 508511
[31] Amig, E., Gonzalo, J., Peas, A., Verdejo, F.:
QARLA: a framework for the evaluation of text summarization systems. In: ACL 05: Proceedings of the
43rd Annual Meeting on Association for Computational Linguistics. (2005) 280289
[32] Saggion, H., Lapalme, G.: Selective analysis for automatic abstracting: Evaluating indicativeness and acceptability. In: Proceedings of Content-Based Multimedia Information Access (RIAO). (2000) 747764
[33] Barzilay, R., Lapata, M.: Modeling local coherence:
An entity-based approach. In: Proceedings of the
43rd Annual Meeting of the Association for Computational Linguistics (ACL05), Ann Arbor, Michigan, Association for Computational Linguistics (June
2005) 141148
[34] Pitler, E., Nenkova, A.: Revisiting readability: A
unified framework for predicting text quality. In:
Proceedings of the 2008 Conference on Empirical
Methods in Natural Language Processing, Honolulu,
Hawaii, Association for Computational Linguistics
(October 2008) 186195
36
E. Lloret et al.
37
The availability of domain-specific knowledge models in various forms has led to the development of several tools and applications specialized on complex domains such as bio-medecine, tourism and chemistry.
Yet, most of the current approaches to the extraction of domain-specific knowledge from text are limited
in their portability to other domains and languages. In this paper, we present and evaluate an approach to
the low-bias extraction of domain-specific concepts. Our approach is based on graph clustering and makes
no use of a-priori knowledge about the language or the domain to process. Therefore, it can be used on
virtually any language. The evaluation is carried out on two data sets of different cleanness and size.
Povzetek: Od jezika neodvisna metoda iz besedila izluci termine in nato domensko odvisne koncepte.
1 Introduction
The recent availability of domain-specific knowledge models in various forms has led to the development of information systems specialized on complex domains such as biomedecine, tourism and chemistry. Domain-specific information systems rely on domain knowledge in forms such
as terminologies, taxonomies and ontologies to represent,
analyze, structure and retrieve information. While this integrated knowledge boosts the accuracy of domain-specific
information systems, modeling domain-specific knowledge
manually remains a challenging task. Therefore, considerable effort is being invested in developing techniques for
the extraction of domain-specific knowledge from various
resources in a semi-automatic fashion. Domain-specific
text corpora are widely used for this purpose. Yet, most of
the current approaches to the extraction of domain-specific
knowledge in the form of terminologies or ontologies are
limited in their portability to other domains and languages.
The limitations result from the knowledge-rich paradigm
followed by these approaches, i.e., from them demanding
hand-crafted domain-specific and language-specific knowledge as input. Due to these constraints, domain-specific information systems exist currently for a limited number of
domains and languages for which domain-specific knowledge models are available. An approach to remedy the high
human costs linked with the modeling of domain-specific
knowledge is the use of low-bias, i.e., knowledge-poor and
unsupervised approaches. They require little human effort
but more computational power to achieve the same goals as
their hand-crafted counterparts.
In this work, we propose the use of low-bias approaches
for the extraction of domain-specific terminology and concepts from text. Especially, we study the low-bias extraction of concepts out of text using a combination of
2 Related work
Approaches to concept extraction can be categorized by
a variety of dimensions including units processed, data
sources and knowledge support [20]. The overview of
techniques for concept extraction presented in this section focuses on the knowledge support dimension. Accordingly, we differentiate between two main categories of
1 http://www.nlm.nih.gov/mesh
38
A.-C.N. Ngomo
nf (w)p(w)e 22
SRE(w) =
, (1)
Pn
2 2 i=1 f (c1 . . . ci ci+2 . . . cn )
where
d(w) is the number of documents in which w occurs,
and 2 are the mean and the variance of the distribution of n-grams in documents respectively,
p(w) is the probability of occurrence of w in the whole
corpus,
f (w) is the frequency of occurrence of w in the whole
corpus and
c1 ...ci ci+2 ...cn are patterns
ham(w, c1 ...ci ci+2 ...cn ) = 1.
such
that
39
4 BorderFlow
BorderFlow is a general-purpose graph clustering algorithm. It uses solely local information for clustering and
achieves a soft clustering of the input graph. The definition of cluster underlying BorderFlow was proposed by [6].
They state that a cluster is a collection of nodes that have
more links between them than links to the outside. When
considering a graph as the description of a flow system,
Flake et al.s definition of a cluster implies that a cluster
X can be understood as a set of nodes such that the flow
within X is maximal while the flow from X to the outside is minimal. The idea behind BorderFlow is to maximize the flow from the border of each cluster to its inner
nodes (i.e., the nodes within the cluster) while minimizing
the flow from the cluster to the nodes outside of the cluster.
In the following, we will specify BorderFlow for weighted
directed graphs, as they encompass all other forms of noncomplex graphs.
b(X), X
b(X), X
=
. (6)
F (X) =
b(X), V \X
b(X), n(X)
Based on the definition of a cluster by [6], we define a
cluster X as a node-maximal subset of V that maximizes
the ratio F (X)2 , i.e.:
X 0 V, v
/ X : X 0 = X + v F (X 0 ) < F (X).
(7)
The idea behind BorderFlow is to select elements from
the border n(X) of a cluster X iteratively and insert them
in X until the border flow ratio F (X) is maximized, i.e.,
until Equation (7) is satisfied. The selection of the nodes
to insert in each iteration is carried out in two steps. In
a first step, the set C(X) of candidates u V \X which
maximize F (X + u) is computed is as follows:
(2)
(8)
un(X)
(3)
(4)
(5)
40
A.-C.N. Ngomo
Prospective cluster members are elements of n(X). To ensure that the inner flow within the cluster is maximized in
the future, a second selection step is necessary. During
this second selection step, BorderFlow picks the candidates
u C(X) which maximize the flow (u, n(X)). The final set of candidates Cf (X) is then
Cf (X) := arg max (u, n(X)).
(9)
uC(X)
4.2 Heuristics
One drawback of the method proposed above is that it demands the simulation of the inclusion of each node in n(X)
in the cluster X before choosing the best ones. Such an
implementation can be time-consuming as nodes in terminology graphs can have a high number of neighbors. The
need is for a computationally less expensive criterion for
selecting a nearly optimal node to optimize F (X). Let us
assume that X is large enough. This assumption implies
that the flow from the cluster boundary to the rest of the
graph is altered insignificantly when adding a node to the
cluster. Under this condition, the following two approximations hold:
(b(X), n(X)) (b(X + v), n(X + v)),
(11)
(12)
(13)
Under this assumption, one can show that the nodes that
maximize F (X) maximize the following:
(b(X), v)
f (X, v) =
for symmetrical graphs.
(v, V \X)
(X) =
(14)
1 X
a(v, X) b(v, V \X)
,
|X|
max{a(v, X), b(v, V \X)}
(15)
vX
where
P
a(v, X) =
v 0 n(v)X
(v, v 0 )
|n(v) X|
(16)
and
b(v, V \X) = 0max (v, v 0 ).
v V \X
(17)
|X M |
,
(18)
(X) = max
C
|X C |
where M is the set of all mesh category labels, C is a
MESH category and C is the set of labels of C and all
its sub-categories. For our evaluation, we considered only
clusters that contained at least one term that could be found
in MESH.
The results of the qualitative evaluation are shown in Table 2. The best cluster purity, 89.23%, was obtained when
41
6 Discussion
From a quantitative point of view, the average silhouette
values on TREC were higher with lower standard deviations . The difference in silhouette can be conceivably
explained by the higher amount of noise contained in the
BMC corpus. On the TREC corpus, a higher size of the
feature vectors led to a higher value of the average silhouette of the clusters. The same relation could be observed between the number f of function words omitted
and the value of . The standard deviation was inversely
proportional to the size of the feature vectors and the number of function words. The number of erroneous clusters
(i.e., clusters with average silhouette value less than 0) was
inversely proportional to the size of the feature vectors.
This can be explained by the higher amount of information available, which led to a better approximation of the
semantic similarity of the terms and, thus, to less clustering mistakes. In the worst case (f =100, s=100), 99.85%
of the clusters had positive silhouettes. From a qualitative
point of view, BorderFlow computed clusters with a high
purity based on low-level features extracted on a terminology extracted using low-bias techniques. As expected, the
average cluster purity was higher for clusters computed using the TREC data set. The results of the qualitative evaluation support the basic assumption underlying this work,
i.e., it is indeed possible to extract high-quality concepts
from text automatically without a-priori knowledge.
Acknowledgement
This work was supported by a grant of the German Ministry
for Education and Research.
References
[1] S. Ananiadou and J. Mcnaught. Text Mining for Biology and Biomedecine. Norwood, MA, USA, 2005.
[2] R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern
Information Retrieval. ACM Press / Addison-Wesley,
1999.
[3] C. Biemann. Ontology learning from text: A survey
of methods. LDV Forum, 20(2):7593, 2005.
42
A.-C.N. Ngomo
Figure 2: Distribution of the average silhouette values obtained by using BorderFlow and kNN on the TREC and BMC
data set with f =100 and s=100
f
100
100
100
250
250
250
s
100
200
400
100
200
400
Erroneous clusters
TREC
BMC
TREC
BMC
kNN
BF
kNN
BF
kNN
BF
kNN
BF
0.680.22
0.690.22
0.700.20
0.810.17
0.840.13
0.840.12
0.910.13
0.910.11
0.920.11
0.930.09
0.940.08
0.940.08
0.370.28
0.380.27
0.410.26
0.230.31
0.230.31
0.240.32
0.830.13
0.820.12
0.830.12
0.800.14
0.800.14
0.800.14
73
68
49
10
5
2
10
1
1
2
2
1
214
184
142
553
575
583
1
1
1
0
0
0
Table 1: Comparison of the distribution of the silhouette index over clusters extracted from the TREC and BMC corpora.
BF stands for BorderFlow. the mean of silhouette values over the clusters and the standard deviation of the distribution
of silhouette values. Erroneous clusters are cluster with negative silhouette silhouettes. Bold fonts mark the best results
in each experimental setting.
[4] Kenneth W. Church and Patrick Hanks. Word association norms, mutual information, and lexicography. In
Proceedings of the 27th. Annual Meeting of the As-
Level
1
2
3
1
2
3
f=100
s=100
86.81
85.61
83.70
78.58
76.79
75.46
f=100
s=200
81.84
79.88
78.55
78.88
77.28
76.13
f=100
s=400
81.45
79.66
78.29
78.40
76.54
74.74
f=250
s=100
89.23
87.67
86.72
72.44
71.91
69.84
f=250
s=200
87.62
85.82
84.81
73.85
73.27
71.58
43
f=250
s=400
87.13
86.83
84.63
73.03
72.39
70.41
Table 2: Cluster purity obtained using BorderFlow on TREC and BMC data. The upper section of the table displays the
results obtained using the TREC corpus. The lower section of the table displays the same results on the BMC corpus. All
results are in %.
Cluster members
b_fragilis, c_albicans, candida_albicans, l_pneumophila
acyclovir, methotrexate_mtx, mtx, methotrexate
embryo, embryos, mouse_embryos, oocytes
leukocytes, macrophages, neutrophils, platelets, pmns
flap, flaps, free_flap, muscle_flap, musculocutaneous_flap
leukocyte, monocyte, neutrophil, polymorphonuclear_leukocyte
Seeds
c_albicans
methotrexate
embryo, embryos, mouse
_embryos, oocytes
platelets
flap, free_flap
polymorphonuclear_leukocyte
Hypernym
Etiologic agents
Drugs
Egg cells
Blood cells
Flaps
White blood cells
Table 3: Examples of clusters extracted from the TREC corpus. The relation between the elements of the clusters is
displayed in the rightmost column. Cluster members in italics are erroneous.
[5] J. Dopazo and J. M. Carazo. Phylogenic reconstruction using a growing neural network that adopts the
topology of a phylogenic tree. Molecular Evolution,
44:226233, 1997.
[6] G. Flake, S. Lawrence, and C. L. Giles. Efficient
identification of web communities. In Proceedings of
the 6th ACM SIGKDD, pages 150160, Boston, MA,
2000.
[7] G. Heyer, M. Luter, U. Quasthoff, T. Wittig, and
C. Wolff. Learning relations using collocations.
In Workshop on Ontology Learning, volume 38 of
CEUR Workshop Proceedings. CEUR-WS.org, 2001.
[8] Latifur Khan and Feng Luo. Ontology construction
for information selection. In Proceedings of the ICTAI02, pages 122127, Washington DC, USA, 2002.
IEEE Computer Society.
[9] A.-C. Ngonga Ngomo. Knowledge-free discovery of
multi-word units. In Proceedings of the 23rd Annual ACM Symposium on Applied Computing, pages
15611565. ACM Press, 2008.
[10] A.-C. Ngonga Ngomo. SIGNUM: A graph algorithm
for terminology extraction. In Proceedings of CICLing 2008, pages 8595. Springer, 2008.
[11] B. Omelayenko. Learning of ontologies for the web:
the analysis of existent approaches. In Proceedings of
the International Workshop on Web Dynamics, pages
268275, 2001.
[12] Yonggang Qiu and Hans-Peter Frei. Concept-based
query expansion. In Proceedings of SIGIR93, pages
160169, Pittsburgh, US, 1993.
44
A.-C.N. Ngomo
45
Motivated by the continuous growth of the Web in the number of sites and users, several search engines attempt to extend their traditional functionality by incorporating question answering (QA) facilities. This extension seems natural but it is not straightforward since current QA systems still achieve poor performance
rates for languages other than English. Based on the fact that retrieval effectiveness has been previously
improved by combining evidence from multiple search engines, in this paper we propose a method that allows taking advantage of the outputs of several QA systems. This method is based on an answer validation
approach that decides about the correctness of answers based on their entailment with a support text, and
therefore, that reduces the influence of the answer redundancies and the system confidences. Experimental
results on Spanish are encouraging; evaluated over a set of 190 questions from the CLEF 2006 collection,
our method responded correctly 63% of the questions, outperforming the best QA participating system
(53%) by a relative increase of 19%. In addition, when they were considered five answers per question, our
method could obtain the correct answer for 73% of the questions. In this case, it outperformed traditional
multi-stream techniques by generating a better ranking of the set of answers presented to the users.
Povzetek: Metoda temelji na kombiniranju odgovorov vec sistemov za QA.
1 Introduction
In the last two decades the discipline of Automatic Text
Processing has showed an impressive progress. It has found
itself at the center of the information revolution triggered
by the emergence of Internet. In particular, the research in
information retrieval (IR) has led to a new generation of
tools and products for searching and navigating the Web.
The major examples of these tools are search engines such
as Google1 and Yahoo2 . This kind of tools allows users to
specify their information needs by short queries (expressed
by a set of keywords), and responds to them with a ranked
list of documents.
At present, fostered by diverse evaluation forums
(TREC3 , CLEF4 , and NTCIIR5 ), there are important efforts to extend the functionality of existing search engines.
1 http://www.google.com
Some of these efforts are directed towards the development of question answering (QA) systems, which are a new
kind of retrieval tools capable of answering concrete questions. Examples of pioneering Web-based QA systems are
START6 and DFKI7 .
Regardless of all these efforts, the presence of QA systems in the Web is still too small compared with traditional
search engines. One of the reasons of this situation is that
QA technology, in contrast to traditional IR methods, is not
equally mature for all languages. For instance, in the TREC
2004, the best QA system for English achieved an accuracy
of 77% for factoid questions8 (Voorhees, 2004), whereas,
two years later in the CLEF 2006, the best QA system for
Spanish could only obtain an accuracy of 55% for the same
kind of questions (Magnini et al, 2006). Taking into account that Spanish is the third language with more presence in the Web9 , and that it is the second language used
2 http://www.yahoo.com
6 http://start.csail.mit.edu
3 The
7 http://experimental-quetal.dfki.de
8 Questions that asks for short, fact-based answers such as the name of
a person or location, the date of an event, the value of something, etc.
9 Internet
World
Stats
(November
2007).
46
2 Related work
Typically, QA systems consist of a single processing stream
that performs three components in a sequential fashion:
question analysis, document/passage retrieval, and answer
selection (see e.g., (Hovy et al, 2000)). In this singlestream approach a kind of information combination is often performed within its last component. The goal of the
answer selection component is to evaluate multiple candidate answers in order to choose from them the most likely
answer for the question. There are several approaches for
http://www.internetworldstats.com/stats7.htm
A. Tllez-Valero et al.
answer selection, ranging from those based on lexical overlaps and answer redundancies (see e.g., (Xu et al, 2002))
to those based on knowledge intensive methods (see e.g.,
(Moldovan et al, 2007)).
Recently, an alternative approach known as a multistream QA has emerged. In this approach the idea is to
combine different QA strategies in order to increase the
number of correctly answered questions. Mainly, multistream QA systems are of two types: internal and external.
Internal multi-stream systems use more than one stream
(in this case, more than one strategy) at each particular
component. For instance, Pizzato and Moll-Aliod (2005)
describes a QA architecture that uses several document retrieval methods, and Chu-Carroll et al (2003) presents a QA
system that applies two different methods at each system
component.
On the other hand, external multi-stream systems directly combine the output of different QA systems. They
employ different strategies to take advantage of the information coming from several streams. Following we describe the main strategies used in external multi-stream QA
systems. It is important to mention that most of these
strategies are adaptations of well-known information fusion techniques from IR. Based on this fact, we propose organizing them into five general categories taking into consideration some ideas proposed elsewhere (Diamond, 1996;
Vogt and Cottrell, 1999).
Skimming Approach. The answers retrieved by different
streams are interleaved according to their original ranks.
In other words, this method takes one answer in turn from
each individual QA system and alternates them in order to
construct the final combined answer list. This approach has
two main variants. In the first one, that we called Nave
Skimming Approach, the streams are selected randomly.
Whereas, in the second variant, which we called Ordered
Skimming Approach, streams are ordered by their general
confidence. In other words, QA systems are ordered by
their global answer accuracy estimated from a reference
question set. Some examples of QA systems that use this
approach are described in (Clarke et al, 2002) and (Jijkoun
and de Rijke, 2004).
Chorus Approach. This approach relies on the answer
redundancies. Basically, it ranks the answers in accordance
to their repetition across different streams. Some systems
based on this approach are described in (de Chalendar et al,
2002), (Burger et al, 2002), (Jijkoun and de Rijke, 2004),
(Roussinov et al, 2005), and (Rotaru and Litman, 2005).
Dark Horse Approach. This approach can be considered
as an extension of the Ordered Skimming Approach. It also
considers the confidence of streams, however, in this case,
these confidences are computed separately for each different answer type. That is, using this approach, a QA system
will have different confidence values associated to factoid,
definition and list questions. A QA system based on this
strategy is described in (Jijkoun and de Rijke, 2004).
Web Chorus Approach. This approach uses the Web information to evaluate the relevance of answers. It basically
ranks the answers based on the number of Web pages containing the answer terms along with the question terms. It
was proposed by Magnini et al (2001), and subsequently it
was also evaluated in (Jijkoun and de Rijke, 2004).
Answer Validation Approach. In this approach the decision about the correctness of an answer is based on its
entailment with a given support text. This way of answer
selection not only allows assuring the rightness of answers
but also their consistency with the snippets that will be
showed to the users. This approach was suggested by Peas
et al (2007), and has been implemented by Glckner et al
(2007)10 .
In addition, it has also been used a combination of different approaches. For instance, Jijkoun and de Rijke (2004)
describes a QA architecture that combines a chorus-based
method with the dark horse approach. Its evaluation results
indicate that this hybrid approach outperformed the results
obtained by systems based on one single multi-stream strategy11 .
In this paper we propose a new multi-stream QA method
based on the answer validation approach. We decide using
this approach because it does not consider any confidence
about the input streams as well as it does not exclusively
depend on the answer redundancies. These characteristics
make this approach very appropriate for working with poor
performance QA systems such as those currently available
for most languages except for English.
Our method distinguishes from existing answervalidation multi-stream methods (Glckner, 2006; Tatu
et al, 2006) in the following two concerns. First, it is the
only one specially suited for Spanish, and second, whereas
the other two methods are based on a deep semantic analysis of texts, ours is only based on a lexical-syntactic analysis of documents. We consider this last difference very
important for constructing Web applications since it makes
our method more easily portable across languages.
In particular, the proposed answer validation method is
based on a supervised learning approach that considers a
combination of two kinds of attributes. On the one hand,
some attributes that indicate the compatibility between the
question and the answer, and on the other hand, some attributes that allow evaluating the textual entailment between the question-answer pair and the given support text.
The first kind of attributes has been previously used in traditional single-stream QA systems (e.g., (Vicedo, 2001)),
whereas the second group of attributes is commonly used
by answer validation (AV) and textual entailment recognition (RTE) systems (e.g., (Kozareva et al, 2006; Jijkoun
and de Rijke, 2005)). In this case, our method not only
considers attributes that indicate the overlap between the
question-answer pair and the support text, but also includes
some attributes that evaluates the non-overlapped information. In some sense, these new attributes allow analyzing
10 (Harabagiu and Hickl, 2006) also describes an answer validation approach for multi-stream QA; nevertheless, it is an internal approach.
11 They compared their hybrid method against all other approaches except the answer validation.
47
the situations where exists a high overlap but not necessarily an entailment relation between these two elements.
The following section describes in detail the proposed
method.
48
A. Tllez-Valero et al.
Question
Candidate answer
Support text
4.1 Preprocessing
The objective of this process is to extract the main content
elements from the question, answer and support text, which
will be subsequently used for deciding about the correctness of the answer. This process considers two basic tasks:
on the one hand, the identification of the main constituents
from the question-answer pair, and on the other hand, the
detection of the core fragment of the support text as well as
the consequent elimination of the unnecessary information.
Commonly, the support text is a short paragraph of maximum 700 bytes according to CLEF evaluations which
provides the context necessary to support the correctness of
a given answer. However, in many cases, it contains more
information than required, damaging the performance of
RTE methods based on lexical-syntactic overlaps. For instance, the example of Table 1 shows that only the last sentence (a smaller text fragment) is useful for validating the
given answer, whereas the rest of the text only contribute
to produce an irrelevant overlap.
In order to reduce the support text to the minimum useful
text fragment we proceed as follows:
First, we apply a shallow parsing to the support text,
obtaining the syntactic tree (Sparsed).
Second, we match the content terms (nouns, verbs,
adjectives and adverbs) from question constituents
against the terms from Sparsed. In order to capture the
morphological inflections of words we compare them
using the Levenshtein edition distance 12 . Mainly, we
12 The
consider that two different words are equal if its distance value is greater than 0.6.
Third, based on the number of matched terms, we
align the question constituents with the syntagms from
the support text.
Forth, we match the answer constituent against the
syntactic tree (Sparsed). The idea is to find all occurrences of the answer in the given support text.
Fifth, we determine the minimum context of the answer in the support text that contains all matched syntagms. This minimum context (represented by a sequence of words around the answer) is what we call
the core fragment. In the case that the support text
includes several occurrences of the answer, we select
the one with the smallest context.
Applying the procedure described above we determine
that the core fragment of the support text showed at Table 1
is in an Iraqi invasion of Kuwait
49
The attributes of this category are of two main types: (i) attributes that measure the overlap between the support text
and the hypothesis (an affirmative sentence formed by combining the question and the answer); and (ii) attributes that
denote the differences between these two components.
It is important to explain that, different from other RTE
methods, we do not use the complete support text, instead
we only use its core fragment T. On the other hand, we
neither need to construct an hypothesis text, instead we use
as hypothesis the set of question-answer constituents (the
union of Q and A, which we call H).
Overlap characteristics
These attributes express the degree of overlap in number
of words between T and H. In particular, we compute
the overlap for each type of content term (nouns, verbs,
adjectives and adverbs) as well as for each type of named
entity (names of persons, places, organizations, and other
50
A. Tllez-Valero et al.
The used test set gathers the outputs from all QA systems
participating at the QA track of CLEF 2006 (Magnini et al,
2006), and it was employed at the first Spanish Answer
Validation Exercise (Peas et al, 2006).
5.1.2
Accuracy =
5 Experimental results
5.1 Experimental setup
Evaluation measure
|right_answers| + |right_nils|
|questions|
(1)
5.2 Experiments
5.2.1
Stream
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Right
answers nils
25
16
48
17
49
7
34
10
10
1
24
5
16
3
88
12
31
7
26
10
15
11
85
12
33
10
21
18
57
13
45
12
64
16
Wrong
answers
77
46
113
92
179
142
138
69
125
125
78
63
88
34
89
102
55
Accuracy
0.22
0.34
0.30
0.23
0.06
0.15
0.10
0.53
0.20
0.19
0.14
0.51
0.23
0.21
0.37
0.30
0.42
51
Dark Horse Method, also obtained relevant results. However, it is necessary to mention that in our implementations these methods made use of a perfect estimation of
these confidences15 . For that reason, and given that in a real
scenario it is practically impossible to obtain these perfect
estimations, we consider that our proposal is more robust
than these two methods.
The results from Table 3 also give evidence that the presence of several deficient streams (which generate a lot of
incorrect answers) seriously affects the performance of the
Nave Skimming Method. This phenomenon also had an
important effect over the Chorus Method, which normally
is reported as one of the best multi-stream approaches.
Finally, it is important to comment that we attribute the
poor results achieved by the Web Chorus Method to the
quantity of online information for Spanish (which it is considerably less than that for English). In order to obtain better results it is necessary to apply some question/answer
expansions, using for instance synonyms and hyperonyms.
5.2.2
52
A. Tllez-Valero et al.
Table 3: Results from the first experiment: general evaluation of the proposed method
could answer some nil questions. In particular, our method
correctly respond 63% of the questions outperforming the
best input stream by a relative increase of 19%.
This experiment also helped to reveal another important
characteristic of our method. It could correctly reject several answers without using any information about the confidence of streams and without considering any restriction
on the answer frequencies.
5.2.3 Third experiment: combination of our method
with other approaches
In (Jijkoun and de Rijke, 2004), Jijkoun and de Rijke describe a multi-stream QA architecture that combines the
Chorus and the Dark Horse Methods. Its evaluation results
indicate that this combination outperformed the results obtained by other systems based on one single multi-stream
strategy16 .
Motivated by this result, we designed a third experiment which considered the combination of our method
with other confidence-based methods, in particular, the
Dark Horse Method and the Ordered Skimming Method.
The combination of our method with these two other approaches was performed as follows. In a first stage, our
method selected a set of candidate answers, then, in a second stage, a confidence-based method ordered the candidate answers in accordance to their own ranking criteria17 .
Table 5 shows the results from the combination of these
methods. On the one hand, these results confirm the conclusions of Jijkoun and de Rijke since they also indicate
that the combination of methods outperformed the results
obtained by individual approaches. On the other hand, and
most important, these results demonstrate the competence
of our method since they show that its individual result outperformed that from the combination of the Chorus Method
with the Dark Horse Method (stated by Jijkoun and de Rijke as the best configuration for a multi-stream QA system).
16 In their experiments, as mentioned in Section 2, they did not consider
the answer validation approach.
17 Given that we use the same implementations for the confidence-based
methods that those described in the first experiment, in this case we also
used a perfect estimation of the streams confidences.
53
Table 4: Results from the second experiment: the impact of rejecting less reliable answers
Table 5: Results from the third experiment: combination of our method with other approaches
dundancy of answers across streams, it turned out to
be less sensible to their low frequency than other approaches. For instance, it outperformed the Chorus
Method by 5%.
The proposed method allowed to significantly reduce
the number of wrong answers presented to the user. In
relation to this aspect, our method was especially adequate to deal with nil questions. It correctly responded
65% of the nil questions, outperforming the best input
stream by a relative increase of 8%.
The combination of our method with the Dark Horse
approach only produced a slightly improvement of
1%. This fact indicates that our method does not require knowing the input stream confidences.
Finally, it is clear that any improvement in the answer
validation module will directly impact the performance of
the proposed multi-stream method. Hence, our future work
will be mainly focused on enhancing this module by: (i)
considering some new features in the entailment recognition process, (ii) including a process for treatment of temporal restrictions, and (iii) using Wordnet in order to consider synonyms and hyperonyms for computing the term
and structure overlaps.
Acknowledgement
This work was done under partial support of Conacyt
(scholarship 171610). We also thank CLEF organizers for
the provided resources.
References
Burger JD, Ferro L, Greiff WR, Henderson JC, Mardis S,
Morgan A, Light M (2002) MITREs qanda at TREC-11.
In: TREC
Chu-Carroll J, Czuba K, Prager JM, Ittycheriah A (2003)
In question answering, two heads are better than one. In:
HLT-NAACL
Clarke CLA, Cormack GV, Kemkes G, Laszlo M, Lynam
TR, Terra EL, Tilker PL (2002) Statistical selection of
exact answers (multitext experiments for TREC 2002).
In: TREC
Dagan I, Glickman O, Magnini B (2005) The PASCAL
recognising textual entailment challenge. In: Quionero
J, Dagan I, Magnini B, dAlch Buc F (eds) MLCW,
Springer, Lecture Notes in Computer Science, vol 3944,
pp 177190
Dalmas T, Webber BL (2007) Answer comparison in automated question answering. J Applied Logic 5(1):104
120
de Chalendar G, Dalmas T, Elkateb-Gara F, Ferret O, Grau
B, Hurault-Plantet M, Illouz G, Monceaux L, Robba I,
Vilnat A (2002) The question answering system QALC
at LIMSI, experiments in using web and wordnet. In:
TREC
de Sousa P (2007) El espaol es el segundo idioma ms usado en las bsquedas a travs de google (english translated: The spanish is the second language most used in
the googles searches). CDT internet.net, URL http:
//www.cdtinternet.net/
54
Diamond T (1996) Information retrieval using dynamic evidence combination, PhD Dissertation Proposal, School
of Information Studies, Syracuse University
Glckner I (2006) Answer validation through robust logical
inference. In: Peters et al (2007), pp 518521
Glckner I, Sven H, Johannes L (2007) Logical validation, answer merging and witness selection - a case
study in multi-stream question answering. In: RIAO
2007, Large-Scale Semantic Access to Content, Pittsburgh, USA
Harabagiu SM, Hickl A (2006) Methods for using textual entailment in open-domain question answering. In:
ACL, The Association for Computer Linguistics
Hovy EH, Gerber L, Hermjakob U, Junk M, Lin CY (2000)
Question answering in webclopedia. In: TREC
Jijkoun V, de Rijke M (2004) Answer selection in a multistream open domain question answering system. In: McDonald S, Tait J (eds) ECIR, Springer, Lecture Notes in
Computer Science, vol 2997, pp 99111
Jijkoun V, de Rijke M (2005) Recognizing textual entailment: Is word similarity enough? In: Candela JQ, Dagan I, Magnini B, dAlch Buc F (eds) MLCW, Springer,
Lecture Notes in Computer Science, vol 3944, pp 449
460
Kozareva Z, Vzquez S, Montoyo A (2006) University of
alicante at qa@clef2006: Answer validation exercise. In:
Peters et al (2007), pp 522525
Magnini B, Negri M, Prevete R, Tanev H (2001) Is it the
right answer?: exploiting web redundancy for answer
validation. In: ACL, The Association for Computational
Linguistics, Morristown, NJ, USA, pp 425432
Magnini B, Giampiccolo D, Forner P, Ayache C, Jijkoun
V, Osenova P, Peas A, Rocha P, Sacaleanu B, Sutcliffe RFE (2006) Overview of the clef 2006 multilingual question answering track. In: Peters et al (2007), pp
223256
Moldovan DI, Clark C, Harabagiu SM, Hodges D (2007)
Cogex: A semantically and contextually enriched logic
prover for question answering. J Applied Logic 5(1):49
69
Peas A, Rodrigo , Sama V, Verdejo F (2006) Overview
of the answer validation exercise 2006. In: Peters et al
(2007), pp 257264
Peas A, Rodrigo , Sama V, Verdejo F (2007) Testing
the reasoning for question answering validation. Journal of Logic and Computation (3), DOI 10.1093/logcom/
exm072
Peters C, Clough P, Gey FC, Karlgren J, Magnini B, Oard
DW, de Rijke M, Stempfhuber M (eds) (2007) Evaluation of Multilingual and Multi-modal Information Retrieval, 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain, September
A. Tllez-Valero et al.
55
This paper reports how the appropriate unlabeled data, post-processing and voting can be effective to
improve the performance of a Named Entity Recognition (NER) system. The proposed method is based
on a combination of the following classifiers: Maximum Entropy (ME), Conditional Random Field (CRF)
and Support Vector Machine (SVM). The training set consists of approximately 272K wordforms. The
proposed method is tested with Bengali. A semi-supervised learning technique has been developed that
uses the unlabeled data during training of the system. We have shown that simply relying upon the use
of large corpora during training for performance improvement is not in itself sufficient. We describe the
measures to automatically select effective documents and sentences from the unlabeled data. In addition,
we have used a number of techniques to post-process the output of each of the models in order to improve
the performance. Finally, we have applied weighted voting approach to combine the models. Experimental
results show the effectiveness of the proposed approach with the overall average recall, precision, and
f-score values of 93.79%, 91.34%, and 92.55%, respectively, which shows an improvement of 19.4% in
f-score over the least performing baseline ME based system and an improvement of 15.19% in f-score over
the best performing baseline SVM based system.
Povzetek: Razvita je metoda za prepoznavanje imen, ki temelji na uteenem glasovanju vec klasifikatorjev.
1 Introduction
Named Entity Recognition (NER) is an important tool in
almost all Natural Language Processing (NLP) application
areas such as Information Extraction [1], Machine Translation [2], Question Answering [3] etc. The objective of NER
is to identify and classify every word/term in a document
into some predefined categories like person name, location
name, organization name, miscellaneous name (date, time,
percentage and monetary expressions etc.) and none-ofthe-above". The challenge in detection of named entities
(NEs) is that such expressions are hard to analyze using
rule-based NLP because they belong to the open class of
expressions, i.e., there is an infinite variety and new expressions are constantly being invented.
In recent years, automatic NER systems have become a
popular research area in which a considerable number of
studies have been addressed on developing these systems.
These can be classified into three main classes [4], namely
rule-based NER, machine learning-based NER and hybrid
NER.
Rule-based approaches focus on extracting names using
a number of hand-crafted rules. Generally, these systems
consist of a set of patterns using grammatical (e.g., part
of speech), syntactic (e.g., word precedence) and orthographic features (e.g., capitalization) in combination with
dictionaries [5]. A NER system has been proposed in
[6][7] based on carefully handcrafted regular expression
called FASTUS. They divided the task into three steps:
recognizing phrase, recognizing patterns and merging incidents, while [8] uses extensive specialized resources such
as gazetteers, white and yellow pages. The NYU system [9]
was introduced that uses handcrafted rules. A rule-based
Greek NER system [10] has been developed in the context of the R&D project MITOS 1 . The NER system consists of three processing stages: linguistic pre-processing,
NE identification and NE classification. The linguistic preprocessing stage involves some basic tasks: tokenisation,
sentence splitting, part of speech (POS) tagging and stemming. Once the text has been annotated with POS tags, a
stemmer is used. The aim of the stemmer is to reduce the
size of the lexicon as well as the size and complexity of
NER grammar. The NE identification phase involves the
detection of their boundaries, i.e., the start and end of all
the possible spans of tokens that are likely to belong to a
NE. Classification involves three sub-stages: application of
classification rules, gazetteer-based classification, and par1 http://www.iit.demokritos.gr/skel/mitos
56
A. Ekbal et al.
systems were developed with the help of a number of features and gazetteers. The method of improving the performance of NER system using appropriate unlabeled data,
post-processing and voting has been reported in [45].
Other than Bengali, the works on Hindi can be found in
[46] with CRF model using feature induction technique to
automatically construct the features that does a maximal
increase in the conditional likelihood. A language independent method for Hindi NER has been reported in [47].
Sujan et al. [48] reported a ME based system with the hybrid feature set that includes statistical as well as linguistic features. A MEMM-based system has been reported in
[49]. As part of the IJCNLP-08 NER shared task, various works of NER in Indian languages using various approaches can be found in IJCNLP-08 NER Shared Task on
South and South East Asian Languages (NERSSEAL)2 . As
part of this shared task, [50] reported a CRF-based system
followed by post-processing which involves using some
heuristics or rules. A CRF-based system has been reported
in [51], where it has been shown that the hybrid HMM
model can perform better than CRF.
Srikanth and Murthy [52] developed a NER system for
Telugu and tested it on several data sets from the Eenaadu
and Andhra Prabha newspaper corpora. They obtained the
overall f-measure between 80-97% with person, location
and organization tags. For Tamil, a CRF-based NER system has been presented in [53] for the tourism domain.
This approach can take care of morphological inflections
of NEs and can handle nested tagging with a hierarchical
tagset containing 106 tags. Shishtla et al. [54] developed
a CRF-based system for English, Telugu and Hindi. They
suggested that character n-gram based approach is more effective than the word based models. They described the
features used and the experiments to increase the recall of
NER system.
In this paper, we have reported a NER system for Bengali
by combining the outputs of the classifiers, namely ME,
CRF and SVM. In terms of native speakers, Bengali is the
seventh most spoken language in the world, second in India
and the national language of Bangladesh. We have manually annotated a portion of the Bengali news corpus, developed from the web-archive of a leading Bengali newspaper
with Person name, Location name, Organization name and
Miscellaneous name tags. We have also used the IJCNLP08 NER Shared Task data that was originally annotated
with a fine-grained NE tagset of twelve tags. This data
has been converted into the forms to be tagged with NEP
(Person name), NEL (Location name), NEO (Organization
name), NEN (Number expressions), NETI (Time expressions) and NEM (Measurement expressions). The NEN,
NETI and NEM tags are mapped to point to the miscellaneous entities. The system makes use of the different contextual information of the words along with the variety of
orthographic word level features that are helpful in predicting the various NE classes. We have considered both language independent as well as language dependent features.
2 http://ltrc.iiit.ac.in/ner-ssea-08
57
58
popular languages and predominantly spoken in the eastern part of India. In terms of native speakers, Bengali is the
seventh most spoken language in the World, second in India
and the national language in Bangladesh. In the literature,
there has been no initiative of corpus development from the
web in Indian languages and specifically in Bengali.
Newspaper is a huge source of readily available documents. Web is a great source of language data. In Bengali,
there are some newspapers (like, Anandabazar Patrika,
Bartaman, Dainik, Ittefaq etc.), published from Kolkata
and Bangladesh, which have their internet-edition in the
web and some of them provide their archive available also.
A collection of documents from the archive of the newspaper, stored in the web, may be used as the corpus, which in
turn can be used in many NLP applications.
We have followed the method of developing the Bengali news corpus in terms of language resource acquisition using a web crawler, language resource creation that
includes HTML file cleaning, code conversion and language resource annotation that involves defining a tagset
and subsequent tagging of the news corpus. A web crawler
has been designed that retrieves the web pages in Hyper
Text Markup Language (HTML) format from the news
archive. Various types of news (International, National,
State, Sports, Business etc.) are collected in the corpus
and so a variety of linguistics features of Bengali are covered. The Bengali news corpus is available in UTF-8 and
contains approximately 34 million wordforms.
A news corpus, whether in Bengali or in any other language has different parts like title, date, reporter, location,
body etc. To identify these parts in a news corpus the tagset
described in Table 1 have been defined. Detailed of this
corpus development work can be found in [55].
The date, location, reporter and agency tags present in
the web pages of the Bengali news corpus have been automatically named entity (NE) tagged. These tags can identify the NEs that appear in some fixed places of the newspaper. In order to achieve reasonable performance for NER,
supervised machine learning approaches are more appropriate and this requires a completely tagged corpus. This
requires the selection of an appropriate NE tagset.
With respect to the tagset, the main feature that concerns
us is its granularity, which is directly related to the size of
the tagset. If the tagset is too coarse, the tagging accuracy
will be much higher, since only the important distinctions
are considered, and the classification may be easier both
by human manual annotators as well as the machine. But,
some important information may be missed out due to the
coarse grained tagset. On the other hand, a too fine-grained
tagset may enrich the supplied information but the performance of the automatic named entity tagger may decrease.
A much richer model is required to be designed to capture
the encoded information when using a fine grained tagset
and hence, it is more difficult to learn.
When we are about to design a tagset for the NE disambiguation task, the issues that need consideration include
- the type of applications (some application may required
A. Ekbal et al.
23,181
200K
19,749
2 (approx.)
more complex information whereas only category information may be sufficient for some tasks), tagging techniques
to be used (statistical, rule based which can adopt large
tagsets very well, supervised/unsupervised learning). Further, a large amount of annotated corpus is usually required
for statistical named entity taggers. A too fine grained
tagset might be difficult to use by human annotators during the development of a large annotated corpus. Hence,
the availability of resources needs to be considered during
the design of a tagset.
During the design of the tagset for Bengali, our main aim
was to build a small but clean and completely tagged corpora for Bengali. The resources can be used for the conventional usages like Information Retrieval, Information Extraction, Event Tracking System, Web People Search etc.
We have used CoNLL 2003 shared task tagset as reference
point for our tagset design.
We have used a NE tagset that consists of the following
four tags:
1. Person name: Denotes the names of people. For
example, sachin[Sachin] /Person name, manmohan
singh[Manmohan Singh]/Person name.
2. Location name: Denotes the names of places. For
example, jadavpur[Jadavpur]/Location name, new
delhi[New Delhi]/Location name.
3. Organization name: Denotes the names of organizations. For example, infosys[Infosys]/Organization
name, jadavpur vishwavidyalaya[Jadavpur University]/Organization name.
4. Miscellaneous name: Denotes the miscellaneous NEs
that include date, time, number, monetary expressions, measurement expressions and percentages. For
example, 15th august 1947[15th August 1947]/Miscellaneous name, 11 am[11 am]/Miscellaneous
name, 110/Miscellaneous name, 1000 taka[1000 rupees]/Miscellaneous name, 100%[100%]/ Miscellaneous name and 100 gram[100 gram]/ Miscellaneous
name.
We have manually annotated approximately 200K wordforms of the Bengali news corpus.The annotation has been
carried out by one expert and edited by another expert. The
corpus is in the Shakti Standard Format (SSF) form [56].
Some statistics of this corpus is shown in Table 2.
We have also used the NE tagged corpus of the IJCNLP Shared Task on Named Entity Recognition for South
59
Definition
Header of the news document
Headline of the news document
1st headline of the title
2nd headline of the title
Date of the news document
Bengali date
Day
English date
7035
122K
5921
2 (approx.)
Tag
reporter
agency
location
body
p
table
tc
tr
Definition
Reporter-name
Agency providing news
The news location
Body of the news document
Paragraph
Information in tabular form
Table Column
Table row
person name, location name, organization name, number expression, time expression and measurement expressions. The number, time and measurement expressions are
mapped to belong to the Miscellaneous name tag. Other
tags of the shared task have been mapped to the other-thanNE category. Hence, the final tagset is shown in Table 5.
In order to properly denote the boundaries of the NEs,
the four NE tags are further subdivided as shown in Table
6. In the output, these sixteen NE tags are directly mapped
to the four major NE tags, namely Person name, Location
name, Organization name and Miscellaneous name.
3.1 Approaches
NLP research around the world has taken giant leaps in the
last decade with the advent of effective machine learning
algorithms and the creation of large annotated corpora for
various languages. However, annotated corpora and other
lexical resources have started appearing only very recently
in India. In this paper, we have reported a NER system by
combining the outputs of the classifiers, namely ME, CRF
60
A. Ekbal et al.
Table 4: Named entity tagset for Indian languages (IJCNLP-08 NER Shared Task Tagset)
NE Tag
NEP
Meaning
Person name
NEL
Location name
NEO
Organization name
NED
NEA
Designation
Abbreviation
NEB
NETP
NETO
NEN
NEM
NETE
Brand
Title-person
Title-object
Number
Measure
Terms
NETI
Time
Example
sachin/NEP,
sachin ramesh tendulkar / NEP
kolkata/NEL,
mahatma gandhi road/ NEL
jadavpur bishbidyalya/NEO,
bhaba eytomik risarch sentar / NEO
chairrman/NED, sangsad/NED
b a/NEA, c m d a/NEA,
b j p/NEA, i.b.m/ NEA
fanta/NEB
shriman/NED, shri/NED, shrimati/NED
american beauty/NETO
10/NEN, dash/NEN
tin din/NEM, panch keji/NEM
hiden markov model/NETE,
chemical reaction/NETE
10 i magh 1402 / NETI, 10 am/NETI
Tagset used
Meaning
Person name
NEL
Location name
NEO
Organization name
Miscellaneous name
Single word/multiword
person name
Single word/multiword
location name
Single word/multiword
organization name
Single word/ multiword
miscellaneous name
NNE
61
B-PER
I-PER
E-PER
Meaning
Single word
person name
Single word
location name
Single word
organization name
Single word
miscellaneous name
Beginning, Internal or
the End of a multiword
person name
B-LOC
I-LOC
E-LOC
B-ORG
I-ORG
E-ORG
Beginning, Internal or
the End of a multiword
location name
Beginning, Internal or
the End of a multiword
organization name
B-MISC
I-MISC
E-MISC
NNE
Beginning, Internal or
the End of a multiword
miscellaneous name
Other than NEs
LOC
ORG
MISC
and SVM frameworks in order to identify NEs from a Bengali text and to classify them into Person name, Location
name, Organization name and Miscellaneous name. We
have developed two different systems with the SVM model,
one using forward parsing (SVM-F) that parses from left
to right and other using backward parsing (SVM-B) that
parses from right to left. The SVM system has been developed based on [57], which perform classification by
constructing a N-dimensional hyperplane that optimally
separates data into two categories. We have used YamCha toolkit (http://chasen-org/taku/software/yamcha), an
SVM based tool for detecting classes in documents and
formulating the NER task as a sequence labeling problem. Here, the pair wise multi-class decision method
and polynomial kernel function have been used. We
have used TinySVM-0.04 TinySVM classifier that seems
to be the best optimized among publicly available SVM
toolkits. We have used the Maximum Entropy package
(http://homepages.inf.ed.ac.uk/s0450736/software/
maxent/maxent-20061005.tar.bz2). We have used C++
based CRF++ package (http://crfpp.sourceforge.net) for
NER.
During testing, it is possible that the classifier produces a
sequence of inadmissible classes (e.g., B-PER followed by
LOC). To eliminate such sequences, we define a transition
probability between word classes P (ci |cj ) to be equal to 1
if the sequence is admissible, and 0 otherwise. The prob4 http://cl.aist-nara.ac.jp/taku
ku/software/
Example
sachin/PER,
rabindranath/PER
kolkata/LOC, mumbai/LOC
infosys/ORG
10/MISC, dash/MISC
sachin/B-PER ramesh/I-PER
tendulkar /E-PER,
rabindranath/B-PER
thakur/E-PER
mahatma/B-LOC gandhi /I-LOC
road /E-LOC,
new/B-LOC york/E-LOC
jadavpur /B-ORG
bishvidyalya/E-ORG,
bhaba /B-ORG eytomik/I-ORG
risarch/I-ORG sentar /E-ORG
10 i /B-MISC magh/I-MISC
1402/E-MISC,
10/B-MISC am/E-MISC
kara/NNE, jal/NNE
62
A. Ekbal et al.
63
in the NER tasks. We have considered different combination from the following set for inspecting the best set of
features for NER in Bengali:
F={wim , . . . , wi1 , wi , wi+1 , . . . wi+n , |prefix|
64
A. Ekbal et al.
recognizing miscellaneous NEs, such as time expressions, measurement expressions and numerical numbers etc.
Infrequent word: The frequencies of the words in the
training corpus have been calculated. A cut off frequency has been chosen in order to consider the words
that occur with more than the cut off frequency in the
training corpus. The cut off frequency is set to 10. A
binary valued feature Infrequent is defined to check
whether the current token appears in this list or not.
Length of a word: This binary valued feature is used
to check whether the length of the current word is less
than three or not. This is based on the observation that
very short words are rarely NEs.
The above set of language independent features along
with their descriptions are shown in Table 7. The baseline
models have been developed with the language independent features.
65
Table 7: Descriptions of the language independent features. Here, i represents the position of the current word and wi
represents the current word
Feature
ContexT
Suf
Pre
NE
FirstWord
CntDgt
FourDgt
TwoDgt
CntDgtCma
CntDgtPrd
CntDgtSlsh
CntDgtHph
CntDgtPrctg
Infrequent
Length
Description
ContexTi = wim , . . . , wi1 , wi , wi+1 , wi+n ,
where wim
, and wi+n are the previous m-th, and the next n-th word
Suffix string of length n of wi if |wi | n
N D(= 0) if |wi | (n 1)
Sufi (n) =
or wi is a punctuation symbol
Prefix
string of length n of wi if |wi | n
N D(= 0) if |wi | (n 1)
Prei (n) =
or wi is a punctuation symbol
1, if wi contains digit
CntDgti =
0, otherwise
1, if wi consists of four digits
FourDgti =
0, otherwise
1, if wi consists of two digits
TwoDgti =
0,otherwise
1, if wi contains digit and comma
CntDgtCmai =
0, otherwise
1, if wi contains digit and period
CntDgtPrdi =
0, otherwise
1, if wi contains digit and slash
CntDgtSlshi =
0, otherwise
1, if wi contains digit and hyphen
CntDgtHphi =
0, otherwise
1, if wi contains digit
and percentage
CntDgtPrctgi =
0, otherwise
Infrequenti = I{Infrequent word list} (wi )
1, if wi 3
Lengthi =
0, otherwise
66
A. Ekbal et al.
domain (Politics, Sports, Business). Statistics of this corpus is shown in Table 10.
67
Number of entries
115
94
Person prefix
245
Middle name
1491
Surname
5,288
Common Location
Action verb
Designation words
547
221
947
First names
72,206
Location name
5,125
Organization name
Month name
Weekdays
Measurement expressions
2,225
24
14
52
Source
Manually prepared
Manually created
from the news corpus
Manually created
from the news corpus
Semi-automatically
from the news corpus
Semi-automatically
from the news corpus
Manually developed
Manually prepared
Semi-automatically
prepared from news corpus
Semi-automatically
prepared from the news corpus
Semi-automatically
prepared from the news corpus
Manually prepared
Manually prepared
Manually prepared
Manually prepared
Table 9: Descriptions of the language dependent features. Here, i represents the position of the current word and wi
represents the current word
Feature
FirstName
MidName
SurName
Funct
MonthName
WeekDay
MeasureMent
POS
NESuf
OrgSuf
ComLoc
ActVerb
DesG
PerPre
LocName
OrgName
Description
F irstN amei = I{First name list} (wi )
M idN amei = I{Middle name list} (wi )
W
SurN amei = I{Sur name list} (wi ) I{Sur name list} (wi+1 )
F uncti = I{Function word list} (wi )
M onthN amei = I{Month name list} (wi )
W eekDayi = I{Week day list} (wi )
M easurementi = I{Measurement word list} (wi+1 )
W
I{Measurement list} (wi+1 )
P OSi =POS tag of the current word
N ESufi = I{NE suffix list} (wi )
OrgSufi = I{Organization suffix word list} (wi )
W
I{Organization suffix word list} (wi+1 )
ComLoci = I{Common location list} (wi )
ActV erbi = I{Action verb list} (wi )
W
I{Action verb ist} (wi+1 )
DesGi = I{Designation word list} (wi1 )
P erP rei = I{Person prefix word list} (wi1 )
LocN amei = I{Location name list} (wi )
OrgN amei = I{Organization name list} (wi )
68
A. Ekbal et al.
35, 143
940, 927
27
9, 998, 972
285
152, 617
69
# of sentences
#of wordforms
#of NEs
#Avg. length of NE
Training
21,340
272,000
22,488
1.5138
Development
3,367
50,000
3,665
1.6341
Test
2,501
35,000
3,178
1.6202
70
A. Ekbal et al.
R (in %)
73.55
75.97
77.14
77.09
P (in %)
71.45
75.45
75.48
75.14
FS (in %)
72.49
75.71
76.30
76.10
R (in %)
75.26
79.03
81.37
81.29
P (in %)
74.91
80.62
80.14
79.16
FS (in %)
74.41
79.82
80.75
80.21
R (in %)
78.26
82.07
84.56
84.42
P (in %)
76.91
83.75
82.60
82.58
FS (in %)
77.58
82.90
83.57
83.49
71
7. If current word is not tagged as B-XXX/IXXX/NNE but the following word is tagged as
B-XXX/I-XXX/E-XXX then the current word is
assigned the tag B-XXX.
8. If the words, tagged as NNE, contain the variable length NE suffixes (used as the feature in
the baseline models) then the words are assigned
the NE tags. The types of the NE tags are determined by the types of the suffixes (e.g., Person
tag is assigned if matches with the person name
suffix).
Evaluation results have demonstrated the recall, precision, and f-score values of 81.55%, 78.67%, and 80.8%,
respectively.
72
A. Ekbal et al.
Sentences
added
0
107
213
311
398
469
563
619
664
691
701
722
ME
80.8
81.2
81.67
81.94
82.32
82.78
82.94
83.56
83.79
83.85
83.87
83.87
FS (in %)
CRF
SVM-F
86.33 86.91
86.9
87.27
87.35 87.53
87.93 88.12
88.11 88.25
88.66 88.83
89.03 89.17
89.12 89.27
89.28 89.35
89.34 89.51
89.34 89.55
89.34 89.55
SVM-B
86.5
87.13
87.41
87.99
88.18
88.71
89.08
89.15
89.22
89.37
89.37
89.37
Model
Post-processed
(1)+ Bootstrapping
(2) + Document selection
(3) + Sentence selection
bined system selects the classifications, which are proposed by the majority of the models. If four outputs
are different, then the output of the SVM-F system is
selected.
2. Cross validation f-score values: The training data is
divided into N portions. We employ the training by
using N 1 portions, and then evaluate the remaining portion. This is repeated N times. In each iteration, we have evaluated the individual system following the similar methodology, i.e., by including the
various gazetteers and the same set of post-processing
techniques. At the end, we get N f-score values for
each of the system. Final voting weight for a system is
given by the average of these N f-score values. Here,
we set the value of N to be 10. We have defined two
different types of weights depending on the cross validation f-score as follows:
Total F-Score: In the first method, we have assigned the overall average f-score of any classifier as the weight for it.
Tag F-Score: In the second method, we have assigned the average f-score value of the individual
tag as the weight for that model.
Experimental results of the voted system are presented in
Table 17. Evaluation results show that the system achieves
the highest performance for the voting scheme Tag FScore. Voting shows (Tables 16-17) an overall improve-
ME
80.8
82.01
83.05
83.87
CRF
86.33
87.36
88.72
89.34
SVM-F
86.91
88.05
88.88
89.55
SVM-B
86.50
87.81
88.83
89.37
R (in %)
93.19
93.85
93.98
P (in %)
89.35
89.97
91.46
FS (in %)
91.23
92.17
92.71
ment of 8.84% over the least performing ME based system and 3.16% over the best performing SVM-F system in
terms of f-score values.
6.5
The systems have been tested with a gold standard test set
of 35K wordforms. Approximately, 25% of the NEs are
unknown in the test set. Experimental results of the test
set for the baseline models have shown the f-score values
of 73.15%, 76.35%, 77.36%, and 77.23% in the ME, CRF,
SVM-F, and SVM-B based systems, respectively. Results
have demonstrated the improvement in f-scores by 8.35%,
9.67%, 8.82% and 8.83% in the ME, CRF, SVM-B, and
SVM-F models, respectively, by including the language
specific features, context features and post-processing techniques. Appropriate unlabeled sentences are then selected
by the document and sentence selection methods to be included into the training set. Models have shown the fscores of 83.77%, 89.02%, 89.17%, and 89.11% in the ME,
CRF, SVM-F, and SVM-B models, respectively. Experi-
R (in %)
92.91
93.55
93.79
P (in %)
89.77
90.16
91.34
FS (in %)
91.31
91.82
92.55
R(%)
66.53
69.32
74.02
78.64
80.02
81.57
93.79
P (%)
63.45
65.11
72.55
76.89
80.21
79.05
91.34
FS (%)
64.95
67.15
73.28
77.75
80.15
80.29
92.55
73
7 Conclusion
In this paper, we have reported a NER system by combining
the classifiers, namely ME, CRF and SVM with the help of
weighted voting techniques. We have manually annotated
a portion of the Bengali news corpus, developed from the
web archive of a leading Bengali newspaper. In addition,
we have also used the IJCNLP-08 NER shared task data
tagged with a fine-grained NE tag set of twelve tags. We
have converted this data with the NE tags denoting person
name, location name, organization name and miscellaneous
name. The individual models make use of the different
contextual information of the words, several orthographic
word-level features and the binary valued features extracted
from the various gazetteers that are helpful to predict the
NE classes. A number of features are language independent in nature. We have used an unsupervised learning
technique to generate lexical context patterns to be used as
features of the classifiers. We have described the method of
selecting appropriate unlabeled documents and sentences
from a large collection of unlabeled data. This eliminates
the necessity of manual annotation for preparing the NE
annotated corpus. We have also shown how several heuristics for ME, n-best output of CRF and the class splitting
technique of SVM are effective in improving the performance of the corresponding model. Finally, the outputs of
the classifiers have been combined with the three different
weighted voting techniques. It has been shown that combination of several models performs better than any single
one.
References
[1] Cunningham, H.: GATE, a General Architecture for
Text Engineering. Computers and the Humanities 36
(2002) 223254
[2] Babych, B., Hartley, A.: Improving Machine Translation Quality with Automatic Named Entity Recognition. In: Proceedings of EAMT/EACL 2003 Workshop on MT and other Language Technology Tools.
(2003) 18
Results show the effectiveness of the proposed NER system that outperforms other existing systems by the impressive margins. Thus, it can be decided that contextual information of the words, several post-processing methods and
the use of appropriate unlabeled data can yield a reasonably
good performance. Results also suggest that combination
of several classifiers is more effective than any single classifier.
[4] Y.C. Wu, T.K. Fan, Y.L., Yen, S.: Extracting Named
Entities using Support Vector Machines. In: SpringerVerlag. (2006)
[5] I. Budi, S.B.: Association Rules Mining for Name
Entity Recognition. In: Proceedings of the Fourth International Conference on Web Information Systems
Engineering. (2003)
74
A. Ekbal et al.
75
76
A. Ekbal et al.
[52] Srikanth, P., Murthy, K.N.: Named Entity Recognition for Telugu. In: Proceedings of the IJCNLP-08
Workshop on NER for South and South East Asian
Languages. (2008) 4150
In: EMNLP 02: Proceedings of the ACL-02 conference on Empirical methods in natural language processing, Morristown, NJ, USA, Association for Computational Linguistics (2002) 125132
[64] Riloff, E., Jones, R.: Learning dictionaries for information extraction by multi-level bootstrapping. In:
AAAI 99/IAAI 99: Proceedings of the sixteenth
national conference on Artificial intelligence and the
eleventh Innovative applications of artificial intelligence conference innovative applications of artificial
intelligence, Menlo Park, CA, USA, American Association for Artificial Intelligence (1999) 474479
Shakti
In:
[66] Strzalkowski, T., Wang, J.: A self-learning universal concept spotter. In: Proceedings of the 16th conference on Computational linguistics, Morristown,
NJ, USA, Association for Computational Linguistics
(1996) 931936
[57] Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., New York, NY,
USA (1995)
[68] Ekbal, A., Bandyopadhyay, S.: Bengali Named Entity Recognition using Support Vector Machine. In:
Proceedings of NERSSEAL, IJCNLP-08. (2008) 51
58
77
Many publishers follow the Library of Congress Classification (LCC) scheme to indicate a classification
code on the first pages of their books. This is useful for many libraries worldwide because it makes
possible to search and retrieve books by content type, and this scheme has become a de facto standard.
However, not every book has been pre-classified by the publisher; in particular, in many universities,
new dissertations have to be classified manually. Although there are many systems available for
automatic text classification, all of them use extensive information which is not always available, such
as the index, abstract, or even the whole content of the work. In this work, we present our experiments
on supervised classification of books by using only their title, which would allow massive automatic
indexing. We propose a new text comparison measure, which mixes two well-known text classification
techniques: the Lesk voting scheme and the Term Frequency (TF). In addition, we experiment with
different weighing as well as logical-combinatorial methods such as ALVOT in order to determine the
contribution of the title in the correct classification. We found this contribution to be approximately one
third, as we correctly classified 36% (on average by each branch) of 122,431 previously unseen titles (in
total) upon training with 489,726 samples (in total) of one major branch (Q) of the LCC catalogue.
Povzetek: Opisan je postopek klasifikacije knjig na osnovi naslovov v ameriki kongresni knjinici.
Introduction
78
R. vila-Argelles et al.
3.
4.
800,000
19,000
800,000
Test set
Precision
50,000
55.00%
1,029
80.99%
50,000
86.00%
See below.
**
Our
system
1,500,00
0
7,200
16.90%
489,726
122,431
86.89%
Information Extraction
Related work
2.
1.
Training set
3.1
C A B C
A D E I
H G H I
QA103
12
C A C C
F D G F
I G D I
12
QA247
QA 242.5
E A B C J K
J D E F L M
L G H I N
17
QA1
2
0
1
2
5
QA103
1
1
2
3
7
QA242.5
0
1
1
0
2
QA247
1
1
1
1
4
4.
5.
6.
3.2
3.3
classes
IDF log
C | C is a class and w C
1.
2.
3.
4.
79
QA1
QA103
QA242.5
2/12
0/12
1/12
2/12
5/12 0.4167
3
3.4167
1/12
1/12
2/12
3/12
7/12 0.5833
4
4.5833
0/6
1/6
1/6
0/6
2/6 0.3333
2
2.3333
QA247
1/17
1/17
1/17
1/17
4/17 0.2353
4
4.2353
80
R. vila-Argelles et al.
2/12
0/12
1/12
2/12
5/12
1/12
1/12
2/12
3/12
7/12
0/6
1/6
1/6
0/6
2/6
1/17
1/17
1/17
1/17
4/17
IDF
0.124939
0.124939
0
0.124939
0.010
0.010
0.000
0.031
0.052
0.000
0.020
0.000
0.000
0.020
0.007
0.007
0.000
0.007
0.022
0.038
0.038
0.000
0.059
3.4
3.5
Feature sets
Comparison criterion
Similarity function
Object evaluation (row) given a feature set
Class evaluation (column) for all feature sets
Rule of Solution
2.
3.
4.
3.5.1
where
f (T i , Q j )
3.5.2
i,
Tp
where:
Ti is the title to be classified,
Qj is a class to be evaluated for similarity with Ti.
The title to be classified will be compared with all
the titles from each class, so that the class which in
average has the greatest similarity will be chosen.
Consider the following example:
Title to classify: PRACTICAL MATHEMATICS,
Length = 2.
From Table 7 we can see that the selected class
would be the title with greatest similarity with the title to
be classified, i.e., QA39. Figure 2 shows a screenshot of
the system.
Algorithms 4 and 4': Title Classification Using LogicalCombinatorial methods.
1.
2.
3.
4.
5.
f T
T p Q
81
STM
0
1
1
1
1
1
1
1
1
1
1
1
Length
1
2
3
6
5
4
5
6
7
7
6
6
Sub-Class
QA37
QA39
QA303
QA5
QA501
QA76
QA76.58
QA37
QA37.2
QA37.2
QA37.2
QA37.2
Title
BIOMATHEMATICS
MATHEMATICS USE
PURE MATHEMATICS COURSE
MATHEMATICS JA GLENN GH LITTLER DICTIONARY
PRACTICAL DESCRIPTIVE GEOMETRY, GRANT
INTERNATIONAL JOURNAL COMPUTER MATHEMATICS
PRACTICAL PARALLEL COMPUTING STEPHEN MORSE
MATHEMATICS MEASUREMENTS MERRILL RASSWEILER MERLE HARRIS
APPLIED FINITE MATHEMATICS RICHARD COPPINS PAUL UMBERGER
EUCLIDEAN SPACES PREPARED LINEAR MATHEMATICS COURSE TEAM
FOUNDATIONS MATHEMATICS KENNETH BERNARD HENRY WELLENZOHN
MATHEMATICS APPLICATIONS LAURENCE HOFFMANN MICHAEL ORKIN
82
R. vila-Argelles et al.
Logical-
4.1
Training
records
515,721
8,837
Subclasses
Keywords
8,243
402
1,454,615
28,398
Test
records
515,721
8,387
Algorithm
0
1
2
3
4
4
Uncovered
0
228
0
5,35
0
0
Covered
515,721 515,493 515,721 510,286
8,837
8,837
Success
178,654 433,861 177,945 396,689
8,214
7,822
Failure
337,067 81,860 337,776 119,032
623
1,015
Precision
34.64% 84.16% 34.50% 77.74% 92.95% 88.51%
4.2
Subclasses
Keywords
8,377
1,441,220
Test
records
122,431
83
Acknowledgements
4.2.1
0
39,482
77,609
33.72%
1
48,857
68,234
41.73%
2
39,023
78,068
33.33%
3
48,274
68,817
41.23%
4.2.2
Evaluation by position
1
2
3
4
5
Total
0
VFT
28.07%
12.37%
7.60%
5.51%
4.18%
57.73%
1
2
VFTP VFTP-TFIDF
35.67%
27.76%
12.28%
12.08%
6.88%
7.51%
4.65%
5.42%
3.40%
4.11%
56.88%
62.88%
References
[1] Manning, C. Shtze, H, Foundations of statistical
natural language Processing, MIT Press, ISBN
0262133601, Cambridge, May, 620 p., 1999.
[2] Kwan, Yi, Challenges in automated classification
using library classification schemes, Proceedings of
the 97 Information Technology with Audiovisual
and Multimedia and National Libraries IFLA 2006,
Seoul, Korea, 2006.
[3] Frank, Ebie, Gordon W. Paynter, Predicting Library
of Congress classifications from Library of
Congress subject headings, Journal of the American
Society for Information Science and Technology,
Volume 55, Issue 3 , pp 214-227, 2004.
[4]
[5]
[6]
[7]
4
3
4
DPT
VT-MLC VT-MLC
36.13%
35.47%
25.81%
10.15%
8.20%
9.71%
3.83%
4.62%
6.01%
2.05%
3.31%
4.49%
1.19%
2.51%
3.54%
53.35%
54.11%
49.56%
84
[8]
[9]
R. vila-Argelles et al.
85
Lexical unit is a word or collocation. Extracting lexical knowledge is an essential and difficult task in NLP.
The methods of extracting of lexical units are discussed. We present a method for the identification of
lexical boundaries. The problem of necessity of large corpora for training is discussed. The advantage of
identification of lexical boundaries within a text over traditional window method or full parsing approach
allows to reduce human judgment significantly.
Povzetek: Opisana je metoda za avtomaticno identifikacijo leksikalnih enot.
1 Introduction
Identification of a lexical unit is an important problem in
many natural language processing tasks and refers to the
process of extracting of meaningful word chains. The Lexical unit is a fuzzy term embracing a great variety of notions. The definition of the lexical unit differs according to
the researchers interests and standpoint. It also depends
on the methods of extraction that provide researchers with
lists of lexical items. Most lexical units are usually single words or constructed as binary items consisting of a
node and its collocates found within a previously selected
span. The lexical unit can be: (1) a single word, (2) the habitual cooccurrence of two words and (3) also a frequent
recurrent uninterrupted string of words. Second and third
notion refers to the definition of a collocation or a multi
word unit. It is common to consider a single word as a lexical unit. A big variety of the definition of the collocation
is presented in Violeta Seretan work [12]. Fragments of
corpus or strings of words consisting of collocating words
are called collocational chains [7]. For many years the final
agreed definition of the collocation is not made. Many syntactical, statistical and hybrid methods have been proposed
for collocation extraction [13], [1], [5], [4]. In [10], it is
shown that MWEs are far more diverse and interesting than
is standardly appreciated. MWEs constitute a key problem
that must be resolved in order for linguistically precise NLP
to succeed. Although traditionally seen as a language independent task, collocation extraction relies nowadays more
and more on the linguistic preprocessing of texts prior to
the application of statistical measures. In [14] it is provided a language-oriented review of the existing extraction
work.
In our work we compare Dice and Gravity Counts methods for the identification of lexical units by applying them
under the same conditions. The definition of what is a Lexical Unit in a linguistic sence is not discussed in this paper.
86
V. Daudaravicius
3 Combinability measures
Two different statistical calculations of collocability counts
are applied (Dice and Gravity Counts)in this work. A good
overview of combinability methods is presented in [3].
f (x, y) n(x)
G(x, y) = log
+
f (x)
f (x, y) n0 (y)
+log
f (y)
(x, y) is the frequency of the pair of words x and y in the
corpus; n(x) is a number of types to the right of x; f (x)
is the frequency of x in the corpus; n0 (y) is the number of
types to the left of y; f (y) is the frequency of y in the corpus. We set the level of collocability at the Gravity Counts
1 in our experiments. This decision was based on the shape
of the curve found in [3].
Gravity Counts are based on the evaluation of the combinability of two words in a text that takes into account a variety of frequency features, such as individual frequencies of
words, the frequency of pairs of words and the number of
types in the selected span. Token/type ratio is used slightly
different. Usually this ratio is used for the whole document.
The difference is that the token/type ratio is calculated not
for a document or a corpus but for a word within a selected
span only. In our experiments we used the span equal to 1.
The expression of Gravity Counts is as follows:
The Dice coefficient can be used to calculate the cooccurrence of words or word groups. This ratio is used,
for instance, in the collocation compiler XTract [11] and in
the lexicon extraction system Champollion [6]. It is defined
as follows [11]:
Dice(x, y) =
2 f (x, y)
f (x) + f (y)
2 f (x, y)
f (x) + f (y)
87
Figure 1: Identified lexical units of an example sentence, combinability values and collocability level at value 1. / Flat rate
/ corrections / are applied to all expenditure / under / the measure or measures / concerned / unless the deficiencies / were limited to
certain areas of expenditure / individual / projects / or types of project in which case they are applied to those areas of expenditure only/
Figure 2: Identified lexical units of an example sentence, combinability values, collocability level at value 1 and average
minimum law applied. / Flat rate / corrections / are applied to / all / expenditure / under the measure / or measures / concerned /
unless the / deficiencies / were limited to certain / areas of expenditure / individual / projects / or / types of / project / in which / case /
they are / applied to those / areas of / expenditure only /
Figure 3: Identified lexical units of an example sentence, combinability values, collocability level at value 1 and absolute
minimum law applied. / Flat rate / corrections / are applied to all expenditure / under the measure or measures / concerned / unless
the / deficiencies / were limited to certain / areas of expenditure / individual / projects / or / types of project / in which case / they are /
applied to those / areas of / expenditure only /
88
V. Daudaravicius
Figure 4: The number of lexical units (types) in the selected corpus (x-axis has logarithmic scale)
Gravity Counts
average minimum law
100%
10%
1%
of the
the
and
in the
and
the
and
in the
of
to the
the
of the
on the
of
in the
the
to the
to
at the
in
in
for the
on the
a
and the
to
for
to be
or
s
of
s
to the
or
for
that
in
a
is
by the
that
by
of a
and the was
from the is
with
with the for the or
that
with
on
to
at the
on the
it was
by
from
Dice
average minimum law
100%
10%
1%
and
and
and
the
the
the
of the
of the
of
of
of
of the
in the
in the
in the
in
in
to
to
to
in
a
a
a
to the
to the
to the
s
on the
that
on the
s
for
for
for
on the
that
that
to be
or
or
and the
for the for the s
is
and the or
and the is
for the
to be
to be
was
was
at the
with
at the
with
is
100%
10%
1%
she might
have been
the
headmistress
of a
certain
type of
girls
school
, now almost extinct
, or a
mother superior
in an
enclosed
order
.
at any rate
there could be
no doubt
that she had
found
the
temptation
of the
flesh resistible
.
she might
have been the
headmistress
of a
certain type of
girls
school
, now
almost extinct
, or a
mother superior
in an
enclosed order
.
she might
have been
the
headmistress
of a
certain type of
girls
school
, now
almost extinct
, or a
mother superior
in an
enclosed order
.
at any rate
there could be
no doubt
that she had
found
the
temptation
of the
flesh resistible
.
at any rate
there could be
no doubt
that she had found
the
temptation
of the
flesh resistible
.
100%
she might have been the
headmistress
of a certain
type of
girls
school , now
almost
extinct
, or a
mother
superior
in an
enclosed
order .
10%
1%
at any rate
there could be
no doubt
that she had
found the temptation
of the
flesh
resistible
at any rate
there could be
no doubt
that she had
found the temptation
of the
flesh
resistible
.
school , now
almost
extinct
, or a
mother
superior
in an
enclosed
order
.
at any rate
there
could be
no doubt
that she had
found the
temptation
of the
flesh
resistible
.
89
6 Conclusions
The numbers of lexical units in most languages is comparable and amounts to 6-7 millions. It should be applicable
for the most of indoeuropean languages. The lexical unit
is very important in NLP and is applied widely. But the
notion of lexical unit is not clear and hard to define. We
propose a definition of a lexical unit as a sequence of wordforms extracted from row text by using collocability feature
and setting boundaries of lexical units. This approach is
more clear compared to a widely used n-gram definition
of a lexical unit. The boundaries are predictable and easier controlled compared to n-gram model. The result of
setting lexical boundaries for the small and large corpora
90
References
[1] Brigitte Orliac, Mike Dillinger (2003) Collocation extraction for machine translation, MT Summit IX, New
Orleans, USA, 23-27 September 2003, pp.292-298
[2] Christian Boitet, Youcef Bey, Mutsuko Tomokio,
Wenjie Cao, Herv Blanchon (2006) IWSLT-06: experiments with commercial MT systems and lessons
from subjective evaluations,a International Workshop on Spoken Language Translation: Evaluation
Campaign on Spoken Language Translation [IWSLT
2006], November, 27-28, 2006, Kyoto, Japan, pp.23
30.
[3] Daudaravicius V., Marcinkeviciene R. (2004) Gravity
Counts for the Boundaries of Collocations, Interna
tional Journal of Corpus Linguistics, 9(2), pp. 321U348.
[4] Dekang Lin (1998) Extracting collocations from text
corpora, In First Workshop on Computational Termi
nology, Montreal, 1998, pp. 57U-63.
[5] Gael Dias (2003) Multiword unit hybrid extraction,
In Proceedings of the ACL Workshop on Multiword
V. Daudaravicius
[11] Stubbs, M. (2002) Two quantitative methods of studying phraseology in English, International Journal of
Corpus Linguistics, 7(2), pp. 215-244.
[12] Violeta Seretan (2008) Collocation Extraction Based
on Syntactic Parsing, Ph.D. thesis, University of
Geneva.
[13] Violeta Seretan and Eric Wehrli (2006) Accurate collocation extraction using a multilingual parser, In
Proceedings of the 21st International Conference on
Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics
(COLING/ACL 2006), Sydney, Australia, pp. 953
U960.
[14] Violeta Seretan and Eric Wehrli (2006) Multilingual
collocation extraction: Issues and solutions. In Proceedings of COLING/ACL Workshop on Multilingual
Language Resources and Interoperability, 2006, Syd
ney, Australia, pp. 40-U49.
1
51
8
9
5
1
9
1
1
1
1
2
1
1
1
13
1
1
1
2
1
1
7
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
5
6
1
1
1
91
92
V. Daudaravicius
93
Introduction
94
Related work
Problem definition
3.1
3.2
4.1
DIMASP-C
95
Document database
From George Washington to George W. Bush are 43 Presidents
Washington is the capital of the United States
George Washington was the first President of the United States
the President of the United States is George W. Bush
Document database (words by integer identifiers)
1
2
3
4
<1,2,3,4,2,5,6,7,8,9>
<3,10,11,12,13,11,14,15>
<2,3,16,11,17,18,13,11,14,15>
<11,18,13,11,14,15,10,2,5,6>
96
index
<wi,wi+1>
<From,George>
2
3
Cf List
<George,Washington> 2
<Washington,to>
1
2
1
1
<to,George>
<Washington,is>
<is,the>
<the,capital>
<capital,of>
<Washington,was>
10
<was,the>
11
<the,first>
3
3
10
11
12
<first,President>
13
<the,President>
14
<President,of>
12
13
14
15
15
<of,the>
16
<the,United>
17
<United,States>
18
<States,is>
19
<is,George>
20
<George,W.>
21
<W.,Bush>
14
15
16
16
17
16
17
15
17
18
19
20
20
21
21
22
<Bush,are>
23
<are,43>
24
<43,Presidents>
22
23
24
Figure 2: Data structure built by DIMASP-C for the database of the table 1.
pair is -frequent, in such case DIMASP-C, based on the
structure, grows the pattern while its frequency (the
number of documents where the pattern can grow)
remains greater or equal than . When a pattern cannot
grow, it is a possible maximal sequential pattern (PMSP),
and it is used to update the final maximal sequential
pattern set. Since DIMASP-C starts finding 3-MSP or
longer, then at the end, all the -frequent pairs that were
not used for any PMSP and all the -frequent words that
were not used for any -frequent pair are added as
maximal sequential patterns.
In order to be efficient, it is needed to reduce the
number of comparisons when a PMSP is added to the
MSP set (Fig. 4). For such reason, a k-MSP is stored
according to its length k, it means, there is a k-MSP set
for each k. In this way, before adding a k-PMSP as a kMSP, the k-PMSP must not be in the k-MSP set and must
not be subsequence of any longer k-MSP. When a PMSP
is added, all its subsequences are eliminated.
For avoiding repeating all the work for discovering
all the MSP when new documents are added to the
database, DIMASP-C only preprocesses the part
http://kdd.ics.uci.edu/databases/reuters21578/
reuters21578.html
4.1.1
i i
12000
10000
Time (sec.)
While f
GSP
cSPADE
GenPrefixSpan
DELISP
DIMASP-DC
8000
6000
4000
2000
0
0
5000
15000
20000
Time (sec.)
100
90
80
70
60
50
40
30
20
10
0
cSPADE
DIMASP-DC
5000
10000
15000
20000
documents in DDB
b) Performance comparison with =15
2500
cSPADE
GenPrefixSpan
DELISP
DIMASP-DC
2000
Time (sec.)
10000
documents in DDB
i i 1 i.e. i . NextNode
end-while
If |PMSP| 3 then add the PMSP to the MSP set
MSP set add a k-PMSP to the MSP set //step 3.1.1
end-for
For all the cells C Array do the addition of the 2-MSP
If Cf and C.mark = not used then add it as 2-MSP
2-MSP set add C . w i , w i 1
97
1500
1000
500
0
0
5000
10000
documents in DDB
15000
20000
98
4.2
30
25
Step 2 + 3 with
=5
Step 2 + 3 with
=15
20
Time (sec.)
15
10
5
0
0
5000
10000
15000
20000
documents in DDB
e) DIMASP-DC with =2
Time (sec.)
250
200
150
Step 2 + 3 with =2
100
50
0
0
5000
10000
15000
20000
documents in DDB
f) Incremental Scalability of
DIMASP-DC and cSPADE with =15
Time (sec.)
100
cSPADE
80
Input: A document T
Output: The data structure
For all the pairs [ti,ti+1] T do
// if [ti,ti+1] it is not in Array, add it
PositionNode.Pos index array [ti,ti+1];
Array[index].Positions New PositionNode
Array[index].Freq array[index].Freq+ 1
Array[LastIndex].Positions.NextIndexindex;
Array[LastIndex].Positions.NextPosPositionNode;
LastIndex index;
End-for
60
40
20
0
0
5000
10000
documents in DDB
15000
20000
5000
DIMASP-D
4500
Quantity of MSPs
4000
3500
3000
2500
2000
1500
1000
500
0
1
10 11 12 13 14
MSPs by length
8600
Quantity of MSPs
8400
8200
8000
7800
7600
7400
7200
7000
5000
10000
15000
20000
MSPs by length
NextPos List
Positions
1 2
2 3
3 4
4 5
5
6
e
l
l
a
1
1
5 6
10 3
Position of
the pair
11 0
7 4
Position Node
Pos Next
Index
Index of the
next pair (s,a)
9 2
NextPos
(b)
(b
8 1
6 3
Pos:
1 2
3 4
Item: e s
(a)
99
NextNode
5 6 7 8 9 10 11 12
a d e l a d e s a d
8
4
Steps 1 and 2
600
LETTERS
400
Steps 1 and 2
200
0
40
2
0
10
15
20
25
120
160
Words in thousand
30
200
240
Autobiography
6
4
2
0
10
15
20
25
700
600
500
400
300
200
100
0 =
a)
80
Time (sec)
Time (sec)
a) Autobiography
Autobiography
b) Letters
800
Time (sec)
10
LETTERS
15
20
25
100
MSP in hundred
=2
=5
=10
10
15
20
25
30
25
30
MSP in hundred
Words in thousand
35
30
25
20
15
10
5
0
=2
=5
=10
10
15
20
Words in thousand
Concluding remarks
References
[1]
102
Introduction
eALIS [2] [4], REciprocal And Lifelong Interpretation
System, is a new post-Montagovian [15] [17] theory
concerning the formal interpretation of sentences
constituting coherent discourses [9], with a lifelong
model
[1]
of
lexical,
interpersonal
and
cultural/encyclopaedic knowledge of interpreters in its
centre including their reciprocal knowledge on each
other. The decisive theoretical feature of eALIS lies in
a peculiar reconciliation of three objectives which are all
worth accomplishing in formal semantics but could not
be reconciled so far.
The first aim concerns the exact formal basis itself
(Montagues Thesis [20]): human languages can be
described as interpreted formal systems. The second aim
concerns compositionality: the meaning of a whole is a
function of the meaning of its parts, practically
postulating the existence of a homomorphism from
syntax to semantics, i.e. a rule-to-rule correspondence
between the two sides of grammar.
In Montagues interpretation systems a traditional
logical representation played the role of an intermediate
level between the syntactic representation and the world
model, but Montague argued that this intermediate level
of representation can, and should, be eliminated. (If is
a compositional mapping from syntax to discourse
representation and is a compositional mapping from
discourse to the representation of the world model, then
= must be a compositional mapping directly from
syntax to model.) The post-Montagovian history of
formal semantics [17] [9], however, seems to have
proven the opposite, some principle of discourse
Definition
104
The DRS condition [e: p t r1 ... rK] [10] (e.g. [e: resemble now
Peter Paul]) can be formulated with the aid of this function as
follows (with i and t fixed):
(Pred, , e) = p, (Temp, , e) = t, (Arg, 1, e)= r1,
..., (Arg, K, e)= rK.
G. Alberti et al.
Example
a.
b.
c.
d.
e.
106
Cat,+2,N
f.
g.
h.
:: : Cat,0,Acolor
nmet:
egerm: egerm egerm,
egerm: pGerman rgerm]
egerm :: ...
egerm :: .conj egerm
egerm :: .conj egerm
5th row rgerm :: Adj: Ord,+1,Nei,
Cat,+2,N
:: : Cat,0,Anation
sz-:
[esw: esw esw,
esw: pswimming rsw]
esw :: ...
esw:: .conj esw
esw :: .conj esw
5th row
rsw :: Adj: Ord,+ ,Nei,
Cat,0,N
:: : ...
bajnok:
-ra:
ronto :: Stem: Ord, Nei, Cat,+2, N
ronto :: Pred: Cat,+2, X-rA
In eALIS the lexical representation belonging to a
morpheme typically contains reference to a predicate
(e.g. pchampion) furnished with argument referents (e.g. rch
above in (1h)), a temporal referent and a referent
referring to the fact that the given predicate holds true
(the eventuality referent ech refers to the fact that
somebody is a champion). In the analysis that this paper
provides temporal referents are ignored for the sake of
simplicity. As was mentioned earlier, this eventuality
construction is registered by internal function .
The lexical representation belonging to a morpheme
should predict about these referents how they will
connect to referents coming from other lexical
representations retrieved in the course of dynamic
interpretation of a sentence when the given morpheme
gets into the given sentence. We mean the extension of
, practically responsible for identification, and ,
responsible for scope hierarchy and/or rhetorical
relations. Lexical items thus impose requirements on
the
potential
intrasentential
environments
accommodating the given morphemes and provide
offers for other morphemes items to help them (these
other morphemes) find them.
In what follows we provide comments on a few (but
not all) lexical requirements and offers. Let the verb
(hasonlt resemble) be the first (1b), with its 8 row long
lexical description. The first row contains the eventual
representation of the semantic contribution of this verb,
which consists of an eventuality referent (referring to the
fact that somebody resembles somebody), a predicate
referent, a temporal referent (ignored), and two argument
referents. What the second row says is that the
G. Alberti et al.
a.
b.
c.
d.
e.
f.
g.
h.
i.
j.
k.
l.
m.
2
n.
o.
p.
q.
r.
s.
t.
u.
v.
w.
x.
y.
z.
aa.
a.
b.
c.
d.
e.
f.
g.
h.
i.
j.
Implementation
108
G. Alberti et al.
Conclusion
Acknowledgement
Special thanks are due to the National Scientific Fund of
Hungary (OTKA K60595) for their financial support.
References
[1] Alberti,
G.
(2000):
Lifelong
Discourse
Representation Structures, Gothenburg Papers in
Computational Linguistics 005, 1320.
[2] Alberti, G. (2004): ReAL Interpretation System, in
Hunyadi, L., Rkosi, Gy., Tth, E. (eds.) Preliminary
Papers of the Eighth Symposium on Logic and
Language, Univ. of Debrecen, 112.
[3] Alberti, G. (2005): Accessible Referents in
Opaque Belief Contexts, in Herzig, A., van
Ditmarsch, H. (eds.) Belief Revision and Dynamic
110
G. Alberti et al.
111
Classifier combination techniques have been applied to a number of natural language processing problems.
This paper explores the use of bagging and boosting as combination approaches for coreference resolution. To the best of our knowledge, this is the first effort that examines and evaluates the applicability of
such techniques to coreference resolution. In particular, we (1) outline a scheme for adapting traditional
bagging and boosting techniques to address issues, like entity alignment, that are specific to coreference
resolution, (2) provide experimental evidence which indicates that the accuracy of the coreference engine
can potentially be increased by use of multiple classifiers, without any additional features or training data,
and (3) implement and evaluate combination techniques at the mention, entity and document level.
Povzetek: Kombiniranje ucnih algoritmov je uporabljeno za iskanje koreferenc.
1 Introduction
Classifier combination techniques have been applied
to many problems in natural language processing (NLP). Popular examples include the ROVER
system [Fiscus1997] for speech recognition, the
Multi-Engine Machine Translation (MEMT) system [Jayaraman and Lavie2005], and also part-ofspeech tagging [Brill and Wu1998, Halteren et al.2001].
Even outside the domain of NLP, there have
been numerous interesting applications for classifier combination techniques in the areas of biometrics
[Tulyakov and Govindaraju2006],
handwriting recognition [Xu et al.1992] and data mining [Aslandogan and Mahajani2004] to name a few. Most
of these techniques have shown a considerable improvement over the performance of single-classifier baseline
systems and, therefore, lead us to consider implementing
such a multiple classifier system for coreference resolution
as well. To the best of our knowledge, this is the first
effort that utilizes classifier combination techniques for
improving coreference resolution.
This study shows the potential for increasing the accuracy of the coreference resolution engine by combining
multiple classifier outputs and describes the combination
techniques that we have implemented to establish and tap
into this potential. Unlike other domains where classifier combination has been implemented, the coreference
resolution application presents a unique set of challenges
that prevent us from directly using traditional combination
schemes [Tulyakov et al.2008]. We, therefore, adapt some
of these popular yet simple techniques to suit our application, and study the results of the implementation.
The main advantage of using combination techniques is
that in cases where we have multiple classification engines,
we do not merely use the classifier with highest accuracy,
but instead, we combine all of the available classification
engines attempting to achieve results superior to the single
best engine. This is based on the assumption that the errors
made by each of the classifiers are not identical and therefore if we intelligently combine multiple classifier outputs,
we may be able to correct some of these errors.
The main contributions of this paper are:
demonstrating the potential for improvement in the
baseline By implementing a system that behaves like
an oracle, where we combine the outputs of several
coreference resolution classifiers with knowledge of the
truth i.e. the correct output generated by a human, we
have shown that the output of the combination of multiple classifiers has the potential to be significantly higher
in accuracy than any of the individual classifiers. This
has been proven in certain other areas of NLP; here, we
112
S. Vemulapalli et al.
to the same person; so do Mary and sister. Furthermore, John and Mary are named mentions, sister is a
nominal mention and his is a pronominal mention.
The basic coreference system is similar to the one described by Luo et al. [Luo et al.2004]. In such a system,
the mentions in a document are processed sequentially, and
at each step, a mention is either linked to one of existing entities, or used to create a new entity. At the end of this process, each possible partition of the mentions corresponds to
a unique sequence of link or creation actions, each of which
is scored by a statistical model. The one with the highest
score is output as the final coreference result.
Experiments reported in the paper are done on the ACE
2005 data [NIST2005], which is available through the Linguistic Data Consortium (LDC). The dataset consists of
599 documents from rich and diversified sources (called
genres in this paper), which include newswire articles, web
logs, and Usenet posts, transcription of broadcast news,
broadcast conversations and telephone conversations. We
reserve the last 16% documents of each source as the test
set, and use the rest of the documents as the training set.
The number of documents, words, mentions and entities of
this data split are tabulated in Table 1.
3 Bagging
One way to obtain multiple classifiers is via
bagging or bootstrap aggregating, proposed by
Breiman [Breiman1996] to improve the classification
by combining outputs of classifiers that are trained using
randomly-generated training sets. We have implemented
bagging by using semi-randomly generated subsets of the
entire training set and also subsets of the feature set.
113
Table 1: Statistics of ACE 2005 data: number of documents, words, mentions and entities in the training and test set.
DataSet
Training
Test
Total
#Docs
499
100
599
#Words
253771
45659
299430
#Mentions
46646
8178
54824
#Entities
16102
2709
18811
3.2 Oracle
In this paper, we refer to an oracle system which uses
knowledge of the truth. In this case, truth, called gold standard henceforth, refers to mention detection and coreference resolution done by a human for each document. It
is possible that this gold standard may have errors and is
not perfect truth, but, as in most NLP systems, the humanannotated gold standard is considered the reference for purposes of evaluating computer-based coreference resolution.
To understand the oracle itself, consider an example in
which we have two classifiers, and their outputs for the
same input document are C1 and C2 , as illustrated in Figure 1. The number of entities in C1 and C2 may not be the
same and even in cases where they are, the number of mentions in corresponding entities may not be the same. In fact,
even finding the corresponding entity in the other classifier
or in the gold standard output G is not a trivial problem and
requires us to be able to align any two classifier outputs.
The alignment between any two coreference labelings,
say C1 and G, for a document is done by finding the best
one-to-one map between the entities of C1 and the entities of G, using the algorithm explained by Luo [Luo2005].
To align the entities of C1 with those of G, under the assumption that an entity in C1 may be aligned with at most
only one entity in G and vice versa, we need to generate a bipartite graph between the entities of C1 and G.
Now the alignment task is a maximum bipartite matching
problem. This is solved by using the Kuhn-Munkres algorithm [Kuhn1955, Munkres1957]. The weights of the
edges of the graph, in this case, are entity-level alignment
measures. The metric we use is a relative measure of the
similarity between the two entities. To compute the similarity metric (R, S) for the entity pair (R, S), we use the
formula shown in Equation 1, where the intersection ()
is the commonality with attribute-weighted partial scores.
Attributes are things such as (ACE) entity type, subtype,
2 |R S|
|R| + |S|
(1)
1 X
Sc(Combi , Cj )
N 1
(2)
Si = Sc(Combi , Cj )
(3)
j6=i
Si =
1 X
Sc(Combi , Combj )
N 1
j6=i
(4)
114
S. Vemulapalli et al.
4 Boosting
Sc(Ai , Aj )
(5)
j6=i
Evaluation
115
Accuracy (%)
77.52
79.16
75.81
78.53
5.1 Bagging
The classifiers for the following experiments were generated using bagging techniques described in Section 3.1.
A total of 15 classifiers (C1 to C15 ) were generated, 12
of which were obtained by semi-random sampling of the
training set and the remaining 3 by sampling of the feature
set. We also make use of the baseline classifier C0 , which
was trained using the full training and feature sets. The
accuracy of classifiers C0 to C15 has been summarized in
Table 2. The agreement between the generated classifiers
output was found to be in the range of 93% to 95%. In
this paper, the metric used to compute the accuracy of the
coreference resolution is the Constrained Entity-Alignment
F-Measure (CEAF) [Luo2005] with the entity-pair similarity measure in Equation 1.
Oracle. To conduct the oracle experiment described in
Section 3.2, we train 1 to 15 classifiers, whose output are
aligned to the gold standard. For all system-generated entities aligned with a gold entity, we pick the one with the
highest score as the output. We measure the performance
for varying number of classifiers, and the result is plotted
in Figure 5.
First, we observe a steady and significant increase in
CEAF for every additional classifier. This is not surprising
since an additional classifier can only improve the alignment score. Second, it is interesting to note that the oracle
performance is 87.58% for a single input classifier C1 , i.e.
an absolute gain of 9% compared to the baseline. This is
because the availability of gold entities makes it possible to
remove many false-alarm entities. Finally, the performance
of the oracle output when all 15 classifiers (C1 to C15 ) are
used as input is 94.59%, a 16.06% absolute improvement.
The oracle experiment is a cheating one since the gold
standard is used. Nevertheless, it helps us understand the
performance bound of combining multiple classifiers and
the quantitative contribution of every additional classifier.
Preliminary classifier combination approaches. While the
oracle result is encouraging, a natural question is. how
much performance gain can be attained if the gold standard is not available. To answer this question, we replace
the gold standard with one of the 15 classifiers C1 to C15 ,
and align the rest classifiers. This is done in a round robin
fashion as described in Section 3.3. The best performance
of this procedure is 77.93%. The sum-rule combination
output had an accuracy of 78.65% with a slightly different
baseline of 78.81%. In other words, these techniques do
not yield a statistically significant increase in CEAF relative to the baseline. This is not entirely surprising as the
the 15 classifiers C1 to C15 are highly correlated.
Mention-level majority voting. This experiment is conducted to evaluate and understand the mention-level majority voting technique for coreference resolution. Compared with the baseline, the results of this experiment are
not statistically better, but they give us valuable insight into
116
S. Vemulapalli et al.
Accuracy (%)
78.53
78.82
79.08
78.37
noisy nature of the full system C0 which is used for alignment. We also observe that mentions spread across different alignments often have low-count and they are often
tied in count. Therefore, it is important to set a minimum
threshold for accepting these low-count majority votes and
also investigate better tie-breaking techniques.
5.2
Boosting.
This experiment is conducted to evaluate the documentlevel boosting technique for coreference resolution. Table 3
shows the results of this experiment with the ratio of the
number of training documents to the number of test documents equal to 80:20, F-measure threshold Fthresh = 74%
and percentile threshold Pthresh = 25%. The accuracy increases by 0.7%, relative to the baseline. Due to computational complexity considerations, we used fixed values
for the parameters. Therefore, these values may be suboptimal and may not correspond to the best possible increase in accuracy.
6 Related work
A large body of literature related to statistical methods for coreference resolution is available [Ng and Cardie2003, Yang et al.2003, Ng2008,
Poon and Domingos2008, McCallum and Wellner2003].
Poon and Domingos [Poon and Domingos2008] use an
unsupervised technique based on joint inference across
mentions and Markov logic as a representation language
for their system on both MUC and ACE data. Ng [Ng2008]
proposed a generative model for unsupervised coreference
resolution that views coreference as an EM clustering process. In this paper, we make use of a coreference engine
similar to the one described by Luo et al. [Luo et al.2004],
where a Bell tree representation and a Maximum entropy
framework are used to provide a naturally incremental
framework for coreference resolution. To the best of
our knowledge, this is the first effort that utilizes classifier combination techniques to improve coreference
resolution. Combination techniques have earlier been
applied to various applications including machine translation [Jayaraman and Lavie2005] and part-of-speech
tagging [Brill and Wu1998]. However, the use of these
techniques for coreference resolution presents a unique
set of challenges, such as the issue of entity alignment
117
Acknowledgement
The authors would like to acknowledge Ganesh N. Ramaswamy for his guidance and support in conducting the
research presented in this paper.
References
[Halteren et al.2001] H. Van Halteren, J. Zavrel, and W. Daelemans. 2001. Improving accuracy in word class tagging
through the combination of machine learning systems. Computational Linguistics, 27.
[Jayaraman and Lavie2005] S. Jayaraman and A. Lavie. 2005.
Multi-engine machine translation guided by explicit word
matching. In Proc. of ACL.
[Kuhn1955] H. W. Kuhn. 1955. The hungarian method for the
assignment problem. Naval Research Logistics Quarterly, 2.
[Luo et al.2004] X. Luo, A. Ittycheriah, H. Jing, N. Kambhatla,
and S. Roukos. 2004. A mention-synchronous coreference
resolution algorithm based on the bell tree. In Proc. of ACL.
[Luo2005] X. Luo. 2005. On coreference resolution performance
metrics. In Proc. of EMNLP.
[McCallum and Wellner2003] A. McCallum and B. Wellner.
2003. Toward conditional models of identity uncertainty
with application to proper noun coreference. In Proc. of IJCAI/IIWeb.
[Munkres1957] J. Munkres. 1957. Algorithms for the assignment and transportation problems. Journal of the Society of
Industrial and Applied Mathematics, 5(1).
[Ng and Cardie2003] V. Ng and C. Cardie. 2003. Bootstrapping
coreference classifiers with multiple machine learning algorithms. In Proc. of EMNLP.
[Ng2008] V. Ng. 2008. Unsupervised models for coreference
resolution. In Proc. of EMNLP.
[NIST2005] NIST. 2005. ACE05 evaluation. www.nist.
gov/speech/tests/ace/ace05/index.html.
[Brill and Wu1998] E. Brill and J. Wu. 1998. Classifier combination for improved lexical disambiguation. In Proc. of COLING.
118
S. Vemulapalli et al.
119
Temporal multi-document summarization (TMDS) aims to capture evolving information of a single topic
over time and produce a summary delivering the main information content. This paper presents a cascaded regression analysis based macro-micro importance discriminative model for the content selection of
TMDS, which mines the temporal characteristics at different levels of topical detail in order to provide the
cue for extracting the important content. Temporally evolving data can be treated as dynamic objects that
have changing content over time. Firstly, we extract important time points with macro importance discriminative model, then extract important sentences in these time points with micro importance discriminative
model. Macro and micro importance discriminative models are combined to form a cascaded regression
analysis approach. The summary is made up of the important sentences evolving over time. Experiments
on five Chinese datasets demonstrate the encouraging performance of the proposed approach, but the problem is far from solved.
Povzetek: Metoda kaskadne regresije je uporabljena za izdelavo zbirnega besedila.
1 Introduction
Multi-document summarization is a technology of information compression, which is largely an outgrowth of the late
twentieth-century ability to gather large collections of unstructured information on-line. The explosion of the World
Wide Web has brought a vast amount of information, and
thus created a demand for new ways of managing changing
information. Multi-document summarization is the process
of automatically producing a summary delivering the main
information content from a set of documents about an explicit or implicit topic, which helps to acquire information
efficiently. It has drawn much attention in recent years and
is valuable in many applications, such as intelligence gathering, hand-held devices and aids for the handicapped.
Temporal multi-document summarization (TMDS) is the
natural extension of multi-document summarization, which
captures evolving information of a single topic over time.
The greatest difference from traditional multi-document
summarization is that it deals with the dynamic collection
about a topic changing over time. It is assumed that a user
has access to a stream of news stories that are on the same
topic, but that the stream flows rapidly enough that no one
has the time to look at every story. In this situation, a person
would prefer to dive into the details that include the most
important, evolving concepts within the topic and have a
trend analysis.
The key problem of summarization is how to identify
important content and remove redundant content. The
common problem for summarization is that the information in different documents inevitably overlaps with each
other, and therefore effective summarization methods are
needed to contrast their similarities and differences. However, the above application scenarios, where the objects to
be summarized face to some special topics and evolve with
time, raise new challenges to traditional summarization algorithms. One challenge for TMDS is that the information in summary must contain the evolving content. So we
need to effectively take into account this temporally evolving characteristics during the summarization process. Thus
a good TMDS must include information as much as possible, keeping information as novel as possible. In this paper,
we focus on how to summarize the series news reports by
the generic and extractive way.
Considering the temporal characteristic of the series
news reports at different levels of topic detail, redundancy
is a good feature. We adopted cascaded regression analysis to model the temporal redundancy from the macro and
micro view. We hierarchically extract important information with the macro and micro importance discriminative
models. We detected the important time points based on
macro importance discriminative model, and extracted the
important sentences based on micro importance discriminative model. Macro and micro importance discriminative
models are combined to form a cascaded regression analysis model. This method not only reduces the complexity of
the problem, but also fully mines the temporal characteristics of evolving data over time. The summary is made up of
120
R. He et al.
2 Related work
121
$UWLFOH
6HQWHQFH
T rigger is the set of trigger words, and Scope is the sentence description containing events. Reference to the definition of event in ACE evaluation1 , a trigger word indicates the existence of an event. However, Chinese event
extraction technology is not mature, we ignore the relevant
attributes of event, including type, subtype, modality, polarity, genericity and tense. Generally, trigger word is verb
or action noun. We just consider the situation of verb so as
to simplify the question. Thus, the j th sentence can also
be simply formalized as follows: sj = {vk }, j, k = 1...n.
The importance of a sentence depends on the importance
of the verbs contained in a sentence. Based on the above
analysis, we choose the time point and verb as the content
unit of importance discrimination from macro and micro
view, respectively.
$U WLF OH
6H QWH QF H
right_Slope(ti ) =
RI(left)
Td(left)
(1)
RI(right)
Td(right)
(2)
122
R. He et al.
n
of (vk )
(3)
(4)
X E(X)
V ariance(X)
4 Experiments
4.1 Corpus and evaluation metrics
TMDS is a new research, and there is no public corpus and
evaluation metric. Therefore we have to build the corpus
and the evaluation metric.
Corpus Our Chinese corpus construction includes two
parts, one is the construction of raw corpus, another is the
construction of reference summary. Five groups of Chinese
data set are chosen from Sinas2 international news topics
between 2005 and 2006. Table1 illustrates the settings of
the corpus, where there are five topics, 78 time points, 734
articles, 13486 sentences. Simultaneously, for each date
set, we let experts annotate three groups of reference summary in term of the compression rate 10% and 20%.
ID
1
2
3
4
5
#time points
20
25
3
20
10
#articles
214
250
17
158
95
#sentences
4310
5253
101
2278
1544
(5)
F =
i=1
Fi
, Fi =
2Pi Ri
Pi + Ri
(6)
TFIOF
17.29%
27.18%
Slope
19.11%
27.56%
Variance
14.81%
27.02%
123
Micro
19.11%
27.56%
Macro+Micro
20.46%
29.13%
124
References
[1] J. Allan, R. Gupta, and V. Khandelwal. Temporal summaries of new topics. Proceedings of the
24th annual international ACM SIGIR conference on
Research and development in information retrieval,
pages 10-18, 2001.
[2] A. Jatowt and M. Ishizuka. Temporal Web Page Summarization. 5th International Conference On Web Information Systems Engineering, Brisbane, Australia,
November 22-24, 2004.
[3] C. Kuan-Yu, L. LUESUKPRASERT, and T. Sengcho. Hot Topic Extraction Based on Timeline Analysis and Multidimensional Sentence Modeling. IEEE
Transactions on Knowledge and Data Engineering,
pages 1016-1025, 2007.
[4] M. L. Q. W. K. Li, W.J.and Wu. Integrating temporal
distribution information into event-based summarization. International Journal of Computer Processing of
Oriental Languages, 19:201-222, 2006.
[5] J. Lim, I. Kang, J. J.Bae, and J. Lee. Sentence extraction using time features in multi-document summarization. In Proceedings of the Asia Information
Retrieval Symposium, pages 82-932, 2004.
[6] C. Lin. ROU2GE: A Package for Automatic Evaluation of Summaries. Proceedings of the Workshop
on Text Summarization Branches Out, pages 25-26,
2004.
[7] I. Mani. Recent Developments in Temporal Information Extraction. Nicolov, N., and Mitkov, R. Proceedings of RANLP, 3, 2004.
[8] R. Swan and D. Jensen. Constructing Topic-Specific
Timelines with Statistical Models of Word Usage.
Proceedings of the 6th ACM Conference on Knowledge Discovery and Data Mining (SIGKDD), pages
73-80, 2000.
R. He et al.
Informatica 34 (2010)
125
Informatica 34 (2010)
INFORMATICA
AN INTERNATIONAL JOURNAL OF COMPUTING AND INFORMATICS
INVITATION, COOPERATION
QUESTIONNAIRE
Send Informatica free of charge
Yes, we subscribe
Please, complete the order form and send it to Dr. Drago Torkar,
Informatica, Institut Joef Stefan, Jamova 39, 1000 Ljubljana,
Slovenia. E-mail: [email protected]
Name: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
...........................................................
...........................................................
Informatica WWW:
http://www.informatica.si/
Referees:
Witold Abramowicz, David Abramson, Adel Adi, Kenneth Aizawa, Suad Alagic, Mohamad Alam, Dia Ali, Alan
Aliu, Richard Amoroso, John Anderson, Hans-Jurgen Appelrath, Ivn Araujo, Vladimir Bajic, Michel Barbeau,
Grzegorz Bartoszewicz, Catriel Beeri, Daniel Beech, Fevzi Belli, Simon Beloglavec, Sondes Bennasri, Francesco
Bergadano, Istvan Berkeley, Azer Bestavros, Andra Beek, Balaji Bharadwaj, Ralph Bisland, Jacek Blazewicz,
Laszlo Boeszoermenyi, Damjan Bojadijev, Jeff Bone, Ivan Bratko, Pavel Brazdil, Bostjan Brumen, Jerzy
Brzezinski, Marian Bubak, Davide Bugali, Troy Bull, Sabin Corneliu Buraga, Leslie Burkholder, Frada Burstein,
Wojciech Buszkowski, Rajkumar Bvyya, Giacomo Cabri, Netiva Caftori, Particia Carando, Robert Cattral, Jason
Ceddia, Ryszard Choras, Wojciech Cellary, Wojciech Chybowski, Andrzej Ciepielewski, Vic Ciesielski, Mel
Cinnide, David Cliff, Maria Cobb, Jean-Pierre Corriveau, Travis Craig, Noel Craske, Matthew Crocker, Tadeusz
Informatica
An International Journal of Computing and Informatics
Web edition of Informatica may be accessed at: http://www.informatica.si.
Subscription Information Informatica (ISSN 0350-5596) is published four times a year in Spring, Summer,
Autumn, and Winter (4 issues per year) by the Slovene Society Informatika, Voarski pot 12, 1000 Ljubljana,
Slovenia.
The subscription rate for 2010 (Volume 34) is
60 EUR for institutions,
30 EUR for individuals, and
15 EUR for students
Claims for missing issues will be honored free of charge within six months after the publication date of the issue.
Orders may be placed by email ([email protected]), telephone (+386 1 477 3900) or fax (+386 1 251 93 85). The
payment should be made to our bank account no.: 02083-0013014662 at NLB d.d., 1520 Ljubljana, Trg republike
2, Slovenija, IBAN no.: SI56020830013014662, SWIFT Code: LJBASI2X.
Informatica is published by Slovene Society Informatika (president Niko Schlamberger) in cooperation with the
following societies (and contact persons):
Robotics Society of Slovenia (Jadran Lenarcic)
Slovene Society for Pattern Recognition (Franjo Pernu)
Slovenian Artificial Intelligence Society; Cognitive Science Society (Matja Gams)
Slovenian Society of Mathematicians, Physicists and Astronomers (Bojan Mohar)
Automatic Control Society of Slovenia (Borut Zupancic)
Slovenian Association of Technical and Natural Sciences / Engineering Academy of Slovenia (Igor Grabec)
ACM Slovenia (Dunja Mladenic)
Informatica is surveyed by: ACM Digital Library, Citeseer, COBISS, Compendex, Computer & Information
Systems Abstracts, Computer Database, Computer Science Index, Current Mathematical Publications, DBLP
Computer Science Bibliography, Directory of Open Access Journals, InfoTrac OneFile, Inspec, Linguistic and
Language Behaviour Abstracts, Mathematical Reviews, MatSciNet, MatSci on SilverPlatter, Scopus, Zentralblatt
Math
The issuing of the Informatica journal is financially supported by the Ministry of Higher Education, Science and
Technology, Trg OF 13, 1000 Ljubljana, Slovenia.
ISSN 0350-5596
Y. Ledeneva,
G. Sidorov
Y. Ledeneva,
G. Sidorov
M.C. Lintean, V. Rus
19
E. Lloret, M. Palomar
29
A.-C. N. Ngomo
A. Tllez-Valero,
M. Montes-y-Gmez,
L. Villaseor-Pineda,
A. Peas-Padilla
A. Ekbal,
S. Bandyopadhyay
R. vila-Argelles,
H. Calvo, A. Gelbukh,
S. Godoy-Caldern
V. Daudaravicius
R.A. GarcaHernndez,
J.Fco. MartnezTrinidad,
J.A. Carrasco-Ochoa
G. Alberti, J. Kleiber
37
45
55
77
85
93
103