Semantic Processing For Text Entailment With VENSES
Semantic Processing For Text Entailment With VENSES
Proper Name + Definite Expression ref_ex(sn1, Madonna, [+ref, def0, nil, nil, -pro, -ana, -
Rte5 – DevSet - TH Pair 10 class], 3, fem, sing, [human], subj/theme_unaff)
ref_ex(sn1, existence, [+ref, +def, very, nil, -pro, -ana,
(CNN) -- Malawians are rallying behind Madonna +class], 3, neut, sing, [place, state], subj/actor)
as she awaits a ruling Friday on whether she can ref_ex(sn9, it, [+ref, +def, nil, nil, +pro, -ana, +le], 3, neu,
adopt a girl from the southern African nation. The sing, [any], subj/agent)
pop star, who has three children, adopted a son
from Malawi in 2006. She is seeking to adopt As can be seen from representations, proper nouns
Chifundo “Mercy” James, 4. “Ninety-nine percent are marked def0, +ref and –class; on the contrary
of the people calling in are saying, let her take the
baby,” said Marilyn Segula, a presenter at Capital common nouns are marked +def/-def, +ref and
FM, which broadcasts in at least five cities, +class. Pronouns do not have the attribute CLASS
including the capital, Lilongwe. but +/-le which stands for Lexically Expressed.
Madonna has three children. The vector includes Functional Features – Person,
Gender, Number – and Semantic Features in the
Indefinite Expression + Proper Name
sense of General Nouns or Inherent Features. At
Rte5 – TestSet - TH Pair 83
the end of the vector or Prolog term, we report
A Ugandan spy who set up a bogus charity and grammatical function and semantic role associated
embezzled thousands of dollars of funding meant to the head noun which can be found in syntactic
for Aids patients has been jailed for 10 years. and dependency representations by means of the
Teddy Sseezi Cheeye, 51, took $56,000 (£38,000) index, positioned at the beginning.
from the Global Fund charity, which aims to
prevent HIV, tuberculosis and malaria. He set up
an NGO, the Uganda Centre for Accountability, 4. Evaluation and Ablation test
which received cash in 2005 to do HIV/Aids
community work. But the High Court in Kampala The evaluation results we present try to give a
heard Cheeye siphoned off the funds instead.
Teddy Sseezi Cheeye is an Ugandan spy.
comprehensive picture of the system performance
over the overall datasets made available with RTE.
Definite Expression + Proper Name It is worthwhile reminding that the first 2
Rte5 – TestSet - TH Pair 269 challenges contained very short Texts if compared
to the Text average size of the following
The eruption happened at around 1:30 PM local challenges. In particular then, Texts contained in
time, the United States Geological Survey RTE4 testset and RTE5 development and test sets
reported. The volcano had erupted four times on
Friday, billowing ash up to 51,000 feet up into the are much longer then those contained in RTE3.
air. These are the latest in a series of eruptions The difference in treatment of these datasets is
from Mount Redoubt, which started on March 22. quite obvious: modeling a paragraph long text is
The volcano had not erupted since a four-month certainly much harder. In addition, RTE5 texts
period in 1989-90. The Alaska Volcano
Observatory set its alert level at red, the highest
have a certain number of T/H pairs where the
possible level, meaning that an eruption is contents of the Hypothesis is scattered amongst a
imminent, and that it would send a "significant number of sentences in the paragraph. This makes
emission of volcanic ash into the atmosphere." the task much harder than in all those cases in
Mount Redoubt is located in Alaska. which semantic matching can be concentrated on
just one sentence in the Text paragraph.
As can be gathered from the headers and the
As will be noticed in the data reported in Table 1.
highlighted portions of texts, the cases to be
below, there is a remarkable difference in the
covered all involve a proper noun which can be
results obtained in the Development and the Test
either a person’s name or a location. There are
set: 10 percent point accuracy. A possible reason
three different configurations to account for, which
for this is the fact that RTE5 Testset contains a lot
require basically a search for the type of
more cases of difficult to spot entailment relations.
It is a fact, that a great number of T/H pairs interesting to notice that not all task behave in the
contain Texts where the relevant relations are same way.
scattered in more than one sentence, thus making This type of “sloppy” semantic similarity
the semantic matching task harder to perform. matching is fired every time the system needs
approximated or fuzzy similarity information. In
Subtask Accuracy(%) particular, it is never permitted whenever precise
ir 61.00 information is required, as for instance in what we
qa 65.00 call General Consistency Checking procedures.
ie 58.50 These procedures are carried out to check for the
Average 61.50
presence of Quantified Expressions, information
Table 1: Official results for Run 1 of our system – No
Ranking
related to Spatio-Temporal Location, and any kind
Subtask Precision(%) of numerical information present in the Hypothesis
ir 58.62 that has to be present also in the Text. On the
qa 67.14 contrary, whenever we look for attributes,
ie 66.00 modifiers and other similar adjuncts of the
Average 64.45 arguments expressed in predicate-argument
Table 2: Official results for Run 2 of our system – Ranking structure, we allow access to lexical fields
contained in thesaura. This may also apply in all
Results for past RTE datasets as a whole fare on copulative constructions, whenever a certain
average 63% - but see table 4. below. Results for property is being associated to the subject of the
the Contradiction Dataset, are as follows: predication.
Accuracy measured as ratio of Correct Pairs/All These matching procedure are scattered all over
Pairs: 108/131 = 0.8245 the evaluation algorithm: what we did was simply
Results for the Development and the Test set of dummifying the access to the matching
RTE 5, are as follows: procedures, by inserting a dummy couple of values
DEVELOPMENT set: Accuracy measured as ratio – nil, nil – in place of the two variables that had to
of Correct Pairs/All Pairs: 0.73 be taken into consideration by the matching
It is important to notice that in all cases with no procedure, and inserted a cut – in Prolog an
exception whatsoever, the percentage of True T/H instruction not to allow recursion and oblige a
pairs found is higher than the percentage of False. failure – in place of the procedure itself, which
was hidden.
4.1 Ablation Test
Subtask Accuracy1 Accuracy2
We carried out one ablation test where we ir 61.00 65.50
removed matching procedures related to Grady qa 65.00 58.00
Ward’s MOBY Thesaurus as well as to Roget’s ie 58.50 52.50
Thesaurus. In fact what we eliminated was a Average 61.50 58.67
procedure which used “lexical fields” as semantic Table 3: Ablation Test results compared with Run1
similarity matching in all cases of non identical
lemmas. We used this procedure after eliminating
cases of antonymy which could degrade the 5. Conclusions and Future Work
semantic similarity matching. After the filter for
antonyms, matching was carried out on lemmas as We presented our improvements to VENSES, our
usual. Access to Thesaura can in some cases system for semantic evaluation, which uses a
contribute important and relevant information, but proprietor complete system of text analysis based
this is not always guaranteed as shown by the on a deep system called GETARUNS. We
results of the test reported here below. In introduced a number of new modules that take
particular, we may notice that in one case, IR advantage of the output of the anaphora resolution
subtask, we improved accuracy by 0.045 points. algorithm and exploit its representations to attempt
So, even though in the remaining subtasks there is bridging coreference. In case constraints are
always a reduction of the overall accuracy, it is respected, the system looks for similar relations in
web ontologies, to confirm the anaphoric link. We
also implemented Augmented FSA both at tagging We report here below a table with the overall
and at dependency levels. The results are very results the system obtains on all RTE datasets.
encouraging and we saw an improvement of 8%
overall.
References