Evaluating Part-Of-speech Tagging and Parsing
Evaluating Part-Of-speech Tagging and Parsing
Evaluating Part-Of-speech Tagging and Parsing
EVALUATING PART-OF-SPEECH
TAGGING AND PARSING
On the Evaluation of Automatic Parsing
of Natural Language
Patrick Paroubek
Laboratoire d’Informatique pour la Mécanique et les Sciences de l’Ingénieur
LIMSI-CNRS, Orsay, France
[email protected]
Abstract The aim of this chapter is to introduce the reader to the evaluation of part-of-
speech (POS) taggers and parsers. After a presentation of both POS tagging and
parsing, describing the tasks and the existing formalisms, we introduce general
considerations about evaluation of Natural Language Processing (NLP). Then
we raise a point about the issue of input data segmentation into linguistic units,
a crucial step in any evaluation related to language processing. We conclude by
a review of the current evaluation methodologies and average levels of perfor-
mance generally achieved for POS tagging and parsing.
1 POS Tagging
Part-of-speech (POS) tagging is the identification of the morphosyntactic
class of each word form using lexical and contextual information. Here is how
Brill’s tagger (Brill, 1995) tags the first sentence of this paragraph. Each line
holds respectively: a token number, a word form, a POS tag, and a short tag
description.
0 part-of-speech tagging VBG verb, gerund or present participle
1 is VBZ verb, present tense, 3rd person, singular
2 the DT determiner
3 identification NN noun, singular or mass
4 of IN preposition or subordinating conjunction
99
L. Dybkjær et al. (eds.), Evaluation of Text and Speech Systems, 99–124.
c 2007 Springer.
100 EVALUATION OF TEXT AND SPEECH SYSTEMS
5 the DT determiner
6 morphosyntactic JJ adjective
7 class NN noun, singular or mass
8 of IN preposition or subordinating conjunction
9 each DT determiner
10 word NN noun, singular or mass
11 form NN noun, singular or mass
12 using VBG verb, gerund or present participle
13 lexical JJ adjective
14 and CC conjunction, coordinating
15 contextual JJ adjective
16 information NN noun, singular or mass
Brill’s tagger uses the Penn Treebank1 tagset (Marcus et al., 1993). The
tagset regroups all the tags used to represent the various word classes. Ideally, a
tagset should have the capacity to integrate all the morphosyntactic information
present in the lexical descriptions of the words, if any is available. It should
also have the capacity to encode the information needed to disambiguate POS
tags in context, and last of all, it should have the capacity to represent the
information that will be needed by the linguistic processing to which POS
tagging is a preliminary processing phase. We give below a short description
of the 36 tags of the Penn Treebank tagset (Marcus et al., 1993).
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NP Proper noun, singular
15. NPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PP Personal pronoun
19. PP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
Evaluating Part-of-Speech Tagging and Parsing 101
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
30. VBN Verb, past participle
31. VBP Verb, non-third person singular present
32. VBZ Verb, third person singular present
33. WDT Wh-determiner
34. WP Wh-pronoun
35. WP$ Possessive wh-pronoun
36. WRB Wh-adverb
The selection of the linguistic features from the lexical descriptions and
how they are associated to POS tags is always a difficult choice. Arbitrary
linguistic choices, the application for which tagging is done, the performance
expected of the tagger, and finally the disambiguation power offered by the
current language technology are all important factors in determining lexical
feature selection. For instance, Chanod and Tapanainen (1995) have shown
that one way to improve the performance of a POS tagger for French, is to
exclude the gender information from the tags of nouns and adjectives (there
is less ambiguity to solve, and therefore less chance for the tagger to make
an error). The gender information can be recovered afterwards by means of a
lexicon and a few rules (Tufis, 1999).
It is very difficult to draw a precise boundary around the morphosyntac-
tic information associated with POS tags, since it concerns morphology (e.g.,
verb tense), morphosyntax (e.g., noun/verb distinction), syntax (e.g., identifi-
cation of the case for pronouns, accusative versus dative), and semantics (e.g.,
distinction between common and proper noun). Often it is represented by lexi-
cal descriptions which make explicit the way linguistic features are organised
into a hierarchy and the constraints that exist between them (some features are
defined only for some specific morphosyntactic categories, like the notion of
tense which is restricted to the category of verbs). Here is an example of a
lexical description of the word form “results”:
[ word form = ‘‘results’’
[ category = noun
subcategory = common
morphology = [ number = plural
gender = neuter2
lemma = ‘‘result’’ ]]
102 EVALUATION OF TEXT AND SPEECH SYSTEMS
[ category = verb
subcategory = main
morphology = [ form = indicative
tense = present
number = singular
person = third
lemma = ‘‘result’’]]]
2 Parsing
Parsing is an analysis task aiming at identifying any constraint that controls
the arrangement of the various linguistic units into sentences, and hence the
ordering of words. An automatic parser tries to extract from the textual data it
is given as input a description of the organization and function of the linguistic
elements it finds in the data. The syntactic description can then be used by the
application for which the parser was developed.
In Natural Language Processing (NLP), parsing has been studied since
the early 1960s, first to develop theoretical models of human language syn-
tax and general “deep”3 parsers. After a period during which the formalisms
have evolved to take into account more and more lexical information (linguistic
descriptions anchored in words), the last decade has seen a regain of interest
Evaluating Part-of-Speech Tagging and Parsing 103
<root>
main:
<said>
subj: obj:
<joan> <suits>
subj: obj:
<likes> <her>
subj: obj:
<decide>
obj: infmark:
Figure 1. An example of dependency annotation of the sentence “John likes to decide what-
ever suits her” from Monceaux (2002).
describe a parser realised with finite state automata. An introduction to the use
of statistical methods for parsing is proposed in Manning and Schütze (2002).
A presentation of the various approaches that have been tried for parsing along
with the main milestones of the domain is given in Wehrli (1997) and Abeillé
and Blache (2000); in Abeillé (1993) we find a description of all the formalisms
that were inspired from logic programming (based on unification operation)
like the “lexical functional grammar” (LFG), the “generalized phrase structure
grammar” (GPSG), the “head-driven phrase structure grammar” (HPSG), and
the “tree adjoining grammar” (TAG).
LFG is a lexical theory that represents grammatical structure by means of
two kinds of objects linked together by correspondences: the functional struc-
tures (f-structures), which express grammatical relations by means of attribute-
value pairs (attributes may be features such as tense, or functions such as
subject); and the constituent structures (c-structures), which have the form of
phrase structure trees. Information about the c-structure category of each word
as well as its f-structure is stored in the lexicon. The grammar rules encode con-
straints between the f-structure of any non-terminal node and the f-structures
of its daughter nodes. The functional structure must validate the completeness
and coherence condition: all grammatical functions required by a predicate
must be present but no other grammatical function may be present.
In GPSG, phrase structure is encoded by means of context-free rules, which
are divided into immediate dominance rules and linear precedence rules. The
formalism is equipped with the so-called slash feature to handle unbounded
movements in a context-free fashion. GPSG offers a high level, compact rep-
resentation of language at the cost of sometimes problematic computation.
Evaluating Part-of-Speech Tagging and Parsing 105
EAGLES (King and Maegaard, 1998) used the role of the human operator as
a guide to recast the question of evaluation in terms of users’ perspective. The
resulting evaluation methodology is centred on the consumer report paradigm.
EAGLES distinguishes three kinds of evaluation:
information. The classes are either a refinement of the ones inherited from the
Latin grammar (where, for instance, the class of nouns regroups the words des-
ignating entities, objects, notions, and concepts), inferred from statistical data
according to an arbitrary feature set, or a mix of both of the previous cases.
By definition, the task of parsing aims at identifying any constraint that con-
trols the arrangement of the various linguistic units into sentences, and hence
the ordering of words.
If we use basic linguistic terminology in the example of “The program prints
results”, POS tagging will identify the word form “prints” as a verb, at the third
person singular of the indicative present tense (and not as a noun), and pars-
ing will tell that the form “program” is the subject of the verb form “prints”,
and that the form “results” is the direct object complement of the verb form
“prints”.
Note that the majority of parsing algorithms require the result of a prelim-
inary POS tagging analysis or incorporate a POS tagging function. Note also,
that the definitions we have just given of POS tagging and parsing rely on the
definition of what constitutes a word, a not so trivial task as we will see in
Section 3.1.
Table 1. Example of error amplification when using token segmentation instead of word
segmentation (2 errors instead of one).
35
30
25
20
15
10
0
0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1
Figure 2. Variation of POS tagging accuracy depending on text genre. The graph (Illouz,
2000) gives the number of texts of a given genre (ordinate) in function of tagging precision
(abscissa), measured on the Brown corpus (500 texts of 2000 words), with the Tree Tagger
using the Penn Treebank tagset.
Rajman, 2002) does not permit to obtain enough information either on the lan-
guage coverage or on the robustness of the tagger.
Not only the size of the corpus, but also its type can have an influence on
the accuracy measure. To show how the performance of a POS tagger varies
depending on the kind of data it processes, we give in Figure 2 the variation
of tagging accuracy of the Tree Tagger (a freely available probabilistic POS
tagger which uses the Penn Treebank tagset) as a function of the text genre,
measured on the Brown corpus (500 texts of 2,000 words each). The accuracy
varies from 85% to 98% with an average value of 94.6% (Illouz, 2000). Of
course, it is recommended for testing to use material different from that which
served for training of the system, since performance will invariably be better
on the training material (van Halteren, 1999).
Things get more complicated as soon as we start considering cases other
than the one in which both the tagger and the reference data assign only
one tag per token. Then the accuracy measure no longer permits a fair com-
parison between different taggers, if they are allowed to propose partially
disambiguated taggings. Van Halteren (1999) proposes in such cases to use
112 EVALUATION OF TEXT AND SPEECH SYSTEMS
the average tagging perplexity, i.e., the average number of tags per word as-
signed by the system,11 or to have recourse to precision and recall, the now
well-known evaluation measures from Information Retrieval.
Let us denote with ti the set of tags assigned to the ith word form wi by
a tagger and ri the set of tags assigned to the same word form in the refer-
ence annotations. The value of the precision and recall for this word form are,
respectively, the ratio of the number of correct tags over the number of tags
assigned by the system P (wi ) = |ti|t∩r
i|
i|
, and the ratio of the number of correct
tags over the number of tags assigned in the reference R(wi ) = |ti|r∩r
i|
i|
. By aver-
aging the respective sums of the two previous quantities
for all the word forms,
we obtain the measures over the whole corpus P = N1 N i=1 pi and similarly
for R. Often precision and recall are combined together into one single value,
the f-measure whose formula accepts as parameter α the relative importance12
1
given to precision over recall, F = α (1−α) (Manning and Schütze, 2002).
P
+ R
In the very frequent case where only one tag per word form is assigned in
the reference annotation, precision and recall take very intuitive interpretations.
Recall is the proportion of word taggings holding one correct tag. Precision is
the ratio between the recall and the average number of tags assigned per word
by the tagger. This second measure is relatively close to the average ambi-
guity (Tufis and Mason, 1998), the average number of tags assigned by a
lexicon to the words of a corpus. It integrates both the a priori ambiguity of
the corpus and the delicacy13 of the tagset used in the lexicon. Average am-
biguity can be used to quantify the relative difficulty offered by the task of
tagging the corpus, i.e., how much ambiguity remains to be solved, since some
word forms have already an unambiguous tagging in the lexicon.
Note that precision is a global performance measurement which does not
give any information about the error distribution over the various linguistic
phenomena or the various genres of text, or on the types of error. It is not
because two taggers have similar precision values that they make the same
errors at the same locations. Therefore, it may be of interest to quantify the
similarity between two taggings of the same text. There exists a measure ini-
tially developed for this very purpose, but for human annotators. It is the κ
(kappa) coefficient (Carletta, 1996), which compensates for the cases where
the two taggings agree by chance.
Other approaches use measures from Information Theory (Resnik and
Yarowsky, 1997), like the per word cross-entropy, which measures the dis-
tance between a stochastic process q and a reference stochastic process p. In
this approach, tagging is considered to be a stochastic process which associates
to each word form a probability distribution over the set of tags. If we suppose
that the reference process is stationary14 and ergodic,15 and that two subse-
quent taggings are two independent events, then for a sufficiently large corpus,
the cross-entropy can be easily computed (Cover and Thomas, 1991).
Evaluating Part-of-Speech Tagging and Parsing 113
Let us mention another set of measures which has been used in the GRACE
evaluation campaign (Adda et al., 1999): precision and decision. The precision
measures the number of times a word was assigned a single correct tag. The
decision measures the ratio between the number of words which have been
assigned a single tag and the total number of words. The originality of this
measure lies with the possibility to plot the whole range of performance values
reachable by a system, if one were to attempt to disambiguate some or all of
the taggings that were left ambiguous by the tagger.
In the literature, most of the results mention precision values which are
almost always greater than 90% and sometimes reach 99%. Already in de Rose
(1988), the Volsunga tagger had achieved 96% precision for English on the
Brown corpus. The best result in the GRACE evaluation of French taggers
was 97.8% precision on a corpus of classic literature and the Le Monde news-
paper. In the same evaluation, a lexical tagging (assigning all the tags found in
the lexicon associated to the considered word form) achieved 88% precision.
This result dropped to 59% precision16 when a few contextual rule files were
applied to try to artificially reduce the ambiguous taggings to one single tag
per word. But let us remind the reader that all these measures must be con-
sidered with caution since they highly depend on the size and composition of
the tagset as well as on the segmentation algorithms and on the genre of the
text processed. Furthermore, evaluation results are given on a per word basis,
which is not necessarily an appropriate unit for some applications where units
like the sentence, the paragraph, or the document are often more pertinent. For
instance, for a 15-word sentence and a tagging precision of 96% at the word
level, we only get a tagging precision of 54.2% at the sentence level, i.e., almost
1 sentence in 2 contains a tagging error. Conversely, to achieve a 95% tagging
precision at the sentence level, we would need to have a tagger which would
achieve a 99.67% precision at the word level.
Although POS tagging seems to be a task far simpler than parsing, a
POS tagger is a complex system combining several functions (tokeniser,
word/sentence segmenter, context-free tagger, POS tag disambiguator) which
may use external linguistic resources like a lexicon and a tagset. Evaluat-
ing such systems implies clear choices about the criteria that will be effec-
tively taken into account during evaluation. Evaluation cannot resume itself
to the simple measurement of tagging accuracy; factors like the processing
speed (number of words tagged per second), the software portability (on which
operating system can the tagger run, how easily can it be integrated with other
modules), its robustness (is the system tolerant to large variations of the input
data characteristics), the delicacy of the tagset (how fine a linguistic distinction
can be made between two word classes), and the multilingualism of the system
all constitute different dimensions of the evaluation space, the importance of
which varies depending on the purpose of evaluation.
114 EVALUATION OF TEXT AND SPEECH SYSTEMS
comp
suj
mod-n
<NV>Il arrive</NV><GP>en retard</GP>, avec,<GP> dans sa poche </GP>, <GN>un discours</GN>
mod-n
cpl-v cod
qu’<NV>il est </NV><GA>obligé</GA><PV>de garder</PV>
Figure 3. Example of reference annotation of the EASY evaluation campaign for the
sentence: “He arrives late, with in his pocket, a discourse which he must keep.”
116 EVALUATION OF TEXT AND SPEECH SYSTEMS
Table 2. Performance range of four parsers of French and their combination, on questions of
the Question and Answer TREC track corpus.
Precision Recall
Noun phrase from 31.5% to 86.6% from 38.7% to 86.6%
Verb phrase from 85.6% to 98.6% from 80.5% to 98.6%
Prepositional phrase from 60.5% to 100% from 60.5% to 100%
6 Conclusion
When POS taggers and parsers are integrated in an application, only quan-
titative blackbox methodologies are available to gauge their performance. This
approach is characteristic for technology-oriented evaluation, which interests
mostly integrators and developers, contrary to user-oriented evaluation, for
which the interaction with the final user is a key element of the evaluation
process.
Although the corpus-based automatic evaluation procedures do provide
most of the information useful for assessing the performance of a POS tagger
or parser, the recourse to the opinion of an expert of the domain is essential,
not only to provide an interpretation of the results returned by the automatic
evaluation procedures, but also to provide the knowledge needed to define the
conditions under which the evaluation measures will be taken.
POS tagging evaluation methodology is now mature, and there exist enough
results in the literature to be able to compare POS taggers on grounds suffi-
ciently sound if one has the proper evaluation tools and an annotated corpus,
118 EVALUATION OF TEXT AND SPEECH SYSTEMS
the cost of which is rather high, not only because of the manpower needed, but
also because of the annotation quality required.
For parsing, the situation is less clear, possibly only because of the greater
variety of the syntactic formalisms and of the analysis algorithms. It is very
difficult to compare on a fair basis systems that use different formalisms.
However, the situation begins to change with the emergence of new evalu-
ation protocols based on grammatical relations (Carroll et al., 2003) instead
of constituents, and large-scale evaluation campaigns, like the French EASY-
EVALDA of the TECHNOLANGUE program for parsers of French (Vilnat
et al., 2003).
Notes
1. A Treebank is a large corpus completely annotated with syntactic information (trees) in a consistent
way.
2. In English, gender for nouns is only useful for analysing constructions with pronouns.
3. A “deep” parser describes for all the word forms of a sentence, in a complete and consistent way, the
various linguistic elements present in the sentence and the structures they form; on the contrary, a “shallow”
parser only provides a partial description of the structures.
4. This is particularly true of any batch-processing activity like POS tagging and parsing.
5. Of all kinds, including emails or produced by automatic speech transcription.
6. We will refrain from using the term type to refer to word forms, to avoid any confusion with other
meanings of this term.
7. Languages like Chinese are written without separators.
8. Tokens are indexed with indices made of the position of the current token in the compound word,
associated with the total number of tokens in the compound, e.g., of/1.2 course/2.2.
9. Transcription of oral dialogues, recorded in various everyday life situations.
10. The error rate is simply the 1’s complement of the accuracy.
11. Note that this measure takes all its sense when given with the corresponding measure of the standard
deviation.
12. In general α = 0.5.
13. The level of refinement in linguistic distinction offered by the tagset, in general, correlated with the
number of tags: the finer the distinctions, the larger the tagset.
14. A stochastic process is stationary when its statistical characteristics do not depend on the initial
conditions.
15. Observations made at any time over a succession of process states are the same as the observations
made over the same states but on a large number of realisations.
16. The precision decreases, because as the ambiguous taggings are resolved, they become unambiguous
and thus are taken into account in the computation of the precision, while before they were only taken into
account in the measurement of the decision.
17. Here is an example where the A parentheses cross the B parentheses: (A (B A)B).
18. Grammar Evaluation Interest Group.
References
Abeillé, A. (1991). Analyseurs syntaxiques du français. Bulletin Semestriel de
l’Association pour le Traitement Automatique des Langues, 32:107–120.
Abeillé, A. (1993). Les nouvelles syntaxes. Armand Colin, Paris, France.
Evaluating Part-of-Speech Tagging and Parsing 119