0% found this document useful (0 votes)
8 views

nested-ner

Uploaded by

AARTI SAHITYA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

nested-ner

Uploaded by

AARTI SAHITYA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Nested Named Entity Recognition

Jenny Rose Finkel and Christopher D. Manning


Computer Science Department
Stanford University
Stanford, CA 94305
{jrfinkel|manning}@cs.stanford.edu

Abstract We believe this has largely been for practical,


Many named entities contain other named not ideological, reasons. Most corpus designers
entities inside them. Despite this fact, the have chosen to skirt the issue entirely, and have
field of named entity recognition has al- annotated only the topmost entities. The widely
most entirely ignored nested named en- used CoNLL (Sang and Meulder, 2003), MUC-6,
tity recognition, but due to technological, and MUC-7 NER corpora, composed of American
rather than ideological reasons. In this pa- and British newswire, are all flatly annotated. The
per, we present a new technique for rec- GENIA corpus contains nested entities, but the
ognizing nested named entities, by using JNLPBA 2004 shared task (Collier et al., 2004),
a discriminative constituency parser. To which utilized the corpus, removed all embedded
train the model, we transform each sen- entities for the evaluation. To our knowledge, the
tence into a tree, with constituents for each only shared task which has included nested enti-
named entity (and no other syntactic struc- ties is the SemEval 2007 Task 9 (Márquez et al.,
ture). We present results on both news- 2007b), which used a subset of the AnCora corpus.
paper and biomedical corpora which con- However, in that task all entities corresponded to
tain nested named entities. In three out particular parts of speech or noun phrases in the
of four sets of experiments, our model provided syntactic structure, and no participant di-
outperforms a standard semi-CRF on the rectly addressed the nested nature of the data.
more traditional top-level entities. At the Another reason for the lack of focus on nested
same time, we improve the overall F-score NER is technological. The NER task arose in the
by up to 30% over the flat model, which is context of the MUC workshops, as small chunks
unable to recover any nested entities. which could be identified by finite state models
or gazetteers. This then led to the widespread
1 Introduction use of sequence models, first hidden Markov mod-
Named entity recognition is the task of finding en- els, then conditional Markov models (Borthwick,
tities, such as people and organizations, in text. 1999), and, more recently, linear chain conditional
Frequently, entities are nested within each other, random fields (CRFs) (Lafferty et al., 2001). All
such as Bank of China and University of Wash- of these models suffer from an inability to model
ington, both organizations with nested locations. nested entities.
Nested entities are also common in biomedical In this paper we present a novel solution to the
data, where different biological entities of inter- problem of nested named entity recognition. Our
est are often composed of one another. In the model explicitly represents the nested structure,
GENIA corpus (Ohta et al., 2002), which is la- allowing entities to be influenced not just by the
beled with entity types such as protein and DNA, labels of the words surrounding them, as in a CRF,
roughly 17% of entities are embedded within an- but also by the entities contained in them, and in
other entity. In the AnCora corpus of Spanish and which they are contained. We represent each sen-
Catalan newspaper text (Martı́ et al., 2007), nearly tence as a parse tree, with the words as leaves, and
half of the entities are embedded. However, work with phrases corresponding to each entity (and a
on named entity recognition (NER) has almost en- node which joins the entire sentence). Our trees
tirely ignored nested entities and instead chosen to look just like syntactic constituency trees, such as
focus on the outermost entities. those in the Penn TreeBank (Marcus et al., 1993),
ROOT

PROT , PROT , CC PROT VBD DT DNA IN DT DNA .

PROT PROT NN NN NN NN NNS PROT NN NN PROT NN

NN NN NN NN NN

PEBP2 alpha A1 , alpha B1 , and alpha B2 proteins bound the PEBP2 site within the mouse GM-CSF promoter .

Figure 1: An example of our tree representation over nested named entities. The sentence is from the
GENIA corpus. PROT is short for PROTEIN.

but they tend to be much flatter. This model allows entities, etc. For outside-in layering the first CRF
us to include parts of speech in the tree, and there- would identify outermost entities, and then succes-
fore to jointly model the named entities and the sive CRFs would identify increasingly nested en-
part of speech tags. Once we have converted our tities. They also tried a cascaded approach, with
sentences into parse trees, we train a discrimina- separate CRFs for each entity type. The CRFs
tive constituency parser similar to that of (Finkel would be applied in a specified order, and then
et al., 2008). We found that on top-level enti- each CRF could utilize features derived from the
ties, our model does just as well as more conven- output of previously applied CRFs. This technique
tional methods. When evaluating on all entities has the problem that it cannot identify nested en-
our model does well, with F-scores ranging from tities of the same type; this happens frequently in
slightly worse than performance on top-level only, the data, such as the nested proteins at the begin-
to substantially better than top-level only. ning of the sentence in Figure 1. They also tried a
joint labeling approach, where they trained a sin-
2 Related Work gle CRF, but the label set was significantly ex-
panded so that a single label would include all of
There is a large body of work on named en-
the entities for a particular word. Their best results
tity recognition, but very little of it addresses
where from the cascaded approach.
nested entities. Early work on the GENIA cor-
pus (Kazama et al., 2002; Tsuruoka and Tsujii, Byrne (2007) took a different approach, on his-
2003) only worked on the innermost entities. This torical archive text. She modified the data by con-
was soon followed by several attempts at nested catenating adjacent tokens (up to length six) into
NER in GENIA (Shen et al., 2003; Zhang et potential entities, and then labeled each concate-
al., 2004; Zhou et al., 2004) which built hidden nated string using the C&C tagger (Curran and
Markov models over the innermost named enti- Clark, 1999). When labeling a string, the “previ-
ties, and then used a rule-based post-processing ous” string was the one-token-shorter string con-
step to identify the named entities containing the taining all but the last token of the current string.
innermost entities. Zhou (2006) used a more elab- For single tokens the “previous” token was the
orate model for the innermost entities, but then longest concatenation starting one token earlier.
used the same rule-based post-processing method SemEval 2007 Task 9 (Márquez et al., 2007b)
on the output to identify non-innermost entities. included a nested NER component, as well as
Gu (2006) focused only on proteins and DNA, by noun sense disambiguation and semantic role la-
building separate binary SVM classifiers for inner- beling. However, the parts of speech and syn-
most and outermost entities for those two classes. tactic tree were given as part of the input, and
Several techniques for nested NER in GENIA named entities were specified as corresponding to
where presented in (Alex et al., 2007). Their first noun phrases in the tree, or particular parts of
approach was to layer CRFs, using the output of speech. This restriction substantially changes the
one as the input to the next. For inside-out lay- task. Two groups participated in the shared task,
ering, the first CRF would identify the innermost but only one (Márquez et al., 2007a) worked on
entities, the next layer would be over the words the named entity component. They used a multi-
and the innermost entities to identify second-level label AdaBoost.MH algorithm, over phrases in the
DNAparent=ROOT

NNparent=DNA,grandparent=ROOT @DNAparent=ROOT,prev=NN,first=PROT

PROTparent=DNA,grandparent=ROOT NNparent=DNA,grandparent=ROOT

NNparent=PROT,grandparent=DNA

mouse GM-CSF promoter

Figure 2: An example of a subtree after it has been annotated and binarized. Features are computed over
this representation. An @ indicates a chart parser active state (incomplete constituent).

parse tree which, based on their labels, could po- and Cohen, 2004; Andrew, 2006), but with no
tentially be entities. length restriction on entities. Like a semi-CRF, we
Finally, McDonald et al. (2005) presented a are able to define features over entire entities of
technique for labeling potentially overlapping seg- arbitrary length, instead of just over a small, fixed
ments of text, based on a large margin, multilabel window of words like a regular linear chain CRF.
classification algorithm. Their method could be We model part of speech tags jointly with the
used for nested named entity recognition, but the named entities, though the model also works with-
experiments they performed were on joint (flat) out them. We determine the possible part of
NER and noun phrase chunking. speech tags based on distributional similarity clus-
ters. We used Alexander Clarke’s software, 1 based
3 Nested Named Entity Recognition as on (Clark, 2003), to cluster the words, and then
Parsing allow each word to be labeled with any part of
speech tag seen in the data with any other word
Our model is quite simple – we represent each sen- in the same cluster. Because the parts of speech
tence as a constituency tree, with each named en- are annotated with the parent (and grandparent)
tity corresponding to a phrase in the tree, along labels, they determine what, if any, entity types
with a root node which connects the entire sen- a word can be labeled with. Many words, such as
tence. No additional syntactic structure is rep- verbs, cannot be labeled with any entities. We also
resented. We also model the parts of speech as limit our grammar based on the rules observed in
preterminals, and the words themselves as the the data. The rules whose children include part of
leaves. See Figure 1 for an example of a named speech tags restrict the possible pairs of adjacent
entity tree. Each node is then annotated with both tags. Interestingly, the restrictions imposed by this
its parent and grandparent labels, which allows joint modeling (both observed word/tag pairs and
the model to learn how entities nest. We bina- observed rules) actually result in much faster infer-
rize our trees in a right-branching manner, and ence (and therefore faster train and test times) than
then build features over the labels, unary rules, a model over named entities alone. This is differ-
and binary rules. We also use first-order horizon- ent from most work on joint modeling of multiple
tal Markovization, which allows us to retain some levels of annotation, which usually results in sig-
information about the previous node in the bina- nificantly slower inference.
rized rule. See Figure 2 for an example of an an-
notated and binarized subtree. Once each sentence 3.1 Discriminative Constituency Parsing
has been converted into a tree, we train a discrimi-
We train our nested NER model using the same
native constituency parser, based on (Finkel et al.,
technique as the discriminatively trained, condi-
2008).
tional random field-based, CRF-CFG parser of
It is worth noting that if you use our model on
(Finkel et al., 2008). The parser is similar to a
data which does not have any nested entities, then
it is precisely equivalent to a semi-CRF (Sarawagi 1 http://www.cs.rhul.ac.uk/home/alexc/RHUL/Downloads.html
Local Features Pairwise Features
labeli distsimi + distsimi−1 + labeli labeli−1 + labeli
wordi + labeli shapei + shapei+1 + labeli wordi + labeli−1 + labeli
wordi−1 + labeli shapei−1 + shapei + labeli wordi−1 + labeli−1 + labeli
wordi+1 + labeli wordi−1 + shapei + labeli wordi+1 + labeli−1 + labeli
distsimi + labeli shapei + wordi+1 + labeli distsimi + labeli−1 + labeli
distsimi−1 + labeli words in a 5 word window distsimi−1 + labeli−1 + labeli
distsimi+1 + labeli prefixes up to length 6 distsimi+1 + labeli−1 + labeli
shapei + labeli suffixes up to length 6 distsimi−1 + distsimi + labeli−1 + labeli
shapei−1 + labeli shapei + labeli−1 + labeli
shapei+1 + labeli shapei−1 + labeli−1 + labeli
shapei+1 + labeli−1 + labeli
shapei−1 + shapei + labeli−1 + labeli
shapei−1 + shapei+1 + labeli−1 + labeli

Table 1: The local and pairwise NER features used in all of our experiments. Consult the text for a full
description of all features, which includes feature classes not in this table.

chart-based PCFG parser, except that instead of ilarity cluster (distsim), and a string indicating
putting probabilities over rules, it puts clique po- orthographic information (shape) (Finkel et al.,
tentials over local subtrees. These unnormalized 2005). Subscripts represent word position in the
potentials know what span (and split) the rule is sentence. In addition to those below, we include
over, and arbitrary features can be defined over the features for each fully annotated label and rule.
local subtree, the span/split and the words of the
sentence. The inside-outside algorithm is run over Local named entity features. Local named en-
the clique potentials to produce the partial deriva- tity features are over the label for a single word.
tives and normalizing constant which are neces- They are equivalent to the local features in a linear
sary for optimizing the log likelihood. Optimiza- chain CRF. However, unlike in a linear chain CRF,
tion is done by stochastic gradient descent. if a word belongs to multiple entities then the local
features are computed for each entity. Local fea-
The only real drawback to our model is run-
tures are also computed for words not contained in
time. The algorithm is O(n3 ) in sentence length.
any entity. Local features are in Table 1.
Training on all of GENIA took approximately 23
hours for the nested model and 16 hours for the Pairwise named entity features. Pairwise fea-
semi-CRF. A semi-CRF with an entity length re- tures are over the labels for adjacent words, and
striction, or a regular CRF, would both have been are equivalent to the edge features in a linear chain
faster. At runtime, the nested model for GENIA CRF. They can occur when pairs of words have
tagged about 38 words per second, while the semi- the same label, or over entity boundaries where
CRF tagged 45 words per second. For compar- the words have different labels. Like with the lo-
ison, a first-order linear chain CRF trained with cal features, if a pair of words are contained in, or
similar features on the same data can tag about straddle the border of, multiple entities, then the
4,000 words per second. features are repeated for each. The pairwise fea-
tures we use are shown in Table 1.
4 Features
Embedded named entity features. Embedded
When designing features, we first made ones sim- named entity features occur in binary rules where
ilar to the features typically designed for a first- one entity is the child of another entity. For our
order CRF, and then added features which are not embedded features, we replicated the pairwise fea-
possible in a CRF, but are possible in our enhanced tures, except that the embedded named entity was
representation. This includes features over entire treated as one of the words, where the “word”
entities, features which directly model nested en- (and other annotations) were indicative of the type
tities, and joint features over entities and parts of of entity, and not the actual string that is the en-
speech. When features are computed over each tity. For instance, in the subtree in Figure 2, we
label, unary rule, and binary rule, the feature func- would compute wordi +labeli−1 +labeli as PROT-
tion is aware of the rule span and split. DNA-DNA for i = 18 (the index of the word GM-
Each word is labeled with its distributional sim- CSF). The normal pairwise feature at the same po-
GENIA – Testing on All Entities

Nested NER Model Semi-CRF Model


# Test (train on all entities) (train on top-level entities)
Entities Precision Recall F1 Precision Recall F1
Protein 3034 79.04 69.22 73.80 78.63 64.04 70.59
DNA 1222 69.61 61.29 65.19 71.62 57.61 63.85
RNA 103 86.08 66.02 74.73 79.27 63.11 70.27
Cell Line 444 73.82 56.53 64.03 76.59 59.68 67.09
Cell Type 599 68.77 65.44 67.07 72.12 59.60 65.27
Overall 5402 75.39 65.90 70.33 76.17 61.72 68.19

Table 2: Named entity results on GENIA, evaluating on all entities.


GENIA – Testing on Top-level Entities Only

Nested NER Model Semi-CRF Model


# Test (train on all entities) (train on top-level entities)
Entities Precision Recall F1 Precision Recall F1
Protein 2592 78.24 72.42 75.22 76.16 72.61 74.34
DNA 1129 70.40 64.66 67.41 71.21 62.00 66.29
RNA 103 86.08 66.02 74.73 79.27 63.11 70.27
Cell Line 420 75.54 58.81 66.13 76.59 63.10 69.19
Cell Type 537 69.36 70.39 69.87 71.11 65.55 68.22
Overall 4781 75.22 69.02 71.99 74.57 68.27 71.28

Table 3: Named entity results on GENIA, evaluating on only top-level entities.

sition would be GM-CSF-DNA-DNA. with 36 different kinds of biological entities, and


with parts of speech. Previous NER work using
Whole entity features. We had four whole en- this corpus has employed 10-fold cross-validation
tity features: the entire phrase; the preceding and for evaluation. We wanted to explore different
following word; the preceding and following dis- model variations (e.g., level of Markovization, and
tributional similarity tags; and the preceding dis- different sets of distributional similarity cluster-
tributional similarity tag with the following word. ings) and feature sets, so we needed to set aside
Local part of speech features. We used the a development set. We split the data by putting
same POS features as (Finkel et al., 2008). the first 90% of sentences into the training set, and
the remaining 10% into the test set. This is the
Joint named entity and part of speech features. exact same split used to evaluate part of speech
For the joint features we replicated the POS fea- tagging in (Tsuruoka et al., 2005). For develop-
tures, but included the parent of the POS, which ment we used the first half of the data to train, and
either is the innermost entity type, or would indi- the next quarter of the data to test.2 We made the
cate that the word is not in any entities. same modifications to the label set as the organiz-
ers of the JNLPBA 2004 shared task (Collier et
5 Experiments al., 2004). They collapsed all DNA subtypes into
We performed two sets of experiments, the first set DNA; all RNA subtypes into RNA; all protein sub-
over biomedical data, and the second over Spanish types into protein; kept cell line and cell type; and
and Catalan newspaper text. We designed our ex- removed all other entities. However, they also re-
periments to show that our model works just as moved all embedded entities, while we kept them.
well on outermost entities, the typical NER task, As discussed in Section 3, we annotated each
and also works well on nested entities. word with a distributional similarity cluster. We
used 200 clusters, trained using 200 million words
5.1 GENIA Experiments from PubMed abstracts. During development, we
5.1.1 Data found that fewer clusters resulted in slower infer-
We performed experiments on the GENIA v.3.02 2 This split may seem strange: we had originally intended
corpus (Ohta et al., 2002). This corpus contains a 50/25/25 train/dev/test split, until we found the previously
2000 Medline abstracts (≈500k words), annotated used 90/10 split.
JNLPBA 2004 – Testing on Top-level Entities Only

Nested NER Model Semi-CRF Model Zhou & Su (2004)


# Test (train on all entities) (train on top-level entities)
Entities Precision Recall F1 Precision Recall F1 Precision Recall F1
Protein 4944 66.98 74.58 70.57 68.15 62.68 65.30 69.01 79.24 73.77
DNA 1030 62.96 66.50 64.68 65.45 52.23 58.10 66.84 73.11 69.83
RNA 115 63.06 60.87 61.95 64.55 61.74 63.11 64.66 63.56 64.10
Cell line 487 49.92 60.78 54.81 49.61 52.16 50.85 53.85 65.80 59.23
Cell type 1858 75.12 65.34 69.89 73.29 55.81 63.37 78.06 72.41 75.13
Overall 8434 66.78 70.57 68.62 67.50 59.27 63.12 69.42 75.99 72.55

Table 4: Named entity results on the JNLPBA 2004 shared task data. Zhou and Su (2004) was the best
system at the shared task, and is still state-of-the-art on the dataset.
ROOT

SP AQ NC FC ORGANIZATION VS DA AQ FE FC VM PERSON FP

DA ORGANIZATION PERSON PERSON

NP NP FC NC SP ORGANIZATION

NP

A doble partido , el Barça es el favorito ” , afirma Makaay , delantero del Deportivo .

At double match , the Barça is the favorite ” , states Makaay , attacker of Deportivo .

Figure 3: An example sentence from the AnCora corpus, along with its English translation.

ence with no improvement in performance. on both top-level entities and all entities.
While not our main focus, we also evaluated
5.1.2 Experimental Setup
our models on parts of speech. The model trained
We ran several sets of experiments, varying be-
on just top level entities achieved POS accuracy
tween all entities, or just top-level entities, for
of 97.37%, and the one trained on all entities
training and testing. As discussed in Section 3, if
achieved 97.25% accuracy. The GENIA tagger
we train on just top-level entities then the model is
(Tsuruoka et al., 2005) achieves 98.49% accuracy
equivalent to a semi-CRF. Semi-CRFs are state-
using the same train/test split.
of-the-art and provide a good baseline for per-
formance on just the top-level entities. Semi- 5.1.4 Additional JNLPBA 2004 Experiments
CRFs are strictly better than regular, linear chain
CRFs, because they can use all of the features and Because we could not compare our results on the
strucutre of a linear chain CRF, but also utilize NER portion of the GENIA corpus with any other
whole-entity features (Andrew, 2006). We also work, we also evaluated on the JNLPBA corpus.
evaluated the semi-CRF model on all entities. This This corpus was used in a shared task for the
may seem like an unfair evaluation, because the BioNLP workshop at Coling in 2004 (Collier et
semi-CRF has no way of recovering the nested en- al., 2004). They used the entire GENIA corpus for
tities, but we wanted to illustrate just how much training, and modified the label set as discussed in
information is lost when using a flat representa- Section 5.1.1. They also removed all embedded
tion. entities, and kept only the top-level ones. They
then annotated new data for the test set. This
5.1.3 Results dataset has no nested entities, but because the
Our named entity results when evaluating on all training data is GENIA we can still train our model
entities are shown in Table 2 and when evaluat- on the data annotated with nested entities, and then
ing on only top-level entities are shown in Table 3. evaluate on their test data by ignoring all embed-
Our nested model outperforms the flat semi-CRF ded entities found by our named entity recognizer.
AnCora Spanish – Testing on All Entities

Nested NER Model Semi-CRF Model


# Test (train on all entities) (train on top-level entities)
Entities Precision Recall F1 Precision Recall F1
Person 1778 65.29 78.91 71.45 75.10 32.73 45.59
Organization 2137 86.43 56.90 68.62 47.02 26.20 33.65
Location 1050 78.66 46.00 58.05 84.94 13.43 23.19
Date 568 87.13 83.45 85.25 79.43 29.23 42.73
Number 991 81.51 80.52 81.02 66.27 28.15 39.52
Other 512 17.90 64.65 28.04 10.77 16.60 13.07
Overall 7036 62.38 66.87 64.55 51.06 25.77 34.25

Table 5: Named entity results on the Spanish portion of AnCora, evaluating on all entities.
AnCora Spanish – Testing on Top-level Entities Only

Nested NER Model Semi-CRF Model


# Test (train on all entities) (train on top-level entities)
Entities Precision Recall F1 Precision Recall F1
Person 1050 57.42 66.67 61.70 71.23 52.57 60.49
Organization 1060 77.38 40.66 53.31 44.33 49.81 46.91
Location 279 72.49 36.04 48.15 79.52 24.40 37.34
Date 290 72.29 57.59 64.11 71.77 51.72 60.12
Number 519 57.17 49.90 53.29 54.87 44.51 49.15
Other 541 11.30 38.35 17.46 9.51 26.88 14.04
Overall 3739 50.57 49.72 50.14 46.07 44.61 45.76

Table 6: Named entity results on the Spanish portion of AnCora, evaluating on only top-level entities.

This experiment allows us to show that our named feature design, but it is probable that some of our
entity recognizer works well on top-level entities, orthographic features “learned” this fact anyway.
by comparing it with prior work. Our model also This probably harmed our results overall, because
produces part of speech tags, but the test data is some hyphenated words, which straddled bound-
not annotated with POS tags, so we cannot show aries in nested entities and would have been split
POS tagging results on this dataset. in the original corpus (and were split in our train-
One difficulty we had with the JNLPBA exper- ing data), were not split in the test data, prohibiting
iments was with tokenization. The version of GE- our model from properly identifying them.
NIA distributed for the shared task is tokenized For this experiment, we retrained our model on
differently from the original GENIA corpus, but the entire, retokenized, GENIA corpus. We also
we needed to train on the original corpus as it is retrained the distributional similarity model on the
the only version with nested entities. We tried our retokenized data. Once again, we trained one
best to retokenize the original corpus to match the model on the nested data, and one on just the top-
distributed data, but did not have complete suc- level entities, so that we can compare performance
cess. It is worth noting that the data is actually to- of both models on the top-level entities. Our full
kenized in a manner which allows a small amount results are shown in Table 4, along with the cur-
of “cheating.” Normally, hyphenated words, such rent state-of-the-art (Zhou and Su, 2004). Besides
as LPS-induced, are tokenized as one word. How- the tokenization issues harming our performance,
ever, if the portion of the word before the hyphen Zhou and Su (2004) also employed clever post-
is in an entity, and the part after is not, such as processing to improve their results.
BCR-induced, then the word is split into two to-
5.2 AnCora Experiments
kens: BCR and -induced. Therefore, when a word
starts with a hyphen it is a strong indicator that the 5.2.1 Data
prior word and it span the right boundary of an en- We performed experiments on the NER portion
tity. Because the train and test data for the shared of AnCora (Martı́ et al., 2007). This corpus has
task do not contain nested entities, fewer words Spanish and Catalan portions, and we evaluated
are split in this manner than in the original data. on both. The data is also annotated with parts
We did not intentionally exploit this fact in our of speech, parse trees, semantic roles and word
AnCora Catalan – Testing on All Entities

Nested NER Model Semi-CRF Model


# Test (train all entities) (train top-level entities only)
Entities Precision Recall F1 Precision Recall F1
Person 1303 89.01 50.35 64.31 70.08 46.20 55.69
Organization 1781 68.95 83.77 75.64 65.32 41.77 50.96
Location 1282 76.78 72.46 74.56 75.49 36.04 48.79
Date 606 84.27 81.35 82.79 70.87 38.94 50.27
Number 1128 86.55 83.87 85.19 75.74 38.74 51.26
Other 596 85.48 8.89 16.11 64.91 6.21 11.33
Overall 6696 78.09 68.23 72.83 70.39 37.60 49.02

Table 7: Named entity results on the Catalan portion of AnCora, evaluating on all entities.
AnCora Catalan – Testing on Top-level Entities Only

Nested NER Model Semi-CRF Model


# Test (train all entities) (train top-level entities only)
Entities Precision Recall F1 Precision Recall F1
Person 801 67.44 47.32 55.61 62.63 67.17 64.82
Organization 899 52.21 74.86 61.52 57.68 73.08 64.47
Location 659 54.86 67.68 60.60 62.42 57.97 60.11
Date 296 62.54 66.55 64.48 59.46 66.89 62.96
Number 528 62.35 70.27 66.07 63.08 68.94 65.88
Other 342 49.12 8.19 14.04 45.61 7.60 13.03
Overall 3525 57.67 59.40 58.52 60.53 61.42 60.97

Table 8: Named entity results on the Catalan portion of AnCora, evaluating on only top-level entities.

senses. The corpus annotators made a distinction (Hana and Hanová, 2002). There are around 250
between strong and weak entities. They define possible tags, and experiments on the development
strong named entities as “a word, a number, a date, data with the full tagset where unsuccessful. We
or a string of words that refer to a single individual removed all but the first two characters of each
entity in the real world.” If a strong NE contains POS tag, resulting in a set of 57 tags which more
multiple words, it is collapsed into a single token. closely resembles that of the Penn TreeBank (Mar-
Weak named entities, “consist of a noun phrase, cus et al., 1993). All reported results use our mod-
being it simple or complex” and must contain a ified version of the POS tag set.
strong entity. Figure 3 shows an example from the We took only the words as input, none of the
corpus with both strong and weak entities. The extra annotations. For both languages we trained a
entity types present are person, location, organi- 200 cluster distributional similarity model over the
zation, date, number, and other. Weak entities are words in the corpus. We performed the same set
very prevalent; 47.1% of entities are embedded. of experiments on AnCora as we did on GENIA.
For Spanish, files starting with 7–9 were the test
set, 5–6 were the development test set, and the re- 5.2.3 Results and Discussion
mainder were the development train set. For Cata- The full results for Spanish when testing on all en-
lan, files starting with 8–9 were the test set, 6–7 tities are shown in Table 5, and for only top-level
were the development test set, and the remainder entities are shown in Table 6. For part of speech
were the development train set. For both, the de- tagging, the nested model achieved 95.93% accu-
velopment train and test sets were combined to racy, compared with 95.60% for the flatly trained
form the final train set. We removed sentences model. The full results for Catalan when testing on
longer than 80 words. Spanish has 15,591 train- all entities are shown in Table 7, and for only top-
ing sentences, and Catalan has 14,906. level entities are shown in Table 8. POS tagging
results were even closer on Catalan: 96.62% for
5.2.2 Experimental Setup the nested model, and 96.59% for the flat model.
The parts of speech provided in the data include It is not surprising that the models trained on
detailed morphological information, using a sim- all entities do significantly better than the flatly
ilar annotation scheme to the Prague TreeBank trained models when testing on all entities. The
story is a little less clear when testing on just top- Nigel Collier, J. Kim, Y. Tateisi, T. Ohta, and Y. Tsuruoka, ed-
level entities. In this case, the nested model does itors. 2004. Proceedings of the International Joint Work-
shop on NLP in Biomedicine and its Applications.
4.38% better than the flat model on the Spanish
data, but 2.45% worse on the Catalan data. The J. R. Curran and S. Clark. 1999. Language independent NER
overall picture is the same as for GENIA: model- using a maximum entropy tagger. In CoNLL 1999, pages
164–167.
ing the nested entities does not, on average, reduce
performance on the top-level entities, but a nested Jenny Finkel, Shipra Dingare, Christopher Manning, Malv-
entity model does substantially better when evalu- ina Nissim, Beatrice Alex, and Claire Grover. 2005. Ex-
ploring the boundaries: Gene and protein identification in
ated on all entities. biomedical text. In BMC Bioinformatics 6 (Suppl. 1).

6 Conclusions Jenny Rose Finkel, Alex Kleeman, and Christopher D. Man-


ning. 2008. Efficient, feature-based conditional random
field parsing. In ACL/HLT-2008.
We presented a discriminative parsing-based
method for nested named entity recognition, Baohua Gu. 2006. Recognizing nested named entities in GE-
which does well on both top-level and nested enti- NIA corpus. In BioNLP Workshop at HLT-NAACL 2006,
pages 112–113.
ties. The only real drawback to our method is that
it is slower than common flat techniques. While Jiřı́ Hana and Hana Hanová. 2002. Manual for morpholog-
most NER corpus designers have defenestrated ical annotation. Technical Report TR-2002-14, UK MFF
CKL.
embedded entities, we hope that in the future this
will not continue, as large amounts of information Jun’ichi Kazama, Takaki Makino, Yoshihiro Ohta, and
are lost due to this design decision. Jun’ichi Tsujii. 2002. Tuning support vector machines
for biomedical named entity recognition. In Proceedings
of the Workshop on Natural Language Processing in the
Acknowledgements Biomedical Domain (ACL 2002).

Thanks to Mihai Surdeanu for help with the An- John Lafferty, Andrew McCallum, and Fernando Pereira.
Cora data. The first author was supported by 2001. Conditional Random Fields: Probabilistic mod-
els for segmenting and labeling sequence data. In ICML
a Stanford Graduate Fellowship. This paper is 2001, pages 282–289. Morgan Kaufmann, San Francisco,
based on work funded in part by the Defense Ad- CA.
vanced Research Projects Agency through IBM.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
The content does not necessarily reflect the views Marcinkiewicz. 1993. Building a large annotated corpus
of the U.S. Government, and no official endorse- of English: The Penn Treebank. Computational Linguis-
ment should be inferred. tics, 19(2):313–330.

L. Márquez, L. Padrè, M. Surdeanu, and L. Villarejo. 2007a.


UPC: Experiments with joint learning within semeval task
References 9. In Proceedings of the 4th International Workshop on
Semantic Evaluations (SemEval-2007).
Beatrice Alex, Barry Haddow, and Claire Grover. 2007.
Recognising nested named entities in biomedical text. In
L. Márquez, L. Villarejo, M.A. Martı́, and M. Taulè. 2007b.
BioNLP Workshop at ACL 2007, pages 65–72.
Semeval-2007 task 09: Multilevel semantic annotation of
Catalan and Spanish. In Proceedings of the 4th Inter-
Galen Andrew. 2006. A hybrid markov/semi-markov con- national Workshop on Semantic Evaluations (SemEval-
ditional random field for sequence segmentation. In Pro- 2007).
ceedings of the Conference on Empirical Methods in Nat-
ural Language Processing (EMNLP 2006). M.A. Martı́, M. Taulè, M. Bertran, and L. Márquez. 2007.
Ancora: Multilingual and multilevel annotated corpora.
A. Borthwick. 1999. A Maximum Entropy Approach to MS, Universitat de Barcelona.
Named Entity Recognition. Ph.D. thesis, New York Uni-
versity. Ryan McDonald, Koby Crammer, and Fernando Pereira.
2005. Flexible text segmentation with structured mul-
Kate Byrne. 2007. Nested named entity recognition in his- tilabel classification. In HLT ’05: Proceedings of the
torical archive text. In ICSC ’07: Proceedings of the Inter- conference on Human Language Technology and Empiri-
national Conference on Semantic Computing, pages 589– cal Methods in Natural Language Processing, pages 987–
596. 994.

Alexander Clark. 2003. Combining distributional and mor- Tomoko Ohta, Yuka Tateisi, and Jin-Dong Kim. 2002. The
phological information for part of speech induction. In GENIA corpus: an annotated research abstract corpus in
Proceedings of the tenth Annual Meeting of the European molecular biology domain. In Proceedings of the second
Association for Computational Linguistics (EACL), pages international conference on Human Language Technology
59–66. Research, pages 82–86.
Erik F. Tjong Kim Sang and Fien De Meulder. 2003.
Introduction to the conll-2003 shared task: Language-
independent named entity recognition. In Proceedings of
CoNLL-2003.

Sunita Sarawagi and William W. Cohen. 2004. Semi-markov


conditional random fields for information extraction. In In
Advances in Neural Information Processing Systems 17,
pages 1185–1192.

Dan Shen, Jie Zhang, Guodong Zhou, Jian Su, and Chew-
Lim Tan. 2003. Effective adaptation of a hidden markov
model-based named entity recognizer for biomedical do-
main. In Proceedings of the ACL 2003 workshop on Nat-
ural language processing in biomedicine. Association for
Computational Linguistics (ACL 2003).

Yoshimasa Tsuruoka and Jun’ichi Tsujii. 2003. Boost-


ing precision and recall of dictionary-based protein name
recognition. In Proceedings of the ACL-03 Workshop on
Natural Language Processing in Biomedicine, pages 41–
48.

Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim, Tomoko


Ohta, John McNaught, Sophia Ananiadou, and Jun’ichi
Tsujii. 2005. Developing a robust part-of-speech tag-
ger for biomedical text. In Advances in Informatics -
10th Panhellenic Conference on Informatics, LNCS 3746,
pages 382–392.

Jie Zhang, Dan Shen, Guodong Zhou, Jian Su, and Chew-Lim
Tan. 2004. Enhancing HMM-based biomedical named
entity recognition by studying special phenomena. Jour-
nal of Biomedical Informatics, 37(6):411–422.

GuoDong Zhou and Jian Su. 2004. Exploring deep


knowledge resources in biomedical name recognition.
In Joint Workshop on Natural Language Processing in
Biomedicine and Its Applications at Coling 2004.

Guodong Zhou, Jie Zhang, Jian Su, Dan Shen, and Chewlim
Tan. 2004. Recognizing names in biomedical texts: a
machine learning approach. Bioinformatics, 20(7):1178–
1190.

Guodong Zhou. 2006. Recognizing names in biomedical


texts using mutual information independence model and
SVM plus sigmoid. International Journal of Medical In-
formatics, 75:456–467.

You might also like