nested-ner
nested-ner
NN NN NN NN NN
PEBP2 alpha A1 , alpha B1 , and alpha B2 proteins bound the PEBP2 site within the mouse GM-CSF promoter .
Figure 1: An example of our tree representation over nested named entities. The sentence is from the
GENIA corpus. PROT is short for PROTEIN.
but they tend to be much flatter. This model allows entities, etc. For outside-in layering the first CRF
us to include parts of speech in the tree, and there- would identify outermost entities, and then succes-
fore to jointly model the named entities and the sive CRFs would identify increasingly nested en-
part of speech tags. Once we have converted our tities. They also tried a cascaded approach, with
sentences into parse trees, we train a discrimina- separate CRFs for each entity type. The CRFs
tive constituency parser similar to that of (Finkel would be applied in a specified order, and then
et al., 2008). We found that on top-level enti- each CRF could utilize features derived from the
ties, our model does just as well as more conven- output of previously applied CRFs. This technique
tional methods. When evaluating on all entities has the problem that it cannot identify nested en-
our model does well, with F-scores ranging from tities of the same type; this happens frequently in
slightly worse than performance on top-level only, the data, such as the nested proteins at the begin-
to substantially better than top-level only. ning of the sentence in Figure 1. They also tried a
joint labeling approach, where they trained a sin-
2 Related Work gle CRF, but the label set was significantly ex-
panded so that a single label would include all of
There is a large body of work on named en-
the entities for a particular word. Their best results
tity recognition, but very little of it addresses
where from the cascaded approach.
nested entities. Early work on the GENIA cor-
pus (Kazama et al., 2002; Tsuruoka and Tsujii, Byrne (2007) took a different approach, on his-
2003) only worked on the innermost entities. This torical archive text. She modified the data by con-
was soon followed by several attempts at nested catenating adjacent tokens (up to length six) into
NER in GENIA (Shen et al., 2003; Zhang et potential entities, and then labeled each concate-
al., 2004; Zhou et al., 2004) which built hidden nated string using the C&C tagger (Curran and
Markov models over the innermost named enti- Clark, 1999). When labeling a string, the “previ-
ties, and then used a rule-based post-processing ous” string was the one-token-shorter string con-
step to identify the named entities containing the taining all but the last token of the current string.
innermost entities. Zhou (2006) used a more elab- For single tokens the “previous” token was the
orate model for the innermost entities, but then longest concatenation starting one token earlier.
used the same rule-based post-processing method SemEval 2007 Task 9 (Márquez et al., 2007b)
on the output to identify non-innermost entities. included a nested NER component, as well as
Gu (2006) focused only on proteins and DNA, by noun sense disambiguation and semantic role la-
building separate binary SVM classifiers for inner- beling. However, the parts of speech and syn-
most and outermost entities for those two classes. tactic tree were given as part of the input, and
Several techniques for nested NER in GENIA named entities were specified as corresponding to
where presented in (Alex et al., 2007). Their first noun phrases in the tree, or particular parts of
approach was to layer CRFs, using the output of speech. This restriction substantially changes the
one as the input to the next. For inside-out lay- task. Two groups participated in the shared task,
ering, the first CRF would identify the innermost but only one (Márquez et al., 2007a) worked on
entities, the next layer would be over the words the named entity component. They used a multi-
and the innermost entities to identify second-level label AdaBoost.MH algorithm, over phrases in the
DNAparent=ROOT
NNparent=DNA,grandparent=ROOT @DNAparent=ROOT,prev=NN,first=PROT
PROTparent=DNA,grandparent=ROOT NNparent=DNA,grandparent=ROOT
NNparent=PROT,grandparent=DNA
Figure 2: An example of a subtree after it has been annotated and binarized. Features are computed over
this representation. An @ indicates a chart parser active state (incomplete constituent).
parse tree which, based on their labels, could po- and Cohen, 2004; Andrew, 2006), but with no
tentially be entities. length restriction on entities. Like a semi-CRF, we
Finally, McDonald et al. (2005) presented a are able to define features over entire entities of
technique for labeling potentially overlapping seg- arbitrary length, instead of just over a small, fixed
ments of text, based on a large margin, multilabel window of words like a regular linear chain CRF.
classification algorithm. Their method could be We model part of speech tags jointly with the
used for nested named entity recognition, but the named entities, though the model also works with-
experiments they performed were on joint (flat) out them. We determine the possible part of
NER and noun phrase chunking. speech tags based on distributional similarity clus-
ters. We used Alexander Clarke’s software, 1 based
3 Nested Named Entity Recognition as on (Clark, 2003), to cluster the words, and then
Parsing allow each word to be labeled with any part of
speech tag seen in the data with any other word
Our model is quite simple – we represent each sen- in the same cluster. Because the parts of speech
tence as a constituency tree, with each named en- are annotated with the parent (and grandparent)
tity corresponding to a phrase in the tree, along labels, they determine what, if any, entity types
with a root node which connects the entire sen- a word can be labeled with. Many words, such as
tence. No additional syntactic structure is rep- verbs, cannot be labeled with any entities. We also
resented. We also model the parts of speech as limit our grammar based on the rules observed in
preterminals, and the words themselves as the the data. The rules whose children include part of
leaves. See Figure 1 for an example of a named speech tags restrict the possible pairs of adjacent
entity tree. Each node is then annotated with both tags. Interestingly, the restrictions imposed by this
its parent and grandparent labels, which allows joint modeling (both observed word/tag pairs and
the model to learn how entities nest. We bina- observed rules) actually result in much faster infer-
rize our trees in a right-branching manner, and ence (and therefore faster train and test times) than
then build features over the labels, unary rules, a model over named entities alone. This is differ-
and binary rules. We also use first-order horizon- ent from most work on joint modeling of multiple
tal Markovization, which allows us to retain some levels of annotation, which usually results in sig-
information about the previous node in the bina- nificantly slower inference.
rized rule. See Figure 2 for an example of an an-
notated and binarized subtree. Once each sentence 3.1 Discriminative Constituency Parsing
has been converted into a tree, we train a discrimi-
We train our nested NER model using the same
native constituency parser, based on (Finkel et al.,
technique as the discriminatively trained, condi-
2008).
tional random field-based, CRF-CFG parser of
It is worth noting that if you use our model on
(Finkel et al., 2008). The parser is similar to a
data which does not have any nested entities, then
it is precisely equivalent to a semi-CRF (Sarawagi 1 http://www.cs.rhul.ac.uk/home/alexc/RHUL/Downloads.html
Local Features Pairwise Features
labeli distsimi + distsimi−1 + labeli labeli−1 + labeli
wordi + labeli shapei + shapei+1 + labeli wordi + labeli−1 + labeli
wordi−1 + labeli shapei−1 + shapei + labeli wordi−1 + labeli−1 + labeli
wordi+1 + labeli wordi−1 + shapei + labeli wordi+1 + labeli−1 + labeli
distsimi + labeli shapei + wordi+1 + labeli distsimi + labeli−1 + labeli
distsimi−1 + labeli words in a 5 word window distsimi−1 + labeli−1 + labeli
distsimi+1 + labeli prefixes up to length 6 distsimi+1 + labeli−1 + labeli
shapei + labeli suffixes up to length 6 distsimi−1 + distsimi + labeli−1 + labeli
shapei−1 + labeli shapei + labeli−1 + labeli
shapei+1 + labeli shapei−1 + labeli−1 + labeli
shapei+1 + labeli−1 + labeli
shapei−1 + shapei + labeli−1 + labeli
shapei−1 + shapei+1 + labeli−1 + labeli
Table 1: The local and pairwise NER features used in all of our experiments. Consult the text for a full
description of all features, which includes feature classes not in this table.
chart-based PCFG parser, except that instead of ilarity cluster (distsim), and a string indicating
putting probabilities over rules, it puts clique po- orthographic information (shape) (Finkel et al.,
tentials over local subtrees. These unnormalized 2005). Subscripts represent word position in the
potentials know what span (and split) the rule is sentence. In addition to those below, we include
over, and arbitrary features can be defined over the features for each fully annotated label and rule.
local subtree, the span/split and the words of the
sentence. The inside-outside algorithm is run over Local named entity features. Local named en-
the clique potentials to produce the partial deriva- tity features are over the label for a single word.
tives and normalizing constant which are neces- They are equivalent to the local features in a linear
sary for optimizing the log likelihood. Optimiza- chain CRF. However, unlike in a linear chain CRF,
tion is done by stochastic gradient descent. if a word belongs to multiple entities then the local
features are computed for each entity. Local fea-
The only real drawback to our model is run-
tures are also computed for words not contained in
time. The algorithm is O(n3 ) in sentence length.
any entity. Local features are in Table 1.
Training on all of GENIA took approximately 23
hours for the nested model and 16 hours for the Pairwise named entity features. Pairwise fea-
semi-CRF. A semi-CRF with an entity length re- tures are over the labels for adjacent words, and
striction, or a regular CRF, would both have been are equivalent to the edge features in a linear chain
faster. At runtime, the nested model for GENIA CRF. They can occur when pairs of words have
tagged about 38 words per second, while the semi- the same label, or over entity boundaries where
CRF tagged 45 words per second. For compar- the words have different labels. Like with the lo-
ison, a first-order linear chain CRF trained with cal features, if a pair of words are contained in, or
similar features on the same data can tag about straddle the border of, multiple entities, then the
4,000 words per second. features are repeated for each. The pairwise fea-
tures we use are shown in Table 1.
4 Features
Embedded named entity features. Embedded
When designing features, we first made ones sim- named entity features occur in binary rules where
ilar to the features typically designed for a first- one entity is the child of another entity. For our
order CRF, and then added features which are not embedded features, we replicated the pairwise fea-
possible in a CRF, but are possible in our enhanced tures, except that the embedded named entity was
representation. This includes features over entire treated as one of the words, where the “word”
entities, features which directly model nested en- (and other annotations) were indicative of the type
tities, and joint features over entities and parts of of entity, and not the actual string that is the en-
speech. When features are computed over each tity. For instance, in the subtree in Figure 2, we
label, unary rule, and binary rule, the feature func- would compute wordi +labeli−1 +labeli as PROT-
tion is aware of the rule span and split. DNA-DNA for i = 18 (the index of the word GM-
Each word is labeled with its distributional sim- CSF). The normal pairwise feature at the same po-
GENIA – Testing on All Entities
Table 4: Named entity results on the JNLPBA 2004 shared task data. Zhou and Su (2004) was the best
system at the shared task, and is still state-of-the-art on the dataset.
ROOT
SP AQ NC FC ORGANIZATION VS DA AQ FE FC VM PERSON FP
NP NP FC NC SP ORGANIZATION
NP
At double match , the Barça is the favorite ” , states Makaay , attacker of Deportivo .
Figure 3: An example sentence from the AnCora corpus, along with its English translation.
ence with no improvement in performance. on both top-level entities and all entities.
While not our main focus, we also evaluated
5.1.2 Experimental Setup
our models on parts of speech. The model trained
We ran several sets of experiments, varying be-
on just top level entities achieved POS accuracy
tween all entities, or just top-level entities, for
of 97.37%, and the one trained on all entities
training and testing. As discussed in Section 3, if
achieved 97.25% accuracy. The GENIA tagger
we train on just top-level entities then the model is
(Tsuruoka et al., 2005) achieves 98.49% accuracy
equivalent to a semi-CRF. Semi-CRFs are state-
using the same train/test split.
of-the-art and provide a good baseline for per-
formance on just the top-level entities. Semi- 5.1.4 Additional JNLPBA 2004 Experiments
CRFs are strictly better than regular, linear chain
CRFs, because they can use all of the features and Because we could not compare our results on the
strucutre of a linear chain CRF, but also utilize NER portion of the GENIA corpus with any other
whole-entity features (Andrew, 2006). We also work, we also evaluated on the JNLPBA corpus.
evaluated the semi-CRF model on all entities. This This corpus was used in a shared task for the
may seem like an unfair evaluation, because the BioNLP workshop at Coling in 2004 (Collier et
semi-CRF has no way of recovering the nested en- al., 2004). They used the entire GENIA corpus for
tities, but we wanted to illustrate just how much training, and modified the label set as discussed in
information is lost when using a flat representa- Section 5.1.1. They also removed all embedded
tion. entities, and kept only the top-level ones. They
then annotated new data for the test set. This
5.1.3 Results dataset has no nested entities, but because the
Our named entity results when evaluating on all training data is GENIA we can still train our model
entities are shown in Table 2 and when evaluat- on the data annotated with nested entities, and then
ing on only top-level entities are shown in Table 3. evaluate on their test data by ignoring all embed-
Our nested model outperforms the flat semi-CRF ded entities found by our named entity recognizer.
AnCora Spanish – Testing on All Entities
Table 5: Named entity results on the Spanish portion of AnCora, evaluating on all entities.
AnCora Spanish – Testing on Top-level Entities Only
Table 6: Named entity results on the Spanish portion of AnCora, evaluating on only top-level entities.
This experiment allows us to show that our named feature design, but it is probable that some of our
entity recognizer works well on top-level entities, orthographic features “learned” this fact anyway.
by comparing it with prior work. Our model also This probably harmed our results overall, because
produces part of speech tags, but the test data is some hyphenated words, which straddled bound-
not annotated with POS tags, so we cannot show aries in nested entities and would have been split
POS tagging results on this dataset. in the original corpus (and were split in our train-
One difficulty we had with the JNLPBA exper- ing data), were not split in the test data, prohibiting
iments was with tokenization. The version of GE- our model from properly identifying them.
NIA distributed for the shared task is tokenized For this experiment, we retrained our model on
differently from the original GENIA corpus, but the entire, retokenized, GENIA corpus. We also
we needed to train on the original corpus as it is retrained the distributional similarity model on the
the only version with nested entities. We tried our retokenized data. Once again, we trained one
best to retokenize the original corpus to match the model on the nested data, and one on just the top-
distributed data, but did not have complete suc- level entities, so that we can compare performance
cess. It is worth noting that the data is actually to- of both models on the top-level entities. Our full
kenized in a manner which allows a small amount results are shown in Table 4, along with the cur-
of “cheating.” Normally, hyphenated words, such rent state-of-the-art (Zhou and Su, 2004). Besides
as LPS-induced, are tokenized as one word. How- the tokenization issues harming our performance,
ever, if the portion of the word before the hyphen Zhou and Su (2004) also employed clever post-
is in an entity, and the part after is not, such as processing to improve their results.
BCR-induced, then the word is split into two to-
5.2 AnCora Experiments
kens: BCR and -induced. Therefore, when a word
starts with a hyphen it is a strong indicator that the 5.2.1 Data
prior word and it span the right boundary of an en- We performed experiments on the NER portion
tity. Because the train and test data for the shared of AnCora (Martı́ et al., 2007). This corpus has
task do not contain nested entities, fewer words Spanish and Catalan portions, and we evaluated
are split in this manner than in the original data. on both. The data is also annotated with parts
We did not intentionally exploit this fact in our of speech, parse trees, semantic roles and word
AnCora Catalan – Testing on All Entities
Table 7: Named entity results on the Catalan portion of AnCora, evaluating on all entities.
AnCora Catalan – Testing on Top-level Entities Only
Table 8: Named entity results on the Catalan portion of AnCora, evaluating on only top-level entities.
senses. The corpus annotators made a distinction (Hana and Hanová, 2002). There are around 250
between strong and weak entities. They define possible tags, and experiments on the development
strong named entities as “a word, a number, a date, data with the full tagset where unsuccessful. We
or a string of words that refer to a single individual removed all but the first two characters of each
entity in the real world.” If a strong NE contains POS tag, resulting in a set of 57 tags which more
multiple words, it is collapsed into a single token. closely resembles that of the Penn TreeBank (Mar-
Weak named entities, “consist of a noun phrase, cus et al., 1993). All reported results use our mod-
being it simple or complex” and must contain a ified version of the POS tag set.
strong entity. Figure 3 shows an example from the We took only the words as input, none of the
corpus with both strong and weak entities. The extra annotations. For both languages we trained a
entity types present are person, location, organi- 200 cluster distributional similarity model over the
zation, date, number, and other. Weak entities are words in the corpus. We performed the same set
very prevalent; 47.1% of entities are embedded. of experiments on AnCora as we did on GENIA.
For Spanish, files starting with 7–9 were the test
set, 5–6 were the development test set, and the re- 5.2.3 Results and Discussion
mainder were the development train set. For Cata- The full results for Spanish when testing on all en-
lan, files starting with 8–9 were the test set, 6–7 tities are shown in Table 5, and for only top-level
were the development test set, and the remainder entities are shown in Table 6. For part of speech
were the development train set. For both, the de- tagging, the nested model achieved 95.93% accu-
velopment train and test sets were combined to racy, compared with 95.60% for the flatly trained
form the final train set. We removed sentences model. The full results for Catalan when testing on
longer than 80 words. Spanish has 15,591 train- all entities are shown in Table 7, and for only top-
ing sentences, and Catalan has 14,906. level entities are shown in Table 8. POS tagging
results were even closer on Catalan: 96.62% for
5.2.2 Experimental Setup the nested model, and 96.59% for the flat model.
The parts of speech provided in the data include It is not surprising that the models trained on
detailed morphological information, using a sim- all entities do significantly better than the flatly
ilar annotation scheme to the Prague TreeBank trained models when testing on all entities. The
story is a little less clear when testing on just top- Nigel Collier, J. Kim, Y. Tateisi, T. Ohta, and Y. Tsuruoka, ed-
level entities. In this case, the nested model does itors. 2004. Proceedings of the International Joint Work-
shop on NLP in Biomedicine and its Applications.
4.38% better than the flat model on the Spanish
data, but 2.45% worse on the Catalan data. The J. R. Curran and S. Clark. 1999. Language independent NER
overall picture is the same as for GENIA: model- using a maximum entropy tagger. In CoNLL 1999, pages
164–167.
ing the nested entities does not, on average, reduce
performance on the top-level entities, but a nested Jenny Finkel, Shipra Dingare, Christopher Manning, Malv-
entity model does substantially better when evalu- ina Nissim, Beatrice Alex, and Claire Grover. 2005. Ex-
ploring the boundaries: Gene and protein identification in
ated on all entities. biomedical text. In BMC Bioinformatics 6 (Suppl. 1).
Thanks to Mihai Surdeanu for help with the An- John Lafferty, Andrew McCallum, and Fernando Pereira.
Cora data. The first author was supported by 2001. Conditional Random Fields: Probabilistic mod-
els for segmenting and labeling sequence data. In ICML
a Stanford Graduate Fellowship. This paper is 2001, pages 282–289. Morgan Kaufmann, San Francisco,
based on work funded in part by the Defense Ad- CA.
vanced Research Projects Agency through IBM.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
The content does not necessarily reflect the views Marcinkiewicz. 1993. Building a large annotated corpus
of the U.S. Government, and no official endorse- of English: The Penn Treebank. Computational Linguis-
ment should be inferred. tics, 19(2):313–330.
Alexander Clark. 2003. Combining distributional and mor- Tomoko Ohta, Yuka Tateisi, and Jin-Dong Kim. 2002. The
phological information for part of speech induction. In GENIA corpus: an annotated research abstract corpus in
Proceedings of the tenth Annual Meeting of the European molecular biology domain. In Proceedings of the second
Association for Computational Linguistics (EACL), pages international conference on Human Language Technology
59–66. Research, pages 82–86.
Erik F. Tjong Kim Sang and Fien De Meulder. 2003.
Introduction to the conll-2003 shared task: Language-
independent named entity recognition. In Proceedings of
CoNLL-2003.
Dan Shen, Jie Zhang, Guodong Zhou, Jian Su, and Chew-
Lim Tan. 2003. Effective adaptation of a hidden markov
model-based named entity recognizer for biomedical do-
main. In Proceedings of the ACL 2003 workshop on Nat-
ural language processing in biomedicine. Association for
Computational Linguistics (ACL 2003).
Jie Zhang, Dan Shen, Guodong Zhou, Jian Su, and Chew-Lim
Tan. 2004. Enhancing HMM-based biomedical named
entity recognition by studying special phenomena. Jour-
nal of Biomedical Informatics, 37(6):411–422.
Guodong Zhou, Jie Zhang, Jian Su, Dan Shen, and Chewlim
Tan. 2004. Recognizing names in biomedical texts: a
machine learning approach. Bioinformatics, 20(7):1178–
1190.