G22.2591 - Advanced Natural Language Processing - Spring 2004 Name Recognition Why Name Recognition?
G22.2591 - Advanced Natural Language Processing - Spring 2004 Name Recognition Why Name Recognition?
familiar with most of the names. Without capitalization, it can be hard to tell
unfamiliar organization names from common noun phrases.
Hand-coded rules
For a specific domain, it is possible to do very well with hand-coded rules and
dictionaries. On the MUC-6 evaluation (a very favorable situation, where the
source and general topic of the test data was known in advance), the SRA
system, based on hand-coded rules, got F=96.4. Writing rules by hand,
however, requires some skill and considerable time.
The hand-coded rules take advantage of
known names (through lists of well-known places, organizations, and
people)
characteristic suffixes for organizations (Corp., Associates, ...) and
locations (Island, Bay)
first names for people
titles for people
other mentions of the same name in an article
Note that sometimes the type decision is based upon left context, and
sometimes upon right context, so it would be difficult for taggers which operate
deterministically from left to right or from right to left to perform optimally.
Supervised training
Like POS tagging and chunking, named entity recognition has been tried with
very many different machine learning methods. More than the syntactic tasks,
performance on NE recognition depends on the variety of resources which are
brought to bear. CoNLL evaluations are relatively 'pure' ... the systems
basically just learn from the provided training corpus. On the other hand, 'real'
systems make use of as many lists and as much training data as available. This
has a substantial effect on performance. In additiion, performance is strongly
affected by the domain of the training and test data. These two effects can