Coref Analysis
Coref Analysis
< An additional file is published ABSTRACT the same techniques in the clinical domain. They
online only. To view this file Objective This paper describes the coreference grouped existing methods largely into the following
please visit the journal online resolution system submitted by Mayo Clinic for the 2011 three types:
(http://dx.doi.org/10.1136/
amiajnl-2011-000766). i2b2/VA/Cincinnati shared task Track 1C. The goal of the 1. Heuristics-based approaches based on linguistic
1 task was to construct a system that links the markables theories and rules4e8
Department of Health Sciences
Research, Mayo Clinic, corresponding to the same entity. 2. Supervised machine learning approaches with
post-1 post
crt-1 crt
it-1 it
gt-1 gt
et-1 et
ot-1 ot
This paper describes our coreference resolution system, English-Entities-Guidelines_v6.6.pdf) into proper mentions
MedCoref, developed by Mayo Clinic natural language (proper names), nominal mentions (noun phrase whose head is a
processing (NLP) program for Track 1C. We developed a multi- common noun), and pronoun mentions. In coreference analysis
pass sieve system in Java along the same lines for clinical notes research20 30 and broader NLP research,22 deterministic hierar-
by adapting the existing sieves and adding additional sieves, and chical systems that apply rules in the order of precision are
then integrated these sieves with FHMM anaphora resolution. shown to be effective. On the other hand, NLP tasks, such as
Additionally, we performed a thorough study on pronominal clinical concept extraction (mention detection), are additionally
coreference resolution considering the two approaches. The code handled through machine learning approaches.31
for our system, MedCoref, is available at https://sourceforge. Figure 2 shows the system architecture. The eight sieves are
net/projects/ohnlp/files/MedCoref under the unrestrictive open- analogous to inclusion criteria where at least one of them needs
source Apache v2 license. This enables hospital systems to use to be satisfied. The two filters are similar to exclusion criteria
our system that leverages the benefits of the Stanford corefer- where even when one is matched, the mention pairs are not
ence resolution system combined with adaptations suitable for linked. Set-up C uses a rule-based pronoun sieve as the final step.
clinical narratives and allows them to adapt the system to their Set-up A uses the FHMM-based sieve that is unaware of the
environment. mention clusters. For set-up B, we merge the chains in set-up A
and C. In general English, rule-based systems were shown to be
DATA the most effective for coreference resolution.20 We investigated
The Track 1C data consist of three sets from three different whether this is true for clinical narratives, that is, we resolved
institutions: Partners HealthCare, Beth Israel Deaconess Medical the coreference in proper mentions and nominal mentions using
Center, and the University of Pittsburgh. The data from the the initial sieves. The final sieve resolved the pronominal core-
University of Pittsburgh contain two types of notes: discharge ference using the information gathered about the entities
notes and progress notes. All protected health information is (clusters of mentions). Alternatively, we used the system of Li
fully de-identified. In the training set, gold standard markables et al to resolve pronominal coreference. We not only compared
and chains are manually annotated. The training set contains a the performance of the pronominal coreference methodologies
total of 492 notes (Partners: 136, Beth: 115, Pittsburgh: 119 dis- individually (in addition to the other chains), but we also
charge and 122 progress notes) and the test set contains a total compared individual methodologies against chains merged from
of 322 notes (Partners: 94, Beth: 79, Pittsburgh: 77 discharge and both methods.
72 progress notes).
Relationship detection order
METHODS The different sieves used in the system, according to the order of
The markables for coreference analysis could be classified (as per application, are displayed in figure 2. The mentions in each
ACE guidelines, see http://projects.ldc.upenn.edu/ace/docs/ document are ordered by their appearance. For each sieve, the
Clinical narrative with markables For instance, mentions of a problem or treatment could be related
to different persons because of the information recorded in the
‘family medical history’ section. As such, a non-chronic problem
1. Right exact match
that a patient had previously or a test underwent previously as
recorded in the ‘history of present illness’ section does not have
2. Relative pronoun or abbreviation or synonym a relationship with the current problem or test. Similarly, a treat-
ment in the ‘current medications’ section need not be related to
3. Head match and word inclusion and compatible modifier(s) Information
another one in the ‘discharge medications’ section. Clinical notes
about are often divided into sections, or segments, such as ‘history of
4. Head match and word inclusion
entities
present illness’ or ‘past medical history.’ Those sections can be
helpful in identifying coreferred pairs. Intuitively, two mentions
Section filter
Vicinity filter
Vicinity filter
Set-up A Set-up C
Unlike proper mentions, nominal mentions in the same document
Set-up B could refer to completely different entities, as their primary role is
to describe a closer antecedent proper mention. For example, con-
Figure 2 MedCoref coreference system architecture. The sieves of the sider the sentences in box 1. The ‘pathology’ in the second sentence
system are horizontally arranged from 1 to 8. If a sieve detects a relationship, and ‘pathology’ in the final sentence refer to different tests.
the mentions pass through vicinity and section filters. Set-up C uses a rule- Hence, we designed a second filter that rejects relationships if
based pronoun sieve as the final step. Set-up A uses the FHMM-based sieve the mentions only contain a list of stop terms compiled by us as
that is unaware of the mention clusters. For set-up B, we merge the chains in part of the MedTagger project (see online supplementary file).
the other set-ups. FHMM, factorial hidden Markov model.
Sieves
Sieve 1 accepts mentions that match exactly when aligned to
coreferential relationships are tested for each pair of mentions the right and the antecedent has a higher number of words.
starting from the last appearing (probable) mention. For each Since two mentions with the same name in a clinical document
mention, a probable antecedent is searched for starting from the need not corefer, we found it helpful to perform a right-aligned
closest mention. The assumption is that in narrative text, given match. This will be useful in scenarios, such as that shown in
two antecedents with similar properties, the closer antecedent is box 2 where the first and the last ‘echocardiograms’ are different.
more likely to have a coreferential relationship with the Sieve 2 accepts a pair when the mention is a relative pronoun
mention, since there are less intervening words that could that is governed by the antecedent as detected by rules based on
disturb the relationship. Such an assumption makes sense for part-of-speech tags (the pair immediately following each other
the clinical narratives, which is a sublanguage that typically does or intercepted only by a verb). The part-of-speech tags are
not contain complex or nested sentences.32 assigned by the OpenNLP POS tagger trained for clinical text.34 It
also accepts mention pairs where one of them is an abbreviation
Section filter of another as detected using the abbreviation list assembled from
In general English, if two mentions have the same surface text, the Unified Medical Language System (UMLS; version 2011AA)
more than 95% of the time the mentions corefer.20 However, in using the tool present in Liu et al.35 The medical domain favors
clinical narratives, this might not be the case for several reasons. brevity. Recognizing abbreviations is important for medical
Table 2 Accuracy of the machine learning-based pronoun sieve However, as shown by the p values measured using the Student
System Beth Partners Discharge Progress t test between each sieve and its successive one in table 3, we
conclude that the sieves contribute toward improving the
FHMM 66% 63.5% 60% 61.5%
system gradually.
Beth, Beth Israel Deaconess Medical Center; FHMM, factorial hidden Markov model;
Partners, Partners HealthCare.
DISCUSSION
Table 4 shows some of the true positives, false positives, and
versa. Most metrics capture the notion of correctness through false negatives of the system.
precisiondthe ratio of true positives among all the system The examples for true positives illustrate how the rules
outputsdand the notion of completeness through recalldthe worked as we intended. Table 4 refers each true positive to the
recall of true positives among the total number of positives. An F corresponding sieve defined in the Methods section and figure 2.
score represents the overall performance as a harmonic mean of The false positives occur mainly because of the lack of knowl-
the precision and recall. edge of semantics. For example, in the first false positive
In the i2b2/VA/Cincinnati challenge on coreference resolu- example, the ‘baseline creatinine’ and ‘creatinine’ are the same
tion, system performance was measured using MUC,39 B- kind of tests conducted at different instances and so are
On the other hand, the performance increment after addition of data sharing in the domain. Hence, we chose the hybrid
the state-of-the-art machine learning-based pronoun sieve is approach, where the deterministic framework allows experts to
only 7.7%. We believe that this is because our machine learning- add rules or modify existing ones while taking advantage of
based sieve learns features based on mentions and is unaware of machine learning techniques where possible.
the global properties of the entity (mention cluster) itself. With the i2b2/VA/Cincinnati shared task test corpus, the
Others such as Mitkov46 observe, ‘Machine learning algorithms accuracy of MedCoref remained consistent (F score of 0.84).29
for pronoun resolution do not necessarily perform better than The best system, which is a machine learning system, has an F
the traditional rule-based approaches.’ score of 0.92. The minimum F score for the task was 0.58. Our
While we have used supervised machine learning for clinical system scored at the median (exact median F score of 0.85)
information extraction tasks, such as named entity recogni- among the 20 teams that participated and ranked at 11. These
tion,47 association extraction,31 and drug adverse effect extrac- results, including ours, are inflated to some degree, since it is not
tion,48 machine learning-based systems are still used sparingly at an end-to-end evaluation where named entities, such as treat-
an enterprise level by Mayo Clinic49 and other organizations, ment and problem, need to be automatically extracted from
such as Regenstrief Institute.50 51 Systems trained using super- text. In addition, it would be sound but incomplete to evaluate
vised machine learning algorithms are often sensitive to the NLP systems using a rather homogeneous corpus, especially for
distribution of data, and a model trained on clinical notes from the case of clinical narratives that seem to have entirely different
an institution may perform poorly on those from another. For characteristics depending on where they originate.54 In practice,
example, Wagholikar et al52 showed recently that a machine the portability and adaptability of a system is an important
learning model for concept extraction trained on the i2b2/VA/ concern for clinical NLP applications.
Cincinnati corpus achieved a significantly lower F score when There are two other rule-based systems in the competition29
tested on the Mayo Clinic corpus. Other researchers recently besides ours (Hinote et al and Gooch et al). The rest used rules
reported this phenomenon for part-of-speech tagging.53 Such for preprocessing (Yang et al), deciding the order of the machine
poor performances will then be cascaded to higher-level tasks, learning components (Rink et al) or used completely supervised
such as coreference resolution and semantic role labeling. Besides approaches (Anick et al, Cai et al, Xu et al, etc). Experts, through
the inherent challenges pertaining to the peculiar sublanguage, adding more rules (such as semantic clues like dates and loca-
the difficulty in applying machine learning to clinical NLP may tions by Hinote et al and Wikipedia abbreviations by Gooch
be attributed to the difficulty in developing a corpus annotation et al), could further improve and locally customize MedCoref.
standard across institutions and use cases, preparing large This could readily accommodate using outputs from machine
annotated corpora conforming to the standard, and limitation of learning systems as well. Our system is flexible in
accommodating additional components and integrating 16. Denis P, Baldridge J. Specialized Models and Ranking for Coreference Resolution.
different technologies and it is suitable for practical use. Empirical Methods in Natural Language Processing. Honolulu, HI: Association for
Computational Linguistics, 2008:660e9.
17. Rahman A, Ng V. Supervised models for coreference resolution. Empirical Methods
CONCLUSION in Natural Language Processing. Singapore: Association for Computational
Linguistics, 2009:968e77.
We designed a multi-pass sieve system for coreference resolution 18. Haghighi A, Klein D. Unsupervised Coreference Resolution in a Nonparametric
in clinical notes. We demonstrated that, using relatively simple Bayesian Model. Prague, Czech Republic: Association of computational linguistics,
rules, basic part-of-speech information, and semantic type 2007:848e55.
properties, an effective coreference resolution system could be 19. Ng V. Unsupervised models for coreference resolution. Empirical methods in natural
language processing. Waikiki, USA: Association for Computational Linguistics,
designed. Pronominal coreference resolution is shown to be more 2008:640e9.
accurate when an entity-centered approach is used rather than 20. Raghunathan K, Lee H, Rangarajan S, et al. A multi-pass sieve for coreference
a mention-centered approach. The source code of the system resolution. Empirical Methods Natural Language Processing. Proceedings of the
Association for Computational Linguistics. Sydney, Australia: Association for
described in this paper is available at https://sourceforge.net/ Computational Linguistics, 2010:492e501.
projects/ohnlp/files/MedCoref. 21. Lee H, Peirsman Y, Chang A, et al. Stanford’s Multi-Pass Sieve Coreference
Resolution System at the CoNLL-2011 Shared Task. CoNLL-2011 Shared
Acknowledgments This research was evaluated using the gold standard Task, 2011. Portland, Oregon, USA: Association for Computational Linguistics,
2011:73e9.
45. Pascal D, Baldridge J. Joint determination of anaphoricity and coreference 50. Friedlin J, Overhage M, Al-Haddad MA, et al. Comparing methods for identifying
resolution using integer programming. North American Chapter of the Association for Pancreatic Cancer patients using electronic data Sources. AMIA Annual Symposium
Computational Linguistics (NAACL). Rochester, NY, USA: Association for Proceedings. Washington DC, USA: AMIA, 2010.
Computational Linguistics, 2007:236e43. 51. Friedlin J, McDonald CJ. A natural language processing system to extract and code
46. Mitkov R. Comparing pronoun resolution algorithm. Comput Intell 2007;23:262e97. concepts relating to congestive heart failure from chest radiology reports. AMIA Annu
47. Jonnalagadda S, Cohen T, Wu S, et al. Enhancing clinical concept extraction with Symp Proc 2006:269e73.
distributional semantics. J Biomed Inform 2012;45:129e40. 52. Wagholikar K, Torii M, Jonnalagadda S, et al. Feasibility of pooling annotated
48. Sohn S, Kocher JP, Chute CG, et al. Drug side effect extraction from clinical corpora for clinical concept extraction. AMIA Clinical Research Informatics Summit.
narratives of psychiatry and psychology patients. J Am Med Inform Assoc 2011;18 San Francisco, USA: AMIA, 2012.
(Suppl 1):i144e9. 53. Fan J, Prasad R, Yabut R, et al. Part-of-speech tagging for clinical text: Wall or
49. Chute CG, Beck SA, Fisk TB, et al. The Enterprise Data Trust at Mayo Clinic: Bridge between institutions? AMIA Annu Symp Proc 2011;2011:382e91.
a semantically integrated warehouse of biomedical data. J Am Med Inform Assoc 54. Patterson O, Hurdle J. Document clustering of clinical narratives: a Systematic
2010;17:131e5. study of clinical sublanguages. AMIA Annu Symp 2011:1099e107.