0% found this document useful (0 votes)
35 views8 pages

Coref Analysis

This document summarizes a research paper that describes a coreference resolution system submitted for a shared task. The system uses a multi-pass sieve approach that applies rules in order of precision to link mentions of medical entities. It combines a rule-based approach with a supervised machine learning method using factorial hidden Markov models. The system achieved an F1 score of 0.836 on the training set and 0.843 on the test set for linking mentions of problems, treatments, tests, people and pronouns in clinical notes.

Uploaded by

belay beyena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views8 pages

Coref Analysis

This document summarizes a research paper that describes a coreference resolution system submitted for a shared task. The system uses a multi-pass sieve approach that applies rules in order of precision to link mentions of medical entities. It combines a rule-based approach with a supervised machine learning method using factorial hidden Markov models. The system achieved an F1 score of 0.836 on the training set and 0.843 on the test set for linking mentions of problems, treatments, tests, people and pronouns in clinical notes.

Uploaded by

belay beyena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Research and applications

Coreference analysis in clinical notes: a multi-pass


sieve with alternate anaphora resolution modules
Siddhartha Reddy Jonnalagadda,1 Dingcheng Li,1 Sunghwan Sohn,1
Stephen Tze-Inn Wu,1 Kavishwar Wagholikar,1 Manabu Torii,2 Hongfang Liu1

< An additional file is published ABSTRACT the same techniques in the clinical domain. They
online only. To view this file Objective This paper describes the coreference grouped existing methods largely into the following
please visit the journal online resolution system submitted by Mayo Clinic for the 2011 three types:
(http://dx.doi.org/10.1136/
amiajnl-2011-000766). i2b2/VA/Cincinnati shared task Track 1C. The goal of the 1. Heuristics-based approaches based on linguistic
1 task was to construct a system that links the markables theories and rules4e8
Department of Health Sciences
Research, Mayo Clinic, corresponding to the same entity. 2. Supervised machine learning approaches with

Downloaded from https://academic.oup.com/jamia/article/19/5/867/719408 by guest on 23 June 2023


Rochester, Minnesota, USA Materials and methods The task organizers provided binary classification of markable mention/entity
2
Department of Computer and progress notes and discharge summaries that were pairs9e15 or classification by ranking markables16 17
Information Sciences, University annotated with the markables of treatment, problem, test, 3. Unsupervised machine learning approaches, such
of Delaware, Newark, Delaware,
person, and pronoun. We used a multi-pass sieve as non-parametric Bayesian models18 or expectation-
USA
algorithm that applies deterministic rules in the order of maximization clustering.19
Correspondence to preciseness and simultaneously gathers information about In the current work, we employed a multi-pass
Dr Siddhartha Reddy the entities in the documents. Our system, MedCoref, also sieve framework to exploit a heuristic-based ap-
Jonnalagadda, Department of uses a state-of-the-art machine learning framework as an proach along with a supervised machine learning
Health Sciences Research,
Mayo Clinic, 200 First St SW,
alternative to the final, rule-based pronoun resolution sieve. method, specifically factorial hidden Markov
Rochester, MI 55905, USA; Results The best system that uses a multi-pass sieve has models (FHMMs). Raghunathan et al20 developed
[email protected] an overall score of 0.836 (average of B3, MUC, Blanc, and a multi-pass system that applies tiers of resolution
CEAF F score) for the training set and 0.843 for the test set. models one at a time. Each tier (sieve) consists of
Received 11 December 2011 Discussion A supervised machine learning system that similar deterministic rules and builds on outputs of
Accepted 19 May 2012
Published Online First typically uses a single function to find coreferents cannot previously applied sieves. Sieves yielding a higher
16 June 2012 accommodate irregularities encountered in data especially precision are arranged first in the system. Some of
given the insufficient number of examples. On the other the sieves include: pronoun, head match, appositive,
hand, a completely deterministic system could lead to and demonym. Lee et al21 applied this system in the
a decrease in recall (sensitivity) when the rules are not general domain for the CoNLL-2011 shared task for
exhaustive. The sieve-based framework allows one to coreference analysis and produced the best perfor-
combine reliable machine learning components with rules mance. Their work is based on the ‘method of
designed by experts. successive approximation’ for learning that was
Conclusion Using relatively simple rules, part-of-speech successfully used previously for named entity
information, and semantic type properties, an effective classification,22 23 machine translation,24 and
coreference resolution system could be designed. The dependency parsing.25
source code of the system described is available at https:// Li et al26 used FHMMs for pronominal anaphora
sourceforge.net/projects/ohnlp/files/MedCoref. resolution.27 FHMMs are an extension of tradi-
tional hidden Markov models,28 where the hidden
state at each time step t (ie, word ot) is expanded to
contain more than one random variable. Their
BACKGROUND AND SIGNIFICANCE state-of-the-art anaphora resolution system uses
A coreference relation is a relation between men- features, such as part of speech, gender, grammat-
tions referring to the same entity. Coreference res- ical number (singular/plural), and concept class.
olution is the task of determining coreference Figure 1 shows Li et al’s model, where the hidden
relations (whether two mentions refer to the same states h at each word are factored into three com-
entity or not). Coreferential expressions are common ponents: coreference features cr, part-of-speech tags
in clinical narratives,1 and therefore understanding pos, and an operation variable op to control reference.
coreference relations plays a critical role in use cases This factorization allows the learning of complex
requiring discourse-level analysis of clinical docu- hidden states even with limited training data.
ments, such as compiling a patient profile. Despite The 2011 i2b2/VA/Cincinnati challenge29 focuses
interest in coreference resolution in the general on coreferential relations between common, clini-
English domain, little research has been conducted in cally relevant classes in medical text. These classes
the clinical domain. Since the language and des- include problem, treatment, test, person, and pronoun.
cription style in clinical documents differs from Coreferring mentions are to be paired together, and
common English,2 it is necessary to understand the the pairs are to be linked to form a chain that re-
characteristics of clinical text to properly perform presents the entity being referenced. The aim of the
coreference resolution. challenge is to produce coreferential chains of these
Zheng et al3 performed a comprehensive meth- mentions at document level (ie, coreference rela-
odological review of coreference resolution in general tions are made across paragraphs or sections within
English and argued that it may be possible to apply the same document, but not across documents).

J Am Med Inform Assoc 2012;19:867e874. doi:10.1136/amiajnl-2011-000766 867


Research and applications

Figure 1 Factorial hidden Markov


model (FHMM) coreference resolution
model.
ht-1 ht
opt-1 opt

post-1 post
crt-1 crt

it-1 it

gt-1 gt

Downloaded from https://academic.oup.com/jamia/article/19/5/867/719408 by guest on 23 June 2023


nt-1 nt

et-1 et

ot-1 ot

This paper describes our coreference resolution system, English-Entities-Guidelines_v6.6.pdf) into proper mentions
MedCoref, developed by Mayo Clinic natural language (proper names), nominal mentions (noun phrase whose head is a
processing (NLP) program for Track 1C. We developed a multi- common noun), and pronoun mentions. In coreference analysis
pass sieve system in Java along the same lines for clinical notes research20 30 and broader NLP research,22 deterministic hierar-
by adapting the existing sieves and adding additional sieves, and chical systems that apply rules in the order of precision are
then integrated these sieves with FHMM anaphora resolution. shown to be effective. On the other hand, NLP tasks, such as
Additionally, we performed a thorough study on pronominal clinical concept extraction (mention detection), are additionally
coreference resolution considering the two approaches. The code handled through machine learning approaches.31
for our system, MedCoref, is available at https://sourceforge. Figure 2 shows the system architecture. The eight sieves are
net/projects/ohnlp/files/MedCoref under the unrestrictive open- analogous to inclusion criteria where at least one of them needs
source Apache v2 license. This enables hospital systems to use to be satisfied. The two filters are similar to exclusion criteria
our system that leverages the benefits of the Stanford corefer- where even when one is matched, the mention pairs are not
ence resolution system combined with adaptations suitable for linked. Set-up C uses a rule-based pronoun sieve as the final step.
clinical narratives and allows them to adapt the system to their Set-up A uses the FHMM-based sieve that is unaware of the
environment. mention clusters. For set-up B, we merge the chains in set-up A
and C. In general English, rule-based systems were shown to be
DATA the most effective for coreference resolution.20 We investigated
The Track 1C data consist of three sets from three different whether this is true for clinical narratives, that is, we resolved
institutions: Partners HealthCare, Beth Israel Deaconess Medical the coreference in proper mentions and nominal mentions using
Center, and the University of Pittsburgh. The data from the the initial sieves. The final sieve resolved the pronominal core-
University of Pittsburgh contain two types of notes: discharge ference using the information gathered about the entities
notes and progress notes. All protected health information is (clusters of mentions). Alternatively, we used the system of Li
fully de-identified. In the training set, gold standard markables et al to resolve pronominal coreference. We not only compared
and chains are manually annotated. The training set contains a the performance of the pronominal coreference methodologies
total of 492 notes (Partners: 136, Beth: 115, Pittsburgh: 119 dis- individually (in addition to the other chains), but we also
charge and 122 progress notes) and the test set contains a total compared individual methodologies against chains merged from
of 322 notes (Partners: 94, Beth: 79, Pittsburgh: 77 discharge and both methods.
72 progress notes).
Relationship detection order
METHODS The different sieves used in the system, according to the order of
The markables for coreference analysis could be classified (as per application, are displayed in figure 2. The mentions in each
ACE guidelines, see http://projects.ldc.upenn.edu/ace/docs/ document are ordered by their appearance. For each sieve, the

868 J Am Med Inform Assoc 2012;19:867e874. doi:10.1136/amiajnl-2011-000766


Research and applications

Clinical narrative with markables For instance, mentions of a problem or treatment could be related
to different persons because of the information recorded in the
‘family medical history’ section. As such, a non-chronic problem
1. Right exact match
that a patient had previously or a test underwent previously as
recorded in the ‘history of present illness’ section does not have
2. Relative pronoun or abbreviation or synonym a relationship with the current problem or test. Similarly, a treat-
ment in the ‘current medications’ section need not be related to
3. Head match and word inclusion and compatible modifier(s) Information
another one in the ‘discharge medications’ section. Clinical notes
about are often divided into sections, or segments, such as ‘history of
4. Head match and word inclusion
entities
present illness’ or ‘past medical history.’ Those sections can be
helpful in identifying coreferred pairs. Intuitively, two mentions
Section filter
Vicinity filter

associated with the same term appearing in two sections,


5. Head match and compatible modifier(s)
‘history of present illness’ and ‘diagnosis,’ have a higher proba-
bility of being a coreferred pair than two mentions associated
6. Relaxed head match and word inclusion with ‘family history’ and ‘diagnosis.’ We adapted SecTag devel-

Downloaded from https://academic.oup.com/jamia/article/19/5/867/719408 by guest on 23 June 2023


oped by Denny et al33 to associate each sentence in the clinical
7. (Stemmed head match and Bag of stemmed words notes to section headers. The sections that the mentions belong
match) OR Related words match to, the class of the mention (eg, problem, treatment, etc), and a list
of chronic problems are used to create the rule-based filter that
8a. FHMM
8a based
FHMM-based 8b Rule-based Pronoun
8b.
Pronoun Sieve Sieve rejects the relationships detected by sieves. The rules are part of the
open-source code and some of them are represented in table 1.

Vicinity filter
Set-up A Set-up C
Unlike proper mentions, nominal mentions in the same document
Set-up B could refer to completely different entities, as their primary role is
to describe a closer antecedent proper mention. For example, con-
Figure 2 MedCoref coreference system architecture. The sieves of the sider the sentences in box 1. The ‘pathology’ in the second sentence
system are horizontally arranged from 1 to 8. If a sieve detects a relationship, and ‘pathology’ in the final sentence refer to different tests.
the mentions pass through vicinity and section filters. Set-up C uses a rule- Hence, we designed a second filter that rejects relationships if
based pronoun sieve as the final step. Set-up A uses the FHMM-based sieve the mentions only contain a list of stop terms compiled by us as
that is unaware of the mention clusters. For set-up B, we merge the chains in part of the MedTagger project (see online supplementary file).
the other set-ups. FHMM, factorial hidden Markov model.
Sieves
Sieve 1 accepts mentions that match exactly when aligned to
coreferential relationships are tested for each pair of mentions the right and the antecedent has a higher number of words.
starting from the last appearing (probable) mention. For each Since two mentions with the same name in a clinical document
mention, a probable antecedent is searched for starting from the need not corefer, we found it helpful to perform a right-aligned
closest mention. The assumption is that in narrative text, given match. This will be useful in scenarios, such as that shown in
two antecedents with similar properties, the closer antecedent is box 2 where the first and the last ‘echocardiograms’ are different.
more likely to have a coreferential relationship with the Sieve 2 accepts a pair when the mention is a relative pronoun
mention, since there are less intervening words that could that is governed by the antecedent as detected by rules based on
disturb the relationship. Such an assumption makes sense for part-of-speech tags (the pair immediately following each other
the clinical narratives, which is a sublanguage that typically does or intercepted only by a verb). The part-of-speech tags are
not contain complex or nested sentences.32 assigned by the OpenNLP POS tagger trained for clinical text.34 It
also accepts mention pairs where one of them is an abbreviation
Section filter of another as detected using the abbreviation list assembled from
In general English, if two mentions have the same surface text, the Unified Medical Language System (UMLS; version 2011AA)
more than 95% of the time the mentions corefer.20 However, in using the tool present in Liu et al.35 The medical domain favors
clinical narratives, this might not be the case for several reasons. brevity. Recognizing abbreviations is important for medical

Table 1 Sample rules used in section filter


Section 1 Section 2 Entity type Coref
Laboratory_data|6.41.144 assessment_and_plan|13 Treatment NO
Procedures|5.23 discharge_medications|5.37.106.125 Treatment NO
History_present_illness|5.28 discharge_medications|5.37.106.125 Problem NO
History_present_illness|5.28 laboratory_data|6.41.144 Test NO
History_present_illness|5.28 disposition_plan|13.51.156.278 Person NO
Personal_and_social_history|5.34.78 hospital_course|5.32 Person YES
Hospital_course|5.32 discharge_instructions|8.42 Person YES
Principal_procedures|5.23.51 hospital_course|5.32 Test YES
Discharge_diagnosis|5.22.45 discharge_medications|5.37.106.125 Treatment YES
Discharge_diagnosis|5.22.45 hospital_course|5.32 Problem YES
A mention of type ‘Entity type’ in Section 1 has no coreference relationship with another mention of the same name in Section 2 if Coref is NO.

J Am Med Inform Assoc 2012;19:867e874. doi:10.1136/amiajnl-2011-000766 869


Research and applications

Sieve 7 uses the Porter Stemmer algorithm to stem words


Box 1 Example for vicinity filter constituting the mentions and the open class words, such as
prepositions and articles, are dropped. The mention pair is
Patient underwent a total abdominal hysterectomy in 02/90 for accepted as a coreferred pair if (1) the stems of the headwords
a 4x3.6x2 cm cervical mass felt to be a fibroid at Vanor. are the same, and (2) the remaining stemmed words in one of
Pathology revealed poorly differentiated squamous cell carcinoma the mentions are all in the other mention. For example, the
of the cervix with spots of vaginal margins and metastatic mention ‘shortness of breathing’ (stemmed to ‘short breath’) is
squamous cell carcinoma in the cardinal ligaments with extensive mapped to the antecedent mention ‘very short breath’ (stemmed
lymphatic invasion. to ‘very short breath’). Such an approach had been used previ-
. ously in the general domain by Yang et al37 and Zitouni et al.38
She underwent exploratory laparotomy and had a bilateral Stemming headwords is important because some medical terms
salpingo-oophorectomy and appendectomy. refer to the same thing although they have different forms. The
Pathology was negative for tumor and showed peritubal and second criterion is based on the assertion that the modifiers of
periovarian adhesions. a noun phrase (eg, adjectives, prepositions, numbers, possessives,
proper nouns, non-finites, and quantifiers) carry important in-

Downloaded from https://academic.oup.com/jamia/article/19/5/867/719408 by guest on 23 June 2023


formation for coreference resolution. Two noun phrases with the
same head string may refer to distinct entities if their modifiers
language processing and information retrieval systems. do not match.
Expanding acronyms to their full names can be helpful in Sieve 8 for set-up C is a rule-based pronoun sieve. As shown in
coreference resolution. For example, detecting a coreferred figure 2, the seven sieves that were applied prior to this sieve
pair, such as ‘CHF’ and ‘the failure’ is more difficult than collect information about entities referred by the mentions in
detecting a coreferred pair such as ‘congestive heart failure’ the same document. Based on the collected information per-
and ‘the failure’ through language processing. If we expand an taining to the entity’s grammatical number (singular/plural
abbreviation to its corresponding full name, its coreferred depending on the pronouns already in the mentions in the entity
mentions, if any, can be detected using traditional coreference and part-of-speech tags), gender (male/female depending on the
techniques. pronouns in the entity and the markable class), and animacy
Sieves 3 through 6 are implemented in the same fashion as by (person/object based on the pronouns and markable type), each
Raghunathan et al.20 They take into account: (a) head matchd pronoun is assigned to an antecedent entity when each of these
whether the head of the mention matches the head of one of the features match.
mentions in the antecedent entity; (b) compatible modifiersd FHMM is sieve 8 for set-up A. Three adjustments to Li et al’s
whether all the noun and adjective modifiers of the mention are model22 were made for the i2b2/VA/Cincinnati shared task.
present in a single mention of the antecedent entity; and (c) word First, due to speed concerns, we do not incrementally copy cr
inclusiondwhether all the words in the mention are present (coreference) features for words ot that are not mentions. This
among the words in the mentions from the antecedent entity. modification would have little impact because, as described in Li
Sieve 3 requires all the above three conditions to match. Sieve 4 et al’s paper, in a first-order HMM employed in the current
uses head match and word inclusion. Sieve 5 uses head match system, neighboring words that are separated by more than one
and compatible modifiers. Sieve 6 uses a relaxed head match word are assumed independent. A corollary to this first adjust-
where the head of the mention is present in any part of the ment is that the dependencies between neighboring words no
mentions in the antecedent entity and word inclusion. longer exist―only those between neighboring mentions are kept.
For sieves 2 and 7, we used synonyms and other relationships The model described in figure 1 is still applicable to the current
extracted from the UMLS. From each document, we extracted system, where the i-th observation ot is now a mention rather
three sets of mentions by semantic type: tests, problems, and than a word.
treatments. Each mention was mapped to a varying number of Second, we divide pronouns (as represented in the pos (part-of-
concept unique identifiers (CUIs) by MedTagger.36 Next, all speech) model) from the OpenNLP POS tagger into finer cate-
CUIs for a mention were connected to the CUIs for each in-set gories, such as non-personal pronouns or first/second/third-
mention, using the UMLS MRREL table. We only considered person pronouns. Training the pronoun resolution model at this
relationships of types: synonym (for sieve 2), and parentechild granularity yields intuitive empirical information that the
and narrowebroad (sieve 7). system may make use of, based on the discourse context32 of
‘clinical notes.’ Second-person pronouns refer to the patient in
most cases. First-person pronouns are ambiguous and refer to
either the patient or the care provider.
Box 2 Example for right exact match sieve In the original model, an entity e (a subvariable of cr) was a
named entity from some recognition algorithm, such as the input
Echocardiogram showed moderate anterior pericardial effusion of concepts given in Track 1C. This is not useful given the first
approximately 600 cc with diastolic indications of the right adjustment, and e is therefore modified to represent the mention’s
ventricle and low velocity paradox. concept type (ie, problem, treatment, test, and person).
.
She had a follow-up echocardiogram. Evaluation metrics
Echocardiogram showed left ventricle at the upper limits of A mention pair identified as belonging to the same entity is a
normal for size, low normal function, moderate to mild effusion true positive when that is confirmed by the gold standard;
with pericardial pressures exceeding right atrial pressures, and otherwise, it is a false positive. When a mention pair that
right ventricular pressures at various points of patient’s cycle belongs to the same entity as per the gold standard is not linked
without any change in the effusion from 06/11. by the system, it is a false negative. A true positive or false
positive occurs when at least one sieve detects the link, and vice

870 J Am Med Inform Assoc 2012;19:867e874. doi:10.1136/amiajnl-2011-000766


Research and applications

Table 2 Accuracy of the machine learning-based pronoun sieve However, as shown by the p values measured using the Student
System Beth Partners Discharge Progress t test between each sieve and its successive one in table 3, we
conclude that the sieves contribute toward improving the
FHMM 66% 63.5% 60% 61.5%
system gradually.
Beth, Beth Israel Deaconess Medical Center; FHMM, factorial hidden Markov model;
Partners, Partners HealthCare.
DISCUSSION
Table 4 shows some of the true positives, false positives, and
versa. Most metrics capture the notion of correctness through false negatives of the system.
precisiondthe ratio of true positives among all the system The examples for true positives illustrate how the rules
outputsdand the notion of completeness through recalldthe worked as we intended. Table 4 refers each true positive to the
recall of true positives among the total number of positives. An F corresponding sieve defined in the Methods section and figure 2.
score represents the overall performance as a harmonic mean of The false positives occur mainly because of the lack of knowl-
the precision and recall. edge of semantics. For example, in the first false positive
In the i2b2/VA/Cincinnati challenge on coreference resolu- example, the ‘baseline creatinine’ and ‘creatinine’ are the same
tion, system performance was measured using MUC,39 B- kind of tests conducted at different instances and so are

Downloaded from https://academic.oup.com/jamia/article/19/5/867/719408 by guest on 23 June 2023


CUBED (B3),40 entity-based CEAF,41 and BLANC,42 similar to considered different as per the definition of this task. Within our
SEMEVAL-201043 and CoNLL-2011.44 For official evaluation, an framework, it is possible to create a new filter that rejects such
average of B-CUBED, CEAF, MUC, and MELA (mention, entity, mention pairs based on the domain knowledge that a baseline
and link average45) was used, without including BLANC. Scores test is different from a test conducted at a later instance. The
of the three metrics were averaged with equal weights, in the false positives also occur because of the insufficient gathering of
same manner as CoNLL 2011. the context. For example, in the last example of the true posi-
The reported performance measures were calculated using the tives, the mention ‘he’ is compatible with the mention ‘the
Python script provided by the challenge organizers. patient’ and hence they are linked. In the last example of the
false positives, although the mentions ‘this’ and ‘his prealbumin’
RESULTS are grammatically compatible, the second mention refers to time
To evaluate performance of the machine learning-based pronoun and the first mention refers to test. When an aggregate system
sieve on each of the four training corpus parts in Track 1C, extracted the named entities of type time (in addition to test),
FHMM probability models were trained on the other three this situation would be resolved.
corpus parts using relative frequency estimation. For the test The orders of the sieves themselves were adapted from the
corpus, we combined all four training models. Stanford coreference system for general English. When we added
The FHMM was evaluated using the ratio of the number of a few completely novel sieves such as using UMLS relationships
correctly resolved relationships over the total number of rela- other than exact synonyms, stemming, and bag of words match,
tionships, consistent with Li et al27 and Ge et al.13 Table 2 shows we added them at the end (right before pronominal resolution).
the accuracy results on the training corpus. However, one could independently investigate in the future by
The performance as per the evaluation metrics defined above altering the order.
of the different sieves and set-ups on the development set are The performance of the machine learning-based pronoun sieve
shown in table 3. is 10% less than the corresponding performance for general
After the initial right exact-match sieve, the recalls (for all English.27 This might be attributed to the distinguishing
metrics) gradually increased, with a slighter decrease in precision features of the clinical notes,32 and an improvement in this
for proper and nominal mentions (sieve 2 contains relative performance might need the addition of features specific to these
pronouns). Altogether, the average F score does not increase notes. Such features would take into account the differences
substantially. Hence, we might conclude that the sieves4e7 that between the various semantic types, medical specialties, and
are modified versions of sieve 3 (head match) are not effective types of notes.
compared to other sieves. This is consistent with Raghunathan The purpose of the initial sieves is to gather global informa-
et al’s system,20 where the MUC F score (that first uses non- tion about the entities in the document. After addition of the
complete match) increased by 1.2% between sieve 3 and sieve 8. rule-based pronoun sieve, the average F score increases by 10.7%.

Table 3 Cumulative performance on development set as sieves are added


Sieves B3 P|R|F MUC P|R|F BLANC P|R|F CEAF P|R|F Average of F scores p Value
[1] 0.869|0.909|0.889 0.726|0.335|0.458 0.931|0.559|0.605 0.868|0.654|0.746 0.698
[1, 2] 0.865|0.916|0.89 0.726|0.366|0.487 0.927|0.561|0.607 0.865|0.665|0.752 0.710 <106
[1, 2, 3] 0.863|0.923|0.892 0.730|0.412|0.527 0.926|0.567|0.617 0.861|0.681|0.760 0.726 <106
[1, 2, 3, 4] 0.863|0.923|0.892 0.730|0.412|0.527 0.926|0.567|0.617 0.861|0.681|0.760 0.726 0.03
[1, 2, 3, 4, 5] 0.856|0.927|0.890 0.705|0.436|0.539 0.908|0.570|0.619 0.843|0.684|0.755 0.728 <106
[1, 2, 3, 4, 5, 6] 0.853|0.929|0.889 0.701|0.443|0.543 0.906|0.570|0.620 0.839|0.686|0.755 0.729 <106
[1, 2, 3, 4, 5, 6, 7] 0.852|0.930|0.889 0.696|0.447|0.545 0.903|0.570|0.620 0.836|0.686|0.754 0.729 <106
Set-up A¼[1, 2, 3, 4, 5, 6, 7, 8a] 0.883|0.930|0.906 0.691|0.716|0.703 0.884|0.690|0.754 0.801|0.814|0.808 0.806 <106
Set-up B¼[1, 2, 3, 4, 5, 6, 7, 8a+8b] 0.874|0.933|0.903 0.693|0.789|0.738 0.856|0.889|0.872 0.770|0.843|0.805 0.815 <106
Set-up C¼[1, 2, 3, 4, 5, 6, 7, 8b] 00.90|0.936|0.918 0.739|0.798|0.767 0.937|0.808|0.862 0.802|0.843|0.822 0.836 <106
The sieve numbers [1, 2, 3, 4, 5, 6, 7, 8a, 8b] are defined in figure 2.
Bold indicates that row corresponds to the best system for the metric.
The final F scores were averaged over all measurements. The p value was based on the Student t test comparing the mention pairs in the current sieve with those in the previous sieve.
F, F score; P, precision; R, recall.

J Am Med Inform Assoc 2012;19:867e874. doi:10.1136/amiajnl-2011-000766 871


Research and applications

Table 4 Example outputs of the system


Sentence 1 Sentence 2 Sieve
True positives
The diagnosis, therefore, was relapsed C difficile colitis. An abdominal CAT scan revealed thickened bowel wall and thumb printing, 1
primarily involving the cecum and right colon greater than the left,
consistent with C difficile colitis.
2. Chronic pleural effusion. Briefly, the patient has a history of chronic obstructive pulmonary disease, 2
ethanol abuse, chronic pleural effusions, and chronic renal insufficiency.
The patient is an 85-year-old white male with a history of ischemic With intravenous hydration the BUN and creatinine fell to 12/1.9 which is 3
bowel status post recent admission for urosepsis and C difficile colitis. within normal limits for this patient.
1) Serratia urosepsis treated with ceftizoxime. Initially treated with intravenous ceftizoxime, gentamicin, and Flagyl for 5
presumed sepsis, either with urine or bowel source.
Patient is a 28 year old gravida IV, para 2 with metastatic cervical cancer Given the patient’s history of cervical cancer, the pericardial effusion was 6
admitted with a question of malignant pericardial effusion. felt most likely to be malignant.
The patient was alert and oriented throughout the admission; however, The patient was alert and oriented throughout the admission ; however, 8
by personality, he is somewhat cantankerous and demanding of the nurses. by personality, he is somewhat cantankerous and demanding of the nurses.
False positives

Downloaded from https://academic.oup.com/jamia/article/19/5/867/719408 by guest on 23 June 2023


The patient has chronic renal insufficiency with baseline creatinine 1.8e2. Creatinine had risen to 4.3 on admission presumed secondary to sepsis 1
and dehydration.
The patient is an 85-year-old white male with a history of ischemic bowel The patient has a history of ischemic bowel status post SMA Percutaneous 5
status post recent admission for urosepsis and C difficile colitis. Transluminal Coronary Angioplasty with recent admission for gram negative
rod urosepsis complicated by C difficile colitis.
The patient had a PICC line placed and will continue a 8 week course Continue to take the antibiotics as directed. 6
of antibiotics.
18. Heparin Lock Flush (Porcine) 100 unit/ml Syringe Sig: Two (2) ML 18. Heparin Lock Flush (Porcine) 100 unit/ml Syringe Sig: Two (2) Ml Intravenous 6
Intravenous DAILY (Daily) as needed: 10 ml NS followed by 2 ml of DAILY (Daily) as needed: 10 ml NS followed by 2 ml of 100 Units/ml heparin
100 Units/ml heparin (200 units heparin) each lumen Daily and PRN. (200 units heparin) each lumen Daily and PRN.
His prealbumin is up slightly from last week’s level of <7 to 11 this week. His prealbumin is up slightly from last week’s level of <7 to 11 this week. 8
False negatives
2. Chronic pleural effusion. 3) Loculated pleural effusions.
1. Colitis. The patient has a history of ischemic bowel status post SMA Percutaneous
Transluminal Coronary Angioplasty with recent admission for gram negative
rod urosepsis complicated by C difficile colitis.
She also received Cisplatin 35 per meter squared on 06/19 and Ifex and She continued to have no change in her shortness of breath or cardiac examination
Mesna on 06/18. and was discharged home on 06/22/91 after completing her 5-FU and Cisplatin
chemotherapy.

On the other hand, the performance increment after addition of data sharing in the domain. Hence, we chose the hybrid
the state-of-the-art machine learning-based pronoun sieve is approach, where the deterministic framework allows experts to
only 7.7%. We believe that this is because our machine learning- add rules or modify existing ones while taking advantage of
based sieve learns features based on mentions and is unaware of machine learning techniques where possible.
the global properties of the entity (mention cluster) itself. With the i2b2/VA/Cincinnati shared task test corpus, the
Others such as Mitkov46 observe, ‘Machine learning algorithms accuracy of MedCoref remained consistent (F score of 0.84).29
for pronoun resolution do not necessarily perform better than The best system, which is a machine learning system, has an F
the traditional rule-based approaches.’ score of 0.92. The minimum F score for the task was 0.58. Our
While we have used supervised machine learning for clinical system scored at the median (exact median F score of 0.85)
information extraction tasks, such as named entity recogni- among the 20 teams that participated and ranked at 11. These
tion,47 association extraction,31 and drug adverse effect extrac- results, including ours, are inflated to some degree, since it is not
tion,48 machine learning-based systems are still used sparingly at an end-to-end evaluation where named entities, such as treat-
an enterprise level by Mayo Clinic49 and other organizations, ment and problem, need to be automatically extracted from
such as Regenstrief Institute.50 51 Systems trained using super- text. In addition, it would be sound but incomplete to evaluate
vised machine learning algorithms are often sensitive to the NLP systems using a rather homogeneous corpus, especially for
distribution of data, and a model trained on clinical notes from the case of clinical narratives that seem to have entirely different
an institution may perform poorly on those from another. For characteristics depending on where they originate.54 In practice,
example, Wagholikar et al52 showed recently that a machine the portability and adaptability of a system is an important
learning model for concept extraction trained on the i2b2/VA/ concern for clinical NLP applications.
Cincinnati corpus achieved a significantly lower F score when There are two other rule-based systems in the competition29
tested on the Mayo Clinic corpus. Other researchers recently besides ours (Hinote et al and Gooch et al). The rest used rules
reported this phenomenon for part-of-speech tagging.53 Such for preprocessing (Yang et al), deciding the order of the machine
poor performances will then be cascaded to higher-level tasks, learning components (Rink et al) or used completely supervised
such as coreference resolution and semantic role labeling. Besides approaches (Anick et al, Cai et al, Xu et al, etc). Experts, through
the inherent challenges pertaining to the peculiar sublanguage, adding more rules (such as semantic clues like dates and loca-
the difficulty in applying machine learning to clinical NLP may tions by Hinote et al and Wikipedia abbreviations by Gooch
be attributed to the difficulty in developing a corpus annotation et al), could further improve and locally customize MedCoref.
standard across institutions and use cases, preparing large This could readily accommodate using outputs from machine
annotated corpora conforming to the standard, and limitation of learning systems as well. Our system is flexible in

872 J Am Med Inform Assoc 2012;19:867e874. doi:10.1136/amiajnl-2011-000766


Research and applications

accommodating additional components and integrating 16. Denis P, Baldridge J. Specialized Models and Ranking for Coreference Resolution.
different technologies and it is suitable for practical use. Empirical Methods in Natural Language Processing. Honolulu, HI: Association for
Computational Linguistics, 2008:660e9.
17. Rahman A, Ng V. Supervised models for coreference resolution. Empirical Methods
CONCLUSION in Natural Language Processing. Singapore: Association for Computational
Linguistics, 2009:968e77.
We designed a multi-pass sieve system for coreference resolution 18. Haghighi A, Klein D. Unsupervised Coreference Resolution in a Nonparametric
in clinical notes. We demonstrated that, using relatively simple Bayesian Model. Prague, Czech Republic: Association of computational linguistics,
rules, basic part-of-speech information, and semantic type 2007:848e55.
properties, an effective coreference resolution system could be 19. Ng V. Unsupervised models for coreference resolution. Empirical methods in natural
language processing. Waikiki, USA: Association for Computational Linguistics,
designed. Pronominal coreference resolution is shown to be more 2008:640e9.
accurate when an entity-centered approach is used rather than 20. Raghunathan K, Lee H, Rangarajan S, et al. A multi-pass sieve for coreference
a mention-centered approach. The source code of the system resolution. Empirical Methods Natural Language Processing. Proceedings of the
Association for Computational Linguistics. Sydney, Australia: Association for
described in this paper is available at https://sourceforge.net/ Computational Linguistics, 2010:492e501.
projects/ohnlp/files/MedCoref. 21. Lee H, Peirsman Y, Chang A, et al. Stanford’s Multi-Pass Sieve Coreference
Resolution System at the CoNLL-2011 Shared Task. CoNLL-2011 Shared
Acknowledgments This research was evaluated using the gold standard Task, 2011. Portland, Oregon, USA: Association for Computational Linguistics,
2011:73e9.

Downloaded from https://academic.oup.com/jamia/article/19/5/867/719408 by guest on 23 June 2023


developed as part of the 2011 i2b2/VA/Cincinnati challenge.
22. Jonnalagadda SR, Topham P. NEMO: extraction and normalization of organization
Contributors The seven authors are justifiably credited with authorship, according to names from PubMed affiliations. J Biomed Discov Collab 2010;5:50e75.
the authorship criteria. SJ, HL: conception, design, development, analysis and 23. Collins M, Singer Y. Unsupervised models for named entity classification. Joint
interpretation of data, drafting of the manuscript, final approval given; DL: acquisition SIGDAT Conference on Empirical Methods in Natural Language Processing and Very
of data, analysis and interpretation of data, final approval given; SS, SW, KW: Large Corpora. College Park, USA: Association for Computational Linguistics,
development, critical revision of the manuscript, final approval given; MT: critical 1999:189e96.
revision of the manuscript, final approval given. 24. Brown PF, Pietra VJD, Pietra SAD, et al. The mathematics of statistical machine
translation: Parameter estimation. Comput Ling 1993;19:263e311.
Funding This work was funded by National Science Foundation ABI:0845523 and 25. Spitkovsky VI, Alshawi H, Jurafsky D. From baby steps to Leapfrog: how Less is
National Institute of Health R01LM009959A1 to Dr Hongfang Liu. More in unsupervised dependency parsing. North American Association for
Computational Linguistics. Los Angeles, USA: Association for Computational
Competing interests None. Linguistics, 2010:751e9.
Provenance and peer review Not commissioned; externally peer reviewed. 26. Li D, Miller T, Schuler W. A pronoun anaphora resolution system based on factorial
hidden markov models. Proceedings of the Association for Computational Linguistics.
Data sharing statement The code and the accompanying data are available as Portland, USA: Association for Computational Linguistics, 2011.
open-source at https://sourceforge.net/projects/ohnlp/files/MedCoref. 27. Ghahramani Z, Jordan MI. Factorial hidden markov models. Machine Learn
1997;29:1e31.
28. Eddy SR. Hidden markov models. Curr Opin Struct Biol 1996;6:361e5.
REFERENCES 29. Uzuner O, Bodnari A, Shen S, et al. Evaluating the state of the art in
1. Chapman WW, Savova GK, Zheng J, et al. Anaphoric reference in clinical reports: coreference resolution for electronic medical records. J Am Med Inform Assoc
characteristics of an annotated corpus. J Biomed Inform. Published Online First: 9 2012;19:786e91.
February 2012 doi:10.1016/j.jbi.2012.01.010 30. Baldwin B. CogNIAC: high precision coreference with limited knowledge and
2. Coden A, Pakhomov S, Ando R, et al. Domain-specific language models and lexicons linguistic resources. Proceedings of a Workshop on Operational Factors in Practical,
for tagging. J Biomed Inform 2005;38:422e30. Robust Anaphora Resolution for Unrestricted Texts. Madrid, Spain: Association for
3. Zheng J, Chapman WW, Crowley RS, et al. A review of general methodologies and Computational Linguistics,1997.
applications in the clinical domain. J Biomed Inf 2011;44:1113e22. 31. Jonnalagadda S. An Effective Approach to Biomedical Information Extraction with
4. Sidner CL. Focusing for interpretation of pronouns. Comput Ling Limited Training Data [PhD]. Phoenix: Arizona State University, 2011.
1981;7:217e31. 32. Friedman C, Kra P, Rzhetsky A. Two biomedical sublanguages: a description based
5. Rich E, LuperFoy S. An Architecture for Anaphora Resolution. Proceedings of the on the theories of Zellig Harris. J Biomed Inform 2002;35:222e35.
Second Conference on Applied Natural Language Processing 1988. Austin, Texas: 33. Denny JC, Spickard A 3rd, Johnson KB, et al. Evaluation of a method to Identify and
Association for Computational Linguistics, 1988:18e24. Categorize section Headers in clinical documents. J Am Med Inform Assoc
6. Lappin S, Leass HJ. An algorithm for pronominal anaphora resolution. Comput Ling 2009;16:806e15.
1994;20:535e61. 34. OpenNLP. http://opennlp.sourceforge.net/index.html (accessed 1 Mar 2012).
7. Kennedy C, Boguraev B. Anaphora for Everyone: Pronominal Anaphora Resoluation 35. Liu H, Lussier YA, Friedman C. A study of abbreviations in the UMLS. Proc AMIA
Without a Parser. Proceedings of the 16th International Conference on Computational Symp 2001:393e7.
Linguistics. Association for Computational Linguistics, 1996:113e18. 36. Jonnalagadda S, Sohn S, Wu S, et al. MedTagger: the fast NLP pipeline for Mayo’s
8. Castano J, Zhang J, Pustejovsky J. Anaphora resolution in biomedical literature. clinical Data Warehouse. Submitted for Peer-review. Rochester, MN: Mayo Clinic,
Proceedings of the International Symposium on Reference Resolution for NLP 2002. 2012.
Alicante, Spain: Universidad de Alicante, 2002. 37. Yang X, Zhou G, Su J, et al. Improving noun phrase coreference resolution by
9. Soon WM, Ng HT, Lim DC. A machine learning approach to coreference resolution matching strings. In: Su KY, Tsujii JI, Lee JH, et al, eds. Natural Language
of noun phrases. Comput Ling 2001;27:521e44. Processing e IJCNLP 2004. Berlin/Heidelberg: Springer, 2005:22e31.
10. Yang X, Zhou G, Su J, et al. Coreference Resolution Using Competition 38. Zitouni I, Sorensen J, Luo X, et al. The Impact of Morphological Stemming on Arabic
Learning Approach. The 41st Annual Meeting of the Association for Computational Mention Detection and Coreference Resolution. Ann Arbor, USA: Association for
Linguistics. Sapporo, Japan: Association for Computational Linguistics, Computational Linguistics, 2005:63e70.
2003:176e83. 39. Vilain M, Burger J, Aberdeen J, et al. A modeltheoretic coreference scoring
11. Yang X, Su J, Tan CL. Kernel-based Pronoun Resolution with Structured Syntactic scheme. Proceedings of MUC-6. Columbia, USA: Linguistic Data Consortium,
Knowledge. The 21st International Conference on Computational Linguistics and 44th 1995:45e52.
Annual Meeting of the Association for Computational Linguistics. Sydney, Australia: 40. Bagga A, Baldwin B. Algorithms for scoring coreference chains. Proceedings of the
Association for Computational Linguistics, 2006:41e8. LREC Workshop on Linguistic Coreference. Granada, Spain: European Language
12. Culotta A, Wick M, Hall R, et al. First-order Probabilistic Models for Coreference Resources Association, 1998:563e6.
Resolution. Human Language Technology Conference of the North American Chapter 41. Luo X. On coreference Resolution Performance Metrics. Vancouver, Canada: EMNLP,
of the Association of Computational Linguistics (HLT/NAACL). Rochester, NY: 2005:25e32.
Association for Computational Linguistics, 2007:81e8. 42. Recasens M, Hovy E. BLANC: Implementing the Rand Index for coreference
13. Ge N, Hale J, Charniak E. A statistical approach to anaphora resolution. The Sixth evaluation. Nat Lang Eng 2011;17:485e510.
Workshop on Very Large Corpora. Montreal, Canada: Association for Computational 43. Recasens M, Mart’ı T, Taul’e M, et al. SemEval-2010 task 1: coreference resolution
Linguistics, 1998:161e70. in multiple languages. Proceedings of the Workshop on Semantic Evaluations: Recent
14. Yang X, Su J, Lang J, et al. An entity-mention model for coreference resolution with Achievements and Future Directions (SEW-2009). Boulder, USA: Association for
inductive logic programming. Proceedings of ACL-08: HLT. Columbus, USA: Computational Linguistics, 2009:70e5.
Association for Computational Linguistics, 2008:843e51. 44. Pradhan S, Ramshaw L, Marcus M, et al. CoNLL-2011 shared task: Modeling
15. Nicolae C, Nicolae G. Bestcut: a Graph Algorithm for Coreference Resolution. Unrestricted coreference in OntoNotes. Conference on Computational Natural
Empirical Methods in Natural Language Processing. Sydney, Australia: Association for Language Learning. Portland, USA: ACL Special Interest Group on Natural Language
Computational Linguistics, 2006:275e83. Learning, 2011.

J Am Med Inform Assoc 2012;19:867e874. doi:10.1136/amiajnl-2011-000766 873


Research and applications

45. Pascal D, Baldridge J. Joint determination of anaphoricity and coreference 50. Friedlin J, Overhage M, Al-Haddad MA, et al. Comparing methods for identifying
resolution using integer programming. North American Chapter of the Association for Pancreatic Cancer patients using electronic data Sources. AMIA Annual Symposium
Computational Linguistics (NAACL). Rochester, NY, USA: Association for Proceedings. Washington DC, USA: AMIA, 2010.
Computational Linguistics, 2007:236e43. 51. Friedlin J, McDonald CJ. A natural language processing system to extract and code
46. Mitkov R. Comparing pronoun resolution algorithm. Comput Intell 2007;23:262e97. concepts relating to congestive heart failure from chest radiology reports. AMIA Annu
47. Jonnalagadda S, Cohen T, Wu S, et al. Enhancing clinical concept extraction with Symp Proc 2006:269e73.
distributional semantics. J Biomed Inform 2012;45:129e40. 52. Wagholikar K, Torii M, Jonnalagadda S, et al. Feasibility of pooling annotated
48. Sohn S, Kocher JP, Chute CG, et al. Drug side effect extraction from clinical corpora for clinical concept extraction. AMIA Clinical Research Informatics Summit.
narratives of psychiatry and psychology patients. J Am Med Inform Assoc 2011;18 San Francisco, USA: AMIA, 2012.
(Suppl 1):i144e9. 53. Fan J, Prasad R, Yabut R, et al. Part-of-speech tagging for clinical text: Wall or
49. Chute CG, Beck SA, Fisk TB, et al. The Enterprise Data Trust at Mayo Clinic: Bridge between institutions? AMIA Annu Symp Proc 2011;2011:382e91.
a semantically integrated warehouse of biomedical data. J Am Med Inform Assoc 54. Patterson O, Hurdle J. Document clustering of clinical narratives: a Systematic
2010;17:131e5. study of clinical sublanguages. AMIA Annu Symp 2011:1099e107.

PAGE fraction trail=7.25

Downloaded from https://academic.oup.com/jamia/article/19/5/867/719408 by guest on 23 June 2023

874 J Am Med Inform Assoc 2012;19:867e874. doi:10.1136/amiajnl-2011-000766

You might also like