0% found this document useful (0 votes)
91 views

Using Machine Learning and Natural Language Processing Cancer

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views

Using Machine Learning and Natural Language Processing Cancer

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

original report

Using Machine Learning and Natural Language


Processing to Review and Classify the Medical
Literature on Cancer Susceptibility Genes
Yujia Bao, MA1; Zhengyi Deng, MS2; Yan Wang, MD2; Heeyoon Kim, MS1; Victor Diego Armengol2; Francisco Acevedo, MD2;
Nofal Ouardaoui3; Cathy Wang, MS3,4; Giovanni Parmigiani, PhD3,4; Regina Barzilay, PhD1; Danielle Braun, PhD3,4; and
Kevin S. Hughes, MD2,5
abstract

PURPOSE The medical literature relevant to germline genetics is growing exponentially. Clinicians need tools that
help to monitor and prioritize the literature to understand the clinical implications of pathogenic genetic
variants. We developed and evaluated two machine learning models to classify abstracts as relevant to the
penetrance—risk of cancer for germline mutation carriers—or prevalence of germline genetic mutations.
MATERIALS AND METHODS We conducted literature searches in PubMed and retrieved paper titles and abstracts
to create an annotated data set for training and evaluating the two machine learning classification models. Our
first model is a support vector machine (SVM) which learns a linear decision rule on the basis of the bag-of-
ngrams representation of each title and abstract. Our second model is a convolutional neural network (CNN)
which learns a complex nonlinear decision rule on the basis of the raw title and abstract. We evaluated the
performance of the two models on the classification of papers as relevant to penetrance or prevalence.
RESULTS For penetrance classification, we annotated 3,740 paper titles and abstracts and evaluated the two
models using 10-fold cross-validation. The SVM model achieved 88.93% accuracy—percentage of papers that
were correctly classified—whereas the CNN model achieved 88.53% accuracy. For prevalence classification,
we annotated 3,753 paper titles and abstracts. The SVM model achieved 88.92% accuracy and the CNN model
achieved 88.52% accuracy.
CONCLUSION Our models achieve high accuracy in classifying abstracts as relevant to penetrance or prevalence.
By facilitating literature review, this tool could help clinicians and researchers keep abreast of the burgeoning
knowledge of gene–cancer associations and keep the knowledge bases for clinical decision support tools up
to date.
JCO Clin Cancer Inform. © 2019 by American Society of Clinical Oncology

INTRODUCTION Natural language processing (NLP) is an area of ar-


The medical literature is growing exponentially and no- tificial intelligence that focuses on problems involving
where is this more apparent than in genetics. In 2010, the interpretation and understanding of free text by
a PubMed search for BRCA1 yielded 7,867 papers, a nonhuman system.2,3 Traditional NLP approaches
whereas in 2017 the same search retrieved nearly have relied almost exclusively on rules-based systems
double that amount (14,266 papers). As the literature on in which domain experts predefine a set of rules that
ASSOCIATED
CONTENT individual genes increases, so does the number of are used to identify text with specific content. However,
Appendix pathogenic gene variants that are clinically actionable. defining these rules is laborious and challenging as
Author affiliations Panel testing for hereditary cancer susceptibility genes a result of variations in language, format, and syntax.4
and support identifies many patients with pathogenic variants in Modern NLP approaches instead rely on machine
information (if
genes that are less familiar to clinicians, and it is not learning by which predictive models are learned di-
applicable) appear at rectly from a set of texts that have been annotated for
the end of this
feasible for clinicians to understand the clinical impli-
cations of these pathogenic variants by conducting their the specific target.
article.
Accepted on August own comprehensive literature review. Thus, clinicians NLP has been applied in fields that are relevant to
22, 2019 and need help with monitoring, collating, and prioritizing the medical and health research.2,5,6 For example, in the
published at medical literature. In addition, clinicians need clinical field of oncology, researchers have used NLP to
ascopubs.org/journal/
decision support tools with which to facilitate decision identify and classify patients with cancer, assign
cci on September 23,
2019: DOI https://doi.
making for their patients. These tools depend on staging, and determine cancer recurrence.7-9 NLP
org/10.1200/CCI.19. a knowledge base of metadata on these genetic muta- also plays an important role in accelerating literature
00042 tions that is both up to date and comprehensive.1 review by classifying papers as relevant to the topic of

1
Downloaded from ascopubs.org by 185.50.250.155 on September 24, 2019 from 185.050.250.155
Copyright © 2019 American Society of Clinical Oncology. All rights reserved.
Bao et al

CONTEXT
Key Objective
In the current study, we developed natural language processing approaches using a support vector machine (SVM) and
convolutional neural network (CNN) to identify abstracts that are relevant to the penetrance and prevalence of pathogenic
germline cancer susceptibility mutations.
Knowledge Generated
Using an annotated database of 3,919 abstracts, both SVM and CNN classifiers achieve high accuracy in terms of prevalence
and penetrance classification. The SVM model had accuracies of 88.92% and 88.93% for prevalence and penetrance
classification, respectively, which is higher than that of CNN—88.52% for prevalence and 88.53% for penetrance.
Relevance
The natural language processing approaches we developed achieve high accuracy in classifying abstracts as relevant to
penetrance and prevalence of genetic mutations. These classifiers can facilitate literature review and information synthesis
for both academic research and clinical decision making.

interest.10,11 Several studies developed and improved process and the specific queries used are available in the
machine learning approaches on the basis of the publicly Appendix.
available literature collections of 15 systematic literature
Penetrance was included in the initial queries—query 1
reviews.11-14 These reviews were conducted by evidence-
and query 2—as the initial motivation for this work was to
based practice centers to evaluate the efficacy of
identify abstracts on cancer penetrance of genetic mu-
medications in 15 drug categories.13 Frunza et al15 used
tations. As we began annotating abstracts, we realized that
a complement Naı̈ve Bayes (NB) approach to identify
many of the abstracts contained prevalence information;
papers on the topic of the dissemination strategy of
therefore, we decided to develop a classifier with which to
health care services for elderly people, achieving 63%
identify prevalence as well. Query 3 was broad and not
precision. Fiszman et al16 proposed an approach to
restricted to prevalence or penetrance.
identify papers relevant to cardiovascular risk factors
(56% recall and 91% precision). Miwa et al17 extended We considered different gene–cancer combinations from
an existing approach to classify social and public health the All Syndrome Known to Man Evaluator (ASK2ME),18
literature on the topics of cooking skills, sanitation, to- a recently developed clinical decision support tool with
bacco packaging, and youth development. which clinicians can estimate the age-specific cancer risk
of germline mutation carriers. This tool captures many of
However, no NLP approaches have been developed
the important gene–cancer combinations included in
specifically for classifying literature on the penetrance—risk
common genetic testing panels. We opted to use the title
of cancer for germline mutation carriers—or prevalence
and abstract of each paper as the input for our models for
of germline genetic mutations. To our knowledge, no an-
three main reasons. First, this information can be auto-
notated data set is available for the purpose of developing
matically downloaded via EDirect,19 whereas automatically
a machine learning method with which to identify relevant
downloading full-text papers was not feasible as a result of
papers in this domain. In the current study, we aimed to
licensing issues. Second, the title and abstract of each
create a human-annotated data set of abstracts on cancer
paper can be downloaded in free text form, whereas full-
susceptibility genes and develop a machine learning–
text papers are not generally available in a common format
based NLP approach to classify abstracts as relevant
and one needs to handle PDF, HTML, and others. Last but
to the penetrance or prevalence of pathogenic genetic
not least, annotating the title and abstract is less time
mutations.
consuming than annotating the full text; therefore,
obtaining a large training data set is feasible. Each
MATERIALS AND METHODS paper—on the basis of title and abstract—was annotated
Institutional review board approval was not needed as no for the following fields by a team of human annotators
human data were analyzed. from Dana-Farber Cancer Institute and Massachusetts
General Hospital (coauthors on this publication), with
Establishing an Annotated Data Set a minimum of two human annotators per abstract. Two
To develop effective machine learning models for the fields—penetrance and prevalence—were used to classify
automatic identification of relevant papers, we created papers as relevant to penetrance and prevalence. Other
a human-annotated data set. We used three different fields—polymorphism, ambiguous penetrance, and am-
PubMed queries (queries 1 to 3) to search for relevant biguous incidence—were annotated and used as exclusion
papers to create the data set. Details of query development criteria.

2 © 2019 by American Society of Clinical Oncology

Downloaded from ascopubs.org by 185.50.250.155 on September 24, 2019 from 185.050.250.155


Copyright © 2019 American Society of Clinical Oncology. All rights reserved.
Using Natural Language Processing to Review Medical Literature

• Penetrance: the presence of information about risk of treated as the validation set (for hyperparameters selec-
cancer for germline mutation carriers tion), and the remaining 10% of the data were treated as
• Prevalence: the presence of information about pro- the testing set (for model evaluation). In addition, we
portion of germline mutation carriers in the general compared SVM and CNN with a baseline NB model (de-
population or among individuals with cancer tailed model configurations for all three models is presented
• Polymorphism: the presence of information only on in the Appendix) and reported the average performance on
a germline genetic variant present in more than 1% of the the testing set across all 10 folds. We used accuracy—the
general population percentage of papers that were correctly classified—and
• Ambiguous penetrance: unresolved disagreement be- F1 score as our evaluation metrics. Here, the F1 score is the
tween human annotators on the penetrance label, or the harmonic mean of precision (the percent of positive pre-
impossibility of determining the penetrance label solely dictions that are true positive) and recall (the percent of all
on the basis of the title and the abstract true positives that are predicted as positive). Learning
• Ambiguous prevalence: unresolved disagreement be- curves were constructed that demonstrated how the
tween human annotators on the prevalence label, or the number of papers annotated in the training set affected the
impossibility of determining the prevalence label solely accuracy of the models. We also plotted the receiver op-
on the basis of the title and the abstract erating characteristic (ROC) curve to compare model
Our goal was to develop models that could accurately performance at various thresholds.
classify papers with subject matter pertaining to the pen- RESULTS
etrance and prevalence of rare germline mutations. Papers
Data Set
that were annotated as polymorphism or ambiguous were
not used for model training, validation, or testing. The final human-annotated data set contained 3,919 an-
notated papers (Table 1). Of these, 989 were on pene-
Models trance and 1,291 were on prevalence. We excluded papers
Overview. We trained two independent classifiers, one to that were labeled as polymorphism related. For the task of
classify an abstract as relevant to penetrance and one to penetrance classification, we further excluded papers with
classify an abstract as relevant to prevalence. We used two an ambiguous penetrance label, which reduced the an-
models as described below to develop the classifiers. notated data set to 3,740 papers. For the task of prevalence
classification, we excluded papers with ambiguous prev-
SVM. Our first model is an SVM. We first tokenized the
alence label, which reduced the annotated data set to
input title and abstract and converted them to a standard
3,753 papers (Table 1).
bag-of-ngram vector representation. Specifically, we rep-
resented each title and abstract by a vector, wherein each Model Performance
entry is the term frequency–inverse document frequency of Table 2 shows the performance of the SVM and CNN
the corresponding ngram. The term frequency–inverse models. Both models outperformed the NB model on the
document frequency increases in proportion to the fre- two classification tasks. The SVM model achieved 0.8893
quency of the ngram in this particular abstract and is offset accuracy and 0.7753 F1 score in penetrance classification
by the frequency of the ngram in the entire data set. Thus, and 0.8892 accuracy and 0.8329 F1 score in prevalence
the resulting representation serves to provide less weight to classification. Although the CNN model has more flexibility
the feature value of common words that add little in- in modeling, it underperformed by a small margin com-
formation, such as articles. Finally, we used this bag-of- pared with the SVM model. Figures 1A and 1B show the
ngram representation as the input for a linear SVM to ROC curve for penetrance and prevalence classification,
predict its corresponding label. respectively. The y-axis is the true positive rate, also known
CNN. Our second model is CNN.20 This model directly as sensitivity or recall. The x-axis is the false positive rate,
takes the tokenized title and abstract as its input and ap- which represents the probability of false alarm. The ROC
plies a one-dimensional convolution over the input se- curve provides a comparison of the model performance at
quence. It then uses max-over-time pooling to aggregate different levels of decision thresholds. On the two classi-
the information into a vector representation. Finally, it uses fication tasks, both the SVM and the CNN models achieved
a multilayer perceptron to predict the label from the ob- similar area under the ROC curve, and both outperformed
tained representation. Unlike the linear SVM, the CNN the NB model. Figures 2A and 2B depict the learning
model is capable of learning nonlinear decision rules. curves for the three models for penetrance and prevalence
classification, respectively. We observed that when only 50
Model Evaluation annotated abstracts were used for training, the SVM model
For both the penetrance and prevalence classification achieved approximately 0.85 accuracy for both tasks,
tasks, we evaluated performance using 10-fold cross- whereas the CNN model underperformed compared with
validation. For each fold, 80% of the data were treated the baseline NB model; however, the learning curve of the
as the training set (for model training), 10% of the data were CNN model improved steadily as the training set increased.

JCO Clinical Cancer Informatics 3

Downloaded from ascopubs.org by 185.50.250.155 on September 24, 2019 from 185.050.250.155


Copyright © 2019 American Society of Clinical Oncology. All rights reserved.
Bao et al

TABLE 1. Summary of the Annotated Data Set


Data Set No. of Positive Papers (%) No. of Negative Papers (%)
Original data set
Penetrance 989 (25.24) 2,930 (74.76)
Prevalence 1,291 (32.94) 2,628 (67.06)
Both penetrance and prevalence* 389 (9.92) 3,530 (90.08)
Polymorphism 295 (7.53) 3,624 (92.47)
Ambiguous penetrance 119 (3.04) 3,800 (96.96)
Ambiguous prevalence 101 (2.58) 3,818 (97.42)
After excluding polymorphism and ambiguous papers
Penetrance 904 (23.07) 2,836 (72.37)
Prevalence 1,230 (31.39) 2,523 (64.38)

NOTE. The number in the parentheses shows the portion with respect to the total set of papers.
*Models for penetrance and prevalence were trained independently.

For prevalence classification, the two learning curves show To maximize efficiency, SVM-based NLP approaches have
a flattening trend after the number of papers reached been developed to identify relevant papers in the medical
1,000. Table 3 shows the penetrance and prevalence literature for various topics. In 2005, Aphinyanaphongs
performance of the SVM model for different cancer types. et al21 developed the first SVM method to assist systematic
For both penetrance and prevalence classification, the literature review by identifying relevant papers in the do-
accuracy of the SVM classifier is consistent across different main of internal medicine. Several similar approaches were
cancer types. Accuracies ranged from 0.8471 to 0.8945 for subsequently proposed, including an approach developed
penetrance classification, and from 0.8729 to 0.9103 for by Wallace et al22 that incorporates active learning to re-
prevalence classification. duce annotation cost.11,22 Wallace et al22 reduced the
number of papers that must be reviewed manually by
DISCUSSION approximately 50% while capturing all important papers for
The growing number of cancer susceptibility genes identified systematic review. Fiszman et al16 developed a system
and the burgeoning literature on these genes is over- using symbolic relevance processing to identify potentially
whelming for clinicians and even for researchers. Machine relevant papers for cardiovascular risk factor guidelines.
learning algorithms can help to identify the relevant literature. The recall of his system was 56% and precision 91%.16
In the current study, we have created a data set that contains Whereas most existing methods have focused on the
almost 4,000 human annotated papers regarding cancer clinical literature, Miwa et al17 recently extended the scope
susceptibility genes. Using this data set, we developed two of their approach to include the social science literature.
models to classify papers as relevant to the penetrance or CNN-based NLP methods have been developed for short
prevalence of cancer susceptibility genes. The SVM model text and sentence classification20,23-25; however, few
we developed achieved 88.93% accuracy for penetrance methods have been developed and tested for classifying
and 88.92% accuracy for prevalence, which outperformed medical literature. Using the risk of bias text classification
the more complex CNN model. As we have shown in Figures data sets, Zhang et al26 developed a CNN model to assess
2A and 2B, our models perform better as the number of the study design bias in literature on randomized clinical
papers in the training set increases. Although the curves will trials. The accuracy of the model ranged from 64% to 75%.
plateau at some point, the increasing trend indicates that The high accuracy and F1 score of the models we de-
model performance will continue to improve as more veloped demonstrate that these models can be used
annotated papers are added to the training set. to classify prevalence and penetrance papers regarding

TABLE 2. Performance of Two Natural Language Processing Models Developed for Penetrance and Prevalence Classification
Penetrance Classification Prevalence Classification

Task Accuracy (95% CI) F1 Score (95% CI) Accuracy (95% CI) F1 Score (95% CI)
Naı̈ve Bayes 0.8762 (0.8645 to 0.8879) 0.7256 (0.7001 to 0.7510) 0.8702 (0.8556 to 0.8849) 0.7956 (0.7750 to 0.8163)
SVM 0.8893 (0.8821 to 0.8965) 0.7753 (0.7607 to 0.7900) 0.8892 (0.8800 to 0.8983) 0.8329 (0.8190 to 0.8467)
CNN 0.8853 (0.8736 to 0.8970) 0.7523 (0.7264 to 0.7782) 0.8852 (0.8773 to 0.8930) 0.8210 (0.8068 to 0.8351)

Abbreviations: CNN, convolutional neural network; SVM, support vector machine.

4 © 2019 by American Society of Clinical Oncology

Downloaded from ascopubs.org by 185.50.250.155 on September 24, 2019 from 185.050.250.155


Copyright © 2019 American Society of Clinical Oncology. All rights reserved.
Using Natural Language Processing to Review Medical Literature

A B
1.0 1.0

0.8 0.8

True-Positive Rate

True-Positive Rate
0.6 0.6

0.4 0.4

0.2 NB, AUC = 0.9356 0.0101 0.2 NB, AUC = 0.9278 0.0097
SVM, AUC = 0.9471 0.0077 SVM, AUC = 0.9478 0.0066
CNN, AUC = 0.9393 0.0085 CNN, AUC = 0.9462 0.0061

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
False-Positive Rate False-Positive Rate

FIG 1. Receiver operating characteristic curve for (A) penetrance classification and (B) prevalence classification.
Bands around curves and numbers after 6 sign indicate one SE. AUC, area under the receiver operating
characteristic curve; CNN, convolutional neural network; NB, Naı̈ve Bayes; SVM, support vector machine.

cancer susceptibility genes. This approach will be useful for or that have an incomplete abstract. When the abstract is
physicians to prioritize literature and understand the clin- ambiguous for humans, misclassification can also occur. In
ical implications of pathogenic variants. In addition, this the annotated training data set, there are 119 papers
NLP approach has the potential to assist systematic liter- (3.0%) that have ambiguous penetrance information, and
ature review and meta-analysis in the same domain. We 101 papers (2.6%) that have ambiguous prevalence in-
have conducted another study to test its efficiency and formation. Although we excluded these from model train-
comprehensiveness in identifying important papers for ing, classifying new abstracts that are ambiguous remains
meta-analyses, which is reported separately.27 challenging.
Although our approach achieves high performance, there The abstract is an important component of a published
are some limitations. One weakness of our approach is the work and is usually available publicly. A well-written and
dependence on data that are available in the title and complete abstract provides concise yet critical information
abstract. This is partly a result of limitations in access to full- that is pertinent to the study, can facilitate the capture of
text publications, but also because of the variety of formats key content by the reader, and can greatly facilitate NLP.
in which full-text publications are stored. The proposed When abstracts are not clearly written or leave out critical
models do not work for papers that do not have an abstract findings of the study, the efficacy of NLP models that are

A B
0.900 0.900

0.875 0.875
Testing Accuracy

Testing Accuracy

0.850 0.850

0.825 0.825

0.800 0.800

0.775 0.775

0.750 0.750
NB NB
0.725 SVM 0.725 SVM
CNN CNN
0.700 0.700
0 500 1,000 1,500 2,000 2,500 3,000 0 500 1,000 1,500 2,000 2,500 3,000
No. of Training Examples No. of Training Examples

FIG 2. Learning rate of the two models on the task of (A) penetrance classification and (B) prevalence classification.
Bands around curves indicate one SE. CNN, convolutional neural network; NB, Naı̈ve Bayes; SVM, support vector
machine.

JCO Clinical Cancer Informatics 5

Downloaded from ascopubs.org by 185.50.250.155 on September 24, 2019 from 185.050.250.155


Copyright © 2019 American Society of Clinical Oncology. All rights reserved.
Bao et al

TABLE 3. Performance of the Support Vector Machine Model for Different Cancer Types
Cancer Type No. of Papers Proportion of Penetrance Papers Accuracy
Task: Penetrance classification
Breast cancer 1,669 0.2882 0.8808
Ovarian cancer 1,535 0.2189 0.8945
Colorectal cancer 846 0.2624 0.8747
Endometrial cancer 250 0.332 0.888
Pancreatic cancer 242 0.2645 0.8471
Gastric cancer 224 0.2991 0.8839
Prostate cancer 195 0.4564 0.8615
Brain cancer 95 0.2737 0.8737
Cancer Type No. of Papers Proportion of Prevalence Papers Accuracy
Task: Prevalence classification
Breast cancer 1,674 0.3871 0.8805
Ovarian cancer 1,539 0.3008 0.9103
Colorectal cancer 842 0.2755 0.8729
Endometrial cancer 245 0.2163 0.8857
Pancreatic cancer 241 0.4232 0.8838
Gastric cancer 221 0.2986 0.8778
Prostate cancer 188 0.4255 0.883
Brain cancer 98 0.1939 0.8776

NOTE. One abstract may belong to multiple cancer types.

based on abstract text decreases. There is a need for be noted that, as our models were developed for rare
authors to report their findings in sufficient detail if NLP genetic mutations, abstracts on polymorphism were
methods are to be effective in the future. excluded.
One approach to handle important studies that do not have As we have shown, the CNN model did not outperform the
an abstract or that do not report sufficient detail in the SVM model. This is true for both classification tasks and is
abstract is to develop classification algorithms on the basis not surprising as neural networks typically require much
of the full text. Usually, full texts provide much more in- larger amounts of annotated data for training. As an al-
formation on penetrance and prevalence. Developing al- ternative to annotating more data, one may further improve
gorithms to extract and read information from full texts may model performance by asking human annotators to provide
ultimately lead to higher accuracy; however, numerous justifications for their decisions.28 These justifications can
issues will have to be solved to develop algorithms that are be in the form of highlighting parts of the original input
based on full text, including retrieving the PDF files of abstract that informed the classification decision. Recently,
numerous papers automatically, including resolving access Zhang et al26 and Bao et al29 showed that providing these
issues; automatically extracting text, figures, and tables justifications to the model can significantly improve clas-
from a PDF or other published format; and developing more sification performance when a limited amount of training
complex classification models for additional labels. data are available.
Another potential limitation is that our training data set for In the current study, we developed two models with which
this study was limited to abstracts on genes captured by the to classify abstracts that are relevant to the penetrance or
ASK2ME software and to papers indexed by PubMed. prevalence of cancer susceptibility genes. Our models
However, although we trained our models using articles achieve high performance and have the potential to reduce
indexed by PubMed, it is important to note that the clas- the literature review burden. With the exponential growth of
sifiers can be applied to any abstract. ASK2ME captures the medical literature, our hope is to use computing power
many of the well-studied hereditary cancer syndromes, and to help clinicians and researchers search for and prioritize
the models were developed to identify abstracts irrespective of knowledge in this field and to keep knowledge bases that are
specific gene–cancer associations. Expanding our search used by clinical decision support tools, such as ASK2ME,1,18
beyond these resources could be interesting. It should also up to date.

6 © 2019 by American Society of Clinical Oncology

Downloaded from ascopubs.org by 185.50.250.155 on September 24, 2019 from 185.050.250.155


Copyright © 2019 American Society of Clinical Oncology. All rights reserved.
Using Natural Language Processing to Review Medical Literature

AFFILIATIONS Collection and assembly of data: Yujia Bao, Zhengyi Deng, Yan Wang,
1
Massachusetts Institute of Technology, Boston, MA Heeyoon Kim, Victor Diego Armengol, Francisco Acevedo, Cathy Wang,
2
Massachusetts General Hospital, Boston, MA Danielle Braun, Kevin S. Hughes
3
Harvard T.H. Chan School of Public Health, Boston, MA Data analysis and interpretation: Yujia Bao, Zhengyi Deng, Heeyoon Kim,
4
Dana-Farber Cancer Institute, Boston, MA Victor Diego Armengol, Nofal Ouardaoui, Giovanni Parmigiani, Regina
5
Harvard Medical School, Boston, MA Barzilay, Danielle Braun, Kevin S. Hughes
Manuscript writing: All authors
CORRESPONDING AUTHOR Final approval of manuscript: All authors
Danielle Braun, PhD, Department of Biostatistics, Harvard T.H. Chan Accountable for all aspects of the work: All authors
School of Public Health, Department of Data Sciences, Dana-Farber
Cancer Institute, 677 Huntington Ave, SPH 4th Floor, Boston, MA
AUTHORS’ DISCLOSURES OF POTENTIAL CONFLICTS OF INTEREST
02115; e-mail: [email protected].
The following represents disclosure information provided by authors of
this manuscript. All relationships are considered compensated unless
EQUAL CONTRIBUTION otherwise noted. Relationships are self-held unless noted. I = Immediate
Y.B. and Z.D. contributed equally to this work and should be considered Family Member, Inst = My Institution. Relationships may not relate to the
cofirst authors. subject matter of this manuscript. For more information about ASCO’s
conflict of interest policy, please refer to www.asco.org/rwc or ascopubs.
PRIOR PRESENTATION org/jco/site/ifc.
Presented at Massachusetts General Hospital Clinical Research Day, Giovanni Parmigiani
Boston, MA, October 5, 2017; the Dana-Farber/Harvard Cancer Center Leadership: Phaeno Biotech
Junior Investigator Symposium, Boston, MA, November 6, 2017; Dana-
Stock and Other Ownership Interests: HRA Health
Farber Cancer Institute Biostatistics and Computational Biology Annual
Consulting or Advisory Role: Biogen, Konica Minolta
Retreat, Boston, MA, January 18, 2018; and the 2018 Dana-Farber
Patents, Royalties, Other Intellectual Property: Patent: Genetic Alterations
Cancer Institute/Frontier Science and Technology Research Foundation
in Malignant Gliomas, Copyright: BayesMendel software (Inst)
Marvin Zelen Memorial Symposium, Boston, MA, April 6, 2018.
Expert Testimony: Natera
Travel, Accommodations, Expenses: Konica Minolta
SUPPORT
Supported by National Cancer Institute Grants No. 5T32-CA009337-32 Regina Barzilay
and 4P30-CA006516-51 and the Koch Institute/Dana-Farber/Harvard Honoraria: Merck
Cancer Center Bridge Project (Footbridge). Consulting or Advisory Role: Janssen Pharmaceuticals
Research Funding: Bayer, Amgen
AUTHOR CONTRIBUTIONS
Kevin S. Hughes
Conception and design: Yujia Bao, Giovanni Parmigiani, Regina Barzilay,
Stock and Other Ownership Interests: Hughes RiskApps
Danielle Braun, Kevin S. Hughes
Honoraria: Focal Therapeutics, 23andMe, Hologic
Financial support: Giovanni Parmigiani, Regina Barzilay, Kevin S. Hughes
Consulting or Advisory Role: Health Beacons
Administrative support: Kevin S. Hughes
Provision of study materials or patients: Kevin S. Hughes No other potential conflicts of interest were reported.

REFERENCES
1. Braun D, Yang J, Griffin M, et al: A clinical decision support tool to predict cancer risk for commonly tested cancer-related germline mutations. J Genet Couns
27:1187-1199, 2018
2. Yim W-W, Yetisgen M, Harris WP, et al: Natural language processing in oncology: A review. JAMA Oncol 2:797-804, 2016
3. Hirschberg J, Manning CD: Advances in natural language processing. Science 349:261-266, 2015
4. Buckley JM, Coopey SB, Sharko J, et al: The feasibility of using natural language processing to extract clinical information from breast pathology reports. J Pathol
Inform 3:23, 2012
5. Murff HJ, FitzHenry F, Matheny ME, et al: Automated identification of postoperative complications within an electronic medical record using natural language
processing. JAMA 306:848-855, 2011
6. Sevenster M, Bozeman J, Cowhy A, et al: A natural language processing pipeline for pairing measurements uniquely across free-text CT reports. J Biomed
Inform 53:36-48, 2015
7. Carrell DS, Halgrim S, Tran DT, et al: Using natural language processing to improve efficiency of manual chart abstraction in research: The case of breast cancer
recurrence. Am J Epidemiol 179:749-758, 2014
8. Jouhet V, Defossez G, Burgun A, et al: Automated classification of free-text pathology reports for registration of incident cases of cancer. Methods Inf Med
51:242-251, 2012
9. Friedlin J, Overhage M, Al-Haddad MA, et al: Comparing methods for identifying pancreatic cancer patients using electronic data sources. AMIA Annu Symp
Proc 2010:237-241, 2010
10. Harmston N, Filsell W, Stumpf MP: What the papers say: Text mining for genomics and systems biology. Hum Genomics 5:17-29, 2010
11. Jonnalagadda S, Petitti D: A new iterative method to reduce workload in systematic review process. Int J Comput Biol Drug Des 6:5-17, 2013
12. Matwin S, Kouznetsov A, Inkpen D, et al: A new algorithm for reducing the workload of experts in performing systematic reviews. J Am Med Inform Assoc
17:446-453, 2010
13. Cohen AM, Hersh WR, Peterson K, et al: Reducing workload in systematic review preparation using automated citation classification. J Am Med Inform Assoc
13:206-219, 2006

JCO Clinical Cancer Informatics 7

Downloaded from ascopubs.org by 185.50.250.155 on September 24, 2019 from 185.050.250.155


Copyright © 2019 American Society of Clinical Oncology. All rights reserved.
Bao et al

14. Ji X, Ritter A, Yen PY: Using ontology-based semantic similarity to facilitate the article screening process for systematic reviews. J Biomed Inform 69:33-42,
2017
15. Frunza O, Inkpen D, Matwin S, et al: Exploiting the systematic review protocol for classification of medical abstracts. Artif Intell Med 51:17-25, 2011
16. Fiszman M, Bray BE, Shin D, et al: Combining relevance assignment with quality of the evidence to support guideline development. Stud Health Technol Inform
160:709-713, 2010
17. Miwa M, Thomas J, O’Mara-Eves A, et al: Reducing systematic review workload through certainty-based screening. J Biomed Inform 51:242-253, 2014
18. ASK2ME: All Syndromes Known to Man Evaluator. https://ask2me.org/
19. Kans J: Entrez direct: E-utilities on the UNIX command line. Entrez programming utilities help. National Center for Biotechnology Information. https://www.ncbi.
nlm.nih.gov/books/NBK179288/
20. Kim Y: Convolutional neural networks for sentence classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP), 2014, pp 1746-1751
21. Aphinyanaphongs Y, Tsamardinos I, Statnikov A, et al: Text categorization models for high-quality article retrieval in internal medicine. J Am Med Inform Assoc
12:207-216, 2005
22. Wallace BC, Trikalinos TA, Lau J, et al: Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinformatics 11:55, 2010
23. Kalchbrenner N, Grefenstette E, Blunsom P: A convolutional neural network for modelling sentences. Proceedings of the 52nd Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), 2014, pp 655-665
24. Lai S, Xu L, Liu K, et al: Recurrent convolutional neural networks for text classification. Proceedings of the 29th AAAI Conference on Artificial Intelligence, Austin,
TX, January 25-30, 2015.
25. Zhang X, Zhao J, LeCun Y: Character-level convolutional networks for text classification. Advances in Neural Information Processing Systems, 2015, pp
649-657
26. Zhang Y, Marshall I, Wallace BC: Rationale-augmented convolutional neural networks for text classification. Proceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing, 2016, pp 795-804
27. Deng Z, Yin K, Bao Y, et al: Validation of a semiautomated natural language processing-based procedure for meta-analysis of cancer susceptibility gene
penetrance. Clin Cancer Inform doi: 10.1200/CCI.19.00043
28. Zaidan OF, Eisner J, Piatko CD: Using “annotator rationales” to improve machine learning for text categorization. Human Language Technologies 2007: The
conference of the North American chapter of the Association for Computational Linguistics; proceedings of the main conference, 2007, pp 260-267
29. Bao Y, Chang S, Yu M, et al: Deriving machine attention from human rationales. Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing, 2018, pp 1903-1913

n n n

8 © 2019 by American Society of Clinical Oncology

Downloaded from ascopubs.org by 185.50.250.155 on September 24, 2019 from 185.050.250.155


Copyright © 2019 American Society of Clinical Oncology. All rights reserved.
Using Natural Language Processing to Review Medical Literature

APPENDIX
Query Development For the Naı̈ve Bayes model, we tuned the following configuration on the
basis of the validation performance:
Initially, we performed PubMed searches using query 1 to ensure that we
were able to identify enough positive abstracts for model training. Query • Range of ngrams: (1,2), (1,3), (1,4)
1 includes the following search terms: Query 1: (“gene name”[TIAB] OR • Using sublinear tf scaling or not
“medical subject headings (MeSH) for that gene” OR “related syndrome
name”[TIAB] OR “MeSH for that syndrome”) AND (“Risk”[Mesh] OR
• Additive Laplace smoothing parameter: 1e-2, 1e-3
“Risk”[TI] OR “Penetrance”[TIAB] OR “Hazard ratio”[TIAB]) AND For the support vector machine model, we tuned the following con-
(“cancer name”[Mesh] OR “cancer name”[TIAB])” figuration on the basis of on the validation performance:
As our annotated data set grew, we found that query 1 missed several • Range of ngrams: (1,2), (1,3), (1,4)
important papers. We updated the PubMed query to query 2 using the • Using sublinear tf scaling or not
following search terms:
• Weight of L2 regularization: 1e-4, 1e-5
Query 2: (“gene name”[TIAB] OR “medical subject headings (MeSH)
for that gene” OR “related syndrome name”[TIAB] OR “MeSH for that For the convolutional neural network model, we represented each
syndrome”) AND (“Risk”[Mesh] OR Risk*[TIAB] OR Penetrance* word by a 300-dimensional pretrained word embedding (Pyysalo S
[TIAB] OR Hazard Ratio*[TIAB] OR Odds Ratio*[TIAB]) AND (“cancer et al: http://bio.nlplab.org/pdf/pyysalo13literature.pdf) and applied
name”[Mesh] OR cancer name*[TIAB])” a dropout of rate 0.1 on the word embeddings (Srivastava N, et al: J
Mach Learn Res 15:1929-1958, 2014).
The training of the classifiers was done as an iterative process, and
toward the end of the study we expanded the PubMed query to capture For the one-dimensional convolutions, we used filter windows of 3, 4,
a broader range of studies. We updated the PubMed query to query 3 5, with 100 feature maps each. We used ReLU activations for the
using the following search terms: multilayer perceptron. All parameters were optimized using Adam with
a learning rate of 0.0001. We applied early stopping when the vali-
Query 3: (“gene name”[TIAB] OR “medical subject headings (MeSH)
for that gene” OR “related syndrome name”[TIAB] OR “MeSH for that dation loss fails to improve for 10 epochs (Kingma DP: https://arxiv.org/
abs/1412.6980).
syndrome”).
We tuned the following configuration on the basis of the validation
Model Details performance:

We provide details on the model configurations and hyperparameters. • Finetuning the word embeddings or not.
Our code is available at https://github.com/YujiaBao/PubmedClassifier. • Using a hidden layer of dimension 50 or not.

JCO Clinical Cancer Informatics 9

Downloaded from ascopubs.org by 185.50.250.155 on September 24, 2019 from 185.050.250.155


Copyright © 2019 American Society of Clinical Oncology. All rights reserved.

You might also like