100% found this document useful (3 votes)

1K views323 pages

2019 Book BioinformaticsAndDrugDiscovery PDF

Uploaded by

Dacete

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (3 votes)

1K views323 pages

2019 Book BioinformaticsAndDrugDiscovery PDF

Uploaded by

Dacete

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 323

Methods in

Molecular Biology 1939

Richard S. Larson · Tudor I. Oprea Editors

Bioinformatics
and Drug
Discovery
Third Edition
METHODS IN MOLECULAR BIOLOGY

Series Editor
John M. Walker
School of Life and Medical Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK

For further volumes:

http://www.springer.com/series/7651
Bioinformatics and Drug
Discovery

Third Edition

Edited by

Richard S. Larson
Department of Pathology, University of New Mexico, Albuquerque, NM, USA

Tudor I. Oprea
Department of Internal Medicine, University of New Mexico, Albuquerque, NM, USA
Editors
Richard S. Larson Tudor I. Oprea
Department of Pathology Department of Internal Medicine
University of New Mexico University of New Mexico
Albuquerque, NM, USA Albuquerque, NM, USA

ISSN 1064-3745 ISSN 1940-6029 (electronic)

Methods in Molecular Biology
ISBN 978-1-4939-9088-7 ISBN 978-1-4939-9089-4 (eBook)
https://doi.org/10.1007/978-1-4939-9089-4
Library of Congress Control Number: 2019932160

© Springer Science+Business Media, LLC, part of Springer Nature 2019

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction
on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations
and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to
be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty,
express or implied, with respect to the material contained herein or for any errors or omissions that may have been made.
The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Humana Press imprint is published by the registered company Springer Science+Business Media, LLC, part of
Springer Nature.
The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A.
Preface

A remarkable number of novel bioinformatics methods and techniques have become avail-
able in recent years, enabling us to more rapidly identify new molecular and cellular
therapeutic targets. It is safe to say that bioinformatics has now taken its place as an essential
tool in the process of rational drug discovery.
The first (2005), second (2012), and now third editions of Bioinformatics and Drug
Discovery offer many examples that illustrate the dramatic improvement in our ability to
understand the requirements for manipulating proteins and genes toward desired therapeu-
tic and clinical effects.
This is partly due to our growing ability to modulate protein and gene functions, which
has been facilitated by the emergence of novel technologies and their seamless digital
integration. To address the rapidly changing landscape of bioinformatics methods and
technologies, this edition has been updated to include four major topics: (1) Translational
Bioinformatics in Drug Discovery; (2) Informatics in Drug Discovery; (3) Clinical Research
Informatics in Drug Discovery; and (4) Clinical Informatics in Drug discovery. The topics
covered range from new technologies in target identification, genomic analysis, cheminfor-
matics and chemical mixture informatics, protein analysis, text mining and network or
pathway analyses, as well as drug repurposing.
It is virtually impossible for an individual investigator to be familiar with all these
techniques, so we have adopted a slightly different chapter format than other titles published
by Methods in Molecular Biology. Each chapter introduces the theory and application of the
technology, followed by practical procedures derived from these technologies and software.
Meanwhile, the pipeline of methodologies and the biologic analysis that they perform has
grown over time.
Bioinformatics and Drug Discovery is intended for those interested in the different
aspects of drug design, including academicians (biologists, informaticists, chemists, and
biochemists), clinicians, and scientists at pharmaceutical companies. This edition’s chapters
have been written by well-established investigators who regularly employ the methods they
discuss. The editors hope this book will provide readers with insight into key topics,
accompanied by reliable step-by-step directions for reproducing the techniques described.

Albuquerque, NM, USA Richard S. Larson

Tudor I. Oprea

v
Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

PART I TRANSLATIONAL BIOINFORMATICS IN DRUG DISCOVERY

1 Miniaturized Checkerboard Assays to Measure Antibiotic Interactions. . . . . . . . . 3
Melike Cokol-Cakmak and Murat Cokol
2 High-Throughput Screening for Drug Combinations . . . . . . . . . . . . . . . . . . . . . . . 11
Paul Shinn, Lu Chen, Marc Ferrer, Zina Itkin,
Carleen Klumpp-Thomas, Crystal McKnight, Sam Michael,
Tim Mierzwa, Craig Thomas, Kelli Wilson, and Rajarshi Guha
3 Post-processing of Large Bioactivity Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Jason Bret Harris
4 How to Develop a Drug Target Ontology: KNowledge
Acquisition and Representation Methodology (KNARM) . . . . . . . . . . . . . . . . . . . . 49
Hande Küçük McGinty, Ubbo Visser, and Stephan Schürer

PART II INFORMATICS IN DRUG DISCOVERY

5 A Guide to Dictionary-Based Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Helen V. Cook and Lars Juhl Jensen
6 Leveraging Big Data to Transform Drug Discovery . . . . . . . . . . . . . . . . . . . . . . . . . 91
Benjamin S. Glicksberg, Li Li, Rong Chen, Joel Dudley, and Bin Chen
7 How to Prepare a Compound Collection Prior to Virtual Screening . . . . . . . . . . 119
Cristian G. Bologa, Oleg Ursu, and Tudor I. Oprea
8 Building a Quantitative Structure-Property Relationship
(QSPR) Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Robert D. Clark and Pankaj R. Daga
9 Isomeric and Conformational Analysis of Small Drug
and Drug-Like Molecules by Ion Mobility-Mass Spectrometry
(IM-MS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Shawn T. Phillips, James N. Dodds, Jody C. May, and John A. McLean

PART III CLINICAL RESEARCH INFORMATICS IN DRUG DISCOVERY

10 A Computational Platform and Guide for Acceleration

of Novel Medicines and Personalized Medicine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Ioannis N. Melas, Theodore Sakellaropoulos, Junguk Hur,
Dimitris Messinis, Ellen Y. Guo, Leonidas G. Alexopoulos,
and Jane P. F. Bai
11 Omics Data Integration and Analysis for Systems Pharmacology . . . . . . . . . . . . . . 199
Hansaim Lim and Lei Xie
vii
viii Contents

12 Bioinformatics-Based Tools and Software in Clinical Research:

A New Emerging Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Parveen Bansal, Malika Arora, Vikas Gupta, and Mukesh Maithani
13 Text Mining for Drug Discovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Si Zheng, Shazia Dharssi, Meng Wu, Jiao Li, and Zhiyong Lu

PART IV CLINICAL INFORMATICS IN DRUG DISCOVERY

14 Big Data Cohort Extraction for Personalized Statin Treatment

and Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Terrence J. Adam and Chih-Lin Chi
15 Drug Signature Detection Based on L1000 Genomic
and Proteomic Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Wei Chen and Xiaobo Zhou
16 Drug Effect Prediction by Integrating L1000 Genomic
and Proteomic Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
Wei Chen and Xiaobo Zhou
17 A Bayesian Network Approach to Disease Subtype Discovery . . . . . . . . . . . . . . . . 299
Mei-Sing Ong

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
Contributors

TERRENCE J. ADAM Department of Pharmaceutical Care and Health Systems, Health

Informatics, Social and Administrative Pharmacy, University of Minnesota College of
Pharmacy, Minneapolis, MN, USA
LEONIDAS G. ALEXOPOULOS School of Mechanical Engineering, National Technical
University of Athens, Zografou, Greece
MALIKA ARORA Multidisciplinary Research Unit, Guru Gobind Singh Medical College,
Faridkot, India
JANE P. F. BAI Office of Clinical Pharmacology, Center for Drug Evaluation and Research,
US Food and Drug Administration, Silver Spring, MD, USA
PARVEEN BANSAL University Centre of Excellence in Research, Baba Farid University of
Health Sciences, Faridkot, India
CRISTIAN G. BOLOGA Division of Translational Informatics, Department of Internal
Medicine, University of New Mexico School of Medicine, Albuquerque, NM, USA
BIN CHEN Bakar Computational Health Sciences Institute, University of California, San
Francisco, CA, USA; Department of Pediatrics and Human Development, Michigan State
University, Grand Rapids, MI, USA; Department of Pharmacology and Toxicology,
Michigan State University, Grand Rapids, MI, USA
LU CHEN National Center for Advancing Translational Science, Rockville, MD, USA
RONG CHEN Department of Genetics and Genomic Sciences, Institute of Next Generation
Healthcare, Icahn School of Medicine at Mount Sinai, New York, NY, USA; Sema4, A
Mount Sinai Venture, Stamford, CT, USA
WEI CHEN Department of Radiology, Wake Forest University Medical School,
Winston-Salem, NC, USA
CHIH-LIN CHI University of Minnesota School of Nursing, Minneapolis, MN, USA
ROBERT D. CLARK Simulations Plus, Inc., Lancaster, CA, USA
MURAT COKOL Department of Molecular Biology and Microbiology, Tufts University School
of Medicine, Boston, MA, USA; Laboratory of Systems Pharmacology, Harvard Medical
School, Boston, MA, USA
MELIKE COKOL-CAKMAK Faculty of Engineering and Natural Sciences, Sabanci University,
Tuzla, Istanbul, Turkey
HELEN V. COOK School of Clinical Medicine, University of Cambridge, Cambridge, UK;
Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical
Sciences, University of Copenhagen, Copenhagen, Denmark
PANKAJ R. DAGA Simulations Plus, Inc., Lancaster, CA, USA
SHAZIA DHARSSI National Center for Biotechnology Information (NCBI), National
Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
JAMES N. DODDS Department of Chemistry, Center for Innovative Technology, Vanderbilt-
Ingram Cancer Center, Vanderbilt Institute of Chemical Biology, Vanderbilt Institute for
Integrative Biosystems Research and Education, Vanderbilt University, Nashville, TN,
USA
JOEL DUDLEY Department of Genetics and Genomic Sciences, Institute of Next Generation
Healthcare, Icahn School of Medicine at Mount Sinai, New York, NY, USA
MARC FERRER National Center for Advancing Translational Science, Rockville, MD, USA

ix
x Contributors

BENJAMIN S. GLICKSBERG Bakar Computational Health Sciences Institute, University of

California, San Francisco, CA, USA; Department of Genetics and Genomic Sciences,
Institute of Next Generation Healthcare, Icahn School of Medicine at Mount Sinai,
New York, NY, USA
RAJARSHI GUHA Vertex Pharmaceuticals, Rockville, MD, USA
ELLEN Y. GUO College of Pharmacy, University of Illinois at Chicago, Chicago, IL, USA
VIKAS GUPTA University Centre of Excellence in Research, Baba Farid University of Health
Sciences, Faridkot, India
JASON BRET HARRIS Collaborative Drug Discovery (CDD), Inc., Burlingame, CA, USA
JUNGUK HUR Office of Clinical Pharmacology, Center for Drug Evaluation and Research,
US Food and Drug Administration, Silver Spring, MD, USA; Department of Biomedical
Sciences, University of North Dakota, School of Medicine and Health Sciences, Grand Forks,
ND, USA
ZINA ITKIN National Center for Advancing Translational Science, Rockville, MD, USA
LARS JUHL JENSEN Novo Nordisk Foundation Center for Protein Research, Faculty of Health
and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
CARLEEN KLUMPP-THOMAS National Center for Advancing Translational Science,
Rockville, MD, USA
HANDE KÜÇÜK MCGINTY Department of Computer Science, University of Miami, Coral
Gables, FL, USA; Collaborative Drug Discovery, Inc., Burlingame, CA, USA
JIAO LI Institute of Medical Information and Library, Chinese Academy of Medical Sciences
and Peking Union Medical College, Beijing, China
LI LI Department of Genetics and Genomic Sciences, Institute of Next Generation
Healthcare, Icahn School of Medicine at Mount Sinai, New York, NY, USA; Sema4,
A Mount Sinai Venture, Stamford, CT, USA
HANSAIM LIM The Ph.D. Program in Biochemistry, The Graduate Center, The City
University of New York, New York, NY, USA
ZHIYONG LU National Center for Biotechnology Information (NCBI), National Library
of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
MUKESH MAITHANI Multidisciplinary Research Unit, Guru Gobind Singh Medical College,
Faridkot, India
JODY C. MAY Department of Chemistry, Center for Innovative Technology, Vanderbilt-
Ingram Cancer Center, Vanderbilt Institute of Chemical Biology, Vanderbilt Institute
for Integrative Biosystems Research and Education, Vanderbilt University, Nashville,
TN, USA
CRYSTAL MCKNIGHT National Center for Advancing Translational Science, Rockville,
MD, USA
JOHN A. MCLEAN Department of Chemistry, Center for Innovative Technology, Vanderbilt-
Ingram Cancer Center, Vanderbilt Institute of Chemical Biology, Vanderbilt Institute
for Integrative Biosystems Research and Education, Vanderbilt University, Nashville,
TN, USA
IOANNIS N. MELAS Office of Clinical Pharmacology, Center for Drug Evaluation and
Research, US Food and Drug Administration, Silver Spring, MD, USA; Translational
Bioinformatics, UCB Pharma, Slough, UK
DIMITRIS MESSINIS Office of Clinical Pharmacology, Center for Drug Evaluation and
Research, US Food and Drug Administration, Silver Spring, MD, USA; School of
Mechanical Engineering, National Technical University of Athens, Zografou, Greece
SAM MICHAEL National Center for Advancing Translational Science, Rockville, MD, USA
Contributors xi

TIM MIERZWA National Center for Advancing Translational Science, Rockville, MD, USA
MEI-SING ONG Department of Population Medicine, Harvard Medical School and
Harvard Pilgrim Health Care Institute, Boston, MA, USA; Computational Health
Informatics Program, Boston Children‘s Hospital, Boston, MA, USA
TUDOR I. OPREA Division of Translational Informatics, Department of Internal Medicine,
University of New Mexico School of Medicine, Albuquerque, NM, USA
SHAWN T. PHILLIPS Department of Chemistry, Center for Innovative Technology,
Vanderbilt-Ingram Cancer Center, Vanderbilt Institute of Chemical Biology, Vanderbilt
Institute for Integrative Biosystems Research and Education, Vanderbilt University,
Nashville, TN, USA
THEODORE SAKELLAROPOULOS Office of Clinical Pharmacology, Center for Drug
Evaluation and Research, US Food and Drug Administration, Silver Spring, MD, USA;
Department of Pathology, New York University School of Medicine, New York, USA
STEPHAN SCHÜRER Department of Molecular and Cellular Pharmacology, Miller School
of Medicine, University of Miami, Miami, FL, USA; Center for Computational Science,
University of Miami, Coral Gables, FL, USA
PAUL SHINN National Center for Advancing Translational Science, Rockville, MD, USA
CRAIG THOMAS National Center for Advancing Translational Science, Rockville, MD, USA
OLEG URSU Merck Research Laboratories, Boston, MA, USA; Division of Translational
Informatics, Department of Internal Medicine, University of New Mexico School of
Medicine, Albuquerque, NM, USA
UBBO VISSER Department of Computer Science, University of Miami, Coral Gables, FL, USA
KELLI WILSON National Center for Advancing Translational Science, Rockville, MD, USA
MENG WU Institute of Medical Information and Library, Chinese Academy of Medical
Sciences and Peking Union Medical College, Beijing, China
LEI XIE The Ph.D. Program in Biochemistry, The Graduate Center, The City University
of New York, New York, NY, USA; Department of Computer Science, Hunter College,
The City University of New York, New York, NY, USA
SI ZHENG Institute of Medical Information and Library, Chinese Academy of Medical
Sciences and Peking Union Medical College, Beijing, China
XIAOBO ZHOU Department of Radiology, Wake Forest University Medical School, Winston-
Salem, NC, USA; School of Biomedical Informatics, The University of Texas, Health Science
Center at Houston, Houston, TX, USA
Part I

Translational Bioinformatics in Drug Discovery

Chapter 1

Miniaturized Checkerboard Assays to Measure Antibiotic

Interactions
Melike Cokol-Cakmak and Murat Cokol

Abstract
Drugs may have synergistic or antagonistic interactions when combined. Checkerboard assays, where two
drugs are combined in many doses, allow sensitive measurement of drug interactions. Here, we describe a
protocol to measure the pairwise interactions among three antibiotics, in duplicate, in 5 days, using only
two 96-well microplates and standard laboratory equipment.

Key words Drug interactions, Checkerboard assay, Drug synergy

1 Introduction

Drug combinations may exhibit surprisingly high or low effect on a

phenotype given the effects of constituent drugs, corresponding to
synergistic or antagonistic drug interactions, respectively
[1–4]. Experimental measurement of a drug interaction involves
the preparation of combinations of constituent drugs in various
concentrations [5]. A commonly used experimental setup for pair-
wise drug interaction measurement is the checkerboard assay,
where two drugs are combined in a 2D matrix where the dose of
each drug is linearly increased in one axis [6]. In such a setting,
synergistic drug pairs will be more efficacious in many of the
combinations, while high growth will be observed in antagonistic
pairs.
Although in use for many decades, the preparation of a check-
erboard assay is difficult, due to experimental variation of single-
drug effects. In addition, checkerboard assays are often conducted
in an 8 8 matrix of concentration combinations, resulting in
significant cost in time and resources [6]. Here, we describe a
simple and reproducible protocol to determine the pairwise antibi-
otic interactions using miniaturized checkerboard assays.

Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_1, © Springer Science+Business Media, LLC, part of Springer Nature 2019

3
4 Melike Cokol-Cakmak and Murat Cokol

2 Materials

2.1 Preparation 1. Aliquots of Escherichia coli in 25% glycerol (see Note 1).
of Bacterial Culture 2. LB Broth Powder.
3. 15 ml breathable cell culture tube.
4. Pipette pump.
5. 5 ml cell culture serological pipette.
6. Manual pipette.
7. 200 μl tips.
8. Incubator.
9. Tube rotator.
10. 1.5 ml semi-micro cuvette.
11. Spectrophotometer.

2.2 Dose-Response 1. Drugs X, Y, and Z (see Note 2).

and Checkerboard 2. DMSO.
Assays
3. 1.5 ml Eppendorf microcentrifuge tubes.
4. Manual pipette.
5. 20 μl and 1000 μl tips.
6. Vortex mixer.
7. 96-well plates.
8. Reagent reservoir.
9. Breathable sealing film.
10. Microplate reader.

3 Methods

Carry out all protocols at room temperature. Thaw new aliquots of

bacteria and drugs each day. Prior to experiments, prepare LB
Broth with adding 25 g of powder to 1 l distillated water, autoclave
it at 121 C for 15 min, and store the autoclaved media at room
temperature. Dissolve drugs X, Y, and Z in DMSO at a concentra-
tion of 2 mM, and freeze aliquots in 1.5 ml Eppendorf tubes at
20 C.

3.1 Day 1: Start 1. Take one aliquot of Escherichia coli from 80 C.
Bacterial Culture 2. Add 100 μl of bacterial culture in 5 ml of growth media in a
culture tube.
3. Leave to grow overnight on a tube rotator in a 37 C incubator.
Miniaturized Checkerboard Assays to Measure Antibiotic Interactions 5

3.2 Day 2: Serial 1. Take one aliquot of drugs X, Y, and Z from 20 C, leave them
Dilution Dose in room temperature for 10 min, and prepare for serial dilution
Response of these drugs.
2. Prepare LB-10% sol by mixing LB media and solvent (DMSO)
in a 9:1 ratio.
3. Prepare LB-10% drug X by mixing LB media and drug X in a
9:1 ratio.
4. Vortex and add 20 μl of the 1:10 diluted solvent (LB-10% sol)
to 10 wells in a 96-well plate.
5. Vortex and add 20 μl of the 1:10 diluted drug X (LB-10% drug
X) into the first well.
6. Take 20 μl of content from first well and add to second well.
Dilute the drug concentration serially in each well by adding
20 μl of content to its bottom adjacent well until ninth well (see
Fig. 1a).
7. Discard the last 20 μl of content from ninth well (Last well of
the column is used as a no drug control).
8. Repeat steps 3–7 for the drugs Y and Z (see Note 3).
9. Measure the OD600 of the 1:10 dilution of the culture started
in Day 1.
10. Dilute the cells in growth media to an OD of 0.01 (see Note 4).
11. Add 80 μl cells on drug serial dilutions prepared in step 8. The
final drug concentration in each well is shown in Fig. 1a.
12. Seal plate to avoid evaporation.
13. Leave plate for 12 h at 37 C in a shaker with 150 rpm.
14. Start new bacterial culture to use in Day 3 (repeat Subheading
3.1).

3.3 Day 3: Linear 1. Measure OD600 absorbance for serial dilution dose-response
Dilution Dose plate from Day 2 (see Fig. 1b).
Response 2. Normalize growth by dividing growth in each well with the
growth in no drug control. For each drug, choose 1 as the
dose which is twice the minimum concentration that results in
no growth.
3. For each drug, prepare LB-10% drug by mixing LB media and
drug in a 9:1 ratio, where drug’s concentration is 50 of what
is chosen at step 2. Similarly, prepare LB-10% sol by mixing LB
media and solvent (DMSO) in a 9:1 ratio.
4. Prepare linearly increasing doses of drugs X, Y, and Z in ten
concentrations, by mixing LB-10% drug and LB-10% sol in
volumes shown in Fig. 2a (see Note 3).
5. Measure the OD600 of the 1:10 dilution of the culture started
in Day 2.
6 Melike Cokol-Cakmak and Murat Cokol

Fig. 1 Serial dilution dose-response experiment. (a) Preparation of serial dilution dose response for one drug
and corresponding final concentrations of the drug. (b) Normalized growth in serial dilution of drugs X, Y, and
Z. Each rectangle here represents a well of 96-well plate. Concentrations of each drug chosen for the next
experiment are shown in orange

Fig. 2 Linear dilution dose-response experiment. (a) Preparation of linear dilution dose response for each drug
and corresponding final concentrations of the drug. (b) Normalized growth in linear dilution of drugs X, Y, and
Z. Each rectangle here represents a well of 96-well plate. Concentrations of each drug chosen for the next
experiment are shown in orange

6. Dilute the cells in growth media to an OD of 0.01 (see Note 4).

7. Add 80 μl cells on drug linear dilutions prepared in step 4.
The final drug concentration in each well is shown in Fig. 2a
as ratios of 1.
8. Seal plate to avoid evaporation.
9. Leave plate for 12 h at 37 C in a shaker with 150 rpm.
10. Start new bacterial culture to use in Day 4 (repeat
Subheading 3.1).
Miniaturized Checkerboard Assays to Measure Antibiotic Interactions 7

3.4 Day 4: 1. Measure OD600 absorbance for linear dilution dose-response

Checkerboard Assay plate from Day 3 (see Fig. 2b).
Experiment 2. For each drug, choose the concentration that resulted in 80%
growth inhibition (IC80) as 1 (see Note 5).
3. For drug X, label four tubes as LB-drugX0, LB-drugX1,
LB-drugX2, and LB-drugX3, and add 189 μl of LB media to
these tubes.
4. In each tube, add 0, 7, 14, or 21 μl of 100 drug X, and add
21, 14, 7, or 0 μl of solvent (DMSO), as shown in Fig. 3a.
5. Repeat steps 3 and 4 for the drugs Y and Z (see Note 6).
6. Preparation of a 4x4 checkerboard assay for drug X þ drug Y is
shown in Fig. 3a.

Fig. 3 Miniaturized checkerboard assay. (a) Preparation of drug mixes and placement of each drug in 96-well
plate for 4 4 checkerboard. Each rectangle here represents a well of 96-well plate. Drug X and drug Y pairs
are used as an example for preparation. (b) Interpretation of drug pairs results in 4 4 checkerboard assay as
additive, synergistic, or antagonistic
8 Melike Cokol-Cakmak and Murat Cokol

7. Add 10 μl of LB-drugX0, LB-drugX1, LB-drugX2, and

LB-drugX3 in each well on first, second, third, and fourth
rows, respectively.
8. Add 10 μl of LB-drugY0, LB-drugY1, LB-drugY2, and
LB-drugY3 in each well on first, second, third, and fourth
columns, respectively.
9. Repeat the steps 6–8 for X þ Z and Y þ Z, in duplicate, which
corresponds to one 96-well plate (4 4 6 ¼ 96) (see Note 3).
10. Measure the OD600 of the 1:10 dilution of the culture started
in Day 3.
11. Dilute the cells in growth media to an OD of 0.01 (see Note 4).
12. Add 80 μl cells on 4 4 checkerboards assay.
13. Seal plate to avoid evaporation.
14. Leave plate for 12 h at 37 C on a shaker with 150 rpm.

3.5 Day 5: 1. Measure OD600 absorbance for checkerboard assay experiment

Checkerboard Assay plate from Day 4.
Result 2. Example results for additive, synergistic, or antagonistic drug
pairs are shown in Fig. 3b.
3. For each experiment, count the number of wells where there is
no growth. This count will be high for synergistic drug pairs,
medium in additive drug pairs, and low in antagonistic drug
pairs. Compare results from replicates.
4. For further exploration on how to score checkerboard assays,
the reader is suggested to consult refs. 2, 4, 6–8.
We have previously used this miniaturized checkerboard assay
protocol in two antibiotic interaction screens, where all pairwise
interaction scores for 24 compounds (276 pairs) were determined
in replicate. For these screens, we developed a scoring method
based on Loewe additivity model, where negative, zero, or positive
values correspond to synergy, additivity, or antagonism. MATLAB
functions that use 4 4 growth metrics and compute a drug
interaction score are shared as the supplementary material of ref.
8, as well as all the raw growth measurements recorded in this
screen.
In this screen, we have found that the pairwise interactions
among fusidic acid, oxacillin, and amikacin cover all possible three
drug interaction types: Fusidic acid and oxacillin are synergistic;
fusidic acid and amikacin are additive; and oxacillin and amikacin
are antagonistic. We suggest that the reader use these three drugs
for trying this protocol, in order to observe the full extent of the
drug interaction phenotypes. The reader may use the simple scor-
ing method described in the protocol’s Day 5 step 3 or the more
involved synergy metric described in ref. 8. With materials that can
Miniaturized Checkerboard Assays to Measure Antibiotic Interactions 9

be found in an undergraduate laboratory class, our protocol

describes an efficient and reproducible method to measure antibi-
otic interactions.

4 Notes

1. While antibiotic interaction in E. coli is the example here, any

species can be substituted here, with their respective growth
media and growth conditions supplanted.
2. Any small molecule that inhibits growth and corresponding
solvent can be used.
3. In our protocol, there is 2% solvent in all microplate growth
experiments, ensuring the effects we observe are not due to the
solvent.
4. Since the cell density influences the inhibitory concentration of
a drug, it is important that cells used are at an OD ¼ 0.01.
5. In our experience, we have found IC80 is the most informative
top concentration in a miniaturized checkerboard assay.
6. Although we need 160 μl for each concentration (10 μl 4 per
interaction assay 4 interaction assays), we prepare 210 μl
because of ease of calculation and pipetting.

References
1. Zimmermann GR, Lehár J, Keith CT (2007) 6. Cokol M et al (2011) Systematic exploration of
Multi-target therapeutics: when the whole is synergistic drug pairs. Mol Syst Biol 7:544–544
greater than the sum of the parts. Drug Discov 7. Greco WR, Bravo G, Parsons JC (1995) The
Today 12:34–42 search for synergy: a critical review from a
2. Berenbaum MC (1989) What is synergy? Phar- response surface perspective. Pharmacol Rev
macol Rev 41:93–141 47:331–385
3. Chevereau G, Bollenbach T (2015) Systematic 8. Mason DJ, Stott I, Ashenden S, Karakoc I,
discovery of drug interaction mechanisms. Mol Meral S, Weinstein ZB, Kuru N, Bender A,
Syst Biol 11:807–807 Cokol M (2017) Prediction of antibiotic inter-
4. Cokol M (2013) Drugs and their interactions. actions using descriptors derived from com-
Curr Drug Discov Technol 10:106–113 pound molecular structure. J Med Chem 60
5. Chandrasekaran S et al (2016) Chemogenomics (9):3902–3912. https://doi.org/10.1021/acs.
and orthology-based design of antibiotic combi- jmedchem.7b00204.
nation therapies. Mol Syst Biol 12:872
Chapter 2

High-Throughput Screening for Drug Combinations

Paul Shinn, Lu Chen, Marc Ferrer, Zina Itkin, Carleen Klumpp-Thomas,
Crystal McKnight, Sam Michael, Tim Mierzwa, Craig Thomas,
Kelli Wilson, and Rajarshi Guha

Abstract
The identification of drug combinations as alternatives to single-agent therapeutics has traditionally been a
slow, largely manual process. In the last 10 years, high-throughput screening platforms have been devel-
oped that enable routine screening of thousands of drug pairs in an in vitro setting. In this chapter, we
describe the workflow involved in screening a single agent versus a library of mechanistically annotated,
investigation, and approved drugs using a full dose-response matrix scheme using viability as the readout.
We provide details of the automation required to run the screen and the informatics required to process data
from screening robot and subsequent analysis and visualization of the datasets.

Key words Drug combination screening, Acoustic dispensing, Automation, Compound manage-
ment, Synergy

1 Introduction

High-throughput screening for compounds that affect cell viability

has been utilized as a method for discovery of novel treatments for
various human diseases. For patients with cancer and certain infec-
tious diseases, combinations of drugs are given to achieve maximal
clinical benefit. An additional benefit of a clinically synergistic drug
combination is that both drugs may be synergistic at a low dose,
which can reduce off target toxicities. For infectious diseases such as
HIV, drug combinations are critical to prevent infectious agents
from acquiring mutations to evade the action of a single drug. The
search for novel synergistic drug pairs requires the development of a
systematic, large-scale screening platform. CombinatoRX, a bio-
tech company acquired in 2014 by Horizon Discovery, was the first
to publish a series of papers utilizing drug combination screening to
explore synergistic drug responses in various disease models such as
cancer and drug-resistant bacteria [1–3]. A recent study spear-
headed by AstraZeneca and NCI-DREAM utilized a

Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_2, © Springer Science+Business Media, LLC, part of Springer Nature 2019

11
12 Paul Shinn et al.

crowdsourcing approach to predict synergistic drug combinations

for treatment of B-cell lymphoma [4].
The development of a methodology for large-scale testing of
drug combinations in vitro was advanced by the incorporation of
acoustic dispensing technology, which allows for the flexibility of an
anywhere-to-anywhere compound transfer. Given that drug com-
bination screening requires two or more compounds present in a
single well, contact-based transfer methods would be costly in time
and resources to reduce the possibility of contamination between
transfer steps. Using a noncontact dispenser greatly reduces the
amount of sample and consumables used as well as the complexity
that would be involved if traditional contact-based pipetting had
been applied.
Here we report the methods and workflow specifically for drug
combination screening that has been implemented and optimized
at the National Institutes of Health’s National Center for Advanc-
ing Translational Sciences (NCATS). This drug combination
screening platform has been applied to multiple areas of drug
discovery including cancer, malaria, Ebola, and various other dis-
ease models [5–8] and as of 2016 has tested over 200,000 discrete
drug combinations. This automated screening platform has
required the use of in-house software development as well as inte-
gration of various instrumentations in order to achieve an almost
fully automated workflow. We typically refer to this as Matrix
screening, due to the layout of the drug combinations in a grid
format on the final plate. The workflow presented here was utilized
for screening of the Ewing’s sarcoma cell line and has been
published [8].

2 Materials

The DMSO stock solutions are stored at 20 C, but all other
operations occur at room temperature.

2.1 Consumables 1. Dimethyl sulfoxide (DMSO): 100% DMSO, ACS grade.

2. 1.4 mL Matrix 2D barcode tube (sample tube): Thermo Scien-
tific, #3711.
3. 96-well Society for Biomolecular Screening (SBS) footprint
rack that holds sample tubes (compound source rack).
4. SepraSeal cap (cap): Thermo Scientific, # 4463.
5. 96-well polypropylene compound plate (intermediate plate):
VWR, #82006-704.
6. 384-well polypropylene compound plate (mother plate): Grei-
ner, #784201.
7. SBS footprint reservoir (DMSO reservoir).
High-Throughput Screening for Drug Combinations 13

8. 384-well cyclic olefin copolymer (COC) plate (acoustic source

plate): Greiner, #788876.
9. P25 JANUS tips (P25 tips): Perkin-Elmer, #6000689.
10. Biomek FX P30XL tips (P30XL tips): Beckman Coulter,
#A22288.
11. Biomek FX P30 tips (P30 tips): Axygen, FX-1536-30FP-R-S.
12. DMSO-resistant adhesive foil seal (foil seal): 4titude,
4Ti-0512.
13. Deionized water.
14. 70% ethanol.
15. White 1536, tissue culture treated, high base plates (assay
plate): Aurora, EWB04100A.
16. T175 tissue culture flasks.
17. TC71, Ewing’s sarcoma cancer cell line, DSMZ repository
#ACC 516.
18. RPMI-1640 cell culture media, Thermo Fisher Scientific
#11875093.
19. Fetal Bovine Serum, GE Healthcare Life Sciences,
#SH30071.03.
20. Penicillin-streptomycin, Thermo Fisher Scientific #15140122.
21. 0.25% Trypsin-EDTA, Thermo Fisher Scientific #25200056.
22. CellTiter-Glo® One Solution (CellTiter-Glo): Promega
G7573.

2.2 Equipment 1. Benchtop vortex mixer.

and Instrumentation 2. Sonicating water bath.
3. Automated compound store (ACS): Brooks Automation, A3+.
4. Automated decapper: Univo, #DC480.
5. TubeAuditor: automated volume measurement device from
Brooks Automation.
6. JANUS liquid handler (JANUS): Perkin-Elmer.
7. Handheld barcode scanner.
8. Matrix WellMate bulk liquid dispenser (WellMate), Thermo
Scientific.
9. Handheld 8-channel pipettor.
10. Biomek FX liquid handler (FX), Beckman Coulter.
11. Benchtop centrifuge.
12. Handheld pipettor.
13. Rubber roller.
14 Paul Shinn et al.

14. ATS-100 acoustic dispenser (ATS-100), EDC Biosystems,

Gen4+.
15. Multidrop Combi dispenser (Multidrop), Thermo Fisher Sci-
entific, 5840300.
16. Multidrop Combi dispensing cassette (cassette), Thermo
Fisher Scientific #24073290.
17. Metal, foam gasketed lid (compound plate lid).
18. Clear assay lid.
19. Stainless steel, rubber gasketed assay lid (assay lid).
20. ViewLux reader (ViewLux): Perkin-Elmer.
21. Polystyrene Universal Microplate Lid (plastic lid), Corning
#3098.
22. Automated acoustic plate reformatter (HRB): HighRes Bioso-
lutions, ACell.

2.3 Software 1. Microsoft Excel or equivalent spreadsheet program.

Components 2. Matrix Script Plate Generator (MSPG).
3. R 3.3.1 and the ncgcmatrix package.

3 Methods

3.1 Preparation 1. Prepare compound stock solutions by weighing compound

of Stock Compound into sample tube to make 800 μL of 10 mM DMSO solution.
Solution 2. Cap and vortex the sample tube for 10 s at 3200 rpm. Visually
inspect that the compound has completely dissolved; sonicate
the sample tube for up to 10 s in a sonicating water bath to
assist in dissolution, if necessary.
3. Register the sample tube barcode to sample ID association in
the database, and load the sample tube to the ACS.

3.2 Compound 1. Based on prior IC50 determination of the compounds of inter-

Source Rack Plate Map est, prepare a Matrix screening request form following the
Creation template format as shown in Figs. 1a and 2.
2. Identify a list of available sample tubes from the chemical
inventory system, and input the list of sample tube barcodes
to the ACS to cherry-pick the compounds needed to prepare
the acoustic source plate.
3. Remove the compound source racks from the ACS, and allow
the samples to thaw at room temperature. Briefly centrifuge the
compound source racks for 30 s at 234 g (see Note 1).
Export the cherry-pick plate map from the ACS database to
Microsoft Excel, and save (Fig. 1b). This file is called the
compound source rack plate map.
High-Throughput Screening for Drug Combinations 15

Fig. 1 An overview of the process to prepare an acoustic source plate and related files

Fig. 2 An example of the matrix screening request form in Microsoft Excel format with the required information
filled in. Each drug combination should be listed by row. Drug A should be listed with the name, starting
concentration in the assay in nanomolar (nM), dilution factor of the drug in the screen, and internal compound
ID. Drug B in the combination should be listed next with the same information. The researcher should also
specify the size of the matrix block as well as information regarding the number of replicates needed and
assay types used
16 Paul Shinn et al.

3.3 Preparation of All The MSPG application will create the appropriate files needed for
Files Needed each critical instrument used in creation of the acoustic source
for Creation plate. MSPG will take the submitted requestor form and create
of the Acoustic Source the JANUS worklist file to prepare the mother plates which is
Plate then transferred to acoustic source plates. It will also generate
transfer script files that are used on the ATS-100 for acoustic
dispensing, a worklist file used to schedule the movement of plates
on the HRB, and a plate map of the assay plate which is used for
data analysis (see Note 2).
1. Open the MSPG application, and click on Matrix Order which
will open the “Import Compound Combinations Wizard”
(Figs. 1c and 3).
2. Click Next to begin the Wizard. On the second window, select
the “Browse” button, and select the Matrix screening request
form (Fig. 2) containing all the drug combinations you wish to
process. After selecting the file, a preview of the combinations
will appear in the window as seen in Fig. 4. Click “Next” to
advance to the next screen.
3. Use the drop-down menu to select the Excel worksheet tab
that contains the compound pairs. For each Compound A and

Fig. 3 Select the “Matrix Order” menu item to initiate the wizard that will walk you through the use of the tool,
which will open the import compound combination wizards
High-Throughput Screening for Drug Combinations 17

Fig. 4 Browse to and select the request form to display a preview of the desired matrix combinations

Fig. 5 Data fields from the matrix request form are associated with the variable fields in the MSPG

Compound B pair, use the drop-down menus to map the

appropriate columns in the spreadsheet to the appropriate
column headers named “Agent Column,” “Concentration
Column,” and “Dilution Factor Column.”. Click “Next”
(Fig. 5). Verify that the columns in the Review window have
been mapped properly, and click Finish (Fig. 6). The software
has now recorded the requested drug combinations to
be made.
4. In the next window, input the assay parameters in the Source
Plate Settings window (Fig. 7a). Copy the Compound source
18 Paul Shinn et al.

Fig. 6 The review window provides verification that the columns from the spreadsheet were mapped correctly

Fig. 7 (a) Input parameters appropriate for the matrix screen are entered; (b) the compound source plate map
is pasted into the MSPG; (c) successful generation of a “Master” Excel file is visible in the MSPG
“Spreadsheets” window
High-Throughput Screening for Drug Combinations 19

rack plate map from Excel and Paste into the Plate Map win-
dow of the MSPG, and click “Submit” to create the Master file
(Fig. 7b). The various tabs of the Master file will be visible in
the MSPG “Spreadsheets” tab (Fig. 7c). Save the Master file
from the MSPG as an Excel file in a folder where you want all
future MPSG files deposited (see Note 2).
5. Click on the “Script Files” tab within MSPG, and select the
“Source Plate” type, “Destination Plate” type, and “Matrix
Size” pertaining to the parameters in the Matrix order request
form. Click on “Load Picklist” and browse and open the Mas-
ter file. Click “Create Scripts” to generate the ATS-100 scripts,
HRB worklist, and assay plate maps which will be automatically
saved to a sub-folder of the project folder where the Master file
resides (Figs. 8 and 9).
6. Close the MSPG.
7. Open the Master file in Excel, and separately save each of the
“Janus,” “Intermediate,” and “DMSO” tabs as comma-
separated value (CSV) files in the same folder location as the
Master file (see Note 3). Close the Master file.

Fig. 8 After the script, datamap, and worklist files are successfully created, you can preview the compound
dispense patterns from source to destination in the MSPG
20 Paul Shinn et al.

Fig. 9 The JANUS worklist files define the source to destination compound transfers from compound source
rack to mother plate. The HRB worklist directs the Cellario scheduler to move plates through the HRB, pairing
the source (acoustic source plate) and destination (assay plate) plates at the ATS-100 to perform the acoustic
liquid transfer. The acoustic transfer script (ATS) file instructs the ATS-100 to dispense compound from a
specific source well location to a specific destination well location. *Continued—The datamap file is loaded to
the data analysis system to identify the compound pair IDs, concentrations, and volumes dispensed to each
well

3.4 384-Well Mother 1. Prime the JANUS tubing with water until all air bubbles are
Plate and Acoustic flushed out (see Note 4). Remove the caps from the sample
Source Plate tubes using the automated decapper.
Preparation 2. Scan and associate the compound source rack barcodes to the
JANUS deck positions using the attached handheld barcode
scanner, and then place the compound source racks into their
assigned deck locations. Supply P25 tip boxes, the necessary
number of empty mother plates as determined by the JANUS
worklist file, and a filled DMSO reservoir to the JANUS deck
(see Fig. 10 and Note 5).
3. Start the JANUS protocol, and when prompted, import the
“DMSO” worklist file to begin the transfer. This worklist file
directs the JANUS to transfer the required amount of DMSO
High-Throughput Screening for Drug Combinations 21

Fig. 10 JANUS deck layout

from the reservoir into the mother plates to create the

requested dilution for screening.
4. If no intermediate plate is needed, proceed to step 5. If an
intermediate plate is needed (see Note 3), fill all the wells of an
intermediate plate with 45 μL DMSO using the WellMate, and
place the intermediate plate into the “Intermed Plate” location
on the JANUS deck. Start the JANUS protocol, and when
prompted, import the “Intermediate” worklist file which will
transfer compounds from the compound source rack into the
intermediate plate. Remove the intermediate plate from the
JANUS, and mix the odd columns of the intermediate plate
ten times using a handheld 8-channel pipettor. Using clean
tips, transfer 5 μL of solution from every odd column to
every adjacent even column, and repeat the mixing after each
transfer. Return the intermediate plate to the JANUS.
5. Start the JANUS protocol, and when prompted, import the
“Janus” worklist file. This will transfer compound from the
compound source rack and intermediate plate into columns
1 and 11 of the mother plates (Fig. 1e).
6. Remove the compound source racks from the JANUS, and
apply caps manually.
7. Process the sample tubes through the TubeAuditor to update
the database with the new volumes, and load them back into
the ACS for long-term storage.
8. Transfer the mother plates to the WellMate, and dispense
10 μL DMSO to columns 2–10 and 12–20 of each mother
22 Paul Shinn et al.

Fig. 11 The “10 10 matrix serial dilution” protocol performs a twofold serial dilution from columns 1 to
9 and 11 to 19 of each mother plate. Columns 10 and 20 remain DMSO only

plate (Fig. 1f). Briefly centrifuge the mother plates for 30 s at

234 g (see Note 6).
9. Place the mother plates on to the FX in the appropriate loca-
tions (see Notes 7 and 8). Run the “10 10 matrix serial
dilution” protocol which will perform intraplate serial dilutions
in each mother plate (see Figs. 1g and 11).
10. Transfer 8 μL of sample from the mother plate to the acoustic
source plate using the “384-well to 384-well acoustic transfer”
protocol on the FX (see Note 9). Briefly centrifuge the acoustic
source plate for 30 s at 234 g (Fig. 1h).
11. Press a foil seal on to each acoustic source plate using a rubber
roller, and store at room temperature until it is ready for use
(see Note 10).

3.5 Acoustic 1. In Cellario 2.0 of the HRB computer, create a new order using
Dispensing from Cellario’s Cherry Pick Wizard (Figs. 12a–d and 13, Note 11).
Acoustic Source Plate Copy the acoustic transfer scripts from the network folder to
to Assay Plate the “EDC_Scripts” folder on the HRB computer.
on the HRB System 2. Spin down the acoustic source plates for 60 s at 234 g.
Manually remove the foil seals, and apply compound plate lids
High-Throughput Screening for Drug Combinations 23

Fig. 12 Using the Cellario Cherry Pick Wizard to create an order. (a). Import the HRB worklist to Cellario using
the Cherry Pick Wizard. (b). Define the source and (c). destination plate types as well as the starting location
for the first plate of each type in the Ambistore. (d). Verify that the physical location of the source and
destination plates matches what is virtually represented in Cellario

Fig. 13 The acoustic transfer protocol has separately defined steps (threads) for
each source and destination (Dest) plates. The ACell robot shuttles the source
and Dest plates according to the plate order in the HRB worklist from device to
device as each step in their respective threads is executed. The source and Dest
plates intersect at the Cherry Pick step where the Cellario scheduling software
instructs the ATS-100 to dispense a specific script file from the appropriate
acoustic source plate to the designated assay plate
24 Paul Shinn et al.

Fig. 14 The HRB automated acoustic plate reformatting system

to the acoustic source plates, and load them to available slots in

the AmbiStore hotel.
3. Cover each assay plate with a clear assay lid, and load the assay
plates to a free slot in the AmbiStore (Fig. 14). All plates should
have A1 facing outward.
4. Perform an inventory scan of the AmbiStore which will scan
and record all AmbiStore column barcodes into the internal
database. Export the barcodes from the database to Excel.
5. Using Excel, associate the barcodes and AmbiStore positions to
the plate names in the HRB worklist file. This will be used later
to update the datamaps with the plate barcode for data analysis.
6. Initialize the HRB system (see Notes 12 and 13), and start the
order to automatically transfer compounds from the acoustic
source plates to the assay plates.
7. After all the transfers are complete, remove the acoustic source
plates from the AmbiStore, and apply foil seals with a rubber
roller. Transfer the assay plates to the laboratory for cell
addition.
8. Update the datamaps with the actual assay plate barcodes, and
copy the datamaps to the primary screening system’s project
folder.
High-Throughput Screening for Drug Combinations 25

3.6 Cell Plating 1. Visually inspect cells in the flask to ensure the cell morphology
appears uniform and that there are not large numbers of
detached cells. Ensure that media is not cloudy or oddly col-
ored which can be an indication of fungal or bacterial contami-
nation. Move flasks into a sterile biosafety cabinet.
2. Aspirate all media from the flask and discard. Add 7 mL of
trypsin, and place the flask into a standard tissue culture incu-
bator for 5 min. Visually inspect flasks periodically until all cells
are detached.
3. Add 7 mL of complete culture media to the flask to neutralize
the trypsin. Transfer all contents of the flask to a 15 mL conical
tube and put on the lid. Place conical tube into a tabletop
centrifuge along with an equally weighted balance tube, and
spin at 233 g for 5 min to pellet cells.
4. Remove the tube from the centrifuge and verify that a cell pellet
is present. Aspirate as much volume from the tube as possible
and discard. Do not aspirate the cell pellet.
5. Add complete media to the cell pellet, and use a pipettor to
resuspend the pellet in media. Ensure the cells are homoge-
nously distributed in the media (see Note 14).
6. Count the cells using the desired method, and determine the
number of cells per mL. Determine the total number of cells
needed for plating by multiplying the number of cells per well
by the total number of plates by 1536. Calculate the total
volume of media for plating by multiplying the volume per
well by the total number of plates by 1536. Add 20% additional
volume to account for dead volume, priming volume, and extra
plates needed for imaging and growth rate calculations (see
Note 15).
7. Add the total number of cells needed into the total volume of
media needed to fill all assay plates. Ensure the solution is well
mixed and without clumps to ensure homogenous distribution
of cells. Ensure the vessel used for diluting the cells has a wide
enough opening to accommodate the Multidrop cassette
tubing.
8. Load the cassette onto the Multidrop. Clean the cassette by
priming 10 mL of deionized water followed by 10 mL of 70%
ethanol followed by 10 mL of deionized water. While cleaning,
ensure that dispenser streams are linear and continuous.
9. Program the Multidrop to dispense 5 μL per well to 1536-well
standard height plates and filling columns 2–48 for dispensing.
Column 1 is left empty as a background control.
10. Place the cassette tubing into the cell solution, and prime for
10 s to ensure all tubing is filled with the cell solution. Cover
the opening of the cell dilution vessel with its lid to ensure no
26 Paul Shinn et al.

large objects drop into the vessel. Remove all plastic lids from
the assay plates.
11. Place the first plate on the Multidrop in the correct orientation,
well A01 should be in the upper-left-hand corner of the plat-
form. Press start to begin dispensing cells. This can be done in a
biosafety cabinet or in an open laboratory environment (see
Note 16).
12. Once the 1536-well assay plate is filled, remove the 1536-well
assay plate from the Multidrop, and ensure that all wells are
uniform in appearance. If the liquid level is uniform, apply an
assay lid to the 1536-well assay plate (see Note 17). Repeat
until all assay plates are complete.
13. All lidded 1536-well assay plates should then be moved to the
incubator on the primary screening system with standard con-
ditions, 37 C, 95% humidity, and 5% CO2, and the odd
numbered plate barcode facing outward.
14. Following plating, the cassette should be cleaned using the
same protocol in step 3. If any tips appear to be clogged, use
the syringe tool that comes with the cassette to clear any
blockages. After cleaning is complete, empty all liquid from
the tubing using the empty button on the Multidrop. Store the
cassette in the box provided.
15. All vessels that contained cells should be discarded in a biohaz-
ard disposal box.

3.7 Reading Assay 1. Scan and inventory the assay plate barcodes in the primary
Plates on the Priming screening system’s assay plate incubator (see Note 18).
Screening System 2. Incubate the assay plates for 48 h.
3. Set up to run the read protocol on the primary screening
system (Fig. 15). Using the barcodes in the inventory database,
create an assay file in CSV format containing the assay plate
barcodes which were originally loaded. The assay plates will be
handled in the order the barcodes are listed in the CSV file.
4. Create a method file containing the steps of the read protocol
using the Method Editor software on the Dispatcher computer
for the robotic system (see Note 19).
5. Compress the ViewLux data files, 1536-well assay datamaps,
ATS-100 dispense error logs, and assay file with cell line names
to a ZIP file, and save to a shared network folder for retrieval
from informatics for data analysis.

3.8 Data Processing 1. Plate data files along with the plate maps which specify the
compound/control composition, and concentrations are
loaded into an Oracle database using an in-house software
High-Throughput Screening for Drug Combinations 27

Fig. 15 A typical ViewLux read of a 1536-well assay plate tested in pairwise drug
combination screening

stack that parses a wide variety of formats, such as envision,

ViewLux, InCell, and others (see Note 20).
2. Apply necessary correction if plate material-specific artifact is
observed. For example, luminescent cross talk is found neces-
sary when using CellTiter-Glo readout on cyclo olefin polymer
plates. Calibration of the parameters needed for cross talk
correction (e.g., percentage of signal bleeding) before assay is
required.
3. A number of plate-level quality control metrics are computed
on the raw or corrected data (e.g., coefficient of variation (CV),
Z’ [9], signal to background ratio, SSMD [10]). Plates that fail
quality control may be rerun or else removed from consider-
ation in the following steps (resulting in the loss of some
combinations).
4. Plate data are visualized in the form of heatmaps (one per plate)
and manually inspected for assay artifacts (e.g., layout-specific
pattern). If observed, a correction algorithm is applied. If the
result of the correction is unsatisfactory, the affected blocks
(based on the wells affected by the artifact) are ignored from
future analysis (see Note 21).
5. Sample wells are normalized to the in-plate positive and nega-
tive controls based on the raw or corrected data (e.g., borte-
zomib and DMSO, respectively) (see Note 22).
28 Paul Shinn et al.

6. Using the pre-specified combination layout mapping file (see

Subheading 3.5), plate-level data is deconvoluted into individ-
ual 6 6 or 10 10 blocks using the normalized (or the
corrected, if necessary) data.
7. A Matrix Quality Control (mQC) evaluation [11] is computed
for each block that classifies the quality of the combination
responses as bad, medium, and good along with a confidence
score (ranging between 0 and 1) for each category. Other
reproducibility metrics (e.g., minimum significant ratio
(MSR), standard deviation, etc.) using replicates may also be
computed to check for other assay artifacts. This allows the
users and downstream analyses to ignore failed or low-quality
combination responses (see Note 22).
8. The combination response for each block is then analyzed to
compute multiple synergy metrics. These metrics have been
described previously [12, 13] and provide multiple character-
izations of the combination response by quantifying the devia-
tion of the observed response from a predefined model of
additivity. Currently we compute synergy metrics based on
the highest single agent (HSA) [14] and Bliss [12] additivity
models.
9. Each block, comprising the combination response and com-
puted properties (synergy metrics, mQC score), is stored in an
Oracle database and made available to users via a web interface
(https://tripod.nih.gov/matrix-client). For the exemplar
screen discussed in this chapter, the processed data can be
found at https://tripod.nih.gov/matrix-client/rest/matrix/
blocks/265/table.
10. After processed data has been deployed to the public website, it
may be downloaded for alternative in-depth analyses.

4 Notes

1. Samples in the Matrix racks can splash up to the cap during

routine handling. Centrifugation before removing the cap pre-
vents cross-contamination between adjacent wells.
2. The MSPG creates an Excel file (Master file) that contains all
the assay-specific details and can be modified and reused at a
later date. The Master file contains multiple tabs, each of which
contains a file needed to operate an instrument used in the
process. The tabs are as follows:
(a) “Master”—summarized version of the combinations
requested in the request form
(b) “PlateMap” —copy of the imported compound source
rack plate map
High-Throughput Screening for Drug Combinations 29

(c) “Frequency” —displays the volume usage of each com-

pound in the whole screen
(d) “Source#”—summary of the sample IDs, their initial
source well location, final destination well location, and
dilutions needed to prepare the compound dilutions on
each source plate
(e) “Input” —spreadsheet that can be imported to the MSPG
to create all the ATS files, datamaps, and robot worklists to
perform the acoustic compound transfers
(f) “Intermediate” —JANUS worklist used to transfer com-
pound from the compound source rack plate to the inter-
mediate plate
(g) “Janus” —JANUS worklist used to transfer compound
from the compound source rack plate and intermediate
plates to the mother plate
(h) “DMSO” —JANUS worklist used to transfer compound
from the DMSO reservoir to the mother plate
(i) “Summary” —overview of the total screening parameters
3. The “Janus,” “Intermediate,” and “DMSO” JANUS files are
needed in the next protocol section to create the desired start-
ing concentrations for each compound using the 10 mM
DMSO stock solution. If the final concentration cannot be
practically achieved by diluting the 10 mM DMSO stock with
the addition of DMSO in the mother plate, the MSPG will call
for the creation of an intermediate plate. The intermediate
plate will contain 10- and 100-fold dilutions of the 10 mM
DMSO stock. If using an intermediate plate, the Janus process
will proceed in this order: Intermediate, DMSO, Janus.
4. Flushing all the air bubbles out of the JANUS tubing is neces-
sary to ensure accurate pipetting.
5. Unless otherwise noted, each plate placed on any instrument in
this protocol should be oriented with the A01 well in the
upper-left corner as viewed from the operator’s perspective
standing in front of the instrument.
6. A centrifugation step is necessary after compound addition on
the JANUS and DMSO addition on the WellMate to concen-
trate the liquid to the bottom of the well and eliminate air
bubbles. This will ensure that the serial dilution was performed
accurately.
30 Paul Shinn et al.

7. FX deck layout

8. The “10 10 matrix serial dilution” FX protocol can process

up to six 384-well mother plates in one run. Place the mother
plates into the front input stacker (Stacker2) of the FX. Place an
empty P30XL tip box at TL1 and a full box of P30XL tips at
P16. The FX is programmed to automatically rearray the
P30XL tips at P16 to columns 1 and 12 of the empty P30XL
tip box at TL1 using the Advanced Selective Tip head. After a
mother plate is delivered to the pipetting zone at Stacker2, the
FX attaches tips at TL1 then begins the serial dilution by
indexing the head from left to right aspirating, dispensing,
and mixing the user-defined sample volumes from well to
well. P30XL tips are discarded after five serial dilution steps
and new tips are attached to reduce the carryover effect. To
ensure proper mixing, the FX is programmed to mix 80% of the
total well volume ten times in each well by aspirating at 0.5 mm
above the well bottom and dispensing at 2.5 mm above the well
bottom [6]. After each mother plate is processed, the Stacker
returns the mother plate to the output location and the cycle is
repeated until all the mother plates are serially diluted.
9. The “384-well to 384-well acoustic transfer” protocol will load
up to 384 tips, pipette up a defined volume from the mother
plate and deposit that volume into an empty acoustic source
plate. Place the mother plate at P1, the empty acoustic source
plate at P4, and a box of P30 tips at TL1. The P30 tip box is
arrayed with tips only in the locations that correspond to real
sample in the mother plate. Start the protocol and transfer 8 μL
of sample from the mother plate to the acoustic source plate.
Repeat the protocol until each mother plates have been trans-
ferred to acoustic source plates.
High-Throughput Screening for Drug Combinations 31

10. Since DMSO is hygroscopic, it is best to use the acoustic

source plates as soon as possible to avoid compound dilution
in the well. The excessive uptake of water can also negatively
affect the operation of the ATS-100 dispenser causing the
liquid level height to be out of range of the dispenser’s cali-
bration setting. To avoid this problem, we do not dispense
more than 10 μL into the acoustic source plates. Heat sealing
the acoustic source plates should be avoided. The cyclic olefin
copolymer (COC) material that comprises the source plate
requires higher sealing temperatures for a longer time com-
pared to standard polypropylene plates. This can deform the
wells and warp the plate which can result in poor performance
on the ATS-100.
11. For troubleshooting purposes, it is helpful to physically seg-
regate the 384-well acoustic plates from the 1536-well assay
plates into separate columns of the HRB AmbiStore. Addi-
tionally, it is helpful to load the plates in sequential barcode
order according to the HRB worklist file into the HRB
AmbiStore with the lowest plate number at the bottom of
each HRB AmbiStore hotel column. The HRB AmbiStore
numbers the rows ascending from bottom to top. Finally, it is
also helpful to load the 1536-well assay plates in sequential
barcode order. The HRB is equipped with two ATS-100s.
The HRB worklist file can be divided into two orders and the
assay plates can be prepared on both ATS-100s simulta-
neously to reduce the overall run time.
12. Before the ATS-100 is used for an acoustic transfer, the 1536-
well destination gripper needs to be attached and clean dis-
tilled water should be refilled into the system. The ATS-100
should be thoroughly primed with distilled water using the
onboard software and pump immediately prior to operation.
The agitator should be checked for any large air bubbles as
this can slow down the transfer time. Any air in the agitator
should be released by slightly unscrewing the screw on top to
allow air to escape while priming the instrument. Once the
water level has reached the top of the agitator, stop priming
the instrument, then retighten the screw.
13. Before starting the acoustic transfer on the HRB, do a walk-
through of the system and visually verify there are no foreign
objects at any of the instrument stations where a source or
destination plate may be moved. You can expect the system to
produce four completed assay plates per hour per ATS-100.
Add cells to the 1536-well assay plates within 24 h of spotting
to avoid compound evaporation in the plate.
14. When selecting a volume for resuspension of cell pellets,
attempt to resuspend in a volume that will result in
32 Paul Shinn et al.

approximately 1 million cells/mL. Most automated cell coun-

ters report a large range of cell densities that can be accurately
counted; however the accuracy of the cell count is typically
poor at very low (less than 50,000 cells/mL) or very high cell
(more than 5 million cells/mL) densities. One million cells/
mL is a density which can be accurately counted manually or
with an automated counter.
15. Extra plates are highly recommended to include during cell
plating. One clear bottom plate for imaging should be made
in which at least four columns of cells are plated. This plate
should be imaged throughout the assay to ensure cells are
healthy with the usual morphology. If the imaging plate
begins to show signs of contamination or unhealthy cells, it
can be assumed that all assay plates are also contaminated and
should be discarded to prevent waste of expensive reagents.
One additional assay plate should be made as well to allow for
cell growth rate calculations. A recent publication demon-
strated that cell growth rate can have a large effect on com-
pound IC50 [7]. One way to correct for this effect is to
incorporate the cell doubling time into the IC50 calculation.
This can be done by adding CellTiter-Glo to an assay plate
immediately after cell plating to determine the time 0 signal.
The assay endpoint data should have a DMSO control that
can be used as the final signal and growth rate from time 0 to
assay endpoint can be determined.
16. Plating of cells in an open laboratory environment typically
has a low risk of contamination of the cells during plating. If
gloves are worn and clean lids are applied shortly after plating,
the exposure of the cells to the open air is limited. If the assay
is to be shorter than 72 h, contamination of cells during the
course of the assay is uncommon. However, longer assays
have more time to allow contaminants to proliferate. If con-
tamination does occur within a short-term assay, the cell flasks
should be examined for contamination as it is likely that the
cells were contaminated prior to plating. If the assay is to last
longer than 72 h, plating cells in a biosafety cabinet is pre-
ferred to reduce the possibility of contamination.
17. Each cassette contains eight tips that each fill four rows of a
1536-well assay plate. This arrangement means that if a single
tip has some sort of issue during plating, then a four-row block
of the plate will be compromised. This can sometimes be seen
right after cell plating as four rows of wells with too much or
too little volume compared to the adjacent four rows. This
often will appear in the final data as a four-row stripe across
the plate where the signal in that four-row stripe will be higher
or lower than surrounding rows. We refer to this as a “Multi-
drop effect,” which can affect the data in that portion of the
High-Throughput Screening for Drug Combinations 33

plate. To prevent this, ensure all tips are clean and unclogged
prior to cell plating and monitor Multidrop tips while plating
to ensure streams are linear and consistent. If a tip becomes
clogged while plating, attempt to unblock the tip before start-
ing a new plate.
18. An inventory is done on the system where the barcode of each
plate is read and added into the database to track the location of
each plate in the incubator. It is important that the barcodes of
these plates are in the inventory database; this way when other
processes are running on the system there is no possibility that
the slots will be accessed causing a crash. A Keyence barcode
reader attached to the gripper on the robotic arm allows the
barcodes to be read in an automated fashion.
19. The Method File typically consists of the following:
(a) Dispense 2 μL CellTiter-Glo into each of the 32 rows of
the 1536-well assay plates, containing cells which have
incubated with compound, using four tips in parallel on
a solenoid valve dispenser.
(b) Incubate at room temperature in an auxiliary hotel on the
system for 15 min.
(c) Read luminescence on the ViewLux using 2 s exposure
time with slow speed, medium gain, and 2 binning. The
units are relative luminescence units (Fig. 5).
(d) Plate is returned to the home location after the read has
finished.
20. For inhibition assay where the positive control has lower signal
than negative control (e.g., viability, toxicity assay), sample
wells are normalized using the formula 100 ððnp cnÞ
Þ þ 100,
where c, n, and p are the sample well value, median of the
negative control well values, and the median of the positive
control well values, respectively. Activation assay where positive
control has higher signal; on the other hand, sample wells are
normalized using the formula 100 ððnp cnÞ
Þ. In some assays, the
positive control may not be available or dispensed properly.
This issue can be overcome by inclusion of an empty column
with no cells, particularly for inhibition assay, and then we
employ the median value of no cell wells as a proxy for the
positive control assuming that cytotoxic compound leads to
maximum cell killing.
21. When measuring the synergy, the absolute response to drug
treatment, not always EC50, is crucial to the accurate estima-
tion of synergy. However, the responses within a plate may be
drifted due to unexpected artifact (edge effect, variability in
liquid dispense, signal crosstalk, etc.). To correct for minor
artifact, we take all the intraplate replicates (e.g., controls, no
34 Paul Shinn et al.

cells, single/combination replicates) and fit the raw values as a

function of compound composition, row and column using a
generalized additive model. Here we assume that each well is
differentially but linearly affected by layout-specific artifact.
Then we compute the expected control value on a per well
basis and normalize the response using these corrected control
values. For inhibition assay for example, sample wells are nor-
malized using the formula 100 ½½EcE ðn Þ
ðnÞp þ 100, where we sub-
stitute n with E(n) which is denoted as the corrected negative
control value.
22. The mQC method is a classification model trained on a set of
crowdsourced assessment of combination response quality.
The model takes into account a number of features of the
combination response including DMSO response, spatial auto-
correlation of responses, and the presence or absence of dose
response of the single agents. The current implementation of
mQC is not very fast and for large combination screens can take
a long time to run. While it can be sped up by reducing the
number of randomizations required for p-value computation,
this can lead to loss of accuracy. In practice, we do not compute
mQC during the data processing step and instead batch up the
mQC computation for multiple screens and run the calcula-
tions in batch on our cluster.

Acknowledgments

We would like to thank Lesley Mathews-Griner, John Keller, and

Don Liu for their contributions to the development of the NCATS
combination screening platform.

References

1. Borisy AA et al (2003) Systematic discovery of environmental chemicals—from vision to real-

multicomponent therapeutics. Proc. Nat. ity. Drug Discov Today 18:716–723
Acad. Sci 100:7977–7982 6. Yasgar A, Shinn P, Jadhav A et al (2008) Com-
2. Lehár J et al (2007) Chemical combination pound management for quantitative high-
effects predict connectivity in biological sys- throughput screening. JALA 13:79–89
tems. Mol Sys Bio 3:80 7. Hafner M et al (2016) Growth rate inhibition
3. Lehár J et al (2009) Synergistic drug combina- metrics correct for confounders in measuring
tions tend to improve therapeutically relevant sensitivity to cancer drugs. Nat Meth
selectivity. Nat. Biotech. 27:659–666 13:521–527
4. Bansal M et al (2014) A community computa- 8. Heske CM et al (2017) Matrix screen identifies
tional challenge to predict the activity of pairs synergistic combination of PARP inhibitors
of compounds. Nat. Biotech. 32:1213–1222 and nicotinamide phosphoribosyltransferase
5. Attene-Ramos M et al (2013) The Tox21 (NAMPT) inhibitors in Ewing sarcoma. Clin
robotic platform for the assessment of Cancer Res 23(23):7301–7311
High-Throughput Screening for Drug Combinations 35

9. Malo N, Hanley J, Cerquozzi S, Pelletier J, 12. Borisy AA et al (2003) Systematic discovery of

Nadon R (2006) Statistical practice in high- multicomponent therapeutics. Proc. Natl.
throughput screening data analysis. Nat Bio- Acad. Sci 100:7977–7982
tech 24:167–175 13. Schindler M (2017) Theory of synergistic
10. Zhang XD (2007) A pair of new statistical effects: hill-type response surfaces as ‘null-
parameters for quality control in RNA interfer- interaction’ models for mixtures. Theor Biol
ence high-throughput screening assays. Geno- Med Model 14:15
mics 89:552–561 14. Cokol M et al (2011) Systematic exploration of
11. Chen L et al (2016) mQC: a heuristic quality- synergistic drug pairs. Mol Syst Biol 7:544
control metric for high-throughput drug com-
bination screening. Sci Rep 6:37741
Chapter 3

Post-processing of Large Bioactivity Data

Jason Bret Harris

Abstract
Bioactivity data is a valuable scientific data type that needs to be findable, accessible, interoperable, and
reusable (FAIR) (Wilkinson et al. Sci Data 3:160018, 2016). However, results from bioassay experiments
often exist in formats that are difficult to interoperate across and reuse in follow-up research, especially when
attempting to combine experimental records from many different sources. This chapter details common
issues associated with the processing of large bioactivity data and methods for handling these issues in a
post-processing scenario. Specifically described are observations from a recent effort (Harris, http://www.
scrubchem.org, 2017) to post-process massive amounts of bioactivity data from the NIH’s PubChem
Bioassay repository (Wang et al., Nucleic Acids Res 42:1075–1082, 2014).

Key words Bioactivity, Bioassay, ScrubChem, PubChem, Hit-calls, Big data, Data integration

1 Introduction

This chapter will explain an approach that was used in a recent data
integration project (http://www.ScrubChem.org) [1] to efficiently
process billions of data points from a large public bioassay reposi-
tory known as PubChem [2]. The approach is generalizable and
applicable to other data types and sources. It is also designed to be
efficient in cost, time, and ease of development. For example,
ScrubChem is the largest curation effort of public bioactivity
data, containing millions of bioassay records; yet it rebuilds entirely
in less than a day, and it requires only limited scientific program-
ming and data science skills to develop. The goal of this chapter is to
provide a how-to-guide for other technically inclined researchers
who are subject matter experts in their respective fields and in need
of post-processing their research data.
There are many research uses for aggregating and repurposing
legacy bioactivity data. The simplest use is referencing of experi-
mental protocols from previous efforts in order to inform and
accelerate the design of a current research goal. Prior results
can also be used to create predictive models from

Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_3, © Springer Science+Business Media, LLC, part of Springer Nature 2019

37
38 Jason Bret Harris

computational technologies such as QSAR, docking, and machine

learning. These data-driven reuse cases are mostly applied in phar-
macology and toxicological research to aid in the discovery and
safety evaluation for new products and therapeutics. Despite the
many applications in reusing data, there is a highly disconnected
cycle between those generating, storing, distributing, and reusing
it. Due to this disunion, data is often not readily reusable and
requires significant post-processing.
Post-processing bioassay data generally involves harmonization
of terminologies, concepts, and formats. The complexity of bioas-
say designs makes this a difficult task. The problems to
resolve become increasingly large as data is made increasingly
more available in public repositories (e.g., PubChem [2], ChEMBL
[3], DrugBank [4], BindingDB [5], Tox21 [6], Toxcast [7], CTD
[8], Pharos [9],ILINCS [10]). The PubChem project is by far the
largest collection of public bioactivity data. It was created as a free
repository in order to increase the FAIR-ness (findable, accessible,
interoperable, and reusable) [11] of chemical and biological data.
Within a relatively short period, over 87 authoritative sources of
experimental data contributed approximately 1.2 million bioassay
records with results from 2.3 million compounds and thousands of
biological targets. PubChem easily succeeded in making bioassay
data more findable and accessible; however, timproving he interop-
erability and reusability of data between these now large amounts of
records is an ongoing challenge. Many specific issues related to
interoperability are presented in this chapter, and examples from
the ScrubChem effort to post-process PubChem are provided to
illustrate these issues.
Post-processing large bioactivity data involves integrating
results from separate experimental records into an interoperable
and relational database. To be interoperable, a database should be
designed to facilitate structured queries that reliably access and
aggregate data across many records. The quality of data, diversity
of data types, and reuse cases will affect the exact design of the
database as well as the amount of curation and enrichment of meta-
information needed in order to achieve a desired level of interoper-
ability. Ontologies such as BAO [12] and MIABE [13] are available
that can aid in designing the database so that it can represent
important bioassay concepts. The most basic concepts include
target (e.g., protein, gene, cell, tissue, organism, metabolism),
taxonomy (e.g., human, mouse, yeast), perturbagen (e.g., chemi-
cal, substance), endpoint (e.g., IC50, EC50, CC50, phenotype),
outcome (active, inactive), endpoint measurement(s) or value(s),
concentration(s) or dose(s), exposure time (e.g., 24 h, every 2 h),
system or format (e.g., cell-based, tissue, whole organism), detec-
tion technology or method (e.g., bioluminescence, radiolabeling),
and modality or mode-of-action (e.g., inhibition, agonism, antag-
onism). Additional metadata such as source or publication are
Post-processing of Large Bioactivity Data 39

Fig. 1 Annotating a Justification (concentration ranges)

useful to retain for maintaining provenance. A more complex con-

cept called a Justification, as referred to in ScrubChem [1], is also
useful to resolve. Resolving a Justification involves flagging data
points that are most relevant for justifying an experimental Out-
come. For example, a value describing an IC50 measurement
would be flagged as a Justification for an Outcome as compared
to other available data points comprising each single-dose measure-
ment used in its derivation. Figure 1 illustrates this concept, and it is
further described in Subheading 4 at the end of the chapter.
In Fig. 1, two gray boxes represent the general format of a
PubChem Bioassay record. Each assay (AID) contains a Descrip-
tion section for defining the assay protocol and definitions (name
and description, unit, etc.) for each test result value (TID). There is
also a Results section which contains each tested perturbagen
(SID), an Outcome for each SID, and all TID values used to
support the Outcome. Shown in red are pseudo-values that relate
to the light blue fields on their immediate lefts. The table above
each gray box is a summary of a single post-processing issue involv-
ing the lack of annotations to describe concentration ranges. The
40 Jason Bret Harris

impact of this issue is an inability to distinguish some of the useful

TID values used to justify reported Outcomes. The pseudosolution
involves categorizing TIDs as a max-, min-, mid-, or single-dose
concentration value. Counts for affected record types (assays, che-
micals, bioactivities) and a few example assays (AIDs) affected by
this issue are shown in the last column. Highlighted in a light blue
box is a count for 181 million bioactivity results that were tested at a
single concentration and resulted in substances with an inactive
outcome. This observation is significant because high-throughput
screening campaigns usually test many compounds at a single dose
and in high concentration to quickly detect and remove inactive
compounds before proceeding forward with an experiment involv-
ing a range of dose-response concentrations. This means that the
majority of compounds confirmed as inactive do not have an activ-
ity at 50% concentration summary value (AC value) as their Justifi-
cation. This is important since PubChem provides a true or false
flag (AC flag) to mark AC50-like values and also their individual
test concentrations (TC flag); however, inactive compounds with
only a single-test concentration as a Justification are not immedi-
ately distinguishable from cases where a summary AC50-like value
is used. Such values are significant to use as Justifications for inac-
tive outcomes (approximately 181 million times). To resolve this
issue, further post-processing is needed to better flag the results
from single-dose experiments.
In Fig. 2, results from 887 experiments targeting the human
estrogen receptor (ESR1) are shown before and after applying a
filtered view on ScrubChem.org. This figure is meant to highlight
the usefulness of filtering that can be done across assays which is
only achievable after annotating a Justification for the readouts in
each assay. Without filtering (top view), there are over 11 million
result records (test values) available. As shown in the Records
count. Using the Filter by Justifications changes the amount of
records viewed to approximately 262,000 values. A significant
difference between the two data tables is that records with a test
value name (TID name) of “RBA Published Value” are
retained and “RBA Activity Comment” removed after applying
the Filter by Justifications. This particular filter is useful to identify
result values immediately of importance for summarizing (i.e.,
justifying) the Outcome of each experiment. Many other readout
types (e.g., % inhibition, percent inhibition, IC50, inhibition con-
centration at 50%) can similarly be flagged and retained as Justifica-
tions with the use of relatively simple dictionaries.
The goal of extracting Justifications from many bioassay
records is to have a representative value that can be combined
into a single consensus result for comparing similar experiments,
referred to as a hit-call. Hit-calls can be created to summarize data
points with the same chemical, target, and general assay design
(e.g., a similar modality and endpoint). The exact requirements
Post-processing of Large Bioactivity Data 41

Fig. 2 Selecting Justification across assay records

for comparing and aggregating data into hit-calls will vary and
depend on an intended reuse case for the data. Simple quality
metrics about hit-calls can also be used to interrogate the reproduc-
ibility of data, such as the number of records or evidences used to
support the hit-call and also the total agreement between those
evidences.
A case study involving these introductory concepts is presented
in Subheading 3. The Methods explain how to choose hardware
and software for parsing large amounts of data quickly and afford-
ably. Also explained are general approaches to download, parse,
annotate/curate, add metadata, and set up a scalable database.
This chapter concludes with a conceptual example of how to
query a database using Justifications as a filter and aggregate this
data into summary hit-calls.
42 Jason Bret Harris

2 Materials

An object-oriented programming language such as C# or Java is

recommended to process source data. Sources usually provide data
in json and xml formats which can be transformed into class objects
for easier processing. It is also recommended to use schemas for
creating template classes. In cases where no schema is available, free
services such as jsontoclass [14] or xmltocsharp [15] can be used to
build model classes from representative source files. An advantage
of using a high-level language such as C# or Java is the ability to
create try-catch [16] statements which are helpful when data does
not follow a well-structured format. Format exceptions are com-
mon when processing data from many sources. Familiarity with a
structured query language (SQL) and database engine such as
MySQL is needed for storing and extracting transformed data. If
the database will be made available through web services, an enter-
prise database solution such as MS SQL or ORACLE is recom-
mended. Programming and database environments can be
substituted as needed to accommodate license restrictions, available
computing environment, and specific developer experience. The
goal for establishing a developing environment is to have efficiency
in both compute and development time. A 64-bit workstation/
server environment with multiple cores and approximately 8GB of
RAM per logical processor is recommended for parallelization and
scaling during the data parsing/processing. At least four storage
disks, preferably solid-state, are also recommended in order to
optimize IO (read/write) during data parsing/processing. A data-
base management engine such as MySQL Workbench or SSMS
(MSQL Server Management Studio) and SSIS (SQL Server Inte-
gration Services) is also recommended for easier setup of the data-
base, importing of data, indexing of fields, and performing SQL
queries during data exploration.

3 Methods

3.1 Hardware Processing individual assay records is scalable by distributing loads

Configuration across many processors, depending on the available system memory
and disk IO speeds. In the case of ScrubChem, 11 separate proces-
sors were efficiently used in a system with 12-core dual Xeon
processors, 64-bit OS, and 192GB of RAM. A single drive (storage
disk #1) is needed for storing all 1.2 million raw data records. A
second drive (storage disk #2) is needed for dedicated writing of
data processed in memory to disk. If write or read times become
bottlenecked, additional disks (striped or separated) can be used to
increase IO speeds. Each processor handles around 100,000
records with a maximum memory utilization per record of
Post-processing of Large Bioactivity Data 43

approximately 20 GB. PubChem bioassay records can range from

MBs to GBs in size, and memory will often be the bottleneck for
processing such large records. For other projects, the utilization of
additional processors can be benchmarked in trial runs to deter-
mine the memory and IO requirements. The goal during bench-
marking is to balance the system with enough jobs to utilize the
bulk of its memory throughout an entire runtime. Another drive
(storage disk #3) is useful for keeping caches of web pages/data
taken from API calls to external resources. A final drive (storage
disk #4) will be used for storing the database.

3.2 Basic Software Setting up a development environment is straightforward. It is

recommended to follow installation instructions and tools provided
with each software. For developing in C#, MS Visual Studios is a
convenient tool for project management and source control. The
project code should be designated to build/run on a 64-bit system
in order to utilize greater than 2 GB of memory per process. The
database in this case is MS SQL, and it should be designated to
build on a dedicated drive (storage disk #4). SSIS is to be used for
recursive data importing. It can also be used for simple data trans-
formations. Microsoft SSMS is used for indexing fields and query-
ing the database. Similar tools are available within the Java
development environment.

3.3 Downloading Bulk data is usually made available via an FTP site (e.g., ftp://ftp.
Primary Data (Disk 1) ncbi.nlm.nih.gov/pubchem/Bioassay/XML). There is often a
schema available (e.g., ftp://ftp.ncbi.nlm.nih.gov/pubchem/
specifications/pubchem.xsd) to use as a template for creating clas-
ses to store and process the data. Data records are not guaranteed to
follow expected formats, and try-catch [16] statements are useful to
implement during processing in order to handle those errors
cleanly. Many records do not change often, and a change log
(e.g., ftp://ftp.ncbi.nlm.nih.gov/pubchem/Bioassay/assay.
ftpdump.history) can be used to save in bandwidth or processing
time if implemented. After decompression, the PubChem bioassays
fit on 557GB of disk space.

3.4 Parsing Primary A Parser simply identifies a class of data (e.g., assay title, assay
Data and Annotating target, source) and extracts it. The Parser works on a single bioassay
(Disk 2) record at a time, but it can be split up to work in parallel across
many processors. Each processor should have dedicated output files
3.4.1 Parser for writing all data onto the storage disk #2. In the case of Scrub-
Chem, outputs can be separated by the processor and a result (e.g.,
output_results_processor_1.txt) and description (e.g., output_de-
scriptions_processor_1.txt) file which later can be more easily
imported into a normalized database schema. These output files
are a reflection of the database tables, and appropriate keys (e.g.,
assay ID, test ID, panel ID) should be maintained for joining their
44 Jason Bret Harris

data records. Optimization of the parsing process is explained in

Subheading 3.1 of the Methods and is dependent on data size and
available resources. The configuration used on the ScrubChem
project takes approximately 4 h to finish parsing.

3.4.2 Annotator The Annotator makes changes to the data and is incrementally
constructed over the course of a project. Building up the functions
of an Annotator involves querying of the database after each itera-
tion of building it in order to identify quality issues involving
missing, unresolved, or incorrect data/concepts. The Annotator
is then updated to incorporate logical and programmatic fixes to
address the identified issues. The database is then rebuilt to accom-
modate the desired changes. This approach is faster than
performing case-by-case updates on a large relational database,
especially after it has been indexed.
The complexity of the Annotator will depend on the quality of
original data and a user’s intended applications for the data. Mini-
mum information, as described in the Introduction, is needed in
order to derive basic hit-calls. Particularly important is a Justifica-
tion (measurement) and Modality (action) which are used in an
experimental design to test the perturbation of a molecular target.
These concepts are further discussed in Subheading 4.

3.5 Enriching Primary data is usually basic in detail, and therefore additional
Primary Data metadata is often needed to enrich the descriptions of an experi-
with External Linkages ment. In the case of ScrubChem, other NCBI databases such as
to Metadata (Disk 3) GenBank and PubChem compound are accessed through API
services to gather additional information for targets or chemicals.
This allows integrating in useful features such as additional identi-
fiers, taxonomies, synonyms, physical properties, sequences, etc.
The number of API calls can become very large (nearly
300,000,000 pages in the case of ScrubChem), and due to the
iterative process of building up an Annotator, it is useful to cache
many of these pages on a local disk (storage disk #3). The simple
syntax of most API URLs allows keeping a local copy of the page
using a transformation of the URL in regular file structures for
reference before making a call to the web version. The local cache of
pages can be cleaned up over time to bring in updated data as
needed.

3.6 Bulk Uploading This step involves building a loader function to bulk upload output
Parsed Data into data from the Parser’s output on storage disk #2 into a database on
a Database (Disk 4) storage disk # 4. This bulk reading and writing is faster if the read
and write disks are separated. The schema for a database needs to be
described within SSIS when setting it up for loading. Since parsing
is done in parallel and output records are stored across separate files
for each processor, primary keys need to be generated at the same
time as loading. SSIS allows for bulk uploading separate files into a
Post-processing of Large Bioactivity Data 45

database and also the building of primary keys with an auto incre-
ment. Simple transformations and checks on the data can also be
managed from within an SSIS workflow. If an alternative database
engine is used, it is recommended to identify the bulk loading
features of that engine. Since loading uses an auto increment to
build primary keys, this procedure has to be done serially. However,
the description and result tables can be loaded independently,
taking 30 min and 2 h, respectively.

3.7 Building, Microsoft’s SSMS can be used to initialize the database tables and
Indexing, and Querying fields. It is also useful for adding indexes and querying the database.
the Database It provides a typical workbench environment for managing and
exploring the database. Field data types (e.g., int, char, varchar)
should be assigned carefully in order to optimize disk space utiliza-
tion. For assigning appropriate lengths to fields, it is useful to keep a
report of the longest records during parsing. Long varchars slow
down loading but can be useful in avoiding truncated data. MS
SQL does a decent job at compressing (shrinking) the database
after loading. Indexes should be built only for the fields most often
used in queries. SSIS allows for profiling queries to see which fields
add the most time and are good candidates for indexing. For
example, CID (chemical ID) is a ScrubChem data type that is
indexed because it is often used to access data points for specific
chemicals. The descriptions table is relatively small (approximately
seven million rows) and has a fairly low cost to add indexes (about
5 min compute time/index). The results table is relatively large
(approximately 1.5 billion rows) and therefore should have very few
indexes outside of its primary key. Once indexed the database can
be quickly queried on specific tables or joined across tables. An
example query might conceptually be described as “select all che-
micals, outcomes, and their Justifications for the androgen receptor
protein target and the modality of inhibition.” The specific fields
used in the SQL syntax of this query will depend on the structure of
the database.

3.8 Building a Hit- The previously described query in Subheading 3.7 of the Methods
Call section would return records with a single Justification for each
outcome of a chemical tested for inhibition against the human
androgen receptor. In many cases, there are multiple records for
the same chemical that are tests from separate bioassay experiments.
These can be treated as replicates, and a hit-call is needed to
combine all available replicates into a single consensus result. For
example, a chemical such as estradiol may have been tested three
separate times all with an outcome of active with respect to inhibi-
tion. These three evidences for active outcomes can be combined
into a single consensus hit-call of active (n ¼ 3, where n is the
number of evidences and r ¼ 3/3 or 1, where r is the ratio of
agreement between all evidences). If desired, this hit-call can be
46 Jason Bret Harris

compared to other hit-calls for its relative reproducibility using the

n and r values. It is important to define the target as human
androgen receptor inhibition and to not combine data from
another modality (e.g., agonism) into constructing this hit-call. If
the retrieved Justifications for the outcome are based off of similar
endpoints such as IC50s, then the quantities can be averaged as part
of the hit-call.

4 Notes

Justifications, illustrated in Figs. 1 and 2, is a concept needed to

interoperate on data across a large amount of records. Experimental
records vary widely in their reported metadata (e.g., control runs,
prior replicates, comments, statistical tests), and these can contain
values that are not relevant to the experimental measurements used
as a Justification for the reported experimental Outcome. Annotat-
ing the Justifications for each assay allows easier extraction and then
comparison of significant results from sets of different experiments.
PubChem Bioassay records often lack annotations to identify Jus-
tifications. This provides an opportunity to demonstrate how an
Annotator can be built to iteratively find and mark Justifications.
For example, after building and attempting to query data in the
first version of ScrubChem, a question was left unanswered: “What
are the Justifications for each outcome?” There were many different
kinds of results to consider as Justifications between different
assays. PubChem does provide a true or false flag, called an active
concentration (AC), for depositors to mark summary dose-
response values (e.g., IC50, EC50, AC50). Also provided is a
flag, called a test concentration (TC) flag, for depositors to mark
individual test concentration values. Extracting meaningful values
from experiments is made easier with these flags. However, a sub-
ject matter expert immediately will realize that many assay designs
only contain a single-test dose at a high concentration or a small
range of doses in order to efficiently screen for activity before
proceeding to further testing. In this tiered approach, any chemical
not showing activity at a high dose range is considered inactive and
no longer pursued with a more accurate series of full dose-response
concentrations. Therefore, there are no AC50-like values available
as Justifications for most inactive results. Recognizing this screen-
ing design concept allowed for a function to be built in the Anno-
tator which flags inactive results from assays that have no AC flag
and only TC flags. The function was further developed to flag the
highest concentration values tested and use them as a Justification
for inactive outcomes. This logic was applied during a subsequent
parsing and rebuild of the database, and in the new database over
181 million, data points were available as Justifications from inac-
tive screening results.
Post-processing of Large Bioactivity Data 47

References
1. Harris J (2017) ScrubChem. http://www. 9. Nguyen DT et al (2017) Pharos: collating pro-
scrubchem.org tein information to shed light on the druggable
2. Wang Y et al (2014) PubChem BioAssay: 2014 genome. Nucleic Acids Res 45:D995–D1002
update. Nucleic Acids Res 42:1075–1082 10. Pilarczyk M, Medvedovic M, Fazel Najafabadi
3. Bento AP et al (2014) The ChEMBL bioactiv- M, Naim M, Michal K, Nicholas C, Shana W,
ity database: an update. Nucleic Acids Res Mark B, Wen N, John R, Juozas V, Jarek M,
42:1083–1090 Mario M (2016) iLINCS: Web-Platform For
4. Wishart DS et al (2006) DrugBank: a compre- Analysis Of Lincs Data And Signatures, ilincs.
hensive resource for in silico drug discovery org. https://doi.org/10.5281/zenodo.167472
and exploration. Nucleic Acids Res 34: 11. Wilkinson MD et al (2016) The FAIR guiding
D668–D672 principles for scientific data management and
5. Gilson MK et al (2016) BindingDB in 2015: a stewardship. Sci Data 3:160018
public database for medicinal chemistry, 12. Visser U et al (2011) BioAssay ontology
computational chemistry and systems (BAO): a semantic description of bioassays
pharmacology. Nucleic Acids Res 44: and high-throughput screening results. BMC
D1045–D1053 Bioinformatics 12:257
6. Toxicology in the 21st Century 13. Orchard S et al (2011) Minimum information
7. Dix DJ et al (2007) The toxcast program for about a bioactive entity (MIABE). Nat Rev
prioritizing toxicity testing of environmental Drug Discov 10:661–669
chemicals. Toxicol Sci 95:5–12 14. jsontoclass. http://www.jsontoclass.com
8. Davis AP et al (2017) The comparative Toxi- 15. xmltocsharp. http://xmltocsharp.azurewebsites.
cogenomics database: update 2017. Nucleic net
Acids Res 45:D972–D978 16. try-catch (C# Reference). https://msdn.
microsoft.com/en-us/library/6dekhbbc.aspx
Chapter 4

How to Develop a Drug Target Ontology: KNowledge

Acquisition and Representation Methodology (KNARM)
Hande Küçük McGinty, Ubbo Visser, and Stephan Schürer

Abstract
Technological advancements in many fields have led to huge increases in data production, including data
volume, diversity, and the speed at which new data is becoming available. In accordance with this, there is a
lack of conformity in the ways data is interpreted. This era of “big data” provides unprecedented oppor-
tunities for data-driven research and “big picture” models. However, in-depth analyses—making use of
various data types and data sources and extracting knowledge—have become a more daunting task. This is
especially the case in life sciences where simplification and flattening of diverse data types often lead to
incorrect predictions. Effective applications of big data approaches in life sciences require better,
knowledge-based, semantic models that are suitable as a framework for big data integration, while avoiding
oversimplifications, such as reducing various biological data types to the gene level. A huge hurdle in
developing such semantic knowledge models, or ontologies, is the knowledge acquisition bottleneck.
Automated methods are still very limited, and significant human expertise is required. In this chapter, we
describe a methodology to systematize this knowledge acquisition and representation challenge, termed
KNowledge Acquisition and Representation Methodology (KNARM). We then describe application of the
methodology while implementing the Drug Target Ontology (DTO). We aimed to create an approach,
involving domain experts and knowledge engineers, to build useful, comprehensive, consistent ontologies
that will enable big data approaches in the domain of drug discovery, without the currently common
simplifications.

Key words Knowledge acquisition, Ontology, Drug target ontology, Semantic web, Big data, Seman-
tic model, KNARM

1 Introduction

Gruber defines an ontology as a formal and explicit specification of

a shared conceptualization for a domain of interest [1]. Almost
three decades ago, CommonKADS presented a widely accepted
methodology for knowledge acquisition and ontology building
which described workflows for manual ontology building
[2, 3]. Following that, nearly two decades ago, the idea of using
semantic web applications for representing life sciences data and
knowledge started gaining more attention in the life sciences

Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_4, © Springer Science+Business Media, LLC, part of Springer Nature 2019

49
50 Hande Küçük McGinty et al.

community [4–9]. Wache and colleagues [10] summarized the

existing approaches and tools that can help scientists build powerful
ontologies. Around the same time, Blagosklonny and colleagues
[4, 5] described how ontologies could be utilized for bioinformat-
ics and drug discovery research and that they can be powerful tools
for life scientists. Today’s well-cited, highly accessed, well-
described, and well-maintained ontologies such as Gene Ontology
(GO) [11] and ChEBI [12] are among the first that showed how
semantic web technologies could be wielded into creating common
vocabularies. However, two decades after the abovementioned
milestones were developed, we still lack sophisticated methodolo-
gies for knowledge acquisition and data representation using
semantic web technologies [2, 4–6, 9, 13–24].
Understanding the bigger picture without oversimplification,
by making use of different databases available and extracting knowl-
edge from data, is becoming a more daunting task in the era of big
data [18]. Life sciences data are not only increasing in numbers but
also are fitting more into the description of big data by being too
large, too dynamic, and too complex for conventional data tools to
handle [20, 25]. Screening technologies and computational algo-
rithms have become very powerful, capable of creating diverse types
of data increasingly faster and cheaper, such as gene sequencing,
RNASeq gene expression, and microscopy imaging data. Such large
and dynamic data are typically scattered in different databases, in
many different formats (i.e., traditional relational databases,
NoSQL databases, ontologies, etc.). Additionally, currently avail-
able complex life sciences data is not being efficiently translated into
a format that is unambiguously readable and understandable by
humans and machines. Furthermore, different types of data from
gene expression, small molecule biochemical data to cell phenotyp-
ing via imaging, make it harder to manage, consolidate, integrate,
and analyze these data.
For our purposes, we define big data as data that is high in
volume (terabytes and larger), complex (interconnected with over
25 highly accessed databases [18] and over 600 ontologies [23])
with various types of data (varying from gene sequencing to cell
imaging), and dynamic (growing exponentially [18, 25]) for con-
ventional data tools to store, manage, and analyze.
Related with our research, we have created two major ontolo-
gies: BioAssay Ontology (BAO) [20, 21, 26–28] and Drug Target
Ontology (DTO) [20, 24, 29]. BioAssay Ontology (BAO) [28] is
aimed at describing and modeling assay data by using formal
description logic (DL) and semantic web technologies. Drug Tar-
get Ontology (DTO) uses formal description logic to provide a
classification of (protein) drug targets based on function and phy-
logenetics. Rich annotations of (protein) drug targets along with
other chemical, biological, and clinical classifications and relations
to diseases and tissue expression are also formally described in DTO
How to Develop a Drug Target Ontology: KNowledge Acquisition. . . 51

using DL. Large number of different assays as well as their com-

plexity with data types motivated us to look for a methodology that
helped us acquire knowledge and formalize large amounts of data
in the development of BAO.
Many different approaches have been presented for handling
biological and chemical data for ontologies [1, 9, 11, 15, 23,
30–40]. Currently, one focus is on combining existing databases
and using machine-operated data mining tools or relying on com-
plete manual ontology building. However, creating a systematic
methodology that effectively combines human and machine cap-
abilities for extracting knowledge and representing it in an ontol-
ogy is crucial for better understanding of the data. The existing
literature lacks a formal methodology or workflow dealing with
knowledge acquisition of large amounts of textual data and forma-
lizing that information into a semantic knowledge model.
Confronted with that challenge and as part of our research, we
created and implemented a hybrid methodology, KNowledge
Acquisition and Representation Methodology (KNARM), that
handles big data in life sciences in the form of large amounts of
textual information and translates it into axioms by using descrip-
tion logic (DL). In addition, the methodology and tools we built
help update the ontologies faster and more accurately by semi-
automating the ontology building process (Fig. 1). As our projects
grew in size and focus, we also developed a systematically deepen-
ing modeling (SDM) approach for modeling life sciences data
described in detail in the Metadata Creation and Knowledge Mod-
eling section of this methodology.

2 Methods

KNowledge Acquisition and Representation Methodology

(KNARM) consists of nine steps that allow domain experts and
knowledge engineers to build useful, consistent ontologies forma-
lizing biomedical knowledge. This methodology aims at acquiring
knowledge from data scattered in different databases and ontolo-
gies, combining them in a meaningful fashion that is understand-
able by humans and machines by effectively combining human and
machine capabilities. In this way, we attempt to allow users to
understand, query, and analyze their data better by formalizing it
using semantic web technologies.

2.1 Sub-language Sub-language analysis is a technique for discovering units of infor-

Analysis mation or knowledge and the relationships among these units
within existing knowledge sources, including published literature
or corpora of narrative text. As the first step of formalization of the
data, we recommend starting with the existing literature and/or
reports for the data. While reading the text data, we recommend an
52 Hande Küçük McGinty et al.

(1) Sub- (2) Unstructured

Language Interview
Analysis
(6) KA Validation

(3) Sub-
Language
Recycling
(5) Structured
Interview
(4) Meta Data
(9) Ontology
Creation and
Validation
Knowledge
Modeling

(7) Database
Formation
(8) Semi-
Automated
Ontology
Building

Methodology step

Methodology flow to the next step

Alternate checkpoint and flow to the next step

Fig. 1 Steps of KNowledge Acquisition and Representation Methodology (KNARM). This figure shows the nine
steps and flow of KNARM. Following agile principles, there are feedback loops present before finalizing
ontologies. The circular flow also represents that ontology building process is a continuous effort, allowing
ontology engineers to iteratively add more concepts and knowledge

active reading by creating use cases and taking notes aiming to

identify patterns and the units of information, concepts and facts
in data, that have a recurring pattern.
A “unit of information” is a concept, relationship, or data
property contained in the data in hand. A use case is a list of actions,
event steps that users might follow, questions that can be asked by
users, and/or scenarios that users may find themselves in. Example
use cases might be:
1. Search for proteins are in the same kinase branch as target X
where there were validated chemical hits from external or
internal sources.
How to Develop a Drug Target Ontology: KNowledge Acquisition. . . 53

2. I have assay X. What are the other assays that have the same
design or technology but different targets?
3. What assay technologies have been used against my kinase of
interest? Which cell lines?
After identifying units of information and patterns followed by
listing some possible use cases, the ontology engineers can intro-
duce the domain experts to their preliminary analysis or continue to
work with them toward the next steps of the methodology.

2.2 In-House After identification of the key concepts and units of information
Unstructured Interview during sub-language analysis, we perform an interview with the
domain experts who work in the same team. This step can be
performed separately after the sub-language analysis or in a hybrid
fashion with the previous step. The unstructured interview is aimed
at understanding the data and their purposes better with the help of
the domain experts. It can be performed in a more directed fashion
by using the previously identified knowledge units or could be
treated as a separate process. Together with the previous step, this
step also helps identify the knowledge units and key concepts of
the data.

2.3 Sub-language Following the identification of knowledge units through the textual
Recycling data of the domain, literature, and unstructured interview with the
domain experts, we perform a search on the existing ontologies and
databases. The aim of the search on the databases and ontologies is
to ascertain the already formalized knowledge units that are identi-
fied. We perform and encourage reuse of existing—relevant and
well-maintained—ontologies, aligning them with ontology in
development and using cross-references (annotated as Xref in the
ontology) to the various databases that contain the same knowl-
edge units and concepts that we determined to formalize. By
recycling the sub-language, not only we save time and effort but
also reuse widely accepted conceptualization of knowledge. In this
way, we also aim to help life scientists by sparing them the painful
data alignment practices and by helping them avoid redundant
and/or irrelevant data available in different data resources.

2.4 Metadata In this step, we combine the knowledge units and essential concepts
Creation and identified with those recycled from the existing databases and
Knowledge Modeling ontologies to create the metadata describing the domain of the
data to be modeled. The metadata creation can be a cumbersome
task that could be performed in different levels by defining subsets
of metadata on various details of the data. For example, with our
systematically deepening approach of formalization (i.e., systemati-
cally deepening modeling approach (SDM)), we started with the
metadata for proteins and genes, followed by metadata for diseases,
tissues, and small molecules. The SDM approach allows us to focus
54 Hande Küçük McGinty et al.

BAO LINDO tissue UBERON

bioassay cell
biological
lines
concept
is derived from
has participant
is protein
is a equivalent
is a to
protein is equivalent to
cell gene
lines

tissue BRENDA
is equivalent to is equivalent to is equivalent to has associated disease

protein DTO
gene

is a is a is a is a

ion nuclear disease DOID

kinase GPCR is a
channel receptor

has associated ion

is a
is gene template of
ABL1
ABL1 ion
protein has gene template

has associated disease disease is equivalent to

ion ChEBI
is equivalent to
is expressed in tissue tissue is equivalent to

Class
Ext. Ontology
Ontology

Fig. 2 Conceptual modeling example, showing modeling of an example kinase (ABL1) and how some of its
axioms relate to the different ontologies created using KNARM

on one aspect at a time and extract more detailed (i.e., deeper)

metadata, which later allows creating more complex axioms (i.e.,
modeling of concepts).
In combination with the metadata creation comes a very
important step in knowledge acquisition and representation:
knowledge modeling. Here, we define knowledge modeling as
using axioms to define concepts and aim to help infer new knowl-
edge based on existing data using this axiomatic modeling of
concepts.
While modeling, we focus on one aspect at a time and create
more complex axioms as going deeper into the knowledge. The
detailed metadata extracted is utilized on different levels to create
axioms that can be modeled without overwhelming the reasoners
and other semantic web technologies by creating nested axioms. By
dividing the knowledge into detail levels and representing different
detail levels of the knowledge in different ontologies, we allow
reuse of concepts and axioms easily as well (also see modular
architecture in Semi-Automated Ontology Building section and
Fig. 2).
How to Develop a Drug Target Ontology: KNowledge Acquisition. . . 55

This step can be performed within the team first and then can
be discussed with the collaborators and other scientists. Alterna-
tively, a bigger initiative can be set up to agree on the metadata,
axioms, and knowledge models (examples include OBO Foundry
ontologies [22]).

2.5 Structured Structured interview consists of close-ended questions that are

Interview aimed at the domain experts. For our purposes, we use metadata
created for the knowledge obtained so far to perform an interview
with collaborators who are involved in data creation as well as
scientists who are not involved in data creation. The aim of the
structured interview is to identify any important points that might
have been missed by knowledge engineers and domain experts so
far. In this step, the metadata identified is presented in context of
the data obtained by knowledge engineers. This data could be
dissected based on the metadata identified, and this dissected infor-
mation could also be presented to the collaborators.

2.6 Knowledge This step could be considered the first feedback. The aim in this
Acquisition Validation step is to identify any knowledge that is missed or misinterpreted.
By this step, the sub-language identified and recycled, the metadata
is created, and the data dissected based on the metadata are pre-
sented to domain experts by knowledge engineers. It could also be
presented to a small group of users based on use cases. If missed or
misinterpreted knowledge exists, we recommend starting from the
first step and reiterating the steps listed above.

2.7 Database After validating the knowledge acquired is correct and consistent,
Formation we start building the backbone for the representation of the knowl-
edge. The first step is to create a database to collect the data in a
schema that will facilitate the knowledge engineering. Typically,
this will be a relational database. The domain experts may prefer
to use different means of handling and editing their data, such as a
set of flat files, but we recommend using a database as the main data
feed to the ontology that will be created as the final product. The
details of the database are designed based on the acquired metadata
and data types collected and their relations (Fig. 3 shows an exam-
ple database schema). Ideally, the databases should contain the
metadata as well as the knowledge units and the key concepts
identified in the knowledge acquisition steps. Information that
the database may not hold directly includes specific relationships
or axioms involving the different knowledge units and key concepts
that are identified during the knowledge acquisition. We placed the
relationships among the pieces of data in the next step during the
ontology building process.

2.8 Semiautomated After placing the data dissected based on the metadata as well as the
Ontology Building metadata into the database, we convert the data to a more
56 Hande Küçük McGinty et al.

IC_Rel_Table
id : int (11) I
uniprot : varchar (1000) * GPCR_Table
tdl : varchar (1000) IC_Table id : int (11) I
proteinName : varchar (1000) id : int (11) I uniprot : varchar (500)
uniprot : varchar (500) gene : varchar (500)
ionType : varchar (1000) gene : varchar (500) tdl : varchar (500)
channelActivity : varchar (1000) tcrd : varchar (500) IDG_family : mediumtext
topology : varchar (1000) IDG_family : mediumtext GPCR_Class : mediumtext
IC_family : mediumtext GPCR_Group : mediumtext
IC_subfamily : mediumtext GPCR_Family : mediumtext
IC_sub_subfamily : mediumtext GPCR_Subfamily : mediumtext
proteinName : mediumtext proteinName : mediumtext
ligand_type : mediumtext

ProteinTissue ProteinInfo UniProtIDs2EntrezID

id : int (11) I UniprotID : varchar (150) * UniprotID : varchar (150)
tissueName : varchar (1000) Gene_Symbol_UniProt : varchar (150) *E ntrezID : varchar (150)*
brendaID : varchar (155) Protein_Name_UniProt : varchar (150)
Gene_Synonyms_UniProt : varchar (150) GeneInfo
uniprotID : varchar (155) * ECnumber_UniProt : varchar (150) EntrezID : varchar (150)*
ProID : varchar (25) TCRD_Names Symbol : varchar (150)
id : int (11) I
uniprotID : varchar (155) * GeneSynonyms : text
ProteinDisease proteinName : varchar (1000)
id : int (11) I proteinType : varchar (155)
datatype : varchar (1000)
diseaseName : varchar (1000)
proteinName : varchar (1000)
doid : varchar (155)

uniprotID : varchar (155) *

idgFamily : varchar (155)

tdl : varchar (155)
NR_Table
id : int (11) I
Kinase_Table
uniprot : varchar (255)
gene : varchar (255) id : int (11) I
tdl : varchar (255) uniprot : varchar (255) *
lincs_proteins IDG_family : varchar (255) gene : varchar (255)
proteinid : varchar (150) NR_family : varchar (255) tdl : varchar (255)
uniprotid : varchar (150) * proteinName : varchar (255) IDG_family : varchar (255)
hugosymbol : varchar (150) Kinase_Type : varchar (1000)
kinasesymbol : varchar (150) Kinase_Group : varchar (1000)
kinomescanannotationname : varchar (150) Kinase_Family : varchar (1000)
geneid : varchar (150) Kinase_Subfamily : varchar (1000)
proteinname : varchar (150) Non_proteinKinase_subtype : varchar (1000)
ecnumber : varchar (150) proteinName : varchar (1000)
organism : varchar (150) pseudokinase : varchar (255)
uniprotsynonyms : varchar (150)
genesynonyms : varchar (150)

Fig. 3 Excerpt of the database schema used to create DTO

meaningful format that allows inference of new knowledge that is

not explicit in the flat representation in the database. This is
achieved using semantic web technologies, mainly an ontology.
Building an ontology is particularly relevant for representing com-
plex knowledge involving hierarchies of concepts (i.e., classes in
ontology) and many specific relationships (i.e., object properties in
ontology) among concepts and their data properties (i.e., data
properties in ontology). In this way, flat data obtained can be
used to create axioms that represent current knowledge. With the
help of DL reasoners, inference of new knowledge and performing
complex queries for analysis and exploration become possible and
easily operable. We previously reported a modular architecture
[20, 24, 26] while building ontologies. The modular architecture
allows easier management and sharing of ontology files, standar-
dized vocabularies, and axiomatic representations of knowledge.
Modularization and ontology development can be performed man-
ually. However, especially while building DTO, we created all
vocabulary files and some of the axioms from the database using a
Java application, OntoJog [24], which will be released soon. This
process adds another layer into the modularization to separate
How to Develop a Drug Target Ontology: KNowledge Acquisition. . . 57

axioms that are automatically created by a software from axioms

that are manually asserted in the ontology by expert knowledge
engineers [20] (Fig. 3).

2.9 Ontology The final step in the proposed workflow is the ontology validation.
Validation The domain experts as well as the knowledge engineer perform
different tests in order to find out if the information in the ontology
is accurate. In addition, different reasoners can be run on the
ontology to check its consistency. Additional software can be imple-
mented to test the different aspects of the ontology (e.g., Java
programs that compare the database with the ontology classes,
object properties, data properties, etc.) Finally, queries for the
different use cases can be run to check if the ontology implementa-
tion answers questions it was meant to answer. If there are any
inconsistencies or inaccuracies in the ontology, the knowledge
engineer and the domain expert should try to go back to the
ontology building step. If the inconsistencies are fundamental, we
recommend starting from the first step and retracing the steps that
lead to the inconsistent knowledge. Domain experts and ontology
engineers can also choose to go back to “Metadata Creation and
Knowledge Modeling” or “Sub-language Recycling” step.

2.10 Implementation As a part of the Illuminating the Druggable Genome (IDG) [41]
of the Drug Target project, we designed and implemented the Drug Target Ontology
Ontology (DTO) Using (DTO) [29]. The long-term goal of the IDG project [41] is to
KNARM catalyze the development of novel therapeutics that act on novel
drug targets, which are currently poorly understood and poorly
annotated, but are likely targetable. The project puts particular
emphasis on the most common drug target protein families, G-
protein-coupled receptors (GPCRs), nuclear receptors, ion chan-
nels, and protein kinases. Therefore, we focused initially on for-
mally classifying, annotating, and modeling these specific protein
families in their role as drug targets, and DTO is focused on
proteins known as putative drug targets including many aspects to
describe the relevant properties of these proteins in their role as a
drug target. While creating DTO, we further advanced the meth-
odology and ontology architecture that we used for the BAO [26]
and other application ontologies from the LINCS project (e.g.
LIFE ontology) [20]. A longer-term goal for DTO is to integrate
it with the assays (formally described in BAO) that are used to
identify and characterize small molecules that modulate these tar-
gets. This will result in an integrated drug discovery knowledge
framework.

2.10.1 Sub-language The initial interviews and sub-language analysis steps involved
Analysis and In-House determining the different classifications of the drug targets and
Unstructured Interview the properties of them. The IDG project defined drug target
for DTO [24, 29, 42] as “A material entity, such as native (gene product)
58 Hande Küçük McGinty et al.

protein, protein complex, microorganism, DNA, etc., that physi-

cally interacts with a therapeutic or prophylactic drug (with some
binding affinity) and where this physical interaction is (at least
partially) the cause of a (detectable) clinical effect” [24]. Currently,
DTO focuses on protein targets.
The IDG drug targets have been categorized as four major
classes with respect to the depth of investigation from a clinical,
biological, and chemical standpoint:
1. Tclin are targets for which a molecule in advanced stages of
development, or an approved drug, exists and is known to bind
to that target with high potency.
2. Tchem are proteins for which no approved drug or molecule in
clinical trials is known to bind with high potency but which can
be specifically manipulated with small molecules in vitro.
3. Tbio are targets that do not have known drug or small molecule
activities that satisfy the Tchem activity thresholds but were
annotated with a Gene Ontology Molecular Function or
biological process with an experimental evidence code or tar-
gets with confirmed OMIM phenotype(s) [43].
4. Tdark refers to proteins that have been described at the
sequence level, do not satisfy Tclin/Tchem/Tbio criteria, and
meet two of the following three conditions: a fractional
PubMed publication count [44] below five, three or more
NCBI gene RIF annotations [45], or 50 or more commercial
antibodies, counted from data made available by the Antibo-
dypedia database [46].
DTO proteins have further been classified based on their struc-
tural (sequence/domains) and functional properties. Here we give
a high-level summary of the classifications for kinases, ion channels,
GPCRs, and nuclear receptors.
Most of the 578 kinases covered in the current version of DTO
are protein kinases (PK). These 514 PKs are categorized in ten
groups that are further subcategorized in 131 families and 82 sub-
families. The 62 nonprotein kinases are categorized in five groups
depending upon the substrate that are phosphorylated by these
proteins. These five groups are further subcategorized in 25 families
and 7 subfamilies. There are two kinases that have not been cate-
gorized yet in any of the above types or groups.
The 334 ion channel proteins (out of 342 covered in the
current version of DTO) are categorized in 46 families, 111 sub-
families, and 107 sub-subfamilies. Similarly, the 827 GPCRs cov-
ered in the current version of DTO are categorized in 6 classes,
61 families, and 14 subfamilies. The additional information
whether any receptor has a known endogenous ligand or is cur-
rently orphan is mapped with the individual proteins. Finally, the
48 nuclear hormone receptors are categorized in 19 NR families.
How to Develop a Drug Target Ontology: KNowledge Acquisition. . . 59

Following our reviews of the free-form text about the data in

hand, the domain experts in the group provided help with answer-
ing the ontology engineers’ questions. At times, the reviews of the
free-form text were performed together with the domain experts.
This process is defined as the unstructured interview, because there
are no predefined set of questions asked to the domain expert. The
questions are asked in a conversation-like environment to better
understand the various characteristics of proteins as drug targets—
such as protein domains, binding ligands, functions, mutation,
binding site, tissue expression, disease association, and many pro-
tein family-specific concepts—and identify a pattern among the
various kinds of molecular entities, their parts, functions, roles,
related biomedical concepts, their uses, as well as their functions
in drug discovery assays and projects.
Above classifications of the proteins were performed by the
domain experts and provided to the ontology engineers in Excel
sheets. Other classification questions were also discussed such as
how to best classify mutated and modified proteins. The different
properties identified in the first step are used in subsequent steps to
create metadata, model the knowledge, and axiomize in the ontol-
ogy building process.

2.10.2 Sub-language While designing the ontology, we decided to add the UniProt IDs
Recycling for DTO for the proteins and the Entrez IDs [30] for the genes as cross-
references. In addition to this, we wanted to include the textual
definitions for the genes and the proteins. We also cross-referenced
the synonymous names and symbols for the molecules that already
exist in different databases.
We aimed at creating the Drug Target Ontology (DTO) as a
comprehensive resource by importing existing information about
the biological and chemical molecules that DTO contains. In this
way, we aim to help the life scientists’ query and retrieve informa-
tion about the different drug targets that they are working on. To
do that, we wrote various scripts using Java to retrieve information
from different databases. These databases include UniProt and
NCBI databases for Entrez IDs for the genes.
In addition to several publicly available databases and data
including the DISEASES and TISSUES databases [44, 47], we
also used the collaborators TCRD databases [42] in order retrieve
information about proteins, genes, and their related target devel-
opment levels (TDLs), as well as the tissue and disease information.
The DISEASES and TISSUES databases were developed in the
Jensen group from several resources including using advanced text
mining. They include a scoring system to provide a consensus of the
various integrated data sources. We retrieved the proteins with their
tissue and disease relationships and the confidence scores that are
given for the relationships. This data was loaded into our database
60 Hande Küçük McGinty et al.

and later used to create the ontology axioms that refer to the
probabilistic values of the relationships.
In addition to the larger-scale information derived from the
databases mentioned above, a vast amount of manual curation for
the proteins and genes was performed in the team by the curators
and domain experts. Most significantly improved were the drug
target classifications for kinases, ion channels, nuclear receptors,
and GPCRs. For most protein kinases, we followed the phyloge-
netic tree classification originally proposed by Sugen and the Salk
Institute [48]. Protein kinases not covered by this resource were
manually curated and classified mainly based on information in
UniProt [49] and also the literature. Non-protein kinases were
curated and classified based on their substrate chemotypes. We
also added pseudokinases, which are becoming more recognized
and relevant drug targets. We continue updating manual annota-
tions and classifications as new data becomes available. Nuclear
receptors were organized following the IUPHAR classification.
GPCRs were classified based on information from several sources
primarily using GPCRDB (http://www.gpcr.org/7tm/) and
IUPHAR as we have previously implemented in our GPCR ontol-
ogy [50]. However, not all GPCRs were covered, and we are
aligning GPCR ontology with other resources to complete classifi-
cation for several understudied receptors. We are also incorporating
ligand chemotype-based classification. A basic classification of ion
channels is available in IUPHAR [51]. Manual classification is in
progress for 342 ion channels in order to provide better classifica-
tions as required including for domain functions, subunit topology,
and heteromer and homomer formation.
Protein domains were annotated using the Pfam web service.
The domain sequences and domain annotations were extracted
using custom scripts. Several of the kinase domains were manually
curated based on their descriptions. For nuclear receptors, we
identified and annotated the ligand-binding domains, which are
most relevant as drug targets. For GPCRs, we identified 7TM
domains for majority (780 out of 827) of GPCRs. Ion channel
domains were annotated, and transmembrane domains were iden-
tified; additional ion channel characteristics—such as regulatory
and, gating mechanism, transported ion—were curated for ion
channel drug targets. Additional subclassification and annotation
are in progress and will further improve that module.
In addition to the curated drug target family function-specific
domain annotations, we generated comprehensive Pfam domain
annotations for the kinase module [42]. The domain sequences
were compared to the PDB chain sequences by BLAST, and
e-values were calculated. For significant hits, domain identities
were computed using the EMBOSS software suite. These results
were used to align and identify critical selectivity residues, such as
How to Develop a Drug Target Ontology: KNowledge Acquisition. . . 61

gatekeeper and the hinge-binding motif (publication in prepara-

tion). These annotations also allowed the integration with KINO-
MEscan assays from the LINCS project [52]. These domains are
classified manually based on curated annotations to generate mean-
ingful interpretable assertions in DTO.

2.10.3 Metadata Based on the sub-language analysis, the in-house unstructured

Creation for DTO and interview, and sub-language recycling, the next step in formalizing
Knowledge Modeling descriptions is creating a set of metadata.
The metadata creation step is a combination of analyzing the
standards already existing (e.g., Pfam annotations) and understand-
ing the patterns of the data at hand. For the first version of the
DTO, we decided to add the following axioms for the different
protein classes (not a complete list):
1. Kinase relationships.
(a) Protein-gene relationships.
(b) Protein-disease relationships.
(c) Protein-tissue relationships.
(d) Target development level relationships.
(e) Has quality pseudokinase relationships.
2. GPCR relationships.
(a) Protein-gene relationships.
(b) Protein-disease relationships.
(c) Protein-tissue relationships.
(d) Target development level relationships.
(e) Has ligand-type relationships.
3. IC relationships.
(a) Protein-gene relationships.
(b) Protein-disease relationships.
(c) Protein-tissue relationships.
(d) Target development level relationships.
(e) Has channel activity.
(f) Has gating mechanism.
(g) Has quaternary organization.
(h) Has topology.
4. NR relationships.
(a) Protein-gene relationships.
(b) Protein-disease relationships.
(c) Protein-tissue relationships.
(d) Target development level relationships.
62 Hande Küçük McGinty et al.

Target development levels (TDL: Tclin, Tchem, Tbio, Tdark)

from TCRD [42] were assigned using has target development level
relationship and based on the criteria set by the IDG project. Each
protein has an axiom annotating a target development level (TDL),
i.e., Tclin, Tchem, Tbio, and Tdark. The protein is linked to gene
by has gene template relation.
The gene is associated with disease based on evidence from the
DISEASES database. The protein is also associated with some
organ, tissue, or cell line using some evidence from TISSUES
database. Important disease targets based on the protein-disease
associations were modeled as strong, at least some, or at least weak
evidence. DTO uses the following hierarchical relations to declare
the relation between a protein and the associated disease extracted
from the DISEASES database. In the DISEASES database [44], the
associated disease and protein are measured by a Z-score. In DTO
the relationships are translated as follows:
1. Has associated disease with at least weak evidence from DIS-
EASES (translated for Z-scores between zero and 2.4).
2. Has associated disease with at least some evidence from DIS-
EASES (translated for Z-scores between 2.5 and 3.5).
3. Has associated disease with strong evidence from DISEASES
(translated for Z-scores between 3.6 and 5).

2.10.4 Structured Based on the metadata created, we have interviewed the researchers
Interview for DTO in the group and outside of the group. This step is to confirm that
the interpretation of the text data is correct and accurate. Addition-
ally, this step can be used in combination with other methods in
order to decide on a concept’s proper name. In this case, we chose
to use existing names in well-known databases such as
UniProt [49].
With this step, the aim is to finalize names and types of concepts
used in the metadata. Furthermore, it is to make sure that the
ontology engineer is on the same page as the domain experts before
starting to write the axioms. Therefore, this step can be combined
with the next step, i.e., knowledge acquisition validation.

2.10.5 Knowledge In this case after the metadata creation, various interviews, and
Acquisition Validation reviews of the data, an ontology engineer runs several scripts to
(KA Validation) for DTO check the consistency of the data. In addition, a domain expert
performs a thorough manual expert review of the extracted data.
Before the database formation, metadata is also reviewed. Domain
experts use metadata for grouping the extracted data. Modeling of
the knowledge is confirmed with ontology engineer and domain
expert reviews. Structured data is then shared with the research
scientists inside and outside of the team, especially with the scien-
tists in the IDG project to make sure that the information
How to Develop a Drug Target Ontology: KNowledge Acquisition. . . 63

contained was valid. Corrections, if necessary, were made on the

data and metadata provided.

2.10.6 Database Previously we had engineered BioAssay Ontology (BAO) ontology

Formation for DTO in a rigorous way using version control and Protégé. However, the
modularization approach, although it has many benefits, requires
tracking of many vocabulary files and ID ranges (to avoid conflicts).
In addition to the vocabulary files, BAO has mostly expert-
constructed, manual axioms defining assays and various related
BAO concepts. For DTO on the other hand, much information
was extracted from third-party databases and then consolidated by
curators and domain experts. The use of external resources also
requires a mechanism for frequent updates of the ontology. To
facilitate that process and to better track DTO modules, vocabul-
aries, and ID ranges, a more efficient and less error-prone method
to manage all information was required.
For the DTO, a new MySQL database was built to handle all
data and metadata. Drug Target Ontology (DTO) uses various
external databases and ontologies as information sources. Data
from these databases was retrieved via web-based applications and
in-house-built scripts. All data were stored in a relational database.
The database schema for the DTO is provided in Fig. 3.

2.10.7 Semiautomated The ontology was then built from this database in an automated
Ontology Building for DTO way using a Java application, OntoJog [24], which will be released
and described separately soon. This process builds all vocabulary
files, modules, and the axioms that can be automatically con-
structed given information in the database. In addition, all the
external modules were built. The various vocabularies and modules
are organized hierarchically via direct and indirect imports leading
to DTO_core. DTO_core is then imported, along with the expert-
asserted axioms and the external modules, to DTO_complete
(Fig. 4).

2.11 Knowledge In BAO, the formal descriptions of assays were manually axiomized.
Modeling of the Drug DTO, which was created for the IDG project, focuses on the
Target Ontology bio-molecules and their binding partners such as the specific ions
for ion-channeling proteins or small molecule ligands for GPCRs,
as well as their relationships to the specific diseases and tissues.
We used several tools, including Java, OWL API, and Jena to
build the ontology in a semiautomated way leveraging our local
database and implementing a new modularization architecture
given in detail below.

2.12 A New Modular The modular design of the DTO adds an additional layer on top of
Architecture for the our previously reported modular architecture developed for BAO
Drug Target Ontology [26]. Specifically, we separate the module with auto-generated
simple axioms, which are created using native-DTO concepts
64 Hande Küçük McGinty et al.

Fig. 4 Modular architecture of DTO showing the core principles and levels of DTO’s architecture with direct
and indirect imports

and/or various pieces of data imported from external databases

after internal preprocessing. Following the auto-generated axioms,
complex axioms are formed by ontologists or knowledge engineers.
This way, auto-updates do not affect expert-formalized knowledge.
The modular design is illustrated in Fig. 4. The new approach is
detailed below.
First, we determine an abstract horizon between TBox and
ABox. TBox contains vocabularies and modules. Vocabularies
define the conceptualization without dependencies. The vocabul-
aries are self-contained and well-defined with respect to the
domain, and they contain concepts, relations, and data properties
(i.e., native-DTO concepts). We can have n of these vocabularies
and modules which are combined into DTO_core.
Second, once the number of native vocabularies and modules
are defined, we can design modules that import modules from our
domain of discourse and also from third-party ontologies. Once
these ontologies are imported, the alignment takes place. The
alignments are defined for concepts and relations using equivalence
or subsumption-DL constructs. The alignment depends on the
domain experts and/or cross-references made in the ontologies.
For DTO, the most significant alignment made is between
UBERON and BRENDA ontologies for the tissue information.
How to Develop a Drug Target Ontology: KNowledge Acquisition. . . 65

We combine these modules in the DTO_complete level. We can

have one DTO_complete file or multiple files; each may be mod-
eled for a different purpose, e.g., tailored for a different user group
or area of research (e.g., kinases, GPCRs).
At the third level, the modules with axioms that can be gener-
ated automatically are created. The auto-generated modules have
interdependent axioms, i.e., these axioms can be generated using
native-DTO concepts and/or concepts imported from external
ontologies. At this level one could create any number of gluing
modules, which import other modules without dependencies or
with dependencies.
The fourth level contains axioms created manually. The manual
modules are an optional level, and they inherit the axioms created
automatically. Examples of axioms that may be seen at this level
include protein modifications and mutations and protein drug
interactions.
The fifth level contains the TBox released based on the modules
created from the fourth phase. Depending on the end users, the
modules are combined without loss of generality. With this meth-
odology, we make sure that we only send out physical files that
contain our (and the absolute necessary) knowledge.
At the sixth and last level, the necessary modules ABoxes (i.e.,
instances of concepts created in the TBoxes) can be created. ABoxes
can be loaded to a triple store or to a distributed file system
(Hadoop DFS [53]) in a way that one could achieve pseudo-parallel
reasoning. In another layer, using modules, we can define views on
the knowledge base. These are files that contain imports
(both direct and indirect) from various TBoxes and ABoxes mod-
ules for the end user. It can be seen as a view, using database
terminology.

2.12.1 Ontology Several check points for data validation are performed throughout
Validation for DTO the methodology. For example, after the data is extracted in Sub-
language Recycling step, the ontology engineer runs several scripts
to check the consistency of the data. In addition, the domain expert
performs a thorough manual expert review of the extracted data.
The second check point in the methodology is during the Database
Formation step. Several scripts are used to check if the extracted
data is properly imported to the database under the appropriate
metadata categories as a unit of information along with the meta-
data. Once the Semiautomated Ontology Building step is com-
plete, the ontology engineer runs available reasoners to check the
consistency of information. Furthermore, several SPARQL queries
are run to flag any discrepancies. If there are any issues in the
ontology, the ontology engineer and domain experts could decide
to step back and repeat the previous steps in the methodology.
Another ontology validation script for DTO is designed to read
DTO vocabulary and module files and compare them to the
66 Hande Küçük McGinty et al.

previous version of the ontology. This script generates reports with

all new (i.e., not present in the previous version), deleted (i.e., not
present in the current version), and changed classes or properties
based on their URIs and labels.
Any issues resulting from the tests are then discussed among
ontology engineers and domain experts. GitHub is used to store
the different versions of the ontology to help audit the quality
control (QC) and ontology validation (OV) process. Once all the
QC and OV procedures are completed with no errors, then DTO is
released on the public GitHub and its web page [29].

3 Notes

Complex life sciences data is fitting the big data description due to
the large volume (terabytes and larger), complexity (interconnected
with over 25 highly accessed databases [18] and over 600 ontolo-
gies [23]), variety (many technologies generating different data
types, such as gene sequencing, RNASeq gene expression, micros-
copy imaging data), and dynamic nature (growing exponentially
and changing fast [25, 18]). New tools are required to store,
manage, integrate, and analyze such data while avoiding oversim-
plification. It is a challenge to design applications involving such big
data sets aimed at advancing human knowledge. One approach is to
develop a knowledge-based integrative semantic framework, such
as an ontology that formalizes how the different data types fit
together given the current understanding of the domain of investi-
gation. Building ontologies is time-consuming and limited by the
knowledge acquisition process, which is typically done manually via
domain experts and knowledge engineers.
In this chapter, we described a methodology, KNowledge
Acquisition and Representation Methodology (KNARM), as a
guided approach, involving domain experts and knowledge engi-
neers, to build useful, comprehensive, consistent ontologies that
will enable big data approaches in the domain of drug discovery,
without the currently common simplifications. It is designed to
help with the challenge of acquiring and representing knowledge
in a systematic, semiautomated way. We applied this methodology
in the implementation of the Drug Target Ontology (DTO).
While technological innovations continue to drive the increase
of data generation in the biomedical domains across all dimensions
of big data, novel bioinformatics and computational methodologies
will facilitate better integration and modeling of complex data and
knowledge.
Although the above-described methodology is still a work in
progress, it provided a systematic process for building concordant
ontologies such as BioAssay Ontology (BAO) and Drug Target
How to Develop a Drug Target Ontology: KNowledge Acquisition. . . 67

Ontology (DTO) [20]. The proposed method helps to find a

starting point and facilitates the practical implementation of an
ontology. The interview steps in our methodology, which involve
domain experts’ manual contributions, are crucial to acquire the
knowledge and formalize it accurately and consistently. A critical
current effort is to further formalize and automate this approach.
Beyond the methodology for ontology generation and in par-
ticular knowledge acquisition, we are also developing new tools to
improve the interaction between ontology developers and users
given the reality of rapidly advancing knowledge and the need for
more dynamic environment in which user requests can be
incorporated in real time via direct information exchange with
ontology developers. The long-term prospect is a global dynamic
knowledge framework to integrate and model increasingly large
and complex datasets to help solve the most challenging biomedical
research problems.

Acknowledgments and Funding

This work was supported by NIH grants U54CA189205 (Illumi-

nating the Druggable Genome Knowledge Management Center,
IDG-KMC), U24TR002278 (Illuminating the Druggable
Genome Resource Dissemination and Outreach Center,
IDG-RDOC), U54HL127624 (BD2K LINCS Data Coordination
and Integration Center, DCIC), and U01LM012630-02 (BD2K,
Enhancing the efficiency and effectiveness of digital curation for
biomedical “big data”). The IDG-KMC and IDG-RDOC (http://
druggablegenome.net/) are components of the Illuminating the
Druggable Genome (IDG) project (https://commonfund.nih.
gov/idg) awarded by the National Cancer Institute (NCI) and
National Center for Advancing Translational Sciences (NCATS),
respectively. The BD2K LINC DCIC is awarded by the National
Heart, Lung, and Blood Institute through funds provided by the
trans-NIH Library of Integrated Network-Based Cellular Signa-
tures (LINCS) Program (http://www.lincsproject.org/) and the
trans-NIH Big Data to Knowledge (BD2K) initiative (https://
commonfund.nih.gov/bd2k). IDG, LINCS, and BD2K are NIH
Common Fund projects.

References
1. Gruber TR (1993) Towards principles for the 3. Schreiber G, Wielinga B, de Hoog R,
Design of Ontologies Used for knowledge Akkermans H, Van de Velde W (1994) Com-
sharing. Int J Hum Comput Stud 43 monKADS: a comprehensive methodology for
(5–6):907–928 KBS development. IEEE Expert 9(6):28–37
2. CommonKADS CommonKADS. http://com 4. Barnes JC (2002) Conceptual biology: a
monkads.org/ semantic issue and more. Nature 417
(6889):587–588
68 Hande Küçük McGinty et al.

5. Blagosklonny MV, Pardee AB (2002) Concep- templates for the semantic web data science:
tual biology: unearthing the gems. Nature 416 challenges and directions. PeerJ Comput Sci 2
(6879):373–373 (8):e61
6. Benson DA, Karsch-Mizrachi I, Lipman DJ, 17. Belleau F, Nolin M-A, Tourigny N, Rigault P,
Ostell J, Rapp BA, Wheeler DL (2000) Gen- Morissette J (2008) Bio2RDF: towards a
Bank. Nucleic Acids Res 28(1):15–18 mashup to build bioinformatics knowledge sys-
7. Heflin J, Hendler J (2000) Semantic interoper- tems. J Biomed Inform 41(5):706–716
ability on the web. Maryland University, Col- 18. Cook CE, Bergman MT, Finn RD,
lege Park, Department of Computer Science, Cochrane G, Birney E, Apweiler R (2015)
Maryland The European bioinformatics institute in
8. Noy NF, Fergerson RW, Musen MA (2000) 2016: data growth and integration. Nucleic
The knowledge model of Protege-2000: com- Acids Res 44(D1):D20–D26
bining interoperability and flexibility. In: 19. Hitzler P, Krötzsch M, Rudolph S (2009)
Knowledge engineering and knowledge man- Foundations of semantic web technologies.
agement methods, models, and tools. Springer, Chapman and Hall (CRC), Florida
New York, pp 17–32 20. Küçük-Mcginty H, Metha S, Lin Y,
9. Stevens R, Goble CA, Bechhofer S (2000) Nabizadeh N, Stathias V, Vidovic D, Koleti A,
Ontology-based knowledge representation for Mader C, Duan J, Visser U (2016) Schurer S
bioinformatics. Brief Bioinform 1(4):398–414 IT405: building concordant ontologies for
10. Wache H, Voegele T, Visser U, drug discovery. In: International conference
Stuckenschmidt H, Schuster G, Neumann H, on biomedical ontology and BioCreative.
Hübner S (2001) Ontology-based integration (ICBO BioCreative 2016), Oregon
of information-a survey of existing approaches. 21. Schurer SC, Vempati U, Smith R, Southern M,
In: IJCAI-01 workshop: ontologies and infor- Lemmon V (2011) BioAssay ontology annota-
mation sharing. Citeseer, New Jersey, pp tions facilitate cross-analysis of diverse high-
108–117 throughput screening data sets. J Biomol
11. Yeh I, Karp PD, Noy NF, Altman RB (2003) Screen 16(4):415–426
Knowledge acquisition, consistency checking 22. Smith B, Ashburner M, Rosse C, Bard J,
and concurrency control for gene ontology Bug W, Ceusters W, Goldberg LJ, Eilbeck K,
(GO). Bioinformatics 19(2):241–248 Ireland A, Mungall CJ et al (2007) The OBO
12. Degtyarenko K, De Matos P, Ennis M, foundry: coordinated evolution of ontologies
Hastings J, Zbinden M, McNaught A, to support biomedical data integration. Nat
Alcántara R, Darsow M, Guedj M, Ashburner Biotechnol 25(11):1251–1255
M (2008) ChEBI: a database and ontology for 23. Whetzel PL, Noy NF, Shah NH, Alexander
chemical entities of biological interest. Nucleic PR, Nyulas C, Tudorache T, Musen MA
Acids Res 36(suppl 1):D344–D350 (2011) BioPortal: enhanced functionality via
13. Baader F, Calvanese D, McGuinness DL, new web services from the National Center
Nardi D, Patel-Schneider PF (2010) The for biomedical ontology to access and use
description logic handbook: theory, implemen- ontologies in software applications. Nucleic
tation and applications, 2nd edn. Cambridge Acids Res 39(2):W541–W545
University Press, New York, NY 24. Lin Y, Mehta S, Küçük-McGinty H, Turner JP,
14. Buchanan BG, Barstow D, Bechtal R, Vidovic D, Forlin M, Koleti A, Nguyen D-T,
Bennett J, Clancey W, Kulikowski C, Jensen LJ, Guha R, Mathias SL, Ursu O,
Mitchell T, Waterman DA (1983) Construct- Stathias V, Duan J, Nabizadeh N, Chung C,
ing an expert system. Build Exper Sys Mader C, Visser U, Yang JJ, Bologa CG, Oprea
50:127–167 TI, Schürer SC (2017) Drug target ontology
15. Natale DA, Arighi CN, Blake JA, Bona J, to classify and integrate drug discovery data. J
Chen C, Chen S-C, Christie KR, Cowart J, Biomed Semantics 8(1):50
D’Eustachio P, Diehl AD, Drabkin HJ, Dun- 25. Ma’ayan A (2017) Complex systems biology. J
can WD, Huang H, Ren J, Ross K, R Soc Interface 14(134):1742–5689
Ruttenberg A, Shamovsky V, Smith B, 26. Abeyruwan S, Vempati UD, Küçük-
Wang Q, Zhang J, El-Sayed A, Wu CH McGinty H, Visser U, Koleti A, Mir A,
(2011) The representation of protein com- Sakurai K, Chung C, Bittker JA, Clemons PA,
plexes in the protein ontology (PRO). BMC Chung C, Bittker JA, Clemons PA, Brudz S,
bioinformatics 12(1):1 Siripala A, Morales AJ, Romacker M,
16. Clark AM, Litterman NK, Kranz JE, Gund P, Twomey D, Bureeva S, Lemmon V, Schürer
Gregory K, Bunin BA, Cao L (2016) BioAssay SC (2014) Evolving BioAssay ontology
How to Develop a Drug Target Ontology: KNowledge Acquisition. . . 69

(BAO): modularization, integration and appli- 41. NIH Illuminating the Druggable Genome |
cations. J Biomed Semantics 5(Suppl 1):S5 NIH Common Fund. https://commonfund.
27. BAOSearch. http://baosearch.ccs.miami.edu/ nih.gov/idg/index
28. Visser U, Abeyruwan S, Vempati U, Smith R, 42. TCRD Database. http://habanero.health.
Lemmon V, Schurer S (2011) BioAssay ontol- unm.edu/tcrd/
ogy (BAO): a semantic description of bioassays 43. Hamosh AS, Alan F, Amberger JS, Bocchini
and high-throughput screening results. BMC CA, McKusick VA (2005) Online Mendelian
Bioinformatics 12(1):257 inheritance in man (OMIM), a knowledgebase
29. Drug Target Ontology. http:// of human genes and genetic disorders. Nucleic
drugtargetontology.org/ Acids Res 33:D514–D517
30. Brinkman RR, Courtot M, Derom D, Fostel 44. Pletscher-Frankild S, Pallejà A, Tsafou K,
JM, He Y, Lord P, Malone J, Parkinson H, Binder JX, Jensen LJ (2015) DISEASES: text
Peters B, Rocca-Serra P, Ruttenberg A, San- mining and data integration of disease–gene
sone SA, Soldatova LN, Stoeckert CJ Jr, associations. Methods 74:83–89
Turner JA, Zheng J (2010) Modeling biomed- 45. NCBI (2017) https://www.ncbi.nlm.nih.gov/
ical experimental processes with OBI. J Biomed gene/about-generif. 2017
Semantics 1(Suppl 1):S7 46. Kiermer V (2008) Antibodypedia. Nat Meth-
31. Callahan A, Cruz-Toledo J, Dumontier M ods 5(10):860–860
(2013) Ontology-based querying with 47. Santos A, Tsafou K, Stolte C, Pletscher-
Bio2RDF’s linked open data. Journal of Bio- Frankild S, O’Donoghue SI, Jensen LJ
medical Semantics 4(Suppl 1):S1 (2015) Comprehensive comparison of large-
32. Ceusters W, Smith B (2006) A realism-based scale tissue expression datasets. PeerJ 3:e1054
approach to the evolution of biomedical ontol- 48. Sugen and the Salk Institute. (2012). http://
ogies. AMIA Annu Symp Proc:121–125 kinase.com/human/kinome/phylogeny.html
33. Consortium TGO (2015) Gene ontology con- 49. Consortium TU (2015) UniProt: a hub for
sortium: going forward. Nucleic Acids Res 43 protein information. Nucleic Acids Res 43
(D1):D1049–D1056. https://doi.org/10. (D1):D204–D212. https://doi.org/10.
1093/nar/gku1179 1093/nar/gku989
34. Decker S, Erdmann M, Fensel D, Studer R 50. Przydzial MJ, Bhhatarai B, Koleti A,
(1999) Ontobroker: ontology based access to Vempati U, Schürer SC (2013) GPCR ontol-
distributed and semi-structured information. ogy: development and application of a G
In: Database semantics. Springer, New York, protein-coupled receptor pharmacology
pp 351–369 knowledge framework. Bioinformatics 29
35. Gruber TR (1993) A translation approach to (24):3211–3219
portable ontology specifications. Knowl Acquis 51. Pawson AJ, Sharman JL, Benson HE,
5(2):199–220 Faccenda E, Alexander SP, Buneman OP, Dav-
36. Köhler J, Philippi S, Lange M (2003) enport AP, JC MG, Peters JA, Southan C,
SEMEDA: ontology based semantic integra- Spedding M, Yu W, Harmar AJ,
tion of biological databases. Bioinformatics 19 NC-IUPHAR (2013) The IUPHAR/BPS
(18):2420–2427 guide to PHARMACOLOGY: an expert-
37. Ontology BF Basic Formal Ontology (BFO) driven knowledgebase of drug targets and
Project. http://www.ifomis.org/bfo their ligands. Nucleic Acids Res 42(D1):
38. Pease A, Niles I, Li J (2002) The suggested D1098–D1106
upper merged ontology: a large ontology for 52. Vidović D, Koleti A, Schürer SC (2014) Large-
the semantic web and its applications. In: scale integration of small molecule-induced
Working notes of the AAAI-2002 workshop genome-wide transcriptional responses,
on ontologies and the semantic web Kinome-wide binding affinities and cell-
39. Sure Y, Erdmann M, Angele J, Staab S, growth inhibition profiles reveal global trends
Studer R, Wenke D (2002) OntoEdit: collabo- characterizing systems-level drug action. Front
rative ontology development for the semantic Genet 5:342
web. Springer, New York 53. Shvachko K, Kuang H, Radia S, Chansler R
40. Welty CA, Fikes R (2006) A reusable ontology (2010) The Hadoop distributed file system.
for Fluents in OWL. In: Formal ontology in In: Proceedings of the 2010 IEEE 26th sym-
information systems Frontiers in artificial Intel. posium on mass storage systems and technolo-
And apps. IOS, pp 226–236 gies (MSST). IEEE Computer Society,
Washington, DC, USA, pp 1–10
Part II

Informatics in Drug Discovery

Chapter 5

A Guide to Dictionary-Based Text Mining

Helen V. Cook and Lars Juhl Jensen

Abstract
PubMed contains more than 27 million documents, and this number is growing at an estimated 4% per
year. Even within specialized topics, it is no longer possible for a researcher to read any field in its entirety,
and thus nobody has a complete picture of the scientific knowledge in any given field at any time. Text
mining provides a means to automatically read this corpus and to extract the relations found therein as
structured information. Having data in a structured format is a huge boon for computational efforts to
access, cross reference, and mine the data stored therein. This is increasingly useful as biological research is
becoming more focused on systems and multi-omics integration. This chapter provides an overview of the
steps that are required for text mining: tokenization, named entity recognition, normalization, event
extraction, and benchmarking. It discusses a variety of approaches to these tasks and then goes into detail
on how to prepare data for use specifically with the JensenLab tagger. This software uses a dictionary-based
approach and provides the text mining evidence for STRING and several other databases.

Key words Automated text processing, Dictionary-based approach, Named entity recognition,
PubMed, Structured information, Text mining, Text normalization

1 Introduction

PubMed contains more than 27 million documents, and this num-

ber is growing at an estimated 4% per year [1]. Even within
specialized topics, it is not possible for a researcher to read any
field in its entirety, so everyone is lacking a complete picture of the
scientific knowledge in any given field at any time. This dearth of
perspective is in part compensated for by databases such as UniProt
that manually curate domain-specific knowledge and provide it in a
structured form [2]. Such databases are essential collections of
information as they facilitate finding information about the proper-
ties and functions of proteins and provide programmatic access to
it. Efforts to curate scientific facts into databases are greatly valued
by the community for the ease with which structured information
can be extracted, but they require significant investments of time,
resources, and money to create and maintain [3]. Further, where
databases do exist, they are necessarily limited in scope—as

Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_5, © Springer Science+Business Media, LLC, part of Springer Nature 2019

73
74 Helen V. Cook and Lars Juhl Jensen

comprehensive as UniProt is for proteins, it makes no effort to also

cover noncoding RNAs.
Biomedical text mining, sometimes called BioNLP, is becom-
ing an essential part of the research toolbox as it can be used to
automatically and quickly extract and organize facts that are scat-
tered throughout the literature into a comprehensive structured
collection. NLP refers to natural language processing, a term that
can mean processing text in general sense using a variety of meth-
ods or more specifically parsing the grammatical sentence structure
to better understand the text. The term BioNLP refers to the more
general meaning of NLP to mean processing biomedical text by any
method.
Biomedical text mining spans a wide range of entities and their
relations. Relations that can be extracted using text mining include
protein interactions with diseases [4], RNA [5], cellular compart-
ments [6] or tissues [7], or any other pairs that could be part of a
biologically interesting relation. In addition to describing
biological relationships, text mining has also been used to describe
trends in biomedical research [8] and to identify proteins that have
not been researched compared to their importance [9].
As text mining uses the existing literature as a source, the
individual facts extracted by text mining are inherently not novel.
However, text mining enables information to be combined from
multiple papers in potentially novel ways to facilitate discovery
[10, 11]. For example, an early result of text mining is the discovery
of a connection between magnesium deficiency and migraines, an
example of such “undiscovered public knowledge” [12]. Text
mining further gives new life to facts that are buried in old articles
that are unlikely to be unearthed manually by human researchers,
but that can be quickly searched by computers. Having data in the
structured format provided by the text mining results is a huge
boon for computational efforts to access, cross reference, and mine
the data stored therein. This is increasingly useful as biological
research is becoming more focused on systems and multi-omics
integration.
Beyond providing a resource to make interactions available, the
results of text mining can be used in other ways. Text mining is used
to prefilter abstracts for hand curation for the biomarker database
miRandola [13] and also for the protein-protein interaction data-
bases MINT, DIP, and BIND [14–17]. Reflect [18] is a browser
plug-in that will augment web pages by tagging proteins and
chemicals in the text and providing a clickable link that opens a
pop-up containing more information about each entity.
The aims of this chapter are twofold: first, to provide an imple-
mentation agnostic overview of the process of text mining and,
second, to describe how to use the dictionary-based tagger from
the JensenLab [19]. This tagger is used to provide the protein-
protein interactions for the text mining channel of the STRING
A Guide to Dictionary-Based Text Mining 75

database [20] and gives good recall and precision also on other
domains [21]. In the next section, we walk through the process of
text mining in general and briefly introduce the different
approaches that can be taken along with their respective merits.
The third section focuses on the data that will need to be prepared
in order to apply a dictionary-based system, using the specific
example of the JensenLab tagger. We will then discuss some limita-
tions of text mining in general and this system in particular as well
as ways to circumvent them.

2 Text Mining Fundamentals

Information retrieval and information extraction are two closely

related but different tasks [22]. Information retrieval describes
the task of retrieving documents that match a given query, whereas
information extraction describes the task of retrieving facts from
documents. In this article, we are interested in the latter—our goal
is to extract facts, relationships, or interactions between given
entities directly out of the text without human intervention or
assistance.
Information extraction can be implemented in a variety of ways,
including purely statistical methods that consider the frequency of
words in the document and NLP methods that parse the text and
attempt to understand the specific context that words are used
in. The BioCreative [23] and BioNLP shared task [24] contests
are run every year or two, respectively, and pose challenges of
biological interest and relevance to the text mining community.
Reference [25] reviews the many other community challenges
that have been run, their impacts on the field, and their limitations.
The organizers provide training data, and submissions to each
contest are evaluated against a common test set. These submissions
are good resources to survey for a comparison of the efficacy and
speed of a range of different approaches to biomedical text mining.

2.1 Corpora Text mining is performed on a set of documents, often abstracts or

full texts from biomedical journals, but also supplemental materials
[26], patient records [27–29], drug information sheets, or package
inserts [30] can be used to discover correlations between diseases
and stratify patients or to discover drug interactions. Semi-
structured sources, such as encyclopedia entries and UniProt
descriptions, are also good sources, as the structure of the docu-
ment can provide additional information to infer relations. For
example, if a habitat is found within the subsection of the Encyclo-
pedia of Life for habitat for a given species, it is extremely likely that
the species lives in the discovered habitat without requiring any
further evidence from co-occurrence counts [31]. As an extreme
example, the genome itself can also be used as a corpus, as in [32]
76 Helen V. Cook and Lars Juhl Jensen

which used text mining techniques to discover motifs in the

enhancer regions of the human genome.
Medline abstracts and full texts are available through an API
[33] and can be downloaded for free. Although only a subset of full
texts are available as open-access, it is worth using this content
where it is possible. Compared to abstracts only, using full texts
gives a higher recall for protein-protein and disease-gene interac-
tions while maintaining the same precision [34]. That open-access
articles are readily available for text mining is another compelling
reason for authors to prefer open-access publishing to traditional
publication venues [35, 36]. Documents that are available in PDF
format can be converted to text using conversion software, for
example, pdftotext [37], or layout-aware programs, such as
LAPDFText [38]. Tables and figures are often discarded, but can
also be a source of information [39].
Different sources of documents will use language differently;
patient records use a much different language and a different set of
abbreviations than scientific articles; however, these can also be
successfully text mined, even if written in a non-English
language [29].
The encoding of the corpus should be consistent between all
documents in the corpus. Documents are generally encoded in
ASCII or UTF-8, but older documents and documents in other
languages may use other encodings, for example, Windows code
pages or Japanese Shift JIS. Document encodings can be converted
by using the Unix and Mac command line utility GNU iconv [40]
or an encoding-aware editor such as vim [41].
Further details on corpora are out of scope for this document,
but the 2016 review by Przybyla et al. [42] covers publications and
other sources that can be used for text mining in some more detail,
as well as the structured formats that these documents will be
found in.

2.2 Tokenization Any text mining approach must first prepare the text to be read by
the software by dividing the text into individual words. This process
of identifying the word boundaries is called tokenization, and each
word is referred to as a token. Word boundaries are for the most
part obvious for text in English and other languages in which words
are separated by spaces; however, contractions and hyphenated
words can make tokenization less straightforward. For most text
mining methods, text must also be segmented into sentences.
Here, abbreviations that are terminated with a period are the
main challenge, since in all other cases, a period will clearly denote
the end of the sentence. Documents that have been converted from
PDF and that contain columns can give additional complication,
where text segment boundaries have not been split correctly and
words can run together. Layout-aware PDF conversion software
may help mitigate this problem.
A Guide to Dictionary-Based Text Mining 77

Many text mining packages use the Stanford parser [43], which
provides tokenization along with some other useful features such as
co-reference resolution [44]. The Python natural language toolkit
[45] is another popular option.

2.3 NER Named entity recognition (NER) refers to the task of locating any
and Normalization terms of interest in the corpus. Under a dictionary approach, this
involves scanning the corpus for all terms that appear in the dictio-
nary. Once a term has been identified, it can then be mapped onto a
single corresponding identifier in a predefined ontology or taxo-
nomic resource, for example, a taxonomic identifier for species [46]
or a database identifier for a protein [2]. This process is called
normalization. With a dictionary, normalization is essentially auto-
matic (modulo any disambiguation that needs to take place) since
the dictionary will be originally constructed from an ontology,
where each name is already tied to its normalized form. Conversely,
under machine learning methods, normalization is a separate task
that will be learned after observing sufficient training examples.
The JensenLab tagger and LINNAEUS are both dictionary-
based systems that perform joint NER and normalization and that
are broadly applicable to any domain [19, 47]. Alternatively, Tag-
gerOne, which uses semi-Markov models, requires training data
and can be applied to any domain [48]. NERsuite takes a third
approach and performs part-of-speech tagging (to determine which
words are nouns, verbs, adjectives, and so forth), lemmatization
(to reduce nouns to their singular forms and verbs to their infinitive
forms), and chunking (to identify noun phrases) prior to named
entity recognition and normalization [49]. All the software men-
tioned here is open source.

2.4 Event Extraction Once entities are located in the text, the next step is to find relations
between them. This process is referred to as event extraction.
Following NER, the JensenLab tagger uses co-occurrence counts
to determine relationships between entities, but other statistical
methods can be used to determine co-occurrence relationships
[50]. A similar method to the tagger’s co-occurrence scoring are
term frequency-inverse document frequency (tf-idf) counts
[51]. This gives more weight to terms that are globally rare but
that are used frequently within a small set of documents. Frequen-
cies of co-occurring words can be captured with N-grams, which
are lists of all sequences of N words that occur in sequence in a
document [52]. N-grams can be compared to identify documents
on similar topics or rare phrases within a document. These methods
tend to require a large corpus to generate accurate statistics, and
they do not use the meaning of the words in the document to
influence the statistics.
Whereas co-occurrence relies on statistical relationships
between word occurrences, grammar-based approaches use part-
78 Helen V. Cook and Lars Juhl Jensen

of-speech tagging to determine the semantics of the sentence by

parsing its grammatical structure. Many machine learning techni-
ques use the parts of speech, and possibly other characteristics, as
features to build a model on. A full review of machine learning-
based systems is beyond the scope of this chapter; however, exam-
ples include the Turku Event Extraction System (TEES) [53] and
the VERSE system, which expands on TEES [54].
A new and fundamentally different approach to mapping rela-
tionships between terms is word embedding, such as word2vec
[55]. This class of methods associates each word in a document
with a vector in a high-dimensional space and then adjusts the
values of the vectors to bring together vectors of words that occur
in similar contexts. This has the effect that words that are synonyms
will have similar vectors, and relationships between concepts will be
preserved. GloVe [56] and fastText [57] are two newer implemen-
tations of word embedding. FastText is unique in that it uses
sub-word frequencies to determine the vector representations of
words. The implementation of word2vec, from [55], has been run
over abstracts in PubMed, and the resulting vectors are available for
use [58].

2.5 Benchmarking There are two ways that the results of text mining can be evaluated.
First, the entities that are identified can be compared against a
manually annotated corpus to evaluate how well the system per-
forms compared to trained humans. Since humans too are prone to
error, and texts are often truly ambiguous, annotations made by
different human annotators will not always agree. To reliably assess
the results of text mining, there should be good agreement
between annotators (e.g., above 90%) to ensure that the corpus
has been annotated as consistently as possible. Both the inter-
annotator agreement and the text mining system’s agreement
with the consensus of the human annotators are generally reported
as F-score, as the harmonic mean of precision and recall, or as
Cohen’s kappa if annotations are of a set number of classes. To
create annotations, browser-based tools such as tagtog [59] and
brat [60] provide simple interfaces to highlight entities in text and
normalize them. PubAnnotation is a repository of hand-annotated
corpora that is used by the biomedical text mining community to
distribute annotations [61]. Creators of new corpora are encour-
aged to upload their work here so that it may be found and reused.
The second benchmarking strategy is to evaluate the
co-occurrence results against a gold standard. Here, we look
directly at the end result, e.g., co-occurrences, instead of the preci-
sion and recall of specific matches. In some domains, well-defined
gold standard sets are available, for example, disease-protein inter-
actions from OMIM [62] or drug-protein interactions from Drug-
Bank [63]. The resulting scores can be compared against known
values, and a probability or a confidence can be assigned to the text
A Guide to Dictionary-Based Text Mining 79

mining results. Calibration curves can be used to map the

co-occurrence scores onto a probability that the interaction is
true. Ideally, the curve would be S-shaped such that the probability
is low for low co-occurrence scores, but becomes high after a
threshold co-occurrence score is reached. The curve can be defined
for entities in the gold standard, and then the curve can be applied
to all entities.
The text mining for STRING is benchmarked using the second
approach, where the gold standard for functional protein interac-
tions comes from KEGG pathways [64]. Protein interactions are
considered to be true positives if the proteins appear in the same
KEGG pathway together. In order to assign a probability to each
interaction, STRING draws a benchmarking curve based on the
number of interactions mentioned in a publication and the percent-
age of the interactions that are true according to the gold standard.
Publications that contain very many interactions are reporting on
high-throughput screens which are prone to high false-positive
rates, whereas studies that report a smaller number of protein
interactions have a higher probability of each interaction being
true. By generating these curves for proteins that are in KEGG,
benchmarking scores can be extrapolated for protein interactions
where neither protein is in KEGG.

3 Dictionary-Based Text Mining with the JensenLab Tagger

This section details the input formats that are required to use the
JensenLab tagger. Although we focus on a particular implementa-
tion here, the same information will be needed as input to other
dictionary-based text mining pipelines. The tagger internally toke-
nizes the text, performs named entity recognition and normaliza-
tion, and can optionally calculate co-occurrences between user-
provided pairs of item types. However, benchmarking of the results
must be done by the user of the software.
The tagger is made available as a Docker [65] which can be
downloaded from http://hub.docker.com/r/larsjuhljensen/tag
ger/. Alternatively, the source code is available at http://
bitbucket.org/larsjuhljensen/tagger, and instructions are available
in the readme on how to install this on a Mac or Linux system.
The tagger interface can be used directly from the command
line, or it can be called with a Python wrapper. The tagger reads all
documents bytewise and matches bytes from the dictionary against
bytes of the corpus. This means that words containing Unicode
characters will be matched correctly, but the tagger will report all
positions in byte coordinates (as opposed to character coordinates).
If you use UTF-8 text in the dictionary or the corpus, the Python
wrapper can be used to convert positions between bytes and
80 Helen V. Cook and Lars Juhl Jensen

characters so that the reported positions are correct. The tagger will
process the tens of millions of documents in PubMed in under 2 h
on most servers [19].

3.1 Dictionary When performing text mining on corpora that cover a broad range
Creation: Entities of topics, the recall of dictionary approaches is limited due to the
and Names vast multitude of entities that need to be covered by the dictionary.
However, in the biomedical space, the terms of interest are gener-
ally better defined and limited in scope. Thus, it is feasible and
straightforward to make dictionaries that comprehensively cover
all the terms of a given type.
The dictionary, also sometimes called a lexicon, thesaurus, or
gazetteer, is simply a list of all the entities of interest and all of their
possible synonyms with common spelling, or orthographic, varia-
tions. The tagging software will then match each of these terms
against the text and return the locations at which they match.
Additionally, the results are filtered through a stopword list,
which will reject matches of any words that are known to give
many false positives.
It can be a significant investment of work to create a new
dictionary from scratch, so most dictionaries are built either from
existing databases or ontologies. For example, the dictionary used
to text mine the diseases database [4] was created from the disease
ontology [66]. Ontologies define the vocabularies, properties, and
relationships for the entities in a given subject. Many ontologies
have already been created for a range of biomedical domains, which
may provide a starting point for a new dictionary. The Open
Biomedical Ontologies Foundry [67] is a collection of ontologies
that are intended to be interoperable and standardized. BioPortal
hosts a library of more than 260 different ontologies, along with
related resources and tools that can be used to aid dictionary
creation [68].
What makes an ontology useful for text mining is primarily
having a complete list of entities and all of their possible synonyms.
This is not necessarily the same as what makes the ontology useful
for answering questions about the relationships between entities, a
more traditional use of ontologies. For text mining, having a loose
hierarchy of terms is sufficient in terms of structure, and an exten-
sive set of interrelationships between entities is not needed.
Since it is essential to have a good list of synonyms to have good
recall, it is worth putting some care into assembling the dictionary
from as many diverse sources as possible to cover lexical variants and
names that are commonly used in different disciplines. For exam-
ple, if you are using Ensembl identifiers for proteins, it would be
worthwhile to import their UniProt synonyms as well. Be aware
that terms that are used in a controlled vocabulary such as an
ontology may become out of date and incomplete as new terms
are generated too quickly to be included. If the ontology does not
A Guide to Dictionary-Based Text Mining 81

include synonyms for each term, it may be possible to map syno-

nyms from other resources or other ontologies. Ontology align-
ment tools, such as AgreementMaker [69], may be helpful to map
one ontology onto another. The OntoBiotope ontology [70],
which describes bacterial environments, contains the eukaryotes
branch of NCBI taxonomy since any eukaryote is a possible envi-
ronment for bacteria to live in. If both bacterial habitats and eukar-
yotes are to be tagged, the user must decide whether such
duplication of terms is desired—does it make sense for a tomato
to be tagged as both types? In the dictionaries we built, we decided
that terms should only be tagged with one type, and we include the
item only in the class it resembles most. If you are tagging only one
type of entity, this is not an issue. Plurals and any alternate forms of
the word, such as adjectives, must be added to the dictionary
explicitly. For proteins, the species prefixes should also be added,
for example, the synonyms hCDK1 and mCDK1 for human and
mouse CDK1, respectively. The dictionary is case insensitive, mean-
ing that terms in the dictionary with any case will match terms in
the text with any case. The JensenLab tagger uses a custom hashing
algorithm that ignores case and hyphens. For example, “cyclin-
dependent kinase 1” will hash to the same result as “Cyclin Depen-
dent Kinase 1” so alternate forms involving hyphens do not need to
be added to the dictionary. Note that the run time of the tagger
does not scale with the size of the dictionary, so whereas adding ten
times as many names to the dictionary will increase memory use, it
will not noticeably slow down the tagger performance. It is there-
fore safe to generate both “s” and “es” plural forms, and so forth,
for each entry in the dictionary even if this results in many nonsense
words, as it will improve recall without impacting performance or
precision. These steps to expand the recall may result in false-
positive matches that can then be blocked with the stopword list.
To make the tagger memory efficient and to facilitate the use of
very large dictionaries, it represents each entity by an integer (called
a serial) rather than by their names as strings. The dictionary files
are prepared by assigning a serial number to each entity. The
entities file contains the assigned serial number, another numeric
value that represents the type of the entity, and lastly a string that
represents the normalized form of the entity. The names file then
cross references the serial number and contains an entry for all
synonyms for the entity. Scripts that convert the ontology format,
obo, to this dictionary format are also available in the larsjuhljensen
bitbucket repository.
A list of predefined entity types is available in the readme file for
the tagger, but more can be added. The convention we follow is
that the types of protein entities are specified by using the taxid of
their species as the type, and all positive types are interpreted as
genes/proteins from the corresponding NCBI taxonomic identi-
fier, but it is possible to use alternative type definitions with the
82 Helen V. Cook and Lars Juhl Jensen

tagger. The proteins for a species will only be tagged if a name for
the species corresponding to the protein type has also been tagged
in the document. This helps prevent false positives caused by gene
names from organisms that the text is not about and further helps
reduce the ambiguity caused by many organisms sharing gene
names, which will be discussed later.
Expanding the list of synonyms with all possible plurals and
variants is done to gain recall. In some cases, this will lead to
generating terms that have a different meaning or that should not
be tagged. These false positives will be filtered out in the next step,
where we apply a stopword list to increase precision.
Pre-made dictionaries for proteins, species, GO terms, tissues,
diseases, environments, chemicals, and phenotypes are linked from
the readme file in the bitbucket repository. Stopwords and groups
files for these dictionaries are also available from the same location.

3.2 Stopwords After generating the dictionaries, the tagger should be run over the
intended corpus, and the output should be inspected for false
positives. Depending on the size of the corpus, it will be impossible
to look at all the results, so priority should be given to the terms
that result in the most hits, since these will affect the overall quality
the most. Any terms that should not be tagged should be added to
the stopword list so that this specific term will not be tagged.
Whereas the dictionary is case insensitive, the stopwords are case
sensitive, which allows for specific case variants to be stopworded
while still tagging other variants that are not on the stopword list.
This includes hyphens, so “cm-1’” can be stopworded (likely mean-
ing per centimeter, cm1), while “cm-1” (a gene found in Arabi-
dopsis) would still tagged if it were present in the dictionary.
Similarly, “ran” is the symbol for a human gene and is tagged, but
“ran” is stopworded as it otherwise would result in a plethora of
false positives.
The format of the stopwords file is two columns separated by a
tab, the first containing the term and the second containing “t” if
the term is a stopword and “f” if it is not a stopword.

3.3 Groups Providing the hierarchical relationships between the entities in a

groups file enables more general categories of the entity to be
tagged along with the specific term. For example, the disease
ontology provides the relationship that Parkinson’s disease is a
subclass of neurodegenerative diseases. Running the tagger with a
groups file that includes this relation will result in every instance of
Parkinson’s disease in the text also being tagged with neurodegen-
erative disease. This is useful for building an index, so that a query
for neurodegenerative diseases could also retrieve mentions of Par-
kinson’s disease even though the more general term did not appear
in the text. Groups files are also used for disambiguation and for
co-occurrence counting, which is discussed in later sections. The
A Guide to Dictionary-Based Text Mining 83

groups file uses the serial numbers defined in the entities file to map
children to parents. In case of a tree structure or directed acyclic
graph, all relationships must be specified in the groups file, i.e., a
child term must be explicitly linked not only to its parents but also
to its grandparents. Because relationships are not inferred through
indirect relationships, it is also possible to represent group member-
ships that do not fit a tree structure.

3.4 Disambiguation The same name is often used to refer to multiple entities; such
names are referred to as homonyms or ambiguous names. We try to
disambiguate them at several stages of the text mining process.
During the dictionary build, if a gene name is the prescribed
standard gene name for one gene, and is also a rarely used synonym
for another gene, it is most likely that any term in the corpus refers
to the former, and so the term will only be added as a name for
it. Names may also conflict across types of entities, for example,
“ECD” can be used to refer to the recommended human gene
symbol for ecdysoneless homolog, a regulator of p53 stability and
function, or to refer to the abbreviation for endocardial cushion
defect, a congenital heart condition. In such cases, the dictionary
build should only allow this name to appear in one of the diction-
aries. There is no strict rule that dictionaries must not contain the
same name for different types, and the tagger will tag them all, but
it is better to disambiguate them prior to doing named entity
recognition so that the results are much easier to work with.
If protein names conflict across organisms, these are all added
to the dictionary and are disambiguated at tagging time. For exam-
ple, the gene cyclin-dependent kinase 1, or CDK1, which promotes
progression through the cell cycle has the same name in both
human and mouse. In order to not tag all instances of this protein
as both mouse and human CDK1 when only one was intended by
the author, the tagger will require the species to be tagged in the
text first. The only exception to this rule is for human proteins,
which can be tagged without requiring that the species be explicitly
mentioned in the text, because very often the species is implied to
be human in biomedical abstracts. This differs from the tagging of
other types, which have no interdependencies.
Using a groups file will further help disambiguate terms.
Groups files for proteins are generated from eggNOG orthology
relationships between proteins, which are hierarchical as of egg-
NOG 4.5 [71]. Often protein names refer to the same gene in
multiple organisms, as with CDK1, but this is not always the case.
For example, the gene symbol CDC2 refers to different genes in
S. cerevisiae and Sz. pombe, and these should be treated differently
than matches in the CDK1 example. If the text refers to a term that
is the name of a protein for two different species, and the names of
both species are present in the document, and the proteins are in
the same orthology group, then the protein will be tagged as
84 Helen V. Cook and Lars Juhl Jensen

belonging to both species. If both species are identified, but the

proteins are not in the same orthologous group, then it will not be
tagged. In the last case, when both species are identified, the tagger
attempts to determine if one of the species is the main topic of the
article, and if it can determine this by counting the number of other
protein mentions that are unique to that organism, then it will
disambiguate the ambiguous protein to that species.

3.5 Co-occurrences The tagger will optionally score co-occurrences between terms of
the same or different types, as specified in the type-pairs file. To
score co-occurrences, the tagger will count terms that occur in the
same sentence, same paragraph, and same document with decreas-
ing levels of weight. The score is then adjusted for the fact that
some terms occur very frequently in the corpus, and so they have a
high prior probability of co-occurring with any other protein just
by their abundance.
X
c i, j ¼ δs ði; j Þ ws þ δpði; j Þ w p þ δd ði; j Þ wd ð1Þ
where the sum is over all sentences, paragraphs, and documents,
and δs(i,j) evaluates to 1 if term i and j occur in the same
sentence and 0 otherwise and similarly for paragraphs and
documents.

α c i, j c , 1α
s ði, j Þ ¼ c i, j ð2Þ
c i , c , j
where l denotes all entities of the same type. These results should
then be benchmarked, as previously discussed.
The type-pairs file can be used to request only that a subset of
the possible pairs of types are scored for co-occurrences; otherwise
the full cross product of nonprotein types will be scored. Cross-
species interactions between proteins can also be specified explicitly
in the type-pairs file. Specifying relationships between the entities
via a groups file will not impact the scoring.

4 Challenges and Shortcomings

This chapter has so far described the steps to approach text mining,
following the specific example of the JensenLab tagger. Here, we
would like to discuss some of the common errors that are generated
by dictionary systems and possible ways that they can be mitigated.
The main source of false positives comes from the fact that
dictionary-based methods will faithfully identify all instances of a
word, including those where the word refers to a different concept.
A dictionary approach is not able to distinguish between homo-
nyms such as SDS, which is commonly used in the biomedical
literature to refer to both a human protein and a detergent. These
A Guide to Dictionary-Based Text Mining 85

two uses cannot be distinguished without looking at the context of

the sentence. Methods such as conditional random fields [72],
which build a model of the words that surround entities, can be
used to better discriminate between such homographs.
False negatives result from the fact that a dictionary approach
can only find what is in the dictionary. If the dictionary is incom-
plete, then the missing terms cannot be matched. Machine learning
approaches may be able to extrapolate from training data to return
novel results but require a source of well-annotated training data.
In some cases, the term will be present in the dictionary but will
be referred to in a way that the dictionary will not recognize it. For
example, a phrase such as “CDK1-3” should match three entities:
CDK1, CDK2, and CDK3. Some systems recognize disjoint enti-
ties like this, such as [73].
A similar issue occurs with co-references, words or phrases such
as “it” or “the protein” that refer to an entity but that use words
that otherwise do not refer to a specific entity to do so. To do this
correctly requires understanding the structure of the sentence,
since “it” refers back to the last noun of the correct type for the
sentence to make sense. This is not a major issue for our
co-occurrence scoring approach, since not resolving the
co-reference will degrade a same-sentence mention to a same-
paragraph mention. If co-reference resolution is needed, Stanford
provides a co-reference resolver in their CoreNLP package, which
achieves an F-score of 60.0 on English text using neural network-
based co-reference resolution. Also see the review [74] for an
overview of co-reference resolution methods.
Any text mining approach must account for the fact that lan-
guage itself is imprecise. Even though scientific language aims to be
clear and rigorous, the ground truth is often ambiguous, and
human readers may disagree on the correct interpretation of the
text. For example, a viral capsid is the protein structure that sur-
rounds the virus, and it is assembled from protein monomers,
which are also called capsid. Specifically, what the word “capsid”
refers to is sometimes unclear. Further, scientific articles often have
word limits, which means that important information about an
experiment can be relegated to sections like online supplemental
methods that may not be included in the text mining corpus. It has
been reported by the UniProt curators [75], who manually curate
articles for the UniProt database, that it is often a challenge to
determine in which animal model an experiment has been per-
formed. If this is challenging for human curators because the
information is not available, then we cannot expect text mining to
do better. Lastly, text mining done perfectly will extract exactly
what is stated in the text, regardless of whether or not it is actually
true. Changes in truth could be due to new discoveries causing old
“facts” to be disproven or due to shifts in word usage over time.
Changes in definitions and terminology used to describe diseases
and treatments recommended nomenclature, and the style of
86 Helen V. Cook and Lars Juhl Jensen

language used [76] can pose challenges for text mining, if the
changes in use are not considered or are not part of the training
data (if applicable).

References
1. Lu Z (2011) PubMed and beyond: a survey of 11. Swanson DR, Smalheiserf NR (1996) Undis-
web tools for searching biomedical literature. covered public knowledge: a ten-year update.
Database 2011:1–13. issn: 17580463. arXiv: KDD-96 Proceedings 56(2):103–118. issn:
baq03. https://doi.org/10.1093/database/ 00242519. https://doi.org/10.2307/
baq036 4307965
2. The UniProt Consortium (2014) UniProt: a 12. Swanson DR (1988) Migraine and magnesium:
hub for protein information. Nucleic Acids Res eleven neglected connections. Perspect Biol
43(D1):D204–D212. issn: 0305-1048. Med
http://nar.oxfordjournals.org/content/43/ 13. Russo F et al (2018) miRandola 2017: a
D1/D204. https://doi.org/10.1093/nar/ curated knowledge base of non-invasive bio-
gku989 markers. Nucleic Acids Res 46:D354–D359.
3. Attwood T, Agit B, Ellis L (2015) Longevity of issn: 0305-1048. https://doi.org/10.1093/
biological databases. EMBnet.journal 21.0 nar/gkx854
issn: 2226-6089. http://journal.embnet.org/ 14. Orchard S et al (2014) The MIntAct project -
index.php/embnetjournal/article/view/803 IntAct as a common curation platform for
4. Pletscher-Frankild S et al (2015) DISEASES: 11 molecular interaction databases. Nucleic
text mining and data integration of disease- Acids Res 42(November 2013):358–363.
gene associations. Methods 74:83–89. issn: https://doi.org/10.1093/nar/gkt1115
10959130. https://doi.org/10.1016/j. 15. Xenarios I et al (2002) DIP, the database of
ymeth.2014.11.020 interacting proteins: a research tool for study-
5. Junge A et al (2017) RAIN: RNA-protein asso- ing cellular networks of protein interactions.
ciation and interaction networks. Database Nucleic Acids Res 30(1):303–305. issn: 1362-
baw167:1–9. issn: 1047- 3211. arXiv: 4962. https://doi.org/10.1093/nar/30.1.
1611.06654. http://fdslive.oup.com/www. 303
oup.com/pdf/production%7B%5C_%7Din% 16. Bader GD, Betel D, Hogue CWV (2003)
7B%5C_%7Dprogress.pdf. https://doi.org/ BIND: the biomolecular interaction network
10.1093/cercor/bhw393 database. Nucleic Acids Res 31(1):248–250.
6. Binder JX et al (2014) COMPARTMENTS: issn: 03051048. https://doi.org/10.1093/
unification and visualization of protein subcel- nar/gkg056
lular localization evidence. Database 1–.9. issn: 17. Rodriguez-Esteban R (2009) Biomedical text
17580463. https://doi.org/10.1093/data mining and its applications. PLoS Comput Biol
base/bau012 5(12):1–5. issn: 1553734X. https://doi.org/
7. Santos A et al (2015) Comprehensive compar- 10.1371/journal.pcbi.1000597
ison of large-scale tissue expression datasets. 18. Pafilis E et al (2009) Reflect: augmented
PeerJ 3:e1054. issn: 2167-8359. https:// browsing for the life scientist. Nat Biotechnol
peerj.com/articles/1054. https://doi.org/ 27(6):508–510. issn: 1087- 0156. https://doi.
10.7717/peerj.1054 org/10.1038/nbt0609-508
8. Meaney C et al (2016) Text mining describes 19. Pafilis E et al (2013) The SPECIES and
the use of statistical and epidemiological meth- ORGANISMS resources for fast and accurate
ods in published medical research. J Clin Epi- identification of taxonomic names in text.
demiol 74:124–132. issn: 18785921. https:// PLoS ONE 8(6):2–7. issn: 19326203.
doi.org/10.1016/j.jclinepi.2015.10.020 https://doi.org/10.1371/journal.pone.
9. IDG Knowledge Management Center (2016) 0065390
Unexplored opportunities in the druggable 20. Szklarczyk D et al (2016) The STRING data-
genome. Nat Rev Drug Discov http://www. base in 2017: quality- controlled protein-
nature.com/nrd/posters/druggablegenome/ protein association networks, made broadly
index.html accessible. Nucleic Acids Res 45(D1):
10. Swanson DR (1986) Fish oil, Raynaud’s syn- D362–D368. issn: 0305-1048. http://nar.
drome, and undiscovered public knowledge. oxfordjournals.org/lookup/. https://doi.
Perspect Biol Med 30:7–18 org/10.1093/nar/gkw937
A Guide to Dictionary-Based Text Mining 87

21. Cook H, Pafilis E, Jensen L (2016) A dictio- 31. Pafilis E et al (2015) ENVIRONMENTS and
nary- and rule-based system for identification EOL: identification of environment ontology
of bacteria and habitats in text. In: Proceedings terms in text and the annotation of the ency-
of the 4th BioNLP shared task workshop, clopedia of life. Bioinformatics 31
p 50–55. isbn: 978-1-945626-21-0. http:// (11):1872–1874. issn: 14602059. https://
www.aclweb.org/anthology/W/W16/W16- doi.org/10.1093/bioinformatics/btv045
30.pdf%7B%5C#%7Dpage¼60 32. Yang Y et al (2017) Exploiting sequence-based
22. Jensen LJ, Saric J, Bork P (2006) Literature features for predicting enhancer-promoter
mining for the biologist: from information interactions. Bioinformatics 33(14):
retrieval to biological discovery. Nat Rev i252–i260. issn: 14602059. https://doi.org/
Genet 7(2):119–129. issn: 1471-0056. 10.1093/bioinformatics/btx257
http://www.nature.com/doifinder/10.1038/ 33. Sayers E (2010) A general introduction to the
nrg1768. https://doi.org/10.1038/nrg1768 E-utilities. National Center for Biotechnology
23. Arighi CN et al (2014) BioCreative-IV virtual Information (US), Bethesda, MD, pp 1–10
issue. Database 2014:1–6. issn: 1758-0463. 34. Westergaard D et al (2017) Text mining of
https://doi.org/10.1093/database/bau039 15 million full-text scientific articles. bioRxiv.
24. Deléger L et al (2016) Overview of the bacteria https://doi.org/10.1101/162099
biotope task at BioNLP shared task 2016. In: 35. Eysenbach G (2006) Citation advantage of
Proceedings of the 4th BioNLP shared task open access articles. PLoS Biol 4(5):692–698.
workshop, p 12–22 issn: 15457885. https://doi.org/10.1371/
25. Huang CC, Zhiyong L (2016) Community journal.pbio.0040157
challenges in biomedical text mining over 36. Handke C, Guibault L, Vallbé JJ (2015) Is
10 years: success, failure and the future. Brief Europe falling behind in data mining? Copy-
Bioinform 17(1):132–144. issn: 14774054. right’s impact on data mining in academic
https://doi.org/10.1093/bib/bbv024 research. In: New avenues for electronic pub-
26. Yepes AJ, Verspoor K (2014) Literature mining lishing in the age of infinite collections and
of genetic variants for curation: quantifying the citizen science: scale, openness and trust—Pro-
importance of supplementary material. Data- ceedings of the 19th international conference
base 2014., bau003. issn: 1758-0463. http:// on electronic publishing, Elpub 2015 June
www.pubmedcentral.nih.gov/articlerender. (2015), pp. 120–130. issn: 1556-5068. doi:
fcgi?artid¼3920087%7B%5C&% https://doi.org/10.3233/978-1-61499-562-
7Dtool¼pmcentrez%7B%5C&% 3-120
7Drendertype¼abstract. https://doi.org/10. 37. Noonburg D XpdfReader. http://www.
1093/database/bau003 xpdfreader.com/
27. Roque FS et al (2011) Using electronic patient 38. Ramakrishnan C et al (2012) Layout-aware
records to discover disease correlations and text extraction from full-text PDF of scientific
stratify patient cohorts. PLoS Comput Biol 7 articles. Source Code Biol Med 7:7. issn: 1751-
(8):e1002141. issn: 1553734X. arXiv: 0473. https://doi.org/10.1186/1751-0473-
NIHMS150003. https://doi.org/10.1371/ 7-7
journal.pcbi.1002141 39. Kim D, Hong Y (2011) Figure text extraction
28. Ford E et al (2016) Extracting information in biomedical literature. PLoS ONE 6
from the text of electronic medical records to (1):1–11. issn: 19326203. https://doi.org/
improve case detection: a systematic review. J 10.1371/journal.pone.0015338
Am Med Inform Assoc 23(5):1007–1015. issn: 40. Free software foundation. iconv. http://www.
1527974X. https://doi.org/10.1093/jamia/ gnu.org/savannah-checkouts/gnu/libiconv/
ocv180 documentation/libiconv-1.15/iconv.1.html
29. Thomas CE et al. (2014) Negation scope and 41. Moolenaar B Vim. https://vim.sourceforge.
spelling variation for text-mining of Danish io/
electronic patient records. In: Proceedings of
the 5th international workshop on health text 42. Przybyla P et al (2016) Text mining resources
mining and information analysis 2014, for the life sciences. Database 2016:1–30. issn:
p 64–68 17580463. arXiv: 1611.06654. https://doi.
org/10.1093/database/baw145
30. Kuhn M et al (2016) The SIDER database of
drugs and side effects. Nucleic Acids Res 44 43. Chen D, Manning CD (2014) A fast and accu-
(D1):D1075–D1079. issn: 13624962. rate dependency parser using neural networks.
https://doi.org/10.1093/nar/gkv1075 In: Proceedings of the 2014 conference on
empirical methods in natural language proces-
sing (EMNLP) 2014, p 740–750. isbn:
88 Helen V. Cook and Lars Juhl Jensen

9781937284961. https://cs.stanford.edu/% 55. Mikolov T, Yih W-T, Zweig G (2013) Linguis-

7B~%7Ddanqi/papers/emnlp2014.pdf tic regularities in continuous space word repre-
44. Recasens M, De Marneffe MC, Potts C (2013) sentations. In: Proceedings of NAACL-HLT
The life and death of discourse entities: identi- 2013, p 746–751. isbn: 9781937284473.
fying singleton mentions. In: Proceedings of http://scholar.google.com/scholar?hl¼en%7B
NAACL-HLT 0.June 2013, p 627–633. %5C&%7DbtnG¼Search%7B%5C&%
http://www.aclweb.org/anthology-new/N/ 7Dq¼intitle:Linguistic+Regularities+in+Con
N13/N13-1071.pdf tinuous+Space+Word+Representations%7B%
45. NLTK Project. Natural Language Toolkit 5C#%7D0%7B%5C%%7D5Cn, https://www.
http://www.nltk.org/ aclweb.org/anthology/N/N13/N13-1090.
pdf
46. Sayers EW et al (2009) Database resources of
the national center for biotechnology informa- 56. Pennington J, Socher R, Manning CD (2014)
tion. Nucleic Acids Res 37:D5–D15 issn: GloVe: global vectors for word representation.
1362-4962. https://doi.org/10.1093/nar/ issn: 10495258. doi: https://doi.org/10.
gkn741 3115/v1/D14-1162. arXiv: 1504.06654.
47. Gerner M, Nenadic G, Bergman 57. Bojanowski P et al (2016) Enriching word vec-
CM. LINNAEUS: a species name identifica- tors with subword information. issn:
tion system for biomedical literature. In: 10450823. arXiv:1607.04606. http://arxiv.
BMC Bioinformatics 111 (2010), p. 85. issn: org/abs/1607.04606. doi: 1511.09249v1
1471-2105. http://www.ncbi.nlm.nih.gov/ 58. Pyysalo S et al (2012) Distributional semantics
pmc/articles/PMC2836304/%7B%5C%% resources for biomedical text processing
7D5Cn, http://www.biomedcentral.com/ 59. Cejuela JM et al (2014) Tagtog: interactive and
1471-2105/11/85. doi: https://doi.org/10. text-mining-assisted annotation of gene men-
1186/1471-2105-11-85 tions in PLOS full-text articles. Database
48. Leaman R, Zhiyong L (2016) TaggerOne: 2014:1–8. issn: 17580463. https://doi.org/
joint named entity recognition and normaliza- 10.1093/database/bau033
tion with semi-Markov Models. Bioinformatics 60. Stenetorp P, Pyysalo S, Topic G Brat rapid
32(18):2839–2846. issn: 14602059. https:// annotation tool. http://brat.nlplab.org/
doi.org/10.1093/bioinformatics/btw343 61. Database Center for Life Science. PubAnnota-
49. Cho H-C et al NERsuite: a named entity rec- tion. http://www.pubannotation.org/
ognition toolkit. https://github.com/nlplab/ 62. Johns Hopkins University McKusick-Nathans
nersuite Institute of Genetic Medicine. Online Mende-
50. Hogenboom F et al (2011) An overview of lian Inheritance in Man, OMIM.
event extraction from text. CEUR Workshop 63. Law V et al (2014) DrugBank 4.0: shedding
Proceedings 779:48–57 isbn: 1467392006 new light on drug metabolism. Nucleic Acids
51. Ramos J (2003) Using TF-IDF to determine Res 42(D1):1091–1097. issn: 03051048.
word relevance in document queries. In: Pro- https://doi.org/10.1093/nar/gkt1068
ceedings of the first instructional conference on 64. Kanehisa M et al (2017) KEGG: new perspec-
machine learning 2003, p 1–4. doi: tives on genomes, pathways, diseases and
10.1.1.121.1424 drugs. Nucleic Acids Res 45(Database):
52. Damashek M (1995) Gauging similarity with D353–D361
n-grams: language-independent categorization 65. Docker Inc. Docker.
of text. Science 267(5199):843–848. issn: 66. Jupp S et al (2015) A new ontology lookup
0036-8075. https://doi.org/10.1126/sci service at EMBL-EBI. CEUR Workshop Pro-
ence.267.5199.843 ceedings 1546:118–119 issn: 16130073
53. Björne J, Salakoski T (2015) TEES 2.2: bio- 67. Smith B et al (2007) The OBO foundry: coor-
medical event extraction for diverse corpora. dinated evolution of ontologies to support bio-
BMC Bioinformatics 16 Suppl 16 S4. issn: medical data integration. Nat Biotechnol 25
1471-2105. http://bmcbioinformatics. (11):1251–1255. issn: 1087-0156. http://
biomedcentral.com/articles/10.1186/1471- www.nature.com/doifinder/10.1038/
2105-16-S16-S4. doi: https://doi.org/10. nbt1346. https://doi.org/10.1038/nbt1346
1186/1471-2105-16-S16-S4
68. Whetzel PL et al (2011) BioPortal: enhanced
54. Lever J, Jones SJM (2016) VERSE: event and functionality via new Web services from the
relation extraction in the BioNLP 2016 shared national center for biomedical ontology to
task. In: Proceedings of the 4th BioNLP shared access and use ontologies in software applica-
task workshop, 2016, p 42–49 tions”. In: Nucleic Acids Res 39 SUPPL 2 pp.
541–545. issn: 03051048. doi: https://doi.
A Guide to Dictionary-Based Text Mining 89

org/10.1093/nar/gkr469. arXiv: 73. Tang B et al (2013) Recognizing and encoding

arXiv:1011.1669v3. disorder concepts in clinical text using machine
69. Faria D et al (2013) The AgreementMaker- learning and vector space. In: Proceedings of
Light ontology matching system. Springer, pp the ShARe/CLEF Evaluation Lab (2013).
527–541. isbn: 9783642410291. https://doi. issn: 16130073. http://www.clef-initiative.
org/10.1007/978-3-642-41030-7_38. eu/documents/71612/d596ae25-c4b3-
70. Nédellec C (2013) OntoBiotope. In: INRA 4a9a-be4a-648a77712aaf
71. Huerta-Cepas J et al (2015) eggNOG 4.5: a 74. Zheng J et al (2011) Coreference resolution: a
hierarchical orthology framework with review of general methodologies and applica-
improved functional annotations for eukary- tions in the clinical domain. J Biomed Inform
otic, prokaryotic and viral sequences. Nucleic 44(6):1113–1122. issn: 15320464. https://
Acids Res 44(Database issue):286–293. issn: doi.org/10.1016/j.jbi.2011.08.006
0305-1048. https://doi.org/10.1093/nar/ 75. Jensen LJ (2017) Personal Communication
gkv1248 76. Thompson P et al (2016) Text mining the
72. Finkel JR, Kleeman A, Manning CD (2008) history of medicine. PLoS ONE 11(1):1–33.
Feature-based, conditional random field pars- issn: 19326203. https://doi.org/10.1371/
ing. In: Proceedings of the 46th meeting of the journal.pone.0144717
ACL, 2008, p 959–967
Chapter 6

Leveraging Big Data to Transform Drug Discovery

Benjamin S. Glicksberg, Li Li, Rong Chen, Joel Dudley, and Bin Chen

Abstract
The surge of public disease and drug-related data availability has facilitated the application of computational
methodologies to transform drug discovery. In the current chapter, we outline and detail the various
resources and tools one can leverage in order to perform such analyses. We further describe in depth the
in silico workflows of two recent studies that have identified possible novel indications of existing drugs.
Lastly, we delve into the caveats and considerations of this process to enable other researchers to perform
rigorous computational drug discovery experiments of their own.

Key words Systems pharmacology, Drug discovery, Big data, Electronic medical records, Clinical
informatics, Bioinformatics, Drug repurposing, Drug repositioning, Gene expression data,
Pharmacogenomics

1 Introduction

Preclinical drug discovery efforts are typically led by target-based or

systems (phenotypic)-based strategies. In target-based screenings,
the underlying goal is development around certain target or path-
way with a priori evidence for its role in a disease or phenotype.
Systems-based screenings are typically performed in a high-
throughput fashion without an initial hypothesis of the target. In
these models, many drugs, with and without known pharmacology,
are tested against an assay evaluating properties of the phenotype of
interest. From 1998 to 2006, 70% of first-in-class drugs discovered
have been target-based, with only 30% directed by systems-based
approaches [1]. This traditional drug discovery framework is both
costly and timely, with a relatively low overall average success rate of
9.6% across all diseases [2]. While the max average overall success
rate is 26.1% for hematology, the lowest is a mere 5.1% for
oncology.
Other issues with traditional clinical trial design revolve around
biased trials and selective publishing [3] that affect both internal
and external validity [4] and therefore the implications of studies.

Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_6, © Springer Science+Business Media, LLC, part of Springer Nature 2019

91
92 Benjamin S. Glicksberg et al.

One study evaluated the diversity of race, ethnicity, age, and sex in
participants of cancer trials [5]. They found large differences in
representation in all these realms: for instance, lower enrollment
fractions in Hispanic and African-American participants compared
to Caucasian participants ( p < 0.001 for both comparisons). They
also found an inverse relationship between age and enrollment
fraction across all racial and ethnic groups and significant differ-
ences for men and women depending on the disease (e.g., men had
higher enrollment fractions for lung cancer; p < 0.001). These
differences in representation also have serious implications in prac-
tice when, for instance, a treatment studied in one population is
given to another. Making clinical decisions based on studies with
these issues can lead to expensive, sub-optimal treatment rates or
missed opportunities at best and harmful events at worst. There are
plenty of examples of randomized controlled trials that were judged
to be beneficial but shown to be harmful (e.g., fluoride treatment
for osteoporosis) [6]. Another weakness of clinical trial design is the
relaxing of inclusion criteria for disease group in order to bolster
study numbers, which is especially problematic in heterogeneous
diseases.
These issues, along with the vast resources required for these
studies and overall low success rates, necessitate complementary
approaches. The recent surge of biomedical information pertaining
to molecular characterizations of diseases and chemical compounds
has facilitated a new age of drug discovery through computational
predictions [7, 8]. By analyzing FDA-approved compounds to
discover novel indications, one can leverage the massive amount
of research and effort that have already been completed and bypass
many steps in the traditional drug development framework. While
there are many innovative strategies to reduce cost and improve
success rates during the traditional drug discovery process [9, 10],
drug repurposing is a viable and growing discipline with documen-
ted advantages: for instance, traditional drug discovery pipelines
take on average from 12 to 16 years from inception to market and
cost on average one to two billion dollars, while successful drug
repurposing can be done in less than half the time (6 years on
average) at a quarter of the cost (~$300 million) [11, 12]. Incorpor-
ating drug repurposing strategies into workflows can drastically
increase productivity for biopharmaceutical companies [13], and
there are growing numbers of successful examples that prove its
worth [14–17].
Thalidomide, for instance, was originally developed in Ger-
many and England as a sedative and was prescribed to treat morn-
ing sickness in pregnant women. It soon became apparent that this
treatment caused severe and devastating skeletal birth defects in
thousands of babies born to mothers taking it during the first
trimester of pregnancy. Years later, after the drug was banned for
this purpose, it was serendipitously found to be effective for the
Leveraging Big Data to Transform Drug Discovery 93

treatment of erythema nodosum leprosum, a very serious compli-

cation of leprosy. In a subsequent double-blind study of over 4500
patients with this condition, thalidomide treatment led to full
remission in 2 weeks for 99% patients [18]. Not only is thalidomide
the current (and only) standard of care for erythema nodosum
leprosum, it has also proven beneficial in other disorders. Soon
after, in fact, it was found to significantly improve survival in
patients with multiple myeloma [19]. Celgene, the biopharmaceu-
tical company responsible for driving the resurgence of thalidomide
through these repurposed indications, derived 75% or more of its
revenue in 2016 from this one drug, primarily for treating multiple
myeloma [20].
This case, although extreme, illustrates the value and possibi-
lities of drug repurposing even in the face of documented failure.
There are countless avenues to explore for new indications of old
drugs, which are not necessarily surveyed in traditional clinical trial
design strategies. In the current chapter, we will outline and
describe the tools and best-practice methodologies that can be
used to successfully leverage big data for drug discovery by detail-
ing the pipelines of two recent studies as templates. We further
discuss limitations and important considerations of this process in
Subheading 4. We expect that one can apply the approach to
discover new therapeutics for other diseases of interest after reading
this chapter.

2 Materials

In this section, we will cover an overview of the materials and tools

that can be used to perform an in silico drug discovery experiment,
along with how to go about accessing them. We also provide a
visual guide for this process in Fig. 1. First, we will discuss recom-
mended software and computational resources that can be used for
this purpose. Next we describe the types of data that are typically
used in these experiments along with ontological resources on how
they can be effectively integrated. We then go into the specific
databases that house disease and chemogenomic-related gene
expression data. We conclude with specialized software and
packages that can enhance figure making capabilities to visualize
ensuing results.

2.1 Software Drug discovery using big data resources is accomplished computa-
and Computational tionally. For the individual researcher, a modern computer is really
Resources the only hardware requirement. For software, essentially any pro-
gramming language whether open-source (e.g., R [21], Python
[22]) or closed and commercially licensed (e.g., SAS [SAS Institute
Inc., Cary, NC]) can perform data organization, statistical analyses,
and figure generation with the inclusion of freely available packages
94 Benjamin S. Glicksberg et al.

Disease Drug
gene expression data

MOLECULAR x x
healthy
PROFILES disease
treated
untreated

disease drug
signature signature

BIG DATA ArrayExpress

RESOURCES ONTOLOGICAL
HARMONIZATION

• curation
• selection
gene expression

PRE-PROCESSING gene expression

META-ANALYSIS } drugs
}

• tissues • cell lines

• patients • tissues
• dosages
• treatment duration
consensus disease signature
CHEMOGENOMIC }
}
ASSOCIATIONS SIMILIAR x
positive

DISSIMILAR
RANKING .
coorelation
consensus drug signatures .
negative

candidate drug

• ClinicalTrials.gov in silico experimental

• DrugBank
external dataset in vitro
VALIDATION

in vivo
Electronical Medical Records

Fig. 1 Full generic workflow for enabling drug discovery through big data resources. We start by illustrating the
mechanism behind disease and drug gene expression signatures and highlighting a few public repositories
Leveraging Big Data to Transform Drug Discovery 95

(e.g., SciPy [23] for Python) when needed. There exist numerous
resources (e.g., edX; https://www.edx.org) that provide an intro-
duction on using these programming languages for data science.
For infrastructure, cloud computing (e.g., Amazon Web Services
Cloud) enables managing large datasets and performing computa-
tionally intensive tasks without the need of building in-house
clusters.

2.2 Ontologies The landscape of big data in biomedicine is expansive yet unsys-
and Reference tematic: encompassing a tremendous amount of data points across
Databases a multitude of modalities. As such, there are clear hurdles in precise
characterization and proper integration procedures. To address
these challenges, researchers have created tools and ontologies to
map and normalize these disparate data types in order to facilitate
data harmonization and reproducible methodologies. This is espe-
cially important for the purposes of leveraging big data for drug
discovery, as many different data types have to be seamlessly and
reproducibly integrated along the entire drug discovery pipeline. A
more comprehensive list of these various resources can be found in
other related reviews [24].
The exponential growth of data entities with heterogeneous
data types from multiple resources calls for developing ontologies
to define entities and centralized reference databases. Meta-
thesauruses, like the Unified Medical Language System (UMLS)
[25], have been developed to organize, classify, standardize, and
distribute key terminology in biomedical information systems and
are invaluable for computational biology research. Essentially, each
medical term (e.g., disease) has a Concept Unique Identifier (CUI)
code that then can be referenced from other related ontologies.
The Systematized Nomenclature of Medicine-Clinical Terms
(SNOMED-CT; http://www.ihtsdo.org/snomed-ct/), for
instance, is one of the largest healthcare-related ontologies and
has over 300,000 medical concepts ranging from body structure
to clinical findings. There are specific ontologies that aim to char-
acterize the continually evolving representations of the phenotypic
space. Many clinical datasets encode diseases using International

Fig. 1 (continued) where they can be found. Based on research focus, it is often recommended to harmonize
these disparate sources of data through use of ontologies. Multiple signatures per disease or drug can be
integrated through rigorous meta-analysis procedures. Chemogenomic association testing assesses similarity
between drug and disease signatures and can be performed using procedures like the Kolmogorov-Smirnov
(KS) test. These drug signatures can then be ranked according to their correlation, or anticorrelation, to the
disease signature of interest. Drug signatures that are highly anticorrelated to the disease signature are
potential treatment candidates. Drug candidates that are selected for follow-up need further validation, in the
form of in silico (e.g., other external datasets or electronic medical records), in vitro, and/or in vivo
experiments
96 Benjamin S. Glicksberg et al.

Classification of Diseases (ICD) codes, for instance. The Disease

Ontology (http://disease-ontology.org/) organizes and standar-
dizes various aspects (e.g., synonyms, relationships, ICD codes) of
disease-oriented clinical topics into representative terms. The
Human Phenotype Ontology (http://human-phenotype-ontol
ogy.github.io/) has a similar goal but expands to more broad
aspects of human phenotypes, including abnormalities and side
effects. There are many resources that organize information per-
taining to various properties of genetic data, such as dbSNP
(https://www.ncbi.nlm.nih.gov/projects/SNP/), dbVar
(https://www.ncbi.nlm.nih.gov/dbvar), RegulomeDB (informa-
tion regarding regulatory elements; http://www.regulomedb.
org/), and UniProt (protein sequence and functional information;
http://www.uniprot.org/). The 1000 Genomes Project (http://
www.internationalgenome.org/) is a collection of thousands of
whole genomes from populations across the globe. The Exome
Aggregation Consortium (ExAC) is an accumulation of whole-
exome data for over 60,000 unrelated individuals from multiple
contributing projects at the time of this publication. There are also
resources that are specifically focused on compiling information
related to the associations of genotypes and phenotypes. The
Online Mendelian Inheritance in Man (OMIM; https://www.
omim.org/) is an online compendium of genotype/phenotype
information focusing on Mendelian disorders. As of June
15, 2017, this resource has information on over 6000 phenotypes
for which the molecular basis is known and over 3700 genes with
gene-causing mutations. Other related resources include the
Human Gene Mutation Database (HGMD; http://www.hgmd.cf.
ac.uk/) and ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/).
The Mouse Genome Informatics (MGI; http://www.informatics.
jax.org/) is a collection of data pertaining to various experiments
performed on mice, including disease models with phenotype and
genotype information.
In the drug space, medication and prescription data are often
unsystematically organized due to issues like conflicting nomencla-
ture or spellings (i.e., brand name vs. generic name,
American vs. British terminology). Additionally, the same medica-
tion may be prescribed with different dosages, minor formulation
differences, and routes of administration. All the information is
presented as free text in the medical records, requiring additional
text mining effort to connect them to other modalities. RxNorm
[26] was developed as an UMLS-based, standardized medication
vocabulary that intersects with known clinical knowledge reposi-
tories like Micromedex (Micromedex Solutions, Truven Health
Analytics, Inc. Ann Arbor, MI). Essentially, related complex drug
name strings can be mapped to a common identifier. Linking
medications to an RxNorm identifier facilities easy connection to
other related resources that include other modalities of data, such
Leveraging Big Data to Transform Drug Discovery 97

as Sider [27] and Offsides/Two sides [28], which document

known and predicted drug-drug interactions and connect medica-
tions to known clinical side effects. Through public databases like
DrugBank [29], one can cross-reference drug targets, pharmaco-
logical properties, and clinical indications. Using RepurposeDB
(http://repurposedb.dudleylab.org/) [30], one can explore the
known drug repurposing space and explore various factors (e.g.,
chemical properties) that might underlie these successes. The US
National Institutes of Health clinical trial repository (https://
ClinicalTrials.gov) is a collection of clinical trial data from various
diseases and treatments along with study outcomes and can open
the door for accessing and reanalyzing clinical trial data [31], such
as research into medication efficacy at various stages of clinical trials.
PubChem (https://pubchem.ncbi.nlm.nih.gov/) is a reference
database for information on the biological activities of small mole-
cules, including compound structure and substance information.

2.3 Data Resources A growing priority of increasing data accessibility through open-
that Can Be Leveraged access efforts has considerably boosted computational-based drug
for Drug Discovery discovery capabilities. In fact, Greene et al. created the Research
Parasite Award, which honors researchers who perform rigorous
secondary analyses on existing, open-access data to make novel
insights independent from the original investigators [32]. These
data exist in many forms, including clinical (i.e., disease comorbid-
ities), genomic (e.g., genetic, transcriptomic), and proteomics.
Drug discovery is multifaceted at its core, at the very least
involving some combination of chemical data in addition to those
mentioned above. This is even more relevant for computational-
based drug discovery, where complex intersections of many dispa-
rate data types are required. Fortunately, there are many public
repositories that contain a plethora of data around these spheres,
which can be connected utilizing the aforementioned ontologies
and reference databases. We outline the various data sources in the
current section and present two examples of successful drug discov-
ery utilizations of these datasets in detail in Subheading 3.

2.3.1 Hospitals Due to regulatory requirements, almost all American medical cen-
and Academic Health ters and health systems store patient data collected during outpa-
Centers tient and hospital visits using software platforms such as those
provided by EPIC (Epic Systems Corporation, Madison, WI) and
Cerner (Cerner Corporation, Kansas City, MO). These electronic
medical records (EMRs), or alternately electronic health records
(EHRs), are composed of a number of data types such as disease
diagnoses, medication prescriptions, lab test results, surgical pro-
cedures, and physician notes. Generally, data warehouse adminis-
tration takes responsibility for housing and creating an anonymized
or de-identified version of the EMR. Affiliated faculty and research-
ers then apply for access to this resource through their Institutional
98 Benjamin S. Glicksberg et al.

Review Board (IRB). While the primary role of EMR is for institu-
tional or administrative purposes, one important benefit of the
digitization of health records is that they can be more easily adapted
for powerful healthcare research purposes. For drug discovery,
these data can be used for genotype-phenotype relationship discov-
ery that could drive target selection or to analyze medication
efficacy and side effects in a real-world context [33]. The implica-
tions and impact of precise electronic phenotyping procedures for
these types of analyses will be discussed in Subheading 4.
There are many hospital systems and affiliated medical centers
that have successfully used their in-house EMR systems for impor-
tant scientific and clinical discoveries [34–38]. The Mount Sinai
Hospital and the Icahn School of Medicine at Mount Sinai organize
and protect their EMR data (beginning in 2003) within the Mount
Sinai Data Warehouse, which is comprised of over 7.5 million
patients and 2 billion points of data, including disease diagnoses
and lab test results. The University of California, San Francisco
(UCSF) is leading a massive effort to coordinate EMR data from
five UC medical centers. The University of California Research
eXchange (UC ReX; https://myresearch.ucsf.edu/uc-rex) pro-
vides a framework for UC-affiliated researchers to query
de-identified clinical and demographic data, providing a natural
cross-validation opportunity to compare findings across these sites.
Data analyses using EMRs can be enhanced with the inclusion
of genetic data from biobanks or repositories of biological samples
(e.g., blood) of recruited participants generally from a hospital
setting. Many institutions have frameworks set up that facilitate
this type of research, such as The Charles Bronfman Institute of
Personalized Medicine BioMe biobank within the Mount Sinai
Hospital system, BioVU from the Vanderbilt University Medical
Center, and DiscovEHR, which is a collaboration between Regen-
eron Genetics Center and the Geisinger Health System. Coupling
clinical and genetic data has led to important findings in the area of
drug and target discovery. Using the DiscovEHR resource, for
instance, researchers recently characterized the distribution and
clinical impact of rare, functional variants (i.e., deleterious) in
whole-exome sequences for over 50,000 individuals [39]. The
associations they identified add insight to the current understand-
ing of therapeutic targets and clinically actionable genes.
A limitation of these biobanks and EMR systems is that they are
often restricted to researchers affiliated with the associated institu-
tion. There are a number of initiatives and resources that allow
researchers to apply for access of these types of data. For instance,
the Centers for Medicare & Medicaid Services offer relevant
healthcare-related data dumps (https://data.medicare.gov/) for
sites that accept Medicare including hospitals, nursing home, and
hospices with information on various factors and outcomes (e.g.,
Leveraging Big Data to Transform Drug Discovery 99

infection rates). The DREAM challenges (http://www.

dreamchallenges.org), for instance, consist of various “open sci-
ence” prediction challenges, some of which are disease-centric
that incorporate real clinical datasets often coupled with other
modalities of data (e.g., genetics). In the Alzheimer’s Disease Big
Data DREAM Challenge #1 (Synapse ID: syn2290704; June–-
October 2014), challengers were tasked with developing the best
performing machine learning model to predict disease progression,
using actual genetic data (e.g., genotypes) and clinical data (e.g.,
cognitive assessments) from patients with mild cognitive
impairment, early Alzheimer’s disease, and elderly controls. The
UK Biobank (http://www.ukbiobank.ac.uk/) is an unprecedented
health resource of over 500,000 recruited participants aged 40–69
with genetic (genotyping) and a wide variety of clinical data, includ-
ing longitudinal follow-ups (original data collected from 2006 to
2010). These data include online questionnaires (e.g., about diet,
cognitive function), EMR, blood biochemistry (e.g., hormone
levels), and urinalysis, among others. Further, more specialized
data is available for a subset of patients, including a 24-h activity
monitoring for a week (n ¼ 100,000) and image scans (e.g., brain,
heart, abdomen; n ¼ 100,000).
What distinguishes the UK Biobank from other large-scale
initiatives is that it is a “fair access” biobank [40], one that has
infrastructure to facilitate collection, storage, protection, and dis-
tribution (with data update releases) to allow academic and indus-
trial researchers to apply for access. As of July 2016, there have been
around 100 publications featuring or involving data from the UK
Biobank (http://www.ukbiobank.ac.uk/published-papers/; latest
update), although this number is rapidly increasing. While these
studies include major findings in the realm of genetics and pheno-
types (e.g., diseases and health outcomes), very few are directed at
drug discovery. Wain et al. have demonstrated the value of these
data for the identification of potential drug targets, specifically for
lung function and chronic obstructive pulmonary disease (COPD)
[41]. The authors leveraged the large cohort size to select ~50,000
individuals at the extreme ends of lung function (e.g., forced
expired volume in 1 s) for their analyses. From a GWAS, they
identified 97 signals (43 reported as novel) and created a polygenic
risk score from around six alleles that confers a 3.7-fold change in
COPD risk between high- and low-risk scored individuals. The
authors then analyzed these signals in the context of variant func-
tion (i.e., deleteriousness) that cause expression changes in other
genes (eQTLs), resulting in 234 genes with potentially causal
effects on lung function. Seven out of these 234 genes are already
targets of approved or drugs currently in development. The
remaining genes, along with others they identified by expanding
100 Benjamin S. Glicksberg et al.

their network via protein-protein interactions, represent potential

novel targets for drug development.

2.3.2 Genomic Each and every research study is crucial for furthering scientific
Experimental Data knowledge, but sharing of the experimental data collected can
Repositories additionally benefit other researchers in the field directly. The pool-
ing of several sources of data can allow for meta-analyses and other
types of research not possible in isolation, thereby facilitating fur-
ther discoveries. There have been outstanding efforts to provide a
framework for researchers to deposit and share data, with many
high-impact journals even requiring it for publication. There are
other reviews that go into detail about these resources, but we will
describe a few here.
The Gene Expression Omnibus (GEO) is a public repository
from NCBI that allows researchers to upload high-quality func-
tional genomics data (e.g., microarray) that meet their guidelines
[42]. GEO organizes, stores, and freely distributes these data to the
research community. As of 2017, GEO already surpasses over two
million samples. ArrayExpress (http://www.ebi.ac.uk/
arrayexpress/) [43] is an archive of functional genomics data
(some overlap with GEO) comprised of over 70,000 experiments,
2.2 million assays, and 45 terabytes of data as of July 2017. The
Immunology Database and Analysis Portal (ImmPort; www.
immport.org) is a related resource focused on data from immunol-
ogy studies of various types, focuses, and species that provides a
framework to share and use genomic data of clinical samples.
The Cancer Genome Atlas (TCGA; https://cancergenome.
nih.gov/) from the NIH is a comprehensive resource that collects
data relating to the genomic aspects in over 33 types of cancer.
TCGA harmonizes clinical, sequencing, transcriptomic, and other
data types from various studies in a user-friendly Genomic Data
Commons Data Portal (https://portal.gdc.cancer.gov/). As of
Data Release 6.0 (May 9, 2017), TCGA encompasses around
250,000 files from almost 15,000 patients with cancer in 29 pri-
mary sites (e.g., kidney). As the volume of the data from TCGA is
both immense and complex, Firehose GDAC (https://gdac.bro
adinstitute.org/) was developed to systematize analyses and pipe-
lines using TCGA in order to facilitate smooth research efforts. The
Cancer Cell Line Encyclopedia (CCLE; http://portals.bro
adinstitute.org/ccle/) is a public compendium of over 1000 cell
lines that aims to characterize the genetic and pharmacologic
aspects of human cancer models. The Genome-Tissue Expression
(GTEx; https://gtexportal.org/home/) portal is a collection of
genetic and gene expression data from the Broad Institute, with
data for 544 healthy individuals in 53 tissues as of the V6p release.
Using this resource, researchers have been able to delineate cis and
trans expression quantitative trait loci (eQTL), a unique capability
from having access to both genotype and gene expression data.
Leveraging Big Data to Transform Drug Discovery 101

2.3.3 Chemogenomic The last crucial component to computational-based drug discovery

Data Resources is a resource that relates effects of chemical exposure on biological
systems. The Connectivity Map (CMap; https://portals.bro
adinstitute.org/cmap/) is a landmark repository that seeks to pro-
vide a systematic representation of the transcriptomic effects of cell
lines being treated with various drugs. The current version (build
02) of CMap contains over 7000 drug-induced gene expression
profiles for 1309 compounds across five cell lines.
A spiritual successor of CMap, the Library of Integrated Cellu-
lar Signatures (LINCS; http://www.lincsproject.org/) project is a
large-scale collaboration from 12 institutions. As of the writing of
this chapter, LINCS encompasses 350 datasets using 14 different
methods (e.g., KINOMEscan) across six subject areas (e.g., bind-
ing, proteomics) and 11 biological processes (e.g., gene expression,
cell proliferation). A large component of LINCS is L1000 Connec-
tivity Map – a collection of assays that measure transcriptomic
profiles for a variety of pharmacological and genetic (knockdown
and overexpression) perturbations across different cell lines. L1000
contains data for roughly 20,000 compounds across over 50 differ-
ent cell lines – a substantial increase from CMap. One large differ-
ence between these two resources, however, is that the L1000 assay
directly measures only 978 “landmark” genes and uses imputation
methods to gain information for others. Researchers can obtain
these datasets through a convenient data portal (http://lincsportal.
ccs.miami.edu/dcic-portal/). As referenced above, there are
countless successful research applications using these resources in
the fields of target discovery, drug discovery, and drug repurposing
along with many others.

2.4 Visualization Effective data visualization is pertinent for network-based target

Tools and Software (e.g., key driver) discovery or data exploration in drug discovery.
The programming languages mentioned before can produce high-
quality figures of results. While the base graphical output style is
often adequate, there are a number of packages that can be used to
create superior and publication-ready graphics, including ggplot2
[44] for R, matplotlib [45], PyX (http://pyx.sourceforge.net/),
and seaborn (http://seaborn.pydata.org/) for Python. D3
(https://d3js.org/), a JavaScript library, can also create impressive
static and interactive figures for feature on web sites. While these
languages can generate network figures, such as NetworkX (http://
networkx.github.io/) for Python, specialized software such as
Cytoscape [46], Gephi [47], and igraph (http:// http://igraph.
org/) or online tools such as Plot.ly (http://plot.ly/) may be
preferred for this use case.
102 Benjamin S. Glicksberg et al.

3 Methods

In the current section, we will describe a general workflow that

leverages public data for drug discovery using, exemplified in two of
our recent studies. We provide the general step-by-step framework
for performing such experiments in Fig. 2. While both studies are
methodologically similar, they have disparate focuses, illustrating
the wide variety of possibilities of this framework. In the first study,
Li and Greene et al. sought to identify novel therapeutic options for
chronic allograft damage, specifically to limit the progression of
interstitial fibrosis and tubular atrophy (IF/TA) [48]. In the next
study, Chen and Wei et al. discovered a novel, potentially

Disease (Dx) of Interest

Gene Expression
Omnibus (GEO)

Step 1. Identify Relevant Datasets Generate Dx Signature Step 4.

Differential Expression
GSE #
e.g. edgeR; RankProd

Step 2. Obtain Data Query Drug Library Step 5.

Connectivity Map (CMap)

Library of Integrated Network-
Based Cellular Signatures (LINCS)

Drug-Dx Correlation Step 6.

Step 3. Extract Case/Control
Reverse Gene Expression Score
Regular Expressions

High Prioirty Drug List

Fig. 2 Step-by-step process for performing in silico drug discovery starting from a disease of interest.
Researchers can go to the GEO online portal and search for their disease of interest, which will result in
experiments and microarray datasets that other scientists have generated and uploaded. Having identified all
datasets of interest, the data can be downloaded individually or via a tool, such as GEOquery [80] for R. Case
and control data will have to be identified either manually or through regular expressions from the name
strings. Next, disease signatures can be derived through various tools that perform differential expression
analysis, such as edgeR [81] for RNA-Seq and RankProd and SAM for microarray data [82]. Drug libraries,
such as CMap and LINCS, provide RNA expression data for a large number of compounds across different cell
lines. Next, a correlation analysis can be performed comparing the disease signature and all drug signatures
producing a reversal score for each drug. Significant scores can be ranked from the top most correlated
signatures to the most anticorrelated signatures. These hits can then be used to select a candidate drug based
on study goals, such as prioritizing candidates that would reverse disease signature (i.e., top anticorrelated
hits)
Leveraging Big Data to Transform Drug Discovery 103

therapeutic drug to treat hepatocellular carcinoma (HCC) reveal-

ing reduced growth of cancer cells in both in vitro and in vivo
models [49].

3.1 Disease A disease signature is the unique molecular state (i.e., RNA expres-
Signatures sion profile) of a phenotype (e.g., type 2 diabetes) that is altered
from “wild-type” or healthy. Typically, these signatures can be
characterized by multimodal biological data, including gene
expression in tissues and protein composition in the microbiome.
In the Li and Greene et al. study, they obtained kidney transplant
microarray datasets from GEO in addition to an in-house dataset.
They first restricted the possible datasets to those in humans with
biopsy or peripheral blood samples, leaving five that were eligible.
As is typical in these types of analyses, the phenotypic descriptors
from the datasets did not directly coincide. As such, for each study,
they selected data from specific biopsies that met their criteria for
suitable comparison (i.e., “moderate” and “severe” IF/TA from
one study, IF/TA “II” and “III” from another, etc.) including both
case and control conditions. To rectify any possible conflicting
probe annotations due to platform or versioning, they
re-annotated all probes to the latest gene identifiers using AILUN
[50]. In addition to ensuring consistent annotations, it is also impor-
tant to account for potential discrepancies in expression measure-
ments across studies and platforms. Accordingly, they standardized
each dataset using quantile-quantile normalization to allow for more
precise integration. The integrated dataset comprised 275 samples in
two tissues from three different microarray platforms.
With the datasets normalized and integrated, a meta-analysis
can be performed to create a consensus disease signature across
multiple studies. There are many differing methodologies to per-
form meta-analysis on microarray data that we discuss in Subhead-
ing 4. Li and Greene et al. performed two meta-analysis techniques
and included genes that had robust expression profiles significant in
both, thus maximizing the methodological strengths of each. The
first method involved evaluating the differential expression effect
size for each. Specifically, these effect sizes were combined using a
fixed-effect inverse-variance model by which the effect size from
each study is weighted by the inverse of the intra-study variance
resulting in a meta-effect size. From this approach, they identified
996 FDR-corrected significant genes measured in at least four
studies and were concordantly expressed in the same direction
(FDR < 0.05).
For the second method, they then utilized results from the
significance analysis of microarrays (SAM) [51] of each study, clas-
sifying significant differentially expressed genes between IF/TA
and non-IF/TA groups using q < 0.1 threshold. For the meta-
analysis, they performed a Fisher’s exact test comparing the number
of studies in which each gene was significantly differentially
104 Benjamin S. Glicksberg et al.

expressed by the hypergeometric distribution ( p < 0.05). From this

method, they identified 510 genes that were significant in at least
three of the studies. With the two results from the meta-analyses in
hand, they restricted their disease signature to genes that were
significantly differentially expressed using both methods, resulting
in 85 genes used for drug signature comparison.
Chen and Wei et al. also leveraged public, external datasets to
generate a signature for the disease of interest: HCC. The authors
constructed a multitiered pipeline that utilized various open-source
databases to build a high-confidence HCC disease signature that
was evaluated at multiple stages. To build the HCC disease signa-
ture, they first obtained RNA sequence profiles for 200 HCC and
50 adjacent non-tumor samples from GDAC and their
corresponding clinical labels from TCGA. For subsequent evalua-
tion of this disease signature, they downloaded data from GEO by
querying “hepatocellular carcinoma,” resulting in seven potential
independent datasets that had at least three samples in both disease
and control (i.e., non-tumor liver sample) groups. Like the other
study, they converted all probes to the most recent build and
collapsed them to the gene level by mean value and then performed
quantile normalization. As mentioned above, datasets from public
resources may not coincide with the current research question or
focus. Like before, different methodologies have to be employed to
make best use of the exploitable aspects of interest (see Subheading
4). For the current study, in order to ensure that gene expression
data from these HCC samples were of high quality, robust enough,
and related to the cell lines that were later used for experimental
validation, they performed an initial assessment of the similarity of
each expression profile to cancer cell lines that are already charac-
terized in detail. For the background set, they downloaded gene
expression data for 1019 cancer cell lines, including 25 that were
HCC-specific, from CCLE, and represented each by their expres-
sion profiles of 5000 genes that varied the most across all samples.
For the similarity assessment, they performed a ranked-based
Spearman correlation of the CCLE set to both the TCGA and
GEO samples of interest. They then performed a Mann-Whitney
test to differentiate the correlation outcomes of the samples of
interest to the HCC-specific cell lines against the non-specific
ones. The resulting p value outcomes of the sets of interest were
compared to outcomes of 1000 random tissue samples (assessed via
the same method) obtained from the Expression Project for Oncol-
ogy (expO; GSE2109). From the original datasets of interest,
samples were removed from consideration if their association was
lower than 95% of the random samples based on the null distribu-
tion ( p < 0.05). From this thorough assessment, eight tumor
samples from TCGA and one dataset from GEO were removed
from consideration. After restricting to high-quality TCGA data-
sets (n ¼ 192 tumor and 50 non-tumor samples), the authors built
Leveraging Big Data to Transform Drug Discovery 105

the initial HCC disease signature model using DESeq (version 1)

[52] to perform differential expression analysis across various
log-two fold change (FC) and significance thresholds.
To fine-tune and identify the best disease signatures, the
authors evaluated them on the 1736 patient samples from the six
curated datasets from GEO. For each study, they removed
non-HCC genes from the gene expression data and then per-
formed principal component analysis to build classifiers, with the
first principal component representing the variation between tumor
and non-tumor samples. They found that FC > 2.0 and p < 1E-20
thresholds led to the best separation of tumor and non-tumor
samples (median AUC ¼ 0.995) across all studies. Based on this
best threshold, the integrated HCC signature consisted of 163 up-
and 111 downregulated genes. In order to use the LINCS L1000, a
drug gene expression database with 978 genes profiled, they also
created another reduced disease signature.

3.2 Drug Signatures Drug signatures are similar in nature to disease signatures in that
they represent the global perturbation of gene expression com-
pared to vehicle control. In the case of drug signatures, the pertur-
bation is treatment exposure instead of disease state. As mentioned
in Subheading 2, there are many databases that contain gene
expression data for a variety of drugs across many different cell
lines, tissues and organisms. In the Li and Greene et al. study,
they collected all drug-induced transcriptional profiles for all
drugs in CMap (n ¼ 1309) across all experiments (n ¼ 6100).
They then created a consensus, representative profile for each drug,
merging data from the associated studies using the prototype
ranked list (PRL) method [53]. The PRL method works through
a hierarchical majority-voting scheme for all ranked gene expression
lists within a single compound where consistently over- or down-
regulated genes are weighted toward the top of affiliated extremes.
The various merging methodologies and their considerations will
be discussed in the Notes.
Chen and Wei et al. utilized both CMap and LINCS data to
generate drug signatures. For each drug of all CMap, they per-
formed an initial quality control step keeping only instances (i.e.,
cell line experiment data) where its profile correlated ( p < 0.05)
with at least one other profile (of the same drug), leaving 1329
high-quality instances. They gathered a high-quality list of data
from LINCS by only including landmark genes (n ¼ 978), restrict-
ing instances to HepG2 and Huh7 cell lines, and removing poor
quality perturbations, resulting in 2816 profiles. As the authors
were interested in repurposing an already approved compound,
they intersected drugs from both data sources to DrugBank, leav-
ing 380 instances of 249 drugs. Out of these 249 drugs, 83 were
common between the two sources.
106 Benjamin S. Glicksberg et al.

3.3 Integrating The general methodology used to correlate drug-induced and

Disease and Drug disease-state gene expression profiles is well established and has
Signatures: In Silico been used successfully in a wide variety of studies. How well, and
Methodologies in what direction, drug and biological profiles correlate can direct
for Therapeutic drug discovery and repurposing efforts. The disease-state profile
Predictions represents gene expression patterns that diverge from the norm.
Accordingly, identifying a compound with an opposing (i.e., antic-
orrelated) profile can push biological expression patterns back to an
unperturbed state. Typically, the signature matching is performed
using a KS test [54]. As this type of analysis can produce many
candidate predictions, we recommend additional refinement and
filtration steps to prioritize drugs or targets with the highest confi-
dence of success to move forward with testing.
In the Li and Greene et al. study, they compared the correlation
for 1309 consensus compound signatures from CMap to their
IF/TA disease signature using a modified KS test. The significance
of these scores was calculated by comparing specific KS scores to
those from a random permutation of 1000 drug signatures, pro-
ducing a ranked list of potential treatments that were significantly
anticorrelated to the disease signature ( p 0.05). To further refine
this list of candidates, they performed a literature review of the top
hits and excluded drugs from their list of possibilities that would
impede IF/TA improvement due to their side effect profile (i.e.,
those with negative neurological side effects). From this step, they
decided to further pursue kaempferol and esculetin as candidate
therapeutic drugs to treat renal fibrosis. To add further evidence to
their findings, the authors performed a separate, additional in silico
experiment to try to characterize potential immune-related effects
of these two compounds through evaluating specific immune cell
they may potentially influence during anti-fibrosis activity. As such,
they first matched the 1309 drug expression profiles to 221 immune
cell state profiles procured from immune-cell pharmacology map
[55] to look for enrichments in specific immune cell subsets. This
analysis predicted that esculetin and kaempferol would inhibit both
active and innate immune cells (e.g., CD4 T cells) in IF/TA.
In the Chen and Wei et al. study, they were unable to use their
best performing HCC signature profile of 274 genes as only
30 genes mapped to the landmark LINCS set. Accordingly, they
relaxed the threshold ( p < 0.001 and FC >2.0) to increase the
power of the signature for this comparison task, producing a signa-
ture of 44 genes. The original signature was correlated to 1174
distinct drugs in CMap and the reduced signature was correlated to
249 distinct drugs in LINCS. Similar to the previous study, the
authors utilized a nonparametric, rank-based pattern methodology
based on the KS statistic to identify compounds curated from both
CMap and LINCS that are anticorrelated with the HCC signature.
They also assessed significance of these predictions through ran-
dom permutations and multiple testing correction (FDR <0.05):
Leveraging Big Data to Transform Drug Discovery 107

there were 302 drugs from CMap and 39 drugs from LINCS that
were anticorrelated at this threshold, with 16 overlapping. Out of
this high-confidence intersected set, the top hit from ranking across
both libraries was niclosamide. This drug had previously established
antitumor properties in other cancers, but had not been assessed in
HCC animal models. Therefore, it was selected as a candidate drug
to undergo subsequent validation experiments.
Like the previous study, Chen and Wei et al. developed an
innovative, additional in silico approach to evaluate the global
performance of the predictions from their pipeline. They hypothe-
sized that their top predictions (i.e., anticorrelated drugs) should
be enriched for gold standard, or established, treatments of HCC.
Accordingly, they extracted data from clinicaltrials.gov querying
the terms “hepatocellular carcinoma” and “liver cancer” and filter-
ing from the results trials that studied tumors and cancer in general.
From this list of 960 trials, they extracted 76 drugs from the
“interventions” column that appeared in more than one trial as
their list of gold standard HCC drugs. When mapped to the che-
mogenomic databases, there were 7 found in CMap and 16 in
LINCS. Using ssGSEA from GSVA [56] with permutation testing
(n ¼ 10,000), a gene set enrichment package, they found that these
gold standard drugs were more likely to reverse the HCC gene
signature in both CMap ( p ¼ 0.012) and LINCS ( p ¼ 0.018),
increasing the confidence of testing other hits in the lab.

3.4 Experimental While the prediction methodology outlined above is powerful in its
Validation own right, it is often necessary to validate these findings in an
of Predictions experimental setting. Convincing validation of these predictions
can be achieved in a multitude of ways in both in vitro and in vivo
models. The considerations and criteria for this decision are dis-
cussed in Subheading 4. To investigate the utility of kaempferol and
esculetin for fibrosis, Li and Greene et al. utilized human kidney
2 (KH2) in vitro cell line to determine perturbations in cellular
pathways following drug exposure, specifically targeting biological
aspects from their in silico-based hypotheses. As such, they showed
kaempferol significantly reduced TGF-β1-mediated expression of
SNAI1 ( p ¼ 0.014 for 15 μm exposure) and reversed CDH1
downregulation ( p ¼ 0.045). They also found that esculetin treat-
ment inhibits Wnt/β-catenin signaling in renal tubular cells:
esculetin-treated cells caused a significant decrease in CCND1
protein levels after Wnt agonist stimulation ( p ¼ 0.0054 for
60 μm exposure).
In addition to the encouraging in vitro results, they performed
an additional experiment to assess the effects of kaempferol and
esculetin on renal interstitial fibrosis in vivo. Specifically, they used a
unilateral ureteric obstruction (UUO) mouse model to study gene
expression, histological, immunohistochemistry (IHC) effects of
treatment on renal fibrogenesis. Mice (Balb/c mice from Jackson
108 Benjamin S. Glicksberg et al.

Laboratory) were administered kaempferol (n ¼ 5), esculetin

(n ¼ 5), or saline (n ¼ 6) from 2 days prior to UUO surgery
procedure until sample collection 7 days post-UUO. They found
encouraging results supporting a beneficial role of these drugs in
treating renal fibrosis: mice treated with both drugs had signifi-
cantly lower amount of interstitial collagens and renal fibrosis in
UUO kidneys by picrosirius red staining compared to controls
( p ¼ 0.0009 for kaempferol; p ¼ 0.0011 for kaempferol). Their
targeted analyses also allowed to assess their hypotheses of the
mechanisms by which these drugs are effective in this system: for
instance, kaempferol caused a significant reduction in the gene
expression of Snai1 ( p ¼ 0.038), a key transcription factor in the
TGF-β signaling pathway to confirm their in vitro analysis.
Like the first study, Chen and Wei et al. performed a series of
both in vitro and in vivo experiments to systematically evaluate their
prediction of niclosamide as a HCC drug candidate. Due to poor
water solubility of niclosamide which could hamper its effective-
ness, they also performed these experiments using niclosamide
ethanolamine salt (NEN), which is known to have better systemic
bioavailability and could have a better chance of reaching the tumor
site. As an initial evaluation, the authors performed a cell viability
assay of these two drugs on HCC cells to calculate inhibitory
concentrations. They found that niclosamide and NEN both
reduced HCC cell viability, being over sevenfold more cytotoxic
to HCC cells than primary hepatocytes. With these findings, the
authors felt confident to assess the antitumor effects of niclosamide
and NEN in a mouse primary HCC model. Mice with induced
HCC were treated with food containing niclosamide or NEN
(cases) or with autoclaved food (control). At 12 weeks, the livers
were extracted and analyzed histologically. The authors found that
niclosamide and NEN both reduced the number of tumor nodules
compared to controls, but the effect of NEN was much more
pronounced. Furthermore, the NEN-treated mice had lower over-
all liver weights compared to both control and niclosamide groups
( p < 0.001 for both), but there was no significant difference
between niclosamide and control groups. This provided early evi-
dence that NEN might be the superior treatment candidate.
Moving closer in relevance for human treatment, they evalu-
ated the effect of these drugs on mice bearing orthotopic patient-
derived xenografts (PDX) derived from HCC tissue of resected
livers from three patients with the disease. After implantation,
these mice were separated into groups like before, being fed either
regular food or food with NEN or niclosamide. Like in the results
for the previous experiment, the NEN group had the most pro-
nounced effects in inhibiting PDX growth based on both biolumi-
nescence and tumor volume ( p < 0.05 for both) and did not
significantly lower body weight. Niclosamide treatment, however,
did not significantly reduce tumor growth. At the end of this
Leveraging Big Data to Transform Drug Discovery 109

experiment, they found that levels of niclosamide were over 15

greater in xenografts in the NEN group compared to the niclosa-
mide group providing further evidence for the limited bioavailabil-
ity of the latter.
As an adjunct to this experiment, the authors assessed how
niclosamide and NEN compared to sorafenib, one standard of
care for HCC, in this PDX model. In the PDX model, the combi-
nation of NEN and sorafenib resulted in decreased tumor volume
compared to control group ( p ¼ 0.013), NEN only ( p ¼ 0.030),
and sorafenib only ( p ¼ 0.024). Treatment with niclosamide and
sorafenib, however, did not result in reduced tumor volumes com-
pared to the other groups. These experiments further verified that
NEN was the preferred candidate over niclosamide for HCC
treatment.
Relating to the original computation-based predictions, the
authors experimentally assessed the capacity for these drugs to
reverse the HCC gene signature they defined. As a first step, they
treated HepG2 cells for 6 h with 10 μm niclosamide, 10 μm NEN,
or 10 μm dimethyl sulfoxide (control). As hypothesized, they
found that both niclosamide ( p ¼ 1.1 107; Spearman correla-
tion of 0.32) and NEN ( p ¼ 9.8 106; Spearman correlation of
0.26) significantly reduced the 274 HCC gene expression profile.
Furthermore, they observed similar gene expression changes for
both drugs compared to control ( p < 2.2 1016; Spearman
correlation of 0.87). As a complement to the in vitro assessment
of gene expression changes, the authors additionally assessed effects
on gene expression within the PDX model. They found that the
differential gene expression profile between NEN-treated animals
and controls was significantly anticorrelated with the HCC profile
( p < 3.9 106; Spearman correlation of 0.25). Out of all the
genes in the HCC profile, the authors observed that the antic-
orrelation signal was mostly driven by 20 upregulated and 29 sup-
pressed genes from both the in vitro and in vivo experiments (FDR
<0.25).
As a final piece to the puzzle, the authors sought to characterize
by which biological mechanism NEN attenuates HCC. They
focused their attention to chaperone proteins heat shock protein
90 (HSP90) and cell division cycle 37 (CDC37) that regulate
kinases inhibited by NEN. They first confirmed that NEN inhibited
the HSP90/CDC37 interaction and then assessed to which protein
NEN binds. They found that NEN binds to CDC37 and also
enhanced its thermal stability. To support this finding, they
observed that CDC37 was overexpressed in 80% of HCC tissues
compared to normal livers. To assess whether CDC37 is, in fact,
mediating the effects of NEN in treating HCC, they performed a
final experiment in which RNA interference was used to knock-
down CDC37 in HCC cells treated with NEN. As hypothesized,
they found that HCC cells lacking CDC37 expression were less
110 Benjamin S. Glicksberg et al.

sensitive to NEN, supporting the notion that the antiproliferative

effects of NEN in HCC are partly dependent on CDC37. With
these exhaustive in vitro and in vivo experiments, the authors
successfully provided a molecular context of how their computa-
tionally predicted treatment for HCC worked at multiple biological
levels.

4 Notes

In this chapter, we have briefly detailed the vast amount of public

resources that can enable computational-based drug discovery. The
methods used by Li and Greene et al. and Chen and Wei et al. can
hopefully serve as a guide on how to leverage big data for drug
discovery from prediction to validation. It is important, however, to
note the limitations and caveats of various aspects of these pipelines
and the special considerations that should be accounted for.

4.1 Computational As software is continually updated and new versions of tools and
Considerations packages released, there is a large issue with reproducibility in
research, often due to incompatibility of computing environments
[57]. This often leads to inconsistent results, even when using the
same computational protocols. Beaulieu-Jones and Greene have
recently proposed a system that, if adopted, could avoid these
potential incompatibility issues [58]. They outline a pipeline that
combines Docker, a container technology that is distinct from the
native operating system environment, with software that continu-
ally reruns the pipeline whenever new updates are released for the
data or underlying packages. We generally recommend using open-
source programming frameworks and distributable notebooks,
such as Jupyter Notebook (http://jupyter.org/) and RMarkDown
(http://rmarkdown.rstudio.com/), and following the emerging
best practices for reproducible research whenever possible in
order to most easily facilitate data sharing and research reproduc-
tion. With this said, the most important step toward enabling
reproducible research is a willingness to share the data and code
in a public repository. GitHub (http://github.com/) and Bit-
bucket (http://bitbucket.org/) are two common code repos, and
it is often encouraged to deposit large datasets in repositories like
Synapse (http://www.synapse.org/).
Another issue plaguing biomedical research is the ever-
changing nature of genome builds and identifiers, which can affect
delineation of reference alleles, genomic coordinates, and probe
annotations, among many others. As an early salutation to this
phenomenon, Chen et al. built AILUN, a fully automated online
tool that will re-annotate all microarray datasets to the latest anno-
tations, allowing for compatible analyses across differing
versions [50].
Leveraging Big Data to Transform Drug Discovery 111

4.2 Robustness The concept of a disease signature is continually evolving through

of Disease Signatures the development of better methods to quantify health, the refine-
ment of understanding roles of disease pathophysiology, and
increase in both the availability and types of biological data col-
lected. The extensive public repositories described above contain a
wealth of data pertaining to a variety of diseases across many disease
models, tissues, and conditions (i.e., exposures). To fully leverage
these data and increase power, it is possible to collect and integrate
multiple related datasets, but will require vigilant and methodolog-
ical quality control for effective and accurate integration in both the
phenotypic and genomic space. There are, however, potential criti-
cal issues to consider when performing a meta-analysis experiment
of gene expression disease signatures, particularly in microarray
experiments [59]. In addition to issues arising from experiments
having mismatching platforms (described above), the disease char-
acterization that is the basis of each experiment may differ, such as
having different inclusion criteria or disease definitions.
Population-level differences in racial, demographic, or environ-
mental backgrounds of these studies can have a large effect on
gene expression levels that are unrelated to disease and could lead
to spurious associations if not properly controlled for [60]. One
solution to deal with these and other potential unmodeled factors
would be to utilize surrogate variable analysis to overcome gene
expression heterogeneity within these studies [61].
One study assessed the robustness of disease signatures for over
8000 microarrays across over 400 experiments in GEO for a broad
range of diseases and tissue types [62]. Fortunately, they concluded
that the gene expression signatures within diseases are more con-
cordant than tissue expression across diseases, lending further cred-
ibility to utilizing such data for meta-analysis studies. While this
finding is encouraging, it may not always be true for new datasets or
alternative public repositories. Another consideration is the type of
statistical analysis performed to complete the meta-analysis. We
have illustrated a few in experiments described within Subheading
3, but there are many other options. In fact, researchers have
systematically compared eight different meta-analysis methods for
combining multiple microarray experiments [63]. They found that
these different approaches resulted in substantially different error
rates in classifying the disease of interest when using the same
datasets. As such, one must be conscientious of studies to be
included and the method used in meta-analysis experiments.

4.3 Drug Profile Many related studies, including the highlighted one by Li and
Variation Greene et al., performed a meta-analysis to integrate multiple
drug profiles into a consensus signature. Not surprisingly, there
are caveats of this process pertaining to varying biological contexts
that can affect its validity and utility [64]. Researchers integrated
gene expression data for 11,000 drugs from LINCS with their
112 Benjamin S. Glicksberg et al.

chemical structure and bioactivity data from PubChem to assess

correlation between structure and expression [65]. Specifically,
they systematically evaluated the effect of various biological condi-
tions, namely, cell line, dose, and treatment duration, on this rela-
tionship. They found that compounds that are more structurally
similar tend to have similar transcriptomic profiles as well, but it is
dependent on cell line. For instance, PC3 and VCAP, both
prostate-related cancer cell lines, generally have significantly differ-
ent patterns of similarity between structure and gene expression.
Separately, they also found an interesting relationship between
structure and gene expression in the context of dose: drugs with
high structure similarity have stronger gene expression concor-
dance at higher than lower doses. Another research group merged
profiles of 1302 drugs from CMap across doses and cell lines to
create a drug network with the goal of using it to predict drug effect
and mechanism of action [53]. They noted that creating a consen-
sus signature for a drug that has inconsistent effects on different cell
lines could dilute its unique biological effects. That being said, they
found even across heterogeneous cell types, a consensus drug sig-
nature could still be well classified in their drug network given a
sufficiently large collection of data. In any case, further understand-
ing and characterization of relationship between compound and
biological context will undoubtedly improve accuracy of computa-
tional predictions and thereby bolster rates of successful translation
into the clinic.

4.4 Selection With a candidate drug selected for a disease of interest, it is impera-
of Validation Model tive to perform a series of coordinated experiments (e.g., PK/PD/
toxicity studies), in addition to legal considerations (e.g., IP pro-
tection) to gauge its eligibility. Further, these decisions may be
particularly critical due to time and financial limitations, where
setbacks, mistakes, or poor choices could halt future progress.
Drug candidates often fail to successfully translate into the clinic
due to two main reasons: improper safety and efficacy assessments,
both possibly a direct result of poor early target validation and
unreliable preclinical models [66].
In terms of preclinical testing, the large number of avenues and
models can be overwhelming, especially in light of variable reliabil-
ity. While the accessibility to massive quantities of biological big
data has been transformative in the field of drug discovery, not all
experiments are of equal utility. In studying HCC, for example,
researchers recently found that half of public HCC cell lines do not
resemble actual HCC tumors in terms of gene expression patterns
[67]. Interestingly, another group found that rarely used ovarian
cancer cell lines actually have higher genetic similarity to ovarian
tumors than more commonly used ones [68]. Due to phenomena
like these, we recommend performing rigorous examinations like
those above and leveraging knowledge from existing evaluations of
Leveraging Big Data to Transform Drug Discovery 113

these data when possible before usage. The nuances of subsequent

steps of the validation process are beyond the scope of this chapter,
but there are countless resources that delve into specifics regarding
proper procedures for each stage [69–72].

4.5 Drug Discovery EMR systems contain a great deal of disease- and phenotype-
Using EMR Data related information that can be leveraged for drug discovery as
“real-world data.” The multifaceted data that are collected in
EMR facilitate flexibility in devising research questions that may
be beyond the scope or focus of the original experiment. As such,
unanticipated connections may be formed to biological aspects that
would not be collected in traditional prospective study designs.
Additionally, the large sample sizes and natural collection of longi-
tudinal, follow-up information relating to patient outcomes from
treatment are invaluable advantages over traditional randomized
clinical trials [73].
As mentioned, EMR frameworks are primarily designed for
infrastructural support and to facilitate billing. Like other research
datasets, the raw data is often messy, incomplete, and subject to
biases. Therefore, certain aspects may not be as reliable as unbiased
collection measures. Despite better refined structuring, ICD code-
based disease classification overall is often insufficient to accurately
capture disease status [74]. In this notable example, clinicians
manually reviewed the charts of 325 patients that had been
recorded with the ICD-9 code for chronic kidney disease stage
3 (585.3). They found that 47% of these patients did not have
any clinical indicators for the disease. Including other types of
data, like medications, in phenotyping criteria often leads to better
accuracy. For instance, researchers found that electronic phenotyp-
ing algorithms that require at least two ICD-9 diagnoses, prescrip-
tion of antirheumatic medication, and participation of a
rheumatologist resulted in the highest positive predictive value to
identify rheumatoid arthritis [75]. In fact, a recent study analyzed
ten common diseases (e.g., Parkinson’s disease) in an EMR system
and compared the accuracy of classification through ICD codes,
medications (i.e., those primarily prescribed for disease treatment),
or clinician notes (i.e., if the disease was mentioned in the visit
report) [76]. The “true” disease classification was determined by a
physician’s manual chart review. It is clear that certain diseases are
more often and better classified by certain modalities, such as
clinician notes for Atrial Fibrillation or ICD codes for Parkinson’s
disease, but combinations of all three lead generally to highest
predictive power and lower rates of error. Fortunately, notable
efforts by groups such as the electronic medical records and geno-
mics (eMERGE) consortium (https://emerge.mc.vanderbilt.edu/)
have led to rigorous, standardized electronic phenotyping algo-
rithms that identify case and control cohorts utilizing multiple
dimensions of EMR data to establish inclusion and exclusion criteria.
114 Benjamin S. Glicksberg et al.

The Phenotype KnowledgeBase (PheKB; https://phekb.org/) is a

collaborative effort to collect, validate, and publish such algorithms
for transportability and reproducibility for use in any medical sys-
tem’s EMR. With this in mind, researchers must understand these
caveats and take much care as when incorporating public
expression data.
Strict data access stipulations and possible disparate EMR sys-
tem frameworks oftentimes make cross-institution replication
efforts difficult. To address this issue, there are resources that
release public software and analytical tools for healthcare-related
research, like i2b2 (https://www.i2b2.org/). OHDSI (http://
www.ohdsi.org/) is a community of researchers who collaborate
to solve biomedical problems across multiple disciplines, enabling
reproducibility of research through a large-scaled, open-source
workflow using observational health data [77]. Although
EMR-based studies require stringent IRB approval, there is a grow-
ing concern for patient privacy and confidentiality, especially as this
type of research, and those with access to these data, continues to
expand [78]. On the other side of this issue, the stringency of data
access makes reproducibility difficult. These types of multi-hospital
EMR studies both facilitate cross validation for any findings and
isolate any potential role of geographic environment.
Despite these challenges, EMR-based research will continue to
evolve to produce even more outstanding insights that may direct
drug discovery. Open platforms like the UK Biobank will be essen-
tial to allow more researchers to perform this type of research. With
nomenclature standardization practices improving and resources
growing, integration with developing resources of other biological
and environmental modalities (e.g., pollution data) and sensor-
based data collection [79] will allow for a multi-scale understanding
of findings. The integration of these types of data with state-of-the-
art machine learning approaches, such as deep learning, can push
predictive power well beyond the current success rates. Hopefully,
we will continue to see findings from these works to continue to
transform clinical care, leading to more cost-effective and efficient
drug development along with better patient outcomes and
satisfaction.

Acknowledgments

The research is supported by R21 TR001743, U24 DK116214,

and K01 ES028047 (to BC). The content is solely the responsibil-
ity of the authors and does not necessarily represent the official
views of the National Institutes of Health.
Leveraging Big Data to Transform Drug Discovery 115

References
1. Eder J, Sedrani R, Wiesmann C (2014) The 15. Pessetto ZY, Chen B, Alturkmani H, Hyter S,
discovery of first-in-class drugs: origins and Flynn CA, Baltezor M, Ma Y, Rosenthal HG,
evolution. Nat Rev Drug Discov 13 Neville KA, Weir SJ et al (2017) In silico and
(8):577–587 in vitro drug screening identifies new therapeu-
2. Mullard A (2016) Parsing clinical success rates. tic approaches for Ewing sarcoma. Oncotarget
Nat Rev Drug Discov 15(7):447 8(3):4079–4095
3. Every-Palmer S, Howick J (2014) How 16. Dudley JT, Sirota M, Shenoy M, Pai RK,
evidence-based medicine is failing due to Roedder S, Chiang AP, Morgan AA, Sarwal
biased trials and selective publication. J Eval MM, Pasricha PJ, Butte AJ (2011) Computa-
Clin Pract 20(6):908–914 tional repositioning of the anticonvulsant
4. Rothwell PM (2006) Factors that can affect the topiramate for inflammatory bowel disease.
external validity of randomised controlled Sci Transl Med 3(96):96ra76
trials. PLoS Clin Trials 1(1):e9 17. Sirota M, Dudley JT, Kim J, Chiang AP, Mor-
5. Murthy VH, Krumholz HM, Gross CP (2004) gan AA, Sweet-Cordero A, Sage J, Butte AJ
Participation in cancer clinical trials: race-, sex-, (2011) Discovery and preclinical validation of
and age-based disparities. JAMA 291 drug indications using compendia of public
(22):2720–2726 gene expression data. Sci Transl Med 3
(96):96ra77
6. Rothwell PM (2005) External validity of ran-
domised controlled trials: “to whom do the 18. Stephens T, Brynner R (2009) Dark remedy:
results of this trial apply?”. Lancet 365 the impact of thalidomide and its revival as a
(9453):82–93 vital medicine. Basic Books
7. Hodos RA, Kidd BA, Shameer K, Readhead 19. Attal M, Harousseau JL, Leyvraz S, Doyen C,
BP, Dudley JT (2016) In silico methods for Hulin C, Benboubker L, Yakoub Agha I, Bour-
drug repurposing and pharmacology. Wiley his JH, Garderet L, Pegourie B et al (2006)
Interdiscip Rev Syst Biol Med 8(3):186–210 Maintenance therapy with thalidomide
improves survival in patients with multiple
8. Paik H, Chen B, Sirota M, Hadley D, Butte AJ myeloma. Blood 108(10):3289–3294
(2016) Integrating clinical phenotype and gene
expression data to prioritize novel drug uses. 20. From nightmare drug to celgene blockbuster,
CPT Pharmacometrics Syst Pharmacol 5 thalidomide is back bloomberg. https://www.
(11):599–607 bloomberg.com/news/articles/2016-08-22/
from-nightmare-drug-to-celgene-blockbuster-
9. Paul SM, Mytelka DS, Dunwiddie CT, Per- thalidomide-is-back
singer CC, Munos BH, Lindborg SR, Schacht
AL (2010) How to improve R&D productiv- 21. R Core Team (2014) R: A language and envi-
ity: the pharmaceutical industry’s grand chal- ronment for statistical computing. R Founda-
lenge. Nat Rev Drug Discov 9(3):203–214 tion for Statistical Computing, Vienna, Austria
In. 2014
10. Caskey CT (2007) The drug development cri-
sis: efficiency and safety. Annu Rev Med 22. Van Rossum G, Drake FL: Python language
58:1–16 reference manual: network theory; 2003
11. Nosengo N (2016) Can you teach old drugs 23. Jones E, Oliphant T, Peterson P (2014) SciPy:
new tricks? Nature 534(7607):314–316 open source scientific tools for Python
12. Scannell JW, Blanckley A, Boldon H, Warring- 24. Chen B, Wang H, Ding Y, Wild D (2014)
ton B (2012) Diagnosing the decline in phar- Semantic breakthrough in drug discovery. Syn-
maceutical R&D efficiency. Nat Rev Drug thesis Lectures on the Semantic Web: Theory
Discov 11(3):191–200 and Technology 4(2):1–142
13. Ashburn TT, Thor KB (2004) Drug reposi- 25. Bodenreider O (2004) The unified medical
tioning: identifying and developing new uses language system (UMLS): integrating biomed-
for existing drugs. Nat Rev Drug Discov 3 ical terminology. Nucleic Acids Res 32(Data-
(8):673–683 base issue):D267–D270
14. Jahchan NS, Dudley JT, Mazur PK, Flores N, 26. Liu S, Ma W, Moore R, Ganesan V, Nelson S
Yang D, Palmerton A, Zmoos AF, Vaka D, (2005) RxNorm: prescription for electronic
Tran KQ, Zhou M et al (2013) A drug reposi- drug information exchange. IT professional 7
tioning approach identifies tricyclic antidepres- (5):17–23
sants as inhibitors of small cell lung cancer and 27. Kuhn M, Letunic I, Jensen LJ, Bork P (2016)
other neuroendocrine tumors. Cancer Discov The SIDER database of drugs and side effects.
3(12):1364–1377 Nucleic Acids Res 44(D1):D1075–D1079
116 Benjamin S. Glicksberg et al.

28. Tatonetti NP, Ye PP, Daneshjou R, Altman RB linked to an electronic health record. Pharma-
(2012) Data-driven prediction of drug effects cogenomics 13(4):407–418
and interactions. Sci Transl Med 4 39. Dewey FE, Murray MF, Overton JD,
(125):125ra131 Habegger L, Leader JB, Fetterolf SN,
29. Wishart DS, Knox C, Guo AC, Shrivastava S, O’Dushlaine C, Van Hout CV, Staples J,
Hassanali M, Stothard P, Chang Z, Woolsey J Gonzaga-Jauregui C et al (2016) Distribution
(2006) DrugBank: a comprehensive resource and clinical impact of functional variants in
for in silico drug discovery and exploration. 50,726 whole-exome sequences from the Dis-
Nucleic Acids Res 34(Database issue): covEHR study. Science 354(6319)
D668–D672 40. Yuille M, Dixon K, Platt A, Pullum S, Lewis D,
30. Shameer K, Glicksberg BS, Hodos R, Johnson Hall A, Ollier W (2010) The UK DNA banking
KW, Badgeley MA, Readhead B, Tomlinson network: a "fair access" biobank. Cell Tissue
MS, O’Connor T, Miotto R, Kidd BA et al Bank 11(3):241–251
(2017) Systematic analyses of drugs and disease 41. Wain LV, Shrine N, Artigas MS, Erzurumluo-
indications in RepurposeDB reveal pharmaco- glu AM, Noyvert B, Bossini-Castillo L,
logical, biological and epidemiological factors Obeidat M, Henry AP, Portelli MA, Hall RJ
influencing drug repositioning. Brief et al (2017) Genome-wide association analyses
Bioinform for lung function and chronic obstructive pul-
31. Geifman N, Bollyky J, Bhattacharya S, Butte AJ monary disease identify new loci and potential
(2015) Opening clinical trial data: are the vol- druggable targets. Nat Genet 49(3):416–425
untary data-sharing portals enough? BMC 42. Edgar R, Domrachev M, Lash AE (2002) Gene
Med 13:280 expression omnibus: NCBI gene expression
32. Greene CS, Garmire LX, Gilbert JA, Ritchie and hybridization array data repository.
MD, Hunter LE (2017) Celebrating parasites. Nucleic Acids Res 30(1):207–210
Nat Genet 49(4):483–484 43. Kolesnikov N, Hastings E, Keays M,
33. Yao L, Zhang Y, Li Y, Sanseau P, Agarwal P Melnichuk O, Tang YA, Williams E, Dylag M,
(2011) Electronic health records: implications Kurbatova N, Brandizi M, Burdett T et al
for drug discovery. Drug Discov Today 16 (2015) ArrayExpress update--simplifying data
(13–14):594–599 submissions. Nucleic Acids Res 43(Database
34. Wang G, Jung K, Winnenburg R, Shah NH issue):D1113–D1116
(2015) A method for systematic discovery of 44. Wickham H (2016) ggplot2: elegant graphics
adverse drug events from clinical notes. J Am for data analysis, 2nd edn. Springer
Med Inform Assoc 22(6):1196–1204 45. Hunter JD (2007) Matplotlib: a 2D graphics
35. Crosslin DR, Robertson PD, Carrell DS, Gor- environment. Comput Sci Eng 9(3):90–95
don AS, Hanna DS, Burt A, Fullerton SM, 46. Shannon P, Markiel A, Ozier O, Baliga NS,
Scrol A, Ralston J, Leppig K et al (2015) Pro- Wang JT, Ramage D, Amin N,
spective participant selection and ranking to Schwikowski B, Ideker T (2003) Cytoscape: a
maximize actionable pharmacogenetic variants software environment for integrated models of
and discovery in the eMERGE network. biomolecular interaction networks. Genome
Genome Med 7(1):67 Res 13(11):2498–2504
36. Xu H, Aldrich MC, Chen Q, Liu H, Peterson 47. Bastian M, Heymann S, Jacomy M (2009)
NB, Dai Q, Levy M, Shah A, Han X, Ruan X Gephi: an open source software for exploring
et al (2015) Validating drug repurposing sig- and manipulating networks. Icwsm 8:361–362
nals using electronic health records: a case 48. Li L, Greene I, Readhead B, Menon MC, Kidd
study of metformin associated with reduced BA, Uzilov AV, Wei C, Philippe N,
cancer mortality. J Am Med Inform Assoc 22 Schroppel B, He JC et al (2017) Novel thera-
(1):179–191 peutics identification for fibrosis in renal allo-
37. Kirkendall ES, Kouril M, Minich T, Spooner graft using integrative informatics approach.
SA (2014) Analysis of electronic medication Sci Rep 7:39487
orders with large overdoses: opportunities for 49. Chen B, Wei W, Ma L, Yang B, Gill RM, Chua
mitigating dosing errors. Appl Clin Inform 5 MS, Butte AJ, So S (2017) Computational
(1):25–45 discovery of niclosamide ethanolamine, a
38. Ramirez AH, Shi Y, Schildcrout JS, Delaney repurposed drug candidate that reduces
JT, Xu H, Oetjens MT, Zuvich RL, Basford growth of hepatocellular carcinoma cells
MA, Bowton E, Jiang M et al (2012) Predict- in vitro and in mice by inhibiting cell division
ing warfarin dosage in European-Americans cycle 37 signaling. Gastroenterology 152
and African-Americans using DNA samples (8):2022–2036
Leveraging Big Data to Transform Drug Discovery 117

50. Chen R, Li L, Butte AJ (2007) AILUN: rean- gene expression correlates with drug efficacy
notating gene expression data automatically. and reveals therapeutic targets. Nat Commun
Nat Methods 4(11):879 (In Press)
51. Tusher VG, Tibshirani R, Chu G (2001) Sig- 65. Chen B, Greenside P, Paik H, Sirota M,
nificance analysis of microarrays applied to the Hadley D, Butte AJ (2015) Relating chemical
ionizing radiation response. Proc Natl Acad Sci structure to cellular response: an integrative
U S A 98(9):5116–5121 analysis of gene expression, bioactivity, and
52. Anders S, Huber W (2010) Differential expres- structural data across 11,000 compounds.
sion analysis for sequence count data. Genome CPT Pharmacometrics Syst Pharmacol 4
Biol 11(10):R106 (10):576–584
53. Iorio F, Bosotti R, Scacheri E, Belcastro V, 66. Smith C (2003) Drug target validation: hitting
Mithbaokar P, Ferriero R, Murino L, the target. Nature 422(6929). 341, 343,
Tagliaferri R, Brunetti-Pierri N, Isacchi A et al 345 passim
(2010) Discovery of drug mode of action and 67. Chen B, Sirota M, Fan-Minogue H, Hadley D,
drug repositioning from transcriptional Butte AJ (2015) Relating hepatocellular carci-
responses. Proc Natl Acad Sci U S A 107 noma tumor samples and cell lines using gene
(33):14621–14626 expression data in translational research. BMC
54. Lamb J, Crawford ED, Peck D, Modell JW, Med Genet 8(Suppl 2):S5
Blat IC, Wrobel MJ, Lerner J, Brunet JP, 68. Domcke S, Sinha R, Levine DA, Sander C,
Subramanian A, Ross KN et al (2006) The Schultz N (2013) Evaluating cell lines as
connectivity map: using gene-expression signa- tumour models by comparison of genomic
tures to connect small molecules, genes, and profiles. Nat Commun 4:2126
disease. Science 313(5795):1929–1935 69. Hefti FF (2008) Requirements for a lead com-
55. Kidd BA, Wroblewska A, Boland MR, Agudo J, pound to become a clinical candidate. BMC
Merad M, Tatonetti NP, Brown BD, Dudley JT Neurosci 9(Suppl 3):S7
(2016) Mapping the effects of drugs on the 70. Empfield JR, Leeson PD (2010) Lessons
immune system. Nat Biotechnol 34(1):47–54 learned from candidate drug attrition. IDrugs
56. Hanzelmann S, Castelo R, Guinney J (2013) 13(12):869–873
GSVA: gene set variation analysis for microar- 71. Hughes JP, Rees S, Kalindjian SB, Philpott KL
ray and RNA-seq data. BMC Bioinformatics (2011) Principles of early drug discovery. Br J
14:7 Pharmacol 162(6):1239–1249
57. Dudley JT, Butte AJ (2010) In silico research 72. Meanwell NA (2011) Improving drug candi-
in the era of cloud computing. Nat Biotechnol dates by design: a focus on physicochemical
28(11):1181–1185 properties as a means of improving compound
58. Beaulieu-Jones BK, Greene CS (2017) Repro- disposition and safety. Chem Res Toxicol 24
ducibility of computational workflows is auto- (9):1420–1456
mated using continuous analysis. Nat 73. Bate A, Juniper J, Lawton AM, Thwaites RM
Biotechnol 35(4):342–346 (2016) Designing and incorporating a real
59. Ramasamy A, Mondry A, Holmes CC, Altman world data approach to international drug
DG (2008) Key issues in conducting a meta- development and use: what the UK offers.
analysis of gene expression microarray datasets. Drug Discov Today 21(3):400–405
PLoS Med 5(9):e184 74. Cipparone CW, Withiam-Leitch M, Kimminau
60. Klebanov L, Yakovlev A (2006) Treating KS, Fox CH, Singh R, Kahn L (2015) Inaccu-
expression levels of different genes as a sample racy of ICD-9 codes for chronic kidney disease:
in microarray data analysis: is it worth a risk? a study from two practice-based research net-
Stat Appl Genet Molec Biol 5(1):1–9 works (PBRNs). J Am Board Fam Med 28
61. Leek JT, Storey JD (2007) Capturing hetero- (5):678–682
geneity in gene expression studies by surrogate 75. Chung CP, Rohan P, Krishnaswami S, McPhe-
variable analysis. PLoS Genet 3(9):1724–1735 eters ML (2013) A systematic review of vali-
62. Dudley JT, Tibshirani R, Deshpande T, Butte dated methods for identifying patients with
AJ (2009) Disease signatures are robust across rheumatoid arthritis using administrative or
tissues and experiments. Mol Syst Biol 5:307 claims data. Vaccine 31(Suppl 10):K41–K61
63. Campain A, Yang YH (2010) Comparison 76. Wei WQ, Teixeira PL, Mo H, Cronin RM,
study of microarray meta-analysis methods. Warner JL, Denny JC (2016) Combining bill-
BMC Bioinformatics 11:408 ing codes, clinical notes, and medications from
64. Chen B, Ma L, Paik H, Sirota M, Wei W, Chua electronic health records provides superior
MS, So S, Butte AJ (2017) Reversal of cancer
118 Benjamin S. Glicksberg et al.

phenotyping performance. J Am Med Inform biomedical, health care and wellness data
Assoc 23(e1):e20–e27 streams. Brief Bioinform 18(1):105–124
77. Yoon D, Ahn EK, Park MY, Cho SY, Ryan P, 80. Davis S, Meltzer PS (2007) GEOquery: a
Schuemie MJ, Shin D, Park H, Park RW bridge between the gene expression omnibus
(2016) Conversion and data quality assessment (GEO) and BioConductor. Bioinformatics 23
of electronic health record data at a Korean (14):1846–1847
tertiary teaching hospital to a common data 81. Robinson MD, McCarthy DJ, Smyth GK
model for distributed network research. (2010) edgeR: a Bioconductor package for dif-
Healthc Inform Res 22(1):54–58 ferential expression analysis of digital gene
78. Barrows RC Jr, Clayton PD (1996) Privacy, expression data. Bioinformatics 26
confidentiality, and electronic medical records. (1):139–140
J Am Med Inform Assoc 3(2):139–148 82. Hong F, Breitling R, McEntee CW, Wittner
79. Shameer K, Badgeley MA, Miotto R, Glicks- BS, Nemhauser JL, Chory J (2006) RankProd:
berg BS, Morgan JW, Dudley JT (2017) Trans- a bioconductor package for detecting differen-
lational bioinformatics in the era of real-time tially expressed genes in meta-analysis. Bioin-
formatics 22(22):2825–2827
Chapter 7

How to Prepare a Compound Collection Prior to Virtual

Screening
Cristian G. Bologa, Oleg Ursu, and Tudor I. Oprea

Abstract
Virtual screening is a well-established technique that has proven to be successful in the identification of
novel biologically active molecules, including drug repurposing. Whether for ligand-based or for structure-
based virtual screening, a chemical collection needs to be properly processed prior to in silico evaluation.
Here we describe our step-by-step procedure for handling very large collections (up to billions) of
compounds prior to virtual screening.

Key words Cheminformatics, Drug discovery, Online services, Property filtering, Unwanted struc-
tures, Virtual screening

1 Introduction

The process of drug discovery is complex and requires an interdis-

ciplinary effort to design effective and safe drugs. The cost of
developing a new drug that gains market approval is currently
more than $2.6 billion [1]. A significant portion of these costs
goes toward preclinical research and developments. High-through-
put screening (HTS) is widely used in both pharmaceutical industry
and academia to discover new ligands for receptors, enzymes, ion
channels, or other pharmacological targets. This approach relies on
automation to screen high numbers of compounds to identify
bioactive hits. Because the median hit rate for HTS is typically
around 0.3% [2], this requires significant resources and large com-
pound libraries. Virtual screening [3, 4] can evaluate millions of
compounds at a fraction of the cost of a full HTS screen and
frequently yield higher hit rates [5].
There are two major categories of virtual screening—structure-
based and ligand-based—and their workflow often incorporates
ADMET profiling [6]. Structure-based virtual screening uses the
knowledge of the target three-dimensional structure to dock tested
compounds to known or proposed binding sites. Ligand-based

Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_7, © Springer Science+Business Media, LLC, part of Springer Nature 2019

119
120 Cristian G. Bologa et al.

virtual screening exploits the knowledge of known active and inac-

tive molecules through chemical, shape, and pharmacophore simi-
larity searches. Depending on what is already known about a target
and its ligands, different approaches to virtual different methods are
preferred. When both target crystal structure and ligand informa-
tion are available, data fusion approaches can be employed to derive
a consensus scoring.
There is an unprecedented availability of chemical and
biological structure-property data, facilitated by the production
and availability of large-scale bioactivity databases, e.g., PubChem
[7] and ChEMBL [8]. Indeed, big data approaches have eliminated
the issue of data access, leading to an increased pace for predictive
methods, to complement high-throughput biology and chemistry
technologies.
Cheminformatics technologies initially served a retrospective
function, i.e., “mixing [. . .] information resources to transform
data into information and information into knowledge for the
intended purpose of making better decisions” [9]. However, the
demand for predictive models starting from molecular structures is
increasing, moving the emphasis on prospective use, i.e., virtual
screening. The accuracy of prospective cheminformatics models
relies on deep knowledge mining [10], increased data completeness
[11], and advances in machine learning [12]. Paralleled by signifi-
cant improvements in hardware and computational power, these
technologies have resulted in increasingly more efficient approaches
to drug discovery and active compound design.
Our group has contributed (Fig. 1) to the development of
several classes of novel small molecule antagonists for G protein-
coupled formyl peptide [13, 14] and estrogen [15] receptors
(FPR1, FPR2, and GPER, respectively), the first selective GPER
agonist [16] G-1, as well as an inhibitor for GLUT5, a fructose-
only transporter [17]. Cheminformatics-based alternatives to vir-
tual screening, such as hierarchical treelike classification of scaffolds
[18], successfully used to identify new bioactive compounds for the
estrogen receptor alpha (ERα) and for 5-lipoxygenase [19], are not
discussed here. However, an interactive open-source tool for hier-
archical tree scaffold navigation is freely available under the GNU
General Public License [20].
In this chapter, we provide a step-by-step description of the
compound preparation procedures prior to virtual screening, based
on our three-decade man-year expertise with virtual screening. We
consider this to be one of the most critical steps in the virtual
screening process, since without proper normalization and filtering
of the input structures, it can lead to an increased chance of finding
promiscuous hits and higher attrition rates in drug discovery. This
procedure is also relevant to scaffold-based navigation, as well as
compound acquisition [21].
How to Prepare a Compound Collection Prior to Virtual Screening 121

Fig. 1 Chemical structures, identifiers, PubChem IDs, and confirmed biological

activities for four chemical probes identified using an integrated virtual and
biomolecular screening approach. ERα and ERβ are the nuclear estrogen recep-
tors alpha and beta, respectively

What is virtual screening? It is a cheminformatics technology

designed to computationally evaluate large numbers of com-
pounds, in order to rapidly identify structures of interest to be
submitted for biological assays. Its experimental counterpart,
high-throughput screening (HTS), is also aimed at sifting through
a large amount of structures, (often) based on single-point, single-
experiment results. Both procedures rely on our ability to process,
using cheminformatics tools, a large number of structures
[5]. However, post-HTS analyses [22] are often clouded by the
presence of reactive species [23] or optically interfering [24] com-
ponents that can be the result of sample degradation in biochemical
assays [25], the tendency of chemicals to aggregate [26] or to turn
up as frequent hitters [27]. Online computational tools geared to
remove “unwanted” or “promiscuous” molecular species are now
available, as highlighted below.
122 Cristian G. Bologa et al.

The logical progression HTS primary hits ! HTS secondary

hits ! HTS confirmed actives ! chemical probe series [! lead
series] ! drug candidate ! launched drug has emerged as a result
of two forces:
1. Stronger demand in the pharmaceutical industry to provide
good quality candidate drugs, as well as good quality
leads [28].
2. A significantly increased effort in the academic sector to screen
large amounts of chemicals, e.g., within the NIH Roadmap
Molecular Libraries Initiative [29], and identify chemical
probes [30], with the ulterior goal of assisting drug
discovery [31].
Genomics, side-effect and pharmacovigilance evaluation
[32, 33], screening drug libraries against neglected diseases [34],
drug-target identification from side-effect data mining [35], and
finding novel drug targets using ligand-based virtual screening [36]
are some of the recently emerged strategies from the academic
sector to identify novel drugs and drug targets as well as new uses
for old drugs [37]. Within the Clinical and Translational Science
Awards framework [38], academic investigators are more likely to
“de-risk” compounds of industrial interest [31].
A set of property filters known as the “rule of five” (Ro5) [39]
has been adopted by both pharmaceutical industry and academia,
with the goal of restricting small molecule synthesis to the property
space defined by ClogP (octanol/water partition coefficient), MW
(molecular weight), HDO (number of hydrogen bond donors),
and HAC (number of hydrogen bond acceptors). The property
distribution of chemical and drug databases in the Ro5 space is well
understood [40] and is remarkably similar when comparing drugs
with “non-drugs” [41]. However, the chemical space coverage of
known drugs is severely limited, when compared to other com-
pound collections [42]. Many chemical libraries are now Ro5 com-
pliant, and a significant proportion of these chemicals can be
described as “drug-like” [41].
Smaller and simpler chemicals are easier to optimize [43]
toward the drug candidate status. The lead-like concept, now
firmly established in drug discovery [44], bears relevance to the
space of chemical probes as well [45], which is quite different from
the property space occupied by high-activity molecules found in
medicinal chemistry literature [28]. We begin our in-depth descrip-
tion of the compound preparation procedures within the context of
“unwanted structure” removal [25] and of property filtering [44],
since removal of promiscuous or just plain undesired chemicals will
result in reduced sample space search and shorter CPU time. How-
ever, the responsibility of implementing such criteria in chemical
database evaluation resides with the end user and should be
How to Prepare a Compound Collection Prior to Virtual Screening 123

regarded as context-dependent. We describe our pipeline, based on

experience with software from OpenEye [46], MESA [47], and
ChemAxon [48]. Software with similar functionality is also avail-
able from BIOVIA (Pipeline Pilot) [49], Chemical Computing
Group [50], Certara [51], and open-source packages [52].

2 Materials

1. Software to convert chemical structures based on standard file

formats (e.g., SDF, mol2, etc.) to canonical isomeric SMILES
[53], InChI [54], or equivalent representations of chemical
structures e.g., [55, 56];
2. Software to handle canonical isomeric SMILES (or equivalent)
and provide chemical fingerprints, e.g., using ChemAxon [48],
OpenEye [57], Sybyl-X [51], Mesa Analytics and Computing
[58], MDL Keys [59], or Chemical Computing Group’s MOE
[60]. Several alternatives are available under the open-source,
open-access agreement, such as OpenBabel [61] and the
Chemistry Development Kit, CDK [62].
3. Software to compute chemical properties from structures, for
example, to calculate the octanol/water partition coefficient,
LogP [63] with CLogP [64] and KowWIN [65] or ALogPS
[66], among many LogP predictors; some of these property
calculators are also available under the open-access agreement
at VCCLAB.org [67] and at Molinspiration [68].
4. Software to compute molecular descriptors and chemical
fingerprints [69].
5. Software to cluster chemical structures from fingerprints or
from computed properties [70, 71]; cheminformatics applica-
tions of clustering were discussed elsewhere [72].
6. Software to convert SMILES (or equivalent) into 3D coordi-
nates using ChemAxon, CORINA [73], OMEGA [74],
Molinspiration [68], etc.; alternatively, PubChem, ChemSpi-
der, and BIOVIA allow end users to use 3D coordinates for
chemical structures.
7. Software to appropriately handle experimental design based on
multidimensional spaces, e.g., MODDE from Sartorius [75].

3 Methods

3.1 Assemble Most pharmaceutical companies have procured large compound

the Collection(s) collections (106–107 molecules), including marketed drugs and
other high-activity compounds—which we termed Reals
[44]. The academic sector has a similar collection (440,000
124 Cristian G. Bologa et al.

chemicals as of January 2018), the Molecular Libraries Small Mole-

cules Repository, MLSMR [76]. Reals are a valuable resource and
are routinely screened against novel targets. One can argue that
these collections reflect the chemistry used to address targets from
the past, and that novel targets require novel chemistry, since the
intellectual property landscape for yesterday’s chemistry is by now
overcrowded. However, such arguments do not exclude Reals from
being considered for virtual screening. These physical collections
also include commercially available chemicals, the superset of these
being the Tangibles—termed this way because one can conceivably
acquire them or synthesize them in-house using tractable chemistry
[44]. Any collection prepared for virtual screening should sample
both the in-house and the “external” chemical spaces. In addition
to Reals and Tangibles, one can also consider the Virtuals. Up to
and including 17 non-hydrogen atoms, Virtuals exceed 166 billion
organic small molecules [77]; these cannot be all made (at least
with current chemistry) but can essentially be used as a theoretical
“resource” for virtual screening. If the goal is to find the shortest
path from virtual screen to clinical practice, then priority should be
given to marketed drugs, which is a significantly smaller collection,
i.e., below five thousand small molecules and biologics [78].
Having appropriate informatics systems to access these virtual
and existing compounds via fingerprints, 2D or 3D descriptors, or
other measured or computed property spaces is key to the screening
strategy. The largest collections of Tangibles are summarized in
Table 1. As of January 2018, these collections exceed 90 million
unique chemicals (e.g., PubChem), from more than 900 unique
suppliers (e.g., BIOVIA). Some collections (ZINC, PubChem) are
available at no cost. However, other resources are available on a
subscription basis only (also indicated in Table 1). Large Virtuals
subsets, derived from systematic enumeration [77], can be down-
loaded from http://gdb.unibe.ch/.

3.2 Clean Up There is no “perfect” chemical database, unless it contains a rather

the Collection simple (e.g., NaCl, H2O) or a rather small number of molecules.
The user needs to spend a significant effort in cleaning up the
collection, whether it includes Virtuals, Reals, or Tangibles. Some
chemical vendors provide their own solution to this problem. We
prefer FILTER, a program available from OpenEye [79], although
one can “wash” a chemical collection via MOE [50] or the Pipeline
Pilot from BIOVIA [49]. Regardless of the method used, the user
needs to take some early decisions regarding the collection’s
“make-up.” One obvious suggestion is to remove (or at least flag)
“unwanted” chemical structures, such as those depicted in Fig. 2;
other substructures are discussed below.
How to Prepare a Compound Collection Prior to Virtual Screening 125

Table 1
Sources for chemical databases available for evaluation

Company
name Web address Number of compounds Description
Aldrich https://www.sigmaaldrich. 14 million in-stock More than 80 of the most
Market com/chemistry/ products (no “virtual reliable suppliers in the
Select chemistry-services/aldrich- chemistry”) over eight compound and building
market-select.html million unique block market
structures
MCULE https://mcule.com/ Over 5.6 million unique An integrated drug discovery
database/ in-stock compounds platform providing IT
infrastructure, drug
discovery tools, high-
quality compound
database, and professional
compound delivery
BIOVIA http://accelrys.com/ Over 25 million products, More than 900 suppliers;
Available products/collaborative- over ten million unique access fee required
Chemical science/databases/ chemicals, including 3D
Directory sourcing-databases/biovia- models
available-chemicals-
directory.html
ZINC http://zinc.docking.org/ Over 120 million Extracted from over
browse/subsets/ potentially purchasable 200 catalogs of purchasable
compounds, compounds; the entire data
more than 12 million in collection is available for
stock free
Chemspider http://www.chemspider. 63 million chemical Almost 300 data sources; not
com/ structures known how many
chemicals are commercially
available; the entire set is
NOT available for free
download (only 1000
structures/day)
eMolecules http://www.emolecules. Over seven million 20 “reliable” chemical
com/ screening compounds suppliers; structures
available for free, access fee
required for vendor and
price information
Pubchem http://pubchem.ncbi.nlm. Over 230 million 580 data sources
nih.gov/ substances over (300 chemical vendors);
90 million unique not all chemicals are
chemicals commercially available; the
entire collection is available
for free
126 Cristian G. Bologa et al.

Fig. 2 Examples of chemical substructures that may cause interference with biochemical assays [25] under
HTS conditions. Modified from Ref. [40] with permission
How to Prepare a Compound Collection Prior to Virtual Screening 127

3.2.1 Remove Garbage Split covalently bound salt counterions and remove small fragments
from the Collection (salts), normalize charges. This is clearly an instance where the user
is confronted with multiple choices. For typical pharmaceutical
screening, it is advisable to remove unwanted structures, such as
those depicted in Fig. 2. One should always consider “unwanted”
structures in context—for example, a large number of antineoplas-
tic agents would be considered as “reactive species” according to
Fig. 2. Furthermore, many flavor compounds are monofunctional
aldehydes. Thus, when seeking actives in oncology or in flavor
science, certain substructure-based filters need to be individually
evaluated and, quite likely, excluded.

3.2.2 Verify Molecular In order to be correctly understood and processed by computer,

Structure Integrity the structures must be entered in a “computer-friendly” format,
which is not necessarily “human-friendly”. A significant amount of
molecules may be incorrectly perceived by the computer [80]
because of chiral centers that do not observe the Cahn-Ingold-
Prelog convention, or because they are bridge structures, etc.
This can become a significant source of errors in structure-activity
relationship studies [81, 82]. Because visual inspection for all the
structures is not an option in really large collections, one has to use
an automated procedure for detection (and perhaps correction) of
some of the invalid entries. If specialized software for this opera-
tion, e.g., CheD [83] or Structure Checker [48], is not available,
good results in detecting errors can be achieved after converting the
original structural files (usually in SDF format) to SMILES using
two or more conversion tools (OpenEye babel, ChemAxon mol-
convert, etc), followed by canonicalization (option smioCanonical
in OpenEye babel, option u in ChemAxon SMILES export
options, etc.) and then by comparing the resulting SMILES. The
number of invalid entries differs significantly among chemical ven-
dors, ranging from under 0.05% to 10% or higher. A totally auto-
mated method for error detection and removal of faulty structures
needs to be implemented prior to large-scale screening of any
collection, be it Reals or Virtuals.

3.2.3 Generate Unique, Once canonical SMILES are derived, one should store just unique
Normalized SMILES SMILES by verifying structure identity while ignoring compound
IDs or chemical names. If the Virtuals or Tangibles are compiled
from a large number of software vendors, there is a good chance
that this will clean up 50% or more of the starting collection. At this
step, it is advisable to use a list of “preferred” or “trusted” vendors
first. Such lists are developed with time—so first-time users must
take some risks in this step. Whenever the budget is limited, a script
to rank low-price structures first could be used. Besides price,
minimum required purity and maximum guaranteed shipping win-
dow are other important criteria to be considered when selecting
from multiple vendors that sell the same compounds.
128 Cristian G. Bologa et al.

3.2.4 Other Unwanted Based on the cumulative experience with high-throughput screen-
Substructures ing using the MLSMR collection (available in PubChem), we have
identified a number of 100 scaffolds which have a significantly
higher-than-average tendency to exhibit bioactivity in primary
assays. While some of these scaffolds may be legitimately present
in hits under certain bioassay conditions, we have implemented a
SMARTS-based removal procedure for such promiscuous patterns
at our open-access website [84]. In addition to Badapple (promis-
cuous pattern removal) [85], we offer three different FILTER-
compliant tools, as well as a fragment-based drug-like filter
[41]. These tools, summarized in Table 2, can be used free of
charge for the purpose of academic research.

3.3 FILTER for Lead- After cleanup, the collection can be processed to remove com-
Likeness pounds that do not have certain 2D-based properties, for example,
those that define leads [25, 43, 44]. Compounds that pass the
filters can be prioritized for virtual screening. When examining a
collection of natural products, however, lead-like filters may be
unsuitable [42]. When a set of lead-like structures is desired, we

Table 2
UNM biocomputing public web applications (available at http://datascience.unm.edu/public-
biocomputing-apps)

Application Description Powered by

Badapple Bioactivity datamining associative promiscuity ChemAxon
pattern learning engine
ClusterMols Cluster molecular datasets Mesa, OpenEye, RDKit, Scitouch
Convert Convert mol formats ChemAxon
Depict Depict molecules ChemAxon
DGeom Distance-geometry conformer generation RDKit, JSmol
DrugCentral Drug knowledge integration database Django, PostgreSql, ChemAxon
Drug- Drug-likeness using DRUGS/ACD fragments ChemAxon
likeness frequencies
iPHACE Integrative navigation in pharmacological space ChemAxon
JSME JavaScript molecular editor peter.ertl.com
MolProps Molecular properties and aggregate stats OpenBabel, RDKit, Scitouch,
Gnuplot, Gnuplot-Py
Rockit ROC-curve plotter R, ROCR, RPy, Gnuplot, Gnuplot-
Py
SIM2D 2D similarity ChemAxon
Smartsfilter SMARTS filtering with built in Glaxo, Blake, and ChemAxon
Oprea SMARTS sets
How to Prepare a Compound Collection Prior to Virtual Screening 129

recommend one clusters (see Subheading 3.7) the “nonlead-like”

set to include a representative set of these compounds (up to 30%),
as they are likely to capture additional scaffolds. It remains the
responsibility of the end user to apply, or discard, the lead-like
concept, or to adjust the parameters prior to acquiring/screening
compounds. Some of our suggestions for exclusions according to
lead-likeness are as follows (see also Note 1):
l More than four rings.
l More than three fused aromatic rings (avoid polyaromatic rings,
since they are likely to be processed by cytochrome P450
enzymes, leading to epoxides and possibly other carcinogens).
l HDO > 4; HDO 5 is one of the Ro5 criteria, but 80% of
drugs have HDO 3 [40].
l More than four halogens, except fluorine (avoid “pesticides”); a
notable exception is the crop-protectant business; in such situa-
tions, the collection must be processed with entirely different
criteria.
l More than two CF3 groups (avoid highly halogenated
molecules).
The “unwanted” list is likely to reflect a “cultural bias” that is
particular to each company. For example, companies active in con-
traceptive research might regard steroids favorably at this stage,
whereas other companies may want to actively exclude them from
the collection at an early stage. Similar arguments could be made
for the lactam (e.g., penicillins) and cephem (e.g., cephalosporins)
scaffolds, as well as for peptides. An additional step may include
removal of known redox cycling compounds [86], frequent hitters
[26] or promiscuous binders [27], and the removal of compounds
that contain fragments responsible for cytotoxicity (see Fig. 2).
The effort to systematically evaluate the collection can be
regarded as the initial step, since in-depth manipulation is likely to
take place only once prior to all virtual screens, assuming that targets
are similar and that the drug discovery projects have similar goals,
e.g., orally available drugs that should (or should not) permeate the
blood-brain barrier. However, the screening set may be just the
Tangibles or the known drugs subsets. The collection may therefore
require very different processing criteria, which are target-
dependent and goal-dependent: targets located in the lung require
a different pharmacokinetic profile, e.g., for inhalation therapy,
compared to targets located in the urinary tract that may require
good aqueous solubility at pH 5 or on the skin (LogP between
5 and 7 is ideal for such topical agents). Such biases need to be
considered as early as possible in the library assembly stage, because
they reduce the size of the chemical space that needs to be sampled.
130 Cristian G. Bologa et al.

3.4 Search Whenever high-activity molecules are available from the literature,
for Similarity If Known patents or from in-house data, the user is advised to perform a
Active Molecules Are similarity search on the entire Virtuals or Tangibles for similar
Available molecules (see Subheading 3.7) and to actively seek to include
them in the virtual screening deck, even though they might have
been removed during the previous steps. These molecules should
serve as positive controls, i.e., they should be retrieved at the end of
the virtual or high-throughput screen as “hits,” if the similarity
principle holds (see also Note 2).

3.5 Explore The user should seek alternate structures by modifying [87] the
Alternative Structures canonical isomeric SMILES, since these may occur in solution or at
the ligand-receptor interface:
l Tautomerism, which shifts one hydrogen along a path of alter-
nating single/double bonds, mostly involving nitrogen and
oxygen (e.g., imidazole); the reader is encouraged to consult
the tautomers issue of Journal of Computer-Aided Molecular
Design [88].
l Acid/base equilibria, which explore different protonation states
by assigning formal charges to those chemical moieties that are
likely to be charged (e.g., phosphate or guanidine) and by
assigning charges to some of those moieties that are likely to
be charged under different microenvironmental conditions
(“chargeable” moieties such as tetrazole and aliphatic amine).
l Exploration of alternate structures whenever chiral centers are
not specified (OpenEye’s omega2—flipper option, ChemAxon’s
Stereoisomer Generator Plugin) since 3D structure conversion
from SMILES in some cases does not “explode” all possible
states. Other examples include pseudo-chiral centers such as
pyramidal (“flappy”) nitrogen inversions that explore
non-charged, nonaromatic, pseudo-chiral nitrogens (three sub-
stituents), since these are easily interconverted in three
dimensions.
Exploring alternate structures is advisable prior to processing
any collection with computational means, e.g., for diversity analy-
sis (see also Note 3) . The results will influence any virtual screen.

3.6 Generate 3D The effort of exploring one or more conformers per molecule is
Structures quite relevant for virtual screening and for other 3D methods (see
also Note 4) . For example, one or multiple conformers per mole-
cule are evaluated during shape (ligand-based) and target
(structure-based) virtual screening. Some docking programs
require a separate 3D conversion step [89], e.g., using CORINA
[73], Catalyst [49], OMEGA [74], or Molinspiration [68]. A web-
site that discusses 3D conformer generation software and provides
links to other tools is available [90].
How to Prepare a Compound Collection Prior to Virtual Screening 131

3.7 Select Chemical Screening compounds that are similar to known actives increases
Structure the likelihood of finding new active compounds, but it may not lead
Representatives to different chemotypes, a highly desirable situation in the indus-
trial context (see also Note 5). The severity of this situation is
increased if the original actives are covered by third-party patents
or if the lead scaffold is toxic. Sometimes, the processed collection
may simply be too large to be evaluated in detail or even to be
submitted to a virtual screen. In such cases, a strategy based on
clustering and perhaps on statistical molecular design (SMD) is a
better alternative, compared to random selection. Clustering meth-
ods aim at grouping molecules into “families” (clusters) of related
structures that are perceived—at a given resolution—to be different
from other chemical families. With clustering, the end user has the
ability to select one or more representatives from each family. SMD
methods aim at sampling various areas of chemical space and select-
ing representatives from each area. Some software is designed to
select compounds from multidimensional spaces, but the outcome
depends on several factors, as discussed below.

3.7.1 Chemical Chemical descriptors are used to encode chemical structures and
Descriptors properties of compounds: 2D/3D binary fingerprints or counts of
different substructural features or perhaps (computed) physico-
chemical properties, e.g., MW, ClogP, HDO, and HAC, as well as
other types of steric, electronic, electrostatic, topologic, or
hydrogen-bonding descriptors. The choice of what descriptors to
use, and in what context, depends on the size of collection, on the
software and hardware available, as well as on the time constraints
given for a particular selection process.

3.7.2 Similarity Chemical similarity is used to quantify the “distance” between a

(Dissimilarity) Measure pair of compounds (dissimilarity or 1 minus similarity) or how
related the two compounds are (similarity). The basic tenet of
chemical similarity is that molecules exhibiting similar features are
expected to have similar biologic activity [91], although this has
been challenged by the same author, who highlights the existence
of “activity cliffs” where similarity fails [92]. Since most inferences
in bioactivity discovery remain rooted on similarity, we continue to
use chemical (or molecular) similarity. By definition, similarity
relates to a particular framework: that of a descriptor system
(a metric by which to judge similarity), as well as that of an object
or class of objects—we need a reference point to which objects can
be compared with [93]. Similarity depends on the choice of molec-
ular descriptors [94], the choice of the weighting scheme(s), and
the similarity coefficient itself. The coefficient is typically based on
Tanimoto’s symmetric distance-between-patterns [95] and on
Tversky’s asymmetric contrast model [96]. Multiple types of meth-
ods are available for chemical similarity evaluation [91, 97–100].
132 Cristian G. Bologa et al.

3.7.3 Clustering These algorithms can be classified using many criteria and also
Algorithms implemented in different ways—see Subheading 2, item 4 for a
short list of clustering software. Hierarchical clustering methods
have been traditionally used to a greater extent—in part due to
computational simplicity. More recently, chemical structure classi-
fications are examining nonhierarchical methods. In practice, the
individual choice of different factors (descriptors, similarity mea-
sure, clustering algorithm) depends also on the hardware and soft-
ware resources available, the size and diversity of the collection that
must be clustered, and not ultimately on the user experience in
producing a useful classification that has the ability to predict
property values. We prefer Mesa’s clustering method [70] for its
ability to provide asymmetric clustering and to deal with the “false
singletons” (borderline compounds that are often assigned to one
of at least two equally distant chemical families).

3.7.4 Statistical SMD can be applied to rationally select collection representatives—

Molecular Design as illustrated for building block selection in combinatorial synthesis
planning [101]. Various methods for experimental design [102]—
such as fractional factorial or composite design—can be applied for
sampling large solution spaces, in particular if only a rather small
screening deck can be investigated in the first round.

3.7.5 Randomness Finally, the “unexpected,” that component which invites chance, as
discussed by Taleb [103, 104], justifies the random inclusion of a
particular subset of molecules to the virtual screening deck. These
molecules should not be subject to any processing (other than
correct structural representation, normalization, and tautomer/
protomer representation), i.e., they should be entirely random.
We cannot document if randomness is more successful compared
to rational methods, nor do we suggest that criteria for rational
selection should to be taken lightly. However, serendipity plays a
major role in drug discovery [105]. Therefore, we should allow a
certain degree of randomness in the final selection. If randomly
selected compounds are included, the final list of compounds
should be verified, once more, for uniqueness—to avoid duplicates.

4 Notes

1. Unless justified by prior data, it may be useful to filter out

molecules that contain:
(a) More than nine connected single bonds not in ring or
more than eight connected unsubstituted single bonds
not in ring.
How to Prepare a Compound Collection Prior to Virtual Screening 133

(b) Macrocycles with more than 22 atoms in a ring or macro-

cycles with more than 14 flexible bonds prior to virtual
screening, as high flexibility has been shown to decrease
the accuracy of docking [106].
2. Wherever the 3D structure of the bioactive conformation is
available, e.g., an active ligand co-crystallized in the target
binding site, a 3D similarity search should be performed in
conjunction with a 2D-based one. These queries are likely to
yield different, quite likely nonoverlapping results. Submitting
hits from both searches to biomolecular screening and other
experiments is preferred.
3. If alternative structures are not explored prior to virtual screen-
ing, the method will sample only a limited state of the “parent”
compounds. These changes are likely to occur in reality, since
the receptor and the solvent environment or simple Brownian
motion will influence the particular 3D and chemical state
(s) that the parent molecule is sampling. Their combinatorial
explosion needs to be, within limits, explored at the SMILES
level, before the 3D structure generation step.
4. Wherever possible, a combination of 2D and 3D methods for
virtual screening is preferred. We have shown that, when the
query molecule is a steroid, 2D methods will invariably yield
steroid-containing molecules as top-ranking hits [107]. If
alternative structures are not explored prior to virtual screen-
ing, the method will sample only a limited state of the “parent”
compounds. These changes are likely to occur in reality, since
the receptor and the solvent environment or simple Brownian
motion will influence the particular 3D and chemical state
(s) that the parent molecule is sampling. Their combinatorial
explosion needs to be, within limits, explored at the SMILES
level, before the 3D structure generation step.
5. Primary literature and patents should always be consulted prior
to launching a virtual screening campaign. A variety of online
tools are available for seeking bioactive molecules: PubChem
and ChEMBL for literature, SureCHEMBL [108] for patents,
Probe Miner [109] for chemical probes, the Guide to Pharma-
cology [110] for pharmacological substances, and DrugCen-
tral for approved drugs, respectively. The latest version of the
Protein Data Bank [111] should be consulted with respect to
availability of 3D structures for the target of interest.
134 Cristian G. Bologa et al.

5 Conclusions

The above procedure can be summarized as follows:

1. Assemble the collection starting from in-house and online
databases.
2. Clean up the collection by removing “garbage,” verifying
structural integrity, and making sure that only unique struc-
tures are screened.
3. Perform property filtering to remove unwanted structures
based on substructures or property profiling or various scoring
schemes; the collection can become the virtual screening set at
this stage or can be further subdivided in a target- and project-
dependent manner.
4. Use similarity to given actives to seek compounds with related
properties.
5. Explore the possible stereoisomers, tautomers, and protomers.
6. Generate the 3D structures in preparation for virtual screening
or for computation of 3D descriptors.
7. Use clustering or statistical molecular design to select com-
pound representatives for acquisition.
8. Add a random subset to the final list of compounds. The final
list can now be submitted for virtual screening.

Acknowledgment

This work was supported, in part, by NIH grants R21GM095952,

U54MH084690, and U24CA224370. We thank Jeremy Yang for
useful discussions.

References

1. Avorn J (2015) The $2.6 billion pill — meth- 4. Walters WP, Stahl MT, Murcko MA (1998)
odologic and policy considerations. N Engl J Virtual screening–an overview. Drug Discov
Med 372:1877–1879 Today 3:160–178
2. Sukuru SCK, Jenkins JL, Beckwith REH, 5. Fara DC, Oprea TI, Prossnitz ER, Bologa
Scheiber J, Bender A, Mikhailov D, Davies CG, Edwards BS, Sklar LA (2006) Integra-
JW, Glick M (2009) Plate-based diversity tion of virtual and physical screening. Drug
selection based on empirical HTS data to Discov Today Technol 3:377–385
enhance the number of hits and their chemical 6. Oprea TI, Matter H (2004) Integrating vir-
diversity. J Biomol Screen 14:690–699 tual screening in lead discovery. Curr Opin
3. Horvath D (1997) A virtual screening Chem Biol 8:349–358
approach applied to the search for trypa- 7. The PubChem service is hosted by the
nothione reductase inhibitors. J Med Chem National Center for Biotechnology Informa-
40:2412–2423 tion at NIH. http://pubchem.ncbi.nlm.nih.
gov/
How to Prepare a Compound Collection Prior to Virtual Screening 135

8. ChEMBL is a database of bioactive drug-like 19. Renner S, van Otterlo W, Dominguez

molecules hosted by the European Bioinfor- Seoane M, Möcklinghoff S, Hofmann B,
matics Institute at EMBL. https://www.ebi. Wetzel S, Schuffenhauer A, Ertl P, Oprea TI,
ac.uk/chembldb/ Steinhilber D, Brunsveld L, Rauh D, Wald-
9. Brown F (2005) Chemoinformatics – a ten mann H (2009) Bioactivity-guided mapping
year update. Curr Opin Drug Discov Devel of and navigation in chemical space by means
8:296–302 of hierarchical scaffold trees. Nat Chem Biol
10. Mewes HW, Wachinger B, Stümpflen V 5:585–592
(2010) Perspectives of a systems biology of 20. Wetzel S, Klein K, Renner S, Rauh D, Oprea
the synapse: How to transform an indefinite TI, Mutzel P, Waldmann H (2009) Interac-
data space into a model? Pharmacopsychiatry tive exploration of chemical space with Scaf-
43:S2–S8 fold Hunter. Nat Chem Biol 5:581–583
11. Mestres J, Gregori-Puigjané E, Valverde S, 21. Olah MM, Bologa CG, Oprea TI (2004) Stra-
Solé RV (2008) Data completeness - the tegies for compound selection. Curr Drug
Achilles heel of drug-target networks. Nat Discov Technol 1:211–220
Biotechnol 26:983–984 22. Oprea TI, Bologa CG, Edwards BS, Prossnitz
12. Schwaighofer A, Schroeter T, Mika S, Blan- EA, Sklar LA (2004) Post-HTS analysis: an
chard G (2009) How wrong can we get? A empirical compound prioritization scheme. J
review of machine learning approaches and Biomol Screen 10:419–425
error bars. Comb Chem High Throughput 23. Rishton GM (1997) Reactive compounds and
Screen 12:453–468 in vitro false positives in HTS. Drug Discov
13. Edwards BS, Bologa CG, Young SM, Pross- Today 2:382–384
nitz ER, Sklar LA, Oprea TI (2005) Integra- 24. Young SM, Bologa CG, Oprea TI, Prossnitz
tion of virtual screening with high throughput ER, Sklar LA, Edwards BS (2005) Screening
flow cytometry to identify novel small mole- with HyperCyt high throughput flow cytome-
cule formylpeptide receptor antagonists. Mol try to detect small-molecule formyl peptide
Pharmacol 368:1301–1310 receptor ligands. J Biomol Screen
14. Young SM, Bologa CG, Fara D, BJK B, 10:374–382
Strouse JJ, Arterburn JB, Ye RD, Oprea TI, 25. Rishton GM (2003) Nonleadlikeness and lea-
Prossnitz ER, Sklar LA, Edwards BS (2009) dlikeness in biochemical screening. Drug Dis-
Duplex high-throughput flow cytometry cov Today 8:86–96
screen identifies two novel formylpeptide 26. McGovern SL, Caselli E, Grigorieff N, Shoi-
receptor family probes. Cytometry chet BK (2002) A common mechanism
75A:253–263 underlying promiscuous inhibitors from vir-
15. Dennis M, Burai R, Ramesh C, Petrie W, tual and high-throughput screening. J Med
Alcon S, Nayak T, Bologa C, Leitão A, Chem 45:1712–1722
Brailoiu E, Deliu E, Dun NS, Sklar LA, 27. Roche O, Schneider P, Zuegge J, Guba W,
Hathaway H, Arterburn JB, Oprea TI, Pross- Kansy M, Alanine A, Bleicher K, Danel F,
nitz ER (2009) In vivo effects of a GPR30 Gutknecht EM, Rogers-Evans M,
antagonist. Nat Chem Biol 5:421–427 Neidhart W, Stalder H, Dillon M, Sjögren E,
16. Bologa CG, Revankar CM, Young SM, Fotouhi N, Gillespie P, Goodnow R,
Edwards BS, Arterburn JB, Parker MA, Tka- Harris W, Jones P, Taniguchi M, Tsujii S,
chenko SE, Savchuck NP, Sklar LA, Oprea TI, von der Saal W, Zimmermann G, Schneider
Prossnitz ER (2006) Virtual and biomolecu- G (2002) Development of a virtual screening
lar screening converge on a selective agonist method for identification of ‘frequent hitters’
for GPR30. Nat Chem Biol 2:207–212 in compound libraries. J Med Chem
17. George Thompson AM, Ursu O, Babkin P, 45:137–142
Iancu CV, Whang A, Oprea TI, Choe JY 28. Oprea TI (2002) Lead structure searching:
(2016) Discovery of a specific inhibitor of are we looking for the appropriate properties?
human GLUT5 by virtual screening and J Comput-Aided Mol Design 16:325–334
in vitro transport evaluation. Sci Rep 6:24240 29. Austin CP, Brady LS, Insel TR, Collins FS
18. Koch MA, Schuffenhauer A, Scheck M, (2004) NIH molecular libraries initiative. Sci-
Wetzel S, Casaulta M, Odermatt A, Ertl P, ence 306:1138–1139
Waldmann H (2005) Charting biologically 30. Oprea TI, Bologa CG, Boyer S, Curpan RF,
relevant chemical space: a structural classifica- Glen RC, Hopkins AL, Lipinski CA, Marshall
tion of natural products (SCONP). Proc Natl GR, Martin YC, Ostopovici-Halip L,
Acad Sci U S A 102:17272–17277 Rishton G, Ursu O, Vaz RJ, Waller C,
136 Cristian G. Bologa et al.

Waldmann H, Sklar LA (2009) A crowdsour- 45. Oprea TI, Allu TK, Fara DC, Rad RF,
cing evaluation of the NIH chemical probes. Ostopovici L, Bologa CG (2007) Lead-like,
Nat Chem Biol 5:441–447 drug-like or “Pub-like”: how different are
31. Collins FS (2010) Research agenda. Oppor- they? J Comput Aided Mol Des 21:113–119
tunities for research and NIH. Science 46. See the OpenEye Scientific Software, Santa
327:36–37 Fe, NM. http://www.eyesopen.com/
32. Boguski MS, Mandl KD, Sukhatme VP 47. See the mesa analytics & computing, Santa Fe,
(2009) Repurposing with a difference. Sci- NM. http://www.mesaac.com/
ence 324:1394–1395 48. See the ChemAxon kft, Budapest, Hungary.
33. Toney JH, Fasick JI, Singh S, Beyrer C, Sulli- https://www.chemaxon.com/
van DJ Jr (2009) Purposeful learning with 49. Accelrys Inc., San Diego, CA. http://www.
drug repurposing. Science 325:1139–1140 accelrys.com/
34. Chong CR, Sullivan DJ Jr (2007) New uses 50. See the Chemical Computing Group. http://
for old drugs. Nature 448:645–646 www.chemcomp.com/
35. Campillos M, Kuhn M, Gavin AC, Jensen LJ, 51. Certara, Princeton, NJ. https://www.certara.
Bork P (2008) Drug target identification com/
using side-effect similarity. Science 52. Ambure P, Aher RB, Roy K (2014) Recent
321:263–266 advances in the open access cheminformatics
36. Keiser MJ, Setola V, Irwin JJ, Laggner C, toolkits, software tools, workflow environ-
Abbas AI, Hufeisen SJ, Jensen NH, Kuijer ments, and databases. In: Zhang W
MB, Matos RC, Tran TB, Whaley R, Glennon (ed) Computer-aided drug discovery. Meth-
RA, Hert J, Thomas KLH, Edwards DD, ods in pharmacology and toxicology. Humana
Shoichet BK, Roth BL (2009) Predicting Press, New York, NY
new molecular targets for known drugs. 53. Weininger D (1988) SMILES, a chemical lan-
Nature 462:175–181 guage and information system. 1. Introduc-
37. Ashburn TT, Thor KB (2004) Drug reposi- tion to methodology and encoding rules. J
tioning: Identifying and developing new uses Chem Inf Comput Sci 28:31–36
for existing drugs. Nat Rev Drug Discov 54. The International Chemical Identifier, InChI,
3:673–683 was a IUPAC project. http://www.iupac.org/
38. CTSA. http://www.ncrr.nih.gov/clinical_ inchi/
research_resources/clinical_and_transla 55. OEChem Toolkit, Openeye Scientific Soft-
tional_science_awards/ ware, Santa Fe, NM. http://www.eyesopen.
39. Lipinski CA, Lombardo F, Dominy BW, Fee- com/oechem-tk
ney PJ (1997) Experimental and computa- 56. Open Babel. http://openbabel.sourceforge.
tional approaches to estimate solubility and net/
permeability in drug discovery and develop-
ment settings. Adv Drug Deliv Rev 23:3–25 57. raphSim TK Openeye Scientific Software,
Santa Fe, NM. http://www.eyesopen.com/
40. Oprea TI (2000) Property distribution of graphsim-tk
drug-related chemical databases. J Comput
Aided Mol Des 14:251–264 58. MACCSKeys320Generator, Mesa analytics
and computing LLC, Santa Fe, NM. http://
41. Ursu O, Oprea TI (2010) Model-free drug- www.mesaac.com/
likeness from fragments. J Chem Inf Model
50:1387–1394 59. Durant JL, Leland BA, Henry DR, Nourse JG
(2002) Reoptimization of MDL keys for use
42. Wester MJ, Pollock SN, Coutsias EA, Allu in drug discovery. J Chem Inf Comput Sci
TK, Muresan S, Oprea TI (2008) Scaffold 42:1273–1280
topologies. 2. Analysis of chemical databases.
J Chem Inf Model 48:1311–1324 60. MOE: The molecular operating environment
from chemical computing group Inc., Mon-
43. Teague SJ, Davis AM, Leeson PD, Oprea TI treal, QC. http://www.chemcomp.com/
(1999) The design of leadlike combinatorial
libraries. Angew Chem Int Ed 38:3743–3748 61. Open Babel: the open source chemistry tool-
German version: Angew. Chem. 111, 3962-- box. http://openbabel.org/wiki/Main_Page
3967 62. CDK is a Java library for structural chemo-
44. Hann MM, Oprea TI (2004) Pursuing the and bioinformatics. http://cdk.sf.net/
leadlikeness concept in pharmaceutical 63. Leo A (1993) Estimating LogPoct from struc-
research. Curr Opin Chem Biol 8:255–263 tures. Chem Rev 5:1281–1306
How to Prepare a Compound Collection Prior to Virtual Screening 137

64. CLOGP is available from BioByte Corpora- organic small molecules in the chemical uni-
tion, Claremont, CA. http://www.biobyte. verse database GDB-17. J Chem Inf Model
com/ 52:2864–2875
65. EPI Suite v4.11, U.S. Environmental Protec- 78. Ursu O, Holmes J, Knockel J, Bologa CG,
tion Agency. https://www.epa.gov/tsca- Yang JJ, Mathias SL, Nelson SJ, Oprea TI
screening-tools/epi-suitetm-estimation-pro (2017) DrugCentral: online drug compen-
gram-interface dium. Nucleic Acids Res 45:D932–D939
66. Tetko IV, Tanchuk VY (2002) Application of 79. FILTER is available from OpenEye Scientific
associative neural networks for prediction of Software Inc., Santa Fe, NM. http://www.
lipophilicity in ALOGPS 2.1 program. J eyesopen.com/products/applications/filter.
Chem Inf Comput Sci 42:1136–1145 html
http://vcclab.org/lab/alogps/index.html 80. Olah M, Mracec M, Ostopovici L, Rad R,
67. The virtual computational chemistry labora- Bora A, Hadaruga N, Olah I, Banda M,
tory (VCCLAB) as a number of on-line soft- Simon Z, Mracec M, Oprea TI (2004) WOM-
ware modules. Available at http://vcclab. BAT: world of molecular bioactivity. In:
org/ Oprea TI (ed) Cheminformatics in drug dis-
68. Molinspiration has a number of property cal- covery. Wiley-VCH, New York, NY (in press)
culators, including 3D conformer generation. 81. Coats EA (1998) The CoMFA steroids as a
http://molinspiration.com/ benchmark dataset for development of
69. Yap CW (2011) PaDEL-descriptor: An open 3D-QSAR methods. In: Kubinyi H,
source software to calculate molecular Folkers G, Martin YC (eds) 3D QSAR in
descriptors and fingerprints. J Comput drug design. Volume 3. Recent advances.
Chem 32:1466–1474 Kluwer/ESCOM, Dordrecht, The Nether-
70. Measures, mesa analytics and computing lands, pp 199–213
LLC, Santa Fe, NM. http://www.mesaac. 82. Oprea TI, Olah M, Ostopovici L, Rad R, Mra-
com/ cec M (2003) On the propagation of errors in
71. ChemoMine plc, Cambridge, UK. http:// the QSAR literature. In: Ford M,
www.chemomine.co.uk/ Livingstone D, Dearden J, Van de Water-
beemd H (eds) EuroQSAR 2002–Designing
72. MacCuish JD, MacCuish NE (2010) Chap- drugs and crop protectants: processes, pro-
man & Hall/CRC mathematical & computa- blems and solutions. Blackwell Publishing,
tional biology. In: Clustering in New York, NY, pp 314–315
bioinformatics and drug discovery, vol 40.
CRC press, Boca Raton, FL, p 244 83. Chemical Database Management Software,
TimTec Inc. http://software.timtec.net/
73. Gasteiger J, Rudolph C, Sadowski J (1990) ched.htm
Automatic generation of 3D-atomic coordi-
nates for organic molecules. Tetrahedron 84. Public web applications from UNM Biocom-
Comput Methodol 3:537–547 CORINA is puting are available at http://pasilla.health.
available from Molecular Networks GmbH unm.edu
and Altamira LLC; https://www.mn-am. 85. Yang JJ, Ursu O, Lipinski CA, Sklar LA,
com/ Oprea TI, Bologa CG (2016) Badapple: pro-
74. Hawkins PCD, Skillman AG, Warren GL, miscuity patterns from noisy evidence. J
Ellingson BA, Stahl MT (2010) Conformer Chem 8:29
generation with OMEGA: Algorithm and val- 86. Johnston PA (2011) Redox cycling com-
idation using high quality structures from the pounds generate H2O2 in HTS buffers con-
Protein Databank and Cambridge Structural taining strong reducing reagents-real hits or
Database. J Chem Inf Model 50:572–584 promiscuous artifacts? Curr Opin Chem Biol
OpenEye Scientific Software Inc., Santa Fe, 15:174–182
NM; http://www.eyesopen.com/ 87. Kenny PW, Sadowski J (2004) Structure mod-
75. MODDE is available from Umetrics, a divi- ification in chemical databases. In: Oprea TI
sion of Sartorius Stedim biotech. https:// (ed) Cheminformatics in drug discovery.
webshop.umetrics.com/ Wiley-VCH, New York, NY (in press)
76. The MLSMR collection can be datamined 88. Martin YC (2010) Perspectives in drug dis-
using the PubChem interface (keyword, covery and design: tautomers and tautomer-
MLSMR). http://pubchem.ncbi.nlm.nih. ism. J Comput Aided Mol Design
gov/ 24:473–638
77. Ruddigkeit L, van Deursen R, Blum LC, Rey- 89. Sadowski J, Gasteiger J (1993) From atoms
mond JL (2012) Enumeration of 166 billion and bonds to three-dimensional atomic
138 Cristian G. Bologa et al.

coordinates: automatic model builders. Chem 103. Taleb NN (2005) Fooled by randomness: the
Rev 93:2567–2581 hidden role of chance in the markets and life.
90. See the Metabolomics Fiehn Lab site: http:// Random House, New York
fiehnlab.ucdavis.edu/staff/kind/ 104. Taleb NN (2007) The Black Swan. The
ChemoInformatics/Concepts/3D- impact of the highly improbable. Random
conformer/ House, New York
91. Johnson MA, Maggiora GM (1990) Con- 105. Sneader W (2005) Drug discovery: a history.
cepts and applications of molecular similarity. Wiley, New York
Wiley-VCH, New York, NY 106. Boström J, Norrby P-O, Liljefors T (1998)
92. Maggiora GM (2006) On outliers and activity Conformational energy penalties of protein-
cliffsWhy QSAR often disappoints. J Chem bound ligands. J Comput-Aided Mol Design
Inf Model 46:1535 12:383–396
93. Oprea TI (2002) Chemical space navigation 107. Prossnitz ER, Arterburn JB, Edwards BS,
in lead discovery. Cur Opin Chem Biol Sklar LA, Oprea TI (2006) Steroid-binding
6:384–389 GPCRs: new drug discovery targets for old
94. Todeschini R, Consonni V (2008) Handbook ligands. Expert Opin Drug Discov
of molecular descriptors, 2nd edn. Wiley- 1:137–150
VCH, Weinheim, Germany 108. Papadatos G, Davies M, Dedman N,
95. Tanimoto TT (1961) Non-linear model for a Chambers J, Gaulton A, Siddle J, Koks R,
computer assisted medical diagnostic proce- Irvine SA, Pettersson J, Goncharoff N,
dure. Trans NY Acad Sci Ser 2(23):576–580 Hersey A, Overington JP (2016) Sure-
96. Tversky A (1977) Features of similarity. ChEMBL: a large-scale, chemically annotated
Psychol Rev 84:327–352 patent document database. Nucleic Acids Res
44:D1220–D1228 Available at https://www.
97. Willett P (1987) Similarity and clustering surechembl.org/search/
techniques in chemical information systems.
In: Research Studies Press. Letchworth, 109. Antolin AA, Tym JE, Komianou A, Collins I,
England Workman P, Al-Lazikani B (2017) Objective,
quantitative, data-driven assessment of chem-
98. Willett P (2000) Chemoinformatics–similar- ical probes. Cell Chem Biol 25(2):P194–205.
ity and diversity in chemical libraries. Curr Op E5 in press. Available at http://probeminer.
Biotech 11:85–88 icr.ac.uk/#/
99. Lewis RA, Pickett SD, Clark DE (2000) 110. Harding SD, Sharman JL, Faccenda E,
Computer-aided molecular diversity analysis Southan C, Pawson AJ, Ireland S, Gray AJG,
and combinatorial library design. Rev Com- Bruce L, Alexander SPH, Anderton S,
put Chem 16:1–51 Bryant C, Davenport AP, Doerig C,
100. Martin YC (2001) Diverse viewpoints on Fabbro D, Levi-Schaffer F, Spedding M,
computational aspects of molecular diversity. Davies JA, NC-IUPHAR (2018) The
J Comb Chem 3:231–250 IUPHAR/BPS Guide to PHARMACOL-
101. Linusson A, Gottfries J, Lindgren F, Wold S OGY in 2018: updates and expansion to
(2000) Statistical molecular design of build- encompass the new guide to IMMUNO-
ing blocks for combinatorial chemistry. J Med PHARMACOLOGY. Nucleic Acids Res 46:
Chem 43:1320–1328 D1091–D1106 Available at http://
102. Eriksson L, Johansson E, Kettaneh-Wold N, guidetopharmacology.org/
Wikström C, Wold S (2000) Design of experi- 111. Berman HM, Westbrook J, Feng Z,
ments: principles and applications. Umetrics Gilliland G, Bhat TN, Weissig H, Shindyalov
Academy, Umeå, Sweden IN, Bourne PE The protein data bank.
Nucleic Acids Res 28:235–242 Available at
http://www.rcsb.org/
Chapter 8

Building a Quantitative Structure-Property Relationship

(QSPR) Model
Robert D. Clark and Pankaj R. Daga

Abstract
Knowing the physicochemical and general biochemical properties of a compound is critical to understand-
ing how it behaves in different biological environments and to anticipating what is likely to happen in
situations where that behavior cannot be measured directly. Quantitative structure-property relationship
(QSPR) models provide a way to predict those properties even before a compound has been synthesized
simply by knowing what its structure would be. This chapter describes a general workflow for compiling the
data upon which a useful QSPR model is built, curating it, evaluating that model’s performance, and then
analyzing the predictive errors with an eye toward identifying systematic errors in the input data. The focus
here is on models for the absorption, distribution, metabolism, and excretion (ADME) properties of drugs
and toxins, but the considerations explored are general and applicable to any QSPR.

Key words ADME, Data curation, QSAR, QSPR, Regression

1 Introduction

Successful drugs are necessarily able to exert a desired effect once

they reach their molecular target, but they can only do so if they are
able to get to that target from the point at which they are adminis-
tered to the patient. Their ability to do so reflects the interplay
between several nontarget molecular properties collectively known
as ADME properties for the pharmacokinetic steps they affect:
absorption, distribution, metabolism, and excretion. Measuring
such properties requires synthesis and purification, but their values
can be estimated even before synthesis by quantitative structure-
property relationship analysis of compounds that have already been
made and assayed. The QSPR examples discussed here are all
concerned with ADME properties, but the methodology is equally
applicable to non-ADME properties in areas such as pesticide dis-
covery and development, environmental dispersion, or toxicology.
The distinction between QSPRs and quantitative structure
activity relationships (QSARs) is somewhat arbitrary, but for the

Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_8, © Springer Science+Business Media, LLC, part of Springer Nature 2019

139
140 Robert D. Clark and Pankaj R. Daga

purposes of this chapter QSPRs cover physicochemical properties

like aqueous solubility, lipophilicity (logP), ionization constants
(pKa’s), and linear (i.e., non-saturating) plasma protein binding,
as well as the general xenobiotic metabolizing systems involved in
Phase I and Phase II metabolism: cytochrome P450s (CYPs),
UDP-glucuronosyl transferases (UGTs), and sulfotransferases.
The latter are distinguished from specific off-target effects (e.g.,
binding to related receptors or kinases) by their promiscuity.
The model-building process described here focuses on 2D
ligand-based “statistical” methods. The process of “structure-
based” (docking) methods that model interactions between mole-
cules and macromolecular targets directly are less well-suited to
most QSPR problems because the properties involved tend to
reflect more or less isotropic (i.e., that lack a specific orientation
in 3D space) interactions or that involve a broad envelope of con-
formations or both. Some success has been reported in predicting
sites of CYP metabolism using 3D models [1], but the enzymes
involved tend to have unusually fluid binding pockets and often
have multiple binding sites within the pocket [2]. That flexibility
makes quantitative docking studies difficult at best. Regardless, the
same basic processes are involved in ligand preparation and results
analysis as in the examples described here.
An exhaustive review of the science behind model-building
methods for QSPR is beyond the scope of this chapter. Rather, it
is presumed that the reader already has some experience with one or
more methods or with the general literature. The focus here is on a
general workflow that the authors have found reliable and robust in
constructing the sort of industrial-strength QSPR models
distributed as part of the ADMET Predictor™ software distributed
by Simulations Plus, Inc. [3]. The examples provided are illustrated
using that software, but the considerations and the “watchouts”
reflect general characteristics of the data and the complications
often encountered regardless of the program being used.
Most QSPR models used in ADME make quantitative predic-
tions of the target property value, and the discussion below is
therefore cast primarily in terms of regression model outputs. Issues
for the classification models that are more commonly used for
toxicological QSPR analysis are formulated somewhat differently.
Their output predictions are ultimately determined by comparing
some continuous model output to a threshold, so the underlying
workflow is essentially the same. They do differ from regression
models in how their performance is evaluated, so that aspect is
addressed separately in Subheading 3.7.
Building a QSPR Model 141

2 Materials

QSPR model building involves retrieving, manipulating, and stor-

ing chemical structures and data related to such structures. It also
requires a way to generate descriptors and model definition files.
Finally, a process for applying those models to new compounds is
required. The programs and support files used in these activities
constitute the relevant “materials.”

2.1 Structural Data Classical two-dimensional line drawings are very flexible and effec-
tive for conveying connectivity (“2D”) information. Adoption of a
few conventions makes it possible to convey a sense of depth as well.
Indeed, such “2.5 D” depictions can be quite beautiful and elegant.
Most existing computer programs are unable to process drawings
directly, however,1 which led to the adoption of standardized line
notations that specify the connectivity of a molecule. The most
widely used line notations are SMILES and InChI. The former
conveys enough molecular connectivity and bond type information
to specify a specific 2D structure, whereas the latter is a hierarchical
decomposition that can describe more than one molecular struc-
tural form of a compound.
Different software systems need to exchange structural infor-
mation, and several standard file formats have been developed to
facilitate that exchange. Files of SMILES strings can serve this
purpose, but their lack of coordinates and associated data can be a
serious limitation. MOL and SD files are more flexible; they store
explicit connectivity tables that specify atomic coordinates and
provide more molecular context. SD files have the added virtues
of being able to store multiple structures in a single file and to
associate nonstructural data with each structure.
Protein Data Bank (PDB) files [4] are primarily designed to
store structural data for macromolecular crystal structures but can
be an important source of 3D structural information for bound
ligands. They require considerable preprocessing for use in building
2D QSPRs, in part because they lack explicit bond information for
ligands; however, once processed, they should be stored in SMILES
or SD format.
1. The simplified molecular-input line-entry system (SMILES)
was created at Daylight Chemical Information Systems, Inc.
[5]. Aromatic atoms are represented by their atomic symbols in
lowercase, and nonaromatic atoms are in uppercase. Numbers
are used to indicate nonadjacent atoms that are bonded to each
other (i.e., typically for ring closures), and parentheses set off

1
Chemical structures can also be extracted from text documents by applying optical structure recognition (OSR).
Those that the authors have worked with are not yet reliable enough to be applied without careful manual review.
142 Robert D. Clark and Pankaj R. Daga

branches. Double bonds are represented by equals signs and

triple bonds by hashes. Hydrogen atoms are normally implicit
(to satisfy valence requirements) as are single and aromatic
bonds. Hence styrene is represented by C¼Cc1ccccc1, and
2-cyanopyridine can be represented as n1c(C#N)cccc1. Details
can be found on the Daylight web site [6]. SMILES are com-
pact and relatively “human readable,” which makes them the
favored way to represent structures in supporting information.
“Classic” SMILES files contain the SMILES string for a single
compound and put the compound identifier in the filename
(with a file extension of “.smi”). There is no industry-standard
format for storing multiple SMILES strings and associated data
in a single file, though many groups and companies have cre-
ated their own variations.
2. The IUPAC International Chemical Identifier (InChI) is a
chemical substance identification system that is designed to
link records in different data compilations that pertain to the
same compound [7, 8]. Basically, it encodes non-hydrogen
atoms2 and bonds connecting those atoms as a “parent core
structure”; bond types are not differentiated. No coordinates
are stored.
The parent structure in the InChI string consists of several
layers set off by “/” characters, each of which provides addi-
tional molecular details about the compound. The first layer is
the empirical formula, whereas a separate layer (starting with
“/c”) encodes the connectivity of heavy atoms in the core
structure. A layer starting with “/h” indicates the position of
fixed and tautomerizable (“immoveable” and “moveable,”
respectively) hydrogens. Tautomeric, isotopic, stereochemical,
and protonation states can be specified as extra features
encoded in additional layers. Because the “standard InChI”
distinguishes between molecules on the basis of connectivity,
stereochemistry, and isotopic composition, but not on proton-
ation state or tautomeric assignments, it is useful for identifying
functionally duplicate records from different data sources.
3. There are many available sketching programs enabling user
input of a molecule by directly drawing it (e.g., in the free
MedChem Designer™ program from Simulations Plus, Inc.
[9]). Most commercial and open-source sketchers will export
structures in MOL or SD format, and nearly all cheminfor-
matics programs will read them. These formats were created by
MDL for use in its database systems. As a result of subsequent
mergers in the cheminformatics industry, it is now overseen by
Dassault Systèmes BIOVIA. The MOL format only allows for
one compound structure per file and makes no provision for

2
Bridging hydrogen atoms that take part in three-center bonds are treated as part of the core structure.
Building a QSPR Model 143

named compound attributes. The compound identifier is gen-

erally used as the filename, but it also appears in the first line of
the file itself. The rest of the file consists of some additional
header information and a connection table made up of two
blocks: one that specifies atomic coordinates, elemental and
isotopic identity, charge, etc. and another that indicates which
atoms a given bond connects as well as the type of bond and
other properties such as stereo information. SD is an extended
file format that includes compound attributes and the ability to
store multiple compounds in a single file.

2.2 Structure 1. ChEMBL [10, 11] is an extensive compilation of literature data

and Property Value that is a good source of assay data for QSPR model construc-
Databases tion. It is well-curated given its scale but does contain tran-
scriptional and encoding errors. All entries include references
to the source of the data, which makes the necessary secondary
curation straightforward. Note, however, that data from
reviews and compilations that may not be well-annotated are
included alongside primary sources.
2. PubChem [12] is an uncurated data repository, and its reliabil-
ity is uneven as a result [13]. Property data comes predomi-
nantly from high-throughput screening (HTS) assays. Most
ADME assays of interest are not amenable to HTS assay for-
mats, so the endpoints used are often surrogates for the actual
property of interest—e.g., kinetic solubility in phosphate-
buffered saline (AID 1996 [14])—which limits their usefulness
for building global models of thermodynamic solubility in pure
water (see Note 1). Results from confirmation assays can be
useful as external test sets.
3. ChemSpider [15] is designed primarily to facilitate the sharing
of chemical structures and their names. It is an excellent tool
for resolving ambiguities in nomenclature. It also contains
considerable physicochemical measurements, though some
care needs to be taken to avoid mistaking in silico predictions
for experimental data and to handle HTS results with due care.
Data deposited by chemical vendors can be valuable, but reli-
ability varies with the company.

2.3 Software A very wide range of software is available for carrying out the
manipulations described here. The minimal functionality required
includes being able to:
1. Read in and write out structures in SMILES and/or SD format
as well as the ability to associate data with each structure.
2. Display 2D structures and allow them to be modified readily.
144 Robert D. Clark and Pankaj R. Daga

3. Detect duplicate entries by name or structure, preferably with

an ability to ignore stereochemical and isotopic differences.
4. Selectively extract constituent molecular units from a com-
pound’s structure.
5. Standardize molecular structures and identify any that cannot
be standardized.
6. Generate molecular descriptors from input structures.
7. Generate a QSPR model that quantitatively relates the descrip-
tors to a selected endpoint.
8. Plot the distribution of signed error in prediction (residuals) as
a function of the endpoint of interest (the dependent variable)
and individual descriptors.
These functions need not all be carried out by a single program,
but if a combination of programs is used, they need to be able to
interchange information effectively. Too many such programs exist
to explore them in detail here. ADMET Predictor and most other
commercial QSAR packages support the full workflow directly. All
the commercial QSAR packages have been developed with sophis-
ticated graphical user interfaces (GUIs) to support complex work-
flows and reduce the likelihood of “pilot error.” Scripting in R [16]
is one example of an open-source environment that supports most
of these functions when used in conjunction with the Chemistry
Development Kit (CDK [17]), which can be used to generate 2D
coordinates from connection tables (e.g., SMILES) and to generate
commonly used molecular descriptors. R itself provides a range of
machine-learning functions; however, like most open-source tools,
R lacks a GUI and requires custom scripting to apply them.
Workflow tools such as KNIME [18] and Pipeline Pilot [19]
can be used to connect functional subunits from multiple sources,
whether they are open-source (e.g., CDK and RDKit [20]) or
commercial programs. Linking disparate programs requires some
care, however, because there is increased risk of inconsistencies
between programs, e.g., differences in structural standardization
or how qualifiers like “>” and missing data flags are handled.

2.4 Descriptors The quality and generality of a QSPR model depend to a consider-
able extent on the descriptors used in its construction. A very wide
range of descriptors is available for generating QSPRs and QSARs
[21]. Different research groups and software vendors tend to favor
their own particular sets, but each can be broadly categorized as
falling into one of a few classes.
1. Constitutional Descriptors
Constitutional descriptors capture various aspects of molecular
size (molecular weight, number of heavy atoms, number of
Building a QSPR Model 145

bonds, etc.) and elemental composition (number of bromines,

number of double bonds, etc.).
2. Substructural Descriptors
Substructural descriptors indicate whether a particular sub-
structure is present in a compound. They may appear mixed
in with other descriptors or used in isolation. In the latter case,
they are usually used as binary (present or absent) “fingerprint”
vectors, but count vectors in which the number of times each
substructure occurs can also be used. Circular fingerprints [22]
are especially widely used. They define a characteristic substruc-
ture around each atom in terms of the properties of atoms in its
immediate neighborhood. ECFP_4 fingerprints, for example,
take the kinds of atoms 1, 2, 3, and 4 bonds away from a central
atom into account when determining which bit in the circular
fingerprint is set to 1 for each atom in a molecule. The number
of possible atom environments is too large to handle computa-
tionally, so they are condensed (“hashed”) into fingerprints of
manageable length.
3. Whole-Molecule Descriptors
Classical topological indices summarize the overall molecular
shape based on molecular connectivity (topology). Descriptors
like electrotopological (E-state) indices [23] and circular fin-
gerprints capture details about the atomic environment of the
atoms in a molecule. E-state indices are the sum of all per-
turbed base values for a given atom type (e.g., the carbons in
terminal methyl groups or alcohol oxygens) in a molecule.
More physically based molecular descriptors can be calculated
from atomic electronegativities, partial charges, “hardness,”
polarizabilities, etc. These include maximal and minimal values
as well as eigenvalues from autocorrelation vectors [24]. Such
descriptors are more commonly available in commercial soft-
ware packages than in open-source programs.
4. 3D Descriptors
All of the descriptors described above can be evaluated from
molecular connectivity alone; they are independent of 3D
structure and of molecular conformation. Three-dimensional
descriptors, on the other hand, can be very useful in QSAR
studies for targeted biochemical activities and are featured in
some commercial CYP prediction software. Applying them to
QSPR analysis in general, however, requires special techniques
for dealing with substrate and enzyme flexibility that are
beyond the scope of this chapter.
146 Robert D. Clark and Pankaj R. Daga

3 Methods

3.1 Compiling Collecting good data is key to building a robust and reliable QSPR
a Data Set model. It can come from internal corporate databases or the pri-
mary literature as well as from commercial or publicly accessible
data compilations (see Note 2).
The primary literature is the best source of property data, but
building a large enough data set to be useful requires a considerable
investment in time and effort, since each publication needs to be
considered on its own merits. Published compilations are a much
more convenient source of data, but these should also be handled
with care unless they were compiled by experts in the field and
include primary literature references for all entries. Relying on an
incompletely annotated compilation means complete reliance on
the domain knowledge of the authors, which is—in our experi-
ence—quite prone to error. Worse, there is often no way to track
down what the error was, which means that the data point or points
affected must be discarded altogether. Note, too, that if you do not
know the primary reference for a data point you suspect is bad, you
will not be able to identify associated observations—from the same
paper or from the same group—that are also in error.
Commercial databases and handbooks are also valuable data
sources; however, both can be quite expensive, and handbooks
often require generating structures from compound names (see
below). A ChEMBL search is usually the best way to start collecting
data, in part because it returns structures in electronic form. It also
represents a good compromise between ready accessibility to rela-
tively large amounts of data and reliability of the property values
retrieved. Combining several large data sets (excluding those that
are compilations) from such a search and building a preliminary
model can provide a sense of how feasible building a global model
for the property of interest is going to be. Once the likely quality of
the QSPR can be assessed, it may be possible to justify adding data
from more resource-intensive sources.

3.2 Curating Regardless of the source of compound structures, they need to be

Structures checked for errors and, where possible, corrected. Programs will
generally notify the user of gross errors like pentavalent carbons or
atom types that cannot be handled correctly. In some cases (e.g.,
unusual phosphorous oxidation states or spurious bonds between
ions in a salt), the structure can often be corrected based on its
name or some other identifier (e.g., CAS number); however, there
are several more subtle kinds of error to check for [13].
1. Check for multiple entries with the same name. In many cases,
these represent independent measurements on the same com-
pound, but they may also represent different compounds that
share a constituent (e.g., a free base and one of its salts).
Building a QSPR Model 147

Occasionally, however, a check for duplicate names uncovers

misassigned structures. This is distressingly common when
structures have been generated automatically from compound
names drawn from the older literature and passed from data-
base to database. Often, they involve ambiguous split names
such as ethanolamine acetate, which could be the ester or the
salt. They are also relatively common for natural products [13].
2. Structures should be standardized so that alternative represen-
tations of the same group are converted to a consistent form.
Nitro groups are particularly problematic in mixed data sets.
They should then be scanned to identify structural duplicates
bearing different identifiers. Absent a structural search facility,
such duplicate checking can be done using InChI codes (see
above).
3. As noted above, most QSPR modeling is done using 2D
descriptors that are insensitive to chirality. Hence it is prudent
to either consolidate the entries for diastereomers (see 3.3.2) if
the discrepancies among their property values are within the
range of experimental error or to set them aside (see Note 3). If
the property of interest is one where chirality should not have
an effect (e.g., solubility) yet endpoint values differ, the asso-
ciated primary literature should be examined; doing so may
reveal errors relevant to other data from the same source.
4. At a minimum, a search should be run to identify entries for
different tautomers of the same molecule. In most cases, one
will predominate in aqueous solution, and that structure
should be used. Any discrepancy in observed property values
for different tautomers bears investigation.
Determining which tautomer is most prevalent can be chal-
lenging, but it should be done if possible; simply “standardiz-
ing” to a canonical tautomeric form regardless of likely
prevalence in solution should be avoided. Fortunately, carefully
formulated (though nontrivial) transformation rules can do an
adequate job of identifying the dominant form in most cases.
5. As a last step, it is a good idea to look though a tiled view of the
data set structures, especially if the software you are using
supports 2D alignment. Often an inconsistent or incorrect
structure will “jump out” when browsing through such “wall-
paper,” even when the data set consists of several thousand
structures (Fig. 1; see Note 4).

3.3 Curating It is possible for a model to be better than the raw data, but only
Property Data insofar as it has been carefully curated. Reported error rates in the
literature are quite high [25], which agrees with the authors’ per-
sonal experience. Unfortunately, many of the most common kinds
of error can confer undue leverage on the affected data points,
148 Robert D. Clark and Pankaj R. Daga

Fig. 1 Tiled display (“wallpaper”) from ADMET Predictor 8.5 showing structures for a rat liver microsome
clearance data set taken from ChEMBL

which risks biasing predictions significantly. Several steps can and

should be taken to ensure that the accuracy of a QSPR model is
limited by the random noise in the experimental data rather than by
errors that could have been found and addressed.
1. Plot the distribution of your data as a function of the property
to be modeled. Errors in units often make themselves apparent
as secondary peaks offset from the main body of properties by a
factor of 1000, typically because “μ” was substituted for “m” in
a table header or vice versa. They can also arise because quali-
fiers such as “<” were lost in transcription. When such an error
in units is found, other data from the same source should also
be considered for revision.
2. Most QSPR properties of practical interest reflect behavior in
aqueous solution, so they are intrinsically molecular in nature.
Therefore, salts and other admixtures are expected to reflect the
properties of their constituent molecules independently, once
secondary effects (e.g., on ionic strength and pH) have been
accounted for. If the property values reported for replicate
entries are reasonably comparable, the records should be merged
and the average value (or geometric mean for models built on a
logarithmic scale; see below) should be used (see Note 5).
Building a QSPR Model 149

Note that property values can be consistent without being

the same. In particular, a solubility value of >1 mg/mL is
consistent with a second entry of 2.5 mg/mL for a compound
of the same name. This may seem obvious, but automated
comparisons can easily be fooled by such nominal
discrepancies.
When inconsistent property values are associated with the
same identifier and the structures are the same, check the
primary sources. Other nonduplicated observations from the
flawed source should also be examined and corrected as appro-
priate, whether they reflect errors in the primary source or were
introduced during transcription. Errors uncovered in this way
may involve units, but they may also involve species, tissue, etc.
Regardless, other data from the same source will likely need to
be corrected as well.
3. Having more data is not necessarily better. This is obvious
when it is a matter of the accuracy of individual data points,
but it also pertains to a “clumpy” sampling of chemistry space
that can result from the tendency to pursue a synthesis strategy
of “methyl, ethyl, butyl, futile, etc.”—i.e., generating a series of
closely related analogs from readily available reagents and a key
synthetic intermediate. This is important for establishing pat-
ent coverage and for local QSARs: the target activity may vary
considerably across the analogs, but the property that is the
focus of the QSPR analysis does not. Such a situation often
arises when data from papers documenting large-scale pharma-
ceutical projects are included in the data set.
Having many nearly duplicate structures with very similar
property values in a data set can distort model performance
statistics, especially if they share a bias. The best way to check
for this is to cluster the compounds by structure (e.g., using
substructural fingerprints or self-organizing maps and molecu-
lar descriptors [24]) and examine the distribution of property
values within the clusters. If necessary, a representative subset
of structures (see Subheading 3.6 below) can be used that
covers the full range of analog properties.
4. Whether a QSPR should be modeled on a linear or logarithmic
scale depends primarily on how its experimental error varies
with its value. Analytical errors for positive unbounded proper-
ties like solubility typically are proportional to the value itself.
When that is the case, the error of the logarithm is independent
of the value, so modeling the log levels out the influence of
errors in individual observations. Applying the log transform to
properties that reflect equilibria has the additional advantage of
putting them on a “natural” scale where they are proportional
to differences in Gibbs free energy (ΔG).
150 Robert D. Clark and Pankaj R. Daga

Some ADME properties are bounded. Percent unbound in

plasma, for example, can range from 0 to 100%. Experimental
errors in such data are often quite variable on a linear scale,
being higher at midrange than near the lower and upper
bounds. In such situations, applying the logit transform should
be considered:
logitðx Þ ¼ ln ðx Þ ln ð1 x Þ
where ln(x) is the natural logarithm of x. The actual base used is
not important so long as the inverse transformation (“antilo-
git”) applied to return to the original scale makes use of the
same base for the exponentiation.
antilogitðz Þ ¼ expðz Þ=ð1 þ expðz ÞÞ
5. When the data is log-transformed, nominal values 0 must be
set aside. The same is true for values of 0 or 100% when a logit
transform is applied. Such values cannot be literally correct;
they typically reflect limitations on the dynamic range of the
analysis system used and are better represented as qualified
values such as “<1 μg/mL.” In fact, any value near the
extremes in the data set (“outliers”) may be significantly less
reliable than others, so it may be advisable to set them aside or
assign them to the test set (see Note 3).

3.4 Data Set It is essential to set aside a substantial fraction—usually at least

Partitioning 10%—of the data set as an external test set. It is critical that the
data from these observations are not used at any stage of the model-
building process. That way the corresponding structures can serve
as surrogates for new structures: how well their properties are
predicted provides an unbiased estimate of the model’s perfor-
mance when presented with truly novel compounds.
Optimizing the partition of the data set into an external test set
and a training pool requires striking a balance between the expected
predictive power of the model when applied to novel structures and
how well that predictive performance is characterized. A more
diverse training pool is more informative and confers greater statis-
tical power on the model produced than a less diverse one can,
whereas a more diverse external test set typically more accurately
represents new compounds to which the model may be applied. On
the other hand, observations that are individually less informative
(more typical) serve to average out noise in the data.
As noted above, the data available to build QSPR models is
naturally clumpy in chemistry space because of the way analog
synthesis is carried out. One good way to address such inhomoge-
neity is to group compounds together based on structural descrip-
tors [24], by their target property values, or by a subsidiary
attribute such as molecular weight or lipophilicity (logP). The
groups obtained are then sampled more or less evenly. This kind
Building a QSPR Model 151

of procedure is called “stratified sampling”; it tends to yield better

model performance than simple random sampling [26].
Regardless of how the partitioning is done, it is best that it has
some element of randomness—e.g., the sampling within groups—
so that the degree of stability to small changes (robustness) of a
model can be assessed accurately.
Most model-building systems create several different verifica-
tion (internal test) sets from the training pool, generally at random;
training pool observations that are not in a particular verification
set are then used to train the respective QSPR model. Performance
on the verification set can then be used to reduce the risk of over-
fitting the model—i.e., avoid having the model “learn” the noise in
the data set as well as the signal.
In some software packages, the performance statistics are aver-
aged across the verification sets in lieu of having a fully external test
set. Such “cross-validation” may overestimate how well a QSPR
model performs [27]. We believe the use of an external test set is
essential to avoid overtraining.

3.5 Model Building Many factors contribute to how well the model-building process
works for any particular program. As a result, how each stage is best
carried out depends on the detailed nature of the other stages.
Moreover, most people will not have access to equally sophisticated
versions of more than two or three fundamentally different
approaches and know how to apply them properly. Rather than
attempt a thorough exploration of the pluses and minuses of the
various methodologies here, we will describe in general terms the
workflow most commonly used in constructing QSPR models for
ADMET Predictor. That said, some form of the basic elements of
the workflow given is applicable to most other programs as well:
1. The list of potential descriptors is filtered to remove any with
low variance and that show unacceptably high pairwise or
multilinear correlation with other descriptors.
2. The descriptors that remain are sorted in descending order of
sensitivity—i.e., the extent to which they themselves can
account for variation in the target property across the training
pool.3
3. An initial number of descriptors and a number of hidden-layer
neurons are selected for the first artificial neural network
ensemble architecture to be trained. Later, models with more
descriptors and/or more neurons in the hidden layer are
trained.
4. A set of 165 artificial neural networks (ANNs) are trained, each
with its own randomly assigned training and verification sets.

3
Where necessary, promising groups of descriptors are generated using a genetic algorithm (GA) [28].
152 Robert D. Clark and Pankaj R. Daga

Training continues iteratively so long as predictive performance

on the verification set continues to improve; that performance
will begin to degrade when the model starts memorizing the
uninformative variation—the noise—in the training set data.
5. The 33 networks exhibiting the best performance on the train-
ing pool are combined to form the final QSPR model, which is
an ensemble of neural nets—an ANNE. A regression model
prediction is obtained by averaging the individual network
predictions, whereas a classification model prediction is
obtained by tallying “votes” across the ensemble or by averag-
ing network outputs.
6. For classification models, a confidence estimate is generated
based on the strength of the consensus between networks in
the ensemble [29].
7. A final model is selected by comparing ensemble performance
on the training, verification, and test sets across a range of ANN
architectures defined by the number of input descriptors and
the number of neurons in the hidden layer.

3.6 Performance Model performance is assessed based on how well property values
Assessment predicted for the external test set compare to the observed property
values. Aggregated performance statistics are important, but exam-
ining individual predictions is more apt to improve a model.
1. Pearson’s correlation coefficient (r2) between observed and
predicted values is widely used as measure of a regression
model’s overall performance. It is a good measure of correla-
tion between variables but is suboptimal as a measure of pre-
dictive performance in that it can be relatively high even when
all predictions are off by the same amount or in the same
proportion. The root mean square error (RMSE) is generally
a more appropriate statistic to use:
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
y^ i y i
RMSE ¼
n
where yi is an observed value, ŷi is the corresponding predicted
value, and n is the number of observations in the test set. The
mean absolute error (MAE) is an analogous nonparametric
statistic that is less sensitive to distortion by extreme outliers.
2. Classification performance is sometimes measured by “accu-
racy,” which is the fraction of all predictions that are correct;
however, most QSPR data sets are unbalanced, in that the
positive class is typically underrepresented. In such cases, it
makes sense to evaluate performance on each class separately.
Doing so yields four statistics:
Building a QSPR Model 153

sensitivity ¼ TP=ðTP þ FNÞ

specificity ¼ TN=ðTN þ FPÞ
PPV ¼ TP=ðTP þ FPÞ
NPV ¼ TN=ðTN þ FNÞ
where TP and FP are the numbers of true and false positives,
TN and FN are the numbers of true and false negatives, and
PPV and NPV are the positive and negative predictive values,
respectively. A form of the accuracy that is balanced by the
relative size of the two classes is given by Youden’s index J:
J ¼ sensitivity þ specificity 1
Youden’s index is optimized in the course of ANN training in
ADMET Predictor. Its value is determined by the decision
threshold chosen once training is complete. A rather more
abstract statistic—the area under the receiver operating charac-
teristic curve (AUC ROC)—is used by some other programs; it
integrates performance across all possible decision thresholds.
3. The confidence estimates mentioned above represent one way
to determine whether a new molecule is too dissimilar to those
upon which the QSPR model is based for the prediction for it
to be trusted. An alternative approach is to compare the
descriptor values for the new compound to those used to
build the model. ADMET Predictor flags a molecule as “out
of scope” if any of the model descriptor values lies more than
10% above or below the range of that descriptor in the model’s
data set. Some programs use fingerprint similarity to identify
out-of-scope structures.

3.7 Analyzing The process outlined above usually does not produce a finished
Outliers QSPR model in the first round; rather, it provides a way to identify
more subtle kinds of error in the input data.
1. Overall performance statistics like RMSE are important for
comparing the performance of different regression models,
but examining plots of predicted vs observed property values
is much more informative and should always be done. Such a
visual examination highlights anomalously high or low predic-
tions. Outliers in the test set usually indicate a weakness in the
model, but outliers in the training set may be cases where the
experimental value is in error. Finding the latter kind of outlier
is good, since it indicates that the model is robust enough to
make an accurate prediction despite being given wrong infor-
mation. Identifying all of the data points in such a plot that
come from the same literature source as an outlier can indicate
a whole block of observations that are also mispredicted but to
154 Robert D. Clark and Pankaj R. Daga

Fig. 2 Identifying systematic bias due to errors in reported units by examining plots of predicted versus
observed CYP rat liver microsomal intrinsic clearance (RLM CLint). (LEFT) Highlighted points represent
uncorrected values from Hibi et al. [30]. (RIGHT) Highlighted points represent uncorrected values from
Röver et al. [31]

a lesser degree because the model was able to partially offset the
aggregate bias arising from the error. Correcting all of the
errors almost always improves (decreases) the overall RMSE.
2. Highlighting blocks of observations from a single source can
identify data sources that generate no obvious outliers. Figure 2
shows results from an interim model that revealed two blocks
of flawed rat liver microsome clearance data. The compounds
in the lower block were too low by a factor of 1000. ChEMBL
provided the wrong units for the upper set of outliers—μL/
min/mg of protein were reported instead of mL/min per gm
of liver. The (erroneous) information that the first block of
compounds contributed to the model evidently overlapped
enough with the information from other compounds that the
model was unable to reconcile the discrepancy. The structures
in the second block were unique enough that the model was
able to accommodate the errors. Nonetheless, correcting the
error improved the overall fit and predictivity.
3. Classification models are not amenable to the analysis described
above, but the confidence estimates described above can serve a
similar function. It often happens that there are a handful of
predictions that every single network in the ensemble gets
wrong. This sometimes reflects experimental problems (e.g.,
in HTS data) that cannot be resolved, but it often reflects the
same kind of systematic errors that come out of the regression
analyses described above—i.e., cases where it is the input data
that are wrong, not the predictions.
Building a QSPR Model 155

3.8 Iterate Once you have used your initial model to identify second-order
errors in the data set, you will be ready to repeat the process by
building a refined model. A second or third iteration may be
required before all the addressable errors have been found, but
the effort invested in improving the data will ultimately pay off in
increased accuracy and robustness of the resulting QSPR model.

4 Notes

1. It is worthwhile to invest some time in studying how the

property being modeled is actually measured experimen-
tally—and how it was measured in the past. Very high solubi-
lities, for example, were historically measured by determining
the amount of material that would dissolve in 1000 g of
water rather than by the amount present in 1 mL of saturated
solution, as is standard now. This distinction is often lost in
compilation, a misunderstanding that can lead to incorrect
molar solubilities.
Tabulated melting points often include values for the tem-
perature at which compounds decompose. In the primary litera-
ture, such “melting points” are generally qualified by appending
“(dec.)”; overlooking this distinction can lead to problems in
generating or evaluating a QSPR model (e.g., [32]).
Kinetic solubility obtained from turbidimetric analysis is
not the same property as solubility at thermodynamic equilib-
rium in pure water, and solubility in buffered solution (e.g., at
pH 7.4) is not the same as solubility in pure water.
Finally, microsomal esterases will distort liver CYP micro-
somal clearance models when data for ester prodrugs are
included that do not include a control incubation without
NADPH [33].
2. Nominal property values of interest are subject in most cases to
significant variability, depending on exactly how they are
measured. In principle, getting all of the data from any single
source will tend to minimize such variability and thereby
increase the consistency (precision) of QSPR predictions. His-
torically, this has led many pharmaceutical companies to rely on
models built entirely on in-house property data, which always
comes at some risk of compromising the robustness and accu-
racy (increasing the bias) of those predictions.
This is particularly true when the data are obtained by
high- or medium-throughput assays that are themselves
in vitro models of the desired property: measuring Caco-2 or
Madin-Darby canine kidney (MDCK) cell monolayer perme-
ability to estimate effective human intestinal permeability (Peff)
is a prominent example of this. S. Yamashita et al. showed that
156 Robert D. Clark and Pankaj R. Daga

Caco-2 permeability data for the same set of compounds can

vary widely across laboratories due to the nonuniformity in
experimental conditions across organizations [34] and the het-
erogeneity of the cell line [35].
In such cases the “true” property values for some kinds of
compounds will consistently be under- or overpredicted. If
predictions are only ever compared to measurements using
exactly the same procedure as that used to generate the training
data for the model, this is unlikely to be a problem. Problems
are likely to arise, however, if biased predictions are compared
to measurements made outside the company or used as input
to a pharmacokinetic simulation, for example.
Any model built solely from in-house data should be eval-
uated by application to a truly external data set, if only to assess
how to interpret values from the literature.
3. It is a good idea to set aside outliers and observations with
explicitly qualified values as a “soft” test set. The errors in
prediction for such observations should not be included when
model performance statistics are calculated, but predictions
that are consistent with a qualified value nonetheless lend sup-
port to the QSPR model that generates them.
4. We encountered an interesting abstraction error recently where
a structure from a paper containing a single badly predicted
fraction unbound in plasma “jumped out” of the spreadsheet
display as being out of place. The structure, drawn from a
ChEMBL search for fraction unbound in rat plasma, was a
rather unusual cyclic acyl enamine (Fig. 3), whereas the

Fig. 3 Structures retrieved from ChEMBL for a paper by Pevarello et al. [36]. The
image at the right is taken from Table 3 in that paper (adapted with permission.
Copyright 2005 American Chemical Society), which lays out the heterocyclic
para-substituents for a series of 2-phenylpropionamide analogs. The dotted line
shows where the background from the substituent for compound 22 overlaps
the structure for the substituent for compound 21, obscuring its carbonyl
oxygen. A check of the synthesis details section confirmed that 21 is, in fact,
a succinimide
Building a QSPR Model 157

substituents for other analogs were much more mundane:

succinimide, imidazolidinones, etc. Examination of the table
of substituent structures in the original paper [36] suggested
that the substituent structure for one structure had been pasted
into the table so that it overlapped the one above, thereby
obscuring the carbonyl oxygen. Correcting the structure
brought predictions for it and other compounds in the paper
into line with predictions based on the rest of the data set.
5. Physicochemical properties are generally insensitive to R/S
stereochemistry and to mixture composition, in that enantio-
mers have the same solubility, pKa, etc. They may differ some-
what in metabolism, but—with some notable exceptions—
usually not by a large amount. The ADME properties of race-
mic mixtures, on the other hand, can be very different from
those of either enantiomer. The melting points and solubilities
of many racemic amino acid mixtures are a case in point: the
enantiomers pair up in the crystal, increasing its stability and
reducing its solubility. Differential effects on transporters are
also possible, with one enantiomer affecting transport and/or
metabolism of the other [37].
A property of the racemic mixture in such cases is not
properly a molecular property of either enantiomer, just as the
melting point and solubility of a salt is an attribute of the
combination rather than of either molecular component
alone. Those properties may well be of practical importance
but few if any model-building programs are likely to handle
data for such mixtures properly.

Acknowledgments

The authors would like to acknowledge Michael S. Lawless, Marvin

Waldman, and Walter S. Woltosz of Simulations Plus, Inc., for their
careful reading of the manuscript and their insightful suggestions.

References

1. Pragyan P, Kesharwani SS, Nandekar PP, 4. RCSB Protein Data Bank Royal Society of
Rathod V, Sangamwa AT (2014) Predicting Chemistry. https://www.rcsb.org/pdb/
drug metabolism by CYP1A1, CYP1A2, and home/home.do
CYP1B1: insights from MetaSite, molecular 5. Weininger D (1988) SMILES, a chemical lan-
docking and quantum chemical calculations. guage and information system. 1. Introduction
Mol Divers 18(4):865–878 to methodology and encoding rules. J Chem
2. Houston JB, Kenworthy KE (2000) In vitro- Inf Comput Sci 28(1):31–36. https://doi.
in vivo scaling of CYP kinetic data not consis- org/10.1021/ci00057a005
tent with the classical Michaelis-Menten 6. SMILES–A simplified chemical language. Day-
Model. Drug Metab Dispos 28(3):246–254 light Chemical Information Systems, Inc.
3. ADMET Predictor™. Simulations Plus Inc., http://www.daylight.com/dayhtml/doc/the
Lancaster, CA, USA ory/theory.smiles.html
158 Robert D. Clark and Pankaj R. Daga

7. The IUPAC international chemical identifier 21. Todeschini R, Consonni V (2000) Handbook
(InChI). International union of pure and of molecular descriptors. Wiley-VCH, Wein-
applied chemistry. https://iupac.org/who-we- heim; New York
are/divisions/division-details/inchi/ 22. Rogers D, Hahn M (2010) Extended-
8. Stephen R Heller AM, Pletnev I, Stein S, Tche- connectivity fingerprints. J Chem Inf Model
khovskoi D (2015) InChI, the IUPAC interna- 50(5):742–752. https://doi.org/10.1021/
tional chemical identifier. J Cheminform 7:23 ci100050t
9. MedChem Designer™: Chemical structure 23. Hall LH, Kier LB (1995) Electrotopological
drawing and property prediction. Simulations state indices for atom types: a novel combina-
Plus, Inc. http://www.simulations-plus.com/ tion of electronic, topological, and valence
software/medchem-designer/ state information. J Chem Inf Comput Sci
10. ChEMBL. EMBL-EBI. https://www.ebi.ac. 35:1039–1045
uk/chembl/ 24. Yan A, Gasteiger J (2003) Prediction of aque-
11. Bento AP, Gaulton A, Hersey A, Bellis LJ, ous solubility of organic compounds by topo-
Chambers J, Davies M, Krüger FA, Light Y, logical descriptors. QSAR Comb Sci
Mak L, McGlinchey S, Nowotka M, 22:821–829. https://doi.org/10.1002/qsar.
Papadatos G, Santos R, Overington J (2014) 200330822
The ChEMBL bioactivity database: an update. 25. Tiikkainen P, Bellis L, Light Y, Franke L (2013)
Nucleic Acids Res 42:1083–1090. https://doi. Estimating error rates in bioactivity databases. J
org/10.1093/nar/gkt1031 Chem Inf Model 53(10):2499–2505. https://
12. Wang YL, Bryant SH, Cheng TJ, Wang JY, doi.org/10.1021/ci400099q
Gindulyte A, Shoemaker BA, Thiessen PA, He 26. May R, Maier H, GC D (2010) Data splitting
SQ, Zhang J (2017) PubChem BioAssay: 2017 for artificial neural networks using SOM-based
update. Nucleic Acids Res 45(D1): stratified sampling. Neural Netw 23:283–294
D955–D963. https://doi.org/10.1093/nar/ 27. Clark RD (2003) Boosted leave-many-out
gkw1118 cross-validation: the effect of training set and
13. Waldman M, Fraczkiewicz R, Clark RD (2015) test set diversity on PLS statistics. J Comput
Tales from the war on error: the art and science Aided Mol Des 17:265–275
of curating QSAR data. J Comput Aided Mol 28. Žuvela P, Liu JJ, Macur K, Ba˛czek T (2015)
Des 29:897 Molecular descriptor subset selection in theo-
14. AID 1996: Aqueous Solubility from MLSMR retical peptide quantitative structure–retention
Stock Solutions (2009) Available via National relationship model development using nature-
Center for Biotechnology Information. inspired optimization algorithms. Anal Chem
https://pubchem.ncbi.nlm.nih.gov/bioassay/ 87(19):9876–9883. https://doi.org/10.
1996. Accessed Nov 2017 1021/acs.analchem.5b02349
15. ChemSpider: Search and share chemistry. 29. Clark RD, Liang W, Lee AC, Lawless MS,
Royal Society of Chemistry. http://www. Fraczkiewicz R, Waldman M (2014) Using
chemspider.com/ beta binomials to estimate classification uncer-
16. What is R? The R Foundation. https://www.r- tainty for ensemble models. J Cheminform 6
project.org/about.html (1):34
17. Willighagen EL, Mayfield JW, Alvarsson J, 30. Hibi S, Ueno K, Nagato S, Kawano K, Ito K,
Arvid Berg LC, Jeliazkova N, Kuhn S, Norimine Y, Takenaka O, Hanada T, Yonaga M
Pluskal T, Rojas-Chertó M, Spjuth O, (2012) Discovery of 2-(2-Oxo-1-phenyl-5-
Torrance G, Evelo CT, Guha R, Steinbeck C pyridin-2-yl-1,2-dihydropyridin-3-yl)benzoni-
(2017) The Chemistry Development Kit trile (Perampanel): a novel, noncompetitive
(CDK) v2.0: atom typing, depiction, molecular α-amino-3-hydroxy-5-methyl-4-isoxazolepro-
formulas, and substructure searching. J Che- panoic Acid (AMPA) receptor antagonist. J
minform 9:33 Med Chem 55(23):10584–10600. https://
18. About KNIME. KNIME. https://www.knime. doi.org/10.1021/jm301268u
com/. Accessed 17 Nov 2017 31. Röver S, Andjelkovic M, Bénardeau A,
19. BIOVIA Pipeline Pilot. Dessault Systemes. Chaput E, Guba W, Hebeisen P, Mohr S,
http://accelrys.com/products/collaborative- Nettekoven M, Obst U, Richter WF,
science/biovia-pipeline-pilot/ Ullmer C, Waldmeier P, Wright MB (2013)
6-Alkoxy-5-aryl-3-pyridinecarboxamides, a
20. Tosco P, Stiefl N, Landrum G (2014) Bringing new series of bioavailable cannabinoid receptor
the MMFF force field to the RDKit: implemen- type 1 (CB1) antagonists including peripher-
tation and validation. J Cheminform 6:37 ally selective compounds. J Med Chem 56
Building a QSPR Model 159

(24):9874–9896. https://doi.org/10.1021/ 35. Sambuy Y, Angelis ID, Ranaldi G, Scarino ML,

jm4010708 Stammati A, Zucco F (2005) The Caco-2 cell
32. Ran Y, Jain N, Yalkowsky SH (2001) Predic- line as a model of the intestinal barrier: influ-
tion of aqueous solubility of organic com- ence of cell and culture-related factors on
pounds by the general solubility equation Caco-2 cell functional characteristics. Cell
(GSE). J Chem Inf Comput Sci 41 Biol Toxicol 21(1):1–26. https://doi.org/10.
(5):1208–1217. https://doi.org/10.1021/ 1007/s10565-005-0085-6
ci010287z 36. Pevarello P, Brasca MG, Orsini P, Traquandi G,
33. Beaulieu PL, Marte JD, Garneau M, Luo L, Longo A, Nesi M, Orzi F, Piutti C, Sansonna P,
Stammers T, Telang C, Wernic D, Kukolj G, Varasi M, Cameron A, Vulpetti A, Roletto F,
Duan J (2015) A prodrug strategy for the oral Alzani R, Ciomei M, Albanese C, Pastori W,
delivery of a poorly soluble HCV NS5B thumb Marsiglio A, Pesenti E, Fiorentini F, Bischoff
pocket 1 polymerase inhibitor using self- JR, Mercurio C (2005) 3-Aminopyrazole Inhi-
emulsifying drug delivery systems (SEDDS). bitors of CDK2/Cyclin A as Antitumor
Bioorg Med Chem Lett 25:210–215 Agents. 2. Lead Optimization. J Med Chem
34. Yamashita F, Hashida S-iFM (2002) The 48:2944–2956
“Latent Membrane Permeability” Concept: 37. Borgstrom L, Nyberg L, Jonsson S,
QSPR Analysis of Inter/Intralaboratorically Lindberg C, Paulson J (1989) Pharmacokinetic
Variable Caco-2 Permeability. J Chem Inf evaluation in man of terbutaline given as sepa-
Comput Sci 42(2):408–413. https://doi.org/ rate enantiomers and as the racemate. Br J Clin
10.1021/ci010317y Pharmacol 27(1):49–56. https://doi.org/10.
1111/j.1365-2125.1989.tb05334.x
Chapter 9

Isomeric and Conformational Analysis of Small Drug

and Drug-Like Molecules by Ion Mobility-Mass
Spectrometry (IM-MS)
Shawn T. Phillips, James N. Dodds, Jody C. May, and John A. McLean

Abstract
This chapter provides a broad overview of ion mobility-mass spectrometry (IM-MS) and its applications in
separation science, with a focus on pharmaceutical applications. A general overview of fundamental ion
mobility (IM) theory is provided with descriptions of several contemporary instrument platforms which are
available commercially (i.e., drift tube and traveling wave IM). Recent applications of IM-MS toward the
evaluation of structural isomers are highlighted and placed in the context of both a separation and
characterization perspective. We conclude this chapter with a guided reference protocol for obtaining
routine IM-MS spectra on a commercially available uniform-field IM-MS.

Key words Isomers, Drugs, Conformation, Ion mobility spectrometry, Ion mobility-mass spectrom-
etry, IM-MS

1 Introduction

The process of developing new drug candidates has changed signif-

icantly over time, as a result of the Human Genome Project and
other technological advances in computational modeling and bio-
informatics [1]. For example, high-throughput screening methods
provide unparalleled capacity to screen millions of chemical struc-
tures for potential drug efficacy [2, 3], as opposed to simply devel-
oping a target candidate and anticipating relevant biochemical
action. Regardless of the desired approach toward the production
of novel drug candidate molecules by either reverse pharmacology,
the classical approach, or natural product discovery [4–6], most
drugs take several years to develop and millions of dollars to
become marketable as a requirement of validation through
in-depth clinical trials, evaluation of safety risks, and FDA approval
[7]. As part of this development process, the analytical need to

Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_9, © Springer Science+Business Media, LLC, part of Springer Nature 2019

161
162 Shawn T. Phillips et al.

study the structural characteristics of these small molecules is

imperative.
Structural characterization of potential drug candidates, either
derived from natural sources or synthesized in the laboratory, is a
complex and time-consuming process, and for rapid analyses, phar-
maceutical companies value the high degree of analytical selectivity
and sensitivity afforded by modern mass spectrometry
(MS) methods toward overall quality control of synthesized pro-
ducts and characterization of new drug targets [8, 9]. Although
mass spectrometers are highly selective, often able to assign a
molecular formula for a target analyte based solely on molecular
mass measurement, isomeric species are difficult to differentiate by
traditional MS methods, even with the addition of tandem MS/MS
approaches [10, 11]. Because the biological function of chemical
compounds can change with their structural variation, isomeric
species are a highly researched area of the pharmaceutical field
[12]. Condensed phase separation techniques such as gas or liquid
chromatography are often utilized to separate complex mixtures
prior to mass analysis and provide the ability to separate isomers by
differences in chemical properties, such as polarity or boiling point.
These methods, while effective, are often highly selective to narrow
classes of isomers and are not inherently high-throughput techni-
ques. In this chapter, we describe the application of IM-MS, an
emerging analytical technique for structurally characterizing small
molecule isomer systems, with a particular focus on the role of
IM-MS in characterizing biological systems and relevant pharma-
ceutical applications.

1.1 Isomers Isomers are defined as compounds having the same molecular
formula but differing in their overall chemical structure [13]. Iso-
meric species are further subdivided into categories that reflect their
structural variations, which may include covalent bond rearrange-
ments (constitutional isomers), stereochemical variations (stereoi-
somers), or rotational isomers, commonly referred to as rotamers.
As constitutional isomers vary in skeletal structure between constit-
uent atoms, these isomers possess a broad scope of biological
activity based upon their particular structural arrangements. For
example, the molecular formula C8H9NO2 is reported to have
33 isomers by the PubChem database [14], and many of these
isomers have unique chemical behavior and physiological function
(Fig. 1). In one case, paracetamol (more commonly known as
acetaminophen) is a well-known analgesic, yet its constitutional
isomer, methyl anthranilate, functions as a bird repellent and a
flavor additive in drinks [15]. The structural makeup of constitu-
tional isomers can also vary widely depending on their biological
class. For example, lipid isomers typically vary in alkyl chain position
and cis/trans double bond positioning [16], while peptides tend to
Isomeric and Conformational Analysis of Small Drug and Drug-Like Molecules. . . 163

Fig. 1 Structures related to four constitutional isomers of chemical formula C8H9NO2 and corresponding
function or typical use

have sequence order variation or even amino acid substitutions

comprised of the same chemical formula (e.g., leucine/
isoleucine) [17].
In addition to constitutional isomers, compounds with the
same molecular formula can differ in stereochemistry (i.e., diaster-
eomers and enantiomers), resulting in varying chemical and physi-
cal properties. For example, ethambutol is a 204 Da molecule
(C10H24N2O2) possessing two stereocenters. In the (þ) form (S,
S) ethambutol is frequently used to treat tuberculosis [18]. How-
ever, with inversion of chirality at its two stereocenters to form ()
or (R,R) ethambutol, the molecule is known to cause blindness
[19]. Isomers also exist for two compounds possessing the same
chemical scaffold and chirality. For example, rotamers are small
molecule conformers where multiple three-dimensional molecular
structures can arise as a result of rotation around a single bond. In
some cases this bond rotation gives rise to atropisomerism, which is
the restriction of rotation around a single covalent bond which
results in distinct optical isomers. A commonly cited rotamer exam-
ple which exhibits freedom of rotation around a single bond is the
Newman projections of butane [20]. In other cases where rotation
is not restricted, different stable conformations are still possible,
especially in protein analysis. Small molecules in particular are
noted for producing a variety of conformations as a result of their
flexibility [21]. Because of the diverse chemical activity that can
exist within constitutional and conformational isomers, finding
useful and efficient ways to explore the structure of these molecules
can provide insight into their specific chemical properties.
For the past 15 years, IM-MS has made large contributions in
the analysis of constitutional and conformational isomers [22–24].
As result of the commercialization and rapid adoption of IM-MS
instrumentation [25–27], the Web of Science database cites over
3500 articles in the last decade related to IM-MS studies [28].
While ion mobility has been traditionally utilized to study large
biological systems, more recently it has been applied to the study of
164 Shawn T. Phillips et al.

smaller (<400 Daltons) drug and drug-like molecules [29, 30].

In this chapter, we describe the technique and theory of IM-MS,
provide examples of the use of IM-MS to characterize various small
drug and drug-like molecules, and provide some basic methodol-
ogy toward collecting and analyzing IM-MS data using a commer-
cially available IM-MS platform (Agilent 6560) as an example [27].

1.2 Instrumentation IM-MS is an emerging analytical technique that separates gas-phase

and Theory ions in two dimensions based upon molecular size and weight. In
the mobility dimension, analyte ions are separated based upon their
two-dimensional orientationally averaged size in the gas phase
(collision cross section, CCS), which provides information regard-
ing their size and shape [26, 31]. In the mass spectrometer dimen-
sion, separation is based upon the mass-to-charge ratio (m/z) of
the analyte ion, which is directly correlated to its intrinsic molecular
formula. Combined, IM-MS provides unique and important infor-
mation regarding the gas-phase density preferences of different
classes of molecules, which can identify unknown compounds
which share similar structural scaffolds [32, 33].
There are four basic components of an IM-MS instrument: the
ion source, the ion mobility separator, the mass analyzer, and the
detector (Fig. 2a). The type and arrangement of these components
can vary depending on instrument vendor and experiment applica-
tion [34, 35].
In the source region, analyte ions are commonly generated by
electrospray ionization (ESI), which allows the option for directly
coupling an LC separation. In ESI, ions enter the source as a liquid
and are vaporized using a combination of gas flows and electric
fields, which ultimately generate gas-phase ions. While ESI is the
most commonly used ion source, other ion source types include
laser and chemical ionization (e.g., MALDI and APCI) [36]. ESI
commonly produces protonated and deprotonated ions ([MþH]+,
[MH]), as well as various alkali metal cation species, such as
[MþNa]+ and [MþK]+, where M represents the neutral form of the
molecule. Once ions are generated, they are released into the ion
mobility spectrometer where they are separated based on their
gas-phase size and shape (CCS). Following the mobility separation,
ions enter the mass analyzer where they are separated by their mass-
to-charge ratio (m/z). For more in-depth information regarding
the experiment, we refer the reader to a recent literature review
which covers the various IM techniques and instrumentation in
detail [26, 37].
Ion mobility techniques can be broadly separated into two
method types: time-dispersive methods, which include drift tube
and traveling wave ion mobility spectrometry (DTIMS and
TWIMS, respectively), and space-dispersive methods, which pri-
marily include high field-asymmetric waveform IM and differential
ion mobility spectrometry (FAIMS and DMS, respectively) which
Isomeric and Conformational Analysis of Small Drug and Drug-Like Molecules. . . 165

(A)
Ion Activation
Region

Sample Ionization Mass

Ion Mobility
Introduction Source Spectrometer

Drift Tube IMS Traveling Wave IMS

(B) (C)

+ + + +
Voltage

Voltage
V
Distance Distance

Fig. 2 (a) Block diagram of a typical IM-MS instrument. Ions are separated in the presence of a neutral drift
gas by (b) a linear electric field along a series of ring electrodes (DTIMS) or (c) by a pulse wave generated by
applied sequential voltage along a series of ring electrodes (TWIMS)

collectively operate as mobility-filtering devices. The examples pre-

sented in this chapter will focus on recent applications of the time-
dispersive methods of DTIMS and TWIMS, which collectively
represent the majority of IM instrumentation currently
utilized [26].

1.2.1 Drift Tube Ion In drift tube ion mobility (DTIMS), the IM region consists of a
Mobility series of ring electrodes contained within a neutral drift gas (typi-
cally helium or nitrogen) (Fig. 2b) [38, 39]. DTIMS is operated at
one of two pressure regimes: low (1–10 Torr) and elevated
(ca. 760 Torr) pressures. Typically, ion transmission is more effi-
cient at reduced pressure yet typically results in somewhat reduced
IM resolving power from high diffusion. As ions are introduced
into the drift tube, they are drawn through the drift region as a
result of an applied electric field along the ring electrodes. During
ion drift, the ions interact with the buffer gas at low energy, and
molecules with smaller rotationally averaged surface area (smaller
CCS) transverse the region faster as a result of fewer collisions.
Mathematically, the CCS of the analyte ion can be calculated
using the Mason-Schamp equation [27, 32] where K0 is the
measured mobility of the ion, z is the charge of the ion, T is the
temperature of the drift gas, and N0 is the number density of the
drift gas at standard temperature and pressure. The terms e and kB
are the elementary charge and Boltzmann’s constant, respectively:
166 Shawn T. Phillips et al.

1=2 K1
3ze 2π 0
CCS ¼ ð1Þ
16N 0 μkB T

1.2.2 Traveling Wave Ion Similar to DTIMS, a traveling wave ion mobility drift cell uses an
Mobility inert buffer gas and a series of ring electrodes to move ions through
the drift region (Fig. 2c) [40, 41]. In TWIMS, ion pulses are
mobility separated by sequentially applying a direct current voltage
to the rings in a series along the drift cell to create a migrating
potential along the length of the cell. These sequential low-voltage
pulses generate waves of electric potential that push ions through
the drift region. As the wave propels the ions forward through the
device, low-energy elastic collisions occur between the analyte ions
and the buffer gas. Smaller ions experience fewer collisions with the
buffer gas and, as a result, traverse the drift region faster than larger
ions, resulting in shorter drift times. This mechanism is almost
identical to what is experienced in DTIMS, but with the exception
that larger ions are slower in TWIMS as a result of “falling over” the
wave pulses during their transit through the cell. The drift times are
converted to collision cross sections through a calibration proce-
dure which takes into account the drift times and CCS of known
internal standards [42, 43]. The ion mobility spectra obtained from
both TWIMS and DTIMS are qualitatively similar.

1.3 Current Work Historically, there are many reported studies where DTIMS and
in Isomer Structural TWIMS have been used to observe both constitutional and confor-
Separations mational structures of large molecules, where structural differences
are significant and readily measured [44–46]. In this section, we
will present some recent examples of the use of TWIMS and
DTIMS to separate constitutional and conformational isomers in
small molecule systems, which have been given less attention in the
literature, but nonetheless are important avenues for developing
IM-MS for the separation and characterization of drug and drug-
like small molecule isomer systems.

1.3.1 Separation Often chromatographic resolution can be difficult to achieve for

of Constitutional Isomers molecules with similar polarities, and hence constitutional isomers
have been studied in detail by IM-MS. In an effort to differentiate
molecules of interest from complex matrices (e.g., biological sam-
ples and natural product extracts), mobility techniques have inves-
tigated the separation of a wide variety of chemical classes including
lipids [47], carbohydrates [48], peptides [42, 49], and fossil fuels
[50]. As a specific example, we will consider the highly studied
isomer system of leucine and isoleucine, which represent a classi-
cally studied isomer pair from an analytical separation perspective,
as both compounds have the same chemical formula (C6H13NO2)
and hence cannot be distinguished by MS measurements alone.
From an ion mobility perspective, leucine and isoleucine have
Isomeric and Conformational Analysis of Small Drug and Drug-Like Molecules. . . 167

been shown to be differentiable using several ion mobility techni-

ques, including FAIMS [51] and TWIMS [52]. In a recent study by
Dodds et al., 11 different leucine/isoleucine isomers were studied
using DTIMS and focused on the positive ion forms of these
molecules, [MþH]+ [24]. The subclasses of isomers investigated
in this study include enantiomers (two molecules whose stereo-
chemistry is opposite at every chiral center), diastereomers (two
molecules with multiple chiral centers of which some, but not all,
have opposite stereocenters), and constitutional isomers related to
leucine and isoleucine. A plot of the range of diversity in CCS for
these compounds appears in Fig. 3. As illustrated in the figure, the

(3)(4) (7) (8)

(A) (B) (9) (10)
(5)(6)
(1) N,N-Dimethylglycine ethyl ester (1) (2)
100
127.5 ± 0.3 (Å2) (11)

(2) 6-Aminocaproic acid (I) 50

129.3 ± 0.2 (Å2)

L-tert-Leucine (2S) 0
(3)
120 125 130 135 140 145
132.4 ± 0.3 (Å2)
100 (1) (3) (11)
D-tert-Leucine (2R)
(4) Constitutional
132.5 ± 0.2 (Å2) (II) 50 Isomers

0
Relative Abundance (%)

L-allo-Isoleucine (2S,3R)
120 125 130 135 140 145
(5) 2
132.9 ± 0.2 (Å )
100 (7) (9)

D-allo-Isoleucine (2R,3S) L-Isoleucine Constitutional

(6) (III) 50 L-Leucine Isomers
133.1 ± 0.3 (Å2)

0
L-Isoleucine (2S,3S) 120 125 130 135 140 145
(7) 2
133.5 ± 0.3 (Å )
100 (5) (7)

D-Isoleucine (2R,3R) L-allo-Isoleucine Diastereomers

(8) (IV) 50 L-Isoleucine
133.3 ± 0.3 (Å2)
0
(9) L-Leucine (2S) 120 125 130 135 140 145
2
135.1 ± 0.3 (Å ) 100 (7) (8)

(10) D-Leucine (2R) (V) 50 L-Isoleucine Enantiomers

135.2 ± 0.3 (Å2) D-Isoleucine
0
120 125 130 135 140 145
(11) L-Norleucine
2
2
136.6 ± 0.3 (Å ) Collision Cross Section (Å )

Fig. 3 (a) Leucine/isoleucine isomers with chemical formula C6H13NO2 examined in this study. Experimental
cross sections with respective standard deviations are shown at the right with corresponding stereochemistry.
(b) (I) Experimental IM spectrum overlays for all isomer compounds (standard error bars omitted for clarity).
(II) Overlay of the IM spectra corresponding to N,N-dimethylglycine ethyl ester, L-tert-leucine, and L-norleucine
and the IM spectrum corresponding to the mixture (black). (III) Overlays of L-isoleucine and L-leucine in
addition to the equal ratio mixture. (IV and V) Overlays of diastereomers and enantiomers, respectively.
Adapted with permission from ref. 24, Copyright 2017 American Chemical Society
168 Shawn T. Phillips et al.

differences in the CCS vary depending on isomer type. For exam-

ple, the enantiomers (e.g., L-leucine and D-leucine) show no statis-
tical difference in their measured collision cross sections, and
similarly, diastereomers (e.g., L-isoleucine and L-alloisoleucine)
exhibit different CCS values but are still very challenging to differ-
entiate (typically 0.4% difference in CCS). However, as the mole-
cules become more structurally diverse, the isomers possess more
significant differences in cross section. Specifically, three constitu-
tional isomers (N-N-dimethylglycine ethyl ester, L-tert-leucine,
and L-norleucine) are structurally distinct enough to yield baseline
or near-baseline separation and possess 3.6% and 3.1% difference in
their respective empirical cross sections.
In addition to illustrating the separation of various constitu-
tional isomers, the authors proposed a mathematical relationship
correlating the percent difference in cross section of any two iso-
mers of interest with respect to instrumental resolving power (Rp).
Briefly, the efficiency of ion mobility instruments is described quan-
titatively in terms of resolving power, defined for DTIMS instru-
ments as the ion drift time (td) divided by the width of the peak at
half height (full width at half maximum height, FWHM):
td
Rp ¼ ð2Þ
FWHM
As two isomers become more structurally similar (i.e., closer in
terms of their cross-sectional areas), higher levels of instrument
efficiency (Rp) are required to resolve the isomeric species of inter-
est. The final equation proposed in the above study relates separa-
tion efficiency, termed two-peak resolution (Rp-p), to instrumental
resolving power and analyte cross-sectional difference (Δ CCS%):
Rpp ¼ 0:00589 Rp ΔCCS% ð3Þ
To illustrate the utility of the above equation, consider two
isomers of interest that possess cross-sectional differences of 1.0%
(e.g., 200 Å2 and 202 Å2). The above equation predicts that
separating these isomers to half height resolution (0.83 Rp-p)
would require ca. 140 Rp. In this manner the study by Dodds
and coworkers can predict how efficiently two isomers of interest
will separate on a specific instrument platform, provided that the
CCS of each analyte is previously known and the resolving power of
the ion mobility instrument well characterized.

1.3.2 Separation Another biological class where IM-MS has been utilized to facilitate
of Conformational Isomers the separation of isomers is carbohydrates. Carbohydrates, or sac-
charides, are a class of compounds which includes sugars, starches,
and cellulose. Carbohydrates are challenging systems to study with
most analytical techniques as they commonly exist as complex
mixtures in nature with variations in skeletal structure, bond coor-
dination, and stereochemistry (Fig. 4). Studies are further
Isomeric and Conformational Analysis of Small Drug and Drug-Like Molecules. . . 169

Carbohydrate Isomers Boat Conformation

Axial / Equatorial –OH groups
Chair / Boat Conformation
Connectivity (e.g. 1-3/1-4)
α/β Chair Conformation

β-D-Glucose-GA

Fig. 4 Complexity of carbohydrate isomers represented graphically pertaining to

both constitutional isomers (connectivity isomers) and stereoisomers (axial/
equatorial substitutions, chair/boat conformations, and α/β glycosidic linkage
orientation)

complicated because many of the compounds in these mixtures

have the same molecular formula (isomers) which again prevents
them from being fully characterized using traditional mass spec-
trometry methods. This makes ion mobility a particularly
intriguing tool to explore carbohydrate systems.
A recent example of the utility of IM to study carbohydrates
was carried out by Li and coworkers [53]. In their study of sugars,
TWIMS was used to measure drift time profiles for ions of
monosaccharide-glycolaldehydes and disaccharides of various sim-
ple sugars. While this work is another example of the ability of ion
mobility to separate constitutional isomers, there were occasions
where the presence of multiple conformations appeared, even for
what was believed to be a single analyte ion. For example, the
monosaccharide-glycolaldehyde β-D-glucopyranosyl-2-glycolalde-
hyde (β-D-glc-GA) was analyzed in negative mode ESI and pro-
duced a drift time profile that generates two distinct peaks for the
deprotonated ion. The authors propose that the appearance of
these multiple conformations may be attributed to several cyclic/
acyclic forms produced by hemiacetal formation in the gas phase.
While most conformers are typically noted for large bimolecular
species (i.e., proteins), these small molecule analytes produced
multiple distributions (peaks) for one ion form. Thus this work
illustrated the possibility to observe conformers even in a relatively
small molecule system.
The appearance of conformational isomers for small drug-like
molecules has also become evident in studies related to thalidomide
in the authors’ laboratory. Thalidomide is a small molecule drug
that is currently used primarily for the treatment of specific cancers
and for alleviating various symptoms of leprosy. However, histori-
cally thalidomide is recognized as one of the first drugs whose
different enantiomer forms produced drastically different
170 Shawn T. Phillips et al.

(A) (B)
Posive Mode (M+H) Negave Mode (M-H)

100 100
(I) 152.5 Å2 (+) Thalidomide (VI) 159.8 Å2 (+) Thalidomide
(-) Thalidomide (-) Thalidomide
80 80
(IV) 169.0 Å2

Relative Intensity
Relative Intensity

60 60

40 (II) 157.8 Å2 40
(III) 162.3 Å2
20 20 (V) 153.5 Å2

0 0
19 20
20 21 22 23 21 22 23
23 24 25 26 27
Drift Time
Dri Time (ms)
(ms) DriftTime
Dri Time(ms)
(ms)

(R, +) Thalidomide
(S, -) Thalidomide

Fig. 5 Drift time profiles of (R) and (S) thalidomide enantiomers with corresponding CCS observed in both
positive and negative mode (a and b, respectively) for nitrogen drift gas. Structures are illustrated at the
bottom for chiral reference

biological effects. Figure 5 illustrates the structures and drift time

profiles for the [MþH]+ and [MþH] ions of (S,-)-thalidomide
and (R,þ)-thalidomide along with corresponding CCS. The drift
time profiles of both positive and negative mode ions for both
enantiomers are identical. This result is expected as stereochemical
differences in small molecules are not expected to result in different
drift time distributions. However, multiple peaks were detected for
both enantiomers of thalidomide in each ionization mode. Struc-
turally, this observation is reasoned as being a consequence of the
possibility for multiple molecular conformations arising from rota-
tion around the single N–C bond that links the two ring moieties.
The development of methods for characterizing small drug and
drug-like molecules has become an important focal point for the
pharmaceutical industry. As a result there has been a concentrated
effort toward discovering new technologies to aid in the develop-
ment of new drugs. The examples presented in this work provide
strong support for the use of IM-MS as an important analytical
technology in exploring the structural diversity of constitutional
and conformational isomers of small drug and drug-like molecules.
In the following sections, we provide the materials and methods
necessary to obtain an IM-MS spectra for small molecules (here S-
thalidomide, 258 Da) using a commercially available IM-MS plat-
form (Agilent IM-MS 6550).
Isomeric and Conformational Analysis of Small Drug and Drug-Like Molecules. . . 171

2 Materials

1. The sample preparation described here is for use with direct

infusion via a syringe pump operated at low flow rates (i.e.,
5–100 μL/min). As with any analytical study, it is beneficial to
have analyte samples and solvents with optimal purity for anal-
ysis. In this study, S-thalidomide was purchased from Sigma-
Aldrich, and the solvent (Optima LC-MS grade Water) was
obtained from Fisher Scientific. A range of analyte concentra-
tion can be used and should be selected based upon instrument
sensitivity and limits of detection. Standard protocol for this
instrument using direct infusion recommends an analyte con-
centration of 1–10 μg/mL. For the experiment described here,
a 10 μg/mL sample of S-thalidomide was prepared using
10 mM ammonium acetate in water.
2. The instrument must be tuned prior to data collection in order
to perform at optimal sensitivity, resolution, and accuracy.
Toward this end a commercially available tune mix solution
containing several standards over a range of masses and mobi-
lities was used (Agilent Tune Mix, part number G1969-
85000). Specifics of the tuning method are described in
Subheading 3.
3. Direct infusion was carried out using a 500 μL glass syringe and
a KD Scientific syringe pump. Flow rates were as described in
Subheading 3 to follow.
4. Collection of IM-MS data was obtained using Agilent Mas-
sHunter Acquisition Software (v.7.00). Data workup of
IM-MS distributions was performed using Agilent IM-MS
Browser Software (v. 7.02).

3 Methods

The method described below is for new users to the Agilent 6560
IM-MS instrument who have general knowledge of traditional
mass spectrometry operation. It is intended to provide the novice
user with the basic steps necessary to obtain routine IM-MS spec-
tra. It is not intended to provide an exhaustive description of all
instrument settings and their uses. The reader should consult their
service manual for extended instructions and a comprehensive
description of settings. Explicit notes are added following the
Subheading 3.

3.1 Preparing 1. To ensure drift time and collision cross section reproducibility,
the Instrument check instrument pressures in each of the following instrument
for Direct Infusion compartments: the high-pressure funnel region should be set
172 Shawn T. Phillips et al.

to 4.80 Torr (0.02) and trap funnel region pressure at

3.80 Torr (0.01), and drift tube pressure should be main-
tained at 3.95 Torr ( 0.01) (see Note 1). A pressure regulation
manifold (alternate gas kit, Agilent) is recommended here in
order to automatically maintain these pressure settings. Alter-
natively, the user may choose to monitor and make manual
adjustments to the drift tube pressure in order to maintain
the precision of the IM measurements.
2. Open the Agilent MassHunter Workstation Data Acquisition
Program. Under the “Context” menu (Fig. 6a), choose the
“Tune” setting to perform an autotune. The source tempera-
tures and voltages will be preset for the Agilent Tune Mixture
(Fig. 6b), and the ion polarity and scan mode will be selected as
a function of the molecules of interest per individual experi-
ment (low mass mode (50–250 m/z), normal mode
(50–1700 m/z), or high mass range (100–3200 m/z)). In
this case we are investigating thalidomide (258 Da) and have
selected the normal instrument tuning mode (50–1700 m/z).
3. Select the “Tune and Calibration” tab (Fig. 6c). Select desired
ionization mode (positive mode is used here), TOF (time of
flight), mass calibration/check, and the corresponding mass

Fig. 6 Agilent MassHunter Workstation Data Acquisition Program tune page with callouts for various
commands
Isomeric and Conformational Analysis of Small Drug and Drug-Like Molecules. . . 173

range depending on the analyte to be studied. Once settings

have been chosen, click “Apply” to apply these settings prior to
tuning (Fig. 6d).
4. Ensure that the calibrant line contains adequate calibrant solu-
tion and is plumbed into the primary nebulizer before starting
the autotune. It is recommended that at minimum, a quarter of
the calibration bottle volume (ca. 30 mL) be filled with calibra-
tion solution prior to tuning. Turn on the tune mix by right-
clicking in the Q-TOF selection box (Fig. 6e) and selecting
calibrant.
5. When the tune mix ions appear in the spectra window (e.g., m/
z 322, 622, and 922 in positive ion mode) (see Note 2), click
on the “Start TOF Mass calibration” button (Fig. 6f).
6. Upon tune completion (see Note 3), a calibration report will be
generated as a portable document file (PDF). From this report,
ensure that signal intensity is greater than 1 105 ion counts
and that the ion mobility resolution is between 40 and 60 (see
Note 4). After tuning, click the “Instrument State” tab
(Fig. 6g), and then click “Save” and “Apply.”
7. Once the instrument is tuned in the desired ion mode, switch
to the acquisition mode under the context menu (Fig. 6a) to
start collecting data.

3.2 Data Acquisition 1. Load the preexisting low mass method in the drag-down box in
(Consult Fig. 7) method editor, and click “Apply” (Fig. 7a). If the user is
interested in collecting collision cross sections, the method
should include a voltage gradient in the drift tube region in
order to subtract out the non-mobility flight time. This voltage
gradient can be accessed under the “Advanced Parameters” tab
of the method editor screen (Fig. 7b) (see Note 5).
2. Load the sample in the syringe and syringe pump for direct
infusion. Ensure that the syringe pump is set with the correct
syringe diameter in order to output the correct flow rates.
Although flow rate will vary based on specific instrument
source settings and sensitivity, the flow rate used here is
1–10 μL/min (see Note 6). Turn on the syringe pump. Once
sample reaches the instrument, analyte ion peaks begin to
appear in the mass window.
3. To set up the data acquisition, select the “Sample Run” tab at
the bottom of the method editor screen (Fig. 7c). In the
sample run mode, name the file (Fig. 7d), and select the path
directory (Fig. 7e) for your file.
4. To acquire data, click the forward arrow (Fig. 7f) in the sample
run screen.
174 Shawn T. Phillips et al.

Fig. 7 Sample run dialogue box including file directory information and sample name

3.3 Data Workup 1. After data acquisition is complete, load the “Agilent MassHun-
ter IM-MS Browser” software, and open the desired file.
Once the file has opened, select “Condense File” under the
“Actions” tab (Fig. 8a). Condensing files will compress the
data from each experimental sequence into a single frame,
which is convenient for viewing multiple segment runs (e.g.,
CCS experiments) or long infusion experiments. Note that
condensing the file is an optional step.
2. In IM-MS Browser, you can view the resulting mass spectrum
(Fig. 8b), the drift spectrum data (Fig. 8c), and the 2-D plot of
mass-to-charge vs drift time (Fig. 8d) for all ions in the sample.
In the example spectrum of thalidomide (Fig. 8e), we will focus
on the mass-to-charge ion at 259.0708, [MþH]+.
3. Expand the Counts vs. Mass-To-Charge window by right-
clicking and holding on the mass axis and dragging over the
desired mass range. (To expand or contract any axis, right-click
hold any axis, and move the mouse right or left.) To move the
peak right or left, left click and hold the axis and drag accord-
ingly (Fig. 8b).
4. To obtain a drift time spectrum for a specific mass-to-charge
region, right-click and drag over the desired ion in the drift
time vs m/z region. This produces a box around the desired ion
Isomeric and Conformational Analysis of Small Drug and Drug-Like Molecules. . . 175

Fig. 8 Agilent MassHunter IM-MS Browser interface. (a) Mass spectrum of thalidomide, (b) drift spectra
window with expanded window for thalidomide drift profile, and (c) 2-D IM-MS window with callout of the
thalidomide [MþH]+ ion species

(Fig. 8f). Command “Ctrl X” copies the selected region, and

“Ctrl D” pastes the spectra in the user drift spectra window
(Fig. 8g).
5. This work flow was repeated for each thalidomide enantiomer
in both positive and negative ion mode, and a processed version
is illustrated in the main text as in Fig. 5.

4 Notes

1. If using the telnet software provided by Agilent Technologies to

monitor the instrument pressures, it is beneficial to allow
approximately 20 min for the instrument pressures to equili-
brate prior to data acquisition.
2. If tune mix does not appear prior to calibration, it is possible
that an obstruction has lodged in the nebulizer. It may be
necessary to clean or exchange the primary nebulizer.
3. If calibration fails, it may be necessary to load a preexisting tune
file and mass calibrate with the loaded file. A common cause for
failing the instrument calibration is insufficient ion signal for
one or more of the tune mix compounds.
4. If the mobility resolution is not between 40 and 60, the user
may wish to repeat the autotune procedure.
176 Shawn T. Phillips et al.

5. For a detailed explanation of the voltage gradient method, we

refer the reader to ref. 32 and its supporting information to
describe the non-mobility component of the drift time and
subsequent conversion to CCS.
6. If minimizing sample consumption is important, the user
may wish to conduct step 4 prior to turning on the
syringe pump.

Acknowledgments

This work was supported in part using the resources of the Center
for Innovative Technology at Vanderbilt University. Financial sup-
port for aspects of this research was provided by The National
Institutes of Health (NIH Grant R01GM092218) and under Assis-
tance Agreement No. 83573601 awarded by the US Environmen-
tal Protection Agency (EPA). This work has not been formally
reviewed by the EPA, and the EPA does not endorse any products
or commercial services mentioned in this publication. Further-
more, the content is solely the responsibility of the authors and
should not be interpreted as representing the official views and
policies, either expressed or implied, of the funding agencies and
organizations.

References
1. Chail H (2008) DNA sequencing technologies determination of inorganic impurities in drugs
key to the human genome project. Nature and pharmaceuticals. J Pharm Biomed Anal
Education 1:219 43:1–13
2. Pareek CS, Smoczynski R, Tretyn A (2011) 9. Kauppila TJ, Wiseman JM, Ketola RA,
Sequencing technologies and genome Kotiaho T, Cooks RG, Kostiainen R (2006)
sequencing. J Appl Genet 52:413–435 Desorption electrospray ionization mass spec-
3. Zhang JH, Chung TD, Oldenburg KR (1999) trometry for the analysis of pharmaceuticals
A simple statistical parameter for use in evalua- and metabolites. Rapid Commun Mass Spec-
tion and validation of high throughput screen- trom 20:387–392
ing assays. J Biomol Screen 4:67–73 10. Cooks RG (1995) Special feature: historical.
4. Takenaka T (2001) Classical vs reverse pharma- Collision-induced dissociation: readings and
cology in drug discovery. BJU Int 88:7–10 commentary. J Mass Spectrom 30:1215–1221
5. Harvey AL, Edrada-Ebel R, Quinn RJ (2015) 11. Wells JM, McLuckey SA (2005) Collision-
The re-emergence of natural products for drug induced dissociation (CID) of peptides and
discovery in the genomics era. Nat Rev Drug proteins. Methods Enzymol 402:148–185
Discov 14:111–129 12. Nguyen LA, He H, Pham-Huy C (2006) Chi-
6. Vaidya ADB (2014) Reverse pharmacology-a ral drugs: an overview. Int J Biomed Sci
paradigm shift for drug discovery and develop- 2:85–100
ment. Curr Res Drug Discov 1:39–44 13. McMurry J (2008) Organic chemistry, 7th
7. Roses AD (2008) Pharmacogenetics in drug edn. Cengage Learning, Stamford, CT
discovery and development: a translational per- 14. NCBI.NLM.NIH.Gov. Search terms
spective. Nat Rev Drug Discov 7:807–817 “C8H9NO2.” Accessed 18 Apr 2017
8. Nageswara Rao R, Talluri MV (2007) An over- 15. Ferreres F, Giner JM, Tomás-Barberán FA
view of recent applications of inductively cou- (1994) A comparative study of hesperetin and
pled plasma-mass spectrometry (ICP-MS) in methyl anthranilate as markers of the floral
Isomeric and Conformational Analysis of Small Drug and Drug-Like Molecules. . . 177

origin of citrus honey. J Sci Food Agric collision cross sections measured on a proto-
65:371–372 type high resolution drift tube ion mobility-
16. Groessl M, Graf S, Knochenmuss R (2015) mass spectrometer. Anal Chem 86:2107–2116
High resolution ion mobility-mass spectrome- 28. Web of Science. Thomson Reuters. Search
try for separation and identification of isomeric terms “Ion Mobility” AND “Mass Spectrome-
lipids. Analyst 140:6904–6911 try.” Articles from 2002 to 2017. Accessed
17. Xiao Y, Vecchi MM, Wen D (2016) Distin- 15 May 2017
guishing between leucine and isoleucine by 29. Paglia G, Astarita G (2017) Metabolomics and
integrated LC-MS analysis using Orbitrap lipidomics using traveling-wave ion mobility
fusion mass spectrometer. Anal Chem mass spectrometry. Nat Protoc 12:797–813
88:10757–10766 30. Stow SM, Lareau NM, Hines KM, McNees
18. Takayama K, Kilburn JO (1989) Inhibition of CR, Goodwin CR, Bachmann BO, McLean
synthesis of arabinogalactan by ethambutol in JA (2014) In: Havlı́ček V, Spı́žek J (eds) Natu-
mycobacterium smegmatis. Antimicrob Agents ral products analysis: instrumentation, meth-
Chemother 33:1493–1499 ods, and applications. John Wiley & Sons,
19. Chatterjee VK, Buchanan DR, Friedmann AI, Inc., Hoboken, NJ, pp 397–432
Green M (1986) Ocular toxicity following eth- 31. Sundarapandian S, May JC, McLean JA (2010)
ambutol in standard dosage. Br J Dis Chest Dual source ion mobility mass-spectrometer
80:288–291 for direct comparison of ESI and MALDI col-
20. Carey R (1996) Organic chemistry, 3rd edn. lision cross section measurements. Anal Chem
McGraw Hill, New York, pp 89–92 82:3247–3254
21. Kothiwale S, Mendenhall JL, Meiler J (2015) 32. Mason EA, McDaniel EW (1988) Transport
BCL::CONF: small molecule conformational properties of ions in gases. John Wiley and
sampling using a knowledge based rotamer Sons, Indianapolis, IN
library. J Cheminform 7:47 33. Glaskin RS, Valentine SJ, Clemmer DE (2010)
22. Paglia G, Williams JP, Menikarachchi L, A scanning frequency mode for ion cyclotron
Thompson JW, Tyldesley-Worster R, mobility spectrometry. Anal Chem
Halldórsson S, Rolfsson O, Moseley A, 82:8266–8271
Grant D, Langridge J, Palsson BO, Astarita G 34. Cumeras R, Figueras E, Davis CE, Baumbach
(2014) Ion mobility derived collision cross sec- JI, Grácia I (2015) Review on ion mobility
tions to support metabolomics applications. spectrometry. Part 1: current instrumentation.
Anal Chem 86:3985–3993 Analyst 140:1376–1390
23. Enders JR, McLean JA (2009) Chiral and 35. Cumeras R, Figueras E, Davis CE, Baumbach
structural analysis of biomolecules using mass JI, Grácia I (2015) Review on ion mobility
spectrometry and ion mobility –mass spec- spectrometry. Part 2: hyphenated methods
trometry. Chirality 21:253–264 and effects of experimental parameters. Analyst
24. Dodds JN, May JC, McLean JA (2017) Inves- 140:1391–1410
tigation of the complete suite of the leucine 36. Adamov A, Mauriala T, Teplov V, Laakia J,
and isoleucine isomers: toward prediction of Pedersen CS, Kotiaho T, Sysoev AA (2010)
ion mobility separation capabilities. Anal Characterization of a high resolution drift
Chem 89:952–959 tube ion mobility spectrometer with a multi-
25. Pringle SD, Giles K, Wildgoose JL, Williams ion source platform. Int J Mass Spectrom
JP, Slade SE, Thalassinos K, Bateman RH, 298:24–29
Bowers MT, Scrivens JH (2007) An investiga- 37. Kanu AB, Dwivedi P, Tam M, Matz L, Hill HH
tion of the mobility separation of some peptide Jr (2008) Ion mobility–mass spectrometry. J
and protein ions using a new hybrid quadru- Mass Spectrom 43:1–22
pole/travelling wave IMS/oa-ToF instrument. 38. Jurneczko E, Kalapothakis J, Campuzano ID,
Int J Mass Spectrom 261:1–12 Morris M, Barran PE (2012) Effects of drift gas
26. May JC, McLean JA (2015) Ion mobility-mass on collision cross sections of a protein standard
spectrometry: time-dispersive instrumentation. in linear drift tube and traveling wave ion
Anal Chem 87:1422–1436 mobility mass spectrometry. Anal Chem
27. May JC, Goodwin CR, Lareau NM, Leaptrot 84:8524–8531
KL, Morris CB, Ruwam T, Kurulugama RT, 39. Ujma J, Giles K, Morris M, Barran PE (2016)
Mordehai A, Klein C, Barry W, Darland E, New high resolution ion mobility mass spec-
Overney G, Imatani K, Stafford GC, Fjeldsted trometer capable of measurements of collision
JC, McLean JA (2014) Conformational order- cross sections from 150 to 520 K. Anal Chem
ing of biomolecules in the gas phase: nitrogen 88:9469–9478
178 Shawn T. Phillips et al.

40. Giles K, Williams JP, Campuzano I (2011) ion mobility-mass spectrometry. Biochimica et
Enhancements in travelling wave ion mobility Biophusica Acta (BBA)-Molecular and Cell
resolution. Rapid Commun Mass Spectrom Biology of Lipids 1811:935–945
25:1559–1566 48. Gaye MM, Nagy G, Clemmer DE, Pohl NL
41. Shvartsburg AA, Smith RD (2008) Fundamen- (2016) Multidimensional analysis of 16 glucose
tals of traveling wave ion mobility spectrome- isomers by ion mobility spectrometry. Anal
try. Anal Chem 80:9689–9699 Chem 88:2335–2344
42. Bush MF, Campuzano ID, Robinson CV 49. Fenn LS, McLean JA (2013) Structural separa-
(2012) Ion mobility mass spectrometry of pep- tions by ion mobility-MS for glycomics and
tide ions: effects of drift gas and calibration glycoproteomics. Methods Mol Biol
strategies. Anal Chem 84:7124–7130 951:171–194
43. Hines KM, May JC, McLean JA, Xu L (2016) 50. Lalli PM, Corilo YE, Rowland SM, Marshall
Evaluation of collision cross section calibrants AG, Rodgers RP (2015) Isomeric separation
for structural analysis of lipids by traveling wave and structural characterization of acids in
ion mobility-mass spectrometry. Anal Chem petroleum by ion mobility mass spectrometry.
88:7329–7336 Energy Fuel 29:3626–3633
44. Lanucara F, Holman SW, Gray CJ, Eyers CE 51. Barnett DA, Ells B, Guevremont R, Purves RW
(2014) The power of ion mobility-mass spec- (1999) Separation of leucine and isoleucine by
trometry for structural characterization and the electrospray ionization-high field asymmetric
study of conformational dynamics. Nat Chem waveform ion mobility spectrometry-mass
6:281–294 spectrometry. J Am Chem Soc 10:1279–1284
45. May JC, McLean JA (2015) A uniform field ion 52. Knapman TW, Berryman JT, Campuzano I,
mobility of melittin and implications of Harris SA, Ashcroft AE (2010) Considerations
low-field mobility for resolving fine cross- in experimental and theoretical collision cross-
sectional detail in peptide and protein experi- section measurements of small molecules using
ments. Proteomics 15:2862–2871 travelling wave ion mobility spectrometry-mass
46. Shvartsburg AA, Tang K, Smith RD (2009) spectrometry. Int J Mass Spectrom 298:17–23
Two-dimensional ion mobility analyses of pro- 53. Li H, Bendiak B, Siems WF, Gang DR, Hill
teins and peptides. Methods Mol Biol HH Jr (2013) Ion mobility mass spectrometry
492:417–445 analysis of isomeric disaccharide precursor,
47. Kilman M, May JC, McLean JA (2011) Lipid product and cluster ions. Rapid Commun
analysis and lipidomics by structually selective Mass Spectrom 27:2699–2709
Part III

Clinical Research Informatics in Drug Discovery

Chapter 10

A Computational Platform and Guide for Acceleration

of Novel Medicines and Personalized Medicine
Ioannis N. Melas, Theodore Sakellaropoulos, Junguk Hur,
Dimitris Messinis, Ellen Y. Guo, Leonidas G. Alexopoulos,
and Jane P. F. Bai

Abstract
In the era of big data and informatics, computational integration of data across the hierarchical structures of
human biology enables discovery of new druggable targets of disease and new mode of action of a drug. We
present herein a computational framework and guide of integrating drug targets, gene expression data,
transcription factors, and prior knowledge of protein interactions to computationally construct the signal-
ing network (mode of action) of a drug. In a similar manner, a disease network is constructed using its
disease targets. And then, drug candidates are computationally prioritized by computationally ranking the
closeness between a disease network and a drug’s signaling network. Furthermore, we describe the use of
the most perturbed HLA genes to assess the safety risk for immune-mediated adverse reactions such as
Stevens-Johnson syndrome/toxic epidermal necrolysis.

Key words Gene expression, Integer linear programming, Drug targets, Network protein interac-
tions, HLA genes

1 Introduction

Advances on many fronts of biological sciences since completion of

the human genome project have accelerated exponentially with the
aid of informatics technologies, thus enabling availability of a large
quantity of biological data across the hierarchy of the human body.
As a result, expert-guided curation of data and biological informa-
tion has made available a large number of databases and knowledge
bases of high quality for systems-based research to facilitate our
understanding of diseases and predictive assessment of drug toxic-
ity, as well as to accelerate novel medicines for treating diseases.
From our experiences [1–5], we conclude that constructing a
drug’s mode of action from drug targets to protein interaction
networks to differentially expressed genes may be an effective way
to advance medicines for treating diseases, either for new chemical

Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_10, © Springer Science+Business Media, LLC, part of Springer Nature 2019

181
182 Ioannis N. Melas et al.

Data compilation
* drug list * drug targets

Acceleration of novel medicine

* gene expression profile * protein interactions
* transcription factors

Computation to identify drug candidates

* identifying differentially expressed genes
* constructing the modes of action (signaling network) of
drugs
* constructing the disease network
* identifying druggable targets

Precision medicine
* prioritizing the drug list using a distance metric between
a drug’s and the disease’s networks (illustration with Anthrax
disease)
Personalized risk assessment
* subjecting individual patient’s cells to a drug for transcriptomic profiling
* identifying each patient’s HLA genes significantly perturbed by the drug
* referencing literature for the HLA genes associated with severe/fatal
adverse drug reactions to assess the risk for individual patients

Fig. 1 The framework of a computational platform and guide for precision medicine in which steps consist of
(a) compilation of the needed data from databases, (b) computational construction of a drug’s mode of action
and a disease network, (c) prioritizing drug candidates, (d) individualized assessment of treatment-related
severe/fetal adverse reactions

entities or repurposed drugs. The druggable target(s) of a disease

and associated network can be constructed in the same fashion.
With a chemical’s or a drug’s mode of action and a druggable
network of a target disease in hand, one can computationally prior-
itize candidates for further testing.
To illustrate the framework of our computational platform, we
describe herein the materials needed and the how-to implementa-
tion of computational methods, thereby providing a guide for
reference, as shown in Fig. 1. We also summarize a guide for
personalized medicine where one can use a patient’s cells to assess
how he/she might adversely respond to a drug (Fig. 1). In sum-
mary, a computational platform described herein include materials
(drug targets, genome-wide transcriptional expression data, prior
knowledge of protein interactions, transcription factors) and
computational methods (identifying the differentially expressed
genes following exposure to a chemical/drug, applying integer
linear programming (ILP) to construct the signaling network
(mode of action) of a drug candidate or the network of a disease
from its druggable target, and computational prioritization of drug
candidates). Last but not least, a potential application for assessing
personalized safety risk is illustrated with the example of carbamaz-
epine and HLA DQB1 gene.
A Computational Platform and Guide for Acceleration of Novel Medicines. . . 183

2 Materials

Before setting up computations, one needs to compile all the data.

Depending on the specific goal of a project, we first compile a list of
drugs or a list of toxins. To computationally link a drug’s mode of
action to its pharmacological effects under investigation, we com-
pile the data from various credible databases and knowledge bases.
The ones that we frequently use are described below.

2.1 Drug Targets Several pharmacological databases of quality containing the

detailed information of on-targets and off-targets of individual
drugs/chemicals are publicly accessible. The most frequently used
ones are highlighted below.
DrugBank [6] is a unique bioinformatics and cheminformatics
resource with detailed data of drugs and their targets (see Note 1).
The current release version 5.0 contains 8261 drug entries includ-
ing 2201 FDA-approved small molecule drugs, 233 FDA-approved
protein and peptide drugs, 94 nutraceuticals, and over 6000 exper-
imental drugs. The complete DrugBank data are freely available for
noncommercial use in XML format. The downloaded XML file can
be processed using any programming languages to parse any
needed information out.
DrugCentral, an online drug compendium, is another informa-
tion source of chemistry and pharmacology of approved drugs by
the Food and Drug Administration (FDA) and other regulatory
agencies [7]. Included in this database are approximately
900 curated biomolecules of human and pathogen nature, on
which 1578 FDA-approved drugs act [8]. Therapeutic Targets
Database (TTD) [9] contains a smaller list of FDA-approved
drugs but much more—currently 14,853—experimental drugs
than DrugBank.
STITCH [10] is a database of known and predicted interac-
tions between chemicals and proteins. The STITCH database
(v 5.0) covers 9,643,763 proteins from 2031 organisms and links
them to 430,000 compounds. Each link is annotated with a score
reflecting the level of confidence for this interaction, based on the
sources from which it was retrieved (see Note 2). STITCH is the
most comprehensive among these three databases since it contains
the drug targets from the DrugBank and TTD, as well as several
other sources.

2.2 Genome-Wide There are multiple large-scale biological databases available [11]
Transcriptional with some of them dedicated to transcriptional expression data.
Expression Data ArrayExpress [12] and Gene Expression Omnibus (GEO) [13]
organized by the National Center for Biotechnology Information
(NCBI) are the two main sources of publicly available gene expres-
sion data. As of today, ArrayExpress contains 69,728 experiments
performed using 2,201,027 different assays and 45.30 terabytes of
184 Ioannis N. Melas et al.

archived data. GEO contains 2,076,947 samples and 16,892 dif-

ferent platforms. In addition to ArrayExpress and NCBI GEO,
there are other gene expression databases, and the two we use are
highlighted below.
Drug Toxicity Signature Generation Center (DToxS) [14],
part of the Library of Integrated Network-Based Cellular Signa-
tures (LINCS) consortium of NIH, hosts the gene expression data
of cardiomyocytes (PromoCell) following exposure to drugs. Raw
sequencing data (Level 0), unique molecular identifier counts
(Level 1), fold change data (Level 2), and differentially expressed
genes (Level 3) can be downloaded. Depending on the context of
use, one can opt for the data at the optimal level to a specific need
(see Note 3).
The Connectivity Map (CMap) [15] contains the drug pertur-
bation signatures generated following chemical or drug treatment
of cancer cell lines, mostly HL60, MCF7, and PC3 and to a lesser
extent, ssMCF7, and SKMEL5. As of today, CMap has profiled
1309 drugs/compounds and includes more than 7000 gene
expression profiles using Affymetrix GeneChip Human Genome
U133A Array platform. The perturbed gene expression profiles are
given in the form of ranked profile, where the genes are ordered and
ranked by their relative gene expression changes by drug perturba-
tion compared to untreated controls (see Note 4).
Typical use of CMap involves merging of multiple profiles to a
representative profile for each drug since most of the compounds in
CMap have more than one experimental condition (different cells
and concentrations). These ranked expression profiles can be
merged using the Kruskal-Borda (Kru-Bor) strategy for each drug
[16]. Briefly, a distance metric using Spearman’s Footrule is com-
puted among all the ranked profiles. Then, the two closest profiles
are merged through a majority voting system and re-ranked into a
new merged profile. This merging procedure continues until a
single consensus profile is obtained.

2.3 Protein Protein-protein interaction (PPI) networks facilitate the transition

Interactions Reflective from an agnostic correlation-based analysis to a mechanism-based
of Biological one by providing the necessary biological context. This can be done
Connectivity via enrichment analysis, where the observed expression signatures
are compared with known pathways or via pathway construction
analysis, where the most probable (de novo) pathway is recon-
structed from known interactions and the correlations among the
proteins measured.
Considerable efforts have been put into identifying and collect-
ing protein-protein interactions. Despite the central role of “inter-
actome” for modern systems biology, no authoritative source is
available, and the data are stored in multiple databases. Some of
the most widely used databases include Reactome, BioGRID, DIP,
IntAct, MINT, and STRING (Table 1). These databases differ with
A Computational Platform and Guide for Acceleration of Novel Medicines. . . 185

Table 1
Databases of protein interactions and transcription factors described in Subheading 2

Name Nature of database URL link

Protein interactions
Reactome A curated pathway database http://www.reactome.org/
BioGRID A curated database of physical and genetic https://thebiogrid.org/
interactions
DIP A curated database of experimentally http://dip.doe-mbi.ucla.edu/dip/
determined interactions between proteins Main.cgi
IntAct Molecular interaction database with data from http://www.ebi.ac.uk/intact/
literature curation or submission by users
MINT Experimentally verified protein-protein http://mint.bio.uniroma2.it/
interactions that were curated by experts
STRING Functional protein interaction networks http://string-db.org/
Transcription factors
DBD A transcription factor prediction database http://www.transcriptionfactor.org/
JASPAR A transcription factor binding profile database http://jaspar.binf.ku.dk/
AnimalTFDB Curated transcription factors http://www.bioguo.org/AnimalTFDB
Swiss-Prot Curated transcription factors http://web.expasy.org/docs/swiss-
prot_guideline.html
TRANSFAC Curated transcription factors Commercially available (see Note 5)

respect to their policies for inclusion of specific interactions as well

as their preferred format for data storage. With respect to policy,
some databases require manual curation by an expert and/or exper-
imental evidence, while others may also include “predicted” inter-
actions based on text mining or other machine learning techniques.
The most common formats for storing interactions are SIF (simple
interaction format), BioPAX, and SBML.
Upon selecting a knowledge base or database and parsing the
data into an appropriate format, some extra “cleaning” steps are
usually required. These include (a) removing nonfunctional inter-
actions, (b) removing nodes that are uncontrollable (i.e., they
cannot be affected by a drug target) or unobservable (i.e., they do
not affect any node whose state cannot be observed experimen-
tally), and (c) applying confidence weights to reactions or nodes if
such relevant information is available. Finally, if the network is too
large and cannot be handled effectively by the optimizer, network
compression techniques such as the ones published previously [5]
can be useful.
186 Ioannis N. Melas et al.

For our published analysis of drug-induced lung injury [3], the

Reactome, a curated pathway database, was used. In particular, we
used its “functional interactions” since the complete database
included a lot of “meta” reactions. Regarding “functional interac-
tion,” for example, the transition of a protein from the cytoplasm to
the nucleus is considered as a reaction.

2.4 Transcription Though transcription factor (TF) databases are not as abundant as
Factors PPI ones, there are databases available as summarized in Table 1,
including DBD and JASPAR that include reviewed as well as pre-
dicted TFs, and more conservative databases such as AnimalTFDB,
Swiss-Prot, and TRANSFAC host a manually curated list of TFs.
Once the appropriate database is selected and the data are parsed
into the appropriate format, the TF-gene interaction can be added
to the prior knowledge network as regular edges.

3 Methods

3.1 Identification For either identifying new drug candidates or repurposing an

of Novel Drug approved drug for a new indication, the same computational frame-
Candidates work can be utilized. For a disease of interest, one needs to identify
the disease targets to be activated or to be inactivated for optimal
treatment benefits. For example, to treat anthrax, potential
approaches include blocking internalization of anthrax toxins,
antagonizing the actions of individual toxins, or activating the
pathways that can beneficially counteract the toxic effects of toxins.

3.1.1 Identifying In order to identify which genes are differentially expressed

the Differentially Expressed between the unstimulated state and exposure to a chemical, the
Genes Following Exposure first step would be to calculate the fold change by dividing each
to a Drug/Chemical gene’s expression level from treated cells by that from the
non-stimulated cells. Having at least a triplicate measurement, a
reasonable approach would be to choose only those genes with a
fold change greater than 2 and p-value less than 0.05, using
two-tailed two-sample unequal variance student’s t-test with
Benjamini-Hochberg correction [17].
After finalizing the list of the differentially expressed genes, one
might opt for the most significantly perturbed genes by applying a
cutoff, usually between 1% and 5%. For example, having a dataset of
1000 differentially expressed genes, if a 5% cutoff is used, then one
will get the 50 most upregulated genes and the 50 most down-
regulated genes. Alternatively, one can apply a cutoff of a specific
number. For instance, one can choose 250 to get the 250 most
upregulated genes and the 250 most downregulated genes. For the
dataset such as the CMap, all 22,283 Affymetrix probe IDs are
listed in the order of from the most upregulated to the most
downregulated.
A Computational Platform and Guide for Acceleration of Novel Medicines. . . 187

3.1.2 Applying A pathway construction approach similar to what was published

Optimization- or previously [3, 18, 19] can be applied to construct the signaling
Enrichment-Based pathway of candidate drugs based on their transcriptomic profiles.
Approaches to Construct These computational methods receive as an input the transcrip-
the Signaling Network tomic profile of a candidate drug, prior knowledge of transcription
(Mode of Action) of a Drug regulation, and protein connectivity and identify a small, easily
interpretable signaling network that most likely yield the observed
gene expression signature.
At the heart of these approaches lies the assumption that a
compound acting on the phosphoproteomic level (e.g., a kinase
inhibitor) will deregulate a number of pathways, and this deregula-
tion will propagate downstream and affect the activities of key
transcription factors that then affect the expression of their down-
stream target genes. By observing the differential gene expression
upon drug perturbation, one may be able to infer which TFs were
deregulated upstream and eventually the source of the deregulation
in the signaling network, i.e., the targets of the candidate
compound.
Below are the detailed guides for three different algorithms
with their own unique utility and requirements. The software
tools and algorithms are summarized in Table 2.

Integer Linear As published previously [3], we employed an integer linear pro-

Programming gramming (ILP) formulation to model the mechanics of signal
transduction from one node to the next through the signaling
network, to the TFs all the way to the differentially expressed
genes, and identify a small subset of the signaling network that
best explains the observed transcriptomic signature. This

Table 2
List of computational algorithms and the needed optimizer for constructing a drug’s mode of action

Nature of computational
Name algorithm URL link for tool or source code
Integer linear Model the mechanics of signal http://pubs.rsc.org/en/content/articlelanding/
programming transduction from one node to ib/2015/C4ib00294f#!divAbstract
the next through the signaling
network
IBM ILOG An mathematical optimizer for http://www-03.ibm.com/software/products/
CPLEX ILP en/ibmilogcpleoptistud
CellNOptR Genetic algorithm to reconstruct Implementation in python https://github.com/
signaling topologies cellnopt/cellnopt or Cytoscape: http://www.
cellnopt.org/cytocopter/
X2K An enrichment-based approach to http://www.maayanlab.net/X2K
identifying upstream regulators
of differentially expressed genes
188 Ioannis N. Melas et al.

subnetwork approximates the mode of action of the interrogated

compound.
An implementation of the ILP algorithm is available in Python
and can be run on the command line; the source code of ILP can be
found in the supplementary information of our previous publica-
tion [3]. Note it requires a working installation of IBM ILOG
CPLEX which is free of charge for academic use. After installing
all dependencies, the ILP formulation is run on a Unix command
line by invoking the ILP.py script (e.g., Python ILP.py). The fol-
lowing files need to be provided by the user in the same working
directory as ILP.py:
– Ilp_options.txt: Specifies options for ILP.py including cutoff
limits and input files.
– Inputs.txt: Tab-delimited file containing the nodes in the signal-
ing network perturbed by the interrogated compound (i.e.,
drug targets).
– Measurements.txt: Tab-delimited file containing the differen-
tially expressed genes upon compound perturbation and the
sign of the deregulation (1 for overexpression and 1 for
underexpression).
– Network_file: Defines the signaling network to be used as a
scaffold by the ILP formulation in .sif format.
We have already compiled a genome-wide signaling network
based on the Reactome functional interaction network (FIN) [20]
and prior knowledge of transcription regulation. The users may use
their own resources instead. The algorithm returns the following
files:
– Xs.txt: A list containing all nodes in the signaling network also
present in the solution, together with their corresponding acti-
vation value (1 for activation and 1 for inhibition).
– Us.txt: A list containing all reactions in the signaling network
also present in the solution, together with the corresponding
activation values.
– Network_out.dot: The optimized network structure in .dot
format ready to be parsed by GraphViz and visualized.
The user may use his/her own resources instead. Optionally, a
list of drug targets may be provided (if known) to best constrain the
search close to these targets. We [21] recently published a Python
package facilitating the use of this algorithm. The algorithm returns
a subset of the prior knowledge signaling network that best explains
the observed transcriptomic profile and approximates the drug’s
mode of action, in .dot format that can easily be visualized using
GraphViz (http://www.graphviz.org/).
A Computational Platform and Guide for Acceleration of Novel Medicines. . . 189

CellNOptR In the work by Saez-Rodriguez et al. [18], the authors have

employed a genetic algorithm to reconstruct signaling topologies
based on experimental data. Even though they haven’t addressed
the identification of drug mode of action based on transcriptomic
data yet, their method could be used in that capacity. CellNOptR is
available in Bioconductor, and there are also Python and Cytoscape
implementations available. The user will have to provide a list of
differentially expressed genes and also provide his/her own
resources of genome-wide signaling network and transcription reg-
ulation, and CellNOptR will return a small subset of the prior
knowledge network that best explains the data at hand. The genetic
algorithm is not as efficient as an ILP formulation; however, recent
advancements in CellNOptR now support an answer set program-
ming formulation that should be as efficient as an equivalent ILP
formulation.

X2K For the X2K algorithm [19], an enrichment-based approach was

used to identify upstream regulators of differentially expressed
genes. X2K has integrated resources of prior knowledge of protein
connectivity and transcription regulation. The user has to provide a
list of differentially expressed genes, and X2K returns a list of
upstream kinases whose deregulation most probably yields the
observed transcriptomic profile (i.e., drug mode of action). X2K
is written in Java with no other dependencies (see Table 2 for the
resource to download the X2K algorithm).
There are a few conceptual differences between X2K and the
ILP formulation described above. X2K reconstructs the network
biology in an iterative manner by starting at the gene expression
level and building upstream to the transcription factor level and
eventually the proteomic level. With X2K, a signaling molecule is
included in the solution if and only if it has a significant number of
connections to other nodes in the solution. The ILP formulation,
on the other hand, will include nodes in the solution even if they do
not have numerous connections to the solution, as long as they help
construct a directed minimum spanning tree that brings together
most of the differentially expressed genes. X2K provides an intuitive
graphical user interface, where the user pastes a list of deregulated
genes and clicks “Start analysis” and the software returns a list of
transcription factors and kinases that best explain the observed
transcriptomic profile.

3.1.3 Applying ILP As described above, the ILP algorithm requires a gene expression
to Construct the Network signature and (optionally) a set of initially perturbed nodes. Drugs
of a Druggable Target offer an ideal application for this since most drugs have known
for a Disease targets and their gene expression signature can be retrieved from
databases like CMap. However, other types of external perturba-
tion, such as infections, can be analyzed in the same way. In our
recent publication [21], we used the same technique to analyze the
190 Ioannis N. Melas et al.

modes of action of Bacillus anthracis (anthrax) infection. In partic-

ular, anthrax is known to cause cleavage of MAPKs and NLRP1 by
secreting its lethal toxin (LT) as well as increasing levels of cAMP
through its secreted edema toxin (ET). These proteins were used as
analogue to drug targets. In their work, the gene expression signa-
ture of anthrax was recovered by publicly available data.
This analysis yielded a signaling network indicative of anthrax’s
mode of action which could then be compared with the drug net-
works described above to identify novel drug candidates to treat the
infection (see Subheading 3.1.5 for more details on drug reposi-
tioning). In the case of anthrax, the immediate targets of the
infection, mainly cleavage of MAPKs and increased level of cAMP,
and the networks reflecting their modes of action can be the targets
for drug repositioning to alleviate the immediate consequences of
infection. In a similar manner, for any one disease of interest, once
the disease target is identified, one can adopt this approach to
constructing its disease biological network based on the data and
knowledge at hand (see Note 6).

3.1.4 Identification Chemical-perturbed genes can be subjected to a functional enrich-

of Druggable Targets ment analysis that identifies overrepresented (or enriched)
for a Disease biological functions or pathways, which are frequently defined as
Gene Ontology (GO) terms, pathways such as Kyoto Encyclopedia
Gene Set Enrichment of Genes and Genomes (KEGG) [22] and Reactome [23], or other
Analysis for Drug Effects previously defined gene signatures in Molecular Signatures Data-
(Therapeutic or Toxic) base (MSigDB) [24] (see Table 3 for a list of tools and databases).
Multiple statistical tests are employed for performing enrich-
ment analyses, which include Fisher’s exact test (or its modified
version), hypergeometric test, and binomial. Fisher’s exact test is
most widely used to examine the significance of the association
between the list of genes to be examined and the annotations to
be tested against. A 22 contingency table is prepared for each
annotation as given in Table 4 below. In this example, there are
300 genes in user’s gene set and a total of 25,000 genes in the
whole genome, and a 2 2 contingency table is generated for
Pathway_A. Fisher’s exact test on this table results in a p-value
<2.2e 16, suggesting there is a statistically significant overrepre-
sentation of the genes belonging to this Pathway_A among the
user’s gene set.
Web-based tools include, but are not limited to, the Gene Set
Enrichment Analysis (GSEA) [25, 26], the Database for Annota-
tion, Visualization, and Integrated Discovery (DAVID) [27, 28],
and ConceptGen [29], Enrichr [30], and Protein ANalysis
THrough Evolutionary Relationships (PANTHER) [31], as listed
in Table 3. A list of genes can be provided as input directly to these
web-based services. GSEA is also available in other integrated anal-
ysis platforms such as GenePattern, while DAVID provides applica-
tion programming interface (API) accessibility. There are also
A Computational Platform and Guide for Acceleration of Novel Medicines. . . 191

Table 3
List of web-based tools for identifying druggable targets for a disease

Service URL Type

PHAROS https://pharos.nih.gov/idg/index Annotation
Gene Ontology (GO) http://www.geneontology.org/ Annotation
Kyoto Encyclopedia of Genes and Genomes http://www.genome.jp/kegg/ Annotation
(KEGG)
Gene Set Enrichment Analysis (GSEA) http://www.broadinstitute.org/gsea/ Enrichment
analysis
Database for Annotation, Visualization, and http://david.abcc.ncifcrf.gov/ Enrichment
Integrated Discovery (DAVID) analysis
ConceptGen http://conceptgen.ncibi.org Enrichment
analysis
Enrichr http://amp.pharm.mssm.edu/Enrichr/ Enrichment
analysis
Protein ANalysis THrough Evolutionary http://pantherdb.org/ Enrichment
Relationships analysis
GenePattern http://software.broadinstitute.org/ Integrated
cancer/software/genepattern/ analysis
GAGE https://bioconductor.org/packages/ R package
release/bioc/html/gage.html
topGO https://bioconductor.org/packages/ R package
release/bioc/html/topGO.html
GSEABase https://bioconductor.org/packages/ R package
release/bioc/html/GSEABase.html
GOStats https://bioconductor.org/packages/ R package
release/bioc/html/GOstats.html
GOexpress https://bioconductor.org/packages/ R package
release/bioc/html/GOexpress.html
Goseq http://bioconductor.org/packages/ R package
release/bioc/html/goseq.html

Table 4
The 22 contingency table

User genes Genome

In Pathway_A 30 170
Not in Pathway_A 270 24,560
192 Ioannis N. Melas et al.

multiple R packages designed for functional enrichment analysis,

which include GAGE, topGO, GSEABase, GOStats, GOExpress,
and EnrichR. These R packages typically run with GO terms and
KEGG pathways as the gene annotation sets and accept gene lists as
input along with optional statistical significance information. Typi-
cally, the web-based tools such as DAVID provide user-friendly
access and thus are suitable for analyzing small number of gene
sets by non-bioinformaticians, while the R packages enable a seam-
less integration into existing high-throughput data such as micro-
array and RNA-Seq analysis pipelines.

Pathway-Based Analysis Pathway-based analysis can be leveraged to identify driver mutation

to Identify Potential candidates for cancer. Mutations are mapped on a protein-protein
Therapeutic Target(s) interaction network (PPI), and their centrality in the network with
respect to each other or with respect to key signaling molecules,
known to be implicated in the disease mechanism, is calculated.
Mutated genes with high centrality may be promising drug targets,
considering their important roles in the network. Centrality mea-
sures were used to prioritize genes for protein connectivity in a
human disease by Disease Module Detection (DIAMOnD) algo-
rithm [32], for biological network [33].
Centrality of a node in a network represent the importance of
the node in that network and can be measured in many different
ways. Most commonly used centrality measures include degree,
eigenvector, closeness, and betweenness, each measuring a specific
type of importance [34]. Briefly, degree centrality relates to the
number of connected neighbors, where nodes with higher degrees
are considered more important. In eigenvector centrality, a node is
considered more important if it is connected to many “central”
nodes; thus, both the quantity and of the quality of connections are
taken into account. Closeness centrality is defined as the sum of the
length of the shortest paths to all other nodes in the network;
therefore, the smaller closeness centrality, the more important in
the network. In betweenness centrality, the importance of a node is
higher if it occurs on many shortest paths between other nodes,
serving as a bridge between them.
To calculate the centrality of genes in a network, several
approaches may be used. Multiple packages or libraries are available
to be used in programming environment, which include Igraph
(http://igraph.org/) for R, networkx (https://networkx.github.
io/) for Python, and Clairlib (http://www.clairlib.org) for Perl.
For example, function all_shortest_paths in Igraph returns all short-
est paths between two nodes in the network. By applying this
function and calculating all shortest paths from/to all mutated
genes and keeping track of the number of shortest paths going
through these mutations, a measure of centrality is obtained. More-
over, function betweenness in Igraph returns the same metric taking
into account all shortest paths. For those who are not familiar with
A Computational Platform and Guide for Acceleration of Novel Medicines. . . 193

programming may use CentiScaPe [35], a centrality-analysis plugin

for Cytoscape platform. Once various centrality measures are
obtained, the most central genes can be deemed as promising
targets using the most central genes (e.g., top 10 in each measure).

3.1.5 Computing Having calculated the mode of action of candidate drugs based on
and Prioritizing the Drug their transcriptomic profile, their fitness to a specific indication may
Candidates Using be prioritized and predicted by calculating the overlap of their
the Shortest Distance or modes of action with known disease mechanisms. In the work by
Scoring Their Ability Lamb et al. [36], the authors demonstrated that compounds which
to Reverse the Disease reverse the transcriptomic profile of a disease may have a therapeu-
Network for Drug tic effect. Melas et al. [3] advanced this idea further and showed
Repurposing that compounds whose modes of action disrupt the key signaling
pathways deregulated in a disease may also have a therapeutic effect
for treating that disease. Assuming the signaling pathways in a
disease are known (see Note 7), the overlap with a drug’s mode of
action is quantified via Fisher’s exact test on the included gene sets.
One approach would be to take into account the directionality
of the signature [21]. In particular, instead of identifying the drug
whose mode of action had the highest overlap with the disease, they
identified the drug that most closely resembled the “reversed”
disease network. The ILP algorithm produces a qualitative descrip-
tion of the signaling effects where proteins assume discrete states,
so the “reverse” network is just a network where the signs of the
nodes are flipped. In this context, the distance of two networks is
equal to the Euclidean distance between the node signatures of the
two networks (drug and disease).
Other distance metrics are also an option. For example the
Manhattan and Hamming distances were also considered as candi-
dates. Since the compared vectors are defined in the discrete space
of { 1, 0, 1}N, where N is the number of nodes, different distance
metrics amount to different trade-offs between predicting opposite
effects ( 1 vs 1) and predicting non-effects ( 1 or 1 vs 0). In
Table 5 are listed the misclassification costs for the different
metrics.
In a recent work [21], we opted for the Euclidean distance in
order to maximize the pharmacological effect of a drug on the
network of the disease at hand or minimize the synergies between

Table 5
Misclassification costs for the different metrics

Trade-offs for different distance metrics

Classification error Euclidean Manhattan Hamming

1 vs 1 4 2 1
1 or 1 vs 0 1 1 1
194 Ioannis N. Melas et al.

drug and disease. With this metric we computed the distance

between every drug and the reversed-disease network to prioritize
the drugs. Then, we referenced published pharmacological effects
of drugs to manually curate the list of the top 10 drug candidates
that were closer to the reverse network of the disease target for their
therapeutic potential. Among the top 10 drugs, two were report-
edly potent for protecting macrophages from the toxicities of
anthrax toxins.

3.2 Personalized Although the underlying mechanisms of treatment-emergent

Safety Risk by adverse reactions are diverse and often are not fully understood,
Identifying Drug- individuals’ genomes play an important role in individuals’ suscep-
Induced Perturbation tibility to severe adverse reaction(s) associated with specific drugs.
of HLA Genes For drug-induced Stevens-Johnson syndrome and toxic epider-
in Individuals mal necrosis, specific human leukocyte antigen (HLA) alleles have
been identified as biomarkers for such serious and potentially fatal
adverse drug reactions [37, 38]. HLA alleles identified in genome-
wide association studies (GWAS) for SJS/TEN differ among ethnic
groups [39]. These HLA alleles are good references for risk assess-
ment. GWAS studies are expensive. Gene expression offers an
alternative to gain insight into individual patients’ susceptibility.
Take carbamazepine for example, several most perturbed HLA
genes were computationally identified using the gene expression
data from CMap where cancer cell lines were derived from Cauca-
sians (see Subheading 2.2) [2]. Among them, HLA-DQB1*0201 is
associated with carbamazepine-associated SJS/TEN in Caucasians.
Briefly, a patient’s immune cells can be isolated from his/her
blood samples and then be exposed to a drug. The users can use
gene expression data produced from these cells and computation-
ally identify the most significantly perturbed HLA genes by the
drug [2]. One can computationally identify drug perturbation
signatures related to HLA genes by a simple frequency-based
approach to select the most frequently perturbed HLA genes
(either up- or downregulated) by a drug. In the future, it may be
possible for the HLA alleles identified to be compared to what has
been published by GWAS studies for a specific serious adverse
reaction to assess and mitigate the risk for the patient.

4 Notes

1. When downloading drug targets from DrugBank, considering

the size of the XML file (over 500 MB when uncompressed), it
is highly recommended to use programming languages to
extract the list of drug targets out of this file.
2. Since STITCH contains predicted interactions, a score above
the median value is recommended to ensure the level of
A Computational Platform and Guide for Acceleration of Novel Medicines. . . 195

confidence and an adequate number of proteins. One should

also reference literature for the context of use.
3. For the gene expression data published by DToxS, a reasonable
choice would be to begin with Level 2 data.
4. When using CMap data, the profile can be further processed to
collapse the profiles at the gene level, rather than the transcript
level on the Affymetrix array. Using the latest gene annotation
(e.g., the Entrez Gene database), those transcripts belonging
to the same gene could be merged either by arithmetic average
or median, and then the profiles need to be re-ranked. This will
help the downstream analysis to be more straightforward by
avoiding the situations where transcripts belonging to the same
gene have complete different directionality (up- and down-
regulation). However, this gene-level collapsing may render
the data less informative by ignoring transcript-specific expres-
sion information.
5. Among transcription factor databases, TRANSFAC is not
open-access but offers special prices for academic/non-profit
users.
6. Curating computed results for the context of biology and
pharmacology by referencing published reports and literature
is crucial to avoid garbage in and garbage out. Team work is
needed by collaborating scientists with expertise in bioinfor-
matics, pharmacology, biology, and disease.
7. For a disease in a specific organ, it is critical to reference a
database for the proteins and genes expressed in a specific
organ of interest. One such database is GTEx (https://com
monfund.nih.gov/GTEx/index).

Acknowledgments

The authors would like to acknowledge the ORISE Fellowships to

IM, TS and DM via CDER’s Critical Path and FDA’s Medical
Counter Measures grants to JPFB.
Disclaimer. This article reflects the views of the authors and should
not be construed to represent FDA’s views or policies. The mention
of commercial products, their sources, or their use in connection
with material reported herein is not to be construed as either an
actual or implied endorsement of such products by the FDA.
Author Contributions. JPFB conceived and edited the manuscript;
IM, TS, DM, JH, and EG contributed to various sections of the
manuscript; LA commented on the manuscript.
196 Ioannis N. Melas et al.

References
1. Hur J, Guo AY, Loh WY, Feldman EL, Bai JP 10. Szklarczyk D, Santos A, von Mering C, Jensen
(2014) Integrated systems pharmacology anal- LJ, Bork P, Kuhn M (2016) STITCH 5: aug-
ysis of clinical drug-induced peripheral neurop- menting protein-chemical interaction networks
athy. CPT Pharmacometrics Syst Pharmacol 3: with tissue and affinity data. Nucleic Acids Res
e114. https://doi.org/10.1038/psp.2014.11 44(D1):D380–D384. https://doi.org/10.
2. Hur J, Zhao C, Bai JP (2015) Systems pharma- 1093/nar/gkv1277
cological analysis of drugs inducing stevens- 11. Zou D, Ma L, Yu J, Zhang Z (2015) Biological
Johnson syndrome and toxic epidermal necro- databases for human research. Genomics Pro-
lysis. Chem Res Toxicol 28(5):927–934. teomics Bioinformatics 13(1):55–63. https://
https://doi.org/10.1021/tx5005248 doi.org/10.1016/j.gpb.2015.01.006
3. Melas IN, Sakellaropoulos T, Iorio F, Alexo- 12. Parkinson H, Sarkans U, Kolesnikov N,
poulos LG, Loh WY, Lauffenburger DA, Saez- Abeygunawardena N, Burdett T, Dylag M,
Rodriguez J, Bai JP (2015) Identification of Emam I, Farne A, Hastings E, Holloway E,
drug-specific pathways based on gene expres- Kurbatova N, Lukk M, Malone J, Mani R,
sion data: application to drug induced lung Pilicheva E, Rustici G, Sharma A, Williams E,
injury. Integr Biol (Camb) 7(8):904–920. Adamusiak T, Brandizi M, Sklyar N, Brazma A
https://doi.org/10.1039/c4ib00294f (2011) ArrayExpress update–an archive of
4. Hur J, Liu Z, Tong W, Laaksonen R, Bai JP microarray and high-throughput sequencing-
(2014) Drug-induced rhabdomyolysis: from based functional genomics experiments.
systems pharmacology analysis to biochemical Nucleic Acids Res 39(Database issue):
flux. Chem Res Toxicol 27(3):421–432. D1002–D1004. https://doi.org/10.1093/
https://doi.org/10.1021/tx400409c nar/gkq1040
5. Melas IN, Samaga R, Alexopoulos LG, Klamt S 13. Barrett T, Wilhite SE, Ledoux P, Evangelista C,
(2013) Detecting and removing inconsisten- Kim IF, Tomashevsky M, Marshall KA, Phil-
cies between experimental data and signaling lippy KH, Sherman PM, Holko M, Yefanov A,
network topologies using integer linear pro- Lee H, Zhang N, Robertson CL, Serova N,
gramming on interaction graphs. PLoS Com- Davis S, Soboleva A (2013) NCBI GEO:
put Biol 9(9):e1003204. https://doi.org/10. archive for functional genomics data sets--
1371/journal.pcbi.1003204 update. Nucleic Acids Res 41(Database issue):
6. Wishart DS, Knox C, Guo AC, Shrivastava S, D991–D995. https://doi.org/10.1093/nar/
Hassanali M, Stothard P, Chang Z, Woolsey J gks1193
(2006) DrugBank: a comprehensive resource 14. DtoxS–drug toxicity signature generation cen-
for in silico drug discovery and exploration. ter–Data & resources. https://martip03.u.hpc.
Nucleic Acids Res 34(Database issue): mssm.edu/about.php. Accessed Feb 2017
D668–D672. https://doi.org/10.1093/nar/ 15. Lamb J (2007) The connectivity map: a new
gkj067 tool for biomedical research. Nat Rev Cancer 7
7. Ursu O, Holmes J, Knockel J, Bologa CG, (1):54–60. https://doi.org/10.1038/
Yang JJ, Mathias SL, Nelson SJ, Oprea TI nrc2044
(2017) DrugCentral: online drug compen- 16. Iorio F, Bosotti R, Scacheri E, Belcastro V,
dium. Nucleic Acids Res 45(D1): Mithbaokar P, Ferriero R, Murino L,
D932–D939. https://doi.org/10.1093/nar/ Tagliaferri R, Brunetti-Pierri N, Isacchi A, di
gkw993 Bernardo D (2010) Discovery of drug mode of
8. Santos R, Ursu O, Gaulton A, Bento AP, action and drug repositioning from transcrip-
Donadi RS, Bologa CG, Karlsson A, tional responses. Proc Natl Acad Sci U S A 107
Al-Lazikani B, Hersey A, Oprea TI, Overing- (33):14621–14626. https://doi.org/10.
ton JP (2017) A comprehensive map of molec- 1073/pnas.1000138107
ular drug targets. Nat Rev Drug Discov 16 17. Benjamini Y, Hochberg Y (1995) Controlling
(1):19–34. https://doi.org/10.1038/nrd. the false discovery rate: a practical and powerful
2016.230 approach to multiple testing. J R Stat Soc B
9. Yang H, Qin C, Li YH, Tao L, Zhou J, Yu CY, Methodol 57(1):289–300
Xu F, Chen Z, Zhu F, Chen YZ (2016) Thera- 18. Terfve C, Cokelaer T, Henriques D,
peutic target database update 2016: enriched MacNamara A, Goncalves E, Morris MK, van
resource for bench to clinical drug target and Iersel M, Lauffenburger DA, Saez-Rodriguez J
targeted pathway information. Nucleic Acids (2012) CellNOptR: a flexible toolkit to train
Res 44(D1):D1069–D1074. https://doi.org/ protein signaling networks to data using
10.1093/nar/gkv1230
A Computational Platform and Guide for Acceleration of Novel Medicines. . . 197

multiple logic formalisms. BMC Syst Biol DAVID: database for annotation, visualization,
6:133. https://doi.org/10.1186/1752- and integrated discovery. Genome Biol 4(5):3
0509-6-133 28. Huang da W, Sherman BT, Lempicki RA
19. Chen EY, Xu H, Gordonov S, Lim MP, Perkins (2009) Systematic and integrative analysis of
MH, Ma’ayan A (2012) Expression2Kinases: large gene lists using DAVID bioinformatics
mRNA profiling linked to multiple upstream resources. Nat Protoc 4(1):44–57. https://
regulatory layers. Bioinformatics 28 doi.org/10.1038/nprot.2008.211
(1):105–111. https://doi.org/10.1093/bioin 29. Sartor MA, Mahavisno V, Keshamouni VG,
formatics/btr625 Cavalcoli J, Wright Z, Karnovsky A, Kuick R,
20. Wu G, Feng X, Stein L (2010) A human func- Jagadish HV, Mirel B, Weymouth T, Athey B,
tional protein interaction network and its appli- Omenn GS (2010) ConceptGen: a gene set
cation to cancer data analysis. Genome Biol 11 enrichment and gene set relation mapping
(5):R53. https://doi.org/10.1186/gb-2010- tool. Bioinformatics 26(4):456–463. https://
11-5-r53 doi.org/10.1093/bioinformatics/btp683
21. Bai JP, Sakellaropoulos T, Alexopoulos LG 30. Chen EY, Tan CM, Kou Y, Duan Q, Wang Z,
(2017) A biologically-based computational Meirelles GV, Clark NR, Ma’ayan A (2013)
approach to drug repurposing for anthrax Enrichr: interactive and collaborative HTML5
infection. Toxins 9(3):E99 gene list enrichment analysis tool. BMC Bioin-
22. Kanehisa M, Goto S (2000) KEGG: Kyoto formatics 14:128. https://doi.org/10.1186/
encyclopedia of genes and genomes. Nucleic 1471-2105-14-128
Acids Res 28(1):27–30 31. Mi H, Huang X, Muruganujan A, Tang H,
23. Vastrik I, D’Eustachio P, Schmidt E, Joshi- Mills C, Kang D, Thomas PD (2017) PAN-
Tope G, Gopinath G, Croft D, de Bono B, THER version 11: expanded annotation data
Gillespie M, Jassal B, Lewis S, Matthews L, from gene ontology and reactome pathways,
Wu G, Birney E, Stein L (2007) Reactome: a and data analysis tool enhancements. Nucleic
knowledge base of biologic pathways and pro- Acids Res 45(D1):D183–D189. https://doi.
cesses. Genome Biol 8(3):R39. https://doi. org/10.1093/nar/gkw1138
org/10.1186/gb-2007-8-3-r39 32. Ghiassian SD, Menche J, Barabasi AL (2015) A
24. Liberzon A, Birger C, Thorvaldsdottir H, DIseAse MOdule detection (DIAMOnD)
Ghandi M, Mesirov JP, Tamayo P (2015) The algorithm derived from a systematic analysis of
molecular signatures database (MSigDB) hall- connectivity patterns of disease proteins in the
mark gene set collection. Cell Syst 1 human interactome. PLoS Comput Biol 11(4):
(6):417–425. https://doi.org/10.1016/j. e1004120. https://doi.org/10.1371/journal.
cels.2015.12.004 pcbi.1004120
25. Subramanian A, Tamayo P, Mootha VK, 33. Ahn YY, Bagrow JP, Lehmann S (2010) Link
Mukherjee S, Ebert BL, Gillette MA, communities reveal multiscale complexity in
Paulovich A, Pomeroy SL, Golub TR, Lander networks. Nature 466(7307):761–764.
ES, Mesirov JP (2005) Gene set enrichment https://doi.org/10.1038/nature09182
analysis: a knowledge-based approach for inter- 34. Newman M (2010) Networks: an introduc-
preting genome-wide expression profiles. Proc tion. OUP, Oxford
Natl Acad Sci U S A 102(43):15545–15550. 35. Scardoni G, Petterlini M, Laudanna C (2009)
https://doi.org/10.1073/pnas.0506580102 Analyzing biological network parameters with
0506580102 [pii] CentiScaPe. Bioinformatics 25
26. Mootha VK, Lindgren CM, Eriksson KF, (21):2857–2859. https://doi.org/10.1093/
Subramanian A, Sihag S, Lehar J, bioinformatics/btp517
Puigserver P, Carlsson E, Ridderstrale M, 36. Lamb J, Crawford ED, Peck D, Modell JW,
Laurila E, Houstis N, Daly MJ, Patterson N, Blat IC, Wrobel MJ, Lerner J, Brunet JP,
Mesirov JP, Golub TR, Tamayo P, Subramanian A, Ross KN, Reich M,
Spiegelman B, Lander ES, Hirschhorn JN, Hieronymus H, Wei G, Armstrong SA, Hag-
Altshuler D, Groop LC (2003) PGC-1alpha- garty SJ, Clemons PA, Wei R, Carr SA, Lander
responsive genes involved in oxidative phos- ES, Golub TR (2006) The connectivity map:
phorylation are coordinately downregulated using gene-expression signatures to connect
in human diabetes. Nat Genet 34 small molecules, genes, and disease. Science
(3):267–273. https://doi.org/10.1038/ 313(5795):1929–1935. https://doi.org/10.
ng1180 1126/science.1132939
27. Dennis G Jr, Sherman BT, Hosack DA, Yang J, 37. Chen P, Lin JJ, Lu CS, Ong CT, Hsieh PF,
Gao W, Lane HC, Lempicki RA (2003) Yang CC, Tai CT, Wu SL, Lu CH, Hsu YC, Yu
198 Ioannis N. Melas et al.

HY, Ro LS, Lu CT, Chu CC, Tsai JJ, Su YH, Pharmacogenomics 13(11):1285–1306.
Lan SH, Sung SF, Lin SY, Chuang HP, Huang https://doi.org/10.2217/pgs.12.108
LC, Chen YJ, Tsai PJ, Liao HT, Lin YH, Chen 39. Daly AK, Donaldson PT, Bhatnagar P, Shen Y,
CH, Chung WH, Hung SI, Wu JY, Chang CF, Pe’er I, Floratos A, Daly MJ, Goldstein DB,
Chen L, Chen YT, Shen CY, Taiwan SJSC John S, Nelson MR, Graham J, Park BK, Dillon
(2011) Carbamazepine-induced toxic effects JF, Bernal W, Cordell HJ, Pirmohamed M, Aithal
and HLA-B*1502 screening in Taiwan. N GP, Day CP, Study D, International SAEC
Engl J Med 364(12):1126–1133. https://doi. (2009) HLA-B*5701 genotype is a major deter-
org/10.1056/NEJMoa1009717 minant of drug-induced liver injury due to flu-
38. Pavlos R, Mallal S, Phillips E (2012) HLA and cloxacillin. Nat Genet 41(7):816–819. https://
pharmacogenetics of drug hypersensitivity. doi.org/10.1038/ng.379
Chapter 11

Omics Data Integration and Analysis for Systems

Pharmacology
Hansaim Lim and Lei Xie

Abstract
Systems pharmacology aims to understand drug actions on a multi-scale from atomic details of drug-target
interactions to emergent properties of biological network and rationally design drugs targeting an inter-
acting network instead of a single gene. Multifaceted data-driven studies, including machine learning-based
predictions, play a key role in systems pharmacology. In such works, the integration of multiple omics data is
the key initial step, followed by optimization and prediction. Here, we describe the overall procedures for
drug-target association prediction using REMAP, a large-scale off-target prediction tool. The method
introduced here can be applied to other relation inference problems in systems pharmacology.

Key words Big data, Machine learning, Collaborative filtering, Drug repurposing, Off-target
identification

1 Introduction

The conventional procedures for drug discovery starts from finding

a small molecule that can effectively act on the intended target,
which is often a single target that is believed to be one of the causal
components in a disease process. Although it has been the domi-
nant drug discovery paradigm for several decades, it has several
challenges and limitations. Many complex diseases are multifaceted
and multigenic. Targeting a single gene may not be adequate to
modulate the whole disease process. Moreover, drug actions are
often the result of a series of complex biological processes involving
drug-target interactions and their modulation of biological net-
works. A drug often interacts with not only its intended target
but also other unexpected targets, referred to as off-targets. Unex-
pected off-target binding may lead to the possibility of deadly side
effects (or adverse drug reactions), which have rarely been noticed
during the early stages of the procedures [1–3]. For instance, a
cholesteryl ester transfer protein inhibitor, torcetrapib, was found
to have adverse cardiovascular events with many deaths due to its

Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_11, © Springer Science+Business Media, LLC, part of Springer Nature 2019

199
200 Hansaim Lim and Lei Xie

off-target activities and was withdrawn during phase 3 clinical trial

[4, 5], and a fatty acid amide hydrolase inhibitor resulted in at least
six casualties in a phase 1 clinical trial [6]. In addition, the off-target
binding may be essential for the therapeutic effect on treating
multigenic complex diseases. For example, the anticancer effect of
HIV protease inhibitors may come from their binding to multiple
human kinases [7]. Furthermore, the identification of the drug
off-targets across the human genome will offer new opportunities
in drug repurposing, a process that uses existing drugs for new
clinical indications [8, 9]. Unfortunately, we have limited under-
standing of proteome-wide drug-target interactions since only a
small number of all possible targets are examined for each drug. In
addition, the tested drug molecules are only a part of the whole
chemical space, which further increases the sparseness of our
knowledge in drug-target associations. Thus, the conventional
drug discovery paradigm has become less successful in tackling
complex diseases, and there is an urgent need for a new paradigm
that allows systematic and large-scale drug-target screening at an
early stage of drug development.
One possible new paradigm is data-driven molecular screening
from the early stages of drug development [10]. Biological and
clinical data are enormous, heterogeneous, dynamic, and noisy, that
is, a huge amount of data is rapidly changing across diverse domains
and contains intrinsic noise from biology as well as experimental
errors [11]. Pharmacological data also shares these properties,
making it difficult to collect and manage reliable data and requiring
systematic approaches to analyze thousands of drugs against many
partners of interest, including protein targets and adverse drug
effects. The continuously increasing amount of health-related
data makes the development and application of such methods
more interesting and promising. A new field, called systems phar-
macology, aims to design therapeutics to target gene interaction
networks instead of a single gene. The development of systems
pharmacology will benefit from harnessing machine learning and
statistical techniques to discover new treatment strategies from big
data [12]. Various types of computational methods have been
developed to systematically analyze large amounts of pharmacolog-
ical data [13].
Many computational tools have been developed to harness big
data for systems pharmacology. These techniques range from infer-
ring drug-target-disease side effect associations to stoichiometric
modeling of genome-scale metabolic network and dynamic model-
ing of signaling and regulatory network to physiological-based
pharmacokinetics modeling [14]. Due to the limited space, here
we will focus on computational methods that represent omics data
as a bipartite graph (e.g., linking drugs and targets) and infer new
biological relations (e.g., drug-target association). Several machine
learning-based techniques have shown their potential to predict
Machine Learning for Systems Pharmacology 201

binary relations, particularly, drug-target associations. Regarding

chemical substructures of drugs as features, Koutsoukas et al.
showed that drug-target associations can be predicted based on
the Gaussian kernel similarity of features [15]. Gaussian interaction
profile (GIP) also utilizes Gaussian kernel to measure similarity
between drug-target interaction profiles [16], and GIP was later
extended by applying weighted nearest neighbor to predict drug-
target interactions for drugs without a known target [17]. Kerne-
lized Bayesian matrix factorization with twin kernels (KBMF2K)
successfully combined two Gaussian kernels (drug-side and target-
side kernels) into matrix factorization algorithms to predict drug-
target associations [18]. The uses of such machine learning
approaches are also expanding since the collection of large-scale,
multi-domain biological data sets are available, such as Harmoni-
zome [19]. Here we will focus on the prediction of drug-target
associations using REMAP, a large-scale collaborative filtering
approach to predict off-target associations [20]. Collaborative fil-
tering is a technology to predict one’s future responses or prefer-
ences based on the history of decisions made by a large group which
one belongs to, which is frequently used for recommender systems
in commercial platforms [21, 22]. To predict not only the
off-targets but the novel targets as well, we also developed
COSINE [23]. This chapter is divided into two parts: the data
preprocessing to organize our data into a network model with
compatible formats and running the two optimization algorithms
to make predictions.

2 Materials

REMAP is available online at https://omictools.com/remap-tool.

Clicking the REMAP icon on the title opens the GitHub repository
of the REMAP, where readers can find all relevant data and codes
used in this chapter. Please download and decompress the reposi-
tory. Throughout this chapter, we will assume that the decom-
pressed repository is under “/home/user/REMAP” directory.
ZINC database [24] contains chemical-protein associations based
on direct binding assays at 10 μM. The processed data set is avail-
able in the repository, under “/home/user/REMAP/benchmark/
ZINC_raw” directory. Under the “ZINC_raw” directory,
“ZINC_DTI.tsv” contains active chemical-protein associations
(drug-target interactions) from ZINC database, and “ZINC_-
chemicals.tsv” and “ZINC_proteins.fas” contain unique chemicals
and proteins with chemical structure and protein sequence infor-
mation, respectively. To apply collaborative filtering techniques, we
need to convert the data files into three matrices, chemical-chemical
similarity matrix, protein-protein similarity matrix, and chemical-
202 Hansaim Lim and Lei Xie

protein association matrix. The steps to convert these files into

appropriate matrix format are in Subheading 3 .
To complete this chapter, we will use python and MathWorks
Matlab programming languages. Matlab 2008b or higher is
required, and Python 2.7 or higher in the Unix environment
(e.g., Ubuntu) is recommended. Python 3 is required for the
multicore script (see Note 5). We use ChemAxon JChem software
[25] to calculate chemical structure similarity and NCBI BLAST
[26] to calculate protein sequence similarity. All software tools used
here are cross-platform (i.e., they are available in Windows, Mac,
and Linux). In Subheading 3, we provide the commands for each
step in a Linux environment so that readers can follow the steps
without a graphical user interface (GUI), since working in a
non-GUI environment is often necessary to maximize the compu-
tational resources available for the research processes. One example
environment is a (virtual) machine running on Ubuntu with
Matlab, JChem, and BLAST installed. Pythons 2 and 3 are built-
in for Ubuntu 14.04 or higher versions. To set up the environment,
please follow the instructions for each software tool. Please follow
steps on the relevant user manuals for the tools for GUI
environment.

3 Methods

3.1 Overview Data preprocessing is a significant step in data-driven research,

including machine learning-based polypharmacology that we
focus on here. So far, we have the text files for the matrices (Sub-
heading 2). Assuming that we have m unique chemicals and
n unique proteins, the chemical-chemical similarity information
can be represented as a m by m matrix Dm m, where Di,j is the
similarity score between the ith and the jth drugs, and the protein-
protein similarity matrix Tn n can be similarly defined. All entries
of the matrices D and T are between 0 and 1, where 0 is for no
measurable similarity and 1 is for identity. The chemical-protein
association file can be regarded as an undirected bipartite graph—a
network where the only allowed connections are from one set
(drugs) to another set (targets)—between drugs and targets with
binary edges (Fig. 1). Therefore, the chemical-protein associations
may be represented as an m by n matrix Rm n, where Ri,j ¼ 1 if the
ith drug is known to be associated with the jth target protein. As in
Fig. 1, the known connections are often sparse, and our purpose is
to infer some unknown associations based on the two types of
similarity and the known associations. The first part of Subheading
3 is to guide readers to process raw data from the ZINC database
into the input matrices for our methods of choice: REMAP and
COSINE. The two methods take the known chemical-protein
associations, chemical-chemical similarity scores, and protein-
Machine Learning for Systems Pharmacology 203

DRUG 1 TARGET 1

TARGET 2
DRUG 2 DRUG 3
TARGET 3

DRUG 4
TARGET 4
DRUG 5

TARGET 5

DRUG 6

D1 . . . D6 T1 . . . T5 T1 . . . T5
D1 D1 T1
...

...
...

D6 D6 T5

Fig. 1 Drug-target network and its matrix representation. (a) Drug-target associations can be represented as a
network containing three types of connections. A bipartite graph connects drugs and targets. The solid pink
lines and the blue dashed lines represent the known drug-target associations and the unknown off-targets,
respectively. The sky-blue lines and the green lines represent chemical-chemical similarity and protein-
protein similarity networks, respectively. For conciseness, only six drugs and five targets are drawn here. (b)
The matrix representation of the network in (a). The matrices D,T,and R represent drug-drug similarity, target-
target similarity, and known drug-target association networks, respectively. Color-filled entries of the matrices
represent the connections in (a). The values in D and T are nonnegative number with a maximum of 1 for
identity. The values in R are either 0 or 1, where 1 represents the association which is known. Note that the
entries for the unknown off-targets in the matrix R (blue-filled) start at 0 and the goal is to predict these
unknown associations using the known associations and the two similarity matrices

protein similarity scores as matrices. We start from the list of known

associations and the lists of unique chemicals and proteins. Our
goal here is to create three matrices that represent the inputs. Since
204 Hansaim Lim and Lei Xie

there are 12,384 unique chemicals and 3500 unique proteins in the
ZINC data set, chemical-protein association matrix size must be
12,384 by 3500, chemical-chemical similarity matrix size must be
12,384 by 12,384, and protein-protein similarity matrix size must
be 3500 by 3500. For more details, refer to Lim et al. [20].

3.2 Command-Line We will use both bash commands and Matlab commands. To
Notations minimize confusions, the commands are in Courier New font
with gray shade. Bash commands start with a dollar sign, and
Matlab commands start with two greater than signs. We will not
use multiple commands in one sentence. In cases where a one-line
command is written on multiple lines of the page, we will specify
that the command is a one-line command. For example, $ echo
This is a bash command! means to type “echo This is a bash
command!” on a terminal screen, and >> disp(This is a Matlab
command!); disp(Big Data in Drug Discovery); means to
type “disp(‘This is a Matlab command!’); disp(‘Big Data in Drug
Discovery’);” on Matlab command line. A long Matlab command
can be separated at each semicolon. The Matlab command above
can be either a one-line command or two lines, such as >> disp
(This is a Matlab command!); followed by >> disp(Big Data
in Drug Discovery); without changing the results. The leading
dollar signs and the leading greater than signs are not to be typed by
the readers. We expect that the readers are familiar with the basic
bash commands, such as changing directory, viewing and editing
text files, and moving and copying files (see Note 1).

3.3 Data 1. To calculate protein-protein similarity based on sequence simi-

Preprocessing larity, we first build BLAST database based on the unique
proteins in ZINC data set. Please refer to the online document
for install instructions [27] (see Note 2). To build BLAST DB,
change directory to where the protein FASTA file is and run the
following command on the terminal.

$ makeblastdb -in ZINC_proteins.fas -dbtype prot -out ZINC

This generates a BLAST database named “ZINC” that contains

the 3,500 proteins from the “ZINC_proteins.fas” file.
2. Next, we will query all 3,500 proteins against the database just
created. Run (one-line command) on the terminal:

$ blastp -query ZINC_proteins.fas -db ZINC -evalue 1e-5

-outfmt 6 > ZINC_blast_result.dat

This searches sequence-based homology for the 3,500 proteins

with e-value cutoff of 1E-5. The output format will be concise,
with the last column being the bit score for all queries showing
one pair of proteins per line. The compressed version of the
Machine Learning for Systems Pharmacology 205

BLAST result file is available under “list” directory. To get

more detailed output, omit “-outfmt 6” from the command
above. To search for stricter homology, decrease e-value cutoff
from the command, for example, “-evalue 1e-10” instead of
“-evalue 1e-5” with the rest unchanged (see Note 3).
3. Next, we need to calculate similarity scores for protein pairs and
convert the protein IDs to protein indices. To do this, move to
“script” directory and run

$ python blast2sim.py > ../list/prot_prot_sim_Idx.csv

The result file should be in the “list” directory. The first two
columns are the protein indices, and the third column is the
sequence-based similarity score for the protein pair. The
protein-protein matrix may not be symmetric. In other
words, the similarity score from query protein 1 against target
protein 2 may be different from that of query protein 2 against
target protein 1. However, the self-query similarity score is
always 1. The output file does not contain self-query scores as
they can be easily added in Matlab.
4. Comma-separated files can easily be read in Matlab to generate
a numerical matrix. Thus, we generate the chemical-chemical
similarity file in the same format as protein-protein similarity.
Under “list” directory, run

$ cut -f4 ZINC_chemicals.tsv | tail -n +2 > ZINC_chemical_

structures.txt

The resulting file should contain only the SMILES string for
each chemical, without the header line. This file is the input for
ChemAxon JChem software (see Note 4).
5. Assuming ChemAxon JChem is installed under “/home/
user/” directory, run

$ /home/user/ChemAxon/JChem/bin/screenmd ZINC_chemical_
structures.txt ZINC_chemical_structures.txt -k ECFP -g -c
-M Tanimoto > ZINC_tanimoto.dat

See Note 5 for more information. This process will take some
time (approximately 9 min on a 3.40 GHz quad-core desktop
machine). The output file is a matrix of Tanimoto distances
between chemicals based on their 2D structures. Open the
output file by running

$ less -S ZINC_tanimoto.dat
206 Hansaim Lim and Lei Xie

on the terminal and confirm that the diagonal entries are zero
since the scores are distances, not similarities.
6. We need to convert the distances to similarity scores by sub-
tracting each distance from 1. Also, as stated in Lim et al. [20],
chemical-chemical similarity scores lower than 0.5 are filtered
out. Under “scipt” directory, run

$ python tani2sim.py > ../list/chem_chem_sim_Idx_05.csv

(see Note 6). The output file format is the same as “prot_prot_-
sim_Idx.csv” file, but the indices are a chemical index instead of
a protein index.
7. Finally, we need to convert the text-based drug-target associa-
tions into index-based associations. Under “script” directory,
run

$ python dti2index.py >../list/chem_prot_Idx.csv

The output file contains only two columns: chemical index and
protein index. There is no need to output the third column as
the chemical-protein association matrix is binary.
8. Next, we need to create Matlab-readable matrices from the
comma-separated files. Open Matlab and move to the “list”
directory. Run

>> cline=csvread(’./chem_chem_sim_Idx_05.csv’); chem_

chem=sparse(cline(:,1),cline(:,2),cline(:,3),12384,
12384); chem_chem=chem_chem+chem_chem’+speye(12384);

(one-line or separated at each semicolon to three-line com-

mands) on the Matlab command window (see Note 7). The
“chem_chem” variable is a sparse, symmetric matrix containing
the Tanimoto similarity scores for chemicals (i.e., the matrix
D in Fig. 1). Note that we make it symmetric by adding the
transposition matrix to itself. A sparse identity matrix of the
same size is added to make self-similarity equal to 1.
9. To read protein-protein similarity scores and create the similar-
ity matrix, run

>> pline=csvread(’./prot_prot_sim_Idx.csv’); prot_prot=

sparse(pline(:,1),pline(:,2),pline(:,3),3500,3500);
prot_prot=prot_prot+speye(3500);

(one- or three-line commands) on the Matlab command win-

dow. The “prot_prot” variable is the protein-protein similarity
matrix (i.e., the matrix T in Fig. 1). As noted above, the
Machine Learning for Systems Pharmacology 207

protein-protein similarity matrix may not be symmetric, but

self-similarity is equal to 1 (see Note 8).
10. To read and create a chemical-protein association matrix, run

>> cpline=csvread(’./chem_prot_Idx.csv’); chem_prot=

sparse(cpline(:,1),cpline(:,2),1,12384,3500);

(one- or two-line commands) on the Matlab command win-

dow. The “chem_prot” variable is the known chemical-protein
association matrix (i.e., the matrix R in Fig. 1).
11. Now we have all matrices needed to run REMAP and/or
COSINE. Save the matrices into Matlab matrix files for later
uses. For example, on the Matlab command window, run

>> save(‘/home/user/REMAP/chem_prot_zinc.mat’,chem_prot);

to save the chemical-protein association matrix into “chem_-

prot_zinc.mat” file under “/home/user/REMAP” directory
(see Note 9). The three matrices are available under “BiDD/
matrix” directory in the GitHub repository.
So far, we created the three inputs from the ZINC data set:
chemical-protein association matrix, chemical-chemical struc-
ture similarity matrix, and the protein-protein sequence simi-
larity matrix. Since there are tunable parameters for REMAP
and COSINE that may affect the predictive performances, we
will optimize the parameters by tenfold cross-validation and
grid search strategy. Briefly, approximately 10% of chemicals
are tested after training the algorithms with the rest of the
90% of chemicals. Then, the average of the ten area under
receiver operating characteristic curves (AUC or AUROC)
from each fold is calculated. Such tenfold cross-validation is
repeated by changing one parameter at a time, through a range
of each parameter.

3.4 Parameter 1. Open Matlab command window, and change directory to

Optimization where the matrices are stored. Then, load each of the three
and Prediction matrices using “load” command. For example, on the Matlab
command window, run

>> load(‘/home/user/REMAP/BiDD/matrix/chem_prot_zinc.
mat’);

to load chemical-protein association matrix stored in the

“chem_prot_zinc.mat” file.
208 Hansaim Lim and Lei Xie

2. Once the three matrices are loaded, check the names of loaded
variables. The command

>> who

shows the names of the loaded matrices. These names are the
variable names that were used to save the matrices into files.
The three names should appear “chem_prot_zinc, chem_-
chem_zinc, prot_prot_zinc”.
3. Run

>> addpath(‘/home/user/REMAP/matlab_codes/’);

to let Matlab search for the specified directory. To run the

parameter optimization code for REMAP, run

>> REMAP_optimization(chem_prot_zinc,chem_chem_zinc,
prot_prot_zinc)

(see Note 10). Once the grid search is done, the best para-
meters will be shown on the Matlab command window. For
detailed performance comparisons, open the output file out-
side the Matlab command window. The output file should be
located under “BiDD” directory (one level higher than the
current working directory). Based on the grid search, the best
parameters were rank ¼ 100, p6 ¼ 0.99, and p7 ¼ 0.33, and
they yielded an average AUC of 0.9661. Using these para-
meters, we can continue to get the prediction scores for each
chemical-protein pair.
4. Now that we have the optimal parameters, the actual prediction
step is simply to use all known associations to get the score
matrix P. In other words, we do not divide our data into
training and test sets. We will use “REMAP.m” script, which
takes the same inputs and the user-defined parameters. First, we
define parameters. On Matlab command window, run

>> para=[0.1,0.1,0.01,100,100,0.99,0.33];

followed by

>> REMAP(chem_prot_zinc,chem_chem_zinc,prot_prot_zinc,’/
home/user/REMAP/REMAP_prediction.txt’,’false’,0.7,para);

(one-line command). This script runs REMAP using the user-

defined parameters, removes the known associations from the
prediction matrix, and outputs the predicted pairs having pre-
dicted scores greater than the cutoff value, which we set at 0.7
Machine Learning for Systems Pharmacology 209

in our command above. For the score normalization option (see

Note 11).
5. The output file should contain three columns, the chemical
index, the protein index, and the predicted score for one pre-
dicted pair in each line. We need to convert the indices into the
chemical and protein IDs. On the terminal screen, run

$ python index2dti.py > /home/user/REMAP/BiDD/results/RE-

MAP_prediction_IDs.csv

(one-line command). The output file contains human-readable

forms of chemical and protein IDs, instead of numerical
indices. Adjust the cutoff score, and repeat from step 4 above
to include more (or fewer) predicted pairs. This concludes the
procedures to predict drug-target pairs using REMAP and
COSINE (see Note 12). The predicted pairs can be structurally
analyzed (e.g., molecular docking analysis) or experimentally
validated (e.g., in vitro binding assays).

4 Notes

1. Being familiar with some basic bash commands is helpful for

following directions in this chapter. Specifically, checking the
current working directory using $ pwd and changing the work-
ing directory using $ cd are essential. While viewing and edit-
ing text files may be done using GUI (e.g., double click to open
files), using $ less and one of the common text editors (e.g.,
nano, gedit, or vim) is recommended. The commands $ cp and
$ mv copy and move files, respectively.
2. The installed BLAST package should be included in the envi-
ronment path variable. Assuming the BLAST package is
installed under “/home/user/blast/ncbi-blast-2.6.0+” direc-
tory, this can be done by adding “export PATH ¼ "$PATH:/
home/user/blast/ncbi-blast-2.6.0+/bin/"” (one line) at the
end of the “.profile” file (note the leading dot for a hidden file),
which is under the user’s home directory. Once the line is
added, save the file and run $ source .profile to have the
changes take effect. Running $ which makeblastdb should
print out “/home/user/blast/ncbi-blast-2.6.0+/bin/make-
blastdb” on the terminal screen if the setting is correct.
3. We showed that the BLAST e-value cutoff does not signifi-
cantly affect the predictive performance of REMAP on the
ZINC data set based on the methods that we introduced in
the manuscript [20].
210 Hansaim Lim and Lei Xie

4. The usage of the commands may be different depending on the

operating system. Here,
$ cut -f4 file.txt is how to get the fourth column of the
tab-separated file, and
$ tail -n +2 file.txt is used to print from the second line of
the file.
5. For users with a large number of chemicals (e.g., 50,000 or
more chemicals), we recommend using multicore-enabled
script, “get_tani_sim_multicore.py”, under “REMAP/
scripts/” directory. This script distributes the tasks in steps 5
and 6 (under Subheading 3.3) to multiple processors, which
not only speeds up the task but saves the storage space for the
output files. Running the following command will distribute
5,000 chemicals to each process and collect similarity scores of
0.5 or higher. The result is same as described in the step 6,
Subheading 3.3.

$ python3 REMAP/scripts/get_tani_sim_multicore.py ZINC_

chemical_structures.txt 5000 0.5 chem_chem_sim_Idx_05.csv

Note that to use the multicore script, the global variable,

“screenmd_path_global” in the script has to be changed to
where the JChem package is installed (e.g., “/home/user/
ChemAxon/JChem/bin”).Similarly to the environment path
variable for the BLAST package (see Note 2), adding “export
PATH¼"$PATH:/home/user/ChemAxon/JChem/bin/"”
(one line) at the end of the “.profile” file can simplify the
command.
6. The calculated Tanimoto distance scores are between 0 and 1.
The chemical-chemical similarity score is calculated by sub-
tracting the Tanimoto distance from 1 so that the similarity
score is higher when the distance is lower and vice versa. The
cutoff score can be controlled in the “tani2sim.py” script.
Chemical-chemical pairs with a similarity score lower than the
cutoff are ignored. In other words, if the calculated similarity
between chemical 1 and chemical 8 is 0.4 (cutoff ¼ 0.5), the
chemical matrix will have 0 value at its first row and eighth
column entry. Lowering the cutoff to 0.4 will result in the same
entry being a similarity score, 0.4. While lowering the cutoff
score will include more information, it is not necessarily better
for predictions. In addition, lowering the cutoff score dramati-
cally increases the resulting file size and the matrix density,
especially for larger data sets.
7. The “csvread” function reads the comma-separated file,
“sparse” function creates a sparse matrix, an apostrophe trans-
poses the matrix, and “speye” creates a sparse identity matrix.
Machine Learning for Systems Pharmacology 211

8. Since we are not assuming that the protein-protein similarity is

symmetric, it is important not to add the transposition matrix
of protein-protein similarity matrix here.
9. The command >> save(/home/user/REMAP/chem_-
prot_zinc.mat,chem_prot); saves the current workspace
variable named “chem_prot” as a file named “chem_prot_zinc.
mat” under “/home/user/REMAP” directory. When the
saved file is loaded later, the loaded variable name is the same
as the saved variable name, “chem_prot” here.
10. This process took approximately 40 h on the 2.80 GHz quad-
core machine (using maximum 3 cores). Readers who want a
more thorough grid search (i.e., smaller intervals within para-
meters) may modify the code. For example, replacing
“p6s ¼ 0:0.33:1.0” with “p6s ¼ 0:0.1:1.0” will search for
the chemical-side importance parameter from 0 to 1 with an
increment of 0.1, instead of 0.33. Readers can specify the
output file by modifying the line “fid¼fopen(’../REMAP_op-
timization_ZINC_quick.txt’,’at+’);” in the code. For instance,
changing “fid¼fopen(’/home/user/REMAP/YourFile.
txt’,’at+’);” will save the optimization results in “YourFile.
txt” under “/home/user/REMAP” directory.
11. We described the score normalization procedures in our man-
uscript [20]. The same normalization procedure can be turned
on by adding another argument. For instance,

>> REMAP(chem_prot_zinc,chem_chem_zinc,prot_prot_zinc,’/
home/user/REMAP/REMAP_prediction.txt’,‘true’,0.7,para);

(one-line command) will turn on the score normalization, and

the results are filtered based on the normalized scores, instead
of the raw prediction scores. While our normalization process is
based on the predicted score distribution of both true positive
and true negative data sets, ZINC data that we use here does
not contain any information about the true negative pairs.
Therefore, we cannot assume that additional processing of
the scores guarantees an improvement.
12. To run COSINE, use “COSINE_Optimization.m” script
under “/home/user/REMAP/BiDD/COSINE” directory.
This script takes the same arguments as “REMAP_optimiza-
tion.m,” but it also makes the final prediction using the opti-
mized parameters. For example,

>> COSINE_Optimization(chem_prot_zinc,chem_chem_zinc,
prot_prot_zinc)
212 Hansaim Lim and Lei Xie

not only optimizes parameters for COSINE but outputs the

predicted pairs as integer indices. The output file can be con-
verted to chemical and protein IDs, similar to step 5 in the
Parameter optimization and prediction section.

5 Conclusion

In this chapter, we covered a state-of-the-art method to predict

drug-target associations from data preprocessing to parameter opti-
mization in detail and used REMAP, a collaborative filtering algo-
rithm, to make predictions. We also introduced COSINE to resolve
the cold start problem, where the prediction based on decision
history becomes difficult and inaccurate for users (drugs) having
no purchase records (known targets) (see Note 12). These methods
can be used to solve other problems, such as gene-side effect
associations, if they can be represented as the network in Fig. 1.
We hope our readers be aware that neither REMAP nor
COSINE is a perfect prediction method. These methods do not
reflect some important molecular-level biochemical details.
Chemical-chemical similarity does not reflect activity cliff, where a
small change in chemical structure leads to dramatic changes in
binding activities. Similarly, protein-protein similarity by global
sequence comparison does not reflect impact of amino acid muta-
tions, posttranslational modifications, or conformational dynamics
on the ligand binding. It is also possible that two nonhomologous
proteins having similar binding pockets bind to same ligands
[28, 29]. Pharmacokinetic parameters need to be incorporated to
reflect binding activities that require prolonged drug-target bind-
ing to induce physiological consequences [12]. Also, a drug-
induced physiological effect results from the interplay among com-
plex biological networks within and across different cellular com-
partments. Methods that integrate multi-aspect data sets need to be
developed for more accurate predictions and better understanding
of drug actions. It is possible to combine multiple bipartite graph
representations of biological relations into an integrated multilay-
ered network model and infer multiple biological relations jointly.
We have developed a new computational tool FASCINATE for this
purpose [30].

Acknowledgment

We acknowledge Miriam Cohen, Ph.D., for proofreading the

manuscript.
Machine Learning for Systems Pharmacology 213

References
1. Kennedy T (1997) Managing the drug discov- 10. Bowes J, Brown AJ, Hamon J, Jarolimek W,
ery/development interface. Drug Discov Sridhar A, Waldron G, Whitebread S (2012)
Today 2(10):436–444. https://doi.org/10. Reducing safety-related drug attrition: the use
1016/S1359-6446(97)01099-4 of in vitro pharmacological profiling. Nat Rev
2. Weber A, Casini A, Heine A, Kuhn D, Supuran Drug Discov 11(12):909–922. https://doi.
CT, Scozzafava A, Klebe G (2004) Unexpected org/10.1038/nrd3845
nanomolar inhibition of carbonic anhydrase by 11. Hart T, Xie L (2016) Providing data science
COX-2-selective celecoxib: new pharmacolog- support for systems pharmacology and its
ical opportunities due to related binding site implications to drug discovery. Expert Opin
recognition. J Med Chem 47(3):550–557. Drug Discov 11(3):241–256. https://doi.
https://doi.org/10.1021/jm030912m org/10.1517/17460441.2016.1135126
3. Xie L, Wang J, Bourne PE (2007) In silico 12. Xie L, Draizen EJ, Bourne PE (2017) Harnes-
elucidation of the molecular mechanism defin- sing big data for systems pharmacology. Annu
ing the adverse effect of selective estrogen Rev Pharmacol Toxicol 57:245–262. https://
receptor modulators. PLoS Comput Biol 3 doi.org/10.1146/annurev-pharmtox-
(11):e217. https://doi.org/10.1371/journal. 010716-104659
pcbi.0030217 13. Xie L, Xie L, Kinnings SL, Bourne PE (2012)
4. Forrest MJ, Bloomfield D, Briscoe RJ, Brown Novel computational approaches to polyphar-
PN, Cumiskey AM, Ehrhart J, Hershey JC, macology as a means to define responses to
Keller WJ, Ma X, McPherson HE, Messina E, individual drugs. Annu Rev Pharmacol Toxicol
Peterson LB, Sharif-Rodriguez W, Siegl PK, 52:361–379. https://doi.org/10.1146/
Sinclair PJ, Sparrow CP, Stevenson AS, Sun annurev-pharmtox-010611-134630
SY, Tsai C, Vargas H, Walker M 3rd, West 14. Xie L, Ge X, Tan H, Xie L, Zhang Y, Hart T,
SH, White V, Woltmann RF (2008) Yang X, Bourne PE (2014) Towards structural
Torcetrapib-induced blood pressure elevation systems pharmacology to study complex dis-
is independent of CETP inhibition and is eases and personalized medicine. PLoS Com-
accompanied by increased circulating levels of put Biol 10(5):e1003554. https://doi.org/10.
aldosterone. Br J Pharmacol 154 1371/journal.pcbi.1003554
(7):1465–1473. https://doi.org/10.1038/ 15. Koutsoukas A, Lowe R, Kalantarmotamedi Y,
bjp.2008.229 Mussa HY, Klaffke W, Mitchell JB, Glen RC,
5. Howes LG, Kostner K (2007) The withdrawal Bender A (2013) In silico target predictions:
of torcetrapib from drug development: impli- defining a benchmarking data set and compari-
cations for the future of drugs that alter HDL son of performance of the multiclass naive
metabolism. Expert Opin Investig Drugs 16 Bayes and Parzen-Rosenblatt window. J Chem
(10):1509–1516. https://doi.org/10.1517/ Inf Model 53(8):1957–1966. https://doi.
13543784.16.10.1509 org/10.1021/ci300435j
6. Butler D, Callaway E (2016) Scientists in the 16. van Laarhoven T, Nabuurs SB, Marchiori E
dark after French clinical trial proves fatal. (2011) Gaussian interaction profile kernels for
Nature 529(7586):263–264. https://doi. predicting drug-target interaction. Bioinfor-
org/10.1038/nature.2016.19189 matics 27(21):3036–3043. https://doi.org/
7. Xie L, Evangelidis T, Xie L, Bourne PE (2011) 10.1093/bioinformatics/btr500
Drug discovery using chemical systems biol- 17. van Laarhoven T, Marchiori E (2013) Predict-
ogy: weak inhibition of multiple kinases may ing drug-target interactions for new drug com-
contribute to the anti-cancer effect of nelfina- pounds using a weighted nearest neighbor
vir. PLoS Comput Biol 7(4):e1002037. profile. PLoS One 8(6):e66952. https://doi.
https://doi.org/10.1371/journal.pcbi. org/10.1371/journal.pone.0066952
1002037 18. Gonen M (2012) Predicting drug-target inter-
8. Bertolini F, Sukhatme VP, Bouche G (2015) actions from chemical and genomic kernels
Drug repurposing in oncology--patient and using Bayesian matrix factorization. Bioinfor-
health systems opportunities. Nat Rev Clin matics 28(18):2304–2310. https://doi.org/
Oncol 12(12):732–742. https://doi.org/10. 10.1093/bioinformatics/bts360
1038/nrclinonc.2015.169 19. Rouillard AD, Gundersen GW, Fernandez NF,
9. Novac N (2013) Challenges and opportunities Wang Z, Monteiro CD, McDermott MG,
of drug repositioning. Trends Pharmacol Sci Ma’ayan A (2016) The harmonizome: a collec-
34(5):267–272. https://doi.org/10.1016/j. tion of processed datasets gathered to serve and
tips.2013.03.004
214 Hansaim Lim and Lei Xie

mine knowledge about genes and proteins. Model 52(7):1757–1768. https://doi.org/

Database (Oxford). https://doi.org/10. 10.1021/ci3001277
1093/database/baw100 25. ChemAxon (2015) Screen was used for gener-
20. Lim H, Poleksic A, Yao Y, Tong H, He D, ating pharmacophore descriptors and screen-
Zhuang L, Meng P, Xie L (2016) Large-scale ing structures, JChem 15.3.2.0.
off-target identification using fast and accurate 26. Camacho C, Coulouris G, Avagyan V, Ma N,
dual regularized one-class collaborative filter- Papadopoulos J, Bealer K, Madden TL (2009)
ing and its application to drug repurposing. BLAST+: architecture and applications. BMC
PLoS Comput Biol 12(10):e1005135. Bioinformatics 10(1):1
https://doi.org/10.1371/journal.pcbi. 27. BLAST® Command Line Applications User
1005135 Manual [Internet] (2008) National Center for
21. Adomavicius G, Tuzhilin A (2005) Toward the Biotechnology Information (US). https://
next generation of recommender systems: a www.ncbi.nlm.nih.gov/books/NBK279690/
survey of the state-of-the-art and possible . 2017.
extensions. IEEE Trans Knowl Data Eng 17 28. Creixell P, Palmeri A, Miller CJ, Lou HJ, San-
(6):734–749. https://doi.org/10.1109/ tini CC, Nielsen M, Turk BE, Linding R
TKDE.2005.99 (2015) Unmasking determinants of specificity
22. Bobadilla J, Ortega F, Hernando A, Bernal J in the human kinome. Cell 163(1):187–201.
(2012) A collaborative filtering approach to https://doi.org/10.1016/j.cell.2015.08.057
mitigate the new user cold start problem. 29. Xie L, Bourne PE (2008) Detecting evolution-
Knowl Based Syst 26:225–238. https://doi. ary relationships across existing fold space,
org/10.1016/j.knosys.2011.07.021 using sequence order-independent profile-pro-
23. Lim H, Gray P, Xie L, Poleksic A (2016) file alignments. Proc Natl Acad Sci U S A 105
Improved genome-scale multi-target virtual (14):5441–5446. https://doi.org/10.1073/
screening via a novel collaborative filtering pnas.0704422105
approach to cold-start problem. Sci Rep 30. Chen C, Tong H, Xie L, Ying L, He Q (2016)
6:38860. https://doi.org/10.1038/ FASCINATE: fast cross-layer dependency
srep38860 inference on multi-layered networks. In: Pro-
24. Irwin JJ, Sterling T, Mysinger MM, Bolstad ceedings of the 22nd ACM SIGKDD interna-
ES, Coleman RG (2012) ZINC: a free tool to tional conference on knowledge discovery and
discover chemistry for biology. J Chem Inf data mining. ACM, pp. 765–774.
Chapter 12

Bioinformatics-Based Tools and Software in Clinical

Research: A New Emerging Area
Parveen Bansal, Malika Arora, Vikas Gupta, and Mukesh Maithani

Abstract
Nowadays, drug discovery is a long process which includes target identification, validation, lead optimiza-
tion, and many other major/minor steps. The huge flow of data has necessitated the need for computa-
tional support for collection, storage, retrieval, analysis, and correlation of data sets of complex information.
At the beginning of the twentieth century, it was cumbersome to elaborate the experimental findings in the
form of clinical outcomes, but current research in the field of bioinformatics clearly shows ongoing
unification of experimental findings and clinical outcomes. Bioinformatics has made it easier for researchers
to overcome various challenges of time-consuming and expensive procedures of evaluation of safety and
efficacy of drugs at a much faster and economic way. In the near future, it may be a major game player and
trendsetter for personalized medicine, drug discovery, drug standardization, as well as food products. Due
to rapidly increasing commercial interest, currently probiotic-based industries are flooding the market with
a range of probiotic products under the banner of dietary supplements, natural health products, food
supplements, or functional foods. Most of the consumers are attracted toward probiotic formulations due
to the rosy picture provided by the media and advertisements about high beneficial claims. These products
are not regulated by pharmaceutical regulatory authorities in different countries of origin and are rather
regulated as per their intended use. Lack of stipulated quality standard is a major challenge for probiotic
industry; hence there would always be a possibility of marketing of ineffective and unsafe products with false
claims. Hence it is very important and pertinent to ensure the safety of probiotic formulations available as
over-the-counter (OTC) products for ignorant society. At the same time, probiotic industry, being in its
initial stages in developing and underdeveloped countries, requires to ensure safe, swift, and successful
usage of probiotics. In the absence of harmonized regulatory guidelines, safety, quality, as well as the efficacy
of the probiotic strain does not remain a mandate but becomes a choice for the manufacturer. Hence there
is an urgent need to screen already marketed probiotic formulations for their safety with respect to specific
strains of probiotic. Various conventional methods used by the manufacturers for the identification of
probiotic microbes create a blurred image about their status as probiotics. The present manuscript focuses
on a bioinformatics-based technique for validation of marketed probiotic formulation using 16s rRNA
sequencing and strain-level identification of bacterial species using Ez Texan and laser gene software. This
technique gives a clear picture about the safety of the product for human use.

Key words Bioinformatics, Probiotics, Ez Texan, Laser gene, Drug discovery, 16s rRNA, Strain-level
identification, Pathogenic strain, Safety and efficacy

Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_12, © Springer Science+Business Media, LLC, part of Springer Nature 2019

215
216 Parveen Bansal et al.

1 Introduction

Innovations in the health sciences have resulted in dramatic changes

in the ability to treat disease and improve the quality of life. Expen-
ditures on pharmaceuticals have grown faster than other major
components of the health-care system since the late 1990s
[1]. Consequently, the debates on rising health-care costs and the
development of new medical technologies have focused increas-
ingly on the pharmaceutical industry, which is both a major partici-
pant in the health-care industry as well as a major source of
advances in health-care technologies [2]. There is a sudden expo-
nential increase in number and severity of various diseases like
cancer, hepatitis, HIV, etc. resulting in high morbidity and mortal-
ity [3]. PwC project has revealed that 2017 medical costs will
continue to rise at the same rate as they were rising in 2016
[4]. Clinical research has played a significant role for promotion
and well-being of the health status of the people and deals with the
issues related with safety as well as efficacy of various medications,
devices, diagnostic products, and treatment to be used for human
welfare. It also involves drug discovery, preclinical trials, intensive
clinical trials, and eventually post marketing vigilance performed to
establish safety and efficacy of drugs, and hence clinical research
refers to any of the article from its inception in the lab to its
introduction into the market for human use particularly for alleviat-
ing various diseases [5].
Today bioinformatics has emerged as an important tool that
can improve drug discovery with efficient statistical algorithms,
rationale approaches for target identification, validation, and opti-
mization [6]. Computers and software tools greatly help creating
databases, predict the function of proteins, model the structure of
proteins, determine the coding regions of nucleic acid sequences,
find suitable drug compounds from a large pool, and perform data
mining, analyzing, and interpreting data faster, thereby reducing
time of drug discovery and eventually the cost involved in it
[1]. Software and the bioinformatics tools play a great role not
only in the drug discovery but also in drug development [7]. The
ultimate aim of the use of bioinformatics tools is to discover various
biological insights which will further be helpful to generate a har-
monized perspective regarding unifying biological principles [8]. A
number of bioinformatics software are being developed and uti-
lized for giving a thrust to clinical research; however there is a great
need of compiling such techniques being used by scientists all over
the world and putting all of them at a single platform for use of
other fellow scientists in interest of health around the world.
The present manuscript focuses on a technique for validation of
marketed probiotic formulation using 16s rRNA sequencing and
strain-level identification of bacterial species using Ez Texan and
laser gene software.
Bioinformatics-Based Tools and Software in Clinical Research: A New. . . 217

Due to rapidly increasing commercial interest in the probiotic

products as well as their popularity for health benefits, there has
been an upsurge of active research in the field of probiotics
[9]. Majority of the research is ongoing to understand the ability
of particular probiotic organisms and to characterize specific probi-
otic organisms as per their specific health benefits [10]. A number
of poisonous strains of some probiotic species are likely to enter the
probiotic products due to negligence of manufacturers or inadver-
tently. In absence of firm regulations, the noose of regulatory
authorities worldwide is also not very tight for manufacturing and
marketing of probiotic-based products. This might be matter of
serious concern for consumer and may cause some adverse drug
reaction or adverse health effects. There is a great need to develop
fast and foolproof methods to identify the strains of probiotics
being used by consumers before even being launched in the market
[11]. Hence, before marketing of probiotic-based products, it
becomes utmost important to know exact taxonomic status of a
probiotic strain because all the strains belonging to the same genus
or species may not confer the same effects due to variations in
biological characteristics. Henceforth, probiotic strains should
undergo polyphasic species and strain identification which ulti-
mately will include phenotypic characterization by the use of classi-
cal microbiological approaches, biochemical characterization, as
well as genotypic characterization. The use of polyphasic technique
is recommended to find out the closely related species that occupy
the same ecological niches in the phylogenetic trees. In fact pheno-
typic characterization alone is insufficient to provide satisfactory
results. Hence strain-level identification by polyphasic approach is
supposed to be essential to enable identification of new strains,
phylogenetic analysis, differentiation of interstrain differences,
and discrimination of bacterial species at molecular level. It also
describes natural therapeutic behavior of probiotic bacteria and
prevents mislabeling of probiotic products.

2 Validation of Marketed Probiotic Formulation Using 16s rRNA Sequencing

and Strain-Level Identification of Bacterial Species Using Ez Texan and Laser
Gene Software

In the present manuscript protocol of a technique for an already

available marketed formulation has been validated with respect to
identification of the bacterial content using polyphasic approach
and bioinformatics-based tools. The aim of the present article is to
accomplish:
l Detailed review of label on the marketed formulation.
l Identification of bacterial content present in the procured mar-
keted formulation using phenotypic, biochemical, and
218 Parveen Bansal et al.

genotypic techniques keeping the labeled bacterial species as

standard.
l Analysis of results with respect to identified bacterial species and
labeled species on marketed formulation.

3 Methodology

The following methodology was used for the validation of mar-

keted formulation, and a flowchart of the same has been shown in
Fig. 1.
(a) Procurement of marketed formulation: A marketed formula-
tion constituted and labeled with a single probiotic strain was
procured from the local market.
(b) Reviewing label: The label of the procured formulation was
critically reviewed with respect to the bacterial genus, species,

Procurement of marketed formulation

Isolation of problotic bacteria from sample

Characterization of sample

Morphological Biochemical Genotypic

characterization characterization characterization

Genomic DNA
Gram Staining & Catalase test, Indole test, Oxidase Isolation
Microscopic test, Methyl Red test, VP test, Citrate
Examination utilization test, Sugar fermentation
pattern Amplification of
16S rDNAregion

rRNA purification
and sequencing

Sequence
determination of 16S
rDNA Region

Percentage Phylogenetic
homology analysis

Data

Strain level identification

Fig. 1 Methodology followed for validation of a marketed formulation

Bioinformatics-Based Tools and Software in Clinical Research: A New. . . 219

and strain stated. Label depicted presence of Lactobacillus

sporogenes in the said formulation.
(c) Isolation of probiotic bacteria from sample: The bacteria were
isolated from the procured sample using nutrient broth as well
as nutrient agar and De Man, Rogosa, and Sharpe (MRS)
broth and MRS agar. The plates were incubated at
37 2 C for 48 h. After incubation, individual colonies
were selected and transferred into sterile broth mediums.
The isolates were purified selecting colonies with streak plate
technique.
(d) Physiological and biochemical characterization: The colonies
from procured sample and standard Lactobacillus sporogenes
were subjected to various morphological and biochemical
examinations as follows:
l Gram staining: Isolates from procured culture and stan-
dard culture were examined for striking differences in the
Gram staining pattern. For this, the same samples were
subjected to standard Gram staining technique for size,
shape, arrangement, and Gram’s nature of bacterial sam-
ples. The Gram reaction of the isolates was determined by
light microscopy after Gram staining. Cultures were grown
in appropriate medium at 37 2 C for 24 h under anaer-
obic conditions. Cells from fresh cultures were used for
Gram staining. The individual colony was picked up asepti-
cally into the agar plate for 1 min and again rinsed with
water (5 s). Further decolorizer (95% ethanol) was added
for 15–30 s and rinsed with water for 5 s. Finally safranine
was added for 60–80 s and rinsed with water followed by
determination of Gram-positive and purified isolates under
light microscopy. Gram-positive cells will stain purple,
while Gram-negative cells stain pink/red [12].
l Catalase test: Catalase is an enzyme produced by many
microorganisms that break down the hydrogen peroxide
into water and oxygen and cause gas bubbles. The forma-
tion of gas bubbles determines the presence of catalase
enzyme and indicates the positive result [13]:
2H2 O2 ! 2H2 O þ O2

Catalase test was performed with both marketed formulation

and standard L. sporogenes in order to see their catalase reac-
tions. For this purpose, fresh liquid cultures in nutrient media
were used for catalase test by dropping 3% hydrogen peroxide
solution onto 1 ml of cultures grown for 12 h. The isolates
unable to generate gas bubbles are supposed to be catalase-
negative [14].
220 Parveen Bansal et al.

l Indole test: 10 ml tryptophan medium was poured in each

McCartney bottle. Inoculation process was performed after
autoclaving the media for 15 min under 15 lbs. pressure at
121 C). The inoculated media with isolate of marketed
formulation and standard L. sporogenes was incubated for
24 h at 37 2 C. E. coli MTCC 4315 was used as a
positive control, and laboratory isolates of Pseudomonas
aeruginosa was used as a negative control. Kovac’s reagent
was added to detect the indole. The amino acid tryptophan
can be broken down by enzyme tryptophanase to form
indole, pyruvic acid, and ammonia as end products. Tryp-
tophanase differentiates indole-positive enterics e.g., E. coli
from closely related indole-negative enterics [15].
l Oxidase test: The enzyme oxidase present in certain bacteria
catalyzes the transport of electron from donor bacteria to
the redox dye tetra-methyl-p-phenylenediamine dihy-
drochloride. The dye in the reduced state has a deep purple
color. Inoculated plates were incubated at 37 2 C for
76 h. One colony of each isolates was smeared to filter
paper already impregnated with oxidase reagent (1% tetra-
methyl-p-phenylenediamine dihydrochloride). Basically
this test is to see if an organism produces cytochrome C
oxidase. The positive result was indicated by the produc-
tion of dark blue color within 7 s [16].
l Methyl red test: 10 ml methyl red and Voges-Proskauer
(MR-VP) broth was taken in each McCartney bottle and
autoclaved (for 15 min. In 15 lbs. pressure at 121 C) for
methyl red (MR) and Voges-Proskauer (VP) tests. After
this, isolate of marketed formulation and standard
L. sporogenes were inoculated into MR-VP broth with a
sterile transfer loop. The McCartney bottles were then
incubated at 37 2 C. After 48 h of incubation, the
MR-VP broth of each bottle was equally split into two
McCartney bottles (one of these bottles used for the MR
test; the other was used for VP test). Then the bottles
containing MR-VP broth that was subjected to MR test
were incubated for another 24 h. After this, five drops of
the pH indicator methyl red was added to each bottle. The
bottles were gently rolled between the palms of the hand to
disperse the methyl red. In this test, E. coli MTCC 4315
was used as a positive control and Pseudomonas aeruginosa
was used as negative control. The methyl red test was used
to identify enteric bacteria based on their pattern of glucose
metabolism. Enterics that subsequently metabolize pyruvic
acid to other acids lower the pH of the medium to 4.2. At
this pH, methyl red turns red. A red color represents a
Bioinformatics-Based Tools and Software in Clinical Research: A New. . . 221

positive test. Enterics that subsequently metabolize pyruvic

acid to neutral end products lower the pH of the medium
to only 6.0. At this pH, methyl red is yellow. A yellow color
represents a negative test [17].
l Voges-Proskauer (V.P.) test: As previously mentioned in the
methyl red test, 10 ml MR-VP broth was taken in each
McCartney. The McCartney bottles inoculated with isolate
of marketed formulation and standard L. sporogenes were
then incubated at 37 2 C for 48 hours. After incubation,
40% KOH (Barritt’s reagent B) was added. After this, the
bottles were shaken vigorously and allowed to stand for
20 min.
l Citrate utilization test: Isolate of marketed formulation
and standard L. sporogenes were streaked onto citrate agar
slant and then incubated for 96 h. The citrate test deter-
mines the ability of microorganisms to use citrate as the sole
of carbon and energy. In this test, E. coli MTCC 4315 was
used as a negative control, and Pseudomonas aeruginosa was
used as positive control. A chemically defined medium
(Simmon Citrate Agar) having sodium citrate as carbon
source and NH4+ as nitrogen source was used for test.
Bromothymol blue was used as a pH indicator. Microor-
ganisms utilize citrate that result in raise in pH which is
indcated by pH indicator with change in color from green
to blue. It further indicates that microorganisms under
screening can utilize citrate as its only carbon source [18].
l Sugar fermentation pattern: Broth having isolated bacteria
from marketed formulation and standard L. sporogenes, at
pH 6.5, was taken into each McCartney tube, and phenol
red (0.018 g/l) was added into the tube as pH indicator.
After autoclaving the medium (at 121 C for 15 min), 1 ml
of different types of sugar solutions (10% w/v sterilized
through membrane filtration) (filtered sterilized) were
inoculated into the different tubes. Then 200 μl of 12 h
old culture liquid culture medium was inoculated into the
broth and incubated at 37 2 C for 24 h. If fermenting
bacteria are grown in a liquid culture medium containing
the carbohydrates, they may produce organic acids as
by-products of the fermentation. These acids are released
into the medium and lower its pH. If a pH indicator such as
phenol red or bromocresol purple is included in the
medium, the acid production will change the medium
from its original color to yellow. Gases produced during
the fermentation process can be detected by using a small,
inverted tube, called a Durham tube (named after Herbert
Edward Durham, bacteriologist, 1866–1945), within the
liquid culture medium. After adding the proper amount of
222 Parveen Bansal et al.

broth, Durham tubes are inserted into each culture tube.

During autoclaving, the air is expelled from the Durham
tubes, and they become filled with the medium. If gas is
produced, the liquid medium inside the Durham tube will
be displaced, entrapping the gas in the form of a
bubble [19].
(e) Genotypic characterization:
l Genomic DNA isolation: MagNA Pure Bacterial Lysis
Buffer (Roche) (400 μl) was added to a bead tube (Septi-
Fast Lys Kit MGRADE, Roche) followed by addition of
200–800 μl sample culture. Later mixture was run for
2 45 s at speed 6.5 in the FastPrep instrument
(Q-BIOgene) followed by centrifugation at 12,281 g
for 3 min. Supernatant was transferred to a 2 ml tube that
fits into the MagNA Pure Compact machine (Roche).
l Quantification of DNA: The isolated DNA was quantified.
For the same, electrophoresis through agarose gel is the
standard method used to separate, identify, and purify the
DNA fragments. The technique is simple, rapid to perform,
and capable of resolving fragments of DNA that cannot be
separated adequately by other procedures, such as density
gradient centrifugation. Furthermore, location of DNA
within the gel can be determined directly by staining with
low concentration of the fluorescent intercalating dye, ethi-
dium bromide; bands containing as little as 10 ng of DNA
can be detected by direct examination of the gel in ultravi-
olet light. Presumed PCR-amplified DNA fragments were
analyzed by electrophoresis on 1% agarose gel. The agarose
gel was prepared after sealing the edges of a clean, dry, glass
plate with autoclave tape so as to form a mold. Mold was
settled on a horizontal section of the bench. Later correct
amount of powdered agarose was added to measured quan-
tity of the electrophoresis buffer in the flask (e.g., for 1% gel
0.4 g of agarose was added to 40 ml of the 1 TAE buffer)
to make a gel. After it, TAE buffer was filled in electropho-
resis tank. The slurry was heated in the microwave until the
agarose dissolves completely. The solution was cooled to
60 C, and ethidium bromide was added to gel to a final
concentration of 0.5 μg/ml. The mixture was mixed thor-
oughly and poured into the mold. DNA was mixed with
desired amount of gel-loading buffer and slowly loaded
into the slots of the submerged gel using a disposable
micropipette. The lid of the gel tank was closed and gel
was run. After completion, gel was observed under UV
light. After getting the bands of desired amplification,
later bands were eluted from the gel.
Bioinformatics-Based Tools and Software in Clinical Research: A New. . . 223

Table 1
Details of primers used for amplification of 16S rDNA region

Forward
Group A-forward 50 -gt-tTg-atc-mtg-gct-cag-rAc-30
Group B-forward 50 -gt-tTg-atc-mtg-gct-cag-aKtg-30
Group C-forward 50 -agt-ttg-atc-mtg-gct-cag-gAt-30
Reverse
Groups A,B,C-reverse (pD)** 50 -gta-tta-ccg-cgg-ctg-ctg-30

Table 2
Details of the reaction mixture used for RT-PCR

S. No Components Quantity added

1. SYBR Premix Ex Taq (TaKaRa) 12.5 μl
2. F-primer (10 μM) 2.0 μl
3. R-primer (10 μM) 1.0 μl
4. H2O (PCR grade) 7.5 μl
5. Template 2.0 μl
6. Total volume 25.0 μl

Table 3
Details of PCR conditions used for amplification

S. No. Process Temperature Time (s)

1. Initial enzyme activation 95 C 10

2. Melting 95 C 10
3. Annealing 64 C 15

4. Extension 72 C 20

l Purification of DNA: DNA extraction and purification

were performed by DNA elution method. For the elution,
DNA elution kit (Qiagen) had been used. Elution volume
was 50 μl.
l Amplification of 16S rDNA region: For the amplification of
16S rDNA, reaction mixture was prepared using group-
specific primers. The real-time PCR reactions are run on a
SmartCycler machine (Cepheid) for 45 cycles.
l The details of group-specifc primers and detail of reaction
mixture are given in Tables 1, 2, and 3 as follows:
224 Parveen Bansal et al.

l rRNA purification and sequencing: 16s rRNA sequence

was eluted out of gel, and finally purified sample was
sequenced. The sequencing was done by outsourcing, and
a sequence of 699 bp was detected for this marketed
formulation.

3.1 Bioinformatics Sequence obtained after sequencing was exploited for analysis of
Tools Based Data data with respect to the following parameters:
Analysis
1. Percentage homology.
2. Phylogenetic analysis.
l Percentage homology: The percentage homology of the
sequence obtained from the marketed formulation was ana-
lyzed with other available strains by using Ez-Texan data-
base, and hence observations for percentage homology,
sequence analysis, graphic summary, dot matrix, identity
analysis, and taxonomic hierarchy had been recorded as
images.
l Phylogenetic analysis: The obtained query sequence of iso-
late from marketed probiotic formulation was closely com-
pared with other related species with the help of laser gene
software. Significant level of variations in 16s rRNAs
between strains of different species was observed, and
hence a phylogenetic tree had been drawn from the
observed parameters.

3.2 Observations Sequence obtained by 16s rRNA sequencing was analyzed by

of Bioinformatics Tools EzTexan database. By using EzTexan database, results were cap-
Used in Data Analysis tured in various formats like BLAST-based sequence analysis,
graphic summary, dot matrix, identity analysis, and sequence anal-
ysis which are shown in Figs. 3, 4, 5, 6, and 7. For example, in the
study planned by the authors, it has been observed that sequence
isolated from marketed formulation isolate had shown maximum
similarity with the already existing database sequence of Bacillus
coagulans as shown in Fig. 2.
Graphic summary illustrates how significant matches, or hits,
align with the query sequence. For example, in the graphic sum-
mary given below (shown in Fig. 3), it has been observed that the
query sequence, i.e., marketed formulation sequence was of 699 bp
and subjected sequence for anonymous strain is of 697 bp. Both the
sequences had shown 99.13% of similarity with only two gaps.
Dot matrix is also known as a dot plot and is supposed to be a
graphical method that allows the comparison of two biological
sequences and identifies regions of close similarity between them.
It compares sequences by organizing one sequence on the x-axis,
and another on the y-axis, of a plot. The closeness of the sequences
in similarity is determined by observing the curves in the diagonal
Bioinformatics-Based Tools and Software in Clinical Research: A New. . . 225

Fig. 2 16s rRNA sequence analysis

Fig. 3 Graphic summary of nucleotide sequence

line. For example, in the current study, dot matrix shows that most
of the base pairs are similar and hence representing a straight line as
shown in Fig. 4.
226 Parveen Bansal et al.

Fig. 4 Representation of dot matrix

Results of identity analysis had shown the similarity of the

obtained sequence with variety of Bacillus and Lactobacillus strains,
but the initial 12 ranks were of Bacillus species instead of Lactoba-
cillus as shown in Fig. 6. Supporting details that the isolated strain
was Bacillus coagulans (identity analysis and sequence analysis) are
shown in Figs. 5 and 6, respectively.

3.3 Phylogenetic For the phylogenetic analysis, the obtained sequence, i.e., query
analysis sequence of isolate from selected probiotic formulation, was closely
compared with other related species with the help of laser gene
software. Significant level of variations in 16s rRNAs between
strains of different species was evident from the obtained results.
The query sequence of 699 bp was compared between the three
strains of B. coagulans and closely related species of the genus
Bacillus. The level of 16S rDNA from B. coagulans (marketed
formulation isolate) was closely related to two strains of
B. coagulans, namely, B. coagulans (Strain C4) and B. coagulans
(LA 204), and the similarity was 98.8% and 99.3%, respectively.
Further the phylogenetic tree prepared with the help of laser gene
software had also clarified that the bacterial sample isolated from
the marketed formulation is indeed of Bacillus coagulans, and it
formed a coherent cluster with two other strains of B. coagulans,
namely, B. coagulans C4 and B. coagulans L204 which is a type
Bioinformatics-Based Tools and Software in Clinical Research: A New. . . 227

Fig. 5 Results of identity analysis

Fig. 6 Sequence analysis of sequence derived from marketed formulation

strain of Bacillus coagulans. Apart from it, according to the query

sequence, phylogenetic tree also had shown that B. atrophaeus
strains are having 95.5% of homology between them. Similarly, in
case of B. clausii strains having homology of 88.4–93.6%, Bacillus
coagulans strains were showing 91.3–100%. The strains were
showing a sequence similarity ranging from 91.9 to 98.0% for
228 Parveen Bansal et al.

Fig. 7 Phylogenetic tree for probiotic strains

B. fusiformis, 91.1 to 96.5% for B. lentus, 93.0 to 96.1% for

B. mageterium 93.6 to 94.7% for B. mojavensis 94.8% for
B. pumilus, 80.7–94.9% for B. subtilis, 89.1–95.1% for
B. vallismortis and 92.1–92.6% for G. stearothermphilus respec-
tively. In phylogenetic analysis, all strains of the same genera are
forming different clusters as shown in Fig. 7. The query sequence in
phylogenetic tree which was constructed in Mega 5.0 using
Bioinformatics-Based Tools and Software in Clinical Research: A New. . . 229

maximum parsimony method was present in cluster of B. coagulans.

B. vallisformis CG141107 was strain which was out group in the
tree as shown in Fig. 7.
In the abovementioned study, a marketed formulation was
validated which claimed the presence of lactic acid bacillus earlier
known as Lactobacillus sporogenes on its label. When the bacterial
samples were subjected to physiological, biochemical, and molecu-
lar characterization by 16s rDNA sequencing, it has been observed
that the strain present in the marketed formulation is of Bacillus
coagulans as its identity matches 98–100% with strains of Bacillus
coagulans.
Today it is difficult to imagine an area of knowledge without
the use of computers and informatics. On similar lines modern
scientific/medical research also makes utilization of computer tech-
nology to organize and analyze large sets of data. Tremendous flow
of data produced with new innovations has necessitated the need of
computational support for collection, storage, retrieval, analysis,
and correlation of huge data sets of complex information. The
analysis of data and computer modeling has really revolutionized
the clinical research of various drugs and therapeutic agents at
molecular level. A lot of economy can be observed in methods
designed through computer modeling. This even pinpoints the
exact site of action of drugs, thus helping in devising new targeted
drug delivery systems. As a result, bioinformatics has become the
latest engineering discipline. Safety and efficacy of various drugs is
the key link between advances in medical research technology and
improved health care. Moreover evaluation of the safety, efficacy, as
well as identity of the drugs is highly complex, time-consuming,
and an expensive affair. With the adoption of mathematical/statis-
tical software, computer modeling, and other computational engi-
neering methods, it has become easier for the researchers to
overcome various challenges at a fast pace.
Abovementioned compilation presents “how to manual” appli-
cation of bioinformatics-based technique for identification of bac-
terial strains used in probiotics. In above-cited example, a marketed
formulation was validated, using bioinformatics-based software
which had shown remarkable differences in the bacterial strain
mentioned on the marketed product label and the actual identity
of the species contained in marketed formulation. This method will
help in eradicating inadvertent entry of toxic bacterial strains in
marketed products for safe use of consumers. This study clearly
shows that bioinformatics provides a wide range of drug-related
databases and softwares, which can be used for various purposes,
related to drug designing and development process.
230 Parveen Bansal et al.

References
1. Zerhouni EA (2006) Clinical research at a 11. Arora M, Baldi A (2017) Selective identifica-
crossroads: the NIH roadmap. J Investig Med tion and characterization of potential probiotic
54:171–173 strains: a review on comprehensive Polyphasic
2. DiMasi JA, Hansen RW, Grabowski HG approach. App Clin Res Clin Trials Reg Aff 4
(2003) The price of innovation: new estimates (1):60–76
of drug development costs. J Health Econ 12. Ennis V (2008) Microbiology handouts. Int J
22:151–185 Food Microbiol 91:305–313
3. Gill SK, Christopher AF, Gupta V et al (2016) 13. Hiu SF, Holt RA, Sriranganathan N (1984)
Emerging role of bioinformatics tools and soft- Lactobacillus piscicola, a new species from sal-
ware in evolution of clinical research. Perspect monid fish. Int J Syst Evol Microbiol 34
Clin Res 7(3):115–119 (4):393–400
4. PWC United States (2017) Medical Cost 14. Taylor WI, Achanzar D (1972) Catalase test as
Trend. https://www.pwc.com/us/en/health- an aid to the identification of Enterobacteria-
industries/health-research-institute/behind- ceae. Appl Microbiol 24(1):58–61
the-numbers-2017.html. Accessed on 26 Sep 15. Miller JM, Wright JW (1982) Spot Indole test:
2018 evaluation of four reagents. J Clin Microbiol
5. Clark DE, Pickett SD (2000) Computational 15(4):589–592
methods for the prediction of drug-likeness. 16. Tarrand JJ, Gröschel DH (1982) Rapid mod-
Drug Discov Today 5:49–58 ified oxidase test for oxidase-variable bacterial
6. Drews J (1996) Genomic sciences and the isolates. J Clin Microbiol 16(4):772–774
medicine of tomorrow. Nat Biotechnol 17. Barry G (2011) Probiotics and health: from
14:1516–1518 history to future. In: Wolfgang K
7. Luscombe NM, Greenbaum D, Gerstein M (ed) Probiotics and health claim, 1st edn.
(2001) What is bioinformatics? An introduc- Blackwell Publishing Ltd, New Jeresy
tion and overview. Yearb Med Inform 10(01), 18. Kateete DP, Kimani CN, Katabazi FA et al
83–100 (2010) Identification of Staphylococcus aureus:
8. Isea R (2015) The present-day meaning of the Dnase and Mannitol salt agar improve the effi-
word bioinformatics. Global J Adv Res ciency of the tube coagulase test. Ann Clin
2:70–73 Microbiol Antimicrob 9(1):23–25
9. Pyar HA, Peh K (2014) Characterization and 19. Miller JM, Rhoden DL (1991) Preliminary
identification of Lactobacillus acidophilus using evaluation of biolog, a carbon source utiliza-
biolog rapid identification system. Int J Pharm tion method for bacterial identification. J Clin
Pharm Sci 6(1):189–193 Microbiol 29(6):1143–1147
10. Stanton C, Gardiner G, Meehan H et al (2001)
Market potential for probiotics. Am J Clin
Nutr 73(2):476s–483s
Chapter 13

Text Mining for Drug Discovery

Si Zheng, Shazia Dharssi, Meng Wu, Jiao Li, and Zhiyong Lu

Abstract
Recent advances in technology have led to the exponential growth of scientific literature in biomedical
sciences. This rapid increase in information has surpassed the threshold for manual curation efforts,
necessitating the use of text mining approaches in the field of life sciences. One such application of text
mining is in fostering in silico drug discovery such as drug target screening, pharmacogenomics, adverse
drug event detection, etc. This chapter serves as an introduction to the applications of various text mining
approaches in drug discovery. It is divided into two parts with the first half as an overview of text mining in
the biosciences. The second half of the chapter reviews strategies and methods for four unique applications
of text mining in drug discovery.

Key words Biomedical text mining, Drug discovery, Biomedical literature, Electronic medical
records, Deep learning

1 Introduction

Drug discovery is a complex and costly process averaging

10–17 years in drug development and close to a billion dollars in
research for each drug discovered [1]. The lack of innovative meth-
ods for preclinical testing of drugs has been cited in the FDA’s
“Critical Path Initiative” report as one of the major obstacles to
drug development [2]. Conventional drug discovery efforts often
begin with manual literature searches with the hope of identifying
potential drug targets. These manual curation efforts are costly and
inefficient, as the majority of drug-related information is
disseminated in papers buried deep within scientific literature or
patient health records and is unable to be processed effectively by
traditional techniques. Often this type of textual data is semi-
structured, unstructured, or heterogeneously formed. In this cir-
cumstance, text mining shows immense potential as a useful
computational tool for drug discovery tasks, as it can be used to

Si Zheng, Shazia Dharssi and Meng Wu contributed equally to this work.

Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_13, © Springer Science+Business Media, LLC, part of Springer Nature 2019

231
232 Si Zheng et al.

process large-scale textual data with variable formats. Hence, it can

be utilized as an essential component for building information and
knowledge based tools in drug discovery [3].

1.1 What Is Text In a nutshell, text mining (TM) is the process of discovering and
Mining capturing knowledge or useful patterns from a large number of
unstructured textual data. It is an interdisciplinary field that draws
on data mining, machine learning, natural language processing,
statistics, and more. Broadly speaking, TM tasks include document
summarization, information retrieval, entity recognition, and rela-
tionship extraction. The extracted information is often linked via
knowledge graphs to form new facts or hypotheses. Take for exam-
ple the mining of protein-protein interactions from bioscience
textual data. Using TM techniques, one can extract articles with
individual protein name mentions, acquire related words that occur
in each of these articles, and find additional articles containing the
same sets of words. The ultimate goal is to find potential protein-
protein interactions from the articles obtained.

1.2 Text Mining The biosciences have become one of the most promising applica-
for Facilitating Drug tion areas for text mining, as most biomedical scientific findings are
Discovery presented in the form of textual data in scholarly publications. In
recent years, biomedical text mining has generated promising out-
comes for drug discovery by utilizing textual databases and
advanced information extraction techniques. Hidden information
like drug-drug and drug-target interactions can be extracted from
textual data [4], which can aid in the identification of novel drugs
or the reusability of these approved drugs for other indications. All
of these approaches have sought to infer novel relationships among
biological entities by combining known elements hidden in
scientific text.

1.3 Challenges Although promising, there are many challenges in text mining for
of Text Mining in Drug drug discovery. For instance, drug and chemical names in scientific
Discovery textual data are heterogeneous/ambiguous, and documents differ
substantially in their length, structure, language/vocabulary,
writing style, and information content. As for TM technologies,
the lack of interoperability among TM tools impedes the develop-
ment of more complex systems. Other challenges include limited
high-quality training data for the development and evaluation of
advanced supervised machine learning methods. For example, deep
learning, a new and advanced area of machine learning research, has
yet to show its full potential when applied to biomedical text
mining tasks. One major obstacle is the lack of sufficient training
data in the biomedical domain.
Lastly, the ability to handle large-scale data continues to be a
challenge and is necessary for effective text mining applications.
Recent computing technologies, such as cloud computing, have
been applied to help boost and optimize existing tools.
Text Mining for Drug Discovery 233

Fig. 1 Overview of text mining for drug discovery

In this chapter, we present a brief introduction to text mining

strategies, address limitations to text mining in the biosciences, and
review successful applications in drug discovery research (Fig. 1).
This chapter serves as a manual for readers interested in text mining
for drug discovery and directs readers to the necessary resources
currently available.

2 General Principles of Biomedical Text Mining

2.1 Information The first critical step to biomedical text mining is to obtain infor-
Retrieval mation relevant to a particular topic from data resources, also called
information retrieval. Topics of interest can be represented by
various queries, with each query retrieving a list of matching docu-
ments. There are two main strategies when crafting queries: (1)
obtain all matches that may be relevant to the topic (maximize
recall) or (2) find only documents that are truly related to the
topic of interest (maximize precision). Methods for generating
different types of queries include keywords with controlled vocab-
ulary, Boolean search queries, natural language queries, wildcard
234 Si Zheng et al.

queries, and hybrid approaches. Advanced search options are also

available but require an exact search target with proper terms as well
as additional support from the search interface. Queries for bio-
medicine are complicated by the ambiguity and variability of terms,
often rendering the search results neither comprehensive nor
accurate.
To improve the search accuracy of keyword-based retrieval
systems, an understanding of the semantics of user queries is neces-
sary. One such approach was developed by Huang et al., in which
he proposed an unsupervised framework, SIP (semantically similar
pattern finder), which extracts biomedical semantic relationships
from PubMed queries in an automated form [5]. This novel frame-
work aims to understand the user query’s semantics in order to
provide improved literature retrieval results. Question answering
systems are an alternative to keyword searches that allows the user
to query using natural language. Olelo is an example of a question
answering system in biomedicine. It is a fast and intelligent web
search application, which combines biomedical literature and ter-
minologies in a fast in-memory database to enable real-time
responses to researchers’ queries [6].

2.2 Named Entity The next step in the text mining pipeline includes named entity
Recognition recognition. Named entity recognition is the process of locating
specific predefined types of entities in text. For biomedical text
mining, named entity recognition typically involves extraction of
entities such as drugs, diseases, genes/proteins, and mutations
from unstructured text. The information extracted can be used to
define semantic relationships between entities to allow for further
analysis of article topics. Over the years, many biomedical named
entity recognition systems have been developed (Table 1).
Common methods for biomedical entity recognition include
dictionary look-up, rule-based approaches, machine learning, and
hybrid techniques. Dictionary-based methods are simple and prac-
tical but are limited by the scale and quality of the dictionary.
Resources like Unified Medical Language System (UMLS:
https://www.nlm.nih.gov/research/umls/) [7], Comparative
Toxicogenomics Database (CTD: https://ctdbase.org) [8], Drug-
Bank (https://www.drugbank.ca) [9], PubChem (https://
pubchem.ncbi.nlm.nih.gov) [10], RxNorm (https://www.nlm.
nih.gov/resesarch/umls/rxnorm/) [11], etc. are often used for
the creation of entity dictionaries. Rule-based approaches can be
effective when resources such as entity gazetteers or entity-labeled
textual training data are missing [12]. In recent studies, machine
learning methods such as conditional random fields (CRF),
structured support vector machines (SSVMs), and deep neural net-
works have been widely adopted. For instance, entity recognition
tools such as BANNER [13], DNorm [14], and TaggerOne [15]
are based on conditional random field (CRF) techniques. Finally,
Text Mining for Drug Discovery 235

Table 1
Named entity recognition systems

Content Textual
NER system Description URL recognized data type
tmChem An open-source https://www.ncbi.nlm. Chemical and Biomedical
software for nih.gov/research/ drug names literature
identifying bionlp/Tools/
chemical names tmchem/
BANNER- A system for https://bitbucket.org/ Chemical and Free text
CHEMDNER identifying tsendeemts/banner- drug names
chemical and chemdner
drug mentions
CheNER A tool for chemical http://ubio.bioinfo.cnio. Chemical and Free text
named entity es/biotools/CheNER/ drug names
recognition
ChemDataExtractor A toolkit for http://chemdataextractor. Drug names, Biomedical
extraction of org/ associated literature
chemical properties,
information measurements,
and
relationships
DNorm An open-source https://www.ncbi.nlm. Disease names Biomedical
software tool for nih.gov/research/ literature
identifying bionlp/Tools/dnorm/
disease names
OSCAR4 An extensible https://bitbucket.org/ Chemical and Biomedical
system for wwmm/oscar4/wiki/ drug names literature
annotation of Home
chemistry
GNormPlus An integrative https://www.ncbi.nlm. Gene and protein Biomedical
approach for nih.gov/research/ names literature
tagging gene, bionlp/Tools/
gene family, and gnormplus/
protein domains
tmVar A text mining https://www.ncbi.nlm. Mutation names Biomedical
approach for nih.gov/research/ literature
extracting bionlp/Tools/tmvar/
sequence variants
Whatizit A text processing http://www.ebi.ac.uk/ Gene, chemical, Free text
system for webservices/whatizit/ and disease
identifying info.jsf names
molecular
biology terms
236 Si Zheng et al.

recent named entity recognition systems apply hybrid methods that

combine machine learning with lexical features derived from dic-
tionaries or rules used mostly at the pre- and post-processing
stages.
There have been a number of studies in which drug and chemi-
cal information extraction have been carried out. Swain et al. pre-
sented an innovative method that utilized multiple rule-based
approaches for interpreting specific document domains for phrase
parsing and chemical information extraction [16]. The 2015 Bio-
Creative: Critical Assessment of Information Extraction in Biology
V challenge used our own tmChem, a chemical named entity
recognizer created by combining two independent machine
learning models in an ensemble, as that had the best results [17].
As for other types of entities, Iyer et al. recognized both drug
and event concepts from over 50 million clinical notes from two
clinical sites and identified significant drug-drug-event associations
[18]. They used 19 biomedical ontologies for building a lexicon
and recognizing drug and event concepts in electronic health
records. Han et al. proposed a novel committee-based active
learning method that supports multi-event extraction tasks [19].

2.3 Relationship Relationship identification is an important process that detects

Identification semantic associations between (extracted) biomedical entities. Cur-
rent research efforts have focused on recognizing associations
between drugs and other entities such as proteins/genes, muta-
tions, diseases, or adverse events. This has led to the development
and publication of a considerable number of chemical-biomedical
entity relation extraction approaches. Traditional methods for rela-
tionship identification are based on co-occurrence, pattern recog-
nition, and/or rule-based approaches; all these models focus on the
occurrence distribution of entities in textual data. Nowadays, the
co-occurrence and rule-based method are often combined with
machine learning techniques and can be used to identify relation-
ships, such as mutations related to a particular disease, from bio-
medical literature [20].
For drug-disease relationships hidden in open-access literature,
Xu et al. used a structured support vector machines (SSVMs)
approach to classify both sentence-level and document-level candi-
date drug-disease pairs with a corpus of 1,500 PubMed abstracts
[21]. This approach achieved the best performance on a chemical-
induced disease relation extraction subtask in the 2015 BioCreative
V Chemical Disease Relation (CDR) track.
In another study, Sohn et al. detected associations between
drugs and adverse side effects. They constructed a rule-based
approach using manually developed patterns and extracted sen-
tences detected by this approach as training data [22]. They estab-
lished “side effect” keywords as features to build a machine
learning-based “adverse effect” sentence classifier.
Text Mining for Drug Discovery 237

2.4 Knowledge Knowledge graph construction is the final step for biomedical text
Graph Construction mining. A knowledge graph is a structured semantic knowledge
base that represents knowledge in graphical form. Each knowledge
graph is comprised of nodes representing entities, attributes asso-
ciated with each node, and edges demarcating unidirectional or
multidirectional relationships between nodes. Knowledge graphs
have become the foundation of automatic semantic retrieval in
modern web searches. The construction process can be described
as a link prediction problem and divided into three layers according
to the abstract level of its input materials. The input includes the
information extraction layer, the knowledge integration layer, and
the knowledge processing layer, respectively. Commonly used
approaches for knowledge graph construction include analysis of
graph extraction, incorporation of ontology constrains and rela-
tional patterns, and discovery of statistical relationships within the
knowledge graph.
Previous studies explored both molecular network databases
and biomedical literature to create a knowledge base and coupled it
with approaches such as machine learning, graph mining, and data
visualization. Due to recent efforts, linked open data constitutes a
large collection of datasets comprised of standard formats and have
become new resources for knowledge discovery. Dalleau et al.
integrated a set of linked data associated with pharmacogenomics
and defined pharmacogenes in a large Resource Description Frame-
work (RDF) graph by using machine learning models [23]. Three
distinct types of entities (gene, phenotype, and drug) along with
their corresponding relationships were depicted by the knowledge
graph.

3 Linking Text Mining to Drug Discovery

With the rapid accumulation and distribution of new biomedical

publications, databases and clinical records constitute valuable
resources for facilitating and accelerating drug discovery. Com-
bined with the generation and application of text mining techni-
ques, one can derive information like drug targets,
pharmacogenomics relationships, drug-disease or drug-mutation
relationships, and drug side effects from relevant documents.
Although not perfect, text mining provides relatively reliable results
while being able to process large amounts of textual data in a way
that no manual curation efforts could ever hope to manage [24].

3.1 Materials Many biomedical resources are available for drug discovery using
text mining approaches. In recent years, open-access biomedical
resources have become increasingly easier and cheaper to acquire,
with the majority of them being textual. Often, valuable informa-
tion is encoded in neither structured nor classified forms, thus
238 Si Zheng et al.

Table 2
Open-access textual data resources for drug discovery

Resource
type Resources Website Description
Clinical MIMIC- https://mimic. A large, publicly available database of critical care units
textual III physionet.org/ from the Beth Israel Deaconess Medical Center
data NHS https://www. An executive non-departmental public body (NDPB)
England england.nhs.uk/ of the Department of Health
PCORnet http://www. A large, highly representative national network of
pcornet.org/ clinical data and health information
Biomedical MEDLINE https:// A publicly available bibliographic database of life
literature medlineplus.gov/ science and biomedical information
PMC https://www.ncbi. A free full-text archive of biomedical and life sciences
nlm.nih.gov/ journal literature
pmc/
DOAJ https://doaj.org/ A community-curated online directory that indexes
and provides access to high-quality, open-access,
peer-reviewed journals

necessitating the application of text mining strategies for both the

synthesis and analysis of useful information. Generally, we can
divide open-access textual data into clinical textual data and bio-
medical literature. The following table illustrates various examples
of open-access textual data resources that can be used for drug
discovery (Table 2).

3.1.1 Clinical With the ongoing acquisition of clinical textual data in medicine,
Textual Data such as practice guidelines, clinical notes, and electronic health
records (EHRs), large amounts of data exist as potential resources
for drug discovery. For example, mining electronic health records
(EHRs) have the potential for establishing new patient-
stratification principles and for revealing unknown drug correla-
tions [25]. This method has become increasingly more feasible as
more and more clinical databases are free of access restrictions.
1. MIMIC-III: Medical Information Mart for Intensive Care III
(MIMIC-III) is a large, freely available database comprising of
patient information from all critical care units at a large tertiary
care hospital [26]. This database contains information such as
clinical notes, vital signs, medication lists, laboratory values,
procedure codes, diagnostic codes, imaging reports, hospital
length of stay, and survival data. To access MIMIC-III one
must formally request access via a process documented on the
Text Mining for Drug Discovery 239

website, including the completion of a recognized course and

signing a data use agreement.
2. NHS England: The National Health Services (NHS England:
https://www.england.nhs.uk/) is an executive
non-departmental public body (NDPB) of the Department of
Health, developed by one of the largest repositories of data on
human heath in the world. They have published a variety of
source databases with publicly released information, often from
the government or other public organizations, ranging from
patient surveys to public health outcomes.
3. PCORnet: As clinical data from different organizations all over
the world is being acquired, the need for effective use of this
information, including interoperability has become a critical
problem. Launched in 2013, the National Patient-Centered
Clinical Research Network (PCORnet) aims to address this
problem by integrating data from multiple Clinical Data
Research Networks (CDRNs) and Patient-Powered Research
Networks (PPRNs) [27]. PCORnet collects data routinely
generated in a variety of healthcare settings including hospitals,
outpatient clinics, and urgent care centers. By engaging a vari-
ety of stakeholders—patients, families, providers, and research-
ers—PCORnet empowers individuals and organizations to use
this data to answer practical questions in order to make
informed healthcare decisions. It supports an effective, sustain-
able national research infrastructure that allows for the use of
electronic health data in comparative research.

3.1.2 Biomedical Biomedical literature is another important textual resource essential

Literature for users such as researchers, healthcare professionals, and patients.
Various databases exist that compile large amounts of biomedical
literature.
1. MEDLINE: Medical Literature Analysis and Retrieval System
Online (MEDLINE or MEDLARS Online: https://
medlineplus.gov/) is a publicly available bibliographic database
of life science and biomedical information, which contains
more than 27 million articles from 1950 to present. It includes
bibliographic information of articles from academic journals
covering medicine, nursing, pharmacy, dentistry, veterinary
medicine, and public health. MEDLINE uses Medical Subject
Headings for information retrieval.
2. PMC: PubMed Central (PMC: https://www.ncbi.nlm.nih.
gov/pmc/) is another free digital repository developed by
the National Library of Medicine (NLM), which archives pub-
licly accessible full-text scholarly articles that have been pub-
lished within biomedical and life science journals. As of June
240 Si Zheng et al.

2017, it contained over 4.3 million articles and continues to

rapidly grow each day.
3. DOAJ: The Directory of Open Access Journals (DOAJ:
https://doaj.org/) is a community-curated online directory
that indexes and provides open-access to high-quality, peer-
reviewed journals. As of June 2017, it contained over 2.5
million articles from 126 countries. The aim of the DOAJ is
to increase visibility and accessibility of open-access scientific
and scholarly journals, thereby promoting their increased usage
and overall impact.

3.2 Methods There are multiple approaches to text mining for drug discovery,
each relying on the fundamental principles of information retrieval,
named entity extraction, and relationship identification. Extracting
disease and gene mutation relationships are important for identify-
ing individual variability in diseases and developing drugs that
target each of these mutations. Understanding how drugs behave
in different genetic backgrounds can determine either its potential
benefit or harm to a specific patient population. This section pro-
vides the methodology behind each application of text mining for
drug discovery and provides a review of the most up-to-date
research involved in this field.

3.2.1 Extracting Disease- The identification of disease and gene mutation correlations is an
Mutation Relationships important strategy for drug discovery in precision medicine, as it
for Precision Medicine takes into account individual variability in disease prevention and
treatment. In the case of a genetic disease like cancer, the genomic
diversity and instability of cancer cells become major determinants
that enable them to acquire malignant or characteristic traits. Pre-
vious studies have shown that tumor response to a drug(s) ulti-
mately depends on the steps of oncogenesis, tumor heterogeneity,
and oncogenic evolution of resistant clones within the tumor. The
success of a cancer drug today is fundamentally dependent on the
drug company’s ability to identify target genes that control tumor-
igenic pathways [28].
Various mutations have been reported in recent studies and
many databases like ClinVar [29], Online Mendelian Inheritance
in Man [30], COSMIC [31], and GWAS Catalog [32] are available
for locating manually curated disease and gene mutation associa-
tions; however, the majority of associations still remain buried in
unstructured text of biomedical publications or electronic medical
records. In this instance, text mining methods can be employed to
extract such relationships. Many text mining approaches such as
simple co-occurrence, pattern matching, and machine learning, are
regularly used methods for disease-mutation extraction. PolySearch
is a search-based text mining tool that infers relationships between
mutations and diseases based on their frequency of co-occurrence
Text Mining for Drug Discovery 241

in MEDLINE abstracts [33]. Similarly, Mutation Extraction from

Medline Abstracts (MEMA) uses a word distance metric to select
the correct protein-mutation pairs based on sentence
co-occurrence [34]. Doughty et al. developed the Extractor of
Mutations (EMU) tool that provides a semiautomated approach
to extract disease-related mutations from PubMed’s abstracts and
full text [35]. This tool is a rule-based method that finds mutations
in a given document using regular expression matching, correlates
them to their associated genes, and finally couples them to related
diseases.
The use of machine learning in text mining is emerging as a
pivotal technology in disease-mutation relationship extraction. Our
lab (Singhal et al.) developed a machine learning classification
approach to automatically identify disease-mutation relationships
from biomedical literatures [20]. The steps involved in the devel-
opment of this approach are listed below:
1. Named entity extraction: GNormPlus [36], tmVar [37], and
DNorm [14] were used to extract gene, mutation, and disease
annotations from PubMed abstracts, respectively.
2. Feature construction: six features were used for the machine
learning model including:
(a) Nearness to Target Disease score: denotes a cumulative
score of all mentions, in which the target disease was
closest to the mutation.
(b) Target Disease Frequency score: frequency count of target
disease mentions in the text.
(c) Other Disease Frequency score: frequency count of the
next most frequent disease mention, other than the target
disease.
(d) Same Sentence Disease-Mutation Co-occurrence score
(DMCS): binary score calculated by the co-occurrence
of the mutation and its nearest disease in the same sen-
tence. The DMCS is 1 if both are mentioned within the
same sentence and 0 if not.
(e) Within Text Sentiment score: The text between the muta-
tion and the nearest disease mentioned is extracted and
labeled the “within text.” The sentiment score is based on
the polarity of the words contained in the “within text,”
with a range from 1 (negative sentiment) to þ1 (positive
sentiment).
(f) Test Subjectivity Score: provides an estimate to the reli-
ability of the sentiment score with a range from 0 (highly
objective) to 1.0 (highly subjective).
3. Training the machine learning classification model: Next, they
trained a decision tree classifier with a pre-labeled dataset
242 Si Zheng et al.

containing manually curated disease and mutation entities and

their associated relationships.
Our results showed that the model obtained F-measures of
0.880 and 0.845 for prostate and breast cancer mutations, respec-
tively. Similarly, Komandur et al. developed and evaluated a text
mining system MutD, which extracts protein-mutation-disease
associations from MEDLINE abstracts by incorporating discourse
level analysis [38]. They first used publicly available terminology
sources, as well as text mining systems like the BioTagger-GM [39],
MutationFinder [40], PubTator [41], and GeNo [42] to imple-
ment entity recognition and normalization. Further graph models
were used for extra-sentential processing to associate entities across
multiple sentences. Overall, the F-measure of MutD in association
detection reaches 0.815. In a similar method, Mahmood et al.
developed a text mining system, DiMeX, which uses information
extraction techniques, in addition to co-occurrence, to capture the
mutation-disease relationship from publication abstracts [43].
DiMeX consists of a series of natural language processing modules
that preprocess input text and apply syntactic and semantic patterns
to extract mutation-disease associations.
An additional factor to consider is the pathway leading up to
the gene mutation (either hereditary or somatic). Let us return to
our example of cancer and review the pathway of tumorigenesis, as
it is important for predicting the effectiveness of a drug for a specific
type of cancer. For instance, knowledge of both embryological
characteristics of the organ site and genomic alterations during
drug administration are useful for the development of an effective
treatment. The drug cetuximab (Erbitux®), an epidermal growth
factor receptor (EGFR) monoclonal antibody used in the treatment
of colorectal cancer (CRC) with the upregulation of EGFR, is
rendered ineffective in the presence of a specific mutation in the
KRAS protein. This protein is a downstream node in the EGFR
pathway of the tumor [44].

3.2.2 Extracting Pharmacogenomics (PGx) is the study of how genetic variants may
Pharmacogenomics affect drug efficacy and toxicity. This information is important for
Relationships drug discovery, especially for precision medicine. Substantial
research efforts using both manual and computational approaches
have been conducted for developing these databases. For instance,
the Pharmacogenomics Knowledge Base (PharmGKB) is the larg-
est manually curated resource for information regarding the impact
of genetic variation on drug response [45]. In recent years, manual
curation efforts have been assisted with computational approaches
due to high labor costs and time constraints.
Text mining strategies can be used to extract novel pharmaco-
genomics data using information from publicly available datasets as
prior knowledge for computational inference. A myriad of
Text Mining for Drug Discovery 243

databases, including PharmGKB [45], DrugBank (over 10,500

drug entries with 4,772 nonredundant protein sequences linked
to each drug) [9], Entrez Gene [46], and dbSNP [47], are available
for locating available drug and genetic data.
For instance, Pakhomov et al. used data from the PharmGKB
database to train a support vector machine learning model for
MEDLINE abstracts, which efficiently identified drug-gene targets
[48]. Let us review their methodology by using the baseline bio-
medical text mining pipeline discussed in Section 2:
1. Named Entity Recognition: PharmGKB was used to extract
822 drugs and 2,247 genes.
2. Relationship identification: manually curated drug and gene
relationships labeled in PharmGKB as either “Related” (i.e.,
“Related” or “Positively Related”) or “Unrelated” (i.e., “Neg-
atively Related,” “Discussed,” or “Not Related”) were
extracted. The resulting database consists of 9,317 instances
of drug-gene pairs and the MEDLINE abstracts in which these
pairs occurred.
3. Feature extraction:
(a) They explored the use of lexical features in a supervised
learning approach to label the drug-gene pairs as either
“‘Related” or “Unrelated” using a support vector machine.
Unigrams (single words) and bigrams (two-word
sequences) were used; however larger unigrams may be
appropriate for larger datasets.
(b) Next, all drug and gene names found in the PharmGKB
drug-gene pairs extracted in step 2 were excluded from
modeling, in order to make it context independent. The
goal is to be able to apply the resulting model prospec-
tively to any drug-gene pair.
(c) A conservative frequency cutoff of two was used for both
unigrams and bigrams.
4. Feature selection: Waikato Environment for Knowledge Anal-
ysis (WEKA)’s Information Gain feature [49] was used for
feature selection. The Information Gain feature is used in
machine learning to determine which set of features to use
for categorization. For example, the information gain from a
single word is zero if there is no contribution of this word
(or feature) in determining whether the drug-gene pair is in
fact related or unrelated. Only features with positive informa-
tion gains were used.
(a) Other more sophisticated feature selection methods may
be used for better classification results.
5. Results: A support vector machine classifier trained on MED-
LINE abstracts with single words used as features and
244 Si Zheng et al.

PharmGKB relationships used for supervision achieved an

overall sensitivity of 85% and specificity of 69%.
Another study that utilized PharmGKB’s database was by Xu
et al., in which they developed a conditional approach to extract
PGx-specific drug-gene pairs from MEDLINE abstracts using
known drug-gene pairs available in PharmGKB as prior knowledge
[50]. They used 20 million MEDLINE abstracts as the text corpus,
along with known drug and gene lexicons to constitute a search
engine. Few known PGx-specific drug-gene pairs were used as
seeds to start the extraction process.
In another study, Hakenberg et al. developed SNPshot, an
information retrieval system that mines textual data from over
180,000 PubMed abstracts for genotype-phenotype associations
[51]. This system utilizes heterogeneous resources and cross-
references associations between drugs, genes, and diseases with
EntrezGene [46], PharmGKB [52], PubChem [10], DrugBank
[9], and dbSNP [47]. This system achieved a performance of
85–92% precision for recognition of entities (gene, drug, and dis-
ease) and 79–83% for relationships between entities.
Ensemble methods that combine both co-occurrence and rule-
based methods with machine learning applications have been
implemented for relationship extraction in pharmacogenomics.
Chang et al. used co-occurrence to identify drug-gene pairs and
then utilized a supervised machine learning algorithm to classify the
relationship between each pair in PharmGKB into one of five classes
defined by researchers [52].
Advances in pharmacogenomics research and NLP have
brought about the development of tools to help revolutionize
precision medicine for clinicians. One such tool, electronic PGx
assistant (ePGA), was developed by Lakiotaki et al. as an integrated
information system that provides personalized drug recommenda-
tions to clinicians based on knowledge of drug-gene-phenotype
associations [53]. This system extracts these relationships by utiliz-
ing various data sources including PharmGKB [45], dbSNP [47],
Affymetrix annotations [54], PubMed, etc. and provides real-time
decision-making support to clinicians.

3.2.3 Mining Drug Extracting drug targets from publicly available datasets and bio-
Targets medical literature is another useful approach in text mining for drug
discovery. Multiple types of interactions between drugs and their
targets have been identified in recent studies. Traditional in silico
methods utilize machine learning approaches, such as classification
models (Ding et al.) [55] and rule-based inference methods
(Cheng et al.) [56] to predict drug-target associations. Similarity-
based methods examine associations between drug-drug and
target-target pairs and use these relationships for weighting poten-
tial associations. This includes similarities between chemical
Text Mining for Drug Discovery 245

structures, genomic sequences, ligand-based models, and pharma-

cological features.
Network analysis has also shown to be a useful strategy in
predicting drug-target associations from textual data. Chen et al.
used semantic methods to assess drug-target associations with the
association score calculated by a statistical model that takes into
consideration the topology and semantics of the neighboring
words between a drug and a target [57]. The following steps
outline their process:
1. Semantic drug-target network building:
(a) Chem2Bio2RDF is a Resource Description Framework
(RDF)-based repository that compiles data from 17 differ-
ent public chemogenomic data sources [58]. Drug-target
interactions, chemical similarity data, target similarity
data, and chemical target interactions were extracted
from the Chem2BioRDF dataset. A total of around
290,000 nodes and 720,000 edges were extracted.
(b) Each node and edge was semantically annotated using the
Chem2Bio2OWL ontology [59]. This framework
describes the semantics of chemical compounds, drugs,
proteins, genes, diseases, pathways, side effects, and their
corresponding relationships.
(c) To give a general idea, similarity relationships were iden-
tified by the following:
l Compound similarity was determined if two com-
pounds shared the same substrates, side effects, or
chemical ontology entities.
l Target similarity was determined if they share the same
ligands, gene ontology, or are located within the same
functional pathway.
2. Path finding and statistical model:
(a) A heap-based Dijkstra algorithm was employed to deter-
mine the path between two nodes. Path patterns were
identified as paths of nodes and edges that share the
same semantics but have different data.
(b) Only paths in which the length 3 edges were utilized.
3. Semantic link association prediction (SLAP) model:
(a) Missing links within the data network were predicted
based on the topology (e.g., similar shortest paths and
certain number of neighbors) by a semantic link associa-
tion prediction model.
(b) An association score was calculated using the distance and
weight of each edge between nodes to determine the
significance of the relationship.
246 Si Zheng et al.

4. Pattern importance:
(a) Low p-values between drug-target pairs suggest a strong
probability of association between the drug and target.
However, this association may not be meaningful biologi-
cally. Each pattern was then assessed as a feature and its
resulting ability to identify other drug-target pairs from
the set. Patterns with high ROC scores were deemed as
informative, and its respective drug and target pair is likely
to interact.

One major challenge of topology-based network analysis is its

inability to take into consideration similarities between nodes of the
same entity (i.e., drug-drug or target-target associations). In
addressing this problem, Zong et al. first utilized similarity-based
methods in network analysis for predicting drug-target associations
[60]. By using deep learning (unsupervised feature learning), they
were able to extract features of vertices in the network and compute
not only the topology between drug nodes and target nodes but
between vertices of the same entity. The resulting similarity mea-
sures between drug-drug and target-target similarities were used as
input for predicting drug-target associations using a “guilt-by-
association” principle [60].

3.2.4 Identifying Drug One of the difficulties with the drug discovery process is managing
Adverse Effects adverse drug effects, which restricts the clinical use of otherwise
effective drugs. This has been the leading cause of failure for new
drugs in clinical trials. In recent studies, text mining approaches
have been used in the identification of drug side effects.
For example, in Xu’s 2014 study, they presented an automated
learning approach to accurately extract drug-side effect pairs from
the vast amount of published biomedical literature [61]:
1. Materials: This study used 119 million MEDLINE sentences
and their corresponding parse trees as the text corpus.
2. Known drug-side effect pairs extraction: 100,000 known drug
side effect pairs were derived from FDA drug labels; this
includes 996 FDA-approved drugs and 4,199 adverse event
terms. These drug-side effect pairs were used as prior knowl-
edge to extract relevant sentences and parse trees.
3. Lexicon building: Disease, side effect (SE), and drug lexicons
were developed. The disease lexicon was built by combining all
disease terms in UMLS [7] and Human Disease Ontology
[62]. Side effect lexicons were based on the Medical Dictionary
for Regulatory Activities (MedDRA) [63]. The drug lexicon
was subsequently downloaded from DrugBank [9].
4. Drug-side effect relationship extraction: The drug-side effect
extraction system consists of four parts: (1) pattern extraction,
Text Mining for Drug Discovery 247

(2) pattern ranking and selection, (3) pair extraction, and

(4) pair ranking:
(a) Pattern extraction: Syntactical patterns associated with
known drug-side effect pairs were extracted.
l Sentences that contain any known drug-SE pair
(step 2) were extracted along with their corresponding
parse trees from the MEDLINE sentences obtained in
step 1. For the example sentence: “irinotecan-induced
neutropenia,” the extracted pattern would be
“DRUG-induced SE.”
l An extra requirement was imposed in which both side
effect and drug terms must be noun phrases. This was
used to exclude partial or incorrect drug-side effect
pairs.
(b) Pattern ranking and selection: The extracted patterns were
ranked by the number of their associated known drug-side
effect pairs. Patterns with high recall were ranked higher,
with the resulting patterns manually selected or excluded
to ensure high precision. This manual selection process
took less than 15 min as the pattern ranking algorithm
effectively ranked many drug-SE patterns with both high
recall and precision.
(c) Pair extraction: The resulting patterns were then applied
to the MEDLINE sentences obtained in step 1. Drug-side
effect pairs that were not included in step 2 were deemed
new drug-side effect pairs discovered.
(d) Pair ranking: The extracted drug-SE pairs were then
ranked based on their associated pattern scores and on
their co-occurrence frequencies within the entire MED-
LINE corpus.
5. Results: This model achieved a precision of 0.833, recall of
0.407, and an F1 of 0.545.
Limtox (Literature Mining for Toxicology) is another text
mining application, which was developed by Cañada et al. as an
online biomedical search tool for adverse drug events affecting the
hepatobiliary system [64]. Here, they integrated multiple methods
including co-occurrence algorithms of chemicals with hepatotoxic-
ity terms and SVM algorithms to train machine learning-based
abstract and sentence classifiers. Through Limtox, one can search
chemical compounds/drugs and genes to predict any potential
adverse side effects.
Apart from publicly available literature, electronic medical
records are another valuable resource for mining drug side effects.
Iqbal et al. developed a natural language processing tool for mining
free-text EHRs containing psychiatry notes and identified instances
248 Si Zheng et al.

of adverse drug events [65]. Wang et al. proposed a systematic

classification model for identifying adverse drug effects from clinical
notes [66]. Briefly, they analyzed millions of clinical notes using
prior known drug usages and drug side effects and developed a
discriminative classifier based on statistical analysis to predict poten-
tial drug adverse events.
The Adverse Event Reporting System (AERS) from the US
Food and Drug Administration (FDA) is another valuable resource
for identifying adverse drug effects. Takarabe et al. developed a new
method to identify unknown drug-target interactions (DTIs) using
an algorithm that predicts similarity scores based on drug side
effects reported [67]. Side effect keywords were retrieved from
the AERS database, in which 2.9 million reports containing over
290,000 drugs were extracted. This study demonstrated that
unknown DTIs could be predicted using similarities between
drug side effect profiles.

4 Conclusion

In summary, in silico drug discovery utilizes text mining

approaches that integrate multiple advanced technologies, includ-
ing natural language processing, machine learning, and semantic
web technologies with the curated knowledge of domain experts.
Various mathematical models and computational tools, as well as
the use of strategies including extracting various mutations in
different drugs and understanding how drugs behave in different
genetic backgrounds have been used to develop text mining-based
approaches for drug discovery. Compared with traditional drug
discovery methods, text mining has shown superiority in handing
large amounts of data in a cost-effective and time-efficient manner.
To date, many researchers are continuing to develop efficient tools
and technologies for handling large volumes of drug information,
while maintaining high accuracy rates and stability. Although text
mining techniques show immense potential in the field of drug
discovery, many limitations still exist. The integration, mainte-
nance, and sharing of complex information from different plat-
forms and organizations still proves to be a big challenge.
Ongoing research efforts have been actively addressing these pro-
blems by developing new interoperable databases and platform
systems to allow for more effective sharing of data. As more and
more data becomes publicly available, text mining approaches will
become an integral tool in the discovery of novel drugs.
Text Mining for Drug Discovery 249

Acknowledgments

This research was supported by the NIH Intramural Research

Program, National Library of Medicine, and the NIH Medical
Research Scholars Program, a public-private partnership supported
jointly by the NIH and generous contributions to the Foundation
for the NIH from the Doris Duke Charitable Foundation, the
Howard Hughes Medical Institute, the American Association for
Dental Research, the Colgate-Palmolive Company, and other pri-
vate donors. No funds from the Doris Duke Charitable Foundation
were used to support research that used animals. This work was also
supported by the National Natural Science Foundation of China
(Grant No. 81601573), the National Key Research and Develop-
ment Program of China (Grant No. 2016YFC0901901), the
National Population and Health Scientific Data Sharing Program
of China, and the Knowledge Centre for Engineering Sciences and
Technology (Medical Centre) and the Key Laboratory of Knowl-
edge Technology for Medical Integrative Publishing.

References
1. Reichert JM (2003) Trends in development (Database issue):D267–D270. https://doi.
and approval times for new therapeutics in the org/10.1093/nar/gkh061
United States. Nat Rev Drug Discov 2 8. Mattingly CJ, Colby GT, Forrest JN, Boyer JL
(9):695–702. https://doi.org/10.1038/ (2003) The comparative Toxicogenomics data-
nrd1178 base (CTD). Environ Health Perspect 111
2. Woodcock J, Woosley R (2008) The FDA crit- (6):793–795
ical path initiative and its influence on new 9. Wishart DS, Knox C, Guo AC, Shrivastava S,
drug development. Annu Rev Med 59:1–12. Hassanali M, Stothard P, Chang Z, Woolsey J
https://doi.org/10.1146/annurev.med.59. (2006) DrugBank: a comprehensive resource
090506.155819 for in silico drug discovery and exploration.
3. Claus BL, Underwood DJ (2002) Discovery Nucleic Acids Res 34(Database issue):
informatics: its evolving role in drug discovery. D668–D672. https://doi.org/10.1093/nar/
Drug Discov Today 7(18):957–966 gkj067
4. Percha B, Garten Y, Altman RB (2012) Dis- 10. Kim S, Thiessen PA, Bolton EE, Chen J, Fu G,
covery and explanation of drug-drug interac- Gindulyte A, Han L, He J, He S, Shoemaker
tions via text mining. Pac Symp BA, Wang J, Yu B, Zhang J, Bryant SH (2016)
Biocomput:410–421 PubChem substance and compound databases.
5. Huang CC, Lu Z (2016) Discovering biomed- Nucleic Acids Res 44(D1):D1202–D1213.
ical semantic relations in PubMed queries for https://doi.org/10.1093/nar/gkv951
information retrieval and database curation. 11. Nelson SJ, Zeng K, Kilbourne J, Powell T,
Database (Oxford) 2016. https://doi.org/10. Moore R (2011) Normalized names for clinical
1093/database/baw025 drugs: RxNorm at 6 years. J Am Med Inform
6. Kraus M, Niedermeier J, Jankrift M, Assoc 18(4):441–448. https://doi.org/10.
Tietbohl S, Stachewicz T, Folkerts H, 1136/amiajnl-2011-000116
Uflacker M, Neves M (2017) Olelo: a web 12. Krallinger M, Rabal O, Lourenco A,
application for intuitive exploration of biomed- Oyarzabal J, Valencia A (2017) Information
ical literature. Nucleic Acids Res. https://doi. retrieval and text mining technologies for
org/10.1093/nar/gkx363 chemistry. Chem Rev 117(12):7673–7761.
7. Bodenreider O (2004) The unified medical https://doi.org/10.1021/acs.chemrev.
language system (UMLS): integrating biomed- 6b00851
ical terminology. Nucleic Acids Res 32 13. Leaman R, Gonzalez G (2008) BANNER: an
executable survey of advances in biomedical
250 Si Zheng et al.

named entity recognition. Pac Symp https://doi.org/10.1186/s13326-017-0125-

Biocomput:652–663 1
14. Leaman R, Islamaj Dogan R, Lu Z (2013) 24. Singhal A, Leaman R, Catlett N, Lemberger T,
DNorm: disease name normalization with pair- McEntyre J, Polson S, Xenarios I, Arighi C, Lu
wise learning to rank. Bioinformatics 29 Z (2016) Pressing needs of biomedical text
(22):2909–2917. https://doi.org/10.1093/ mining in biocuration and beyond: opportu-
bioinformatics/btt474 nities and challenges. Database (Oxford)
15. Leaman R, Lu Z (2016) TaggerOne: joint 2016. https://doi.org/10.1093/database/
named entity recognition and normalization baw161
with semi-Markov models. Bioinformatics 32 25. Jensen PB, Jensen LJ, Brunak S (2012) Mining
(18):2839–2846. https://doi.org/10.1093/ electronic health records: towards better
bioinformatics/btw343 research applications and clinical care. Nat Rev
16. Swain MC, Cole JM (2016) ChemDataExtrac- Genet 13(6):395–405. https://doi.org/10.
tor: a toolkit for automated extraction of chem- 1038/nrg3208
ical information from the scientific literature. J 26. Johnson AE, Pollard TJ, Shen L, Lehman LW,
Chem Inf Model 56(10):1894–1904. https:// Feng M, Ghassemi M, Moody B, Szolovits P,
doi.org/10.1021/acs.jcim.6b00207 Celi LA, Mark RG (2016) MIMIC-III, a freely
17. Leaman R, Wei CH, Lu Z (2015) tmChem: a accessible critical care database. Sci Data
high performance approach for chemical 3:160035. https://doi.org/10.1038/sdata.
named entity recognition and normalization. J 2016.35
Chem 7(Suppl 1 Text mining for chemistry and 27. Fleurence RL, Curtis LH, Califf RM, Platt R,
the CHEMDNER track):S3. https://doi.org/ Selby JV, Brown JS (2014) Launching PCOR-
10.1186/1758-2946-7-S1-S3 net, a national patient-centered clinical
18. Iyer SV, Harpaz R, LePendu P, Bauer-Mehren- research network. J Am Med Inform Assoc 21
A, Shah NH (2014) Mining clinical text for (4):578–582. https://doi.org/10.1136/
signals of adverse drug-drug interactions. J amiajnl-2014-002747
Am Med Inform Assoc 21(2):353–362. 28. Dey N, Williams C, Leyland-Jones B, De P
https://doi.org/10.1136/amiajnl-2013- (2017) Mutation matters in precision medi-
001612 cine: a future to believe in. Cancer Treat Rev
19. Han X, Kim JJ, Kwoh CK (2016) Active 55:136–149. https://doi.org/10.1016/j.ctrv.
learning for ontological event extraction incor- 2017.03.002
porating named entity recognition and 29. Landrum MJ, Lee JM, Benson M, Brown G,
unknown word handling. J Biomed Semantics Chao C, Chitipiralla S, Gu B, Hart J,
7:22. https://doi.org/10.1186/s13326-016- Hoffman D, Hoover J, Jang W, Katz K,
0059-z Ovetsky M, Riley G, Sethi A, Tully R,
20. Singhal A, Simmons M, Lu Z (2016) Text Villamarin-Salomon R, Rubinstein W, Maglott
mining for precision medicine: automating DR (2016) ClinVar: public archive of interpre-
disease-mutation relationship extraction from tations of clinically relevant variants. Nucleic
biomedical literature. J Am Med Inform Assoc Acids Res 44(D1):D862–D868. https://doi.
23(4):766–772. https://doi.org/10.1093/ org/10.1093/nar/gkv1222
jamia/ocw041 30. Hamosh A, Scott AF, Amberger J, Valle D,
21. Xu J, Wu Y, Zhang Y, Wang J, Lee HJ, Xu H McKusick VA (2000) Online Mendelian inher-
(2016) CD-REST: a system for extracting itance in man (OMIM). Hum Mutat 15
chemical-induced disease relation in literature. (1):57–61. https://doi.org/10.1002/(SICI)
Database (Oxford) 2016. https://doi.org/10. 1098-1004(200001)15:1<57::AID-
1093/database/baw036 HUMU12>3.0.CO;2-G
22. Sohn S, Kocher JP, Chute CG, Savova GK 31. Bamford S, Dawson E, Forbes S, Clements J,
(2011) Drug side effect extraction from clinical Pettett R, Dogan A, Flanagan A, Teague J,
narratives of psychiatry and psychology Futreal PA, Stratton MR, Wooster R (2004)
patients. J Am Med Inform Assoc 18(Suppl The COSMIC (catalogue of somatic mutations
1):i144–i149. https://doi.org/10.1136/ in cancer) database and website. Br J Cancer 91
amiajnl-2011-000351 (2):355–358. https://doi.org/10.1038/sj.
23. Dalleau K, Marzougui Y, Da Silva S, Ringot P, bjc.6601894
Ndiaye NC, Coulet A (2017) Learning from 32. MacArthur J, Bowler E, Cerezo M, Gil L,
biomedical linked data to suggest valid phar- Hall P, Hastings E, Junkins H, McMahon A,
macogenes. J Biomed Semantics 8(1):16. Milano A, Morales J, Pendlington ZM,
Welter D, Burdett T, Hindorff L, Flicek P,
Cunningham F, Parkinson H (2017) The new
Text Mining for Drug Discovery 251

NHGRI-EBI catalog of published genome- 42. Wermter J, Tomanek K, Hahn U (2009) High-
wide association studies (GWAS catalog). performance gene name normalization with
Nucleic Acids Res 45(D1):D896–D901. GeNo. Bioinformatics 25(6):815–821.
https://doi.org/10.1093/nar/gkw1133 https://doi.org/10.1093/bioinformatics/
33. Cheng D, Knox C, Young N, Stothard P, btp071
Damaraju S, Wishart DS (2008) PolySearch: a 43. Mahmood AS, Wu TJ, Mazumder R, Vijay-
web-based text mining system for extracting Shanker K (2016) DiMeX: a text mining sys-
relationships between human diseases, genes, tem for mutation-disease association extrac-
mutations, drugs and metabolites. Nucleic tion. PLoS One 11(4):e0152725. https://
Acids Res 36(Web Server issue):W399–W405. doi.org/10.1371/journal.pone.0152725
https://doi.org/10.1093/nar/gkn296 44. Van Cutsem E, Kohne CH, Hitre E, Zaluski J,
34. Rebholz-Schuhmann D, Marcel S, Albert S, Chien CRC, Makhson A, D’Haens G, Pinter T,
Tolle R, Casari G, Kirsch H (2004) Automatic Lim R, Bodoky G, Roh JK, Folprecht G,
extraction of mutations from Medline and Ruff P, Stroh C, Tejpar S, Schlichting M,
cross-validation with OMIM. Nucleic Acids Nippgen J, Rougier P (2009) Cetuximab and
Res 32(1):135–142. https://doi.org/10. chemotherapy as initial treatment for meta-
1093/nar/gkh162 static colorectal cancer. New Engl J Med 360
35. Doughty E, Kertesz-Farkas A, Bodenreider O, (14):1408–1417. https://doi.org/10.1056/
Thompson G, Adadey A, Peterson T, Kann Nejmoa0805019
MG (2011) Toward an automatic method for 45. Hewett M, Oliver DE, Rubin DL, Easton KL,
extracting cancer- and other disease-related Stuart JM, Altman RB, Klein TE (2002)
point mutations from the biomedical literature. PharmGKB: the Pharmacogenetics Knowledge
Bioinformatics 27(3):408–415. https://doi. Base. Nucleic Acids Res 30(1):163–165
org/10.1093/bioinformatics/btq667 46. Maglott D, Ostell J, Pruitt KD, Tatusova T
36. Wei CH, Kao HY, Lu Z (2015) GNormPlus: (2011) Entrez gene: gene-centered informa-
an integrative approach for tagging genes, gene tion at NCBI. Nucleic Acids Res 39:
families, and protein domains. Biomed Res Int D52–D57. https://doi.org/10.1093/nar/
2015:918710. https://doi.org/10.1155/ gkq1237
2015/918710 47. Sherry ST, Ward MH, Kholodov M, Baker J,
37. Wei CH, Harris BR, Kao HY, Lu Z (2013) Phan L, Smigielski EM, Sirotkin K (2001)
tmVar: a text mining approach for extracting dbSNP: the NCBI database of genetic varia-
sequence variants in biomedical literature. Bio- tion. Nucleic Acids Res 29(1):308–311
informatics 29(11):1433–1439. https://doi. 48. Pakhomov S, McInnes BT, Lamba J, Liu Y,
org/10.1093/bioinformatics/btt156 Melton GB, Ghodke Y, Bhise N, Lamba V,
38. Ravikumar KE, Wagholikar KB, Li D, Kocher Birnbaum AK (2012) Using PharmGKB to
JP, Liu H (2015) Text mining facilitates data- train text mining approaches for identifying
base curation - extraction of mutation-disease potential gene targets for pharmacogenomic
associations from bio-medical literature. BMC studies. J Biomed Inform 45(5):862–869.
Bioinformatics 16:185. https://doi.org/10. https://doi.org/10.1016/j.jpi.2012.04.007
1186/s12859-015-0609-x 49. Ian H, Witten EF (2011) Data mining: practi-
39. Torii M, Hu Z, Wu CH, Liu H (2009) cal machine learning tools and techniques, 2nd
BioTagger-GM: a gene/protein name recogni- edn. Morgan Kaufmann Publishers, San
tion system. J Am Med Inform Assoc 16 Francisco
(2):247–255. https://doi.org/10.1197/ 50. Xu R, Wang Q (2013) A semi-supervised
jamia.M2844 approach to extract pharmacogenomics-
40. Caporaso JG, Baumgartner WA Jr, Randolph specific drug-gene pairs from biomedical litera-
DA, Cohen KB, Hunter L (2007) Mutation- ture for personalized medicine. J Biomed
Finder: a high-performance system for extract- Inform 46(4):585–593. https://doi.org/10.
ing point mutation mentions from text. 1016/j.jbi.2013.04.001
Bioinformatics 23(14):1862–1865. https:// 51. Hakenberg J, Voronov D, Nguyen VH,
doi.org/10.1093/bioinformatics/btm235 Liang S, Anwar S, Lumpkin B, Leaman R,
41. Wei CH, Kao HY, Lu Z (2013) PubTator: a Tari L, Baral C (2012) A SNPshot of PubMed
web-based text mining tool for assisting bio- to associate genetic variants with drugs, dis-
curation. Nucleic Acids Res 41(Web Server eases, and adverse reactions. J Biomed Inform
issue):W518–W522. https://doi.org/10. 45(5):842–850. https://doi.org/10.1016/j.
1093/nar/gkt441 jbi.2012.04.006
252 Si Zheng et al.

52. Chang JT, Altman RB (2004) Extracting and 60. Zong N, Kim H, Ngo V, Harismendy O
characterizing gene-drug relationships from (2017) Deep mining heterogeneous networks
the literature. Pharmacogenetics 14 of biomedical linked data to predict novel
(9):577–586 drug-target associations. Bioinformatics 33
53. Lakiotaki K, Kartsaki E, Kanterakis A, Katsila T, (15):2337–2344. https://doi.org/10.1093/
Patrinos GP, Potamias G (2016) ePGA: a bioinformatics/btx160
web-based information system for translational 61. Xu R, Wang Q (2014) Automatic construction
pharmacogenomics. PLoS One 11(9). ARTN of a large-scale and accurate drug-side-effect
e0162801). https://doi.org/10.1371/jour association knowledge base from biomedical
nal.pone.0162801 literature. J Biomed Inform 51:191–199.
54. Dalma-Weiszhausz DD, Warrington J, Tani- https://doi.org/10.1016/j.jbi.2014.05.013
moto EY, Miyada CG (2006) The affymetrix 62. Schriml LM, Arze C, Nadendla S, Chang YW,
GeneChip platform: an overview. Methods Mazaitis M, Felix V, Feng G, Kibbe WA (2012)
Enzymol 410:3–28. https://doi.org/10. Disease ontology: a backbone for disease
1016/S0076-6879(06)10001-4 semantic integration. Nucleic Acids Res 40
55. Ding H, Takigawa I, Mamitsuka H, Zhu S (Database issue):D940–D946. https://doi.
(2014) Similarity-based machine learning org/10.1093/nar/gkr972
methods for predicting drug-target interac- 63. Brown EG, Wood L, Wood S (1999) The med-
tions: a brief review. Brief Bioinform 15 ical dictionary for regulatory activities (Med-
(5):734–747. https://doi.org/10.1093/bib/ DRA). Drug Saf 20(2):109–117
bbt056 64. Canada A, Capella-Gutierrez S, Rabal O,
56. Cheng F, Liu C, Jiang J, Lu W, Li W, Liu G, Oyarzabal J, Valencia A, Krallinger M (2017)
Zhou W, Huang J, Tang Y (2012) Prediction LimTox: a web tool for applied text mining of
of drug-target interactions and drug reposi- adverse event and toxicity associations of com-
tioning via network-based inference. PLoS pounds, drugs and genes. Nucleic Acids Res.
Comput Biol 8(5):e1002503. https://doi. https://doi.org/10.1093/nar/gkx462
org/10.1371/journal.pcbi.1002503 65. Iqbal E, Mallah R, Jackson RG, Ball M, Ibra-
57. Chen B, Ding Y, Wild DJ (2012) Assessing him ZM, Broadbent M, Dzahini O, Stewart R,
drug target association using semantic linked Johnston C, Dobson RJ (2015) Identification
data. PLoS Comput Biol 8(7). ARTN of adverse drug events from free text electronic
e1002574). https://doi.org/10.1371/jour patient records and information in a large men-
nal.pcbi.1002574 tal health case register. PLoS One 10(8):
58. Chen B, Dong X, Jiao D, Wang H, Zhu Q, e0134208. https://doi.org/10.1371/journal.
Ding Y, Wild DJ (2010) Chem2Bio2RDF: a pone.0134208
semantic framework for linking and data 66. Wang G, Jung K, Winnenburg R, Shah NH
mining chemogenomic and systems chemical (2015) A method for systematic discovery of
biology data. BMC Bioinformatics 11:255. adverse drug events from clinical notes. J Am
https://doi.org/10.1186/1471-2105-11- Med Inform Assoc 22(6):1196–1204. https://
255 doi.org/10.1093/jamia/ocv102
59. Chen B, Ding Y, Wild DJ (2012) Improving 67. Takarabe M, Kotera M, Nishimura Y, Goto S,
integrative searching of systems chemical biol- Yamanishi Y (2012) Drug target prediction
ogy data using semantic annotation. J Chem 4 using adverse event report systems: a pharma-
(1):6. https://doi.org/10.1186/1758-2946- cogenomic approach. Bioinformatics 28(18):
4-6 I611–I618. https://doi.org/10.1093/bioin
formatics/bts413
Part IV

Clinical Informatics in Drug Discovery

Chapter 14

Big Data Cohort Extraction for Personalized Statin

Treatment and Machine Learning
Terrence J. Adam and Chih-Lin Chi

Abstract
The creation of big clinical data cohorts for machine learning and data analysis require a number of steps
from the beginning to successful completion. Similar to data set preprocessing in other fields, there is an
initial need to complete data quality evaluation; however, with large heterogeneous clinical data sets, it is
important to standardize the data in order to facilitate dimensionality reduction. This is particularly
important for clinical data sets including medications as a core data component due to the complexity of
coded medication data. Data integration at the individual subject level is essential with medication-related
machine learning applications since it can be difficult to accurately identify drug exposures, therapeutic
effects, and adverse drug events without having high-quality data integration of insurance, medication, and
medical data. Successful data integration and standardization efforts can substantially improve the ability to
identify and replicate personalized treatment pathways to optimize drug therapy.

Key words Medication safety, Clinical data integration, Clinical comorbidity evaluation, Personalized
medication therapy

1 Introduction

Clinical big data cohorts provide data analysts with the capacity to
effectively evaluate clinical care, medication, demographic, and
health system factors which may impact clinical treatment plans
and patient outcomes. Comprehensive clinical databases can pro-
vide data representation on a variety of healthcare components. In
order to develop effective clinical big data cohorts, data integration
is generally required including health insurance coverage informa-
tion and medical and pharmacy claims data in order to provide
sufficient data representation to identify clinical events related to
medication use and personalized medication treatment.
Large clinical data cohorts provide an opportunity for research-
ers to ask meaningful questions which can be answered with the
available data for outcomes analysis and data exploration. As with
any data set, it is important to invest sufficient resources into

Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_14, © Springer Science+Business Media, LLC, part of Springer Nature 2019

255
256 Terrence J. Adam and Chih-Lin Chi

understanding the relative data strengths as well as weaknesses to

inform research approaches and subsequent analysis findings. In
this effort, it is important to initially identify any known issues with
a data set from the available published literature as well as from the
data supplier. Applications of a detective’s deductive reasoning skills
are needed to identify and sort out pertinent data issues around the
key outcomes and predictors. Once the unusual and problematic
data elements are resolved, then efforts can focus on developing the
data cohort content with a particular emphasis on data standardiza-
tion to facilitate dimensional reduction and to facilitate the inter-
pretation of the analytic findings.

2 Data Source Identification

The initial efforts to undertake big data cohort development work

with a medication focused question are to acquire and manage the
data in a setting supportive for the analytic question of interest. In
addition to the usual safeguards for protected health information
and human subject research requirements, there are additional
issues to address around data acquisition including completing
the needed data licensing and project planning. Depending on
the particular data set utilized, central management of data licens-
ing including a best practice checklist for data use and result report-
ing can help insure regulatory requirements are met from
institutional and data vendor standpoints and provide a pertinent
resource for staff training. After setting up the analytic environment
and acquiring the data resources, basic quality assessment work will
need to be completed. In addition, the data should be analyzed for
typical data problems.

2.1 Common Data Clinical data focused on medication-related outcomes vary sub-
Problems stantially in their relative quality and utility for advanced data
analysis. Since medications can be difficult to aggregate, identifica-
tion of the most standardized form of drug information available in
the data set provides a good starting point. In some cases, this may
only be a drug name. In other cases, drug dosing quantity, dosage
form, the clinical setting of use, and other data may be available.
With regard to drug names, initial data quality work should evalu-
ate nomenclature consistency. Since medications can be named by
generic or trade names, it is important to identify if the source data
contains one or both names. Frequently, both the trade and generic
medication names are included and ideally also include coded
medication information using either standard or nonstandard cod-
ing which may require data transformation to account for variations
in dosage forms, dosage quantity, drug class, and/or therapeutic
class.
Big Data Cohort Extraction for Personalized Statin Treatment and Machine. . . 257

For large population-level data sets, the available medication

nomenclature can be analyzed for content inclusion and complete-
ness. The initial analysis for medication coverage provides insights
on content completeness as well as potential coverage gaps. The
data can be benchmarked against available population-level data on
the distribution of medication usage. A data query at the name level
with frequency counts is a good starting point to identify abnorm-
alities in drug frequency distributions among well-known high-
frequency prescriptions. Comparisons with available reference
data on drug prescriptions can identify if the drug distributions fit
the “normal” pattern for the population of interest. Typical base-
line assessment can be completed with clinical summary data from
medication formularies from the population of interest or from
other market summaries such as those from population usage lists
such as one of the top 200 or top 300 [1] drug lists or similar
estimates from ClinCalc [2] or other similar summary resources
derived from existing prescription data or national summary
data [3].

3 Normalization and Standardized Drug Coding for Medication Data Management

Normalization of drug names can address typical issues such as

spelling variation, free text entry errors, data truncation, nonstan-
dard nomenclature, and other medication data issues. Typically,
drug data should include a generic name to identify the medications
in human-readable form. In addition, drug data may also contain a
machine-readable identifier to manage mappings to other members
of the same medication class including the drug class and therapeu-
tic class. The ability to group similar drugs in the same drug class is
pertinent since they will have similar pharmacological action and
may share similar chemical characteristics. For example, the analyst
can expect to find similar effects among statin drugs which share the
same drug class and pharmacological activity among 3-hydroxy-3-
methylglutaryl-coenzyme A reductase (HMG-CoA reductase)
medications. Therapeutic classes can be more complex as with the
antihyperlipidemic therapeutic class where HMG-CoA reductase
are one of the drug classes in the therapeutic class, but the other
drug classes have different pharmacological and chemical
characteristics.
Computer-readable standard terminology will typically include
National Drug Codes (NDC), RxNorm, or Anatomical Therapeu-
tic Chemical Classification System (ATC) identifiers depending on
the data origin. Having these computer-readable codes can facili-
tate cross-mappings to other terminologies and drug databases.
Standardized coding can also facilitate drug data mapping to the
respective drug and therapeutic classes which is a nontrivial exercise
in a large data set. The available standard coding can also allow
258 Terrence J. Adam and Chih-Lin Chi

linkage to commercial databases providing the capacity to identify

issues around drug classes, therapeutic classes, drug-drug interac-
tions and drug-disease contraindications, and other applications for
informational retrieval and exposure assessment. Proprietary or
local coding may also be present in a medication data source
requiring a cross-mapping to standardized codes for data manage-
ment and should be obtained as part of the data acquisition process
if it is available from the data provider.
The identification of standardized drug coding is ideally
provided in the data documentation. If unavailable, the data char-
acteristics of the coding should make it apparent. For data mapped
to NDC codes, there will be a ten-digit code with three compo-
nents including the drug labeler or manufacturer, the product, and
trade package size [4]. The drug labeler or manufacturer identifies
the creator of the medication or repackaged product for distribu-
tion and is typically represented as a four- or five-digit code. The
product code identifies the drug strength, dosage form, and for-
mulation and will typically be a three- or four-digit code. The final
part of the NDC is the package code which identifies the type of
packaging and amount of medication in the package. This section
will be a one- or two-digit code. The ten-digit NDCs will generally
be one of the following configurations, 4-4-2, 5-3-2, or 5-4-1, and
are assigned by the manufacturer. The NDC data may also be in
11-digit form which is converted from one of the three ten-digit
formats and by adding a leading zero as needed to derive the
11-digit format or (5-4-2) configuration.
The required work to normalize the data by coding to the
NDC data standard can provide substantial advantages for certain
types of data analysis. Since each NDC code provides an exact
description of the drug, the drug’s manufacturer, and packaging,
one can develop detailed estimates on drug utilization and medica-
tion costs. For cost analysis, it may be necessary to link the NDC
information to other external data files on cost since NDC data
does not directly provide medication cost-related information. The
manufacturing/labeler data provides a means to aggregate infor-
mation on drug cost for a particular medication supplier for eco-
nomic analysis.
Several issues can arise when working with NDC data. The
NDC codes can in some cases be reused which can create medica-
tion tracking problems over time. Additionally, since NDC coding
is managed at the manufacturer level, there may be changes with
company mergers, buyouts, or dissolution which will change the
name of the manufacturer/labeling entity. Since NDC data repre-
sentation is manufacturer and product focused, it does not readily
allow data aggregation to a more generalized level of analysis such
as drug or therapeutic class since the medications in a common class
such as statin medications are provided by a variety of manufac-
turers with unrelated NDC data. In addition, since the statin
Big Data Cohort Extraction for Personalized Statin Treatment and Machine. . . 259

medications are available from multiple manufacturers, even a sin-

gle drug will have a variety of NDCs which may not be similar to
one another. As a result, it will be necessary to have an additional
review or an additional data source to identify medications of a
particular drug class using NDC codes by themselves. NDC lookup
functionality is available from the Food and Drug Administration
to identify codes associated with medications. The FDA public
website provides the capacity for medication lookup using the
name of the manufacturer/labeler, the proprietary or trade name,
the generic drug name, NDC number, or FDA drug approval
application number [4]. In a nonproprietary search of the term
atorvastatin which is currently among the current most utilized
statin drugs, over 500 NDCs are found at the FDA search site. A
more pointed search on the trade name Lipitor generates 84 NDC
entries for the July 31, 2017 search date demonstrating the diver-
sity of NDC data for even a single trade name of statin medication
(accessed 7/31/2017).
Medication inclusion and exclusion criteria need to be
addressed to work with drug class-level data. For a particular drug
class, a query with class-based aggregation will ideally identify all
medications available. However, this query may identify medica-
tions which may not be desired for the analysis. Several special cases
may be included and may require exclusion from the data analysis.
One example would be combination medications. In the case of
statin-related medications, there are available combination medica-
tions which include non-statin components. Some examples
include statin drugs in combination with hypertension medications
and combinations with other types of cholesterol medications. If
these combination medications are included in the data set, it may
be necessary to account for the additional pharmacologic activity of
the combination medication component. Personalized treatment
paths will be more difficult to assess in the context of combination
medications and generally require their exclusion or separate assess-
ment in the data analysis.
Several other medication types may not be part of the expected
usage patterns among subjects in the analytical data set or may be
difficult to evaluate. Such examples could include raw drug pro-
ducts which are used to make medications which are compounded
for patient use. It is generally reasonable to exclude these NDCs
from the analytical data set since they may not represent the final
use of the medication but rather a raw product destined for addi-
tional processing prior to use. Similar NDC data tracking issues can
happen with drug repackaging which may result in locally gener-
ated NDCs which may require semantic and text matching for
identification since the NDCs may not be available to cross-map
with typical mapping tools and terminologies. An additional issue
to consider when choosing to exclude NDCs is to identify the
medications which are used for veterinary purposes as these
260 Terrence J. Adam and Chih-Lin Chi

would not be part of an analysis on human-related clinical events.

Since NDC codes can be cross-referenced to the product type in the
Structured Product Label, this linked data can be used to differen-
tiate human versus veterinary medications, if required. Significantly
more information on the drug product can be provided after link-
ing the NDC to Structured Product Label data available from the
FDA [5].
Another widely used medication terminology standard is the
Anatomical Therapeutic Chemical Classification System (ATC).
The ATC provides a classification of medication active ingredients
with drug data organization focused on the organ or system of the
body on which the medications act along with their pharmacologi-
cal, therapeutic action and chemical properties. This system pro-
vides five levels of classification starting at the first level with 1 of
14 high-level groups. The second level of classification focuses on
pharmacological and therapeutic subgroups. The third and fourth
levels are chemical/pharmacological/therapeutic subgroups, and
the fifth level is chemical substance [6].
The incorporation of new entities into the ATC is managed by
the World Health Organization Collaborating Center in Oslo Nor-
way. Requests for inclusion come from system users and can include
manufacturers, regulatory bodies, researchers, and other end users.
If there is not a request made by end users, the medication may not
be included in the ATC, creating a potential coverage gap in the
terminology for comprehensive medication projects.
For data analysis purposes, the ATC terminology has both
advantages and disadvantages related to its application compared
with NDCs. A key ATC feature is that it provides a classification for
medication active chemical entities by body system and pharmaco-
logical action. This can impart important clinically relevant mean-
ing to the coding and has the potential to link or associate
medications by their area of systematic action. In the case of statin
medications, this will put most statin medications under the car-
diovascular drug category as it is the primary system of the body
affected by the pharmacological activity. The classification also
conveys information on the therapeutic and drug classes. Statins
will generally be under the lipid-modifying agents as a therapeutic
class and be in the HMG-CoA reductase inhibitor drug class. Work
with the ATC can be facilitated from available public use resources,
and the ATC terminology can be readily accessed using their online
search tool [7].
There are several limitations on the use of the ATC terminology
in providing useful information on medications. The ATC termi-
nology provides a classification of the chemical entity and its phar-
macological and systematic sites of action; however, it does not
provide information on the manufacturer of the medication, the
dosage of the medication, or the product size. Such limits on the
drug information make certain types of data analysis very difficult,
Big Data Cohort Extraction for Personalized Statin Treatment and Machine. . . 261

particularly with dose-related toxicity issues, economic analyses,

and the quantification of medication exposure.
Another prominent medication terminology standard is
RxNorm. The RxNorm terminology is managed by the National
Library of Medicine and includes several medication data cross-
walks for use in medication data management. RxNorm includes
content data from several different medication vocabularies includ-
ing First Databank, Micromedex, Medispan, Multum, the Gold
Standard Drug Database, and other source vocabularies. NDC
data is incorporated into RxNorm including NDC content from
the Gold Standard Drug Database, Multum MediSource Lexicon,
FDA Structured Product Labels, First Databank MedKnowledge,
and the Veterans Health Administration National Drug File to
provide an NDC source representing several widely used medica-
tion terminologies [8].
RxNorm also contains ATC and NDC cross-maps to coordi-
nate assessments across terminologies. The ATC data has a single
primary source so it has source data consistency; however, it may
not always map optimally to RxNorm due to some of the limited
data granularity in the ATC terminology. NDC data is also cross-
mapped in RxNorm, but since NDC codes can come from a num-
ber of different data sources, all the potentially available NDCs are
not necessarily contained and mapped in RxNorm. Since all sources
which create, categorize, and maintain NDCs do not have estab-
lished policies to require sharing of their coding data directly with
RxNorm, there can be gaps in coverage; however, several of the
major commercial drug terminology products do participate,
making RxNorm a good alternative to find a relatively high-quality
single source of NDC coding without the high cost of a commercial
NDC database.
There are several commercial databases available which contain
drug data including drug names (trade and generic), NDC infor-
mation, drug class data, therapeutic class data, and a variety of other
information. The other information may include drug-drug inter-
actions, drug-disease contraindications, drug costs per unit, and
other information. In general, commercial databases are fairly com-
prehensive in their content coverage for medications which are used
in the United States; however, they can be expensive to obtain and
may require data preprocessing to insure the appropriate time frame
of medications are included to insure the drug lists are appropriate
for a given data analysis. Hearst’s First Databank, Wolters Kluwer’s
Medispan, and Cerner Multum are some of the commonly available
and frequently used commercial drug databases. Each product will
typically require the purchase of a subscription for database use, and
it is important to plan for the costs of these products in the data
analysis both for the acquisition and potentially yearly updates in
order to use the database products effectively. Since the subscrip-
tion costs can be substantial, use of these products is often outside
262 Terrence J. Adam and Chih-Lin Chi

typical project budget constraints but may need to be considered

when available public use terminologies are insufficient for planned
data analysis work.
Although each of the medication terminologies has some
advantages and disadvantages, the selection for use is based on
data availability, budget constraints, and the required effort to
incorporate mappings of the medication data to the terminologies.
If the primary data set lacks any structured terminology mappings,
it may require a substantial effort to map the data to drug terminol-
ogies with some risk for misclassification. The mapping process will
generally require clinical expert review to help reduce the misclassi-
fication risk. Keyword and text mining approaches can help resolve
the bulk of available raw drug information to help reduce the
mapping burden. For expert review, having available linked medical
and demographic data may be needed to optimally map drugs to
appropriate therapeutic classes as many medications can have mul-
tiple clinical indications for use as well as off-label usage patterns.

4 Clinical Outcomes Data Identification and Preprocessing

For clinical outcomes analysis, the primary and secondary out-

comes of interest and predictor variables will need typical data
quality assessments, but depending on the data sources, there also
may be a need for data standardization as well. In large secondary
clinical data sets, the available clinical outcome data elements will
typically be represented in the form of International Classification
of Disease codes or ICD codes, typically as ICD9 (Version 9) or
ICD10 (Version10) depending on the time frame in which the
encounter data was originally created. Selection of clinical out-
comes of interest can be identified from a mix of prior clinical
literature, available clinical data, and expert consensus [9].
For data sources which include detailed clinical data beyond
typical administrative claims, additional data preprocessing and data
standardization may be needed. If the supplemental clinical data
includes laboratory information, there may be a need to reconcile
data related to the level of data granularity. In laboratory data
extracts, Logical Observation Identifiers Names and Codes [10]
(LOINC) coded data can be used to identify and manage relevant
content of interest. In order to identify needed LOINC standar-
dized data related to clinical content of interest, public use browsers
for LOINC codes can be used. Depending on the laboratory data
extract and its source, there may be issues related to data granular-
ity. For many clinical areas of interest such as with liver dysfunction
for statin medications, the ability to define liver disease is relatively
straightforward with ICD9/ICD10 codes but can be quite com-
plicated due to the large number of specific tests which may be
related to liver function and the potential presence of liver disease.
Big Data Cohort Extraction for Personalized Statin Treatment and Machine. . . 263

A given disease or condition may only have a small set of ICD9/

ICD10 codes but may have tens or even hundreds of clinically
pertinent lab codes and normal value ranges which can become
difficult to manage in data preprocessing work. Identifying well-
established lab tests, the pertinent normal ranges and test results
which are definable as clinical conditions can require a substantial
amount of data analysis and expert review. Review of the lab test
data for outliers and a need for inclusion and exclusion criteria may
be required to deal with physiologically implausible results or
apparent errors in data due to data field truncation, measurement
unit differences, and missing or masked data which may have
assigned dummy values.
Variation in laboratory values is related to population variation
both in tested subjects and differences in laboratory standards for
testing. This creates a data management problem since a “normal”
value from one lab test may be abnormal at another testing site due
to differences in the lab test methods and testing materials. To
account for these differences, normalizing the test units of measure
is needed to have consistent common units. In addition, there is a
need to retain the low-normal and high-normal cutoff values for
each specific lab test since the absolute value as well as the normal or
non-normal status of a particular lab finding may be important in
the later data analysis. Furthermore, laboratory data values them-
selves may have different meaning even for the same test at the same
site. For instance, a particular lab test may have a normal range of
50–75 which allows one to identify if a particular lab value, i.e.,
55, is normal or abnormal. However, in a longitudinal data set, the
normal range cutoffs may change for a particular lab test at a
particular site, and there may be a fairly large variety of normal
value ranges for a particular test potentially resulting in misclassifi-
cation of a particular laboratory result if pre-defined cutoff values
are used. To address this issue, obtaining and retaining the normal
value range associated with each test can help to prevent this
potential misclassification problem. Another approach to manage
lab results is to calculate ratios to create a single value for each
individual laboratory test. This approach can help when a lab test
has substantial variability as with liver function tests for statin
patients. Since liver function tests can at times be slightly abnormal,
but not require clinical action, it can be problematic to use the
low-normal and high-normal values in case classification. The use of
ratios such as the laboratory value divided by the high-normal value
can provide a ratio of test deviation from normal. For common liver
function tests, ratios of 2 or 3 times the upper normal cutoff are
clinically meaningful, and using this approach with laboratory data
can simplify the final data set yet provide important clinical
meaning.
In addressing normal versus abnormal values, the identification
of low and high abnormalities is likely to be meaningful. In some
264 Terrence J. Adam and Chih-Lin Chi

lab tests, the presence of abnormalities which are low can indicate
different disease likelihoods than high-abnormal laboratory values.
As an example, when one looks at a high-frequency test like a
hemoglobin level, the subject may have blood loss or a lack of
function of blood formation or nutritional insufficiencies. High
hemoglobin values may be related to certain blood overproduction
disorders, chronic low oxygenation, and cardiopulmonary disease
compensation, among other disorders. In this case, an abnormal
value higher than normal has a different disease risk than one with a
value below normal. Other laboratory tests reflect more of an
ordered continuum such as with measures of inflammation such
as an erythrocyte sedimentation rate (ESR) where the high num-
bers reflect a higher potential level of inflammation. Clinical experts
can help identify key components for laboratory data management
and data preprocessing.

5 Comorbidity Adjustment

In completing data analysis work, we may find statistically mean-

ingful findings which can create substantial initial interest. How-
ever, these findings rapidly become much less interesting when we
find the underlying clinical scenario explains our novel result and is
in fact a known manifestation of the underlying clinical condition.
To avoid this problem, the incorporation of clinical comorbidity
adjustments can both improve the quality of our analytics and avoid
expending effort to point out clinically expected patterns in
our data.
Clinical comorbidity adjustment can be a major problem, con-
sidering the breadth of potentially available clinical diagnostic data.
To account for clinical comorbidities as well as address the pro-
blems of high-dimensional clinical data, the use of clinical comor-
bidity scoring can be valuable. Ideally a comorbidity adjustment
tool should be able to identify broad clinical categories, and the
scores should be associated with mortality risk and be validated in a
general population or in our clinical cohort of interest. Several
options are available to use, including a number of published
models which can be supplemented with other available commer-
cial tools if the project budget will allow for it. Using broad mea-
sures with well-defined disease subsets can provide the capacity to
model targeted disease states while retaining the capacity to
account for broader disease categories. A broad measure is often
the best option for general data analysis since it will provide infor-
mation on the risk to patients of overall mortality. Available ICD9
and ICD10 code information describing the medical diagnoses are
usually required to use the available validated measures.
Several options are frequently used for comorbidity adjustment
with broad ICD9-based approaches like the Chronic Condition
Big Data Cohort Extraction for Personalized Statin Treatment and Machine. . . 265

Indicator and the Clinical Classifications Software which provide a

way to group the large numbers of ICD9 codes into a smaller
number of clinical categories. There are also more targeted
approaches to comorbidity adjustment which use available pub-
lished approaches such as the Charlson Index and Elixhauser
Index which are validated instruments to account for clinical
comorbidities. There are other comorbidity adjustment tools avail-
able through commercial vendors such as the Johns Hopkins ACG
system, but these will require licensing and a fee associated with
their use, though academic discounts may be available.
The Chronic Condition Indicator is available through the
Agency for Healthcare Research and Quality (AHRQ) which was
developed with the Healthcare Cost and Utilization Project
(HCUP), a Federal-State-Industry partnership sponsored by the
Agency for Healthcare Research and Quality [11]. The Chronic
Condition Indicator translates the ICD codes into chronic or
non-chronic categories and will also associate the code with a
body system. Chronic conditions are those conditions which
would typically last 12 months or longer. The Chronic Condition
Indicator can be used with the Clinical Classifications Software
(CCS) to categorize the ICD codes into a smaller number of
clinically focused categories. The CCS software was developed by
AHRQ and can provide the capacity to substantially reduce the
number of ICD codes to manage by creating a clinically meaningful
“clinical grouper.” The CCS system can reduce the clinical coding
data from approximately 14,000 diagnosis codes and 3900 proce-
dural codes in ICD9 to 285 mutually exclusive single-level cate-
gories [12]. The single-level CCS categories also have a hierarchy
and can be grouped into multilevel CCS categories as well. The
CCS tool set has available help files and software to assist in com-
pleting the grouping of the ICD data, allowing for a high degree of
transparency, and can facilitate a deeper drill down to the specific
ICD codes to better understand study results.
The Charlson Index has a long history of use and has been
modified a number of times since it was originally published
[13]. The original index had 19 disease categories but has been
modified a number of times reflecting later research. Some of the
needed change also results from underlying shifts in the mortality
risks for some disease categories which have different outcomes
today than in 1987 when the index was first published. The original
index also had weights associated with them to create a single
comorbidity score for each subject which was based on the adjusted
risk of 1-year mortality. Subsequently, some of the categories were
reorganized, and some of the weights were changed. A description
of the Charlson Index and its subsequent evolution are summarized
by the University of Manitoba along with lists of the clinical codes
in ICD9 and ICD10 [14] related to the index [15, 16]. The Uni-
versity of Manitoba also provides SAS macros to create the index
266 Terrence J. Adam and Chih-Lin Chi

scores in ICD9 and ICD10 which are adopted from the work of
Quan [15].
Another comorbidity adjustment approach was developed by
Elixhauser which included 30 comorbidity measures in the initial
development [17]. The Elixhauser Comorbidity Index uses ICD9
and ICD10 codes from administrative data and was developed to
predict hospital resource use and hospital mortality. The method
has changed slightly since its original development and SAS code is
available for its use from the University of Manitoba [18].
The Elixhauser and Charlson Index approaches both have the
appeal of being able to create a single score for use in assessing the
clinical comorbidity of a single subject. Broader grouper software
such as the CCS can represent a substantial amount of clinical
information in a relatively small number of groups of clinically
pertinent codes. In practical application, it can be useful to start
with either Elixhauser or Charlson Index aggregate scores along
with their individual clinical components to assess if the clinical
comorbidity adjustment is relevant in the data analysis. Inclusion
of some method of comorbidity adjustment in any clinical data
analysis work is likely to improve the modeling and acceptance of
the results by the clinical community.
Clinical comorbidity adjustment can be completed with the use
of broad index tools or grouping software. However, with
medication-related data analysis, it may also be useful to use tar-
geted clinical comorbidity assessment related to the target medica-
tion. Targeted clinical comorbidity assessment related to
medication use will focus on factors affecting the absorption, distri-
bution, metabolism, and excretion of medications which may have
an effect on the efficacy of a particular medication. Since medica-
tions are typically metabolized and excreted from the body via the
liver and/or kidneys, it is important to account for systematic
deficiencies in this physiological function in the data sets at the
individual patient level. This can be difficult when the organ sys-
tems can be affected by the target drug as is the case with statin
medications. Since the exposure to the drug can lead to a potential
adverse event, this has the potential for misclassification if it is not
accounted for in the analysis. To attempt to discern drug-induced
adverse events from those resulting from conditions which were
present at drug initiation, the presence of these clinical states need
to be identified prior to the initial medication prescription initia-
tion. To adequately screen for the presence of such clinical states, a
review of a sufficient time period prior to the first drug prescription
is required and can typically be either 1 month, 3 months,
6 months, or even a year prior to the initial prescription similar to
the approach in developing a disease-free or washout time frame
with a predilection to longer time frames.
Big Data Cohort Extraction for Personalized Statin Treatment and Machine. . . 267

6 Clinical Cohort Identification

Identification of adverse drug and clinical events is generally com-

pleted via medical claims and/or electronic medical record data. To
identify drug-related events which present as clinical outcomes as in
the case of adverse events associated with drug-drug interactions,
there is a need for overlapping pharmacy and medical benefits in
order to plausibly assess for associations. In this case, any time both
coverages are in place for an individual, there is a chance to establish
the exposures and assess for events. When one or both insurance
coverage elements are not current, this exposure time should be
excluded from the analysis, since missing exposures and events will
create problems of misclassification. A more conservative approach
would entail, establishing contiguous blocks of coverage for both
medical and pharmacy coverage as a requirement of inclusion in the
analytic data set with a corresponding reduction in misclassification
risk but also a reduction in population sample size.
A similar approach can be used to identify a washout period or
to set up an exclusion window time frame. For instance, if a cohort
is to be considered exposure naı̈ve, a retrospective review for the
presence of an exposure of interest is required. For example, if the
goal is to identify a cohort which is statin therapy naı̈ve, the expo-
sure time needs to be defined, and the insurance eligibility needs to
be assessed over that time frame. In this case, only pharmacy benefit
eligibility would be needed; however, the pharmacy benefits would
need to be continuous over the look-back time frame. Since medi-
cation prescriptions can last up to 1 year, a plausible look-back time
frame to assess for a medication exposure is 1 year. One can con-
sider a shorter look-back time frame of 30 days or 90 days for
prescription fulfillment, but this would risk including cohort mem-
bers who are inappropriately treatment exposed just outside the
look-back time frame. Since medication prescriptions can often be
filled for a 90-day supply, the minimum look-back time frame
should be 90 days, but using a 6-month or a longer 12-month
look-back time frame is desirable. For typical medication usage
patterns in population data, there may be medication holidays,
dosage changes, and pill splitting by subjects using the medications
which create gaps in prescription filling times which may be longer
than expected. Given these potential issues, it is desirable to con-
sider longer time frames in reviewing washout or exposure-free
time periods prior to new prescription initiation. Even longer
time frames can potentially misclassify exposures. For example, if
one is using a 6-month time frame for a drug washout, it will cover
most subject drug holidays in the context of 3-month prescription
fills but may still miss subjects who are pill splitting and are able to
fill their 3-month prescription and effectively get 6 months of
therapy prior to needing a new prescription refill cycle. Adding in
268 Terrence J. Adam and Chih-Lin Chi

a couple of drug holidays or inadvertent missed doses and now, one

could have a greater than 6-month window between prescription
fills while still achieving a high level of day-to-day medication
adherence. This example provides a potential rationale for using a
longer 6 month and ideally 1 year time frames to assess for prior
drug exposures.
In defining a population cohort, there is often a need to iden-
tify a particular disease state present in a cohort population
[19]. For statin therapy, it is important to identify at-risk groups
such as those with hyperlipidemia or those with known cardiovas-
cular disease. To identify the presence of these disease states in a
subject cohort, diagnostic data is needed to help with cohort
inclusion and for comorbidity adjustments. To identify a clinical
disease state in a cohort, standardized clinical data such as ICD9/
ICD10 codes can be used to identify clinical disease states of
interest. Although the presence of a clinical code provides evidence
for the presence of a clinical disease, the overlap between provi-
sional and confirmed diagnoses in clinical practice can create pro-
blems for cohort identification. If a patient is getting a clinical
evaluation and has some but not all defining characteristics of a
disease, they may be provisionally diagnosed with that disease.
However, in some cases alternative diagnoses may be found at a
later time, and the patient may be deemed to no longer have this
disorder. In the clinical record, this disease data will be changed to
an inactive state; however, this change is not inactivated in the
ICD9/ICD10 clinical encounter data. For cohort inclusion, one
should have a higher standard to insure study subjects indeed meet
the criteria for cohort inclusion.
Potential mechanisms to insure a higher degree of certainty for
diagnoses used for cohort eligibility include the incorporation of
inpatient and outpatient data criteria to confirm the presence of the
disease of interest. The use of different inclusion qualifications for
inpatient data versus outpatient data has been utilized and typically
uses the presence of one inpatient diagnosis or two outpatient
diagnoses within a calendar year as confirmatory of the disease
state of interest [20]. Using a higher degree of certainty to establish
the diagnosis can help reduce the risk of identifying clinical states
which may be transient or provisional.
The evaluation of clinical diagnostic codes over prolonged time
frames can help to insure clinical cohort subjects meet inclusion or
exclusion criteria. However, when clinical diagnostic codes are
assessed across longer time frames, these longitudinal data evalua-
tions also need to incorporate a yearly coding consistency check as
there can be changes in the coding of the disease state of interest
from 1 year to the next. This will typically require a review of the
relevant code set for each year in which clinical data is available and
then reconciling the coding to insure consistency across the longi-
tudinal data set. As part of data preparation, the initial analysis of
Big Data Cohort Extraction for Personalized Statin Treatment and Machine. . . 269

the number of clinical events may unexpectedly increase or decrease

starting at a particularly date, providing evidence of a potential
systematic shift in the data which may reflect a change in the source
clinical coding. Coding updates are generally available from clinical
coding publications and professional societies to highlight key
changes from 1 year to the next to help simplify this work.
Larger systematic changes in the data will be noted when a
change in the primary coding occurs as happened in the United
States in October 2015 when clinical diagnostic coding changed
from ICD9 to ICD10. Since the two coding systems were dramati-
cally different, any US-based clinical data set will need to account
for this change which occurred on October 1, 2015. A number of
resources are available through public use files and other software
to help facilitate mapping across the two coding versions using
General Equivalence Mappings (GEMS) files to convert between
ICD9 and ICD10 [21]. In addition, these files are also mapped
from both ICD9 and ICD10 to SNOMED-CT to use for other
mapping work and can serve as a second source to test coding
equivalents.

7 Personalized Medication Therapy Development

The identification of personalized medication profiles incorporates

data-driven as well as clinical context elements to identify the
potential medication paths for a particular patient. A number of
elements will guide medication therapy including prescribing infor-
mation provided by the drug manufacturer, clinical guidelines
established by clinical societies, and existing clinical practice pat-
terns. Many medications require dosage titration either to mini-
mize medication side effects or to optimize the clinical effect. Some
medications require drug levels to be assessed to find if the dose is
sufficient, while others require biomarker data to provide a measure
of the response. However, most medications lack readily available
direct biomarker data on drug levels and are generally managed
according to the clinical response using patient history and exami-
nation findings. As a result, the ability to create a precise evaluation
of optimal dosing is limited to the available data such as the dose of
the drug and frequency of usage as measured by prescription fill
data. Given the limitations around medication therapy information,
the ability to aggregate large population databases including patient
characteristics, insurance data, medication regimen characteristics,
and prescription data can provide a means to create personalized
medication profiles [22]. By using a “patients like me” approach,
one can find patients with similar clinical, demographic, and medi-
cation use patterns to create optimized treatment pathways.
A number of treatment pathways can be defined starting with
the identification of all available medications in the drug class and
270 Terrence J. Adam and Chih-Lin Chi

the potential dosages available for each medication. Once the path-
ways are defined, the usage patterns can be evaluated for continu-
ous use, irregular use, or discontinuation of therapy. Usage patterns
can be defined around prescription fill data looking at the number
of days of drug supplied versus the number of days in which a
patient is on medication therapy.
For the assessment of statin therapy and other chronic medica-
tions, there may be a drug initiation followed by subsequent
changes in dose over time to improve a patient’s lipid profiles, to
manage changes in the relative risk of cardiovascular disease, and to
reduce adverse drug events which may be associated with statin
medication use. Assessment of the frequency of medication pre-
scription orders in the medical record can provide information on
the changes in therapy and the likelihood of patient use of the
medication. Unfortunately, medication orders or prescriptions are
not always directly related to the actual use of medications since
many medications which are prescribed are never filled which can
create misclassification problems in identifying drug exposures.
Ideally, in this case, the data analyst has data outside the electronic
record to identify prescription fill data in the form of prescription
claims data. If such data is available, it can substantially reduce the
likelihood of exposure misclassification but does add to data prepa-
ration and data integration efforts for medical record focused
projects.
More specific measures of drug usage focusing on medication
adherence can also be considered with some of the widely used
measures available including the proportion of days covered (PDC)
[23], medication possession ratio (MPR), and others. The typical
measures of adherence use a ratio of the medication supplied to a
patient divided by the number of days in the observation time
period. For statin medications, the proportion of days covered
method is supported by several groups as a preferred measure of
medication adherence including the Pharmacy Quality Alliance and
the National Quality Forum. The PDC is also currently used in the
US Center for Medicare and Medicaid Services star ratings. The
PDC method is calculated by identifying the number of days in the
time period of observation which are covered with a medication
divided by the number of days in the observation time period and
converting this proportion to a percentage measure with some
specific case adjustments [24]. For chronic medication therapies
such as statin drugs, a measure of 80% or higher is generally seen as
a high level of adherence though higher numbers are desired for
some other medications where missing doses can be problematic as
in the case of therapy for human immunodeficiency virus therapy or
antibiotic therapy.
A number of other factors can also be evaluated in personaliz-
ing medication therapy. However, many of these factors are not
readily available in a form which can be readily integrated with
Big Data Cohort Extraction for Personalized Statin Treatment and Machine. . . 271

other available medication data resources. Each of these areas can

be considered when interpreting the findings of the data analysis.
This list is certainly not exhaustive but does provide a starting point
to place analytic results in context.

8 Factors to Consider in Personalizing Medication Therapy

1. Medication activation, metabolism, and clearance factors.

2. Economic: Total drug cost, insurance reimbursement, and out-
of-pocket cost.
3. Policy: Formulary status, medication substitution, mail order
status, and regulatory changes.
4. Medication prescription policies: FDA black box warnings and
drug approval or withdrawal.
5. Clinical evidence: New studies for and against use and clinical
guidelines.
6. Alternative medication availability and substitution.
7. Pharmacogenomics: Including both patient functional data
and drug-specific evidence.
8. Medication experience: Personal beliefs and values.
Each of these factors can be considered in interpreting the
results of our data analysis after we have obtained the data, com-
pleted a quality assessment, standardized the data, identified rele-
vant comorbidities, and generated our clinical cohort data set for
evaluation. If we follow these steps, our data analysis is likely to be
more productive and clinically meaningful, and these additional
factors can serve as a checklist for additional data analysis and
hypothesis testing.

References
1. Jill Kolesar LV (2015) McGraw-Hill’s 2016/ DataStandards/StructuredProductLabeling/
2017 top 300 pharmacy drug cards. McGraw- default.htm. Accessed 31 July 2017
Hill 6. WHO Collaborating Centre for Drug Statistics
2. ClinCalc (2017) The Top 200 of 2017 Clin- O (2017) ATC: structure and principles.
Calc LLC. http://clincalc.com/DrugStats/ https://www.whocc.no/atc/structure_and_
Top200Drugs.aspx. Accessed 30 July 2017 principles/. Accessed 31 July 2017
3. Agency for Healthcare Research and Quality R, 7. WHO Collaborating Centre for Drug Statistics
MD (2017) Medical expenditure panel survey O (2017) ATC/DDD index 2017. https://
4. Food and Drug Administration US (2017) www.whocc.no/atc_ddd_index/. Accessed
National drug code directory. https://www. 31 July 2017
fda.gov/drugs/informationondrugs/ 8. U.S. National Library of Medicine B (2017)
ucm142438.htm. Accessed 30 July 2017 RxNorm technical documentation.
5. Food and Drug Administration US (2017) U.S. National Library of Medicine. https://
Structured product labeling resources. www.nlm.nih.gov/research/umls/rxnorm/
https://www.fda.gov/ForIndustry/ docs/2017/rxnorm_doco_full_2017-2.html.
Accessed 31 July 2017
272 Terrence J. Adam and Chih-Lin Chi

9. Svensson-Ranallo PA, Adam TJ, Sainfort F 17. Elixhauser A, Steiner C, Harris DR, Coffey RM
(2011) A framework and standardized meth- (1998) Comorbidity measures for use with
odology for developing minimum clinical data- administrative data. Med Care 36(1):8–27
sets. AMIA Jt Summits Transl Sci Proc 18. Manitoba Centre for Health Policy C (2016)
2011:54–58 Concept: elixhauser comorbidity index http://
10. Regenstrief I (2017) LOINC: the international mchp-appserv.cpe.umanitoba.ca/
standard for identifying health measurements, viewConcept.php?conceptID¼1436. Accessed
observations, and documents. Regenstrief 31 July 2017
Institute https://loinc.org/. Accessed 31 July 19. Chi C-L, Wang J, Clancy TR, Robinson JG,
2017 Tonellato PJ, Adam TJ (2017) Big data cohort
11. Agency for Healthcare Research and Quality R, extraction to facilitate machine learning to
MD (2017) HCUP chronic condition indica- improve statin treatment. West J Nurs Res 39
tor. Healthcare cost and utilization project (1):42–62. https://doi.org/10.1177/
(HCUP): Chronic condition indicator (CCI) 0193945916673059
for ICD-9-CM. https://www.hcup-us.ahrq. 20. Hebert PL, Geiss LS, Tierney EF, Engelgau
gov/toolssoftware/chronic/chronic.jsp. MM, Yawn BP, McBean AM (1999) Identify-
Accessed 31 July 2017 ing persons with diabetes using medicare
12. Agency for Healthcare Research and Quality R, claims data. Am J Med Qual 14(6):270–277.
MD (2012) HCUP CCS fact sheet. Healthcare https://doi.org/10.1177/
cost and utilization project (HCUP). https:// 106286069901400607
www.hcup-us.ahrq.gov/toolssoftware/ccs/ 21. Center for Medicare and Medicaid Services H
ccsfactsheet.jsp. Accessed 31 July 2017 (2017) 2017 ICD-10-CM and GEMs. https://
13. Charlson ME, Pompei P, Ales KL, MacKenzie www.cms.gov/Medicare/Coding/ICD10/
CR (1987) A new method of classifying prog- 2017-ICD-10-CM-and-GEMs.html. Accessed
nostic comorbidity in longitudinal studies: 28 July 2017
development and validation. J Chronic Dis 40 22. Olson CH, Dierich M, Adam T, Westra BL
(5):373–383 (2014) Optimization of decision support tool
14. Manitoba Centre for Health Policy C (2016) using medication regimens to assess rehospital-
Concept: charlson comorbidity index http:// ization risks. Appl Clin Inform 5(3):773–788.
mchp-appserv.cpe.umanitoba.ca/ https://doi.org/10.4338/ACI-2014-04-RA-
viewConcept.php?conceptID¼1098#a_ 0040
references. Accessed 31 July 2017 23. Benner JS, Glynn RJ, Mogun H, Neumann PJ,
15. Quan H, Sundararajan V, Halfon P, Fong A, Weinstein MC, Avorn J (2002) Long-term per-
Burnand B, Luthi JC, Saunders LD, Beck CA, sistence in use of statin therapy in elderly
Feasby TE, Ghali WA (2005) Coding algo- patients. JAMA 288(4):455–461
rithms for defining comorbidities in ICD-9- 24. Nau DP (2017) Proportion of days covered
CM and ICD-10 administrative data. Med (PDC) as a preferred method of measuring
Care 43(11):1130–1139 medication adherence. Pharmacy quality alli-
16. Deyo RA, Cherkin DC, Ciol MA (1992) ance. http://pqaalliance.org/resources/adher
Adapting a clinical comorbidity index for use ence.asp Accessed 31 July 2017
with ICD-9-CM administrative databases. J
Clin Epidemiol 45(6):613–619
Chapter 15

Drug Signature Detection Based on L1000 Genomic

and Proteomic Big Data
Wei Chen and Xiaobo Zhou

Abstract
The library of integrated Network-Based Cellular Signatures (LINCS) project aims to create a network-
based understanding of biology by cataloging changes in gene expression and signal transduction. L1000
big datasets provide gene expression profiles induced by over 10,000 compounds, shRNAs, and kinase
inhibitors using L1000 platform. We developed a systematic compound signature discovery pipeline named
csNMF, which covers from raw L1000 data processing to drug screening and mechanism generation. The
discovered compound signatures of breast cancer were consistent with the LINCS KINOMEscan data and
were clinically relevant. In this way, the potential mechanisms of compounds’ efficacy are elucidated by our
computational model.

Key words LINCS, L1000, csNMF, Compound signature, Drug signature, Compound efficacy

1 Introduction

Currently, the fundamental step of drug discovery is compound

profiling, which is defined as the large-scale screening of candidate
compounds for their potential drug-like qualities and toxicity using
high-throughput technologies [1]. A critical need in compound
profiling and drug discovery is to thoroughly examine the impacts
of drugs or compounds on cellular functions using a wide panel of
essential proteins [2]. To address such challenges, the Library of
Integrated Network-Based Cellular Signatures (LINCS) program
(http://www.lincsproject.org/) has initialized an effort to generate
the related biomedical big data [3]. The LINCS program has been
used to systematically explore the pharmacological roles of more
than 3700 potential drug targets in 15 cancer cell lines at the
individual-gene level. Compound profiling using LINCS big data
as the reference library is made possible by the large-scale applica-
tion of the L1000 platform. The L1000 gene expression data are
cataloged for human cancer cells treated with compounds and
genetic reagents. Similar to Connectivity Map (CMap) [4], the

Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_15, © Springer Science+Business Media, LLC, part of Springer Nature 2019

273
274 Wei Chen and Xiaobo Zhou

L1000 assay (Luminex-bead detection system) aims to connect

diseases with genes and drugs at low costs. The gene expression
profiles from L1000 data are potentially useful to infer the targets
of compounds. As a novel genome-wide gene expression assay
platform, the L1000 is highly cost-efficient and robotically auto-
mated. It allows the generation of 946,944 profiles of gene expres-
sion data testing 5178 drugs and compounds and perturbations of
3712 genes across 15 different cancer cell types (http://lincscloud.
org/). As an ongoing national data generation consortium, the
LINCS L1000 big data is growing quickly in examined drugs,
compounds, genes, dosing, time points, combinations of treatment
[5] conditions, and cell lines.
Accompanying such a great opportunity are the new challenges
of processing and analyzing data generated from the L1000 plat-
form. In this chapter, we present our project utilizing LINCS
L1000 cell line data for breast cancer. It refers to the development
of a comprehensive and complete pipeline for network-based com-
pound signature discovery and drug screening under the target
gene reference library [6]. We proposed a “compound signature”--
based approach to profiling the pharmacological potential of com-
pounds by associating these candidates with known drugs in terms
of the similarity of their possible targets, using the latest LINCS
L1000 data for breast cancer (MCF-7) cell lines. The whole
approach includes three steps. For the first step, we defined a
“compound signature” as a group of small-molecule compounds
sharing similar target genes. Meanwhile, we developed a parallel
data processing pipeline, the fuzzy c-means guided Gaussian mix-
ture model (GMM), to address the L1000 data processing chal-
lenges with superior accuracy and efficiency. For the second step,
we proposed two compound signature discovery approaches using
data produced by the GMM pipeline. One was the Enrichment of
Gene Effects to a Molecule (EGEM) score, which associated a
compound with its potential targets. Another approach was the
constrained sparse nonnegative matrix factorization (csNMF),
which used the EGEM scores of drugs, compounds, and genes to
reliably detect the compound signatures and associate candidate
compounds with known drugs by the shared compound signatures.
The LINCS kinomics data for kinome-wide drug inhibitory effects
were used to validate discovered signatures. Functional analysis and
known mechanisms of the detected signatures further supported
the results of compound signature detection. For the third step, we
constructed the quadruple model training, which correlated a drug
with its targets, the affected downstream transcription factors, and
the transcriptional alterations. Based on the whole three-step
approach, we discovered the compound signatures of breast cancer,
which were consistent with the LINCS KINOMEscan data and
were clinically relevant. In addition, our pipeline provided a novel
Genomic Big Data for Drug signature Detection 275

and complete tool to expedite signature-based drug discovery

leveraging the LINCS L1000 resources.

2 Materials

LINCS L1000 gene expression data and KINOMEscan data are

adopted. We combined the small-molecule compound and shRNA
data released from the Broad Institute LINCS Data Generation
Center (http://api.lincscloud.org/). Two compound-induced
L1000 gene expression datasets were selected. The two datasets
included data for treatment effects of 728 and 51 compounds on
the MCF-7 breast cancer cell line, respectively. In addition, we
utilized the KINOMEscan data, which measured the interactions
of compounds and more than 450 kinase assays and disease-
relevant mutant variants. Expression patterns after the single-gene
knockdown of 3341 biologically important genes by shRNA treat-
ments were measured on the same cell line. Compounds in the
latter dataset were all kinase inhibitors. Thus, we included the
auxiliary KINOMEscan data of these 51 kinase inhibitors released
from the Harvard Medical School LINCS Data Generation Center
(http://lincs.hms.harvard.edu/db/). This dataset was used to vali-
date the discoveries of compound signatures (see Note 1).

3 Methods

The overall framework of the compound signature discovery pipe-

line (Fig. 1) is composed of three steps:
Step I: Raw L1000 data processing using the GMM pipeline. At
this step, the L1000 raw data were processed, normalized,
cleaned for quality control, and annotated. The GMM pipeline
demonstrated better accuracy and efficiency compared to
another tool using the k-means method (http://lincscloud.
org/exploring-the-data/code-api/, date: 2012/06/27).
Step II: Compound signature detection using the EGEM-based
csNMF model. In this step, the EGEM method was used to
measure the EGEM score for each of the 3341 perturbed
genes, which described the potential of the gene of interest to
be the “target” of a small-molecule compound. The targeting
potentials of such compound-gene pairs were represented by an
EGEM matrix (Fig. 1). Then the novel constrained sparse
nonnegative matrix factorization (csNMF) algorithm was
developed and performed on the EGEM matrix to identify
compounds of similar targets. Each such compound subgroup
is defined as a compound csNMF signature, shares similar
targets, and may show similar pharmaceutical potential.
276 Wei Chen and Xiaobo Zhou

Fig. 1 Overview of the compound signature discovery framework. This method requires raw L1000 data after
various compounds and gene knockdown treatments. The raw data after the two types of treatments are
preprocessed to yield gene expression data in Step I. In Step II, the EGEM matrix is constructed based on these
gene expression data to measure relationships among compounds and knockdown genes. This matrix is then
decomposed to a weight matrix and a coefficient matrix by the csNMF method. Protein–protein interaction
data are added in consideration of biological connections. Signatures are identified based on strongly
associated genes (i.e., those with larger values in the coefficient matrix)

Step III: csNMF signature analysis and annotation using our devel-
oped quadruple model, which reveals how compounds in each
csNMF signature alters the downstream transcription factors
and causes the differential changes of the apparent gene expres-
sion patterns (see Note 2).

3.1 Step I: Raw Data The goal in Step I was to reliably process, normalize, clean, and
Preprocessing Pipeline annotate the L1000 raw data. The major challenges in this step
were reliable peak calling, normalization and quality control, and
the computational burdens for processing big raw data. A GMM
peak calling approach was developed for reliable peak calling from
raw L1000 data [6]. The raw data for each sample were deconvo-
luted, and the fluorescent intensity peak corresponding to each
mRNA probe was identified using the GMM model, annotated
with the gene symbol, probe ID, gene description, and the analyte
and L1000 probe set information [7]. This information was then
Genomic Big Data for Drug signature Detection 277

A. Flowchart

Level 1
Raw Data (.lxb)

Peak Calling
(GMM)

Level 2
Raw Gene Expression
(.csv and .gct)

Quantile Normalization
(Removing Batch Effects)

Quality Control
(Experiment Quality)

Log Fold Changes

(Against Selected Control Conditions)

Level 3
Perturbagent-induced
Gene Expression
Pattern (.gct)

Fig. 2 Overview of the data preprocessing framework. The raw Luminex data are transformed into gene
expression data by the GMM peak calling method. Quantile normalization is then performed to reduce the
batch effects, and quality control is executed to filter out poor-quality data

outputted in the GCT format, defined as the Level 2 raw gene

expression data. After the normalization and quality control, each
set of perturbation-induced data was compared with its negative
control. Differential gene expression (DEG) patterns, in the form
278 Wei Chen and Xiaobo Zhou

of log fold changes (LFCs), were outputted as the Level 3 perturba-

gen-induced gene expression pattern data in the GCT format. The
whole GMM-based pipeline is shown in Fig. 2.

3.2 Step II: Enrichment of Gene Effect to a Molecule (EGEM) was developed
Compound Signature to identify proteins closely related to cellular responses to a small-
Discovery by EGEM- molecule compound, using the LINCS L1000 landmark gene
Based csNMF Model expression data. A small-molecule compound affected a cell by
directly or indirectly changing the activities and functions of its
3.2.1 EGEM Score target proteins, which drive downstream biological events, and
and the EGEM Matrix finally alter cellular gene expression patterns. We hypothesized
that the knockdown of a gene, which is closely related to the target
proteins of a small-molecule compounds, induces similar gene
expression pattern changes. Thus, identification of such genes
could reveal the mechanisms of cellular responses to these com-
pounds and predict their pharmaceutical potentials. We defined the
“target genes” of a compound in the general meaning: the
corresponding proteins of such genes could be either the real
drug targets or those at downstream or upstream and were closely
related to the real targets. The data for 3000 single-gene knock-
down experiments were used as the target gene reference library,
and the data for compound treatments were profiled against this
reference library to identify possible target genes of corresponding
small-molecule compounds.
We defined the EGEM score to describe the similarity between
the treatments of a compound and a shRNA targeting a gene using
the mutual enrichment of their resultant differential expressed
landmark genes. The EGEM metric was derived from the rank-
based gene set enrichment analysis (GSEA) [8] and the connectivity
analysis [9]. Compound treatments could be taken as “pheno-
types” and the differentially expressed genes (DEGs) of a single-
gene knocking down treatment as a “signature gene set” in the
GSEA terminology. The EGEM metric enabled gene set enrich-
ment analysis against the LINCS target gene reference library. The
construction of the EGEM score is shown in Fig. 3, where a
signature gene set of a target gene is composed of n DEGs after
the knockdown of a target gene. Among them, tup were upregu-
lated and tdown downregulated. DEGs were detected according to
the LFCs of the L1000 landmark genes using 1.5 IQR (interquar-
tile range) as the threshold, which was robust against outliers. For a
small-molecule compound, two lists of landmark genes were used
to represent the patterns of the compound-induced L1000 gene
expression changes. One ( pup) was sorted ascendantly, while the
other ( pdown) was sorted descendantly based on the LFCs. Such
p1( j) and p2( j) were the positions of the jth up- and downregulated
DEGs, respectively, in their corresponding probe gene lists. The
EGEM score was defined in Eq. 1:
Genomic Big Data for Drug signature Detection 279

Fig. 3 EGEM score construction. The up- and down-DEGs after a gene knockdown treatment are used as two
feature sets. The locations of up and down feature sets in the ascendant- and descendant-sorted gene list
after a compound treatment are measured by Kolmogorov-Smirnov statistic. The value is normalized by the
total size of up and down feature sets

i p1 ði Þ i
EGEM ¼ max
i¼1:t 1 t 1 þ t 2 2n t 1 t 2

j p2 ðj Þ j
þ max ð1Þ
j ¼1:t 2 t 1 þ t 2 2n t 1 t 2

The EGEM score ranges from 1 to 1. The absolute value of

an EGEM score represents the enrichment degree. The positive or
negative sign of an EGEM score indicates that the change of gene
expression pattern due to knocking down the corresponding gene
is similar or reversely similar to that induced by the drug treatment.
The statistical significance of an EGEM score was determined by t-
test against permutations of 100 times. The EGEM scores were
kept only if the associated p-values were less than 0.05 and other-
wise were set to zero. We constructed an EGEM matrix by calcu-
lating the pairwise EGEM score between each compound and each
knockdown gene. We assumed that both the positive and negative
EGEM scores followed normal distributions. We also assumed that
the EGEM matrix was sparse by observing the fact that, among the
3000 proteins, a compound usually only targets a limited number
of them. Hence, we chose the EGEM scores with single-side p-
values less than 0.05. Other scores were forced to zeroes.
280 Wei Chen and Xiaobo Zhou

We constructed an EGEM matrix A ∈ Rn m involving n driver

genes and m compounds by the pairwise calculation of EGEM
scores. Thus, the impacts of these compounds were delineated
using the 3000-target-gene reference library.

3.2.2 Compound A “compound signature” is defined as a group of small-molecule

Signature Discovery by compounds sharing similar target genes. We developed a novel
csNMF method, the constrained sparse nonnegative matrix factorization
(csNMF), an NMF approach regularized by both the protein–pro-
tein interaction constraint and the sparseness constraint, to effec-
tively detect biomedical meaningful compound signatures from the
large EGEM matrix. NMF [10] is a matrix decomposition method
widely used in pattern recognition and has demonstrated its ability
in solving various biclustering problems in bioinformatics, includ-
ing gene pattern recognition, disease module detection, and phe-
notype classification [11]. Canonically, a nonnegative EGEM
matrix A ∈ Rn m would be decomposed into two nonnegative
matrices W and V, so that A WV, where W ∈ Rn k was the
weight matrix of target genes, V ∈ Rk m was the clustering matrix
of compounds, and k min (m, n) was the number of co-clusters.
Both weight matrices would be later used to identify the k co-
clusters. We extended the canonical NMF approach to detect bio-
medically meaningful co-modules of both compounds and target
genes, in which drugs showed similar associations with target genes
according to the compound–target EGEM scores. The overall
objective function used to solve the csNMF was shown in Eq. 2,
where the first addition item is regarded as simultaneous clustering,
the second and third ones as sparseness constraints, and the fourth
one as PPI constraint:
1X 2
X
minW s , W r , V f ðW s ; W r ; V ; P Þ ¼ i∈fs;r g
kA i W i V kF þ η i∈fs;r g
kW ik2F
P 2
þ β jn¼1 kV ð:; j Þk21 þ λtr ðW s þ W r ÞT ðD P ÞðW s þ W r Þ
ð2Þ
The csNMF was optimized using the multiplicative
algorithm [10].

3.2.3 Simultaneous A co-module consisted of both positive and negative EGEM scores
Clustering of Positive as long as they were significant and consistent across compounds in
and Negative EGEM Scores the same module, but canonical NMF approaches could only accept
nonnegative values. To simultaneously handle both positive and
negative EGEM scores, from the original EGEM matrix A, we
extracted the positive EGEM scores into the similar EGEM Matrix
AS and the absolute values of the negative EGEM scores into the
reverse EGEM Matrix AR, both of the same dimensions as A. Both
the two EGEM matrices were presented in the overall objective
function above and were simultaneously optimized during iterative
Genomic Big Data for Drug signature Detection 281

NMF model training. The corresponding weight matrices of posi-

tively and negatively associated target genes, Ws and Wr, respec-
tively, were achieved at each iteration step and were merged after
optimization.

3.2.4 Sparseness We introduced a sparseness constraint according to the sparse NMF

Constraint (sNMF) method proposed by Kim [12]. In sNMF, the L1 norm
constraint is added to V, and WF was added to balance the accuracy
of the optimization and the sparseness of V. The rationale was that
the elements clustered into the co-modules should be a small
portion of the matrix. The sparseness constraint was necessary
under biclustering a very large EGEM matrix.

3.2.5 PPI Constraint We introduced protein–protein interaction (PPI) constraints

according to the PPI database [13] to emphasize clusters that
were biologically meaningful and thereby controlling false discov-
ery. The rationale was that in the cellular regulatory network,
perturbations of some up- and downstream proteins (“peers”) of
a protein targeted by the compound often also showed similar
changes in gene expression patterns. In the PPI constraint compo-
nent in Eq. 2, P was the PPI prior matrix, and D was a diagonal
matrix, with each row as the sum of the corresponding row of P.
The PPI constraint significantly improved both the specificity and
the sensitivity of the NMF approach in compound signature dis-
covery. On one hand, false-positive signature genes were often
sporadically distributed in the PPI network [14], and thus their
weights downgraded and more likely to be excluded. On the other
hand, if in the PPI network a group of “neighbor” genes showed
consistent but only moderate EGEM scores with a compound, they
were more likely to be clustered as signature genes of this com-
pound. Introducing prior knowledge of the PPI network to the
NMF approach thus contributed to more reliable discovery of
compound signatures.

3.3 Step III: In order to examine the biomedical relevance and the pharmaceu-
Compound Signature tical potentials of the detected compound signatures, we proposed
Analysis quadruple models to reveal the molecular events associated with
compound signatures.
A compound impacts the functions of its target proteins
directly or indirectly, triggers regulatory networks, alters the activ-
ities of downstream transcription factors, and thus changes the
gene expression patterns. To reveal such an underlying mechanism
of signatures, our proposed quadruple model gives the included
compound, its direct and indirect targets, downstream transcrip-
tion factors, and affected genes, which are shown in Fig. 4. In
Fig. 4, transcription factors for each signature were identified by
the enrichment analysis according to signature-associated genes
using ChIP enrichment analysis, setting a p-value of less than 0.05
282 Wei Chen and Xiaobo Zhou

Fig. 4 Quadruple models and Signature 2 in a kinome inhibitor study. A quadruple model simultaneously
includes a compound, its targets, related transcription factors, and the resulting gene expression pattern. This
compound signature discovery method (red) can detect similar quadruples (blue). The similar quadruples
include four compounds with similar target sets. 90 in 109 related TFs of the quadruples are covered using the
enriched TFs of signature genes

and ratios of the interacting genes to all genes that exceeded 0.1
[15]. The quadruples of compound signatures were thus con-
structed. The biomedical relevance of a typical signature (Signature
2) was validated by comparing the predicted transcription factors
from signature target genes with the enriched transcription factors
derived from the direct measurement of kinase targets of four
kinase inhibitors (ALW-II-38-3, ALW-II-49-7, QL-XI-92, and
CP724714) in this signature [16].
The compound signatures were composed of compounds and
their associated target genes. Compounds in a given signature
shared similar target genes and thus perturbed the cell functions
in similar ways for the corresponding cancer cell line. If some had
already demonstrated effectiveness for this type of cancer, other
compounds in this signature were more likely to be promising drug
candidates. We used FDA-approved chemotherapy drugs for breast
cancers to identify breast cancer-specific compound signatures and
examined the drug potentials of corresponding drugs [17]. Func-
tions of the signature also could be revealed by enrichment of
functions among these target genes. Signatures that demonstrated
anti-oncological functions [18] such as the reduced cell prolifera-
tion, increased cell death, and induced apoptosis were more likely
to be seen in potential drugs. We utilized the DAVID gene func-
tional annotation tool [19] to annotate functions of compound
signatures and identify antitumor signatures.
Genomic Big Data for Drug signature Detection 283

3.4 Sample We used the kinase inhibitor dataset to validate the concept of the
Application compound signatures discovered by the EGEM-based csNMF
of Compound approach. We chose this dataset because some kinase inhibitors
Signature Mining had been experimentally profiled to identify their direct kinase
targets and thus could be used to validate the predictions of the
3.4.1 Signatures csNMF modeling. The 51 kinase inhibitors were analyzed against
and Quadruples for Kinase the 3341-target gene reference library. In all, we detected eight
Inhibitors compound signatures, which are Signature 1 8: PD173074,
CP724714, QL-X-138, PLX-4720AZD1152, A443644WZ3105,
WZ7043XMD11, PD0332991, and HGSSUCLG1.

3.4.2 Validation Compounds that triggered similar molecular cascades might

of Predicted Target Genes instead share indirect targets, some of which might not be kinases.
Using the Quadruple Model CP724714, whose major target was HER2, did not show similar
kinase targets to the other three kinases, but it induced a similar
change in gene expression pattern according to the EGEM matrix.
Previous literature suggests a strong co-occurrence between DDR1
and HER2 [20] in breast cancer. We thus examined whether the
four kinase inhibitors in CP724714 instead shared similar down-
stream signaling pathways and affected activities of transcription
factors in the same way. The quadruple models of these four inhi-
bitors were constructed according to predicted target genes
(Fig. 4b, red) and were compared to those constructed according
to direct kinase targets from the KINOMEscan results (Fig. 4b,
blue). Among the 108 transcription factors enriched from pre-
dicted targets and the 109 from experimental targets, 90 over-
lapped. Thus, the predicted similarity between CP724714 and the
other five compounds could be explained in the quadruple models,
reflecting shared patterns of downstream transcription factor
activity.

3.4.3 Clinical Relevance We examined the associations of the discovered compound signa-
of Compound Signatures tures with patient survival and other clinical traits. Clinical features
and gene expression profiles of 2116 breast cancer patients col-
lected from Belgium, England, and Singapore (GEO:GSE45255)
were examined by the gene set enrichment of the eight discovered
breast cancer-related compound signatures. For example, in terms
of distant metastasis-free survival, patients in the Signature 4Low
category responded poorly to chemotherapy compared with those
in the Signature 4High category (Fig. 5c). Signature
4 (PLX-4720AZD1152) was selectively associated with chemo-
therapy but not hormone therapy (tamoxifen). We performed a
univariable and multivariable survival analysis using discovered
compound signatures as well as conventional clinical features
including patient age, tumor size, PAM50 as well as molecular
subtypes, lymph node involvement, the ER status, and the patho-
logical grades. The results suggested that the compound Signatures
4 and 5 (A443644WZ3105) are strongly associated with poor
284 Wei Chen and Xiaobo Zhou

Fig. 5 Breast cancer compound signatures. (a) Eight signatures were detected (yellow rectangles). For each
signature, compounds (columns) and genes (rows) corresponding to a red region showed similar gene
expression effects, whereas those corresponding to a green region exhibited reverse effects. (b) Degree of
yellow represents the relative enrichment for the related gene ontology (GO) terms. (c) Associations of
Signature 4 with drug responses and survival in data from 2116 breast cancer patients collected from
Belgium, England, and Singapore (GEO:GSE45255)

prognosis for patients with chemotherapies but not for those with
tamoxifen treatment. The analysis results were consistent with the
drug response survival results showed in Fig. 5. Signatures also
demonstrated associations with breast cancer subtypes (Signature
2: CP724714) and receptor status (Signatures 3 (QL-X-138) and
6 (WZ7043XMD11) with estrogen receptor status). Such associa-
tion results demonstrate the clinical potential of the compound
signatures discovered in the MCF-7 breast cancer cell line model.
Follow-up investigations could include testing the underlying
mechanisms for the poor prognosis of patients in the Signature
4Low category, by further studying the predicted target genes
using the established Signature 4 (PLX-4720AZD1152) quadruple
model (see Note 3).

4 Notes

1. Compound profiling using LINCS big data as the reference

library is made possible by the first large-scale application of the
L1000 platform. In the LINCS project, L1000 gene expression
profiles are collected from human cells treated with compounds
and genetic reagents. We adopt these data to reveal connec-
tions between genes and compounds and the related molecular
pathways for underlining disease states. All the data are from
15 cancer cell lines on 1000 carefully chosen landmark genes,
which can reduce the number of measurements and will not be
biased for a particular cellular model.
2. For the study of network-based compound signature discovery,
our csNMF approach was validated for 51 kinase inhibitors. We
implemented this approach to screen drug candidates for breast
cancer. 728 compounds are studied against the 3341 target
gene reference library screened for the MCF-7 breast cancer
Genomic Big Data for Drug signature Detection 285

cell line and detected eight signatures. Compounds belonging

to the same signatures were grouped together. In all, eight
compound signatures were identified. To find the signatures
of related compounds that might be beneficial for breast can-
cer, we focused on functions such as induction of apoptosis and
suppression of proliferation. The enrichment of different
biological processes of signatures was investigated by DAVID
[19] according to the gene ontology (GO) terms of signature
target genes. Only terms with a p-value less than 0.05 were
considered. To define similar compound-gene effects, we con-
sidered the terms with positive regulation of cell death and
apoptosis; as to the reverse ones, we considered the negative
regulations (cancer treatment-related GO terms). Finally, sig-
natures 7 and 8 were derived to be enriched for apoptosis.
3. For extracting compound signatures from LINCS L1000 data
for breast cancer (MCF-7) cell lines, we developed the csNMF
approach, a comprehensive and complete pipeline, for
network-based compound signature discovery and drug
screening under the target gene reference library. Predicted
similarities of drug-target genes were validated with the experi-
mental profiling. The csNMF pipeline bridges the gap between
the rich resource of the LINCS signature library and biomedi-
cal and clinical research needs. In addition, we developed
binary linear programming (BLP) approach to infer the best-
fitting cell-specific signaling pathways from perturbation-
induced topological structures. We believe that BLP can com-
plement standard biochemical drug profiling assays and sheds
new light on the discovery of possible mechanisms for drug
effects.

Acknowledgment

The work was supported by the grants of NIH U01HL111560-04

(Zhou) and NIH U01CA166886-03 (Zhou).

References
1. Downey W, Liu C, Hartigan J (2010) Com- 4. Lamb J, Crawford ED, Peck D, Modell JW,
pound profiling: size impact on primary screen- Blat IC et al (2006) The connectivity map:
ing libraries. Drug Discovery World pp 81–86 using gene-expression signatures to connect
2. Lehmann BD et al (2011) Identification of small molecules, genes, and disease. Science
human triple-negative breast cancer subtypes 313:1929–1935
and preclinical models for selection of targeted 5. Peng HM, Zhao WL, Tan H, Ji ZW, Li JS,
therapies. J Clin Invest 121(7):2750–2767 Li K, Zhou XB (2016) Prediction of treatment
3. Duan Q et al (2014) LINCS canvas browser: efficacy for prostate cancer using a mathemati-
interactive web app to query, browse and inter- cal model. Sci Rep 6:21599
rogate LINCS L1000 gene expression signa- 6. Liu CL, Su J, Yang F, Wei K, Ma JW, Zhou XB
tures. Nucleic Acids Res 42:W449–W460 (2015) Compound signature detection on
286 Wei Chen and Xiaobo Zhou

LINCS L1000 big data. Mol BioSyst 14. You ZH, Li JQ, Gao X, He Z, Zhu L, Lei YK, Ji
11:714–722 ZW (2015) Detecting protein-protein interac-
7. Ji ZW, Wu D, Zhao WL, Peng HM, Zhao SJ, tions with a novel matrix-based protein
Huang DS, Zhou XB (2015) Systemic model- sequence representation and support vector
ing myeloma-osteoclast interactions under machines. Biomed Res Int 2015:867516
normoxic/hypoxic condition using a novel 15. Lachmann A et al (2010) ChEA: transcription
computational approach. Sci Rep 5:13291 factor regulation inferred from integrating
8. Subramanian A et al (2005) Gene set enrich- genome-wide ChIP-X experiments. Bioinfor-
ment analysis: a knowledge-based approach for matics 26(19):2438–2444
interpreting genome-wide expression profiles. 16. Shao HW, Peng T, Ji ZW, Su J, Zhou XB
Proc Natl Acad Sci U S A 102 (2013) Systematically studying kinase inhibitor
(43):15545–15550 induced signaling network signatures by inte-
9. Lamb J et al (2006) The connectivity map: grating both therapeutic and side effects. PLoS
using gene-expression signatures to connect One 8(12):e80832
small molecules, genes, and disease. Science 17. Ji ZW, Su J, Wu D, Peng HM, Zhao WL, Zhou
313(5795):1929–1935 XB (2017) Predicting the impact of combined
10. Lee DD, Seung HS (2001) Algorithms for therapies on myeloma cell growth using a
non-negative matrix factorization. In: hybrid multi-scale agent-based model. Onco-
Advances in neural information processing target 8:7647–7665
systems 18. Gerl R, Vaux DL (2005) Apoptosis in the
11. Lee DD, Seung HS (1999) Learning the parts development and treatment of cancer. Carcino-
of objects by non-negative matrix factorization. genesis 26(2):263–270
Nature 401(6755):788–791 19. Huang D et al (2007) The DAVID gene func-
12. Kim H, Park H (2007) Sparse non-negative tional classification tool: a novel biological
matrix factorizations via alternating non- module-centric algorithm to functionally ana-
negativity-constrained least squares for micro- lyze large gene lists. Genome Biol 8(9):R183
array data analysis. Bioinformatics 23 20. Siddiqa A et al (2008) Expression of HER-2 in
(12):1495–1502 MCF-7 breast cancer cells modulates anti-
13. Mering CV et al (2005) STRING: known and apoptotic proteins Survivin and Bcl-2 via the
predicted protein-protein associations, extracellular signal-related kinase (ERK) and
integrated and transferred across organisms. phosphoinositide-3 kinase (PI3K) signalling
Nucleic Acids Res 33(suppl 1):D433–D437 pathways. BMC Cancer 8(1):129
Chapter 16

Drug Effect Prediction by Integrating L1000 Genomic

and Proteomic Big Data
Wei Chen and Xiaobo Zhou

Abstract
The library of integrated Network-Based Cellular Signatures (LINCS) project aims to create a network-
based understanding of biology by cataloging changes in gene expression and signal transduction. Gene
expression and proteomic data in LINCS L1000 are cataloged for human cancer cells treated with
compounds and genetic reagents. For understanding the related cell pathways and facilitating drug
discovery, we developed binary linear programming (BLP) to infer cell-specific pathways and identify
compounds’ effects using L1000 gene expression and phosphoproteomics data. A generic pathway map
for the MCF7 breast cancer cell line was built. Within them, BLP extracted the cell-specific pathways, which
reliably predicted the compounds’ effects. In this way, the potential drug effects are revealed by our models.

Key words LINCS, L1000, Binary linear programming, Drug effect, Cell-specific pathway

1 Introduction

The Library of Integrated Network-based Cellular Signatures

(LINCS) project (http://lincs.hms.harvard.edu/) aims to develop
a library of molecular signatures, based on gene expression and
other cellular changes that describe the response of different types
of cells when exposed to various perturbing agents, including siR-
NAs and small bioactive molecules [1]. Diverse high-throughput
screening approaches are applied in LINCS project to interrogate
the cells, which provide molecular changes and intuitive patterns
(gene or protein profile) of cell response for biologists. The data
acquired from these approaches were collected in a standardized,
integrated, and coordinated manner [2, 3] to promote consistency
and comparison across different cell types. L1000 assay (Luminex-
bead detection system) aims to connect diseases with genes and
drugs at low costs. The gene expression profiles from L1000 data
are potentially useful to infer the targets of compounds.
Accompanying such a great opportunity are the new challenges
of processing and analyzing data generated from the L1000

Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_16, © Springer Science+Business Media, LLC, part of Springer Nature 2019

287
288 Wei Chen and Xiaobo Zhou

platform. In this chapter, we involve developing a novel approach

to infer cell-specific pathways and identify a compound’s effects
using gene expression and phosphoproteomics data under treat-
ments with different compounds [4]. Our data sources contain
L1000 gene expression profiles and P100 phosphoproteomics
data for MCF7 and PC3 cell lines. We integrated these two types
of data and proposed a binary linear programming (BLP) approach
to predict a compound’s efficacy. In our approach, the candidate
targets of compounds are firstly inferred for the purpose of creating
the generic pathway map. Then, we used BLP to optimize the
generic pathways based on the mid-stage phosphosignaling
response. Finally, we applied BLP to re-optimize the cell-specific
pathways and thus evaluate the effects of compounds. For the
validation of our approach, we applied this approach to the
MCF7 breast cancer cell line and PC3 prostate cancer cell line.
The result shows that the inferred cell-specific pathways are reliable.
Meanwhile, the prediction accuracy of a compound’s effects is high.
Generally, the proposed computational approach can shed light
into the mechanisms of a compound’s efficacy and facilitate the
drug discovery.

2 Materials

In this study, we utilized L1000 gene expression profiles and P100

phosphoproteomics data for MCF7 and PC3 cell lines treated with
15 compounds, which are compound 1 15: fulvestrant, pacli-
taxel, doxorubicin, GW-8510, daunorubicin, irinotecan, scriptaid,
anisomycin, valproic acid, digoxin, geldanamycin, trichostatin A,
MS-275, staurosporine, and digoxigenin. The 15 compounds were
viewed as 15 sample conditions in the experiment data. The first
11 compounds were used to optimize the cell-specific pathways.
The remaining four compounds were used to predict treatment
effects. We screened the gene expression profiles for 11 compounds
and 3712 shRNA-perturbations from the L1000 database to infer
potential targets. We also screened two subsets from P100 phos-
phoproteomics data. The first subset with 11 compounds was
employed to optimize the cell-specific pathways via our BLP
approach. The second subset with four compounds was used to
evaluate treatment effects by re-optimizing the inferred cell-specific
pathways via BLP.
L1000 data were downloaded and processed as normalized
log2-fold change value (http://cmap.github.io/l1ktools). The
raw data of P100 (log2 ratio of treatment to control) were con-
verted to binary values (0 or 1) according to the sign of raw data,
where 1 corresponds to the fully activated state and 0 to no activa-
tion. In addition, if the targets or other co-regulators (some key
proteins inhibited or activated after treatment) of some compounds
Genomic and Proteomic Big Data for Dug Effect Prediction 289

were already validated in the literature, this prior knowledge was

presented as constraints in our BLP approach (see Note 1).

3 Methods

The workflow of the whole approach includes three steps, which are
shown in Fig. 1.
In step one, we inferred the potential targets of the compounds
from L1000 gene expression profiles and information from the
literature and then created the corresponding downstream path-
ways of those inferred targets by integrating the PPI, transcriptional
factor, and KEGG pathway database [5]. We also searched the
pathways related to MCF7 and PC3 cell lines in IPA (http://
www.ingenuity.com) and the literature [6–8]. After that, we
integrated these pathways with inferred targets and their down-
stream pathways together to construct a generic pathway map. In

Fig. 1 The flow chart of the proposed approach to infer a cell-type specific pathway map and to identify a
compound’s effects
290 Wei Chen and Xiaobo Zhou

step two, the optimized cell-specific pathways were obtained by

fitting the P100 data to the generic pathway map with our BLP
approach. Finally in step three, we applied BLP to re-optimize the
cell-specific pathways to identify a compound’s effects by discern-
ing topological alterations in the pathway map (see Note 2).

3.1 Binary Linear Here, we describe how the Boolean model can be reformulated as a
Programming (BLP) BLP to optimize the cell-specific pathways. Two reports in the
literature [9, 10] used a Boolean model to optimize the generic
pathway map under the stimulation of combined different cyto-
kines. However, their models are designed only for phosphopro-
teomics data in the early stage of signal transduction. For inferring a
cell-specific pathway map using the P100 database with only one
time point (6 h), we assume a virtual time point (t) before 6 h,
which represents most of enzyme’s activities reach to saturation
conditions after phosphorylation at time t. The cell-specific path-
ways inferred by our BLP approach correspond to the topological
structure in the saturation condition in early response. The
observed time point (6 h) could be represented as t + 1, indicating
the mid-stage signaling response. In our BLP approach, we employ
binary variables to describe the phosphorylation states of enzymes
and the reactions (activated or inhibited). We also use binary linear
constraints to model the relationship between the early response at
t and mid-stage response at t + 1. According to the concept of Hill
Function [11], there are three scenarios for the state of enzyme x at
time t and t + 1 in our BLP approach:
(a) Equation 1 suggests that x is activated by its upstream enzyme
in the early stage and reaches a steady state at time t, and its
activity is unchanged until time t + 1.

x ðt Þ ¼ x ðt þ 1Þ ¼ 1 ð1Þ

(b) If the state of enzyme x at time t and t + 1 can be present as

Eq. 2, x is activated in the early stage (its activity reaches steady
state), and then enzyme x is gradually degraded so that its
activity is very low at time t + 1.

x ðt Þ ¼ 1; x ðt þ 1Þ ¼ 0 ð2Þ

(c) When treated with a compound, enzyme x is inhibited at time

t, and its activity will be sustained to time t + 1 (Eq. 3).

x ðt Þ ¼ x ðt þ 1Þ ¼ 0 ð3Þ
The states of all proteins at time t will completely satisfy the
causal relationships in our constraint set. The change of states for
each measured phosphoprotein is also considered between two
Genomic and Proteomic Big Data for Dug Effect Prediction 291

time points. A pathway is defined as a set of reactions {1,2,. . .i,. . .

nr} and species {1,2,. . .,j,. . .ns}. Each reaction has three
corresponding index sets – signaling molecules Ri, inhibitors Ii,
and products Pi. These sets are subsets of the species index set
(Ri, Ii, Pi {1, 2, . . ., ns}). Our binary linear constraint system
can address four types of linking patterns to represent the relation-
ships between species and reactions, since an actual signaling path-
way has many types of topological structures. Due to the different
mechanisms of compounds, a reaction will take place after treat-
ments with some compounds but not others. The goal of the
proposed formulation is to remove the redundant and inconsistent
reactions which do not occur with any compounds. For a generic
pathway map, a set of experiments (indexed as {1,. . .,ne} can be
performed. Each experiment indicates a treatment condition of one
compound on the pathway. In the kth experiment (1 k ne), a
subset of species is activated and another subset is inhibited, sum-
marized by the index sets Mk,1 and Mk,0, respectively. In addition, a
third subset of the species is measured around the phosphorylation
level (Mk,2). In our BLP system, a binary variable xkj ðt Þ∈f0; 1
g
indicates if the species j is activated x j ðt Þ ¼ 1 or not x kj ðt Þ ¼ 0
k

at the time point t in the kthexperiment. The variable

k zik indicates if
k
the reaction i takes place z i ¼ 1 or not z i ¼ 0 in the kth
experiment. The species set TSk denotes the potential targets of
the kth compound.
If the phosphorylation level of species j is measured at time t + 1
in P100 data, the measurement of species j is defined as x kj ðt þ 1Þ
and its predicted value is named as x^ jk ðt þ 1Þ. For inferring the cell-
specific pathways, we use our BLP approach to optimize two objec-
tive functions. The first is to minimize the difference between
predicted values and measurements (Eq. 4):
Xne X

min x^ jk ðt þ 1Þ x kj ðt þ 1Þ ð4Þ
X, Z
k¼1 j ∈M k, 2
X nr
The second objective min z ik is to minimize the number of
Z
i¼1
reactions, so that the scale of the inferred cell-specific pathways is
smaller. Equation 4 is equivalent to Eq. 5.
X
ne X
min 1 2x kj ðt þ 1Þ x^ jk ðt þ 1Þ ð5Þ
X, Z
k¼1 j ∈M k, 2

Here, we use a linear solution to simultaneously optimize the

two objectives above:
2 !3
Xne X Xnr
min 4 1 2x kj ðt þ 1Þ x^ jk ðt þ 1Þ þ γ z ik 5
X,Z k, 2
k¼1 j ∈M i¼1

ð6Þ
292 Wei Chen and Xiaobo Zhou

In Eq. 6, the parameter γ indicates the weight between two

objective functions. The selection of values of γ obviously affects
the fitting precision of our model on experimental data. The
detailed constraints are listed in our previous literature [4]. For
the fitting precision (FP), we calculated the percentage of fit (pre-
diction accuracy) as below:

k
Xns x^ j ðt þ 1Þ x kj ðt þ 1Þ
FP ¼ k, 2 100% ð7Þ
M
j ∈M k, 2

Equation 7 indicates the fitting precision between the predicted

values inferred by our BLP approach and the measured values of
species under the treatment of the kth compound, where |Mk,2| is
the number of species in the set Mk,2. The value of fitting precision
(FP) is in the range between 0 and 1.

3.2 Sample Figure 2a shows the generic pathway map composed of some
Application important pathways. The estrogen pathway induces tumor growth
of Identifying in estrogen receptor-positive breast cancers [12]; the PI3K/AKT
Compound Effects pathway is an important player in cell survival [7]; the TNF signal-
ing is anticancer-related pathway [13]; and MEK/ERK pathways
3.2.1 Construction of a are usually associated with proliferation and antiapoptosis; the
Generic Pathway Map NFkB pathway is involved in many cell functions, such as cell
proliferation, cell survival, and cellular stress [14]; the JNK/p53/
p21 pathway may induce cell apoptosis [6]; and HDAC1/

Fig. 2 Boolean network topologies of the generic and inferred cell-specific pathways for the MCF7 cell line. (a)
The MCF7 generic pathway map included some important classic pathways. The edges with green color were
potential downstream pathways of some compounds. (b) After optimization via BLP, the red nodes and gray
dash lines were removed from generic pathway map so that the cell-specific pathways were obtained
Genomic and Proteomic Big Data for Dug Effect Prediction 293

BRCA1/DDB2 is a cell survival pathway [15]. In Fig. 2a, the edges

with green color are our inferred downstream pathways of some
compounds.

3.2.2 Inference of Cell- To infer cell-specific pathways based on the constructed generic
Specific Pathways by BLP pathway map, we then minimized the differences between the
measurements and the simulated values, as well as the complexity
of the signaling pathway structure’s topology. We developed the
BLP approach to optimize such multi-objective functions. The
concept behind BLP is that the states of the proteins (variables)
are normalized to binary numbers (activated state or no activation);
edges between two proteins are also represented as binary numbers
(inhibition or promotion); binary linear constraints are used to
describe the relationship between upstream and downstream pro-
teins; and the optimization is done with binarized values taken by
variables, edges, and constraints. The BLP is solved with the opti-
mization toolbox in MATLAB that guarantees minimal differences
between phosphoproteomics data and predicted data, as well as the
Boolean topology of the generic pathway map. Because P100 data
are captured at the mid-stage of signal transduction, we developed
constraints to simulate the change from early to mid-stage so that
we can still obtain many important causal relationships of phos-
phorylation. The fitting precision of the optimized cell-specific
pathways is 87.66% for MCF7, which proves that our model
works well on mid-stage phosphoproteomics data.
Figure 2b shows the inferred cell-specific pathways of MCF7.
The blue nodes are the measured phosphoproteins, while the red
nodes and gray dotted lines were removed after optimization. After
our BLP system simultaneously optimizes the two objective func-
tions with the P100 data, we keep those reactions whose edges exist
for some compounds and remove those reactions whose edges
completely do not exist for all compounds. For example, HDAC1
inhibitor-induced JNK activation in turn activates the downstream
pathways p53 and p21 and eventually results in cell cycle arrest.
Another example is that the reaction between cJUN and p53
(cJUN decreases the expression of p53) did not appear in the
pathways induced by all compounds, so we removed this reaction,
although it does exist for certain conditions [16]. In addition, to
keep the proteins at the end of each pathway as measured phos-
phoproteins, all other proteins at the end of each pathway not
measured in P100 were removed (e.g., (ERK AND CK2) !
ELK-1 reaction in Fig. 2b). Thus, the inferred cell-specific pathway
map was smaller but contains only those elements that can fit the
experimental evidence very well. When this approach was applied to
the PC3 cell line, the goodness of data-fitting on the inferred cell-
specific pathways was 90.91%.
294 Wei Chen and Xiaobo Zhou

3.2.3 Identification We applied the inferred MCF7-specific pathways to four com-

of Compounds’ Effects pounds: trichostatin A, MS-275, staurosporine, and digoxigenin.
The first two compounds are HDAC inhibitors, and staurosporine
promotes apoptosis. Figure 2 depicts the pathway’s topological
alterations for these compounds, including trichostatin A, a
HDAC inhibitor, that removes the branches (subsets of the path-
ways) as follows: some downstream paths of IGF1R and EGFR,
TNFR ! TRAF2 ! TAK1, AKT ! p53, HDAC1 ! BRAC1 !
DDB2, CycE/CDK2 ! Rb-E2F (phosphosite Thr373), and
CycD/CDK4 ! Rb-E2F (phosphosite Thr826). Figure 3a sug-
gests that HDAC1 and its downstream pathway are blocked, which
may cause cell growth arrest. The p53 signal is up-regulated after
the activation of JNK. In the meantime, p21 was activated by p53,
which then induced cell cycle arrest by inhibiting phosphorylation
of the Rb-E2F complex triggered from CycE/CDK2 and CycD/
CDK4 [17]. In addition, up-regulated Fas activated by p53 will
potentially promote cell apoptosis. MS-275 [18], also a specific
HDAC inhibitor, altered the pathway topology in a similar pattern
as trichostatin A. MET was inhibited by trichostatin A but activated
by MS-275 (Fig. 3a, b). These two compounds induced similar
changes on most key proteins. In Fig. 3b, p21 activation inhibited
phosphorylation of the Rb-E2F complex and blocked the disasso-
ciation of this complex. Therefore, our results suggested that the
effects of trichostatin A and MS-275 blocked cell growth in the
MCF7 cell line. With regard to staurosporine, a pro-apoptotic
compound, double effects were detected from the pathway’s topo-
logical alterations: p21 obviously blocked the cell cycle by inhibit-
ing the phosphorylation of the Rb-E2F complex, while DDB2-
induced cell growth also occurred via activated HDAC1 (Fig. 3c).
Figure 3d shows that digoxigenin induces the activation of
HDAC1 and inhibition of p53. Digoxigenin blocked the reaction
JNK ! p53 so that p21 was also inactivated. DDB2 was activated
by HDAC1 through BRCA1. The absence of p21 indicates phos-
phorylation of the Rb-E2F complex, which increases the chance for
the disassociated transcription factor E2F to promote transcription
and cell growth. Therefore, our findings indicate digoxigenin
potentially induce cell cycle and promote cell growth on MCF7
cell line (see Note 3).

4 Notes

1. Compound profiling using LINCS big data as the reference

Fig. 3 The compound-induced topological alterations in the MCF7-specific pathways revealed by BLP. The
treatment effects of four compounds on MCF7 cell line were shown in the figure. Red arrows denote that these
reactions were blocked after treatment with compounds. (a) Compound Trichostatin A; (b) Compound MS-275;
(c) Compound Staurosporine; (d) Compound Digoxigenin

pathways for underlining disease states. All the data are from
15 cancer cell lines on 1000 carefully chosen landmark genes,
which can reduce the number of measurements and will not be
biased for a particular cellular model.
2. We developed binary linear programming (BLP) approach to
infer the best fitting cell-specific signaling pathways from per-
turbation- induced topological structures. We believe that BLP
can complement standard biochemical drug profiling assays
296 Wei Chen and Xiaobo Zhou

and sheds new light on the discovery of possible mechanisms

for drug effects.
3. For the study of the identification of the compound effects in
cell-specific pathways, we developed binary linear program-
ming (BLP) approach to optimize generic pathways and iden-
tify a compound’s effects on an inferred cell-specific pathway
map by integrating gene expression profiles and phosphopro-
teomics data collected from different types of perturbations.
Regarding the construction of generic pathways, we combined
the pathway information from the literature and the potential
targets of compounds inferred from gene expression profiles
under perturbations. In the cell-specific pathways, we moni-
tored four cases of compound-induced topological alterations
to the pathways to predict a compound’s effects using BLP.
Compared to other phosphoproteomics-based and mass
spectrometry-based target identification approaches, which
use compound affinities measured either by in vitro or in vivo
assays, our method uses perturbation-induced gene expression
profiles to infer the potential targets and downstream paths
[19, 20]. After that, a generic pathway map could be created
based on pathways in the literature and inferred targets. Our
developed BLP is adopted as a way to monitor alterations in
pathway topologies and to evaluate a compound’s effects.

Acknowledgment

The work was supported by the grants of NIH U01HL111560-04

(Zhou) and NIH U01CA166886-03 (Zhou).

References
1. Hongwei Shao TP, Ji Z, Jing S, Zhou X (2013) binary linear programming. PLoS One 9(7):
Systematically studying kinase inhibitor e102798
induced signaling network signatures by inte- 5. Ogata H, Goto S, Fujibuchi W, Kanehisa M
grating both therapeutic and side effects. PLoS (1998) Computation with the KEGG pathway
One 8(12):e80832 database. Biosystems 47:119–128
2. Saez-Rodriguez J, Goldsipe A, Muhlich J, 6. Perfettini JL, Castedo M, Nardacci R,
Alexopoulos LG, Millard B et al (2008) Flexi- Ciccosanti F, Boya P et al (2005) Essential
ble informatics for linking experimental data to role of p53 phosphorylation by p38 MAPK in
mathematical models via DataRail. Bioinfor- apoptosis induction by the HIV-1 envelope. J
matics 24:840–847 Exp Med 201:279–289
3. Hendriks BS, Espelin CW (2010) DataPflex: a 7. Su JS, Woods SM, Ronen SM (2012) Meta-
MATLAB-based tool for the manipulation and bolic consequences of treatment with AKT
visualization of multidimensional datasets. Bio- inhibitor perifosine in breast cancer cells.
informatics 26:432–433 NMR Biomed 25:379–388
4. Ji ZW, Su J, Liu CL, Wang HY, Huang DS, 8. Xue LY, Chiu SM, Oleinick NL (2003)
Zhou XB (2014) Integrating genomics and Staurosporine-induced death of MCF-7
proteomics data to predict drug effects using human breast cancer cells: a distinction
between caspase-3-dependent steps of
Genomic and Proteomic Big Data for Dug Effect Prediction 297

apoptosis and the critical lethal lesions. Exp 15. Wang GL, Salisbury E, Shi X, Timchenko L,
Cell Res 283:135–145 Medrano EE et al (2008) HDAC1 promotes
9. Mitsos A, Melas IN, Siminelakis P, Chairakaki liver proliferation in young mice via interac-
AD, Saez-Rodriguez J et al (2009) Identifying tions with C/EBPbeta. J Biol Chem
drug effects via pathway alterations using an 283:26179–16187
integer linear programming optimization for- 16. Schreiber M, Kolbus A, Piu F, Szabowski A,
mulation on Phosphoproteomic data. PLoS Mohle-Steinlein U et al (1999) Control of cell
Comput Biol 5:e1000591 cycle progression by c-Jun is p53 dependent.
10. Saez-Rodriguez J, Alexopoulos LG, Genes Dev 13:607–619
Epperlein J, Samaga R, Lauffenburger DA 17. Coulonval K, Bockstaele L, Paternot S, Roger
et al (2009) Discrete logic modelling as a PP (2003) Phosphorylations of cyclin-
means to link protein signalling networks with dependent kinase 2 revisited using
functional analysis of mammalian signal trans- two-dimensional gel electrophoresis. J Biol
duction. Mol Syst Biol 5:331 Chem 278:52052–52060
11. Mather W, Bennett MR, Hasty J, Tsimring LS 18. Rosato RR, Almenara JA, Grant S (2003) The
(2009) Delay-induced degrade-and-fire oscilla- histone deacetylase inhibitor MS-275 pro-
tions in small genetic circuits. Phys Rev Lett motes differentiation or apoptosis in human
102:068105 leukemia cells through a process regulated by
12. Giacinti L, Giacinti C, Gabellini C, Rizzuto E, generation of reactive oxygen species and
Lopez M et al (2012) Scriptaid effects on induction of p21(CIP1/WAF1). Cancer Res
breast cancer cell lines. J Cell Physiol 63:3637–3645
227:3426–3433 19. Opiteck GJ, Scheffler JE (2004) Target class
13. Rodriguez-Berriguete G, Fraile B, Paniagua R, strategies in mass spectrometry-based proteo-
Aller P, Royuela M (2012) Expression of NF- mics. Expert Rev Proteomics 1:57–66
kappaB-related proteins and their modulation 20. Chiara DG, Marcocci ME, Torcia M,
during TNFalpha-provoked apoptosis in pros- Lucibello M, Rosini P et al (2006) Bcl-2 phos-
tate cancer cells. Prostate 72:40–50 phorylation by p38 MAPK - identification of
14. Courtois G, Gilmore TD (2006) Mutations in target sites and biologic consequences. J Biol
the NF-kappa B signaling pathway: implica- Chem 281:21353–21361
tions for human disease. Oncogene
25:6831–6843
Chapter 17

A Bayesian Network Approach to Disease Subtype

Discovery
Mei-Sing Ong

Abstract
Human diseases are historically categorized into groups based on the specific organ or tissue affected. Over
the past two decades, advances in high-throughput genomic and proteomic technologies have generated
substantial evidence demonstrating that many diseases are in fact markedly heterogeneous, comprising
multiple clinically and molecularly distinct subtypes that simply share an anatomical location. Here, a
Bayesian network analysis is applied to study comorbidity patterns that define disease subtypes in pediatric
pulmonary hypertension. The analysis relearned established subtypes, thus validating the approach, and
identified rare subtypes that are difficult to discern through clinical observations, providing impetus for
deeper investigation of the disease subtypes that will enrich current disease classifications. Further advances
linking disease subtypes to therapeutic response, disease outcomes, as well as the molecular profiles of
individual subtypes will provide impetus for the development of more effective and targeted therapies.

Key words Bayesian network analysis, Disease subtype, Pulmonary hypertension

1 Introduction

Classification of diseases into groups with similar pathobiology,

prognosis, and therapeutic response is the bedrock of the practice
of medicine. Disease taxonomies not only determine how diagnosis
and treatment choices are made; they form the basis for knowledge
generation and drug discovery. Historically, diseases are categor-
ized into groups based on the specific organ or tissue affected. Over
the past two decades, advances in high-throughput genomic and
proteomic technologies have generated substantial evidence
demonstrating that many diseases are in fact markedly heteroge-
neous, comprising multiple clinically and molecularly distinct sub-
types that simply share an anatomical location. The discovery of
previously unrecognized disease subtypes has led to the advance-
ment of new therapy and enhanced ability to target existing treat-
ments to subsets of patients who are mostly likely to benefit
from them.

Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_17, © Springer Science+Business Media, LLC, part of Springer Nature 2019

299
300 Mei-Sing Ong

While the increased availability of large clinical datasets presents

an unprecedented opportunity to better delineate disease subtypes,
the volume and complexity of these data pose substantial analytic
challenges. The realization that traditional reductionist
approaches—the practice of reducing biological entities to the
sum of their constituent parts—fall short in accounting for the
complex interactions of biological entities and processes contribut-
ing to a disease state has led to the development of systems biology
approaches that emphasize the study of biological systems as a
whole. A fundamental tenet of systems biology is that cellular and
organismal constituents are interconnected; thus their structure
and dynamics must be examined collectively rather than as isolated
parts [1–3]. Network science—a branch of mathematics and theo-
retical physics—has emerged as the methodological foundation of
systems biology.
This chapter describes the application of a network approach to
drive the discovery of disease subtypes, using pulmonary hyperten-
sion (PH) as an exemplar disease. The following subsections pro-
vide a brief overview of network medicine (Subheading 1.1) and a
description of the clinical problem that motivated the current study
(Subheading 1.2). The remainder of the chapter is organized as
follows: Subheading 2 describes the study dataset; Subheading 3
presents the methodology; and Subheading 4 discusses the research
findings and addresses challenges and opportunities to advance the
field.

1.1 The Emergence The whole is more than the sum of its parts—Aristotle
of Network Medicine Network science has emerged in diverse disciplines as a means of
analyzing complex relational data. Building on graph theory, a
network is a graphical model comprising a set of nodes and con-
nections between them. Network nodes represent random variables
to be modeled; network edges connecting nodes can have many
interpretations ranging from physical interactions to mathematical
relationships. Combinatorial analysis of these relationships using
graph theoretics can uncover structure and patterns, including both
local properties and global phenomena, providing insights into the
characteristics of complex systems (Box 1).
In the context of biological networks, network nodes represent
biological entities (e.g., genes, proteins, diseases), and edges
between nodes represent biological relationships (e.g., gene corre-
lations, protein-protein interactions, functional associations). Pub-
lished studies have shown that networks operating in biological
systems are not random, but are characterized by a set of organizing
principles such that network topology and connectivity can convey
biologically important information about the network entities
[4–10]. For example, analysis of gene co-expression networks,
whereby genes are modeled as nodes and edges between nodes
A Bayesian Network Approach to Disease Subtype Discovery 301

Box 1: Elements of a Network

Box 1. Elements of a network

Nodes: Random variables

Edges: Relaonship between

random variables

Cluster: A group of ghtly

connected nodes forming a
subgraph

Hub: A highly connected node in

the network

represent gene correlations, has allowed us to address some funda-

mental properties of disease genes. Studies have shown that genes
that display prominent connectivity patterns (known as “hub
genes”) tend to play biologically influential or regulatory roles in
disease-related processes [11–15], and the deletion of genes encod-
ing hub genes leads to a larger number of phenotypic outcomes
than for other genes [16]. Highly co-expressed genes are also more
likely to be co-regulated, and a collection of high-connected genes
has been shown to cause the same or physiologically similar diseases
[17–21]. Networks of genes can therefore be exploited to elicit new
disease genes, by identifying nodes that have not been known to
affect a particular disease but are tightly connected to disease-
causing genes. Indeed, published studies on the successful applica-
tion of network approaches in identifying novel disease genes
abound. Other types of biological networks include protein-
protein interaction networks, metabolic networks, and phenotype
networks.
While widely used to model molecular networks, few studies
have applied network-based approaches to study comorbidity pat-
terns that define disease subtypes in complex diseases. Disease-
causing defects often initiate cascades of failure that leads to the
co-occurrence of multiple diseases in a patient, such as heart dis-
ease, diabetes, and obesity. Indeed, many genetic disorders are
characterized by syndromes comprising a collection of disease
traits. Thus, discovering disease patterns can potentially open new
avenues for understanding disease mechanisms. In one study,
comorbidity networks modeling disease co-occurrence have been
302 Mei-Sing Ong

shown to capture phenotypic differences between patients with dif-

ferent demographic backgrounds, disease progression, and mortality
[22]. This chapter discusses the feasibility of applying a network-based
approach to derive clinically meaningful disease subtypes, with an
emphasis on childhood-onset pulmonary hypertension.

1.2 Pulmonary Pulmonary hypertension (PH)—a condition defined by an elevated

Hypertension: A mean pulmonary arterial pressure of >¼25 mmHG at rest)—is an
Multifactorial Disease exemplar disease driven by multifactorial causes. The disease is
progressive, often with a fatal course; the estimated 5-year survival
for PH is just 53.6% [23]. Survival outcome is dependent on early
diagnosis and treatment, before pathologic changes are advanced
and irreversible. Accurate diagnosis of the underlying etiology is
also critical, as available treatments and responses to treatments
differ substantially.
The goal to optimize the treatment of PH based on its patho-
physiology has led to the development of a formal taxonomy of PH
by the World Health Organization (WHO). The taxonomy (WHO
Classification of PH) defines five distinct PH subtypes (Table 1):
pulmonary arterial hypertension (PAH) (Group 1 PH), PH due to
left heart diseases (Group 2 PH), PH due to chronic lung diseases
and hypoxia (Group 3 PH), chronic thromboembolic PH
(CTEPH) (Group 4 PH), and PH due to other multifactorial
mechanisms (Group 5 PH) [24]. This classification is widely used
in the clinical management of PH and by the US Food and Drug
Administration for the labeling of new drugs approved for
PH. Optimal treatment for each subtype differs substantially
(Table 2): targeted pharmacological therapies are currently avail-
able only for PAH; for PH due to left heart diseases or lung

Table 1
Major PH subtypes defined in the WHO classification of PH

Group 1: Pulmonary arterial hypertension (PAH)

● Idiopathic
● Heritable
● Drug and toxin-induced
● Associated with connective tissue disease, HIV, portal hypertension,
congenital heart disease, schistosomiasis, and chronic hemolytic
anemia
● Pulmonary veno-occlusive disease and/or pulmonary capillary
hemangiomatosis
● Persistent pulmonary hypertension of the newborn
Group 2: PH due to left heart diseases
Group 3: PH due to lung disease and/or hypoxia
Group 4: Chronic thromboembolic PH (CTEPH)
Group 5: PH with unclear multifactorial mechanisms
A Bayesian Network Approach to Disease Subtype Discovery 303

Table 2
Recommended treatment for each PH subtype [27]

WHO PH subtype Recommended therapy

PAH PAH-specific therapy based on disease severity, PAH
subtypes, and vasoreactivity
PH due to left heart Treatment of the underlying left heart disease
diseases
PH due to lung Treatment of the underlying lung condition, long-
diseases term oxygen therapy in patients with chronic
hypoxemia
CTEPH Surgical pulmonary endarterectomy
Unknown Unknown
mechanisms

diseases, treatment of the underlying condition is recommended;

surgical pulmonary endarterectomy is the recommended treatment
for patients with CTEPH; and the optimal treatment for Group
5 PH remains unknown. Misuse of therapies not only represents
missed opportunities to provide optimal treatment; it also exposes
patients to the potential side effects of therapies without the sup-
posed benefits. For instance, therapies that are efficacious for PAH
(e.g., pulmonary vasodilator medications) may actually worsen
pulmonary hemodynamics in PH associated with left heart disease,
the most common cause of PH [25, 26].
More recently, the realization that childhood-onset PH may
have unique etiologies and associations not often observed in adults
further prompted the development of a new taxonomy for
childhood-onset PH [28–30]. The taxonomy, known as the Pan-
ama classification (Table 3), highlighted a number of features more
prominent in pediatric PH, including fetal and developmental ori-
gins of vascular disease, and chromosomal anomalies associated
with PH [29].
While the current PH disease classifications provide a frame-
work for the diagnosis and treatment of PH, gaps remain in our
understanding of the disease, especially among children. Due to the
rarity of PH in children, comprehensive analysis of its clinical man-
ifestations is challenging. To date, published data on pediatric PH
have been limited to several registry-based and small cohort studies
[31–35]. While these studies have greatly advanced our under-
standing of the disease, they may be subject to referral bias and
may not represent the full spectrum of pediatric PH cases
[36, 37]. Furthermore, since knowledge generated from these
studies formed the bedrock of the expert consensus classifications,
current taxonomies may not capture the full spectrum of the diverse
manifestations of PH.
304 Mei-Sing Ong

Table 3
Panama classification of pediatric pulmonary hypertensive vascular
disease

Group 1: prenatal or developmental pulmonary hypertensive vascular

disease
Group 2: perinatal pulmonary vascular maladaptation
Group 3: pediatric cardiovascular disease
Group 4: bronchopulmonary dysplasia
Group 5: isolated pediatric pulmonary hypertensive vascular disease
(isolated pediatric PAH)
Group 6: multifactorial pulmonary hypertensive vascular disease in
congenital malformation syndromes
Group 7: pediatric lung disease
Group 8: pediatric thromboembolic disease
Group 9: pediatric hypobaric hypoxic exposure
Group 10: pediatric pulmonary vascular disease associated with other
system disorders

This chapter discusses a network-based approach to enrich and

extend the current classifications of childhood-onset PH, with the
goal of facilitating improved recognition of clinically relevant pat-
terns of disease manifestation that can result in meaningful
improvement in the timely diagnosis of PH in children.

2 Materials

The study data source comprises an administrative claims dataset

from a major, nationwide employer-provided health insurance plan
in the United States between January 2010 and May 2013. The
database systematically captures all the healthcare encounters of
beneficiaries, including inpatient and outpatient visits and proce-
dures. Medical diagnoses associated with healthcare visits were
coded with International Classification of Diseases, ninth revision
(ICD-9) codes; procedures were coded with the Current Proce-
dural Terminology (CPT).
The data source was chosen for several reasons. First, the
availability of a large number of patients makes it possible to iden-
tify rare but significant relationships, which may not have been
observable in traditional studies involving chart reviews or surveys.
This is particularly important in the study of rare conditions such as
childhood-onset PH. Second, administrative claims data are sys-
tematically collected and provide longitudinal information that
A Bayesian Network Approach to Disease Subtype Discovery 305

crosses facilities, geographical locations, and population demo-

graphics, thereby enhancing the generalizability of the research
and limiting selection biases inherent to analyses that are based on
single institution or registry.
There are, however, limitations to the dataset. Most impor-
tantly, disease diagnoses captured in administrative claims are
based on the International Classification of Diseases (ICD) diag-
nostic codes, the controlled vocabulary employed by healthcare
providers to bill for their services. Some codes are clearly disease
states, while others may represent diagnostic workup for a condi-
tion that is later ruled out or may be coding misclassifications. To
minimize this limitation of claims-based analyses, case identification
algorithms that combine multiple diagnostic and procedure codes
are often applied to identify a disease state. For example, one study
showed that patients with systemic hypertension can be reliably
identified with high accuracy using a case identification algorithm
defined as having two or more healthcare encounters associated
with the diagnostic codes for systemic hypertension (specificity,
95%; sensitivity, 73%) [38]. Similar algorithms have been derived
and validated for a wide range of diseases to facilitate claims-based
population studies [39–43].
The same approach was applied in the current study to define
the subjects of interest. Accordingly, the study cohort comprised
patients with PH, defined as having two or more healthcare visits
associated with PH (ICD-9416.0, 416.8, 416.9) during the study
period. The study subjects were drawn from 6,943,263 children
(<18 years of age) enrolled in the healthcare plan. A total of 1583
children met the criteria for PH (Fig. 1).

Fig. 1 Subject selection criteria

306 Mei-Sing Ong

3 Methods

Leveraging the diagnoses data captured in the claims database, a

Bayesian network approach was applied to discern disease subtypes
in childhood-onset pulmonary hypertension. Figure 2 provides an
overview of the approach, described in detail in the following
subsections.

3.1 Defining The first step involved extracting the comorbidity profile of the
Comorbidity Profiles study subjects. The presence of a comorbidity was defined as having
of Patients with PH two or more healthcare encounters related to the condition, iden-
tified using ICD-9 codes. To select conditions that were most
relevant to PH, the analysis was confined to conditions that were
significantly associated with PH compared with the general popu-
lation of children without PH. Comorbidities significantly asso-
ciated with PH were systematically identified from the data source
using the chi-square statistic to compare their prevalence among
children with and without PH. The α level of 0.05 was used to

Fig. 2 Overview of study approach

A Bayesian Network Approach to Disease Subtype Discovery 307

declare statistical significance, and Bonferroni correction was

applied to control for multiple comparisons. Here, a total of
767 comorbidities were examined; thus the adjusted α level was
0.000065.

3.2 Constructing The next step involved constructing a Bayesian network to model
Bayesian Comorbidity the interdependencies of the comorbidities in children with
Network PH. The following subsections provide a brief overview of
Bayesian network and the specific techniques applied in the current
study.

3.2.1 Bayesian Network A Bayesian network is a probabilistic graphical model that repre-
Overview sents a joint probability distribution of a set of random variables
[44]. The network structure consists of nodes representing ran-
dom variables and directed edges between nodes representing
probabilistic dependencies (Box 2). The absence of a link
between two nodes signifies conditional independence. The
edge strength indicates the relative magnitude of the depen-
dency between two variables. The edge directionality, when
present, is not intended to imply causation. Rather, an edge
from node xi to xj can be interpreted as the presence of xi
“influences” the occurrence of xj.

Box 2: A Formal Definition of Bayesian Network

Consider the set of variables X ¼ {x1,. . .,xn} over which we would like to model the
joint distribution. A Bayesian network uses a directed acyclic graph, G ¼ (V, E), and a
set of conditional probability distributions, P, to represent the joint probability distri-
bution over X. Each node vi∈V corresponds to xi∈X. The edges of the graph, E,
encode dependencies among the variables and can be used to infer conditional inde-
pendence among variables.
The parents (i.e., immediate predecessors) of node xi, denoted as π i, are the set of
vi∈V such that (vi, vj)∈E. For each node in the network, there is a conditional
probability function that relates the node to its immediate parents, denoted as P(xi| π i).

The independent relationships represented by the structure of a Bayesian network

are given by the Markov condition: any node is conditionally independent of its
non-descendants, given its parents. The Markov condition permits the factorization
of a joint probability distribution on model variables X into the following product:

P ðx; . . . ; x n Þ ¼ ∏ni¼1 P ðx i jπ i Þ
308 Mei-Sing Ong

In the current study, the variables of interest were the comor-

bidities found to be significantly associated with PH, depicted as
network nodes; edges between nodes represent probabilistic
dependencies between comorbidities.

3.2.2 Bayesian Network The problem of learning a Bayesian network can be stated as
Learning follows: given a dataset, find a network that best represents the
dataset. The construction of a network entails a two-step process:
structure learning and parameter estimation. Structure learning
involves determining the network structure that most accurately
reflects the observed data. There are two main approaches to struc-
ture learning [45–49]:
1. Constraint-based approach applies statistical tests to establish
conditional dependence and independence relationships
among the variables in a model; these relationships form a set
of edge constraints; the algorithm then finds the best directed
acyclic graph (DAG) that satisfies the constraints.
2. Score-based approach defines a network scoring function to
evaluate the goodness of fit of candidate DAGs with respect to
the data; the method then searches over the space of DAGs for
a structure that maximizes the score.
Hybrid algorithms that combined both approaches have also
been developed, whereby constraint-based algorithms are used to
reduce the space of candidate DAGs and network scores are used to
identify the optimal DAG [50, 51].
In the current study, a score-based structure learning algorithm
was applied to search for the best-fit model. Specifically, the Bayes-
ian Dirichlet equivalent score (with equivalent sample size of ten)
[52, 53] was calculated for each candidate network to measure its
goodness of fit, and the network with the highest score was selected
(Box 3).
To search for high-scoring structures, standard heuristic search
techniques can be applied. Local heuristic search strategies are the
most commonly applied method: starting from an initial feasible
solution, an iterative search is performed to successively improve
the solution through a series of local modifications (i.e., edge
addition, deletion, and reversal) until an optimum is reached.
Such an approach often only finds local optima, which are not
necessarily the best possible solution (i.e., the global optimum). A
range of algorithms have been developed to overcome this limita-
tion, one of which is “tabu search” [54]—the algorithm used in the
current study. To explore solution space beyond local optimality,
the tabu search learning process applies an iterative local search
procedure but maintains a “tabu list” of previously visited solutions
that had been found suboptimal; solutions in the “tabu list” are
prohibited in future moves, thus permitting the search procedure
to escape local optima.
A Bayesian Network Approach to Disease Subtype Discovery 309

Box 3: Bayesian Dirichlet Scoring Function

Given a dataset D with variables X ¼ {x1,...,xn}, the goal of
Bayesian network structure learning is to find a graph G that
maximizes a score function.
The probability of data D conditional on the graph struc-
ture G can be expressed as:
Z
P ðDjG Þ ¼ P ðDjG; ΘG ÞP ðΘG jG ÞdΘG

where ΘG denotes the graph parameters, P(D| G, ΘG) is the

probability of the data given the network structure and
parameters, and P(ΘG| G) is the prior probability of the
parameters assumed to be a Dirichlet distribution with hyper-
parameters α (a vector of positive real values).
It has been shown that the Bayesian Dirichlet score of a
candidate network can be quantified by summing the score of
all local nodes [52, 53]:
X
n
ScoreðG Þ ¼ Score ðx i jπ i Þ
i¼1
!
X
rπ i
Γ αij Xri
Γ αijk þ nijk
Score ðx i jπ i Þ ¼ log þ log
j ¼1
Γ αij þ nij k¼1
Γ αijk

where
i ∈ {1, . . ., n},j ∈ {1, . . ., rπ i},k ∈ {1, . . ., ri}
ri ¼ number of categories of xi, π i ¼ parents of xi.
nijk is the count
P of elements in D containing both xik and
π ij, and nij ¼ nijk .
k
αP
¼ (αijk) 8 ijk are the Dirichlet hyperparameters, where
αij ¼ αijk .
k
The Bayesian Dirichlet equivalent score assumes αijk ¼ α∗.
P(Θijk| G), where α∗ is the equivalent sample size.

Once the best-fit network structure has been selected, the next
step in the Bayesian network construction process—parameter esti-
mation—involves finding a set of probability distribution para-
meters for the learned network that best explains the observed
data using Bayes estimation [55]. Given the learned conditional
distribution parameters, the strength of the relations between node
310 Mei-Sing Ong

pairs can be estimated by assuming a noisy-OR model, whereby the

influence of each parent on a node is independent of other parents.
Accordingly, the weight assigned to a directed edge from node i to
j was quantified by calculating the conditional probability of
j given i.

3.2.3 Model Averaging In finding the “best-fit” model, over-fitting can occur when the
Technique resultant model describes random error or noise instead of the
underlying distribution of the data. To improve the statistical
robustness of the analysis, the following strategies were applied.
First, instead of building a single model, a model averaging tech-
nique [56] was used where multiple best-scored networks were
developed using 1000 subsamples of the dataset generated through
bootstrap resampling; the final model was estimated by averaging
over the highest scoring networks, such that only network edges
(i.e., comorbidity relations) that were statistically significant were
selected for inclusion [57], as described in Box 4. The parameters of
the selected edges were then estimated using the full dataset in the
final network. This technique allows the identification of network
features that are robust to perturbations of the observations
[56, 57]. Second, prior knowledge about the biology of PH
informed construction of the set of comorbidities used for devel-
oping the network. For example, it is well-established that right and
left heart diseases have distinct disease pathophysiology. Thus,
diagnosis codes pertaining to left heart disease were grouped into
a single category and diagnosis codes pertaining to right heart
disease into a separate category. Reducing the number of para-
meters relative to the number of observations also has the effect
of restricting the degrees of freedom during learning, thus resulting
in a more robust model. Third, to further reduce the complexity of
the model, the analysis considered only comorbidities that were
found, in bivariate analyses, to be significantly associated with PH,
compared with the general population without PH, and selected
those for which the lower 95% confidence interval bound for the
odds ratio was greater than five. Comorbidities that occurred in
fewer than four PH patients were also excluded, since a small
number of observations would not suffice to distinguish between
true and spurious correlations.

Box 4: Algorithm for the Identification of Statistically

Significant Features in the Comorbidity Network
Step 1: bootstrap resampling
For b ¼ 1,2,. . .m:
1. Randomly sample a new dataset Xb from the original data X.

(continued)
A Bayesian Network Approach to Disease Subtype Discovery 311

Box 4: (continued)
2. Learn the structure of the graphical model G ¼ (V, Eb) from
Xb.
In the current study, the number of iterations m was set to
1000.
Step 2: model averaging
For each edge ei learned through the bootstrap resam-
pling process, estimate the probability that it is present in the
true network structure G0 ¼ (V, E0) as:
1 X m
P^ ðe i Þ ¼ sb
m b¼1

1 if e i ∈E b
sb ¼
0 otherwise
The empirical probabilities P^ ðe i Þ are known as edge inten-
sities or arc strengths and can be interpreted as the degree of
confidence that ei is present in the network structure G0
describing the true dependence structure of the dataset.
Step 3: selection of significant edges.
Significant edges were identified by defining a threshold t,
such that only edges with edge intensity greater than t were
included in the final model. Thus:
e i ∈E b if pðiÞ > F 1
pð:Þ ðt Þ

where F 1
pð:Þ ðt Þ is the quantile function:
n o
F 1
pð:Þ ðt Þ ¼ inf x∈ℝ F p ð:Þ
ðx Þ t

The threshold t was estimated by applying L1 norm to

approximate the ideal asymptomatic empirical F pð:Þ ðt Þ:
Z

L 1 t; pðiÞ ¼ F pð:Þ ðx Þ F pð:Þ ðx; t Þdx

In the current study, edges with a threshold t greater than

0.50 were considered significant. A more detailed description
of the method is provided in [47].
312 Mei-Sing Ong

3.3 Defining To define PH subtypes, the network was partitioned into subgraphs
Subtypes Through comprising highly connected comorbidities using a strict partition-
Network Clustering ing rule, whereby each comorbidity belongs to exactly one cluster.
Analysis The graph partitioning process involves merging nodes agglomera-
tively. Specifically, a random walk clustering approach was applied
to identify the pathways that were closest to each node in the
network [58]. The process involves a random walk on a network
for t number of steps: a walker at node i and step t randomly selects
one of its neighbors to which it hops at step t + 1; the probability of
walking from node i to node j is quantified by the weight of the
edge divided by the total number of nodes directly linked to i.
Similarity between two nodes is measured by the L2 distance
between their respective transition probabilities, and cluster analysis
involves merging nodes such that the mean of the squared distances
between each node and its cluster is minimized. A more detailed
description of this approach is provided in Box 5. The length of
t must be sufficiently long to gather enough information about the
topology of the graph, but short enough to detect clusters. To
guide the choice of t, a commonly used measure known as “mod-
ularity” was applied to quantify the strength of a network division.
A positive value of modularity is indicative of the potential presence
of community structure [59]. A t that maximized modularity was
chosen; in the current analysis, t of 3 was found to be optimal. In
the cluster analysis, edges with a weight of less than 0.2 were
excluded, in order to capture the strongest relations.

Box 5: Network Clustering Algorithm

A network can be represented as an adjacency matrix A, where
nodes i and j. The degree of node i can be
Aij is the weight ofP
defined as: d ði Þ ¼ A ij .
j
A random walk process on the network is performed: at
each step t, a walker moves from one node to another chosen
randomly and uniformly among its neighbors. The transition
A
probability from node i and j can be defined as: P ij ¼ d ðijiÞ.
The probability of going from i to j through a random
walk of length t is P ijt , and the probabilityX of going from a
1
cluster C to node j in t steps is: P Ct j ¼ Pt.
jC j i∈C ij
The cluster discovery process involves the following steps:
1. Form an initial partition P1 ¼ {{v}, v ∈ V} of the graph into
n clusters, where each cluster contains a single node. Com-
pute the distance between all adjacent nodes based on the
transition probabilities of the random walk process.

(continued)
A Bayesian Network Approach to Disease Subtype Discovery 313

Box 5: (continued)
Let i and j be two nodes. The distance between the nodes
is quantified as follows:
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u 2
u n
uX P ikt P tjk
r ij ¼ t
k¼1
d ðkÞ

The equation can be generalized to describe the distance

between two clusters C1,C2:
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u 2
u n
uX P Ct 1 k P Ct 2 k
r C1C 2 ¼ t
k¼1
d ðkÞ

For each step k:

2. Choose two clusters to merge according to Ward’s method,
whereby the mean of the squared distances between each
node and its cluster is minimized:

1 X X 2
σk ¼ r :
n C∈P i∈C iC
k

3. Merge the two clusters into a new cluster C3 ¼ C1 \ C2, and

create the new partition: Pk+1 ¼ (Pk\{C1, C2}) \ {C3}.
4. Update the distances between clusters.
5. Stop algorithm after n 1 steps.

3.4 Evaluation A review by experts evaluated the study approach by checking for
of Network-Derived the identification of established PH subtypes. Accordingly, each
Subtypes comorbidity cluster was assigned a WHO and Panama classification
subtype that best describes the cluster. For example, a comorbidity
cluster comprising portal hypertension and the associated condi-
tions would be classified as WHO Group 1 (PAH associated with
portal hypertension) and Panama Group 10 (pediatric pulmonary
vascular disease associated with other system disorders). Classifica-
tion was first performed by one researcher and then validated by
two pediatric PH experts; an inter-rater agreement was quantified
using Cohen’s kappa statistic, and discrepancies were resolved
through consensus. The WHO and Panama classifications comprise
5 and 10 subtypes, respectively; the WHO classification further
categorizes some of the 5 major subtypes into 28 “minor” subtypes
(Fig. 1c). A literature review was conducted to evaluate network-
derived clusters not described in either WHO or Panama classifica-
tions to assess if published evidence supported the co-occurrence of
PH with conditions captured by each cluster.
314 Mei-Sing Ong

4 Notes

Figure 3 depicts the Bayesian comorbidity network learned from

the dataset. The inferred network comprised 365 relations.

4.1 Detection Cluster analysis of the comorbidity network identified all five major
of Well-Established subtypes (kappa score, 100%) and 21 of 28 minor subtypes (kappa
Subtypes score 96%) defined in the WHO classification (Table 1), and 9 of
10 subtypes defined in the Panama classification (kappa score, 90%)
(Table 2), with a few anticipated exceptions. For example, in the
absence of pedigree and genetic data, the analysis was unable to
discern the various forms of heritable PAH, with the exception of
PH in association with hereditary hemorrhagic telangiectasia
(HHT), a condition known to associate with PAH and linked to
pathogenic variants in ALK1 and ENG genes [30]. The identifica-
tion of pathogenic drugs and toxins associated with PAH is beyond
the scope of this study. The analysis was also unable to detect PH
caused by chronic exposure to high altitude: the diagnostic code for
the condition is non-specific (ICD-9993.2: “other and unspecified
effects of high altitude”) and was not assigned to any of the patients
in the study dataset. The imprecision of ICD codes also precluded
differentiation between left ventricular systolic and diastolic dys-
functions. Finally, subtypes associated with HIV infection or schis-
tosomiasis were not captured, since none of the PH patients in the
study data source had billing codes for these conditions.

4.2 Detection of Rare The analysis detected known rare associations with PH (Table 3).
Subtypes An example is the co-occurrence of PH with juvenile idiopathic
arthritis (JIA) and hemophagocytic syndrome. The clustering
occurrence is not surprising given that PAH has been reported in
several patients with systemic-onset JIA, particularly in association
with macrophage activation syndrome (MAS) [60, 61]. It has been
hypothesized that PAH may be caused by exposures to IL-1 and
IL-6 inhibitors used for treating systemic-onset JIA and MAS [62];
however, the underlying biology of this association remains
unknown. In the study dataset, of 18 patients with both JIA and
hemophagocytic syndrome, four patients developed PH, signifying
the potential importance of this comorbidity pattern.
The cluster comprising glycogen storage disease (GSD), hered-
itary muscular dystrophy, and cardiomyopathy typifies type 2 GSD.
While type 1 GSD has been linked to PH, the relationship between
PH and type 2 GSD is less studied. A case report noted the
development of PH resulting from respiratory muscular atrophy
and alveolar hypoventilation caused by type 2 GSD [63]. Another
report documented severe pulmonary veno-occlusive disease
(PVOD) in a patient with type 2 GSD [64].
A Bayesian Network Approach to Disease Subtype Discovery 315

PAH PH due to other mechanisms

Hereditary hemorrhagic telangiectasis (1), iron metabolism disorder (2) Sickle cell disease (86), acute chest syndrome (87), hereditary hemolytic anemia (88)
Raynaud phenomenon (3), systemic vasculitis (4), other connective tissue disease (5), systemic sclerosis Acquired hemolytic anemia (89), anorexia (loss of appetite) (90)
(6) Thrombocytopenia (91), drug-induced pancytopenia (92), aplastic anemia (93), leukocytopenia (94),
Juvenile idiopathic arthritis (7), hemophagocytic syndrome (8) neutropenia (95)
Systemic lupus erythematosus (9), nephritis (10), essential hypertension (11) Splenomegaly (96), fistula of stomach or duodenum (97)
Portal hypertension (12), chronic liver disease (13) Hemolytic disease due to isoimmunization (98)
CHD left-to-right shunt (14), respiratory distress syndrome (15), emphysema (16), maternal complications of Coagulation defect (99), iron deficiency anemia (100), leukocystosis (168)
pregnancy (17), congenital pneumonia (18) Multiple congenital anomalies (101), congenital intestinal atresia and stenosis (102), congenital
Pleurisy (19), pericardium disorder (20), noninfectious disorders of lymphatic channels (21) anomalies of kidney (103), congenital malrotation of intestine (104)
PPHN (22), tachypnea (23), neonatal endocrine and metabolic disorder (24), exceptionally large baby (25) DiGeorge syndrome (105), tetralogy of Fallot (106), primary immunodeficiency (107), velo-cardio-facial
syndrome (108)
PH due to left heart disease Cri-du-chat syndrome (109), other chromosomal anomalies (110), other autosomal deletion syndromes
Congestive heart failure (26), cardiomegaly (27) (111), rickets (112)
Valvular heart disease (tricuspid or pulmonary valve disorders) (28), left heart failure (29) Down syndrome (113), myoneural disorder (114)
Valvular heart disease (mitral or aortic valve disorders) (30), rheumatic heart disease (31), left-sided CHD Patau syndrome (115), Edwards syndrome (116), Hirschsprung (117), magnesium metabolic disorder
(32) (118)
Other CHD (33), congenital anomalies of peripheral vascular system (34) Prader-Willi syndrome (119), abnormal weight gain (120), aneurysm (121)
Conduction disorder (35), cardiac dysrhythmias (36) Cerebral palsy (122), abnormal involuntary movement (123)
CHD right-to-left shunt (37) (38), acute cor pulmonale (39) Adrenogenital disorder (124), adrenal hypofunction (125), microcephaly (126)
Myocarditis (40) Disorder of muscle, ligament and fascia (127), abnormality of gait (128), spontaneous ecchymoses
(129)
PH due to chronic lung disease and/or hypoxia Glycogen storage disorder (130), cardiomyopathy (131), hereditary muscular dystrophy (132)
Bronchiectasis (41), cystic fibrosis (42), pneumocystosis (43) Plasma protein metabolic disorder (133), gangrene (134), defibrination syndrome (135)
Interstitial lung disease (44), chronic bronchitis (45) Acute kidney failure (136), phosphorus metabolism disorder (137)
Pneumonia (46), pneumonitis (47), malnutrition (48) Hepatomegaly (138), hydrops fetalis (139), edema (140)
Pulmonary eosinophilia (49), edema of larynx (50), cerebrovascular disorder (51) Mediastinitis (141), transient mental disorder (142)
Sleep apnea (52), intellectual disability (53) Congenital spleen anomaly (143), situs inversus (144)
Other congenital anomalies of respiratory system (54), cerebral depression coma (55) Epilepsy (145), anoxic brain damage (146)
Congenital cystic lung (56), primary atelectasis (57) Intracranial hemorrhage (147), hydrocephalus (148)
Congenital anomalies of larynx or trachea (58), diseases of vocal cord (59), stricture and stenosis of Choanal atresia (149), congenital anomalies of skull and face (150)
esophagus (60) Hearing loss (151), congenital musculoskeletal deformities (152), delay in development (153)
Congenital lung agenesis/hypoplasia (61), acquired hypertrophic pyloric stenosis (62) Congenital brain reduction deformities (154), eosinophilia (155), phlebitis and thrombophlebitis (156)
Bronchopulmonary dysplasia (63), fetal and neonatal hemorrhage (64), cerebral cyst (65), neonatal Periventricular leukomalacia (157), pyogenic granuloma (158)
hematological disorder (66) Intestinal obstruction (159), intestinal vascular insufficiency (160)
Congenital anomalies of diaphragm (67), diaphragmatic hernia (68), diaphragm paralysis (69) Hypothyroidism (161)
Intrauterine hypoxia and birth asphyxia (70), convulsion in newborn (71), birth trauma (72) Perinatal digestive system disorders (162), intestinal malabsorption (163)
Pulmonary insufficiency following trauma (73), hypotension (74), acute edema of lung (75), fluid overload Disorder of eye movement (164), lack of coordination (165)
disorder (76), pneumothorax (77), pulmonary collapse (78), nutritional deficiency (79) Gastrointestinal hemorrhage (166), gastroenteritis and colitis (167)
Pulmonary hemorrhage (80) Encephalocele (169), cerebral compression (170), bilirubin excretion disorder (171), hepatitis (172)
Disturbance of salivary secretion (173), speech disturbance (174), childbirth complications (175)
CTEPH Mixed acid base balance disorder (176), calculus of kidney (177), gastroparesis (178)
Thromboembolism (81), hemorrhagic disorder due to intrinsic factor (82), compression of vein (83) Gonadal dysgenesis (179), drug withdrawal syndrome (180)
Pulmonary embolism (84), endocarditis (85)
Unclassified (comorbidities that do not belong to a cluster and do not fit into a WHO subtype:
Dyspepsia (181), diabetic mother syndrome (182), umbilical hernia (183), persistent vomiting (184),
essential hypertension (185), cyanosis (186)

Fig. 3 A Bayesian network eliciting comorbidity patterns in pediatric pulmonary hypertension

316 Mei-Sing Ong

Another cluster captures the characteristic features of hetero-

taxy syndrome, including situs inversus and congenital spleen
anomaly. Of the 31 patients in the dataset diagnosed with these
conditions, four developed PH. Several limited case reports docu-
mented the disease pattern in the setting of cardiac defects and
pulmonary complications [65, 66].
The analysis identified several rare genetic disorders among the
PH population, including Cri-du-chat, Turner, and Prader-Willi
syndromes. Case reports have documented the co-occurrence of
PH in children with Cri-du-chat [67, 68] and Turner syndrome
[69, 70]. PH in these patients may be caused by underlying con-
genital heart disease. In patients with Prader-Willi syndrome,
obstructive sleep apnea and other obesity-related comorbidities
may have contributed to the development of PH. In the study
dataset, six (1.8%) of 329 patients with Prader-Willi syndrome
developed PH. However, the literature review yielded only one
case report documenting a sudden death secondary to PH in a
child with Prader-Willi syndrome [71], indicating that the risk of
PH in these patients may be under-recognized.
While some clusters clearly describe syndromes with highly
specific and/or rare comorbidities, other clusters contain unusual
combinations of relatively more common conditions, which may
represent unrecognized syndromes and generate new hypotheses.
For example, the cluster comprising adrenogenital disorder, micro-
cephaly, and adrenal hypofunction may represent Smith-Lemli-
Opitz syndrome (SLOS), a rare condition caused by deficiency of
7-dehydrocholesterol (7-DHC) reductase. Two potential etiolo-
gies would support the association between SLOS and PH. First,
persistent pulmonary hypertension of the newborn (PPHN) in
SLOS has been documented in a patient with altered expression
of caveolin-1 (CAV1) [72], suggesting that caveolae-dependent
signaling may be responsible for the pathogenesis of PH. This
hypothesis was further strengthened in a recent study demonstrat-
ing an association between mutations in CAV-1 and PAH through
whole exome sequencing [73]. Second, cardiorespiratory problems
can occur in individuals with SLOS, secondary to malformations of
the heart or respiratory tract [74]; these conditions may contribute
to the development of PH in patients with SLOS.

4.3 Discovery Several network-derived comorbidity clusters do not fall into any of
of Unknown Subtypes the categories in the WHO and Panama classifications. Of note, a
number of these clusters are linked to neurological defects not
commonly thought to be associated with PH, including encepha-
locele, hydrocephalus, microcephaly, periventricular leukomalacia,
and congenital brain reduction deformities. It is well-established
that children with severe neurological impairments are predisposed
to respiratory problems that occur as a direct consequence of the
underlying disability. For example, oropharyngeal motor problems
A Bayesian Network Approach to Disease Subtype Discovery 317

associated with neurological dysfunctions can lead to recurrent

aspiration and pneumonia [75]. Chiari malformation associated
with hydrocephalus can cause both maldevelopment of the brain
stem respiratory control centers and central sleep apnea [76]. Neu-
rological impairments are also common pathways for children
requiring mechanical ventilation at the intensive care unit; PH in
these patients may be secondary to mechanical ventilation manage-
ment. While the causes of PH in children with neurological defects
observed in the study cohort cannot be ascertained, the analysis
suggests that the association of neurological defects with PH may
be under-recognized and deserves further characterization.

4.4 Feasibility The study demonstrated that comorbidity patterns of patients with
and Challenges PH captured in a Bayesian network can be stratified into subtypes
of Data-Driven that are biologically and clinically informative. The algorithmic
Discovery of Disease method automatically relearned most of the major PH subtypes
Subtypes with known etiological basis defined by the WHO classification.
The similarity of the derived network structure to current taxon-
omy of PH provides face validity to the approach. Furthermore, the
network approach enriches the current classification of PH. First, it
captured several subtypes documented in only a few case studies for
which evidence for systematic association remains lacking. This
both validates the approach and provides impetus for deeper inves-
tigation of the disease subtypes. The analysis also identified rare
subtypes with findings consistent with several well-described
genetic syndromes. In the same way in which novel genetic associa-
tions in PH stimulate new avenues of research, so too may novel
phenotypic associations prompt important discoveries related to
disease susceptibility.
To construct the comorbidity network, the Bayesian model
averaging technique was applied to find a network model that
best fits the underlying data. The approach is uniquely suited to
accommodate the inherent uncertainties of biological processes and
to minimize the effects of noise in the data. In maximizing specific-
ity, however, other subtypes may have been missed. Furthermore,
in defining comorbidity clusters, a strict partitioning rule was
applied, whereby each comorbidity belongs exactly to one cluster.
While this approach produces a model that is easier to interpret, the
full expression of subtypes may not have been captured. As shown
in Fig. 3, many comorbidities are linked to comorbidities belonging
to another cluster. Shared features among multiple clusters may
also reflect the overlapping phenotypes of PH, an increasingly
recognized phenomenon [30]. Future research should explore
methods that would facilitate the delineation of subtypes with
overlapping manifestations and etiologies.
A further limitation of the study is the coding inaccuracies
inherent to administrative claims dataset. The diagnoses coded for
billing purposes may not reflect actual comorbidities in the patients.
318 Mei-Sing Ong

To improve case identification specificity, the analysis included only

diagnoses with two or more encounter visits—an algorithm that
has been validated in previous studies [38–40]. While it may not be
possible to fully address diagnostic coding inaccuracies in adminis-
trative claims, the analysis was able to discern comorbidity relations
and derive subtypes that are biologically meaningful, thus lending
support to the validity of the study approach. A further strategy
used to reduce uncertainties in the analysis is the exclusion of
comorbidities that occurred in fewer than four patients and net-
work relations with low probabilistic strengths. In doing so, the
model may not have captured all the comorbidities and subtypes
that were present in the study cohort. An area that is ripe for further
research is the integration of multiple complementary data sources,
including administrative claims, electronic health, and registry data,
to validate and enrich disease classifications. Several studies have
begun to explore the use of network-based approaches to aggregate
diverse data sources [76, 77].

4.5 Summary This chapter describes the application of the Bayesian network
approach to routinely collected clinical data to discern disease sub-
types in childhood-onset PH. With the increased availability of
large clinical datasets, such data-driven approaches can facilitate
and expedite scientific discovery of the causes and treatments of
complex diseases. Further advances linking disease subtypes to
therapeutic response, disease outcomes, as well as the molecular
profiles of individual subtypes will provide impetus for the develop-
ment of more effective and targeted therapies.

Acknowledgment

The study described in this book chapter has been previously

reported in Circulation Research [78].

References

1. Kitano H (2002) Systems biology: a brief over- 6. Barabási AL, Oltvai ZN (2004) Network biol-
view. Science 295(5560):1662–1664 ogy: understanding the cell’s functional orga-
2. Aderem A (2005) Systems biology: its practice nization. Nat Rev Genet 5(2):101–113
and challenges. Cell 121(4):511–513 7. Chan SY, Loscalzo J (2012) The emerging
3. Hood L, Heath JR, Phelps ME, Lin B (2004) paradigm of network medicine in the study of
Systems biology and new technologies enable human disease. Circ Res 111(3):359–374
predictive and preventative medicine. Science 8. Barabási AL, Albert R (1999) Emergence of
306(5696):640–643 scaling in random networks. Science
4. Westerhoff HV, Palsson BO (2004) The evolu- 286:509–512
tion of molecular biology into systems biology. 9. Bhalla US, Iyengar R (1999) Emergent proper-
Nat Biotechnol 22(10):1249–1252 ties of networks of biological signaling path-
5. Barabási AL, Gulbahce N, Loscalzo J (2011) ways. Science 283(5400):381–387
Network medicine: a network-based approach 10. Jeong H, Tombor B, Albert R, Oltvai ZN,
to human disease. Nat Rev Genet 12(1):56–68 Barabási AL (2000) The large-scale
A Bayesian Network Approach to Disease Subtype Discovery 319

organization of metabolic networks. Nature Helgason A, Walters GB, Gunnarsdottir S,

407(6804):651–654 Mouy M, Steinthorsdottir V, Eiriksdottir GH,
11. Stuart JM, Segal E, Koller D, Kim SK (2003) A Bjornsdottir G, Reynisdottir I,
gene-coexpression network for global discov- Gudbjartsson D, Helgadottir A,
ery of conserved genetic modules. Sciene 302 Jonasdottir A, Jonasdottir A, Styrkarsdottir U,
(5643):249–255 Gretarsdottir S, Magnusson KP, Stefansson H,
12. Horvath S, Dong J (2008) Geometric interpre- Fossdal R, Kristjansson K, Gislason HG,
tation of gene coexpression network analysis. Stefansson T, Leifsson BG,
PLoS Comput Biol 4(8):e1000117 Thorsteinsdottir U, Lamb JR, Gulcher JR,
Reitman ML, Kong A, Schadt EE, Stefansson
13. Doering TA, Crawford A, Angelosanto JM, K (2008) Genetics of gene expression and its
Paley MA, Ziegler CG, Wherry EJ (2012) Net- effect on disease. Nature 452(7186):423–428
work analysis reveals centrally connected genes
and pathways involved in CD8+ T cell exhaus- 21. Min JL, Nicholson G, Halgrimsdottir I,
tion versus memory. Immunity 37 Almstrup K, Petri A, Barrett A, Travers M,
(6):1130–1144 Rayner NW, M€agi R, Pettersson FH,
Broxholme J, Neville MJ, Wills QF,
14. McDermott JE, Taylor RC, Yoon H, Heffron F Cheeseman J, GIANT Consortium; MolPAGE
(2009) Bottlenecks and hubs in inferred net- Consortium, Allen M, Holmes CC, Spector
works are important for virulence in Salmonella TD, Fleckner J, MI MC, Karpe F, Lindgren
typhimurium. J Comput Biol 16(2):169–180 CM, Zondervan KT (2012) Coexpression net-
15. Tan N, Chung MK, Smith JD, Hsu J, Serre D, work analysis in abdominal and gluteal adipose
Newton DW, Castel L, Soltesz E, Pettersson G, tissue reveals regulatory genetic loci for meta-
Gillinov AM, Van Wagoner DR, Barnard J bolic syndrome and related phenotypes. PLoS
(2013) A weighted gene co-expression net- Genet 8(2):e1002505
work analysis of human left atrial tissue identi- 22. Hildago CA, Blumm N, Barabasi AL, Christa-
fies gene modules associated with atrial kis NA (2009) A dynamic network approach
fibrillation. Circ Cadiovasc Genet 6 for the study of human phenotypes. PLoS
(4):362–371 Comput Biol 5(4):e1000353
16. Yu H, Braun P, Yildirim MA, Lemmens I, 23. Gall H, Felix JF, Schneck FK, Milger K,
Venkatesan K, Sahalie J, Hirozane-Kishikawa T, Sommer N, Voswinckel R, Franco OH,
Gebreab F, Li N, Simonis N, Hao T, Rual JF, Hofman A, Schermuly RT, Weissmann N,
Dricot A, Vazquez A, Murray RR, Simon C, Grimminger F, Seeger W, Ghofrani HA
Tardivo L, Tam S, Svrzikapa N, Fan C, de Smet (2017) The giessen pulmonary hypertension
AS, Motyl A, Hudson ME, Park J, Xin X, registry: survival in pulmonary hypertension
Cusick ME, Moore T, Boone C, Snyder M, subgroups. J Heart Lung Transplant 36
Roth FP, Barabási AL, Tavernier J, Hill DE, (9):957–967
Vidal M (2008) High-quality binary protein
interaction map of the yeast interactome net- 24. Simonneau G, Gatzoulis MA, Adatia I,
work. Science 322:104–110 Celermajer D, Denton C, Ghofrani A, Gomez
Sanchez MA, Krishna Kumar R, Landzberg M,
17. Erlich Y, Edvardson S, Hodges E, Zenvirt S, Machado RF, Olschewski H, Robbins IM,
Thekkat P, Shaag A, Dor T, Hannon GJ, Elpe- Souza R (2013) Updated clinical classification
leg O (2011) Exome sequencing and disease- of pulmonary hypertension. J Am Coll Cardiol
network analysis of a single family implicate a 62:D34–D41
mutation in KIF1A in hereditary spastic para-
paresis. Genome Res 21(5):658–664 25. Packer M, McMurray J, Massie BM, Caspi A,
Charlon V, Cohen-Solal A, Kiowski W,
18. Aggarwal A, Guo DL, Hoshida Y, Yuen ST, Kostuk W, Krum H, Levine B, Rizzon P,
Chu KM, So S, Boussioutas A, Chen X, Soler J, Swedberg K, Anderson S, Demets DL
Bowtell D, Aburatani H, Leung SY, Tan P (2005) Clinical effects of endothelin receptor
(2006) Topological and functional discovery antagonism with bosentan in patients with
in a gene coexpression meta-network of gastric severe chronic heart failure: results of a pilot
cancer. Cancer Res 66(1):232–241 study. J Card Fail 11(1):12–20
19. Yang Y, Han L, Yuan Y, Li J, Hei N, Liang H 26. Haywood GA, Sneddon JF, Bashir Y, Jennison
(2014) Gene co-expression network analysis SH, Gray HH, McKenna WJ (1992) Adeno-
reveals common system-level properties of sine infusion for the reversal of pulmonary
prognostic genes across cancer types. Nat vasoconstriction in biventricular failure. A
Commun 5:3231 good test but a poor therapy. Circulation 86
20. Emilsson V, Thorleifsson G, Zhang B, Leo- (3):896–902
nardson AS, Zink F, Zhu J, Carlson S,
320 Mei-Sing Ong

27. Galiè N, Humbert M, Vachiery JL et al (2016) national cohort study. Heart 96

2016 ESC/ERS guidelines for the diagnosis (17):1401–1406
and treatment of pulmonary hypertension: the 36. Fraisse A, Jais X, Schleich JM, di Filippo S,
Joint Task Force for the Diagnosis and Treat- Maragnès P, Beghetti M, Gressin V, Voisin M,
ment of Pulmonary Hypertension of the Dauphin C, Clerson P, Godart F, Bonnet D
European Society of Cardiology (ESC) and (2010) Characteristics and prospective 2-year
the European Respiratory Society (ERS): follow-up of children with pulmonary arterial
endorsed by: Association for European Paedia- hypertension in France. Arch Cardiovasc Dis
tric and Congenital Cardiology (AEPC), Inter- 103(2):66–74
national Society for Heart and Lung 37. Miller DP, Gomberg-Maitland M, Humbert M
Transplantation (ISHLT). Eur Heart J 37 (2012) Survivor bias and risk assessment. Eur
(1):67–119 Respir J 40:530–532
28. Barst RG, Ertel SI, Beghetti M, Ivy DD (2011) 38. Quan H, Khan N, Hemmelgarn BR et al
Pulmonary arterial hypertension: a comparison (2009) Validation of a case definition to define
between children and adults. Eur Respir J hypertension using administrative data. Hyper-
37:665–677 tension 54(6):1423–1428
29. Ivy DD, Abman SH, Barst RJ, Berger RM, 39. Lix L, Yogendran M, Burchill C, Metge C,
Bonnet D, Fleming TR, Haworth SG, Raj JU, McKeen N, Moore D, Bond R (2006) Defin-
Rosenzweig EB, Schulze Neick I, Steinhorn ing and validating chronic diseases: an adminis-
RH, Beghetti M (2013) Pediatric pulmonary trative data approach. Winnipeg, Manitoba
hypertension. J Am Coll Cardiol 62: Centre for Health Policy
D117–D126
40. Rector TS, Wickstrom SL, Shah M et al (2004)
30. Cerro MR, Abman S, Diaz G, Freudenthal AH, Specificity and sensitivity of claims-based algo-
Freudenthal F, Harikrishnan S, Haworth SG, rithms for identifying members of Medicare
Ivy D, Lopes AA, Raj JU, Sandoval J, +Choice health plans that have chronic medical
Stenmark K, Adatia I (2011) A consensus conditions. Health Serv Res 39(6 Pt
approach to the classification of pediatric pul- 1):1839–1857
monary hypertensive vascular disease: report
from the PVRI pediatric taskforce, Panama 41. Funch D, Ross D, Gardstein BM, Norman HS,
2011. Pulm Circ 1(2):286–298 Sanders LA, Major-Pedersen A, Gydesen H,
Dore DD (2017) Performance of claims-
31. McGoon MD, Benza RL, Escribano-Subias P, based algorithms for identifying incident thy-
Jiang X, Miller DP, Peacock AJ, Pepke-Zaba J, roid cancer in commercial health plan enrollees
Pulido T, Rich S, Rosenkranz S, Suissa S, receiving antidiabetic drug therapies. BMC
Humbert M (2013) Pulmonary arterial hyper- Health Serv Res 17(1):330
tension: epidemiology and registries. J Am Coll
Cardiol 62(25 Suppl):D51–D59 42. Reid AY, St Germaine-Smith C, Liu M,
Sadiq S, Quan H, Wiebe S, Faris P, Dean S,
32. Berger RM et al (2012) Clinical features of Jetté N (2012) Development and validation of
paediatric pulmonary hypertension: a registry a case definition for epilepsy for use with
study. Lancet 379:537–546 administrative health data. Epilepsy Res 102
33. Badesch DB, Raskob GE, Elliott CG, Krich- (3):173–179
man AM, Farber HW, Frost AE, Barst RJ, 43. Solberg LI, Engebretson KI, Sperl-Hillen JM,
Benza RL, Liou TG, Turner M, Giles S, Hroscikoski MC, O’Connor PJ (2006) Are
Feldkircher K, Miller DP, McGoon MD claims data accurate enough to identify patients
(2010) Pulmonary arterial hypertension: base- for performance measures or quality improve-
line characteristics from the REVEAL registry. ment? The case of diabetes, heart disease, and
Chest 137(2):376–387 depression. Am J Med Qual 21(4):238–245
34. van Loon RL, Roofthooft MT, Hillege HL, ten 44. Pearl J (2000) Causality: models, reasoning
Harkel AD, van Osch-Gevers M, Delhaas T, and inference, vol 29. MIT press, Cambridge
Kapusta L, Strengers JL, Rammeloo L, Clur
SA, Mulder BJ, Berger RM (2011) Pediatric 45. Heckerman D. (1995) A Bayesian approach to
pulmonary hypertension in the Netherlands: learning causal networks. In: UAI’95 Proceed-
epidemiology and characterization during the ings of the Eleventh conference on Uncertainty
period 1991 to 2005. Circulation 124 in artificial intelligence, p 285–295
(16):1755–1764 46. Friedman N, Nachman I, Pe’er D Learning
35. Moledina S, Hislop AA, Foster H, Schulze- Bayesian network structure from massive data-
Neick I, Haworth SG (2010) Childhood idio- sets: the sparse candidate algorithm. In: Pro-
pathic pulmonary arterial hypertension: a ceeding UAI’99 Proceedings of the Fifteenth
A Bayesian Network Approach to Disease Subtype Discovery 321

conference on Uncertainty in artificial intelli- 61. Li EK, Tam LS (1999) Pulmonary hyperten-
gence, p 206–215 sion in systemic lupus erythematosus: Clinical
47. Cooper GF, Herskovits E (1992) A Bayesian association and survival in 18 patients. J Rheu-
method for the induction of probabilistic net- matol 26(9):1923–1929
works from data. Mach Learn 9:309–347 62. Humbert M, Monti G, Brenot F, Sitbon O,
48. De Campos, Zeng Z, Ji Q (2009) Structure Portier A, Grangeot-Keros L et al (1995)
learning of Bayesian networks using con- Increased interleukin-1 and interleukin-6
straints. In: Proceedings of the 26th Interna- serum concentrations in severe primary pulmo-
tional conference on machine learning, nary hypertension. Am J Respir Crit Care Med
Montreal, Canada 151(5):1628–1631
49. Friedman N, Koller D (2003) Being Bayesian 63. Inoue S, Nakamura T, Hasegawa K, Tadaoka S,
about network structure: a Bayesian approach Samukawa M, Nezuo S et al (1989) Pulmonary
to structure discovery in Bayesian networks. hypertension due to glycogen storage disease
Mach Learn 50(1-2):95–125 type II (Pompe’s disease): a case report. J Car-
50. Perrier E, Imoto S, Miyano S (2008) Finding diol 19(1):323–332
optimal Bayesian network given a super- 64. Kobayashi H, Shimada Y, Ikegami M, Kawai T,
structure. J Mach Learn Res 9:2251–2286 Sakurai K, Urashima T et al (2010) Prognostic
51. Acid S, de Campos LM (2001) A hybrid meth- factors for the late onset Pompe disease with
odology for learning belief networks: BENE- enzyme replacement therapy: from our experi-
DICT. Int J Approx Reason 27(3):235–262 ence of 4 cases including an autopsy case. Mol
Genet Metab 100(1):14–19
52. Heckerman D, Geiger D, Chickering DM
(1995) Learning Bayesian networks: the com- 65. Brandenburg VM, Krueger S, Haage P,
bination of knowledge and statistical data. Mertens P, Riehl J (2002 May) Heterotaxy
Mach Learn 20(3):197–243 syndrome with severe pulmonary hypertension
in an adult patient. South Med J 95
53. de Campos CP, Ji Q Properties of Bayesian (5):536–538
Dirichlet scores to learn Bayesian network
structures. In: Proceedings of the 24th AAAI 66. Yousuf T, Kramer J, Jones B, Keshmiri H, Dia
conference on artificial intelligence M (2016) Pulmonary hypertension in a patient
with congenital heart defects and heterotaxy
54. Glover F, Laguna M (2013) Tabu Search. syndrome. Ochsner J 16(3):309–311
Handbook of Combinatorial
Optimization:3261–3362 67. Hills C, Moller JH, Finkelstein M, Lohr J,
Schimmenti L (2006) Cri du Chat syndrome
55. Koller D, Friedman N (2009) Probabilistic and congenital heart disease: a review of previ-
graphical models: Principles and Techniques. ously reported cases and presentation of an
Massachusetts Institute of additional 21 cases from the Pediatric Cardiac
Technology:733–749 Care Consortium. Pediatrics 117(5):
56. Friedman N, Goldszmidt M, Wyner A. Data e924–e927
analysis with Bayesian networks: a bootstrap 68. Levy B, Dunn TM, Kern JH, Hirschhorn K,
approach. In: Proceeding UAI’99 Proceedings Kardon NB (2002) Delineation of the dup5q
of the fifteenth conference on uncertainty in phenotype by molecular cytogenetic analysis in
artificial intelligence, p 196–205 a patient with dup5q/del 5p (cri du chat). Am J
57. Scutari M, Nagarajan R (2013) Identifying sig- Med Genet 108(3):192–197
nificant edges in graphical models of molecular 69. Bechtold SM, Dalla Pozza R, Becker A,
networks. Artif Intell Med:207–217 Meidert A, Döhlemann C, Schwarz HP
58. Pons P, Latapy M Computing communities in (2004) Partial anomalous pulmonary vein con-
large networks using random walks, Computer nection: an underestimated cardiovascular
and information sciences-ISCIS, vol 2005. defect in Ullrich-Turner syndrome. Eur J
Berlin Heidelberg: Springer, 284–293 Pediatr 163(3):158–162
59. Newman ME (2006) Modularity and commu- 70. Tinker A, Schofield UJ (1989) Severe pulmo-
nity structure in networks. Proc Natl Acad Sci nary hypertension in Turner syndrome. Br
U S A 103(23):8577–8582 Heart J 62:74–77
60. Kimura Y, Weiss JE, Haroldson KL, Lee T, 71. Bakker B, Maneatis T, Lippe B (2007) Sudden
Punaro M, Oliveira S et al (2013) Pulmonary death in Prader-Willi syndrome: brief review of
hypertension and other potentially fatal pulmo- five additional cases. Horm Res 67:203–204
nary complications in systemic juvenile idio- 72. Katheria AC, Masliah E, Benirschke K, Jones
pathic arthritis. Arthritis Care Res 65 KL, Kim JH (2010) Idiopathic persistent pul-
(5):745–752 monary hypertension in an infant with Smith-
322 Mei-Sing Ong

Lemli-Opitz syndrome. Fetal Pediatr Pathol 29 76. Dauvilliers Y, Stal V, Abril B, Coubes P,
(6):373–379 Bobin S, Touchon J, Escourrou P, Parker F,
73. Austin ED, Ma L, LeDuc C, Berman Bourgin P (2007) Chiari malformation and
Rosenzweig E, Borczuk A, Phillips JA 3rd sleep related breathing disorders. J Neurol
et al (2012) Whole exome sequencing to iden- Neurosurg Psychiatry 78(12):1344–1348
tify a novel gene (caveolin-1) associated with 77. Wang B, Mezlini AM, Demir F, Fiume M,
human pulmonary arterial hypertension. Circ Tu Z, Brudno M, Haibe-Kains B, Goldenberg
Cardiovasc Genet 5:336–343 A (2014) Similarity network fusion for aggre-
74. Nowaczyk M (2013) Smith-Lemli-Opitz syn- gating data types on a genomic scale. Nat
drome. GeneReviews Methods 11(3):333–337
75. Seddon PC, Khan Y (2003) Respiratory pro- 78. Ong MS, Mullen MP, Austin ED, Szolovits P,
blems in children with neurological Natter MD, Geva A, Cai T, Kong SW, Mandl
impairment. Arc Dis Child 88(1):75–78 KD (2017) Circ Res 121(4):341–353
INDEX

A Dictionary-based approach..........................74, 77, 79–84

Drug
Absorption, distribution, metabolism, and excretion combination screening................................. 11, 12, 27
(ADME)..........................139, 140, 142, 150, 157 discovery ............................................v, 12, 50, 56, 59,
Acoustic dispensing..................................... 12, 16, 21, 24
66, 91–114, 119, 120, 122, 129, 132, 199, 200,
Automated text processing .......................................74–79 203, 216, 231–247, 273, 275, 288, 299
Automation ................................................ 12–14, 20, 24, effects ................................................ 3, 112, 190–192,
51, 54, 57, 63, 66, 67, 77, 108, 119, 127, 149,
200, 245, 248, 285, 287, 289, 292, 295
234, 237, 241, 245, 274 interactions ................ 3, 8, 65, 75, 97, 258, 261, 267
repositioning ........................................................... 190
B
repurposing .................... v, 92, 93, 97, 193–194, 200
Big data........................................ 50, 51, 66, 67, 91–114, signatures ........................................ 95, 102, 104–107,
120, 200, 203, 255–271, 273, 276, 277, 279, 112, 273–285
282, 284, 287, 289, 292, 295 synergy ......................................................................... 3
Binary linear programming (BLP) ..............................285, target ontology.................................................... 49–67
288–290, 292, 295 targets .............................................49, 52, 54, 56, 64,
Bioactivity ...............................37–46, 112, 120, 128, 131 97, 99, 122, 162, 181–183, 185, 188, 190, 192,
Bioassay............................ 37–40, 43, 45, 46, 50, 66, 128 194, 200, 201, 203, 206, 209, 212, 231, 232,
Bioinformatics .................................... v, 50, 66, 183, 195, 237, 243, 245, 246, 248, 273, 277, 285
216–229, 279
Biomedical literatures .................. 84, 234–241, 243, 245 E
Biomedical text mining.....................................74, 75, 78, Electronic medical records (EMR) ........... 95, 96, 98, 99,
232–237, 243
113–114, 240, 247, 267
C G
Cell-specific pathways .......................................... 288–293
Gene expression ......................................... 50, 66, 93, 94,
Checkerboard assays ......................................................... 7 98, 103–109, 111, 112, 183, 184, 187, 189, 190,
Cheminformatics.................... v, 120, 121, 123, 142, 183 194, 195, 273–277, 279, 281, 282, 284,
Clinical comorbidity evaluation .......................... 264–266
287–289, 294
Clinical data integration ..............................256, 262–264 Gene expression data ......................................93, 98, 104,
Clinical informatics ................................................... v, 266 105, 111, 183, 194, 195, 273–277
Collaborative filtering .......................................... 201, 212
Compound efficacy ....................................................... 273 H
Compound management................................................ 14
Compound signatures ............... 106, 274–277, 281–285 Hit-calls ........................................................40, 41, 44–46
Conformations .................................. 133, 140, 145, 161, Human leukocyte antigen (HLA) genes ..................... 194
163, 165, 167, 169, 170, 172, 174, 175, 212
I
Constrained sparse nonnegative matrix factorization
(csNMF) ......................... 274–277, 282, 284, 285 Integer linear programming (ILP)..............182, 187–188
Ion mobility-mass spectrometry (IM-MS) ......... 161–176
D Ion mobility spectrometry............................................ 163
Data curation........................................................ 141, 142 Isomers .............................................. 123, 130, 161, 163,
Data integration ........................... 37, 199–212, 255, 270 165, 167, 169, 170, 172, 174, 175
Deep learning .............................................. 114, 232, 246

Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4, © Springer Science+Business Media, LLC, part of Springer Nature 2019

323
BIOINFORMATICS AND DRUG DISCOVERY
324 Index
K Property filtering .................................................. 122, 134
PubChem.............................................. 37–40, 43, 44, 46,
Knowledge acquisition..........................49, 52, 54, 56, 64 97, 112, 120, 121, 123, 124, 128, 133, 142, 162,
Knowledge acquisition and representation methodology 234, 244
(KNARM)..................................49, 52, 54, 56, 64 PubMed ............................................. 58, 73, 78, 80, 234,
236, 239, 241, 244
L
L1000 ................................................ 101, 105, 273, 276, Q
277, 279, 282, 284, 287, 289, 292, 295 Quantitative structure activity relationships
Library of Integrated Network-Based Cellular Signatures (QSARs)............................ 38, 139, 144, 145, 149
(LINCS) program .................................56, 61, 67, Quantitative structure-property relationship
102, 104–107, 111, 184, 273–275, 277, 284, (QSPR) ..................................................... 139–157
285, 287, 294
R
M
Regression ....................................................140, 152–154
Machine learning........................................ 38, 77, 78, 85,
99, 114, 120, 144, 185, 200–202, 232, 234, 236, S
237, 240, 241, 243, 244, 247, 255–271
Medication safety ............................................................ 96 ScrubChem...................................................37–40, 42–46
Semantic model ...................................................... 51, 245
N Semantic web.......................................49–51, 54, 56, 247
Structured information ................................................... 73
Names entity recognition ............................................. 232 Synergy ......................................... 3, 7, 8, 11, 12, 28, 193
Network protein interactions ....................................... 181 Systems pharmacology......................................... 199–212

O T
Off-target identification....................................... 199, 201 Text mining .............................................v, 58, 73–86, 96,
Online services ................................................85, 99, 101, 185, 231–247, 262
102, 108, 183, 201, 203, 238, 240, 247, 260 Text normalization.......................................................... 77
Ontologies ............................................38, 49, 52, 54, 56,
64, 77, 80, 81, 93, 95–97, 236, 237, 245 U
P Unwanted structures................................... 122, 127, 134

Personalized medication therapy.................255, 269–271 V

Pharmacogenomics (PGx) .................................. 237, 241,
243, 244, 271 Virtual screening .................................................. 119–134

Application of Clinical Bioinformatics
No ratings yet
Application of Clinical Bioinformatics
395 pages
Basic Bioinformatics - S. Ignacimuthu
100% (4)
Basic Bioinformatics - S. Ignacimuthu
232 pages
Bioinformatics - An Introductory Textbook
100% (1)
Bioinformatics - An Introductory Textbook
354 pages
Data Science, AI, and Machine Learning in Drug Development
No ratings yet
Data Science, AI, and Machine Learning in Drug Development
335 pages
Bioinformatics Sequences Structures Phylogeny 2018 PDF
100% (6)
Bioinformatics Sequences Structures Phylogeny 2018 PDF
402 pages
Bioinformatics
100% (2)
Bioinformatics
104 pages
Bioinformatics A Practical Guide To Next Generation Sequencing Data
100% (1)
Bioinformatics A Practical Guide To Next Generation Sequencing Data
349 pages
Advances in Bioinformatics (Springer, 2021)
50% (2)
Advances in Bioinformatics (Springer, 2021)
446 pages
(Methods in Molecular Biology 2194) Joseph Markowitz - Translational Bioinformatics For Therapeutic Development-Springer US - Humana (2021)
100% (2)
(Methods in Molecular Biology 2194) Joseph Markowitz - Translational Bioinformatics For Therapeutic Development-Springer US - Humana (2021)
323 pages
Pharmaco-Genomics: Methods and Protocols
100% (2)
Pharmaco-Genomics: Methods and Protocols
375 pages
(Methods in Molecular Biology 1525) Jonathan M. Keith (Eds.) - Bioinformatics - Volume I - Data, Sequence Analysis, and Evolution-Humana Press (2017)
100% (3)
(Methods in Molecular Biology 1525) Jonathan M. Keith (Eds.) - Bioinformatics - Volume I - Data, Sequence Analysis, and Evolution-Humana Press (2017)
489 pages
RNA Bioinformatics
100% (3)
RNA Bioinformatics
429 pages
Metagenomic Data Analysis 1071630717 9781071630716 Compress
No ratings yet
Metagenomic Data Analysis 1071630717 9781071630716 Compress
443 pages
Computational Methods With Applications in Bioinformatics Analysis
100% (1)
Computational Methods With Applications in Bioinformatics Analysis
233 pages
Artificial Intelligence For Drug Development, Precision Medicine, and Healthcare
100% (3)
Artificial Intelligence For Drug Development, Precision Medicine, and Healthcare
372 pages
Computational Systems Biology in Medicine and Biotechnology
No ratings yet
Computational Systems Biology in Medicine and Biotechnology
493 pages
Transporters in Drug Development: Yuichi Sugiyama Bente Steff Ansen
No ratings yet
Transporters in Drug Development: Yuichi Sugiyama Bente Steff Ansen
322 pages
Molecular Modeling of Proteins PDF
100% (1)
Molecular Modeling of Proteins PDF
474 pages
NGS
100% (3)
NGS
252 pages
Discovering and Developing Molecules With Optimal Drug-Like Properties (20 PDF
100% (1)
Discovering and Developing Molecules With Optimal Drug-Like Properties (20 PDF
510 pages
Seminar A
No ratings yet
Seminar A
12 pages
Systems Biology in Biotech & Pharma A Changing Paradigm Instant EPUB Download
No ratings yet
Systems Biology in Biotech & Pharma A Changing Paradigm Instant EPUB Download
14 pages
Ernesto Picardi - RNA Bioinformatics-Humana (2021)
100% (1)
Ernesto Picardi - RNA Bioinformatics-Humana (2021)
576 pages
From Protein Structure To Function With Bioinformatics (PDFDrive)
100% (1)
From Protein Structure To Function With Bioinformatics (PDFDrive)
509 pages
Bioinformatics of Human Proteomics: Xiangdong Wang Editor
100% (2)
Bioinformatics of Human Proteomics: Xiangdong Wang Editor
395 pages
Machine Learning in Biological Sciences: Shyamasree Ghosh Rathi Dasgupta
100% (1)
Machine Learning in Biological Sciences: Shyamasree Ghosh Rathi Dasgupta
337 pages
Proteomics: Methods and Protocols
100% (3)
Proteomics: Methods and Protocols
375 pages
Clinical Applications of Capillary Electrophoresis: Methods and Protocols
100% (3)
Clinical Applications of Capillary Electrophoresis: Methods and Protocols
270 pages
Algorithms For Next-Generation Sequencing Data (3319598244)
100% (3)
Algorithms For Next-Generation Sequencing Data (3319598244)
356 pages
Protein Protein Interactions Methods and
100% (1)
Protein Protein Interactions Methods and
612 pages
Systems Biology in Biotech & Pharma A Changing Paradigm Full Digital Edition
100% (16)
Systems Biology in Biotech & Pharma A Changing Paradigm Full Digital Edition
14 pages
(Methods in Molecular Biology 1296) Mathieu Rederstorff (Eds.) - Small Non-Coding RNAs - Methods and Protocols-Humana Press (2015) PDF
100% (1)
(Methods in Molecular Biology 1296) Mathieu Rederstorff (Eds.) - Small Non-Coding RNAs - Methods and Protocols-Humana Press (2015) PDF
239 pages
Mastering Bioinformatics and Computational Biology - Unraveling The Complexities of Life Through Data-Driven Discovery
100% (1)
Mastering Bioinformatics and Computational Biology - Unraveling The Complexities of Life Through Data-Driven Discovery
216 pages
Bioinformatics: Nadiya Akmal Binti Baharum (PHD)
100% (2)
Bioinformatics: Nadiya Akmal Binti Baharum (PHD)
54 pages
Basics of Bioinformatics
100% (7)
Basics of Bioinformatics
99 pages
(Ying Xu, Juan Cui, David Puett (Auth.) ) Cancer Bi (B-Ok - CC) PDF
100% (1)
(Ying Xu, Juan Cui, David Puett (Auth.) ) Cancer Bi (B-Ok - CC) PDF
386 pages
Book MachineLearninginBioinformatics
No ratings yet
Book MachineLearninginBioinformatics
226 pages
(Methods in Molecular Biology 1275) Chhandak Basu (Eds.) - PCR Primer
90% (10)
(Methods in Molecular Biology 1275) Chhandak Basu (Eds.) - PCR Primer
221 pages
Drug Design Using Bioinformatics
100% (3)
Drug Design Using Bioinformatics
13 pages
Analysis of Microarray Gene Expression Data - M. Lee (Kluwer
No ratings yet
Analysis of Microarray Gene Expression Data - M. Lee (Kluwer
398 pages
(Methods in Molecular Biology 1903) Quentin Vanhaelen - Computational Methods For Drug Repurposing-Springer New York, Humana Press (2019)
No ratings yet
(Methods in Molecular Biology 1903) Quentin Vanhaelen - Computational Methods For Drug Repurposing-Springer New York, Humana Press (2019)
331 pages
Your Passport To A Career in Bioinformatics
100% (2)
Your Passport To A Career in Bioinformatics
84 pages
Molecular Modeling of Proteins PDF
100% (1)
Molecular Modeling of Proteins PDF
474 pages
From Protein Structure To Function With Bioinformatics (PDFDrive)
100% (1)
From Protein Structure To Function With Bioinformatics (PDFDrive)
509 pages
Next Generation Sequencing Technologies in Medical Genetics
100% (5)
Next Generation Sequencing Technologies in Medical Genetics
101 pages
(Methods in Molecular Biology 1751) Yejun Wang, Ming-An Sun (Eds.) - Transcriptome Data Analysis - Methods and Protocols-Humana Press (2018)
100% (1)
(Methods in Molecular Biology 1751) Yejun Wang, Ming-An Sun (Eds.) - Transcriptome Data Analysis - Methods and Protocols-Humana Press (2018)
239 pages
Practical Protein Bioinformatics PDF
No ratings yet
Practical Protein Bioinformatics PDF
111 pages
Recombinant Gene Expression
100% (1)
Recombinant Gene Expression
643 pages
Bioinformatics in Drug Discovery
No ratings yet
Bioinformatics in Drug Discovery
74 pages
Recombinant Protein Expression in Mammalian Cells: Methods and Protocols
100% (2)
Recombinant Protein Expression in Mammalian Cells: Methods and Protocols
311 pages
Best-Insilco Epitope Design
100% (1)
Best-Insilco Epitope Design
410 pages
Alternative Splicing
100% (3)
Alternative Splicing
356 pages
Eukaryotic Transcriptional and Post-Transcriptional Gene Expression Regulation
100% (1)
Eukaryotic Transcriptional and Post-Transcriptional Gene Expression Regulation
280 pages
Xu GMX 9 D JN
No ratings yet
Xu GMX 9 D JN
270 pages
Methods in Molecular Biology Volume Vol. 857
100% (1)
Methods in Molecular Biology Volume Vol. 857
432 pages
2017 Book SingleCellAnalysis
100% (1)
2017 Book SingleCellAnalysis
277 pages
Methods in Statistical Genomics
No ratings yet
Methods in Statistical Genomics
161 pages
Computational Methods For GPCR Drug Discovery PDF
No ratings yet
Computational Methods For GPCR Drug Discovery PDF
437 pages
Bioinformatics With Python Cookbook - Sample Chapter
100% (1)
Bioinformatics With Python Cookbook - Sample Chapter
24 pages
Supercomputing & Computational Biology: Presented by
100% (2)
Supercomputing & Computational Biology: Presented by
26 pages
Bioinformatics For Biologists PDF
96% (23)
Bioinformatics For Biologists PDF
394 pages
Coursera BioinfoMethods-I Lecture01
No ratings yet
Coursera BioinfoMethods-I Lecture01
15 pages
PeptideWorkbook 020919
No ratings yet
PeptideWorkbook 020919
49 pages
2018 Book
No ratings yet
2018 Book
253 pages
Zoology Core
No ratings yet
Zoology Core
71 pages
Unipro UGENE User Manual
No ratings yet
Unipro UGENE User Manual
247 pages
Bioinfo Final Practical
No ratings yet
Bioinfo Final Practical
66 pages
Laboratory Manual: Bioinformatics Laboratory (For Private Circulation Only)
No ratings yet
Laboratory Manual: Bioinformatics Laboratory (For Private Circulation Only)
52 pages
Stress and Adaption Responses by Microbes. Effect of Injury On Microbial Growth
No ratings yet
Stress and Adaption Responses by Microbes. Effect of Injury On Microbial Growth
25 pages
Biology 171L - General Biology Lab I Lab 12: Introduction To Bioinformatics
No ratings yet
Biology 171L - General Biology Lab I Lab 12: Introduction To Bioinformatics
6 pages
Gene Variants Associated With Acne Vulgaris Presentation and Severity: A Systematic Review and Meta Analysis
No ratings yet
Gene Variants Associated With Acne Vulgaris Presentation and Severity: A Systematic Review and Meta Analysis
42 pages
Bioinf 2024 2085 Proof Hi
No ratings yet
Bioinf 2024 2085 Proof Hi
7 pages
Bioinformatics Practical Assignment
No ratings yet
Bioinformatics Practical Assignment
25 pages
Navarro, Ciara Jane T. Vitan, Monica Louise A. Researchers
100% (1)
Navarro, Ciara Jane T. Vitan, Monica Louise A. Researchers
33 pages
Dear Professor Xiaobing
No ratings yet
Dear Professor Xiaobing
2 pages
BIOINFORMATICS Good.2
No ratings yet
BIOINFORMATICS Good.2
10 pages
DNA Microarrays and Gene Expression
No ratings yet
DNA Microarrays and Gene Expression
230 pages
Botany: U.G.C. Choice Based Credit System (CBCS) Annual Pattern UG Courses Model Curriculum
No ratings yet
Botany: U.G.C. Choice Based Credit System (CBCS) Annual Pattern UG Courses Model Curriculum
37 pages
Bioinformatics Using: BLAST Technology
No ratings yet
Bioinformatics Using: BLAST Technology
18 pages
Formatdb and Fastacmd
No ratings yet
Formatdb and Fastacmd
18 pages
Genomics and Proteomics
No ratings yet
Genomics and Proteomics
30 pages
SEM VI Botany Paper 4 Question Bank
No ratings yet
SEM VI Botany Paper 4 Question Bank
28 pages
Imbalanced Dataset Classification and Solutions: A Review
No ratings yet
Imbalanced Dataset Classification and Solutions: A Review
29 pages
DeepMicrobes Taxonomic Classification For Metagenomics Using Deep Learning
No ratings yet
DeepMicrobes Taxonomic Classification For Metagenomics Using Deep Learning
13 pages
16s Gene Direct Workflow Microbridge App Note
No ratings yet
16s Gene Direct Workflow Microbridge App Note
12 pages
PSI-BLAST Tutorial - Comparative Genomics-For Term Paper
No ratings yet
PSI-BLAST Tutorial - Comparative Genomics-For Term Paper
9 pages
GenBank Overview
No ratings yet
GenBank Overview
2 pages
Construction of Phylogenetic Tree
No ratings yet
Construction of Phylogenetic Tree
12 pages
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)

2019 Book BioinformaticsAndDrugDiscovery PDF

Uploaded by

2019 Book BioinformaticsAndDrugDiscovery PDF

Uploaded by

Methods in

Molecular Biology 1939

Richard S. Larson · Tudor I. Oprea Editors

For further volumes:

ISSN 1064-3745 ISSN 1940-6029 (electronic)

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Albuquerque, NM, USA Richard S. Larson

PART I TRANSLATIONAL BIOINFORMATICS IN DRUG DISCOVERY

PART II INFORMATICS IN DRUG DISCOVERY

5 A Guide to Dictionary-Based Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

PART III CLINICAL RESEARCH INFORMATICS IN DRUG DISCOVERY

10 A Computational Platform and Guide for Acceleration

12 Bioinformatics-Based Tools and Software in Clinical Research:

PART IV CLINICAL INFORMATICS IN DRUG DISCOVERY

14 Big Data Cohort Extraction for Personalized Statin Treatment

TERRENCE J. ADAM  Department of Pharmaceutical Care and Health Systems, Health

BENJAMIN S. GLICKSBERG  Bakar Computational Health Sciences Institute, University of

Translational Bioinformatics in Drug Discovery

Miniaturized Checkerboard Assays to Measure Antibiotic

Key words Drug interactions, Checkerboard assay, Drug synergy

Drug combinations may exhibit surprisingly high or low effect on a

2.2 Dose-Response 1. Drugs X, Y, and Z (see Note 2).

Carry out all protocols at room temperature. Thaw new aliquots of

6. Dilute the cells in growth media to an OD of 0.01 (see Note 4).

3.4 Day 4: 1. Measure OD600 absorbance for linear dilution dose-response

7. Add 10 μl of LB-drugX0, LB-drugX1, LB-drugX2, and

3.5 Day 5: 1. Measure OD600 absorbance for checkerboard assay experiment

be found in an undergraduate laboratory class, our protocol

1. While antibiotic interaction in E. coli is the example here, any

High-Throughput Screening for Drug Combinations

High-throughput screening for compounds that affect cell viability

crowdsourcing approach to predict synergistic drug combinations

2.1 Consumables 1. Dimethyl sulfoxide (DMSO): 100% DMSO, ACS grade.

8. 384-well cyclic olefin copolymer (COC) plate (acoustic source

2.2 Equipment 1. Benchtop vortex mixer.

14. ATS-100 acoustic dispenser (ATS-100), EDC Biosystems,

2.3 Software 1. Microsoft Excel or equivalent spreadsheet program.

3.1 Preparation 1. Prepare compound stock solutions by weighing compound

3.2 Compound 1. Based on prior IC50 determination of the compounds of inter-

Compound B pair, use the drop-down menus to map the

Fig. 10 JANUS deck layout

from the reservoir into the mother plates to create the

plate (Fig. 1f). Briefly centrifuge the mother plates for 30 s at

Fig. 14 The HRB automated acoustic plate reformatting system

to the acoustic source plates, and load them to available slots in

stack that parses a wide variety of formats, such as envision,

6. Using the pre-specified combination layout mapping file (see

1. Samples in the Matrix racks can splash up to the cap during

(c) “Frequency” —displays the volume usage of each com-

8. The “10 10 matrix serial dilution” FX protocol can process

10. Since DMSO is hygroscopic, it is best to use the acoustic

approximately 1 million cells/mL. Most automated cell coun-

cells, single/combination replicates) and fit the raw values as a

We would like to thank Lesley Mathews-Griner, John Keller, and

1. Borisy AA et al (2003) Systematic discovery of environmental chemicals—from vision to real-

9. Malo N, Hanley J, Cerquozzi S, Pelletier J, 12. Borisy AA et al (2003) Systematic discovery of

Post-processing of Large Bioactivity Data

computational technologies such as QSAR, docking, and machine

Fig. 1 Annotating a Justification (concentration ranges)

useful to retain for maintaining provenance. A more complex con-

impact of this issue is an inability to distinguish some of the useful

Fig. 2 Selecting Justification across assay records

An object-oriented programming language such as C# or Java is

3.1 Hardware Processing individual assay records is scalable by distributing loads

approximately 20 GB. PubChem bioassay records can range from

3.2 Basic Software Setting up a development environment is straightforward. It is

data records. Optimization of the parsing process is explained in

compared to other hit-calls for its relative reproducibility using the

Justifications, illustrated in Figs. 1 and 2, is a concept needed to

How to Develop a Drug Target Ontology: KNowledge

Gruber defines an ontology as a formal and explicit specification of

community [4–9]. Wache and colleagues [10] summarized the

using DL. Large number of different assays as well as their com-

KNowledge Acquisition and Representation Methodology

2.1 Sub-language Sub-language analysis is a technique for discovering units of infor-

(1) Sub- (2) Unstructured

TERRENCE J. ADAM Department of Pharmaceutical Care and Health Systems, Health

BENJAMIN S. GLICKSBERG Bakar Computational Health Sciences Institute, University of