2019 Book BioinformaticsAndDrugDiscovery PDF
2019 Book BioinformaticsAndDrugDiscovery PDF
Bioinformatics
and Drug
Discovery
Third Edition
METHODS IN MOLECULAR BIOLOGY
Series Editor
John M. Walker
School of Life and Medical Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK
Third Edition
Edited by
Richard S. Larson
Department of Pathology, University of New Mexico, Albuquerque, NM, USA
Tudor I. Oprea
Department of Internal Medicine, University of New Mexico, Albuquerque, NM, USA
Editors
Richard S. Larson Tudor I. Oprea
Department of Pathology Department of Internal Medicine
University of New Mexico University of New Mexico
Albuquerque, NM, USA Albuquerque, NM, USA
This Humana Press imprint is published by the registered company Springer Science+Business Media, LLC, part of
Springer Nature.
The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A.
Preface
A remarkable number of novel bioinformatics methods and techniques have become avail-
able in recent years, enabling us to more rapidly identify new molecular and cellular
therapeutic targets. It is safe to say that bioinformatics has now taken its place as an essential
tool in the process of rational drug discovery.
The first (2005), second (2012), and now third editions of Bioinformatics and Drug
Discovery offer many examples that illustrate the dramatic improvement in our ability to
understand the requirements for manipulating proteins and genes toward desired therapeu-
tic and clinical effects.
This is partly due to our growing ability to modulate protein and gene functions, which
has been facilitated by the emergence of novel technologies and their seamless digital
integration. To address the rapidly changing landscape of bioinformatics methods and
technologies, this edition has been updated to include four major topics: (1) Translational
Bioinformatics in Drug Discovery; (2) Informatics in Drug Discovery; (3) Clinical Research
Informatics in Drug Discovery; and (4) Clinical Informatics in Drug discovery. The topics
covered range from new technologies in target identification, genomic analysis, cheminfor-
matics and chemical mixture informatics, protein analysis, text mining and network or
pathway analyses, as well as drug repurposing.
It is virtually impossible for an individual investigator to be familiar with all these
techniques, so we have adopted a slightly different chapter format than other titles published
by Methods in Molecular Biology. Each chapter introduces the theory and application of the
technology, followed by practical procedures derived from these technologies and software.
Meanwhile, the pipeline of methodologies and the biologic analysis that they perform has
grown over time.
Bioinformatics and Drug Discovery is intended for those interested in the different
aspects of drug design, including academicians (biologists, informaticists, chemists, and
biochemists), clinicians, and scientists at pharmaceutical companies. This edition’s chapters
have been written by well-established investigators who regularly employ the methods they
discuss. The editors hope this book will provide readers with insight into key topics,
accompanied by reliable step-by-step directions for reproducing the techniques described.
v
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
Contributors
ix
x Contributors
TIM MIERZWA National Center for Advancing Translational Science, Rockville, MD, USA
MEI-SING ONG Department of Population Medicine, Harvard Medical School and
Harvard Pilgrim Health Care Institute, Boston, MA, USA; Computational Health
Informatics Program, Boston Children‘s Hospital, Boston, MA, USA
TUDOR I. OPREA Division of Translational Informatics, Department of Internal Medicine,
University of New Mexico School of Medicine, Albuquerque, NM, USA
SHAWN T. PHILLIPS Department of Chemistry, Center for Innovative Technology,
Vanderbilt-Ingram Cancer Center, Vanderbilt Institute of Chemical Biology, Vanderbilt
Institute for Integrative Biosystems Research and Education, Vanderbilt University,
Nashville, TN, USA
THEODORE SAKELLAROPOULOS Office of Clinical Pharmacology, Center for Drug
Evaluation and Research, US Food and Drug Administration, Silver Spring, MD, USA;
Department of Pathology, New York University School of Medicine, New York, USA
STEPHAN SCHÜRER Department of Molecular and Cellular Pharmacology, Miller School
of Medicine, University of Miami, Miami, FL, USA; Center for Computational Science,
University of Miami, Coral Gables, FL, USA
PAUL SHINN National Center for Advancing Translational Science, Rockville, MD, USA
CRAIG THOMAS National Center for Advancing Translational Science, Rockville, MD, USA
OLEG URSU Merck Research Laboratories, Boston, MA, USA; Division of Translational
Informatics, Department of Internal Medicine, University of New Mexico School of
Medicine, Albuquerque, NM, USA
UBBO VISSER Department of Computer Science, University of Miami, Coral Gables, FL, USA
KELLI WILSON National Center for Advancing Translational Science, Rockville, MD, USA
MENG WU Institute of Medical Information and Library, Chinese Academy of Medical
Sciences and Peking Union Medical College, Beijing, China
LEI XIE The Ph.D. Program in Biochemistry, The Graduate Center, The City University
of New York, New York, NY, USA; Department of Computer Science, Hunter College,
The City University of New York, New York, NY, USA
SI ZHENG Institute of Medical Information and Library, Chinese Academy of Medical
Sciences and Peking Union Medical College, Beijing, China
XIAOBO ZHOU Department of Radiology, Wake Forest University Medical School, Winston-
Salem, NC, USA; School of Biomedical Informatics, The University of Texas, Health Science
Center at Houston, Houston, TX, USA
Part I
Abstract
Drugs may have synergistic or antagonistic interactions when combined. Checkerboard assays, where two
drugs are combined in many doses, allow sensitive measurement of drug interactions. Here, we describe a
protocol to measure the pairwise interactions among three antibiotics, in duplicate, in 5 days, using only
two 96-well microplates and standard laboratory equipment.
1 Introduction
Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_1, © Springer Science+Business Media, LLC, part of Springer Nature 2019
3
4 Melike Cokol-Cakmak and Murat Cokol
2 Materials
2.1 Preparation 1. Aliquots of Escherichia coli in 25% glycerol (see Note 1).
of Bacterial Culture 2. LB Broth Powder.
3. 15 ml breathable cell culture tube.
4. Pipette pump.
5. 5 ml cell culture serological pipette.
6. Manual pipette.
7. 200 μl tips.
8. Incubator.
9. Tube rotator.
10. 1.5 ml semi-micro cuvette.
11. Spectrophotometer.
3 Methods
3.1 Day 1: Start 1. Take one aliquot of Escherichia coli from 80 C.
Bacterial Culture 2. Add 100 μl of bacterial culture in 5 ml of growth media in a
culture tube.
3. Leave to grow overnight on a tube rotator in a 37 C incubator.
Miniaturized Checkerboard Assays to Measure Antibiotic Interactions 5
3.2 Day 2: Serial 1. Take one aliquot of drugs X, Y, and Z from 20 C, leave them
Dilution Dose in room temperature for 10 min, and prepare for serial dilution
Response of these drugs.
2. Prepare LB-10% sol by mixing LB media and solvent (DMSO)
in a 9:1 ratio.
3. Prepare LB-10% drug X by mixing LB media and drug X in a
9:1 ratio.
4. Vortex and add 20 μl of the 1:10 diluted solvent (LB-10% sol)
to 10 wells in a 96-well plate.
5. Vortex and add 20 μl of the 1:10 diluted drug X (LB-10% drug
X) into the first well.
6. Take 20 μl of content from first well and add to second well.
Dilute the drug concentration serially in each well by adding
20 μl of content to its bottom adjacent well until ninth well (see
Fig. 1a).
7. Discard the last 20 μl of content from ninth well (Last well of
the column is used as a no drug control).
8. Repeat steps 3–7 for the drugs Y and Z (see Note 3).
9. Measure the OD600 of the 1:10 dilution of the culture started
in Day 1.
10. Dilute the cells in growth media to an OD of 0.01 (see Note 4).
11. Add 80 μl cells on drug serial dilutions prepared in step 8. The
final drug concentration in each well is shown in Fig. 1a.
12. Seal plate to avoid evaporation.
13. Leave plate for 12 h at 37 C in a shaker with 150 rpm.
14. Start new bacterial culture to use in Day 3 (repeat Subheading
3.1).
3.3 Day 3: Linear 1. Measure OD600 absorbance for serial dilution dose-response
Dilution Dose plate from Day 2 (see Fig. 1b).
Response 2. Normalize growth by dividing growth in each well with the
growth in no drug control. For each drug, choose 1 as the
dose which is twice the minimum concentration that results in
no growth.
3. For each drug, prepare LB-10% drug by mixing LB media and
drug in a 9:1 ratio, where drug’s concentration is 50 of what
is chosen at step 2. Similarly, prepare LB-10% sol by mixing LB
media and solvent (DMSO) in a 9:1 ratio.
4. Prepare linearly increasing doses of drugs X, Y, and Z in ten
concentrations, by mixing LB-10% drug and LB-10% sol in
volumes shown in Fig. 2a (see Note 3).
5. Measure the OD600 of the 1:10 dilution of the culture started
in Day 2.
6 Melike Cokol-Cakmak and Murat Cokol
Fig. 1 Serial dilution dose-response experiment. (a) Preparation of serial dilution dose response for one drug
and corresponding final concentrations of the drug. (b) Normalized growth in serial dilution of drugs X, Y, and
Z. Each rectangle here represents a well of 96-well plate. Concentrations of each drug chosen for the next
experiment are shown in orange
Fig. 2 Linear dilution dose-response experiment. (a) Preparation of linear dilution dose response for each drug
and corresponding final concentrations of the drug. (b) Normalized growth in linear dilution of drugs X, Y, and
Z. Each rectangle here represents a well of 96-well plate. Concentrations of each drug chosen for the next
experiment are shown in orange
Fig. 3 Miniaturized checkerboard assay. (a) Preparation of drug mixes and placement of each drug in 96-well
plate for 4 4 checkerboard. Each rectangle here represents a well of 96-well plate. Drug X and drug Y pairs
are used as an example for preparation. (b) Interpretation of drug pairs results in 4 4 checkerboard assay as
additive, synergistic, or antagonistic
8 Melike Cokol-Cakmak and Murat Cokol
4 Notes
References
1. Zimmermann GR, Lehár J, Keith CT (2007) 6. Cokol M et al (2011) Systematic exploration of
Multi-target therapeutics: when the whole is synergistic drug pairs. Mol Syst Biol 7:544–544
greater than the sum of the parts. Drug Discov 7. Greco WR, Bravo G, Parsons JC (1995) The
Today 12:34–42 search for synergy: a critical review from a
2. Berenbaum MC (1989) What is synergy? Phar- response surface perspective. Pharmacol Rev
macol Rev 41:93–141 47:331–385
3. Chevereau G, Bollenbach T (2015) Systematic 8. Mason DJ, Stott I, Ashenden S, Karakoc I,
discovery of drug interaction mechanisms. Mol Meral S, Weinstein ZB, Kuru N, Bender A,
Syst Biol 11:807–807 Cokol M (2017) Prediction of antibiotic inter-
4. Cokol M (2013) Drugs and their interactions. actions using descriptors derived from com-
Curr Drug Discov Technol 10:106–113 pound molecular structure. J Med Chem 60
5. Chandrasekaran S et al (2016) Chemogenomics (9):3902–3912. https://doi.org/10.1021/acs.
and orthology-based design of antibiotic combi- jmedchem.7b00204.
nation therapies. Mol Syst Biol 12:872
Chapter 2
Abstract
The identification of drug combinations as alternatives to single-agent therapeutics has traditionally been a
slow, largely manual process. In the last 10 years, high-throughput screening platforms have been devel-
oped that enable routine screening of thousands of drug pairs in an in vitro setting. In this chapter, we
describe the workflow involved in screening a single agent versus a library of mechanistically annotated,
investigation, and approved drugs using a full dose-response matrix scheme using viability as the readout.
We provide details of the automation required to run the screen and the informatics required to process data
from screening robot and subsequent analysis and visualization of the datasets.
Key words Drug combination screening, Acoustic dispensing, Automation, Compound manage-
ment, Synergy
1 Introduction
Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_2, © Springer Science+Business Media, LLC, part of Springer Nature 2019
11
12 Paul Shinn et al.
2 Materials
The DMSO stock solutions are stored at 20 C, but all other
operations occur at room temperature.
3 Methods
Fig. 1 An overview of the process to prepare an acoustic source plate and related files
Fig. 2 An example of the matrix screening request form in Microsoft Excel format with the required information
filled in. Each drug combination should be listed by row. Drug A should be listed with the name, starting
concentration in the assay in nanomolar (nM), dilution factor of the drug in the screen, and internal compound
ID. Drug B in the combination should be listed next with the same information. The researcher should also
specify the size of the matrix block as well as information regarding the number of replicates needed and
assay types used
16 Paul Shinn et al.
3.3 Preparation of All The MSPG application will create the appropriate files needed for
Files Needed each critical instrument used in creation of the acoustic source
for Creation plate. MSPG will take the submitted requestor form and create
of the Acoustic Source the JANUS worklist file to prepare the mother plates which is
Plate then transferred to acoustic source plates. It will also generate
transfer script files that are used on the ATS-100 for acoustic
dispensing, a worklist file used to schedule the movement of plates
on the HRB, and a plate map of the assay plate which is used for
data analysis (see Note 2).
1. Open the MSPG application, and click on Matrix Order which
will open the “Import Compound Combinations Wizard”
(Figs. 1c and 3).
2. Click Next to begin the Wizard. On the second window, select
the “Browse” button, and select the Matrix screening request
form (Fig. 2) containing all the drug combinations you wish to
process. After selecting the file, a preview of the combinations
will appear in the window as seen in Fig. 4. Click “Next” to
advance to the next screen.
3. Use the drop-down menu to select the Excel worksheet tab
that contains the compound pairs. For each Compound A and
Fig. 3 Select the “Matrix Order” menu item to initiate the wizard that will walk you through the use of the tool,
which will open the import compound combination wizards
High-Throughput Screening for Drug Combinations 17
Fig. 4 Browse to and select the request form to display a preview of the desired matrix combinations
Fig. 5 Data fields from the matrix request form are associated with the variable fields in the MSPG
Fig. 6 The review window provides verification that the columns from the spreadsheet were mapped correctly
Fig. 7 (a) Input parameters appropriate for the matrix screen are entered; (b) the compound source plate map
is pasted into the MSPG; (c) successful generation of a “Master” Excel file is visible in the MSPG
“Spreadsheets” window
High-Throughput Screening for Drug Combinations 19
rack plate map from Excel and Paste into the Plate Map win-
dow of the MSPG, and click “Submit” to create the Master file
(Fig. 7b). The various tabs of the Master file will be visible in
the MSPG “Spreadsheets” tab (Fig. 7c). Save the Master file
from the MSPG as an Excel file in a folder where you want all
future MPSG files deposited (see Note 2).
5. Click on the “Script Files” tab within MSPG, and select the
“Source Plate” type, “Destination Plate” type, and “Matrix
Size” pertaining to the parameters in the Matrix order request
form. Click on “Load Picklist” and browse and open the Mas-
ter file. Click “Create Scripts” to generate the ATS-100 scripts,
HRB worklist, and assay plate maps which will be automatically
saved to a sub-folder of the project folder where the Master file
resides (Figs. 8 and 9).
6. Close the MSPG.
7. Open the Master file in Excel, and separately save each of the
“Janus,” “Intermediate,” and “DMSO” tabs as comma-
separated value (CSV) files in the same folder location as the
Master file (see Note 3). Close the Master file.
Fig. 8 After the script, datamap, and worklist files are successfully created, you can preview the compound
dispense patterns from source to destination in the MSPG
20 Paul Shinn et al.
Fig. 9 The JANUS worklist files define the source to destination compound transfers from compound source
rack to mother plate. The HRB worklist directs the Cellario scheduler to move plates through the HRB, pairing
the source (acoustic source plate) and destination (assay plate) plates at the ATS-100 to perform the acoustic
liquid transfer. The acoustic transfer script (ATS) file instructs the ATS-100 to dispense compound from a
specific source well location to a specific destination well location. *Continued—The datamap file is loaded to
the data analysis system to identify the compound pair IDs, concentrations, and volumes dispensed to each
well
3.4 384-Well Mother 1. Prime the JANUS tubing with water until all air bubbles are
Plate and Acoustic flushed out (see Note 4). Remove the caps from the sample
Source Plate tubes using the automated decapper.
Preparation 2. Scan and associate the compound source rack barcodes to the
JANUS deck positions using the attached handheld barcode
scanner, and then place the compound source racks into their
assigned deck locations. Supply P25 tip boxes, the necessary
number of empty mother plates as determined by the JANUS
worklist file, and a filled DMSO reservoir to the JANUS deck
(see Fig. 10 and Note 5).
3. Start the JANUS protocol, and when prompted, import the
“DMSO” worklist file to begin the transfer. This worklist file
directs the JANUS to transfer the required amount of DMSO
High-Throughput Screening for Drug Combinations 21
Fig. 11 The “10 10 matrix serial dilution” protocol performs a twofold serial dilution from columns 1 to
9 and 11 to 19 of each mother plate. Columns 10 and 20 remain DMSO only
3.5 Acoustic 1. In Cellario 2.0 of the HRB computer, create a new order using
Dispensing from Cellario’s Cherry Pick Wizard (Figs. 12a–d and 13, Note 11).
Acoustic Source Plate Copy the acoustic transfer scripts from the network folder to
to Assay Plate the “EDC_Scripts” folder on the HRB computer.
on the HRB System 2. Spin down the acoustic source plates for 60 s at 234 g.
Manually remove the foil seals, and apply compound plate lids
High-Throughput Screening for Drug Combinations 23
Fig. 12 Using the Cellario Cherry Pick Wizard to create an order. (a). Import the HRB worklist to Cellario using
the Cherry Pick Wizard. (b). Define the source and (c). destination plate types as well as the starting location
for the first plate of each type in the Ambistore. (d). Verify that the physical location of the source and
destination plates matches what is virtually represented in Cellario
Fig. 13 The acoustic transfer protocol has separately defined steps (threads) for
each source and destination (Dest) plates. The ACell robot shuttles the source
and Dest plates according to the plate order in the HRB worklist from device to
device as each step in their respective threads is executed. The source and Dest
plates intersect at the Cherry Pick step where the Cellario scheduling software
instructs the ATS-100 to dispense a specific script file from the appropriate
acoustic source plate to the designated assay plate
24 Paul Shinn et al.
3.6 Cell Plating 1. Visually inspect cells in the flask to ensure the cell morphology
appears uniform and that there are not large numbers of
detached cells. Ensure that media is not cloudy or oddly col-
ored which can be an indication of fungal or bacterial contami-
nation. Move flasks into a sterile biosafety cabinet.
2. Aspirate all media from the flask and discard. Add 7 mL of
trypsin, and place the flask into a standard tissue culture incu-
bator for 5 min. Visually inspect flasks periodically until all cells
are detached.
3. Add 7 mL of complete culture media to the flask to neutralize
the trypsin. Transfer all contents of the flask to a 15 mL conical
tube and put on the lid. Place conical tube into a tabletop
centrifuge along with an equally weighted balance tube, and
spin at 233 g for 5 min to pellet cells.
4. Remove the tube from the centrifuge and verify that a cell pellet
is present. Aspirate as much volume from the tube as possible
and discard. Do not aspirate the cell pellet.
5. Add complete media to the cell pellet, and use a pipettor to
resuspend the pellet in media. Ensure the cells are homoge-
nously distributed in the media (see Note 14).
6. Count the cells using the desired method, and determine the
number of cells per mL. Determine the total number of cells
needed for plating by multiplying the number of cells per well
by the total number of plates by 1536. Calculate the total
volume of media for plating by multiplying the volume per
well by the total number of plates by 1536. Add 20% additional
volume to account for dead volume, priming volume, and extra
plates needed for imaging and growth rate calculations (see
Note 15).
7. Add the total number of cells needed into the total volume of
media needed to fill all assay plates. Ensure the solution is well
mixed and without clumps to ensure homogenous distribution
of cells. Ensure the vessel used for diluting the cells has a wide
enough opening to accommodate the Multidrop cassette
tubing.
8. Load the cassette onto the Multidrop. Clean the cassette by
priming 10 mL of deionized water followed by 10 mL of 70%
ethanol followed by 10 mL of deionized water. While cleaning,
ensure that dispenser streams are linear and continuous.
9. Program the Multidrop to dispense 5 μL per well to 1536-well
standard height plates and filling columns 2–48 for dispensing.
Column 1 is left empty as a background control.
10. Place the cassette tubing into the cell solution, and prime for
10 s to ensure all tubing is filled with the cell solution. Cover
the opening of the cell dilution vessel with its lid to ensure no
26 Paul Shinn et al.
large objects drop into the vessel. Remove all plastic lids from
the assay plates.
11. Place the first plate on the Multidrop in the correct orientation,
well A01 should be in the upper-left-hand corner of the plat-
form. Press start to begin dispensing cells. This can be done in a
biosafety cabinet or in an open laboratory environment (see
Note 16).
12. Once the 1536-well assay plate is filled, remove the 1536-well
assay plate from the Multidrop, and ensure that all wells are
uniform in appearance. If the liquid level is uniform, apply an
assay lid to the 1536-well assay plate (see Note 17). Repeat
until all assay plates are complete.
13. All lidded 1536-well assay plates should then be moved to the
incubator on the primary screening system with standard con-
ditions, 37 C, 95% humidity, and 5% CO2, and the odd
numbered plate barcode facing outward.
14. Following plating, the cassette should be cleaned using the
same protocol in step 3. If any tips appear to be clogged, use
the syringe tool that comes with the cassette to clear any
blockages. After cleaning is complete, empty all liquid from
the tubing using the empty button on the Multidrop. Store the
cassette in the box provided.
15. All vessels that contained cells should be discarded in a biohaz-
ard disposal box.
3.7 Reading Assay 1. Scan and inventory the assay plate barcodes in the primary
Plates on the Priming screening system’s assay plate incubator (see Note 18).
Screening System 2. Incubate the assay plates for 48 h.
3. Set up to run the read protocol on the primary screening
system (Fig. 15). Using the barcodes in the inventory database,
create an assay file in CSV format containing the assay plate
barcodes which were originally loaded. The assay plates will be
handled in the order the barcodes are listed in the CSV file.
4. Create a method file containing the steps of the read protocol
using the Method Editor software on the Dispatcher computer
for the robotic system (see Note 19).
5. Compress the ViewLux data files, 1536-well assay datamaps,
ATS-100 dispense error logs, and assay file with cell line names
to a ZIP file, and save to a shared network folder for retrieval
from informatics for data analysis.
3.8 Data Processing 1. Plate data files along with the plate maps which specify the
compound/control composition, and concentrations are
loaded into an Oracle database using an in-house software
High-Throughput Screening for Drug Combinations 27
Fig. 15 A typical ViewLux read of a 1536-well assay plate tested in pairwise drug
combination screening
4 Notes
7. FX deck layout
plate. To prevent this, ensure all tips are clean and unclogged
prior to cell plating and monitor Multidrop tips while plating
to ensure streams are linear and consistent. If a tip becomes
clogged while plating, attempt to unblock the tip before start-
ing a new plate.
18. An inventory is done on the system where the barcode of each
plate is read and added into the database to track the location of
each plate in the incubator. It is important that the barcodes of
these plates are in the inventory database; this way when other
processes are running on the system there is no possibility that
the slots will be accessed causing a crash. A Keyence barcode
reader attached to the gripper on the robotic arm allows the
barcodes to be read in an automated fashion.
19. The Method File typically consists of the following:
(a) Dispense 2 μL CellTiter-Glo into each of the 32 rows of
the 1536-well assay plates, containing cells which have
incubated with compound, using four tips in parallel on
a solenoid valve dispenser.
(b) Incubate at room temperature in an auxiliary hotel on the
system for 15 min.
(c) Read luminescence on the ViewLux using 2 s exposure
time with slow speed, medium gain, and 2 binning. The
units are relative luminescence units (Fig. 5).
(d) Plate is returned to the home location after the read has
finished.
20. For inhibition assay where the positive control has lower signal
than negative control (e.g., viability, toxicity assay), sample
wells are normalized using the formula 100 ððnp cnÞ
Þ þ 100,
where c, n, and p are the sample well value, median of the
negative control well values, and the median of the positive
control well values, respectively. Activation assay where positive
control has higher signal; on the other hand, sample wells are
normalized using the formula 100 ððnp cnÞ
Þ. In some assays, the
positive control may not be available or dispensed properly.
This issue can be overcome by inclusion of an empty column
with no cells, particularly for inhibition assay, and then we
employ the median value of no cell wells as a proxy for the
positive control assuming that cytotoxic compound leads to
maximum cell killing.
21. When measuring the synergy, the absolute response to drug
treatment, not always EC50, is crucial to the accurate estima-
tion of synergy. However, the responses within a plate may be
drifted due to unexpected artifact (edge effect, variability in
liquid dispense, signal crosstalk, etc.). To correct for minor
artifact, we take all the intraplate replicates (e.g., controls, no
34 Paul Shinn et al.
Acknowledgments
References
Abstract
Bioactivity data is a valuable scientific data type that needs to be findable, accessible, interoperable, and
reusable (FAIR) (Wilkinson et al. Sci Data 3:160018, 2016). However, results from bioassay experiments
often exist in formats that are difficult to interoperate across and reuse in follow-up research, especially when
attempting to combine experimental records from many different sources. This chapter details common
issues associated with the processing of large bioactivity data and methods for handling these issues in a
post-processing scenario. Specifically described are observations from a recent effort (Harris, http://www.
scrubchem.org, 2017) to post-process massive amounts of bioactivity data from the NIH’s PubChem
Bioassay repository (Wang et al., Nucleic Acids Res 42:1075–1082, 2014).
Key words Bioactivity, Bioassay, ScrubChem, PubChem, Hit-calls, Big data, Data integration
1 Introduction
This chapter will explain an approach that was used in a recent data
integration project (http://www.ScrubChem.org) [1] to efficiently
process billions of data points from a large public bioassay reposi-
tory known as PubChem [2]. The approach is generalizable and
applicable to other data types and sources. It is also designed to be
efficient in cost, time, and ease of development. For example,
ScrubChem is the largest curation effort of public bioactivity
data, containing millions of bioassay records; yet it rebuilds entirely
in less than a day, and it requires only limited scientific program-
ming and data science skills to develop. The goal of this chapter is to
provide a how-to-guide for other technically inclined researchers
who are subject matter experts in their respective fields and in need
of post-processing their research data.
There are many research uses for aggregating and repurposing
legacy bioactivity data. The simplest use is referencing of experi-
mental protocols from previous efforts in order to inform and
accelerate the design of a current research goal. Prior results
can also be used to create predictive models from
Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_3, © Springer Science+Business Media, LLC, part of Springer Nature 2019
37
38 Jason Bret Harris
for comparing and aggregating data into hit-calls will vary and
depend on an intended reuse case for the data. Simple quality
metrics about hit-calls can also be used to interrogate the reproduc-
ibility of data, such as the number of records or evidences used to
support the hit-call and also the total agreement between those
evidences.
A case study involving these introductory concepts is presented
in Subheading 3. The Methods explain how to choose hardware
and software for parsing large amounts of data quickly and afford-
ably. Also explained are general approaches to download, parse,
annotate/curate, add metadata, and set up a scalable database.
This chapter concludes with a conceptual example of how to
query a database using Justifications as a filter and aggregate this
data into summary hit-calls.
42 Jason Bret Harris
2 Materials
3 Methods
3.3 Downloading Bulk data is usually made available via an FTP site (e.g., ftp://ftp.
Primary Data (Disk 1) ncbi.nlm.nih.gov/pubchem/Bioassay/XML). There is often a
schema available (e.g., ftp://ftp.ncbi.nlm.nih.gov/pubchem/
specifications/pubchem.xsd) to use as a template for creating clas-
ses to store and process the data. Data records are not guaranteed to
follow expected formats, and try-catch [16] statements are useful to
implement during processing in order to handle those errors
cleanly. Many records do not change often, and a change log
(e.g., ftp://ftp.ncbi.nlm.nih.gov/pubchem/Bioassay/assay.
ftpdump.history) can be used to save in bandwidth or processing
time if implemented. After decompression, the PubChem bioassays
fit on 557GB of disk space.
3.4 Parsing Primary A Parser simply identifies a class of data (e.g., assay title, assay
Data and Annotating target, source) and extracts it. The Parser works on a single bioassay
(Disk 2) record at a time, but it can be split up to work in parallel across
many processors. Each processor should have dedicated output files
3.4.1 Parser for writing all data onto the storage disk #2. In the case of Scrub-
Chem, outputs can be separated by the processor and a result (e.g.,
output_results_processor_1.txt) and description (e.g., output_de-
scriptions_processor_1.txt) file which later can be more easily
imported into a normalized database schema. These output files
are a reflection of the database tables, and appropriate keys (e.g.,
assay ID, test ID, panel ID) should be maintained for joining their
44 Jason Bret Harris
3.4.2 Annotator The Annotator makes changes to the data and is incrementally
constructed over the course of a project. Building up the functions
of an Annotator involves querying of the database after each itera-
tion of building it in order to identify quality issues involving
missing, unresolved, or incorrect data/concepts. The Annotator
is then updated to incorporate logical and programmatic fixes to
address the identified issues. The database is then rebuilt to accom-
modate the desired changes. This approach is faster than
performing case-by-case updates on a large relational database,
especially after it has been indexed.
The complexity of the Annotator will depend on the quality of
original data and a user’s intended applications for the data. Mini-
mum information, as described in the Introduction, is needed in
order to derive basic hit-calls. Particularly important is a Justifica-
tion (measurement) and Modality (action) which are used in an
experimental design to test the perturbation of a molecular target.
These concepts are further discussed in Subheading 4.
3.5 Enriching Primary data is usually basic in detail, and therefore additional
Primary Data metadata is often needed to enrich the descriptions of an experi-
with External Linkages ment. In the case of ScrubChem, other NCBI databases such as
to Metadata (Disk 3) GenBank and PubChem compound are accessed through API
services to gather additional information for targets or chemicals.
This allows integrating in useful features such as additional identi-
fiers, taxonomies, synonyms, physical properties, sequences, etc.
The number of API calls can become very large (nearly
300,000,000 pages in the case of ScrubChem), and due to the
iterative process of building up an Annotator, it is useful to cache
many of these pages on a local disk (storage disk #3). The simple
syntax of most API URLs allows keeping a local copy of the page
using a transformation of the URL in regular file structures for
reference before making a call to the web version. The local cache of
pages can be cleaned up over time to bring in updated data as
needed.
3.6 Bulk Uploading This step involves building a loader function to bulk upload output
Parsed Data into data from the Parser’s output on storage disk #2 into a database on
a Database (Disk 4) storage disk # 4. This bulk reading and writing is faster if the read
and write disks are separated. The schema for a database needs to be
described within SSIS when setting it up for loading. Since parsing
is done in parallel and output records are stored across separate files
for each processor, primary keys need to be generated at the same
time as loading. SSIS allows for bulk uploading separate files into a
Post-processing of Large Bioactivity Data 45
database and also the building of primary keys with an auto incre-
ment. Simple transformations and checks on the data can also be
managed from within an SSIS workflow. If an alternative database
engine is used, it is recommended to identify the bulk loading
features of that engine. Since loading uses an auto increment to
build primary keys, this procedure has to be done serially. However,
the description and result tables can be loaded independently,
taking 30 min and 2 h, respectively.
3.7 Building, Microsoft’s SSMS can be used to initialize the database tables and
Indexing, and Querying fields. It is also useful for adding indexes and querying the database.
the Database It provides a typical workbench environment for managing and
exploring the database. Field data types (e.g., int, char, varchar)
should be assigned carefully in order to optimize disk space utiliza-
tion. For assigning appropriate lengths to fields, it is useful to keep a
report of the longest records during parsing. Long varchars slow
down loading but can be useful in avoiding truncated data. MS
SQL does a decent job at compressing (shrinking) the database
after loading. Indexes should be built only for the fields most often
used in queries. SSIS allows for profiling queries to see which fields
add the most time and are good candidates for indexing. For
example, CID (chemical ID) is a ScrubChem data type that is
indexed because it is often used to access data points for specific
chemicals. The descriptions table is relatively small (approximately
seven million rows) and has a fairly low cost to add indexes (about
5 min compute time/index). The results table is relatively large
(approximately 1.5 billion rows) and therefore should have very few
indexes outside of its primary key. Once indexed the database can
be quickly queried on specific tables or joined across tables. An
example query might conceptually be described as “select all che-
micals, outcomes, and their Justifications for the androgen receptor
protein target and the modality of inhibition.” The specific fields
used in the SQL syntax of this query will depend on the structure of
the database.
3.8 Building a Hit- The previously described query in Subheading 3.7 of the Methods
Call section would return records with a single Justification for each
outcome of a chemical tested for inhibition against the human
androgen receptor. In many cases, there are multiple records for
the same chemical that are tests from separate bioassay experiments.
These can be treated as replicates, and a hit-call is needed to
combine all available replicates into a single consensus result. For
example, a chemical such as estradiol may have been tested three
separate times all with an outcome of active with respect to inhibi-
tion. These three evidences for active outcomes can be combined
into a single consensus hit-call of active (n ¼ 3, where n is the
number of evidences and r ¼ 3/3 or 1, where r is the ratio of
agreement between all evidences). If desired, this hit-call can be
46 Jason Bret Harris
4 Notes
References
1. Harris J (2017) ScrubChem. http://www. 9. Nguyen DT et al (2017) Pharos: collating pro-
scrubchem.org tein information to shed light on the druggable
2. Wang Y et al (2014) PubChem BioAssay: 2014 genome. Nucleic Acids Res 45:D995–D1002
update. Nucleic Acids Res 42:1075–1082 10. Pilarczyk M, Medvedovic M, Fazel Najafabadi
3. Bento AP et al (2014) The ChEMBL bioactiv- M, Naim M, Michal K, Nicholas C, Shana W,
ity database: an update. Nucleic Acids Res Mark B, Wen N, John R, Juozas V, Jarek M,
42:1083–1090 Mario M (2016) iLINCS: Web-Platform For
4. Wishart DS et al (2006) DrugBank: a compre- Analysis Of Lincs Data And Signatures, ilincs.
hensive resource for in silico drug discovery org. https://doi.org/10.5281/zenodo.167472
and exploration. Nucleic Acids Res 34: 11. Wilkinson MD et al (2016) The FAIR guiding
D668–D672 principles for scientific data management and
5. Gilson MK et al (2016) BindingDB in 2015: a stewardship. Sci Data 3:160018
public database for medicinal chemistry, 12. Visser U et al (2011) BioAssay ontology
computational chemistry and systems (BAO): a semantic description of bioassays
pharmacology. Nucleic Acids Res 44: and high-throughput screening results. BMC
D1045–D1053 Bioinformatics 12:257
6. Toxicology in the 21st Century 13. Orchard S et al (2011) Minimum information
7. Dix DJ et al (2007) The toxcast program for about a bioactive entity (MIABE). Nat Rev
prioritizing toxicity testing of environmental Drug Discov 10:661–669
chemicals. Toxicol Sci 95:5–12 14. jsontoclass. http://www.jsontoclass.com
8. Davis AP et al (2017) The comparative Toxi- 15. xmltocsharp. http://xmltocsharp.azurewebsites.
cogenomics database: update 2017. Nucleic net
Acids Res 45:D972–D978 16. try-catch (C# Reference). https://msdn.
microsoft.com/en-us/library/6dekhbbc.aspx
Chapter 4
Abstract
Technological advancements in many fields have led to huge increases in data production, including data
volume, diversity, and the speed at which new data is becoming available. In accordance with this, there is a
lack of conformity in the ways data is interpreted. This era of “big data” provides unprecedented oppor-
tunities for data-driven research and “big picture” models. However, in-depth analyses—making use of
various data types and data sources and extracting knowledge—have become a more daunting task. This is
especially the case in life sciences where simplification and flattening of diverse data types often lead to
incorrect predictions. Effective applications of big data approaches in life sciences require better,
knowledge-based, semantic models that are suitable as a framework for big data integration, while avoiding
oversimplifications, such as reducing various biological data types to the gene level. A huge hurdle in
developing such semantic knowledge models, or ontologies, is the knowledge acquisition bottleneck.
Automated methods are still very limited, and significant human expertise is required. In this chapter, we
describe a methodology to systematize this knowledge acquisition and representation challenge, termed
KNowledge Acquisition and Representation Methodology (KNARM). We then describe application of the
methodology while implementing the Drug Target Ontology (DTO). We aimed to create an approach,
involving domain experts and knowledge engineers, to build useful, comprehensive, consistent ontologies
that will enable big data approaches in the domain of drug discovery, without the currently common
simplifications.
Key words Knowledge acquisition, Ontology, Drug target ontology, Semantic web, Big data, Seman-
tic model, KNARM
1 Introduction
Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_4, © Springer Science+Business Media, LLC, part of Springer Nature 2019
49
50 Hande Küçük McGinty et al.
2 Methods
(3) Sub-
Language
Recycling
(5) Structured
Interview
(4) Meta Data
(9) Ontology
Creation and
Validation
Knowledge
Modeling
(7) Database
Formation
(8) Semi-
Automated
Ontology
Building
Methodology step
Fig. 1 Steps of KNowledge Acquisition and Representation Methodology (KNARM). This figure shows the nine
steps and flow of KNARM. Following agile principles, there are feedback loops present before finalizing
ontologies. The circular flow also represents that ontology building process is a continuous effort, allowing
ontology engineers to iteratively add more concepts and knowledge
2. I have assay X. What are the other assays that have the same
design or technology but different targets?
3. What assay technologies have been used against my kinase of
interest? Which cell lines?
After identifying units of information and patterns followed by
listing some possible use cases, the ontology engineers can intro-
duce the domain experts to their preliminary analysis or continue to
work with them toward the next steps of the methodology.
2.2 In-House After identification of the key concepts and units of information
Unstructured Interview during sub-language analysis, we perform an interview with the
domain experts who work in the same team. This step can be
performed separately after the sub-language analysis or in a hybrid
fashion with the previous step. The unstructured interview is aimed
at understanding the data and their purposes better with the help of
the domain experts. It can be performed in a more directed fashion
by using the previously identified knowledge units or could be
treated as a separate process. Together with the previous step, this
step also helps identify the knowledge units and key concepts of
the data.
2.3 Sub-language Following the identification of knowledge units through the textual
Recycling data of the domain, literature, and unstructured interview with the
domain experts, we perform a search on the existing ontologies and
databases. The aim of the search on the databases and ontologies is
to ascertain the already formalized knowledge units that are identi-
fied. We perform and encourage reuse of existing—relevant and
well-maintained—ontologies, aligning them with ontology in
development and using cross-references (annotated as Xref in the
ontology) to the various databases that contain the same knowl-
edge units and concepts that we determined to formalize. By
recycling the sub-language, not only we save time and effort but
also reuse widely accepted conceptualization of knowledge. In this
way, we also aim to help life scientists by sparing them the painful
data alignment practices and by helping them avoid redundant
and/or irrelevant data available in different data resources.
2.4 Metadata In this step, we combine the knowledge units and essential concepts
Creation and identified with those recycled from the existing databases and
Knowledge Modeling ontologies to create the metadata describing the domain of the
data to be modeled. The metadata creation can be a cumbersome
task that could be performed in different levels by defining subsets
of metadata on various details of the data. For example, with our
systematically deepening approach of formalization (i.e., systemati-
cally deepening modeling approach (SDM)), we started with the
metadata for proteins and genes, followed by metadata for diseases,
tissues, and small molecules. The SDM approach allows us to focus
54 Hande Küçük McGinty et al.
tissue BRENDA
is equivalent to is equivalent to is equivalent to has associated disease
protein DTO
gene
is a is a is a is a
Class
Ext. Ontology
Ontology
Fig. 2 Conceptual modeling example, showing modeling of an example kinase (ABL1) and how some of its
axioms relate to the different ontologies created using KNARM
This step can be performed within the team first and then can
be discussed with the collaborators and other scientists. Alterna-
tively, a bigger initiative can be set up to agree on the metadata,
axioms, and knowledge models (examples include OBO Foundry
ontologies [22]).
2.6 Knowledge This step could be considered the first feedback. The aim in this
Acquisition Validation step is to identify any knowledge that is missed or misinterpreted.
By this step, the sub-language identified and recycled, the metadata
is created, and the data dissected based on the metadata are pre-
sented to domain experts by knowledge engineers. It could also be
presented to a small group of users based on use cases. If missed or
misinterpreted knowledge exists, we recommend starting from the
first step and reiterating the steps listed above.
2.7 Database After validating the knowledge acquired is correct and consistent,
Formation we start building the backbone for the representation of the knowl-
edge. The first step is to create a database to collect the data in a
schema that will facilitate the knowledge engineering. Typically,
this will be a relational database. The domain experts may prefer
to use different means of handling and editing their data, such as a
set of flat files, but we recommend using a database as the main data
feed to the ontology that will be created as the final product. The
details of the database are designed based on the acquired metadata
and data types collected and their relations (Fig. 3 shows an exam-
ple database schema). Ideally, the databases should contain the
metadata as well as the knowledge units and the key concepts
identified in the knowledge acquisition steps. Information that
the database may not hold directly includes specific relationships
or axioms involving the different knowledge units and key concepts
that are identified during the knowledge acquisition. We placed the
relationships among the pieces of data in the next step during the
ontology building process.
2.8 Semiautomated After placing the data dissected based on the metadata as well as the
Ontology Building metadata into the database, we convert the data to a more
56 Hande Küçük McGinty et al.
IC_Rel_Table
id : int (11) I
uniprot : varchar (1000) * GPCR_Table
tdl : varchar (1000) IC_Table id : int (11) I
proteinName : varchar (1000) id : int (11) I uniprot : varchar (500)
uniprot : varchar (500) gene : varchar (500)
ionType : varchar (1000) gene : varchar (500) tdl : varchar (500)
channelActivity : varchar (1000) tcrd : varchar (500) IDG_family : mediumtext
topology : varchar (1000) IDG_family : mediumtext GPCR_Class : mediumtext
IC_family : mediumtext GPCR_Group : mediumtext
IC_subfamily : mediumtext GPCR_Family : mediumtext
IC_sub_subfamily : mediumtext GPCR_Subfamily : mediumtext
proteinName : mediumtext proteinName : mediumtext
ligand_type : mediumtext
2.9 Ontology The final step in the proposed workflow is the ontology validation.
Validation The domain experts as well as the knowledge engineer perform
different tests in order to find out if the information in the ontology
is accurate. In addition, different reasoners can be run on the
ontology to check its consistency. Additional software can be imple-
mented to test the different aspects of the ontology (e.g., Java
programs that compare the database with the ontology classes,
object properties, data properties, etc.) Finally, queries for the
different use cases can be run to check if the ontology implementa-
tion answers questions it was meant to answer. If there are any
inconsistencies or inaccuracies in the ontology, the knowledge
engineer and the domain expert should try to go back to the
ontology building step. If the inconsistencies are fundamental, we
recommend starting from the first step and retracing the steps that
lead to the inconsistent knowledge. Domain experts and ontology
engineers can also choose to go back to “Metadata Creation and
Knowledge Modeling” or “Sub-language Recycling” step.
2.10 Implementation As a part of the Illuminating the Druggable Genome (IDG) [41]
of the Drug Target project, we designed and implemented the Drug Target Ontology
Ontology (DTO) Using (DTO) [29]. The long-term goal of the IDG project [41] is to
KNARM catalyze the development of novel therapeutics that act on novel
drug targets, which are currently poorly understood and poorly
annotated, but are likely targetable. The project puts particular
emphasis on the most common drug target protein families, G-
protein-coupled receptors (GPCRs), nuclear receptors, ion chan-
nels, and protein kinases. Therefore, we focused initially on for-
mally classifying, annotating, and modeling these specific protein
families in their role as drug targets, and DTO is focused on
proteins known as putative drug targets including many aspects to
describe the relevant properties of these proteins in their role as a
drug target. While creating DTO, we further advanced the meth-
odology and ontology architecture that we used for the BAO [26]
and other application ontologies from the LINCS project (e.g.
LIFE ontology) [20]. A longer-term goal for DTO is to integrate
it with the assays (formally described in BAO) that are used to
identify and characterize small molecules that modulate these tar-
gets. This will result in an integrated drug discovery knowledge
framework.
2.10.1 Sub-language The initial interviews and sub-language analysis steps involved
Analysis and In-House determining the different classifications of the drug targets and
Unstructured Interview the properties of them. The IDG project defined drug target
for DTO [24, 29, 42] as “A material entity, such as native (gene product)
58 Hande Küçük McGinty et al.
2.10.2 Sub-language While designing the ontology, we decided to add the UniProt IDs
Recycling for DTO for the proteins and the Entrez IDs [30] for the genes as cross-
references. In addition to this, we wanted to include the textual
definitions for the genes and the proteins. We also cross-referenced
the synonymous names and symbols for the molecules that already
exist in different databases.
We aimed at creating the Drug Target Ontology (DTO) as a
comprehensive resource by importing existing information about
the biological and chemical molecules that DTO contains. In this
way, we aim to help the life scientists’ query and retrieve informa-
tion about the different drug targets that they are working on. To
do that, we wrote various scripts using Java to retrieve information
from different databases. These databases include UniProt and
NCBI databases for Entrez IDs for the genes.
In addition to several publicly available databases and data
including the DISEASES and TISSUES databases [44, 47], we
also used the collaborators TCRD databases [42] in order retrieve
information about proteins, genes, and their related target devel-
opment levels (TDLs), as well as the tissue and disease information.
The DISEASES and TISSUES databases were developed in the
Jensen group from several resources including using advanced text
mining. They include a scoring system to provide a consensus of the
various integrated data sources. We retrieved the proteins with their
tissue and disease relationships and the confidence scores that are
given for the relationships. This data was loaded into our database
60 Hande Küçük McGinty et al.
and later used to create the ontology axioms that refer to the
probabilistic values of the relationships.
In addition to the larger-scale information derived from the
databases mentioned above, a vast amount of manual curation for
the proteins and genes was performed in the team by the curators
and domain experts. Most significantly improved were the drug
target classifications for kinases, ion channels, nuclear receptors,
and GPCRs. For most protein kinases, we followed the phyloge-
netic tree classification originally proposed by Sugen and the Salk
Institute [48]. Protein kinases not covered by this resource were
manually curated and classified mainly based on information in
UniProt [49] and also the literature. Non-protein kinases were
curated and classified based on their substrate chemotypes. We
also added pseudokinases, which are becoming more recognized
and relevant drug targets. We continue updating manual annota-
tions and classifications as new data becomes available. Nuclear
receptors were organized following the IUPHAR classification.
GPCRs were classified based on information from several sources
primarily using GPCRDB (http://www.gpcr.org/7tm/) and
IUPHAR as we have previously implemented in our GPCR ontol-
ogy [50]. However, not all GPCRs were covered, and we are
aligning GPCR ontology with other resources to complete classifi-
cation for several understudied receptors. We are also incorporating
ligand chemotype-based classification. A basic classification of ion
channels is available in IUPHAR [51]. Manual classification is in
progress for 342 ion channels in order to provide better classifica-
tions as required including for domain functions, subunit topology,
and heteromer and homomer formation.
Protein domains were annotated using the Pfam web service.
The domain sequences and domain annotations were extracted
using custom scripts. Several of the kinase domains were manually
curated based on their descriptions. For nuclear receptors, we
identified and annotated the ligand-binding domains, which are
most relevant as drug targets. For GPCRs, we identified 7TM
domains for majority (780 out of 827) of GPCRs. Ion channel
domains were annotated, and transmembrane domains were iden-
tified; additional ion channel characteristics—such as regulatory
and, gating mechanism, transported ion—were curated for ion
channel drug targets. Additional subclassification and annotation
are in progress and will further improve that module.
In addition to the curated drug target family function-specific
domain annotations, we generated comprehensive Pfam domain
annotations for the kinase module [42]. The domain sequences
were compared to the PDB chain sequences by BLAST, and
e-values were calculated. For significant hits, domain identities
were computed using the EMBOSS software suite. These results
were used to align and identify critical selectivity residues, such as
How to Develop a Drug Target Ontology: KNowledge Acquisition. . . 61
2.10.4 Structured Based on the metadata created, we have interviewed the researchers
Interview for DTO in the group and outside of the group. This step is to confirm that
the interpretation of the text data is correct and accurate. Addition-
ally, this step can be used in combination with other methods in
order to decide on a concept’s proper name. In this case, we chose
to use existing names in well-known databases such as
UniProt [49].
With this step, the aim is to finalize names and types of concepts
used in the metadata. Furthermore, it is to make sure that the
ontology engineer is on the same page as the domain experts before
starting to write the axioms. Therefore, this step can be combined
with the next step, i.e., knowledge acquisition validation.
2.10.5 Knowledge In this case after the metadata creation, various interviews, and
Acquisition Validation reviews of the data, an ontology engineer runs several scripts to
(KA Validation) for DTO check the consistency of the data. In addition, a domain expert
performs a thorough manual expert review of the extracted data.
Before the database formation, metadata is also reviewed. Domain
experts use metadata for grouping the extracted data. Modeling of
the knowledge is confirmed with ontology engineer and domain
expert reviews. Structured data is then shared with the research
scientists inside and outside of the team, especially with the scien-
tists in the IDG project to make sure that the information
How to Develop a Drug Target Ontology: KNowledge Acquisition. . . 63
2.10.7 Semiautomated The ontology was then built from this database in an automated
Ontology Building for DTO way using a Java application, OntoJog [24], which will be released
and described separately soon. This process builds all vocabulary
files, modules, and the axioms that can be automatically con-
structed given information in the database. In addition, all the
external modules were built. The various vocabularies and modules
are organized hierarchically via direct and indirect imports leading
to DTO_core. DTO_core is then imported, along with the expert-
asserted axioms and the external modules, to DTO_complete
(Fig. 4).
2.11 Knowledge In BAO, the formal descriptions of assays were manually axiomized.
Modeling of the Drug DTO, which was created for the IDG project, focuses on the
Target Ontology bio-molecules and their binding partners such as the specific ions
for ion-channeling proteins or small molecule ligands for GPCRs,
as well as their relationships to the specific diseases and tissues.
We used several tools, including Java, OWL API, and Jena to
build the ontology in a semiautomated way leveraging our local
database and implementing a new modularization architecture
given in detail below.
2.12 A New Modular The modular design of the DTO adds an additional layer on top of
Architecture for the our previously reported modular architecture developed for BAO
Drug Target Ontology [26]. Specifically, we separate the module with auto-generated
simple axioms, which are created using native-DTO concepts
64 Hande Küçük McGinty et al.
Fig. 4 Modular architecture of DTO showing the core principles and levels of DTO’s architecture with direct
and indirect imports
2.12.1 Ontology Several check points for data validation are performed throughout
Validation for DTO the methodology. For example, after the data is extracted in Sub-
language Recycling step, the ontology engineer runs several scripts
to check the consistency of the data. In addition, the domain expert
performs a thorough manual expert review of the extracted data.
The second check point in the methodology is during the Database
Formation step. Several scripts are used to check if the extracted
data is properly imported to the database under the appropriate
metadata categories as a unit of information along with the meta-
data. Once the Semiautomated Ontology Building step is com-
plete, the ontology engineer runs available reasoners to check the
consistency of information. Furthermore, several SPARQL queries
are run to flag any discrepancies. If there are any issues in the
ontology, the ontology engineer and domain experts could decide
to step back and repeat the previous steps in the methodology.
Another ontology validation script for DTO is designed to read
DTO vocabulary and module files and compare them to the
66 Hande Küçük McGinty et al.
3 Notes
Complex life sciences data is fitting the big data description due to
the large volume (terabytes and larger), complexity (interconnected
with over 25 highly accessed databases [18] and over 600 ontolo-
gies [23]), variety (many technologies generating different data
types, such as gene sequencing, RNASeq gene expression, micros-
copy imaging data), and dynamic nature (growing exponentially
and changing fast [25, 18]). New tools are required to store,
manage, integrate, and analyze such data while avoiding oversim-
plification. It is a challenge to design applications involving such big
data sets aimed at advancing human knowledge. One approach is to
develop a knowledge-based integrative semantic framework, such
as an ontology that formalizes how the different data types fit
together given the current understanding of the domain of investi-
gation. Building ontologies is time-consuming and limited by the
knowledge acquisition process, which is typically done manually via
domain experts and knowledge engineers.
In this chapter, we described a methodology, KNowledge
Acquisition and Representation Methodology (KNARM), as a
guided approach, involving domain experts and knowledge engi-
neers, to build useful, comprehensive, consistent ontologies that
will enable big data approaches in the domain of drug discovery,
without the currently common simplifications. It is designed to
help with the challenge of acquiring and representing knowledge
in a systematic, semiautomated way. We applied this methodology
in the implementation of the Drug Target Ontology (DTO).
While technological innovations continue to drive the increase
of data generation in the biomedical domains across all dimensions
of big data, novel bioinformatics and computational methodologies
will facilitate better integration and modeling of complex data and
knowledge.
Although the above-described methodology is still a work in
progress, it provided a systematic process for building concordant
ontologies such as BioAssay Ontology (BAO) and Drug Target
How to Develop a Drug Target Ontology: KNowledge Acquisition. . . 67
References
1. Gruber TR (1993) Towards principles for the 3. Schreiber G, Wielinga B, de Hoog R,
Design of Ontologies Used for knowledge Akkermans H, Van de Velde W (1994) Com-
sharing. Int J Hum Comput Stud 43 monKADS: a comprehensive methodology for
(5–6):907–928 KBS development. IEEE Expert 9(6):28–37
2. CommonKADS CommonKADS. http://com 4. Barnes JC (2002) Conceptual biology: a
monkads.org/ semantic issue and more. Nature 417
(6889):587–588
68 Hande Küçük McGinty et al.
5. Blagosklonny MV, Pardee AB (2002) Concep- templates for the semantic web data science:
tual biology: unearthing the gems. Nature 416 challenges and directions. PeerJ Comput Sci 2
(6879):373–373 (8):e61
6. Benson DA, Karsch-Mizrachi I, Lipman DJ, 17. Belleau F, Nolin M-A, Tourigny N, Rigault P,
Ostell J, Rapp BA, Wheeler DL (2000) Gen- Morissette J (2008) Bio2RDF: towards a
Bank. Nucleic Acids Res 28(1):15–18 mashup to build bioinformatics knowledge sys-
7. Heflin J, Hendler J (2000) Semantic interoper- tems. J Biomed Inform 41(5):706–716
ability on the web. Maryland University, Col- 18. Cook CE, Bergman MT, Finn RD,
lege Park, Department of Computer Science, Cochrane G, Birney E, Apweiler R (2015)
Maryland The European bioinformatics institute in
8. Noy NF, Fergerson RW, Musen MA (2000) 2016: data growth and integration. Nucleic
The knowledge model of Protege-2000: com- Acids Res 44(D1):D20–D26
bining interoperability and flexibility. In: 19. Hitzler P, Krötzsch M, Rudolph S (2009)
Knowledge engineering and knowledge man- Foundations of semantic web technologies.
agement methods, models, and tools. Springer, Chapman and Hall (CRC), Florida
New York, pp 17–32 20. Küçük-Mcginty H, Metha S, Lin Y,
9. Stevens R, Goble CA, Bechhofer S (2000) Nabizadeh N, Stathias V, Vidovic D, Koleti A,
Ontology-based knowledge representation for Mader C, Duan J, Visser U (2016) Schurer S
bioinformatics. Brief Bioinform 1(4):398–414 IT405: building concordant ontologies for
10. Wache H, Voegele T, Visser U, drug discovery. In: International conference
Stuckenschmidt H, Schuster G, Neumann H, on biomedical ontology and BioCreative.
Hübner S (2001) Ontology-based integration (ICBO BioCreative 2016), Oregon
of information-a survey of existing approaches. 21. Schurer SC, Vempati U, Smith R, Southern M,
In: IJCAI-01 workshop: ontologies and infor- Lemmon V (2011) BioAssay ontology annota-
mation sharing. Citeseer, New Jersey, pp tions facilitate cross-analysis of diverse high-
108–117 throughput screening data sets. J Biomol
11. Yeh I, Karp PD, Noy NF, Altman RB (2003) Screen 16(4):415–426
Knowledge acquisition, consistency checking 22. Smith B, Ashburner M, Rosse C, Bard J,
and concurrency control for gene ontology Bug W, Ceusters W, Goldberg LJ, Eilbeck K,
(GO). Bioinformatics 19(2):241–248 Ireland A, Mungall CJ et al (2007) The OBO
12. Degtyarenko K, De Matos P, Ennis M, foundry: coordinated evolution of ontologies
Hastings J, Zbinden M, McNaught A, to support biomedical data integration. Nat
Alcántara R, Darsow M, Guedj M, Ashburner Biotechnol 25(11):1251–1255
M (2008) ChEBI: a database and ontology for 23. Whetzel PL, Noy NF, Shah NH, Alexander
chemical entities of biological interest. Nucleic PR, Nyulas C, Tudorache T, Musen MA
Acids Res 36(suppl 1):D344–D350 (2011) BioPortal: enhanced functionality via
13. Baader F, Calvanese D, McGuinness DL, new web services from the National Center
Nardi D, Patel-Schneider PF (2010) The for biomedical ontology to access and use
description logic handbook: theory, implemen- ontologies in software applications. Nucleic
tation and applications, 2nd edn. Cambridge Acids Res 39(2):W541–W545
University Press, New York, NY 24. Lin Y, Mehta S, Küçük-McGinty H, Turner JP,
14. Buchanan BG, Barstow D, Bechtal R, Vidovic D, Forlin M, Koleti A, Nguyen D-T,
Bennett J, Clancey W, Kulikowski C, Jensen LJ, Guha R, Mathias SL, Ursu O,
Mitchell T, Waterman DA (1983) Construct- Stathias V, Duan J, Nabizadeh N, Chung C,
ing an expert system. Build Exper Sys Mader C, Visser U, Yang JJ, Bologa CG, Oprea
50:127–167 TI, Schürer SC (2017) Drug target ontology
15. Natale DA, Arighi CN, Blake JA, Bona J, to classify and integrate drug discovery data. J
Chen C, Chen S-C, Christie KR, Cowart J, Biomed Semantics 8(1):50
D’Eustachio P, Diehl AD, Drabkin HJ, Dun- 25. Ma’ayan A (2017) Complex systems biology. J
can WD, Huang H, Ren J, Ross K, R Soc Interface 14(134):1742–5689
Ruttenberg A, Shamovsky V, Smith B, 26. Abeyruwan S, Vempati UD, Küçük-
Wang Q, Zhang J, El-Sayed A, Wu CH McGinty H, Visser U, Koleti A, Mir A,
(2011) The representation of protein com- Sakurai K, Chung C, Bittker JA, Clemons PA,
plexes in the protein ontology (PRO). BMC Chung C, Bittker JA, Clemons PA, Brudz S,
bioinformatics 12(1):1 Siripala A, Morales AJ, Romacker M,
16. Clark AM, Litterman NK, Kranz JE, Gund P, Twomey D, Bureeva S, Lemmon V, Schürer
Gregory K, Bunin BA, Cao L (2016) BioAssay SC (2014) Evolving BioAssay ontology
How to Develop a Drug Target Ontology: KNowledge Acquisition. . . 69
(BAO): modularization, integration and appli- 41. NIH Illuminating the Druggable Genome |
cations. J Biomed Semantics 5(Suppl 1):S5 NIH Common Fund. https://commonfund.
27. BAOSearch. http://baosearch.ccs.miami.edu/ nih.gov/idg/index
28. Visser U, Abeyruwan S, Vempati U, Smith R, 42. TCRD Database. http://habanero.health.
Lemmon V, Schurer S (2011) BioAssay ontol- unm.edu/tcrd/
ogy (BAO): a semantic description of bioassays 43. Hamosh AS, Alan F, Amberger JS, Bocchini
and high-throughput screening results. BMC CA, McKusick VA (2005) Online Mendelian
Bioinformatics 12(1):257 inheritance in man (OMIM), a knowledgebase
29. Drug Target Ontology. http:// of human genes and genetic disorders. Nucleic
drugtargetontology.org/ Acids Res 33:D514–D517
30. Brinkman RR, Courtot M, Derom D, Fostel 44. Pletscher-Frankild S, Pallejà A, Tsafou K,
JM, He Y, Lord P, Malone J, Parkinson H, Binder JX, Jensen LJ (2015) DISEASES: text
Peters B, Rocca-Serra P, Ruttenberg A, San- mining and data integration of disease–gene
sone SA, Soldatova LN, Stoeckert CJ Jr, associations. Methods 74:83–89
Turner JA, Zheng J (2010) Modeling biomed- 45. NCBI (2017) https://www.ncbi.nlm.nih.gov/
ical experimental processes with OBI. J Biomed gene/about-generif. 2017
Semantics 1(Suppl 1):S7 46. Kiermer V (2008) Antibodypedia. Nat Meth-
31. Callahan A, Cruz-Toledo J, Dumontier M ods 5(10):860–860
(2013) Ontology-based querying with 47. Santos A, Tsafou K, Stolte C, Pletscher-
Bio2RDF’s linked open data. Journal of Bio- Frankild S, O’Donoghue SI, Jensen LJ
medical Semantics 4(Suppl 1):S1 (2015) Comprehensive comparison of large-
32. Ceusters W, Smith B (2006) A realism-based scale tissue expression datasets. PeerJ 3:e1054
approach to the evolution of biomedical ontol- 48. Sugen and the Salk Institute. (2012). http://
ogies. AMIA Annu Symp Proc:121–125 kinase.com/human/kinome/phylogeny.html
33. Consortium TGO (2015) Gene ontology con- 49. Consortium TU (2015) UniProt: a hub for
sortium: going forward. Nucleic Acids Res 43 protein information. Nucleic Acids Res 43
(D1):D1049–D1056. https://doi.org/10. (D1):D204–D212. https://doi.org/10.
1093/nar/gku1179 1093/nar/gku989
34. Decker S, Erdmann M, Fensel D, Studer R 50. Przydzial MJ, Bhhatarai B, Koleti A,
(1999) Ontobroker: ontology based access to Vempati U, Schürer SC (2013) GPCR ontol-
distributed and semi-structured information. ogy: development and application of a G
In: Database semantics. Springer, New York, protein-coupled receptor pharmacology
pp 351–369 knowledge framework. Bioinformatics 29
35. Gruber TR (1993) A translation approach to (24):3211–3219
portable ontology specifications. Knowl Acquis 51. Pawson AJ, Sharman JL, Benson HE,
5(2):199–220 Faccenda E, Alexander SP, Buneman OP, Dav-
36. Köhler J, Philippi S, Lange M (2003) enport AP, JC MG, Peters JA, Southan C,
SEMEDA: ontology based semantic integra- Spedding M, Yu W, Harmar AJ,
tion of biological databases. Bioinformatics 19 NC-IUPHAR (2013) The IUPHAR/BPS
(18):2420–2427 guide to PHARMACOLOGY: an expert-
37. Ontology BF Basic Formal Ontology (BFO) driven knowledgebase of drug targets and
Project. http://www.ifomis.org/bfo their ligands. Nucleic Acids Res 42(D1):
38. Pease A, Niles I, Li J (2002) The suggested D1098–D1106
upper merged ontology: a large ontology for 52. Vidović D, Koleti A, Schürer SC (2014) Large-
the semantic web and its applications. In: scale integration of small molecule-induced
Working notes of the AAAI-2002 workshop genome-wide transcriptional responses,
on ontologies and the semantic web Kinome-wide binding affinities and cell-
39. Sure Y, Erdmann M, Angele J, Staab S, growth inhibition profiles reveal global trends
Studer R, Wenke D (2002) OntoEdit: collabo- characterizing systems-level drug action. Front
rative ontology development for the semantic Genet 5:342
web. Springer, New York 53. Shvachko K, Kuang H, Radia S, Chansler R
40. Welty CA, Fikes R (2006) A reusable ontology (2010) The Hadoop distributed file system.
for Fluents in OWL. In: Formal ontology in In: Proceedings of the 2010 IEEE 26th sym-
information systems Frontiers in artificial Intel. posium on mass storage systems and technolo-
And apps. IOS, pp 226–236 gies (MSST). IEEE Computer Society,
Washington, DC, USA, pp 1–10
Part II
Abstract
PubMed contains more than 27 million documents, and this number is growing at an estimated 4% per
year. Even within specialized topics, it is no longer possible for a researcher to read any field in its entirety,
and thus nobody has a complete picture of the scientific knowledge in any given field at any time. Text
mining provides a means to automatically read this corpus and to extract the relations found therein as
structured information. Having data in a structured format is a huge boon for computational efforts to
access, cross reference, and mine the data stored therein. This is increasingly useful as biological research is
becoming more focused on systems and multi-omics integration. This chapter provides an overview of the
steps that are required for text mining: tokenization, named entity recognition, normalization, event
extraction, and benchmarking. It discusses a variety of approaches to these tasks and then goes into detail
on how to prepare data for use specifically with the JensenLab tagger. This software uses a dictionary-based
approach and provides the text mining evidence for STRING and several other databases.
Key words Automated text processing, Dictionary-based approach, Named entity recognition,
PubMed, Structured information, Text mining, Text normalization
1 Introduction
Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_5, © Springer Science+Business Media, LLC, part of Springer Nature 2019
73
74 Helen V. Cook and Lars Juhl Jensen
database [20] and gives good recall and precision also on other
domains [21]. In the next section, we walk through the process of
text mining in general and briefly introduce the different
approaches that can be taken along with their respective merits.
The third section focuses on the data that will need to be prepared
in order to apply a dictionary-based system, using the specific
example of the JensenLab tagger. We will then discuss some limita-
tions of text mining in general and this system in particular as well
as ways to circumvent them.
2.2 Tokenization Any text mining approach must first prepare the text to be read by
the software by dividing the text into individual words. This process
of identifying the word boundaries is called tokenization, and each
word is referred to as a token. Word boundaries are for the most
part obvious for text in English and other languages in which words
are separated by spaces; however, contractions and hyphenated
words can make tokenization less straightforward. For most text
mining methods, text must also be segmented into sentences.
Here, abbreviations that are terminated with a period are the
main challenge, since in all other cases, a period will clearly denote
the end of the sentence. Documents that have been converted from
PDF and that contain columns can give additional complication,
where text segment boundaries have not been split correctly and
words can run together. Layout-aware PDF conversion software
may help mitigate this problem.
A Guide to Dictionary-Based Text Mining 77
Many text mining packages use the Stanford parser [43], which
provides tokenization along with some other useful features such as
co-reference resolution [44]. The Python natural language toolkit
[45] is another popular option.
2.3 NER Named entity recognition (NER) refers to the task of locating any
and Normalization terms of interest in the corpus. Under a dictionary approach, this
involves scanning the corpus for all terms that appear in the dictio-
nary. Once a term has been identified, it can then be mapped onto a
single corresponding identifier in a predefined ontology or taxo-
nomic resource, for example, a taxonomic identifier for species [46]
or a database identifier for a protein [2]. This process is called
normalization. With a dictionary, normalization is essentially auto-
matic (modulo any disambiguation that needs to take place) since
the dictionary will be originally constructed from an ontology,
where each name is already tied to its normalized form. Conversely,
under machine learning methods, normalization is a separate task
that will be learned after observing sufficient training examples.
The JensenLab tagger and LINNAEUS are both dictionary-
based systems that perform joint NER and normalization and that
are broadly applicable to any domain [19, 47]. Alternatively, Tag-
gerOne, which uses semi-Markov models, requires training data
and can be applied to any domain [48]. NERsuite takes a third
approach and performs part-of-speech tagging (to determine which
words are nouns, verbs, adjectives, and so forth), lemmatization
(to reduce nouns to their singular forms and verbs to their infinitive
forms), and chunking (to identify noun phrases) prior to named
entity recognition and normalization [49]. All the software men-
tioned here is open source.
2.4 Event Extraction Once entities are located in the text, the next step is to find relations
between them. This process is referred to as event extraction.
Following NER, the JensenLab tagger uses co-occurrence counts
to determine relationships between entities, but other statistical
methods can be used to determine co-occurrence relationships
[50]. A similar method to the tagger’s co-occurrence scoring are
term frequency-inverse document frequency (tf-idf) counts
[51]. This gives more weight to terms that are globally rare but
that are used frequently within a small set of documents. Frequen-
cies of co-occurring words can be captured with N-grams, which
are lists of all sequences of N words that occur in sequence in a
document [52]. N-grams can be compared to identify documents
on similar topics or rare phrases within a document. These methods
tend to require a large corpus to generate accurate statistics, and
they do not use the meaning of the words in the document to
influence the statistics.
Whereas co-occurrence relies on statistical relationships
between word occurrences, grammar-based approaches use part-
78 Helen V. Cook and Lars Juhl Jensen
2.5 Benchmarking There are two ways that the results of text mining can be evaluated.
First, the entities that are identified can be compared against a
manually annotated corpus to evaluate how well the system per-
forms compared to trained humans. Since humans too are prone to
error, and texts are often truly ambiguous, annotations made by
different human annotators will not always agree. To reliably assess
the results of text mining, there should be good agreement
between annotators (e.g., above 90%) to ensure that the corpus
has been annotated as consistently as possible. Both the inter-
annotator agreement and the text mining system’s agreement
with the consensus of the human annotators are generally reported
as F-score, as the harmonic mean of precision and recall, or as
Cohen’s kappa if annotations are of a set number of classes. To
create annotations, browser-based tools such as tagtog [59] and
brat [60] provide simple interfaces to highlight entities in text and
normalize them. PubAnnotation is a repository of hand-annotated
corpora that is used by the biomedical text mining community to
distribute annotations [61]. Creators of new corpora are encour-
aged to upload their work here so that it may be found and reused.
The second benchmarking strategy is to evaluate the
co-occurrence results against a gold standard. Here, we look
directly at the end result, e.g., co-occurrences, instead of the preci-
sion and recall of specific matches. In some domains, well-defined
gold standard sets are available, for example, disease-protein inter-
actions from OMIM [62] or drug-protein interactions from Drug-
Bank [63]. The resulting scores can be compared against known
values, and a probability or a confidence can be assigned to the text
A Guide to Dictionary-Based Text Mining 79
This section details the input formats that are required to use the
JensenLab tagger. Although we focus on a particular implementa-
tion here, the same information will be needed as input to other
dictionary-based text mining pipelines. The tagger internally toke-
nizes the text, performs named entity recognition and normaliza-
tion, and can optionally calculate co-occurrences between user-
provided pairs of item types. However, benchmarking of the results
must be done by the user of the software.
The tagger is made available as a Docker [65] which can be
downloaded from http://hub.docker.com/r/larsjuhljensen/tag
ger/. Alternatively, the source code is available at http://
bitbucket.org/larsjuhljensen/tagger, and instructions are available
in the readme on how to install this on a Mac or Linux system.
The tagger interface can be used directly from the command
line, or it can be called with a Python wrapper. The tagger reads all
documents bytewise and matches bytes from the dictionary against
bytes of the corpus. This means that words containing Unicode
characters will be matched correctly, but the tagger will report all
positions in byte coordinates (as opposed to character coordinates).
If you use UTF-8 text in the dictionary or the corpus, the Python
wrapper can be used to convert positions between bytes and
80 Helen V. Cook and Lars Juhl Jensen
characters so that the reported positions are correct. The tagger will
process the tens of millions of documents in PubMed in under 2 h
on most servers [19].
3.1 Dictionary When performing text mining on corpora that cover a broad range
Creation: Entities of topics, the recall of dictionary approaches is limited due to the
and Names vast multitude of entities that need to be covered by the dictionary.
However, in the biomedical space, the terms of interest are gener-
ally better defined and limited in scope. Thus, it is feasible and
straightforward to make dictionaries that comprehensively cover
all the terms of a given type.
The dictionary, also sometimes called a lexicon, thesaurus, or
gazetteer, is simply a list of all the entities of interest and all of their
possible synonyms with common spelling, or orthographic, varia-
tions. The tagging software will then match each of these terms
against the text and return the locations at which they match.
Additionally, the results are filtered through a stopword list,
which will reject matches of any words that are known to give
many false positives.
It can be a significant investment of work to create a new
dictionary from scratch, so most dictionaries are built either from
existing databases or ontologies. For example, the dictionary used
to text mine the diseases database [4] was created from the disease
ontology [66]. Ontologies define the vocabularies, properties, and
relationships for the entities in a given subject. Many ontologies
have already been created for a range of biomedical domains, which
may provide a starting point for a new dictionary. The Open
Biomedical Ontologies Foundry [67] is a collection of ontologies
that are intended to be interoperable and standardized. BioPortal
hosts a library of more than 260 different ontologies, along with
related resources and tools that can be used to aid dictionary
creation [68].
What makes an ontology useful for text mining is primarily
having a complete list of entities and all of their possible synonyms.
This is not necessarily the same as what makes the ontology useful
for answering questions about the relationships between entities, a
more traditional use of ontologies. For text mining, having a loose
hierarchy of terms is sufficient in terms of structure, and an exten-
sive set of interrelationships between entities is not needed.
Since it is essential to have a good list of synonyms to have good
recall, it is worth putting some care into assembling the dictionary
from as many diverse sources as possible to cover lexical variants and
names that are commonly used in different disciplines. For exam-
ple, if you are using Ensembl identifiers for proteins, it would be
worthwhile to import their UniProt synonyms as well. Be aware
that terms that are used in a controlled vocabulary such as an
ontology may become out of date and incomplete as new terms
are generated too quickly to be included. If the ontology does not
A Guide to Dictionary-Based Text Mining 81
tagger. The proteins for a species will only be tagged if a name for
the species corresponding to the protein type has also been tagged
in the document. This helps prevent false positives caused by gene
names from organisms that the text is not about and further helps
reduce the ambiguity caused by many organisms sharing gene
names, which will be discussed later.
Expanding the list of synonyms with all possible plurals and
variants is done to gain recall. In some cases, this will lead to
generating terms that have a different meaning or that should not
be tagged. These false positives will be filtered out in the next step,
where we apply a stopword list to increase precision.
Pre-made dictionaries for proteins, species, GO terms, tissues,
diseases, environments, chemicals, and phenotypes are linked from
the readme file in the bitbucket repository. Stopwords and groups
files for these dictionaries are also available from the same location.
3.2 Stopwords After generating the dictionaries, the tagger should be run over the
intended corpus, and the output should be inspected for false
positives. Depending on the size of the corpus, it will be impossible
to look at all the results, so priority should be given to the terms
that result in the most hits, since these will affect the overall quality
the most. Any terms that should not be tagged should be added to
the stopword list so that this specific term will not be tagged.
Whereas the dictionary is case insensitive, the stopwords are case
sensitive, which allows for specific case variants to be stopworded
while still tagging other variants that are not on the stopword list.
This includes hyphens, so “cm-1’” can be stopworded (likely mean-
ing per centimeter, cm1), while “cm-1” (a gene found in Arabi-
dopsis) would still tagged if it were present in the dictionary.
Similarly, “ran” is the symbol for a human gene and is tagged, but
“ran” is stopworded as it otherwise would result in a plethora of
false positives.
The format of the stopwords file is two columns separated by a
tab, the first containing the term and the second containing “t” if
the term is a stopword and “f” if it is not a stopword.
groups file uses the serial numbers defined in the entities file to map
children to parents. In case of a tree structure or directed acyclic
graph, all relationships must be specified in the groups file, i.e., a
child term must be explicitly linked not only to its parents but also
to its grandparents. Because relationships are not inferred through
indirect relationships, it is also possible to represent group member-
ships that do not fit a tree structure.
3.4 Disambiguation The same name is often used to refer to multiple entities; such
names are referred to as homonyms or ambiguous names. We try to
disambiguate them at several stages of the text mining process.
During the dictionary build, if a gene name is the prescribed
standard gene name for one gene, and is also a rarely used synonym
for another gene, it is most likely that any term in the corpus refers
to the former, and so the term will only be added as a name for
it. Names may also conflict across types of entities, for example,
“ECD” can be used to refer to the recommended human gene
symbol for ecdysoneless homolog, a regulator of p53 stability and
function, or to refer to the abbreviation for endocardial cushion
defect, a congenital heart condition. In such cases, the dictionary
build should only allow this name to appear in one of the diction-
aries. There is no strict rule that dictionaries must not contain the
same name for different types, and the tagger will tag them all, but
it is better to disambiguate them prior to doing named entity
recognition so that the results are much easier to work with.
If protein names conflict across organisms, these are all added
to the dictionary and are disambiguated at tagging time. For exam-
ple, the gene cyclin-dependent kinase 1, or CDK1, which promotes
progression through the cell cycle has the same name in both
human and mouse. In order to not tag all instances of this protein
as both mouse and human CDK1 when only one was intended by
the author, the tagger will require the species to be tagged in the
text first. The only exception to this rule is for human proteins,
which can be tagged without requiring that the species be explicitly
mentioned in the text, because very often the species is implied to
be human in biomedical abstracts. This differs from the tagging of
other types, which have no interdependencies.
Using a groups file will further help disambiguate terms.
Groups files for proteins are generated from eggNOG orthology
relationships between proteins, which are hierarchical as of egg-
NOG 4.5 [71]. Often protein names refer to the same gene in
multiple organisms, as with CDK1, but this is not always the case.
For example, the gene symbol CDC2 refers to different genes in
S. cerevisiae and Sz. pombe, and these should be treated differently
than matches in the CDK1 example. If the text refers to a term that
is the name of a protein for two different species, and the names of
both species are present in the document, and the proteins are in
the same orthology group, then the protein will be tagged as
84 Helen V. Cook and Lars Juhl Jensen
3.5 Co-occurrences The tagger will optionally score co-occurrences between terms of
the same or different types, as specified in the type-pairs file. To
score co-occurrences, the tagger will count terms that occur in the
same sentence, same paragraph, and same document with decreas-
ing levels of weight. The score is then adjusted for the fact that
some terms occur very frequently in the corpus, and so they have a
high prior probability of co-occurring with any other protein just
by their abundance.
X
c i, j ¼ δs ði; j Þ ws þ δpði; j Þ w p þ δd ði; j Þ wd ð1Þ
where the sum is over all sentences, paragraphs, and documents,
and δs(i,j) evaluates to 1 if term i and j occur in the same
sentence and 0 otherwise and similarly for paragraphs and
documents.
α c i, j c , 1α
s ði, j Þ ¼ c i, j ð2Þ
c i , c , j
where l denotes all entities of the same type. These results should
then be benchmarked, as previously discussed.
The type-pairs file can be used to request only that a subset of
the possible pairs of types are scored for co-occurrences; otherwise
the full cross product of nonprotein types will be scored. Cross-
species interactions between proteins can also be specified explicitly
in the type-pairs file. Specifying relationships between the entities
via a groups file will not impact the scoring.
This chapter has so far described the steps to approach text mining,
following the specific example of the JensenLab tagger. Here, we
would like to discuss some of the common errors that are generated
by dictionary systems and possible ways that they can be mitigated.
The main source of false positives comes from the fact that
dictionary-based methods will faithfully identify all instances of a
word, including those where the word refers to a different concept.
A dictionary approach is not able to distinguish between homo-
nyms such as SDS, which is commonly used in the biomedical
literature to refer to both a human protein and a detergent. These
A Guide to Dictionary-Based Text Mining 85
language used [76] can pose challenges for text mining, if the
changes in use are not considered or are not part of the training
data (if applicable).
References
1. Lu Z (2011) PubMed and beyond: a survey of 11. Swanson DR, Smalheiserf NR (1996) Undis-
web tools for searching biomedical literature. covered public knowledge: a ten-year update.
Database 2011:1–13. issn: 17580463. arXiv: KDD-96 Proceedings 56(2):103–118. issn:
baq03. https://doi.org/10.1093/database/ 00242519. https://doi.org/10.2307/
baq036 4307965
2. The UniProt Consortium (2014) UniProt: a 12. Swanson DR (1988) Migraine and magnesium:
hub for protein information. Nucleic Acids Res eleven neglected connections. Perspect Biol
43(D1):D204–D212. issn: 0305-1048. Med
http://nar.oxfordjournals.org/content/43/ 13. Russo F et al (2018) miRandola 2017: a
D1/D204. https://doi.org/10.1093/nar/ curated knowledge base of non-invasive bio-
gku989 markers. Nucleic Acids Res 46:D354–D359.
3. Attwood T, Agit B, Ellis L (2015) Longevity of issn: 0305-1048. https://doi.org/10.1093/
biological databases. EMBnet.journal 21.0 nar/gkx854
issn: 2226-6089. http://journal.embnet.org/ 14. Orchard S et al (2014) The MIntAct project -
index.php/embnetjournal/article/view/803 IntAct as a common curation platform for
4. Pletscher-Frankild S et al (2015) DISEASES: 11 molecular interaction databases. Nucleic
text mining and data integration of disease- Acids Res 42(November 2013):358–363.
gene associations. Methods 74:83–89. issn: https://doi.org/10.1093/nar/gkt1115
10959130. https://doi.org/10.1016/j. 15. Xenarios I et al (2002) DIP, the database of
ymeth.2014.11.020 interacting proteins: a research tool for study-
5. Junge A et al (2017) RAIN: RNA-protein asso- ing cellular networks of protein interactions.
ciation and interaction networks. Database Nucleic Acids Res 30(1):303–305. issn: 1362-
baw167:1–9. issn: 1047- 3211. arXiv: 4962. https://doi.org/10.1093/nar/30.1.
1611.06654. http://fdslive.oup.com/www. 303
oup.com/pdf/production%7B%5C_%7Din% 16. Bader GD, Betel D, Hogue CWV (2003)
7B%5C_%7Dprogress.pdf. https://doi.org/ BIND: the biomolecular interaction network
10.1093/cercor/bhw393 database. Nucleic Acids Res 31(1):248–250.
6. Binder JX et al (2014) COMPARTMENTS: issn: 03051048. https://doi.org/10.1093/
unification and visualization of protein subcel- nar/gkg056
lular localization evidence. Database 1–.9. issn: 17. Rodriguez-Esteban R (2009) Biomedical text
17580463. https://doi.org/10.1093/data mining and its applications. PLoS Comput Biol
base/bau012 5(12):1–5. issn: 1553734X. https://doi.org/
7. Santos A et al (2015) Comprehensive compar- 10.1371/journal.pcbi.1000597
ison of large-scale tissue expression datasets. 18. Pafilis E et al (2009) Reflect: augmented
PeerJ 3:e1054. issn: 2167-8359. https:// browsing for the life scientist. Nat Biotechnol
peerj.com/articles/1054. https://doi.org/ 27(6):508–510. issn: 1087- 0156. https://doi.
10.7717/peerj.1054 org/10.1038/nbt0609-508
8. Meaney C et al (2016) Text mining describes 19. Pafilis E et al (2013) The SPECIES and
the use of statistical and epidemiological meth- ORGANISMS resources for fast and accurate
ods in published medical research. J Clin Epi- identification of taxonomic names in text.
demiol 74:124–132. issn: 18785921. https:// PLoS ONE 8(6):2–7. issn: 19326203.
doi.org/10.1016/j.jclinepi.2015.10.020 https://doi.org/10.1371/journal.pone.
9. IDG Knowledge Management Center (2016) 0065390
Unexplored opportunities in the druggable 20. Szklarczyk D et al (2016) The STRING data-
genome. Nat Rev Drug Discov http://www. base in 2017: quality- controlled protein-
nature.com/nrd/posters/druggablegenome/ protein association networks, made broadly
index.html accessible. Nucleic Acids Res 45(D1):
10. Swanson DR (1986) Fish oil, Raynaud’s syn- D362–D368. issn: 0305-1048. http://nar.
drome, and undiscovered public knowledge. oxfordjournals.org/lookup/. https://doi.
Perspect Biol Med 30:7–18 org/10.1093/nar/gkw937
A Guide to Dictionary-Based Text Mining 87
21. Cook H, Pafilis E, Jensen L (2016) A dictio- 31. Pafilis E et al (2015) ENVIRONMENTS and
nary- and rule-based system for identification EOL: identification of environment ontology
of bacteria and habitats in text. In: Proceedings terms in text and the annotation of the ency-
of the 4th BioNLP shared task workshop, clopedia of life. Bioinformatics 31
p 50–55. isbn: 978-1-945626-21-0. http:// (11):1872–1874. issn: 14602059. https://
www.aclweb.org/anthology/W/W16/W16- doi.org/10.1093/bioinformatics/btv045
30.pdf%7B%5C#%7Dpage¼60 32. Yang Y et al (2017) Exploiting sequence-based
22. Jensen LJ, Saric J, Bork P (2006) Literature features for predicting enhancer-promoter
mining for the biologist: from information interactions. Bioinformatics 33(14):
retrieval to biological discovery. Nat Rev i252–i260. issn: 14602059. https://doi.org/
Genet 7(2):119–129. issn: 1471-0056. 10.1093/bioinformatics/btx257
http://www.nature.com/doifinder/10.1038/ 33. Sayers E (2010) A general introduction to the
nrg1768. https://doi.org/10.1038/nrg1768 E-utilities. National Center for Biotechnology
23. Arighi CN et al (2014) BioCreative-IV virtual Information (US), Bethesda, MD, pp 1–10
issue. Database 2014:1–6. issn: 1758-0463. 34. Westergaard D et al (2017) Text mining of
https://doi.org/10.1093/database/bau039 15 million full-text scientific articles. bioRxiv.
24. Deléger L et al (2016) Overview of the bacteria https://doi.org/10.1101/162099
biotope task at BioNLP shared task 2016. In: 35. Eysenbach G (2006) Citation advantage of
Proceedings of the 4th BioNLP shared task open access articles. PLoS Biol 4(5):692–698.
workshop, p 12–22 issn: 15457885. https://doi.org/10.1371/
25. Huang CC, Zhiyong L (2016) Community journal.pbio.0040157
challenges in biomedical text mining over 36. Handke C, Guibault L, Vallbé JJ (2015) Is
10 years: success, failure and the future. Brief Europe falling behind in data mining? Copy-
Bioinform 17(1):132–144. issn: 14774054. right’s impact on data mining in academic
https://doi.org/10.1093/bib/bbv024 research. In: New avenues for electronic pub-
26. Yepes AJ, Verspoor K (2014) Literature mining lishing in the age of infinite collections and
of genetic variants for curation: quantifying the citizen science: scale, openness and trust—Pro-
importance of supplementary material. Data- ceedings of the 19th international conference
base 2014., bau003. issn: 1758-0463. http:// on electronic publishing, Elpub 2015 June
www.pubmedcentral.nih.gov/articlerender. (2015), pp. 120–130. issn: 1556-5068. doi:
fcgi?artid¼3920087%7B%5C&% https://doi.org/10.3233/978-1-61499-562-
7Dtool¼pmcentrez%7B%5C&% 3-120
7Drendertype¼abstract. https://doi.org/10. 37. Noonburg D XpdfReader. http://www.
1093/database/bau003 xpdfreader.com/
27. Roque FS et al (2011) Using electronic patient 38. Ramakrishnan C et al (2012) Layout-aware
records to discover disease correlations and text extraction from full-text PDF of scientific
stratify patient cohorts. PLoS Comput Biol 7 articles. Source Code Biol Med 7:7. issn: 1751-
(8):e1002141. issn: 1553734X. arXiv: 0473. https://doi.org/10.1186/1751-0473-
NIHMS150003. https://doi.org/10.1371/ 7-7
journal.pcbi.1002141 39. Kim D, Hong Y (2011) Figure text extraction
28. Ford E et al (2016) Extracting information in biomedical literature. PLoS ONE 6
from the text of electronic medical records to (1):1–11. issn: 19326203. https://doi.org/
improve case detection: a systematic review. J 10.1371/journal.pone.0015338
Am Med Inform Assoc 23(5):1007–1015. issn: 40. Free software foundation. iconv. http://www.
1527974X. https://doi.org/10.1093/jamia/ gnu.org/savannah-checkouts/gnu/libiconv/
ocv180 documentation/libiconv-1.15/iconv.1.html
29. Thomas CE et al. (2014) Negation scope and 41. Moolenaar B Vim. https://vim.sourceforge.
spelling variation for text-mining of Danish io/
electronic patient records. In: Proceedings of
the 5th international workshop on health text 42. Przybyla P et al (2016) Text mining resources
mining and information analysis 2014, for the life sciences. Database 2016:1–30. issn:
p 64–68 17580463. arXiv: 1611.06654. https://doi.
org/10.1093/database/baw145
30. Kuhn M et al (2016) The SIDER database of
drugs and side effects. Nucleic Acids Res 44 43. Chen D, Manning CD (2014) A fast and accu-
(D1):D1075–D1079. issn: 13624962. rate dependency parser using neural networks.
https://doi.org/10.1093/nar/gkv1075 In: Proceedings of the 2014 conference on
empirical methods in natural language proces-
sing (EMNLP) 2014, p 740–750. isbn:
88 Helen V. Cook and Lars Juhl Jensen
Abstract
The surge of public disease and drug-related data availability has facilitated the application of computational
methodologies to transform drug discovery. In the current chapter, we outline and detail the various
resources and tools one can leverage in order to perform such analyses. We further describe in depth the
in silico workflows of two recent studies that have identified possible novel indications of existing drugs.
Lastly, we delve into the caveats and considerations of this process to enable other researchers to perform
rigorous computational drug discovery experiments of their own.
Key words Systems pharmacology, Drug discovery, Big data, Electronic medical records, Clinical
informatics, Bioinformatics, Drug repurposing, Drug repositioning, Gene expression data,
Pharmacogenomics
1 Introduction
Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_6, © Springer Science+Business Media, LLC, part of Springer Nature 2019
91
92 Benjamin S. Glicksberg et al.
One study evaluated the diversity of race, ethnicity, age, and sex in
participants of cancer trials [5]. They found large differences in
representation in all these realms: for instance, lower enrollment
fractions in Hispanic and African-American participants compared
to Caucasian participants ( p < 0.001 for both comparisons). They
also found an inverse relationship between age and enrollment
fraction across all racial and ethnic groups and significant differ-
ences for men and women depending on the disease (e.g., men had
higher enrollment fractions for lung cancer; p < 0.001). These
differences in representation also have serious implications in prac-
tice when, for instance, a treatment studied in one population is
given to another. Making clinical decisions based on studies with
these issues can lead to expensive, sub-optimal treatment rates or
missed opportunities at best and harmful events at worst. There are
plenty of examples of randomized controlled trials that were judged
to be beneficial but shown to be harmful (e.g., fluoride treatment
for osteoporosis) [6]. Another weakness of clinical trial design is the
relaxing of inclusion criteria for disease group in order to bolster
study numbers, which is especially problematic in heterogeneous
diseases.
These issues, along with the vast resources required for these
studies and overall low success rates, necessitate complementary
approaches. The recent surge of biomedical information pertaining
to molecular characterizations of diseases and chemical compounds
has facilitated a new age of drug discovery through computational
predictions [7, 8]. By analyzing FDA-approved compounds to
discover novel indications, one can leverage the massive amount
of research and effort that have already been completed and bypass
many steps in the traditional drug development framework. While
there are many innovative strategies to reduce cost and improve
success rates during the traditional drug discovery process [9, 10],
drug repurposing is a viable and growing discipline with documen-
ted advantages: for instance, traditional drug discovery pipelines
take on average from 12 to 16 years from inception to market and
cost on average one to two billion dollars, while successful drug
repurposing can be done in less than half the time (6 years on
average) at a quarter of the cost (~$300 million) [11, 12]. Incorpor-
ating drug repurposing strategies into workflows can drastically
increase productivity for biopharmaceutical companies [13], and
there are growing numbers of successful examples that prove its
worth [14–17].
Thalidomide, for instance, was originally developed in Ger-
many and England as a sedative and was prescribed to treat morn-
ing sickness in pregnant women. It soon became apparent that this
treatment caused severe and devastating skeletal birth defects in
thousands of babies born to mothers taking it during the first
trimester of pregnancy. Years later, after the drug was banned for
this purpose, it was serendipitously found to be effective for the
Leveraging Big Data to Transform Drug Discovery 93
2 Materials
2.1 Software Drug discovery using big data resources is accomplished computa-
and Computational tionally. For the individual researcher, a modern computer is really
Resources the only hardware requirement. For software, essentially any pro-
gramming language whether open-source (e.g., R [21], Python
[22]) or closed and commercially licensed (e.g., SAS [SAS Institute
Inc., Cary, NC]) can perform data organization, statistical analyses,
and figure generation with the inclusion of freely available packages
94 Benjamin S. Glicksberg et al.
Disease Drug
gene expression data
MOLECULAR x x
healthy
PROFILES disease
treated
untreated
disease drug
signature signature
• curation
• selection
gene expression
META-ANALYSIS } drugs
}
DISSIMILAR
RANKING .
coorelation
consensus drug signatures .
negative
candidate drug
in vivo
Electronical Medical Records
Fig. 1 Full generic workflow for enabling drug discovery through big data resources. We start by illustrating the
mechanism behind disease and drug gene expression signatures and highlighting a few public repositories
Leveraging Big Data to Transform Drug Discovery 95
(e.g., SciPy [23] for Python) when needed. There exist numerous
resources (e.g., edX; https://www.edx.org) that provide an intro-
duction on using these programming languages for data science.
For infrastructure, cloud computing (e.g., Amazon Web Services
Cloud) enables managing large datasets and performing computa-
tionally intensive tasks without the need of building in-house
clusters.
2.2 Ontologies The landscape of big data in biomedicine is expansive yet unsys-
and Reference tematic: encompassing a tremendous amount of data points across
Databases a multitude of modalities. As such, there are clear hurdles in precise
characterization and proper integration procedures. To address
these challenges, researchers have created tools and ontologies to
map and normalize these disparate data types in order to facilitate
data harmonization and reproducible methodologies. This is espe-
cially important for the purposes of leveraging big data for drug
discovery, as many different data types have to be seamlessly and
reproducibly integrated along the entire drug discovery pipeline. A
more comprehensive list of these various resources can be found in
other related reviews [24].
The exponential growth of data entities with heterogeneous
data types from multiple resources calls for developing ontologies
to define entities and centralized reference databases. Meta-
thesauruses, like the Unified Medical Language System (UMLS)
[25], have been developed to organize, classify, standardize, and
distribute key terminology in biomedical information systems and
are invaluable for computational biology research. Essentially, each
medical term (e.g., disease) has a Concept Unique Identifier (CUI)
code that then can be referenced from other related ontologies.
The Systematized Nomenclature of Medicine-Clinical Terms
(SNOMED-CT; http://www.ihtsdo.org/snomed-ct/), for
instance, is one of the largest healthcare-related ontologies and
has over 300,000 medical concepts ranging from body structure
to clinical findings. There are specific ontologies that aim to char-
acterize the continually evolving representations of the phenotypic
space. Many clinical datasets encode diseases using International
Fig. 1 (continued) where they can be found. Based on research focus, it is often recommended to harmonize
these disparate sources of data through use of ontologies. Multiple signatures per disease or drug can be
integrated through rigorous meta-analysis procedures. Chemogenomic association testing assesses similarity
between drug and disease signatures and can be performed using procedures like the Kolmogorov-Smirnov
(KS) test. These drug signatures can then be ranked according to their correlation, or anticorrelation, to the
disease signature of interest. Drug signatures that are highly anticorrelated to the disease signature are
potential treatment candidates. Drug candidates that are selected for follow-up need further validation, in the
form of in silico (e.g., other external datasets or electronic medical records), in vitro, and/or in vivo
experiments
96 Benjamin S. Glicksberg et al.
2.3 Data Resources A growing priority of increasing data accessibility through open-
that Can Be Leveraged access efforts has considerably boosted computational-based drug
for Drug Discovery discovery capabilities. In fact, Greene et al. created the Research
Parasite Award, which honors researchers who perform rigorous
secondary analyses on existing, open-access data to make novel
insights independent from the original investigators [32]. These
data exist in many forms, including clinical (i.e., disease comorbid-
ities), genomic (e.g., genetic, transcriptomic), and proteomics.
Drug discovery is multifaceted at its core, at the very least
involving some combination of chemical data in addition to those
mentioned above. This is even more relevant for computational-
based drug discovery, where complex intersections of many dispa-
rate data types are required. Fortunately, there are many public
repositories that contain a plethora of data around these spheres,
which can be connected utilizing the aforementioned ontologies
and reference databases. We outline the various data sources in the
current section and present two examples of successful drug discov-
ery utilizations of these datasets in detail in Subheading 3.
2.3.1 Hospitals Due to regulatory requirements, almost all American medical cen-
and Academic Health ters and health systems store patient data collected during outpa-
Centers tient and hospital visits using software platforms such as those
provided by EPIC (Epic Systems Corporation, Madison, WI) and
Cerner (Cerner Corporation, Kansas City, MO). These electronic
medical records (EMRs), or alternately electronic health records
(EHRs), are composed of a number of data types such as disease
diagnoses, medication prescriptions, lab test results, surgical pro-
cedures, and physician notes. Generally, data warehouse adminis-
tration takes responsibility for housing and creating an anonymized
or de-identified version of the EMR. Affiliated faculty and research-
ers then apply for access to this resource through their Institutional
98 Benjamin S. Glicksberg et al.
Review Board (IRB). While the primary role of EMR is for institu-
tional or administrative purposes, one important benefit of the
digitization of health records is that they can be more easily adapted
for powerful healthcare research purposes. For drug discovery,
these data can be used for genotype-phenotype relationship discov-
ery that could drive target selection or to analyze medication
efficacy and side effects in a real-world context [33]. The implica-
tions and impact of precise electronic phenotyping procedures for
these types of analyses will be discussed in Subheading 4.
There are many hospital systems and affiliated medical centers
that have successfully used their in-house EMR systems for impor-
tant scientific and clinical discoveries [34–38]. The Mount Sinai
Hospital and the Icahn School of Medicine at Mount Sinai organize
and protect their EMR data (beginning in 2003) within the Mount
Sinai Data Warehouse, which is comprised of over 7.5 million
patients and 2 billion points of data, including disease diagnoses
and lab test results. The University of California, San Francisco
(UCSF) is leading a massive effort to coordinate EMR data from
five UC medical centers. The University of California Research
eXchange (UC ReX; https://myresearch.ucsf.edu/uc-rex) pro-
vides a framework for UC-affiliated researchers to query
de-identified clinical and demographic data, providing a natural
cross-validation opportunity to compare findings across these sites.
Data analyses using EMRs can be enhanced with the inclusion
of genetic data from biobanks or repositories of biological samples
(e.g., blood) of recruited participants generally from a hospital
setting. Many institutions have frameworks set up that facilitate
this type of research, such as The Charles Bronfman Institute of
Personalized Medicine BioMe biobank within the Mount Sinai
Hospital system, BioVU from the Vanderbilt University Medical
Center, and DiscovEHR, which is a collaboration between Regen-
eron Genetics Center and the Geisinger Health System. Coupling
clinical and genetic data has led to important findings in the area of
drug and target discovery. Using the DiscovEHR resource, for
instance, researchers recently characterized the distribution and
clinical impact of rare, functional variants (i.e., deleterious) in
whole-exome sequences for over 50,000 individuals [39]. The
associations they identified add insight to the current understand-
ing of therapeutic targets and clinically actionable genes.
A limitation of these biobanks and EMR systems is that they are
often restricted to researchers affiliated with the associated institu-
tion. There are a number of initiatives and resources that allow
researchers to apply for access of these types of data. For instance,
the Centers for Medicare & Medicaid Services offer relevant
healthcare-related data dumps (https://data.medicare.gov/) for
sites that accept Medicare including hospitals, nursing home, and
hospices with information on various factors and outcomes (e.g.,
Leveraging Big Data to Transform Drug Discovery 99
2.3.2 Genomic Each and every research study is crucial for furthering scientific
Experimental Data knowledge, but sharing of the experimental data collected can
Repositories additionally benefit other researchers in the field directly. The pool-
ing of several sources of data can allow for meta-analyses and other
types of research not possible in isolation, thereby facilitating fur-
ther discoveries. There have been outstanding efforts to provide a
framework for researchers to deposit and share data, with many
high-impact journals even requiring it for publication. There are
other reviews that go into detail about these resources, but we will
describe a few here.
The Gene Expression Omnibus (GEO) is a public repository
from NCBI that allows researchers to upload high-quality func-
tional genomics data (e.g., microarray) that meet their guidelines
[42]. GEO organizes, stores, and freely distributes these data to the
research community. As of 2017, GEO already surpasses over two
million samples. ArrayExpress (http://www.ebi.ac.uk/
arrayexpress/) [43] is an archive of functional genomics data
(some overlap with GEO) comprised of over 70,000 experiments,
2.2 million assays, and 45 terabytes of data as of July 2017. The
Immunology Database and Analysis Portal (ImmPort; www.
immport.org) is a related resource focused on data from immunol-
ogy studies of various types, focuses, and species that provides a
framework to share and use genomic data of clinical samples.
The Cancer Genome Atlas (TCGA; https://cancergenome.
nih.gov/) from the NIH is a comprehensive resource that collects
data relating to the genomic aspects in over 33 types of cancer.
TCGA harmonizes clinical, sequencing, transcriptomic, and other
data types from various studies in a user-friendly Genomic Data
Commons Data Portal (https://portal.gdc.cancer.gov/). As of
Data Release 6.0 (May 9, 2017), TCGA encompasses around
250,000 files from almost 15,000 patients with cancer in 29 pri-
mary sites (e.g., kidney). As the volume of the data from TCGA is
both immense and complex, Firehose GDAC (https://gdac.bro
adinstitute.org/) was developed to systematize analyses and pipe-
lines using TCGA in order to facilitate smooth research efforts. The
Cancer Cell Line Encyclopedia (CCLE; http://portals.bro
adinstitute.org/ccle/) is a public compendium of over 1000 cell
lines that aims to characterize the genetic and pharmacologic
aspects of human cancer models. The Genome-Tissue Expression
(GTEx; https://gtexportal.org/home/) portal is a collection of
genetic and gene expression data from the Broad Institute, with
data for 544 healthy individuals in 53 tissues as of the V6p release.
Using this resource, researchers have been able to delineate cis and
trans expression quantitative trait loci (eQTL), a unique capability
from having access to both genotype and gene expression data.
Leveraging Big Data to Transform Drug Discovery 101
3 Methods
Gene Expression
Omnibus (GEO)
Fig. 2 Step-by-step process for performing in silico drug discovery starting from a disease of interest.
Researchers can go to the GEO online portal and search for their disease of interest, which will result in
experiments and microarray datasets that other scientists have generated and uploaded. Having identified all
datasets of interest, the data can be downloaded individually or via a tool, such as GEOquery [80] for R. Case
and control data will have to be identified either manually or through regular expressions from the name
strings. Next, disease signatures can be derived through various tools that perform differential expression
analysis, such as edgeR [81] for RNA-Seq and RankProd and SAM for microarray data [82]. Drug libraries,
such as CMap and LINCS, provide RNA expression data for a large number of compounds across different cell
lines. Next, a correlation analysis can be performed comparing the disease signature and all drug signatures
producing a reversal score for each drug. Significant scores can be ranked from the top most correlated
signatures to the most anticorrelated signatures. These hits can then be used to select a candidate drug based
on study goals, such as prioritizing candidates that would reverse disease signature (i.e., top anticorrelated
hits)
Leveraging Big Data to Transform Drug Discovery 103
3.1 Disease A disease signature is the unique molecular state (i.e., RNA expres-
Signatures sion profile) of a phenotype (e.g., type 2 diabetes) that is altered
from “wild-type” or healthy. Typically, these signatures can be
characterized by multimodal biological data, including gene
expression in tissues and protein composition in the microbiome.
In the Li and Greene et al. study, they obtained kidney transplant
microarray datasets from GEO in addition to an in-house dataset.
They first restricted the possible datasets to those in humans with
biopsy or peripheral blood samples, leaving five that were eligible.
As is typical in these types of analyses, the phenotypic descriptors
from the datasets did not directly coincide. As such, for each study,
they selected data from specific biopsies that met their criteria for
suitable comparison (i.e., “moderate” and “severe” IF/TA from
one study, IF/TA “II” and “III” from another, etc.) including both
case and control conditions. To rectify any possible conflicting
probe annotations due to platform or versioning, they
re-annotated all probes to the latest gene identifiers using AILUN
[50]. In addition to ensuring consistent annotations, it is also impor-
tant to account for potential discrepancies in expression measure-
ments across studies and platforms. Accordingly, they standardized
each dataset using quantile-quantile normalization to allow for more
precise integration. The integrated dataset comprised 275 samples in
two tissues from three different microarray platforms.
With the datasets normalized and integrated, a meta-analysis
can be performed to create a consensus disease signature across
multiple studies. There are many differing methodologies to per-
form meta-analysis on microarray data that we discuss in Subhead-
ing 4. Li and Greene et al. performed two meta-analysis techniques
and included genes that had robust expression profiles significant in
both, thus maximizing the methodological strengths of each. The
first method involved evaluating the differential expression effect
size for each. Specifically, these effect sizes were combined using a
fixed-effect inverse-variance model by which the effect size from
each study is weighted by the inverse of the intra-study variance
resulting in a meta-effect size. From this approach, they identified
996 FDR-corrected significant genes measured in at least four
studies and were concordantly expressed in the same direction
(FDR < 0.05).
For the second method, they then utilized results from the
significance analysis of microarrays (SAM) [51] of each study, clas-
sifying significant differentially expressed genes between IF/TA
and non-IF/TA groups using q < 0.1 threshold. For the meta-
analysis, they performed a Fisher’s exact test comparing the number
of studies in which each gene was significantly differentially
104 Benjamin S. Glicksberg et al.
3.2 Drug Signatures Drug signatures are similar in nature to disease signatures in that
they represent the global perturbation of gene expression com-
pared to vehicle control. In the case of drug signatures, the pertur-
bation is treatment exposure instead of disease state. As mentioned
in Subheading 2, there are many databases that contain gene
expression data for a variety of drugs across many different cell
lines, tissues and organisms. In the Li and Greene et al. study,
they collected all drug-induced transcriptional profiles for all
drugs in CMap (n ¼ 1309) across all experiments (n ¼ 6100).
They then created a consensus, representative profile for each drug,
merging data from the associated studies using the prototype
ranked list (PRL) method [53]. The PRL method works through
a hierarchical majority-voting scheme for all ranked gene expression
lists within a single compound where consistently over- or down-
regulated genes are weighted toward the top of affiliated extremes.
The various merging methodologies and their considerations will
be discussed in the Notes.
Chen and Wei et al. utilized both CMap and LINCS data to
generate drug signatures. For each drug of all CMap, they per-
formed an initial quality control step keeping only instances (i.e.,
cell line experiment data) where its profile correlated ( p < 0.05)
with at least one other profile (of the same drug), leaving 1329
high-quality instances. They gathered a high-quality list of data
from LINCS by only including landmark genes (n ¼ 978), restrict-
ing instances to HepG2 and Huh7 cell lines, and removing poor
quality perturbations, resulting in 2816 profiles. As the authors
were interested in repurposing an already approved compound,
they intersected drugs from both data sources to DrugBank, leav-
ing 380 instances of 249 drugs. Out of these 249 drugs, 83 were
common between the two sources.
106 Benjamin S. Glicksberg et al.
there were 302 drugs from CMap and 39 drugs from LINCS that
were anticorrelated at this threshold, with 16 overlapping. Out of
this high-confidence intersected set, the top hit from ranking across
both libraries was niclosamide. This drug had previously established
antitumor properties in other cancers, but had not been assessed in
HCC animal models. Therefore, it was selected as a candidate drug
to undergo subsequent validation experiments.
Like the previous study, Chen and Wei et al. developed an
innovative, additional in silico approach to evaluate the global
performance of the predictions from their pipeline. They hypothe-
sized that their top predictions (i.e., anticorrelated drugs) should
be enriched for gold standard, or established, treatments of HCC.
Accordingly, they extracted data from clinicaltrials.gov querying
the terms “hepatocellular carcinoma” and “liver cancer” and filter-
ing from the results trials that studied tumors and cancer in general.
From this list of 960 trials, they extracted 76 drugs from the
“interventions” column that appeared in more than one trial as
their list of gold standard HCC drugs. When mapped to the che-
mogenomic databases, there were 7 found in CMap and 16 in
LINCS. Using ssGSEA from GSVA [56] with permutation testing
(n ¼ 10,000), a gene set enrichment package, they found that these
gold standard drugs were more likely to reverse the HCC gene
signature in both CMap ( p ¼ 0.012) and LINCS ( p ¼ 0.018),
increasing the confidence of testing other hits in the lab.
3.4 Experimental While the prediction methodology outlined above is powerful in its
Validation own right, it is often necessary to validate these findings in an
of Predictions experimental setting. Convincing validation of these predictions
can be achieved in a multitude of ways in both in vitro and in vivo
models. The considerations and criteria for this decision are dis-
cussed in Subheading 4. To investigate the utility of kaempferol and
esculetin for fibrosis, Li and Greene et al. utilized human kidney
2 (KH2) in vitro cell line to determine perturbations in cellular
pathways following drug exposure, specifically targeting biological
aspects from their in silico-based hypotheses. As such, they showed
kaempferol significantly reduced TGF-β1-mediated expression of
SNAI1 ( p ¼ 0.014 for 15 μm exposure) and reversed CDH1
downregulation ( p ¼ 0.045). They also found that esculetin treat-
ment inhibits Wnt/β-catenin signaling in renal tubular cells:
esculetin-treated cells caused a significant decrease in CCND1
protein levels after Wnt agonist stimulation ( p ¼ 0.0054 for
60 μm exposure).
In addition to the encouraging in vitro results, they performed
an additional experiment to assess the effects of kaempferol and
esculetin on renal interstitial fibrosis in vivo. Specifically, they used a
unilateral ureteric obstruction (UUO) mouse model to study gene
expression, histological, immunohistochemistry (IHC) effects of
treatment on renal fibrogenesis. Mice (Balb/c mice from Jackson
108 Benjamin S. Glicksberg et al.
4 Notes
4.1 Computational As software is continually updated and new versions of tools and
Considerations packages released, there is a large issue with reproducibility in
research, often due to incompatibility of computing environments
[57]. This often leads to inconsistent results, even when using the
same computational protocols. Beaulieu-Jones and Greene have
recently proposed a system that, if adopted, could avoid these
potential incompatibility issues [58]. They outline a pipeline that
combines Docker, a container technology that is distinct from the
native operating system environment, with software that continu-
ally reruns the pipeline whenever new updates are released for the
data or underlying packages. We generally recommend using open-
source programming frameworks and distributable notebooks,
such as Jupyter Notebook (http://jupyter.org/) and RMarkDown
(http://rmarkdown.rstudio.com/), and following the emerging
best practices for reproducible research whenever possible in
order to most easily facilitate data sharing and research reproduc-
tion. With this said, the most important step toward enabling
reproducible research is a willingness to share the data and code
in a public repository. GitHub (http://github.com/) and Bit-
bucket (http://bitbucket.org/) are two common code repos, and
it is often encouraged to deposit large datasets in repositories like
Synapse (http://www.synapse.org/).
Another issue plaguing biomedical research is the ever-
changing nature of genome builds and identifiers, which can affect
delineation of reference alleles, genomic coordinates, and probe
annotations, among many others. As an early salutation to this
phenomenon, Chen et al. built AILUN, a fully automated online
tool that will re-annotate all microarray datasets to the latest anno-
tations, allowing for compatible analyses across differing
versions [50].
Leveraging Big Data to Transform Drug Discovery 111
4.3 Drug Profile Many related studies, including the highlighted one by Li and
Variation Greene et al., performed a meta-analysis to integrate multiple
drug profiles into a consensus signature. Not surprisingly, there
are caveats of this process pertaining to varying biological contexts
that can affect its validity and utility [64]. Researchers integrated
gene expression data for 11,000 drugs from LINCS with their
112 Benjamin S. Glicksberg et al.
4.4 Selection With a candidate drug selected for a disease of interest, it is impera-
of Validation Model tive to perform a series of coordinated experiments (e.g., PK/PD/
toxicity studies), in addition to legal considerations (e.g., IP pro-
tection) to gauge its eligibility. Further, these decisions may be
particularly critical due to time and financial limitations, where
setbacks, mistakes, or poor choices could halt future progress.
Drug candidates often fail to successfully translate into the clinic
due to two main reasons: improper safety and efficacy assessments,
both possibly a direct result of poor early target validation and
unreliable preclinical models [66].
In terms of preclinical testing, the large number of avenues and
models can be overwhelming, especially in light of variable reliabil-
ity. While the accessibility to massive quantities of biological big
data has been transformative in the field of drug discovery, not all
experiments are of equal utility. In studying HCC, for example,
researchers recently found that half of public HCC cell lines do not
resemble actual HCC tumors in terms of gene expression patterns
[67]. Interestingly, another group found that rarely used ovarian
cancer cell lines actually have higher genetic similarity to ovarian
tumors than more commonly used ones [68]. Due to phenomena
like these, we recommend performing rigorous examinations like
those above and leveraging knowledge from existing evaluations of
Leveraging Big Data to Transform Drug Discovery 113
4.5 Drug Discovery EMR systems contain a great deal of disease- and phenotype-
Using EMR Data related information that can be leveraged for drug discovery as
“real-world data.” The multifaceted data that are collected in
EMR facilitate flexibility in devising research questions that may
be beyond the scope or focus of the original experiment. As such,
unanticipated connections may be formed to biological aspects that
would not be collected in traditional prospective study designs.
Additionally, the large sample sizes and natural collection of longi-
tudinal, follow-up information relating to patient outcomes from
treatment are invaluable advantages over traditional randomized
clinical trials [73].
As mentioned, EMR frameworks are primarily designed for
infrastructural support and to facilitate billing. Like other research
datasets, the raw data is often messy, incomplete, and subject to
biases. Therefore, certain aspects may not be as reliable as unbiased
collection measures. Despite better refined structuring, ICD code-
based disease classification overall is often insufficient to accurately
capture disease status [74]. In this notable example, clinicians
manually reviewed the charts of 325 patients that had been
recorded with the ICD-9 code for chronic kidney disease stage
3 (585.3). They found that 47% of these patients did not have
any clinical indicators for the disease. Including other types of
data, like medications, in phenotyping criteria often leads to better
accuracy. For instance, researchers found that electronic phenotyp-
ing algorithms that require at least two ICD-9 diagnoses, prescrip-
tion of antirheumatic medication, and participation of a
rheumatologist resulted in the highest positive predictive value to
identify rheumatoid arthritis [75]. In fact, a recent study analyzed
ten common diseases (e.g., Parkinson’s disease) in an EMR system
and compared the accuracy of classification through ICD codes,
medications (i.e., those primarily prescribed for disease treatment),
or clinician notes (i.e., if the disease was mentioned in the visit
report) [76]. The “true” disease classification was determined by a
physician’s manual chart review. It is clear that certain diseases are
more often and better classified by certain modalities, such as
clinician notes for Atrial Fibrillation or ICD codes for Parkinson’s
disease, but combinations of all three lead generally to highest
predictive power and lower rates of error. Fortunately, notable
efforts by groups such as the electronic medical records and geno-
mics (eMERGE) consortium (https://emerge.mc.vanderbilt.edu/)
have led to rigorous, standardized electronic phenotyping algo-
rithms that identify case and control cohorts utilizing multiple
dimensions of EMR data to establish inclusion and exclusion criteria.
114 Benjamin S. Glicksberg et al.
Acknowledgments
References
1. Eder J, Sedrani R, Wiesmann C (2014) The 15. Pessetto ZY, Chen B, Alturkmani H, Hyter S,
discovery of first-in-class drugs: origins and Flynn CA, Baltezor M, Ma Y, Rosenthal HG,
evolution. Nat Rev Drug Discov 13 Neville KA, Weir SJ et al (2017) In silico and
(8):577–587 in vitro drug screening identifies new therapeu-
2. Mullard A (2016) Parsing clinical success rates. tic approaches for Ewing sarcoma. Oncotarget
Nat Rev Drug Discov 15(7):447 8(3):4079–4095
3. Every-Palmer S, Howick J (2014) How 16. Dudley JT, Sirota M, Shenoy M, Pai RK,
evidence-based medicine is failing due to Roedder S, Chiang AP, Morgan AA, Sarwal
biased trials and selective publication. J Eval MM, Pasricha PJ, Butte AJ (2011) Computa-
Clin Pract 20(6):908–914 tional repositioning of the anticonvulsant
4. Rothwell PM (2006) Factors that can affect the topiramate for inflammatory bowel disease.
external validity of randomised controlled Sci Transl Med 3(96):96ra76
trials. PLoS Clin Trials 1(1):e9 17. Sirota M, Dudley JT, Kim J, Chiang AP, Mor-
5. Murthy VH, Krumholz HM, Gross CP (2004) gan AA, Sweet-Cordero A, Sage J, Butte AJ
Participation in cancer clinical trials: race-, sex-, (2011) Discovery and preclinical validation of
and age-based disparities. JAMA 291 drug indications using compendia of public
(22):2720–2726 gene expression data. Sci Transl Med 3
(96):96ra77
6. Rothwell PM (2005) External validity of ran-
domised controlled trials: “to whom do the 18. Stephens T, Brynner R (2009) Dark remedy:
results of this trial apply?”. Lancet 365 the impact of thalidomide and its revival as a
(9453):82–93 vital medicine. Basic Books
7. Hodos RA, Kidd BA, Shameer K, Readhead 19. Attal M, Harousseau JL, Leyvraz S, Doyen C,
BP, Dudley JT (2016) In silico methods for Hulin C, Benboubker L, Yakoub Agha I, Bour-
drug repurposing and pharmacology. Wiley his JH, Garderet L, Pegourie B et al (2006)
Interdiscip Rev Syst Biol Med 8(3):186–210 Maintenance therapy with thalidomide
improves survival in patients with multiple
8. Paik H, Chen B, Sirota M, Hadley D, Butte AJ myeloma. Blood 108(10):3289–3294
(2016) Integrating clinical phenotype and gene
expression data to prioritize novel drug uses. 20. From nightmare drug to celgene blockbuster,
CPT Pharmacometrics Syst Pharmacol 5 thalidomide is back bloomberg. https://www.
(11):599–607 bloomberg.com/news/articles/2016-08-22/
from-nightmare-drug-to-celgene-blockbuster-
9. Paul SM, Mytelka DS, Dunwiddie CT, Per- thalidomide-is-back
singer CC, Munos BH, Lindborg SR, Schacht
AL (2010) How to improve R&D productiv- 21. R Core Team (2014) R: A language and envi-
ity: the pharmaceutical industry’s grand chal- ronment for statistical computing. R Founda-
lenge. Nat Rev Drug Discov 9(3):203–214 tion for Statistical Computing, Vienna, Austria
In. 2014
10. Caskey CT (2007) The drug development cri-
sis: efficiency and safety. Annu Rev Med 22. Van Rossum G, Drake FL: Python language
58:1–16 reference manual: network theory; 2003
11. Nosengo N (2016) Can you teach old drugs 23. Jones E, Oliphant T, Peterson P (2014) SciPy:
new tricks? Nature 534(7607):314–316 open source scientific tools for Python
12. Scannell JW, Blanckley A, Boldon H, Warring- 24. Chen B, Wang H, Ding Y, Wild D (2014)
ton B (2012) Diagnosing the decline in phar- Semantic breakthrough in drug discovery. Syn-
maceutical R&D efficiency. Nat Rev Drug thesis Lectures on the Semantic Web: Theory
Discov 11(3):191–200 and Technology 4(2):1–142
13. Ashburn TT, Thor KB (2004) Drug reposi- 25. Bodenreider O (2004) The unified medical
tioning: identifying and developing new uses language system (UMLS): integrating biomed-
for existing drugs. Nat Rev Drug Discov 3 ical terminology. Nucleic Acids Res 32(Data-
(8):673–683 base issue):D267–D270
14. Jahchan NS, Dudley JT, Mazur PK, Flores N, 26. Liu S, Ma W, Moore R, Ganesan V, Nelson S
Yang D, Palmerton A, Zmoos AF, Vaka D, (2005) RxNorm: prescription for electronic
Tran KQ, Zhou M et al (2013) A drug reposi- drug information exchange. IT professional 7
tioning approach identifies tricyclic antidepres- (5):17–23
sants as inhibitors of small cell lung cancer and 27. Kuhn M, Letunic I, Jensen LJ, Bork P (2016)
other neuroendocrine tumors. Cancer Discov The SIDER database of drugs and side effects.
3(12):1364–1377 Nucleic Acids Res 44(D1):D1075–D1079
116 Benjamin S. Glicksberg et al.
28. Tatonetti NP, Ye PP, Daneshjou R, Altman RB linked to an electronic health record. Pharma-
(2012) Data-driven prediction of drug effects cogenomics 13(4):407–418
and interactions. Sci Transl Med 4 39. Dewey FE, Murray MF, Overton JD,
(125):125ra131 Habegger L, Leader JB, Fetterolf SN,
29. Wishart DS, Knox C, Guo AC, Shrivastava S, O’Dushlaine C, Van Hout CV, Staples J,
Hassanali M, Stothard P, Chang Z, Woolsey J Gonzaga-Jauregui C et al (2016) Distribution
(2006) DrugBank: a comprehensive resource and clinical impact of functional variants in
for in silico drug discovery and exploration. 50,726 whole-exome sequences from the Dis-
Nucleic Acids Res 34(Database issue): covEHR study. Science 354(6319)
D668–D672 40. Yuille M, Dixon K, Platt A, Pullum S, Lewis D,
30. Shameer K, Glicksberg BS, Hodos R, Johnson Hall A, Ollier W (2010) The UK DNA banking
KW, Badgeley MA, Readhead B, Tomlinson network: a "fair access" biobank. Cell Tissue
MS, O’Connor T, Miotto R, Kidd BA et al Bank 11(3):241–251
(2017) Systematic analyses of drugs and disease 41. Wain LV, Shrine N, Artigas MS, Erzurumluo-
indications in RepurposeDB reveal pharmaco- glu AM, Noyvert B, Bossini-Castillo L,
logical, biological and epidemiological factors Obeidat M, Henry AP, Portelli MA, Hall RJ
influencing drug repositioning. Brief et al (2017) Genome-wide association analyses
Bioinform for lung function and chronic obstructive pul-
31. Geifman N, Bollyky J, Bhattacharya S, Butte AJ monary disease identify new loci and potential
(2015) Opening clinical trial data: are the vol- druggable targets. Nat Genet 49(3):416–425
untary data-sharing portals enough? BMC 42. Edgar R, Domrachev M, Lash AE (2002) Gene
Med 13:280 expression omnibus: NCBI gene expression
32. Greene CS, Garmire LX, Gilbert JA, Ritchie and hybridization array data repository.
MD, Hunter LE (2017) Celebrating parasites. Nucleic Acids Res 30(1):207–210
Nat Genet 49(4):483–484 43. Kolesnikov N, Hastings E, Keays M,
33. Yao L, Zhang Y, Li Y, Sanseau P, Agarwal P Melnichuk O, Tang YA, Williams E, Dylag M,
(2011) Electronic health records: implications Kurbatova N, Brandizi M, Burdett T et al
for drug discovery. Drug Discov Today 16 (2015) ArrayExpress update--simplifying data
(13–14):594–599 submissions. Nucleic Acids Res 43(Database
34. Wang G, Jung K, Winnenburg R, Shah NH issue):D1113–D1116
(2015) A method for systematic discovery of 44. Wickham H (2016) ggplot2: elegant graphics
adverse drug events from clinical notes. J Am for data analysis, 2nd edn. Springer
Med Inform Assoc 22(6):1196–1204 45. Hunter JD (2007) Matplotlib: a 2D graphics
35. Crosslin DR, Robertson PD, Carrell DS, Gor- environment. Comput Sci Eng 9(3):90–95
don AS, Hanna DS, Burt A, Fullerton SM, 46. Shannon P, Markiel A, Ozier O, Baliga NS,
Scrol A, Ralston J, Leppig K et al (2015) Pro- Wang JT, Ramage D, Amin N,
spective participant selection and ranking to Schwikowski B, Ideker T (2003) Cytoscape: a
maximize actionable pharmacogenetic variants software environment for integrated models of
and discovery in the eMERGE network. biomolecular interaction networks. Genome
Genome Med 7(1):67 Res 13(11):2498–2504
36. Xu H, Aldrich MC, Chen Q, Liu H, Peterson 47. Bastian M, Heymann S, Jacomy M (2009)
NB, Dai Q, Levy M, Shah A, Han X, Ruan X Gephi: an open source software for exploring
et al (2015) Validating drug repurposing sig- and manipulating networks. Icwsm 8:361–362
nals using electronic health records: a case 48. Li L, Greene I, Readhead B, Menon MC, Kidd
study of metformin associated with reduced BA, Uzilov AV, Wei C, Philippe N,
cancer mortality. J Am Med Inform Assoc 22 Schroppel B, He JC et al (2017) Novel thera-
(1):179–191 peutics identification for fibrosis in renal allo-
37. Kirkendall ES, Kouril M, Minich T, Spooner graft using integrative informatics approach.
SA (2014) Analysis of electronic medication Sci Rep 7:39487
orders with large overdoses: opportunities for 49. Chen B, Wei W, Ma L, Yang B, Gill RM, Chua
mitigating dosing errors. Appl Clin Inform 5 MS, Butte AJ, So S (2017) Computational
(1):25–45 discovery of niclosamide ethanolamine, a
38. Ramirez AH, Shi Y, Schildcrout JS, Delaney repurposed drug candidate that reduces
JT, Xu H, Oetjens MT, Zuvich RL, Basford growth of hepatocellular carcinoma cells
MA, Bowton E, Jiang M et al (2012) Predict- in vitro and in mice by inhibiting cell division
ing warfarin dosage in European-Americans cycle 37 signaling. Gastroenterology 152
and African-Americans using DNA samples (8):2022–2036
Leveraging Big Data to Transform Drug Discovery 117
50. Chen R, Li L, Butte AJ (2007) AILUN: rean- gene expression correlates with drug efficacy
notating gene expression data automatically. and reveals therapeutic targets. Nat Commun
Nat Methods 4(11):879 (In Press)
51. Tusher VG, Tibshirani R, Chu G (2001) Sig- 65. Chen B, Greenside P, Paik H, Sirota M,
nificance analysis of microarrays applied to the Hadley D, Butte AJ (2015) Relating chemical
ionizing radiation response. Proc Natl Acad Sci structure to cellular response: an integrative
U S A 98(9):5116–5121 analysis of gene expression, bioactivity, and
52. Anders S, Huber W (2010) Differential expres- structural data across 11,000 compounds.
sion analysis for sequence count data. Genome CPT Pharmacometrics Syst Pharmacol 4
Biol 11(10):R106 (10):576–584
53. Iorio F, Bosotti R, Scacheri E, Belcastro V, 66. Smith C (2003) Drug target validation: hitting
Mithbaokar P, Ferriero R, Murino L, the target. Nature 422(6929). 341, 343,
Tagliaferri R, Brunetti-Pierri N, Isacchi A et al 345 passim
(2010) Discovery of drug mode of action and 67. Chen B, Sirota M, Fan-Minogue H, Hadley D,
drug repositioning from transcriptional Butte AJ (2015) Relating hepatocellular carci-
responses. Proc Natl Acad Sci U S A 107 noma tumor samples and cell lines using gene
(33):14621–14626 expression data in translational research. BMC
54. Lamb J, Crawford ED, Peck D, Modell JW, Med Genet 8(Suppl 2):S5
Blat IC, Wrobel MJ, Lerner J, Brunet JP, 68. Domcke S, Sinha R, Levine DA, Sander C,
Subramanian A, Ross KN et al (2006) The Schultz N (2013) Evaluating cell lines as
connectivity map: using gene-expression signa- tumour models by comparison of genomic
tures to connect small molecules, genes, and profiles. Nat Commun 4:2126
disease. Science 313(5795):1929–1935 69. Hefti FF (2008) Requirements for a lead com-
55. Kidd BA, Wroblewska A, Boland MR, Agudo J, pound to become a clinical candidate. BMC
Merad M, Tatonetti NP, Brown BD, Dudley JT Neurosci 9(Suppl 3):S7
(2016) Mapping the effects of drugs on the 70. Empfield JR, Leeson PD (2010) Lessons
immune system. Nat Biotechnol 34(1):47–54 learned from candidate drug attrition. IDrugs
56. Hanzelmann S, Castelo R, Guinney J (2013) 13(12):869–873
GSVA: gene set variation analysis for microar- 71. Hughes JP, Rees S, Kalindjian SB, Philpott KL
ray and RNA-seq data. BMC Bioinformatics (2011) Principles of early drug discovery. Br J
14:7 Pharmacol 162(6):1239–1249
57. Dudley JT, Butte AJ (2010) In silico research 72. Meanwell NA (2011) Improving drug candi-
in the era of cloud computing. Nat Biotechnol dates by design: a focus on physicochemical
28(11):1181–1185 properties as a means of improving compound
58. Beaulieu-Jones BK, Greene CS (2017) Repro- disposition and safety. Chem Res Toxicol 24
ducibility of computational workflows is auto- (9):1420–1456
mated using continuous analysis. Nat 73. Bate A, Juniper J, Lawton AM, Thwaites RM
Biotechnol 35(4):342–346 (2016) Designing and incorporating a real
59. Ramasamy A, Mondry A, Holmes CC, Altman world data approach to international drug
DG (2008) Key issues in conducting a meta- development and use: what the UK offers.
analysis of gene expression microarray datasets. Drug Discov Today 21(3):400–405
PLoS Med 5(9):e184 74. Cipparone CW, Withiam-Leitch M, Kimminau
60. Klebanov L, Yakovlev A (2006) Treating KS, Fox CH, Singh R, Kahn L (2015) Inaccu-
expression levels of different genes as a sample racy of ICD-9 codes for chronic kidney disease:
in microarray data analysis: is it worth a risk? a study from two practice-based research net-
Stat Appl Genet Molec Biol 5(1):1–9 works (PBRNs). J Am Board Fam Med 28
61. Leek JT, Storey JD (2007) Capturing hetero- (5):678–682
geneity in gene expression studies by surrogate 75. Chung CP, Rohan P, Krishnaswami S, McPhe-
variable analysis. PLoS Genet 3(9):1724–1735 eters ML (2013) A systematic review of vali-
62. Dudley JT, Tibshirani R, Deshpande T, Butte dated methods for identifying patients with
AJ (2009) Disease signatures are robust across rheumatoid arthritis using administrative or
tissues and experiments. Mol Syst Biol 5:307 claims data. Vaccine 31(Suppl 10):K41–K61
63. Campain A, Yang YH (2010) Comparison 76. Wei WQ, Teixeira PL, Mo H, Cronin RM,
study of microarray meta-analysis methods. Warner JL, Denny JC (2016) Combining bill-
BMC Bioinformatics 11:408 ing codes, clinical notes, and medications from
64. Chen B, Ma L, Paik H, Sirota M, Wei W, Chua electronic health records provides superior
MS, So S, Butte AJ (2017) Reversal of cancer
118 Benjamin S. Glicksberg et al.
phenotyping performance. J Am Med Inform biomedical, health care and wellness data
Assoc 23(e1):e20–e27 streams. Brief Bioinform 18(1):105–124
77. Yoon D, Ahn EK, Park MY, Cho SY, Ryan P, 80. Davis S, Meltzer PS (2007) GEOquery: a
Schuemie MJ, Shin D, Park H, Park RW bridge between the gene expression omnibus
(2016) Conversion and data quality assessment (GEO) and BioConductor. Bioinformatics 23
of electronic health record data at a Korean (14):1846–1847
tertiary teaching hospital to a common data 81. Robinson MD, McCarthy DJ, Smyth GK
model for distributed network research. (2010) edgeR: a Bioconductor package for dif-
Healthc Inform Res 22(1):54–58 ferential expression analysis of digital gene
78. Barrows RC Jr, Clayton PD (1996) Privacy, expression data. Bioinformatics 26
confidentiality, and electronic medical records. (1):139–140
J Am Med Inform Assoc 3(2):139–148 82. Hong F, Breitling R, McEntee CW, Wittner
79. Shameer K, Badgeley MA, Miotto R, Glicks- BS, Nemhauser JL, Chory J (2006) RankProd:
berg BS, Morgan JW, Dudley JT (2017) Trans- a bioconductor package for detecting differen-
lational bioinformatics in the era of real-time tially expressed genes in meta-analysis. Bioin-
formatics 22(22):2825–2827
Chapter 7
Abstract
Virtual screening is a well-established technique that has proven to be successful in the identification of
novel biologically active molecules, including drug repurposing. Whether for ligand-based or for structure-
based virtual screening, a chemical collection needs to be properly processed prior to in silico evaluation.
Here we describe our step-by-step procedure for handling very large collections (up to billions) of
compounds prior to virtual screening.
Key words Cheminformatics, Drug discovery, Online services, Property filtering, Unwanted struc-
tures, Virtual screening
1 Introduction
Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_7, © Springer Science+Business Media, LLC, part of Springer Nature 2019
119
120 Cristian G. Bologa et al.
2 Materials
3 Methods
Table 1
Sources for chemical databases available for evaluation
Company
name Web address Number of compounds Description
Aldrich https://www.sigmaaldrich. 14 million in-stock More than 80 of the most
Market com/chemistry/ products (no “virtual reliable suppliers in the
Select chemistry-services/aldrich- chemistry”) over eight compound and building
market-select.html million unique block market
structures
MCULE https://mcule.com/ Over 5.6 million unique An integrated drug discovery
database/ in-stock compounds platform providing IT
infrastructure, drug
discovery tools, high-
quality compound
database, and professional
compound delivery
BIOVIA http://accelrys.com/ Over 25 million products, More than 900 suppliers;
Available products/collaborative- over ten million unique access fee required
Chemical science/databases/ chemicals, including 3D
Directory sourcing-databases/biovia- models
available-chemicals-
directory.html
ZINC http://zinc.docking.org/ Over 120 million Extracted from over
browse/subsets/ potentially purchasable 200 catalogs of purchasable
compounds, compounds; the entire data
more than 12 million in collection is available for
stock free
Chemspider http://www.chemspider. 63 million chemical Almost 300 data sources; not
com/ structures known how many
chemicals are commercially
available; the entire set is
NOT available for free
download (only 1000
structures/day)
eMolecules http://www.emolecules. Over seven million 20 “reliable” chemical
com/ screening compounds suppliers; structures
available for free, access fee
required for vendor and
price information
Pubchem http://pubchem.ncbi.nlm. Over 230 million 580 data sources
nih.gov/ substances over (300 chemical vendors);
90 million unique not all chemicals are
chemicals commercially available; the
entire collection is available
for free
126 Cristian G. Bologa et al.
Fig. 2 Examples of chemical substructures that may cause interference with biochemical assays [25] under
HTS conditions. Modified from Ref. [40] with permission
How to Prepare a Compound Collection Prior to Virtual Screening 127
3.2.1 Remove Garbage Split covalently bound salt counterions and remove small fragments
from the Collection (salts), normalize charges. This is clearly an instance where the user
is confronted with multiple choices. For typical pharmaceutical
screening, it is advisable to remove unwanted structures, such as
those depicted in Fig. 2. One should always consider “unwanted”
structures in context—for example, a large number of antineoplas-
tic agents would be considered as “reactive species” according to
Fig. 2. Furthermore, many flavor compounds are monofunctional
aldehydes. Thus, when seeking actives in oncology or in flavor
science, certain substructure-based filters need to be individually
evaluated and, quite likely, excluded.
3.2.3 Generate Unique, Once canonical SMILES are derived, one should store just unique
Normalized SMILES SMILES by verifying structure identity while ignoring compound
IDs or chemical names. If the Virtuals or Tangibles are compiled
from a large number of software vendors, there is a good chance
that this will clean up 50% or more of the starting collection. At this
step, it is advisable to use a list of “preferred” or “trusted” vendors
first. Such lists are developed with time—so first-time users must
take some risks in this step. Whenever the budget is limited, a script
to rank low-price structures first could be used. Besides price,
minimum required purity and maximum guaranteed shipping win-
dow are other important criteria to be considered when selecting
from multiple vendors that sell the same compounds.
128 Cristian G. Bologa et al.
3.2.4 Other Unwanted Based on the cumulative experience with high-throughput screen-
Substructures ing using the MLSMR collection (available in PubChem), we have
identified a number of 100 scaffolds which have a significantly
higher-than-average tendency to exhibit bioactivity in primary
assays. While some of these scaffolds may be legitimately present
in hits under certain bioassay conditions, we have implemented a
SMARTS-based removal procedure for such promiscuous patterns
at our open-access website [84]. In addition to Badapple (promis-
cuous pattern removal) [85], we offer three different FILTER-
compliant tools, as well as a fragment-based drug-like filter
[41]. These tools, summarized in Table 2, can be used free of
charge for the purpose of academic research.
3.3 FILTER for Lead- After cleanup, the collection can be processed to remove com-
Likeness pounds that do not have certain 2D-based properties, for example,
those that define leads [25, 43, 44]. Compounds that pass the
filters can be prioritized for virtual screening. When examining a
collection of natural products, however, lead-like filters may be
unsuitable [42]. When a set of lead-like structures is desired, we
Table 2
UNM biocomputing public web applications (available at http://datascience.unm.edu/public-
biocomputing-apps)
3.4 Search Whenever high-activity molecules are available from the literature,
for Similarity If Known patents or from in-house data, the user is advised to perform a
Active Molecules Are similarity search on the entire Virtuals or Tangibles for similar
Available molecules (see Subheading 3.7) and to actively seek to include
them in the virtual screening deck, even though they might have
been removed during the previous steps. These molecules should
serve as positive controls, i.e., they should be retrieved at the end of
the virtual or high-throughput screen as “hits,” if the similarity
principle holds (see also Note 2).
3.5 Explore The user should seek alternate structures by modifying [87] the
Alternative Structures canonical isomeric SMILES, since these may occur in solution or at
the ligand-receptor interface:
l Tautomerism, which shifts one hydrogen along a path of alter-
nating single/double bonds, mostly involving nitrogen and
oxygen (e.g., imidazole); the reader is encouraged to consult
the tautomers issue of Journal of Computer-Aided Molecular
Design [88].
l Acid/base equilibria, which explore different protonation states
by assigning formal charges to those chemical moieties that are
likely to be charged (e.g., phosphate or guanidine) and by
assigning charges to some of those moieties that are likely to
be charged under different microenvironmental conditions
(“chargeable” moieties such as tetrazole and aliphatic amine).
l Exploration of alternate structures whenever chiral centers are
not specified (OpenEye’s omega2—flipper option, ChemAxon’s
Stereoisomer Generator Plugin) since 3D structure conversion
from SMILES in some cases does not “explode” all possible
states. Other examples include pseudo-chiral centers such as
pyramidal (“flappy”) nitrogen inversions that explore
non-charged, nonaromatic, pseudo-chiral nitrogens (three sub-
stituents), since these are easily interconverted in three
dimensions.
Exploring alternate structures is advisable prior to processing
any collection with computational means, e.g., for diversity analy-
sis (see also Note 3) . The results will influence any virtual screen.
3.6 Generate 3D The effort of exploring one or more conformers per molecule is
Structures quite relevant for virtual screening and for other 3D methods (see
also Note 4) . For example, one or multiple conformers per mole-
cule are evaluated during shape (ligand-based) and target
(structure-based) virtual screening. Some docking programs
require a separate 3D conversion step [89], e.g., using CORINA
[73], Catalyst [49], OMEGA [74], or Molinspiration [68]. A web-
site that discusses 3D conformer generation software and provides
links to other tools is available [90].
How to Prepare a Compound Collection Prior to Virtual Screening 131
3.7 Select Chemical Screening compounds that are similar to known actives increases
Structure the likelihood of finding new active compounds, but it may not lead
Representatives to different chemotypes, a highly desirable situation in the indus-
trial context (see also Note 5). The severity of this situation is
increased if the original actives are covered by third-party patents
or if the lead scaffold is toxic. Sometimes, the processed collection
may simply be too large to be evaluated in detail or even to be
submitted to a virtual screen. In such cases, a strategy based on
clustering and perhaps on statistical molecular design (SMD) is a
better alternative, compared to random selection. Clustering meth-
ods aim at grouping molecules into “families” (clusters) of related
structures that are perceived—at a given resolution—to be different
from other chemical families. With clustering, the end user has the
ability to select one or more representatives from each family. SMD
methods aim at sampling various areas of chemical space and select-
ing representatives from each area. Some software is designed to
select compounds from multidimensional spaces, but the outcome
depends on several factors, as discussed below.
3.7.1 Chemical Chemical descriptors are used to encode chemical structures and
Descriptors properties of compounds: 2D/3D binary fingerprints or counts of
different substructural features or perhaps (computed) physico-
chemical properties, e.g., MW, ClogP, HDO, and HAC, as well as
other types of steric, electronic, electrostatic, topologic, or
hydrogen-bonding descriptors. The choice of what descriptors to
use, and in what context, depends on the size of collection, on the
software and hardware available, as well as on the time constraints
given for a particular selection process.
3.7.3 Clustering These algorithms can be classified using many criteria and also
Algorithms implemented in different ways—see Subheading 2, item 4 for a
short list of clustering software. Hierarchical clustering methods
have been traditionally used to a greater extent—in part due to
computational simplicity. More recently, chemical structure classi-
fications are examining nonhierarchical methods. In practice, the
individual choice of different factors (descriptors, similarity mea-
sure, clustering algorithm) depends also on the hardware and soft-
ware resources available, the size and diversity of the collection that
must be clustered, and not ultimately on the user experience in
producing a useful classification that has the ability to predict
property values. We prefer Mesa’s clustering method [70] for its
ability to provide asymmetric clustering and to deal with the “false
singletons” (borderline compounds that are often assigned to one
of at least two equally distant chemical families).
3.7.5 Randomness Finally, the “unexpected,” that component which invites chance, as
discussed by Taleb [103, 104], justifies the random inclusion of a
particular subset of molecules to the virtual screening deck. These
molecules should not be subject to any processing (other than
correct structural representation, normalization, and tautomer/
protomer representation), i.e., they should be entirely random.
We cannot document if randomness is more successful compared
to rational methods, nor do we suggest that criteria for rational
selection should to be taken lightly. However, serendipity plays a
major role in drug discovery [105]. Therefore, we should allow a
certain degree of randomness in the final selection. If randomly
selected compounds are included, the final list of compounds
should be verified, once more, for uniqueness—to avoid duplicates.
4 Notes
5 Conclusions
Acknowledgment
References
1. Avorn J (2015) The $2.6 billion pill — meth- 4. Walters WP, Stahl MT, Murcko MA (1998)
odologic and policy considerations. N Engl J Virtual screening–an overview. Drug Discov
Med 372:1877–1879 Today 3:160–178
2. Sukuru SCK, Jenkins JL, Beckwith REH, 5. Fara DC, Oprea TI, Prossnitz ER, Bologa
Scheiber J, Bender A, Mikhailov D, Davies CG, Edwards BS, Sklar LA (2006) Integra-
JW, Glick M (2009) Plate-based diversity tion of virtual and physical screening. Drug
selection based on empirical HTS data to Discov Today Technol 3:377–385
enhance the number of hits and their chemical 6. Oprea TI, Matter H (2004) Integrating vir-
diversity. J Biomol Screen 14:690–699 tual screening in lead discovery. Curr Opin
3. Horvath D (1997) A virtual screening Chem Biol 8:349–358
approach applied to the search for trypa- 7. The PubChem service is hosted by the
nothione reductase inhibitors. J Med Chem National Center for Biotechnology Informa-
40:2412–2423 tion at NIH. http://pubchem.ncbi.nlm.nih.
gov/
How to Prepare a Compound Collection Prior to Virtual Screening 135
Waldmann H, Sklar LA (2009) A crowdsour- 45. Oprea TI, Allu TK, Fara DC, Rad RF,
cing evaluation of the NIH chemical probes. Ostopovici L, Bologa CG (2007) Lead-like,
Nat Chem Biol 5:441–447 drug-like or “Pub-like”: how different are
31. Collins FS (2010) Research agenda. Oppor- they? J Comput Aided Mol Des 21:113–119
tunities for research and NIH. Science 46. See the OpenEye Scientific Software, Santa
327:36–37 Fe, NM. http://www.eyesopen.com/
32. Boguski MS, Mandl KD, Sukhatme VP 47. See the mesa analytics & computing, Santa Fe,
(2009) Repurposing with a difference. Sci- NM. http://www.mesaac.com/
ence 324:1394–1395 48. See the ChemAxon kft, Budapest, Hungary.
33. Toney JH, Fasick JI, Singh S, Beyrer C, Sulli- https://www.chemaxon.com/
van DJ Jr (2009) Purposeful learning with 49. Accelrys Inc., San Diego, CA. http://www.
drug repurposing. Science 325:1139–1140 accelrys.com/
34. Chong CR, Sullivan DJ Jr (2007) New uses 50. See the Chemical Computing Group. http://
for old drugs. Nature 448:645–646 www.chemcomp.com/
35. Campillos M, Kuhn M, Gavin AC, Jensen LJ, 51. Certara, Princeton, NJ. https://www.certara.
Bork P (2008) Drug target identification com/
using side-effect similarity. Science 52. Ambure P, Aher RB, Roy K (2014) Recent
321:263–266 advances in the open access cheminformatics
36. Keiser MJ, Setola V, Irwin JJ, Laggner C, toolkits, software tools, workflow environ-
Abbas AI, Hufeisen SJ, Jensen NH, Kuijer ments, and databases. In: Zhang W
MB, Matos RC, Tran TB, Whaley R, Glennon (ed) Computer-aided drug discovery. Meth-
RA, Hert J, Thomas KLH, Edwards DD, ods in pharmacology and toxicology. Humana
Shoichet BK, Roth BL (2009) Predicting Press, New York, NY
new molecular targets for known drugs. 53. Weininger D (1988) SMILES, a chemical lan-
Nature 462:175–181 guage and information system. 1. Introduc-
37. Ashburn TT, Thor KB (2004) Drug reposi- tion to methodology and encoding rules. J
tioning: Identifying and developing new uses Chem Inf Comput Sci 28:31–36
for existing drugs. Nat Rev Drug Discov 54. The International Chemical Identifier, InChI,
3:673–683 was a IUPAC project. http://www.iupac.org/
38. CTSA. http://www.ncrr.nih.gov/clinical_ inchi/
research_resources/clinical_and_transla 55. OEChem Toolkit, Openeye Scientific Soft-
tional_science_awards/ ware, Santa Fe, NM. http://www.eyesopen.
39. Lipinski CA, Lombardo F, Dominy BW, Fee- com/oechem-tk
ney PJ (1997) Experimental and computa- 56. Open Babel. http://openbabel.sourceforge.
tional approaches to estimate solubility and net/
permeability in drug discovery and develop-
ment settings. Adv Drug Deliv Rev 23:3–25 57. raphSim TK Openeye Scientific Software,
Santa Fe, NM. http://www.eyesopen.com/
40. Oprea TI (2000) Property distribution of graphsim-tk
drug-related chemical databases. J Comput
Aided Mol Des 14:251–264 58. MACCSKeys320Generator, Mesa analytics
and computing LLC, Santa Fe, NM. http://
41. Ursu O, Oprea TI (2010) Model-free drug- www.mesaac.com/
likeness from fragments. J Chem Inf Model
50:1387–1394 59. Durant JL, Leland BA, Henry DR, Nourse JG
(2002) Reoptimization of MDL keys for use
42. Wester MJ, Pollock SN, Coutsias EA, Allu in drug discovery. J Chem Inf Comput Sci
TK, Muresan S, Oprea TI (2008) Scaffold 42:1273–1280
topologies. 2. Analysis of chemical databases.
J Chem Inf Model 48:1311–1324 60. MOE: The molecular operating environment
from chemical computing group Inc., Mon-
43. Teague SJ, Davis AM, Leeson PD, Oprea TI treal, QC. http://www.chemcomp.com/
(1999) The design of leadlike combinatorial
libraries. Angew Chem Int Ed 38:3743–3748 61. Open Babel: the open source chemistry tool-
German version: Angew. Chem. 111, 3962-- box. http://openbabel.org/wiki/Main_Page
3967 62. CDK is a Java library for structural chemo-
44. Hann MM, Oprea TI (2004) Pursuing the and bioinformatics. http://cdk.sf.net/
leadlikeness concept in pharmaceutical 63. Leo A (1993) Estimating LogPoct from struc-
research. Curr Opin Chem Biol 8:255–263 tures. Chem Rev 5:1281–1306
How to Prepare a Compound Collection Prior to Virtual Screening 137
64. CLOGP is available from BioByte Corpora- organic small molecules in the chemical uni-
tion, Claremont, CA. http://www.biobyte. verse database GDB-17. J Chem Inf Model
com/ 52:2864–2875
65. EPI Suite v4.11, U.S. Environmental Protec- 78. Ursu O, Holmes J, Knockel J, Bologa CG,
tion Agency. https://www.epa.gov/tsca- Yang JJ, Mathias SL, Nelson SJ, Oprea TI
screening-tools/epi-suitetm-estimation-pro (2017) DrugCentral: online drug compen-
gram-interface dium. Nucleic Acids Res 45:D932–D939
66. Tetko IV, Tanchuk VY (2002) Application of 79. FILTER is available from OpenEye Scientific
associative neural networks for prediction of Software Inc., Santa Fe, NM. http://www.
lipophilicity in ALOGPS 2.1 program. J eyesopen.com/products/applications/filter.
Chem Inf Comput Sci 42:1136–1145 html
http://vcclab.org/lab/alogps/index.html 80. Olah M, Mracec M, Ostopovici L, Rad R,
67. The virtual computational chemistry labora- Bora A, Hadaruga N, Olah I, Banda M,
tory (VCCLAB) as a number of on-line soft- Simon Z, Mracec M, Oprea TI (2004) WOM-
ware modules. Available at http://vcclab. BAT: world of molecular bioactivity. In:
org/ Oprea TI (ed) Cheminformatics in drug dis-
68. Molinspiration has a number of property cal- covery. Wiley-VCH, New York, NY (in press)
culators, including 3D conformer generation. 81. Coats EA (1998) The CoMFA steroids as a
http://molinspiration.com/ benchmark dataset for development of
69. Yap CW (2011) PaDEL-descriptor: An open 3D-QSAR methods. In: Kubinyi H,
source software to calculate molecular Folkers G, Martin YC (eds) 3D QSAR in
descriptors and fingerprints. J Comput drug design. Volume 3. Recent advances.
Chem 32:1466–1474 Kluwer/ESCOM, Dordrecht, The Nether-
70. Measures, mesa analytics and computing lands, pp 199–213
LLC, Santa Fe, NM. http://www.mesaac. 82. Oprea TI, Olah M, Ostopovici L, Rad R, Mra-
com/ cec M (2003) On the propagation of errors in
71. ChemoMine plc, Cambridge, UK. http:// the QSAR literature. In: Ford M,
www.chemomine.co.uk/ Livingstone D, Dearden J, Van de Water-
beemd H (eds) EuroQSAR 2002–Designing
72. MacCuish JD, MacCuish NE (2010) Chap- drugs and crop protectants: processes, pro-
man & Hall/CRC mathematical & computa- blems and solutions. Blackwell Publishing,
tional biology. In: Clustering in New York, NY, pp 314–315
bioinformatics and drug discovery, vol 40.
CRC press, Boca Raton, FL, p 244 83. Chemical Database Management Software,
TimTec Inc. http://software.timtec.net/
73. Gasteiger J, Rudolph C, Sadowski J (1990) ched.htm
Automatic generation of 3D-atomic coordi-
nates for organic molecules. Tetrahedron 84. Public web applications from UNM Biocom-
Comput Methodol 3:537–547 CORINA is puting are available at http://pasilla.health.
available from Molecular Networks GmbH unm.edu
and Altamira LLC; https://www.mn-am. 85. Yang JJ, Ursu O, Lipinski CA, Sklar LA,
com/ Oprea TI, Bologa CG (2016) Badapple: pro-
74. Hawkins PCD, Skillman AG, Warren GL, miscuity patterns from noisy evidence. J
Ellingson BA, Stahl MT (2010) Conformer Chem 8:29
generation with OMEGA: Algorithm and val- 86. Johnston PA (2011) Redox cycling com-
idation using high quality structures from the pounds generate H2O2 in HTS buffers con-
Protein Databank and Cambridge Structural taining strong reducing reagents-real hits or
Database. J Chem Inf Model 50:572–584 promiscuous artifacts? Curr Opin Chem Biol
OpenEye Scientific Software Inc., Santa Fe, 15:174–182
NM; http://www.eyesopen.com/ 87. Kenny PW, Sadowski J (2004) Structure mod-
75. MODDE is available from Umetrics, a divi- ification in chemical databases. In: Oprea TI
sion of Sartorius Stedim biotech. https:// (ed) Cheminformatics in drug discovery.
webshop.umetrics.com/ Wiley-VCH, New York, NY (in press)
76. The MLSMR collection can be datamined 88. Martin YC (2010) Perspectives in drug dis-
using the PubChem interface (keyword, covery and design: tautomers and tautomer-
MLSMR). http://pubchem.ncbi.nlm.nih. ism. J Comput Aided Mol Design
gov/ 24:473–638
77. Ruddigkeit L, van Deursen R, Blum LC, Rey- 89. Sadowski J, Gasteiger J (1993) From atoms
mond JL (2012) Enumeration of 166 billion and bonds to three-dimensional atomic
138 Cristian G. Bologa et al.
coordinates: automatic model builders. Chem 103. Taleb NN (2005) Fooled by randomness: the
Rev 93:2567–2581 hidden role of chance in the markets and life.
90. See the Metabolomics Fiehn Lab site: http:// Random House, New York
fiehnlab.ucdavis.edu/staff/kind/ 104. Taleb NN (2007) The Black Swan. The
ChemoInformatics/Concepts/3D- impact of the highly improbable. Random
conformer/ House, New York
91. Johnson MA, Maggiora GM (1990) Con- 105. Sneader W (2005) Drug discovery: a history.
cepts and applications of molecular similarity. Wiley, New York
Wiley-VCH, New York, NY 106. Boström J, Norrby P-O, Liljefors T (1998)
92. Maggiora GM (2006) On outliers and activity Conformational energy penalties of protein-
cliffsWhy QSAR often disappoints. J Chem bound ligands. J Comput-Aided Mol Design
Inf Model 46:1535 12:383–396
93. Oprea TI (2002) Chemical space navigation 107. Prossnitz ER, Arterburn JB, Edwards BS,
in lead discovery. Cur Opin Chem Biol Sklar LA, Oprea TI (2006) Steroid-binding
6:384–389 GPCRs: new drug discovery targets for old
94. Todeschini R, Consonni V (2008) Handbook ligands. Expert Opin Drug Discov
of molecular descriptors, 2nd edn. Wiley- 1:137–150
VCH, Weinheim, Germany 108. Papadatos G, Davies M, Dedman N,
95. Tanimoto TT (1961) Non-linear model for a Chambers J, Gaulton A, Siddle J, Koks R,
computer assisted medical diagnostic proce- Irvine SA, Pettersson J, Goncharoff N,
dure. Trans NY Acad Sci Ser 2(23):576–580 Hersey A, Overington JP (2016) Sure-
96. Tversky A (1977) Features of similarity. ChEMBL: a large-scale, chemically annotated
Psychol Rev 84:327–352 patent document database. Nucleic Acids Res
44:D1220–D1228 Available at https://www.
97. Willett P (1987) Similarity and clustering surechembl.org/search/
techniques in chemical information systems.
In: Research Studies Press. Letchworth, 109. Antolin AA, Tym JE, Komianou A, Collins I,
England Workman P, Al-Lazikani B (2017) Objective,
quantitative, data-driven assessment of chem-
98. Willett P (2000) Chemoinformatics–similar- ical probes. Cell Chem Biol 25(2):P194–205.
ity and diversity in chemical libraries. Curr Op E5 in press. Available at http://probeminer.
Biotech 11:85–88 icr.ac.uk/#/
99. Lewis RA, Pickett SD, Clark DE (2000) 110. Harding SD, Sharman JL, Faccenda E,
Computer-aided molecular diversity analysis Southan C, Pawson AJ, Ireland S, Gray AJG,
and combinatorial library design. Rev Com- Bruce L, Alexander SPH, Anderton S,
put Chem 16:1–51 Bryant C, Davenport AP, Doerig C,
100. Martin YC (2001) Diverse viewpoints on Fabbro D, Levi-Schaffer F, Spedding M,
computational aspects of molecular diversity. Davies JA, NC-IUPHAR (2018) The
J Comb Chem 3:231–250 IUPHAR/BPS Guide to PHARMACOL-
101. Linusson A, Gottfries J, Lindgren F, Wold S OGY in 2018: updates and expansion to
(2000) Statistical molecular design of build- encompass the new guide to IMMUNO-
ing blocks for combinatorial chemistry. J Med PHARMACOLOGY. Nucleic Acids Res 46:
Chem 43:1320–1328 D1091–D1106 Available at http://
102. Eriksson L, Johansson E, Kettaneh-Wold N, guidetopharmacology.org/
Wikström C, Wold S (2000) Design of experi- 111. Berman HM, Westbrook J, Feng Z,
ments: principles and applications. Umetrics Gilliland G, Bhat TN, Weissig H, Shindyalov
Academy, Umeå, Sweden IN, Bourne PE The protein data bank.
Nucleic Acids Res 28:235–242 Available at
http://www.rcsb.org/
Chapter 8
Abstract
Knowing the physicochemical and general biochemical properties of a compound is critical to understand-
ing how it behaves in different biological environments and to anticipating what is likely to happen in
situations where that behavior cannot be measured directly. Quantitative structure-property relationship
(QSPR) models provide a way to predict those properties even before a compound has been synthesized
simply by knowing what its structure would be. This chapter describes a general workflow for compiling the
data upon which a useful QSPR model is built, curating it, evaluating that model’s performance, and then
analyzing the predictive errors with an eye toward identifying systematic errors in the input data. The focus
here is on models for the absorption, distribution, metabolism, and excretion (ADME) properties of drugs
and toxins, but the considerations explored are general and applicable to any QSPR.
1 Introduction
Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_8, © Springer Science+Business Media, LLC, part of Springer Nature 2019
139
140 Robert D. Clark and Pankaj R. Daga
2 Materials
2.1 Structural Data Classical two-dimensional line drawings are very flexible and effec-
tive for conveying connectivity (“2D”) information. Adoption of a
few conventions makes it possible to convey a sense of depth as well.
Indeed, such “2.5 D” depictions can be quite beautiful and elegant.
Most existing computer programs are unable to process drawings
directly, however,1 which led to the adoption of standardized line
notations that specify the connectivity of a molecule. The most
widely used line notations are SMILES and InChI. The former
conveys enough molecular connectivity and bond type information
to specify a specific 2D structure, whereas the latter is a hierarchical
decomposition that can describe more than one molecular struc-
tural form of a compound.
Different software systems need to exchange structural infor-
mation, and several standard file formats have been developed to
facilitate that exchange. Files of SMILES strings can serve this
purpose, but their lack of coordinates and associated data can be a
serious limitation. MOL and SD files are more flexible; they store
explicit connectivity tables that specify atomic coordinates and
provide more molecular context. SD files have the added virtues
of being able to store multiple structures in a single file and to
associate nonstructural data with each structure.
Protein Data Bank (PDB) files [4] are primarily designed to
store structural data for macromolecular crystal structures but can
be an important source of 3D structural information for bound
ligands. They require considerable preprocessing for use in building
2D QSPRs, in part because they lack explicit bond information for
ligands; however, once processed, they should be stored in SMILES
or SD format.
1. The simplified molecular-input line-entry system (SMILES)
was created at Daylight Chemical Information Systems, Inc.
[5]. Aromatic atoms are represented by their atomic symbols in
lowercase, and nonaromatic atoms are in uppercase. Numbers
are used to indicate nonadjacent atoms that are bonded to each
other (i.e., typically for ring closures), and parentheses set off
1
Chemical structures can also be extracted from text documents by applying optical structure recognition (OSR).
Those that the authors have worked with are not yet reliable enough to be applied without careful manual review.
142 Robert D. Clark and Pankaj R. Daga
2
Bridging hydrogen atoms that take part in three-center bonds are treated as part of the core structure.
Building a QSPR Model 143
2.3 Software A very wide range of software is available for carrying out the
manipulations described here. The minimal functionality required
includes being able to:
1. Read in and write out structures in SMILES and/or SD format
as well as the ability to associate data with each structure.
2. Display 2D structures and allow them to be modified readily.
144 Robert D. Clark and Pankaj R. Daga
2.4 Descriptors The quality and generality of a QSPR model depend to a consider-
able extent on the descriptors used in its construction. A very wide
range of descriptors is available for generating QSPRs and QSARs
[21]. Different research groups and software vendors tend to favor
their own particular sets, but each can be broadly categorized as
falling into one of a few classes.
1. Constitutional Descriptors
Constitutional descriptors capture various aspects of molecular
size (molecular weight, number of heavy atoms, number of
Building a QSPR Model 145
3 Methods
3.1 Compiling Collecting good data is key to building a robust and reliable QSPR
a Data Set model. It can come from internal corporate databases or the pri-
mary literature as well as from commercial or publicly accessible
data compilations (see Note 2).
The primary literature is the best source of property data, but
building a large enough data set to be useful requires a considerable
investment in time and effort, since each publication needs to be
considered on its own merits. Published compilations are a much
more convenient source of data, but these should also be handled
with care unless they were compiled by experts in the field and
include primary literature references for all entries. Relying on an
incompletely annotated compilation means complete reliance on
the domain knowledge of the authors, which is—in our experi-
ence—quite prone to error. Worse, there is often no way to track
down what the error was, which means that the data point or points
affected must be discarded altogether. Note, too, that if you do not
know the primary reference for a data point you suspect is bad, you
will not be able to identify associated observations—from the same
paper or from the same group—that are also in error.
Commercial databases and handbooks are also valuable data
sources; however, both can be quite expensive, and handbooks
often require generating structures from compound names (see
below). A ChEMBL search is usually the best way to start collecting
data, in part because it returns structures in electronic form. It also
represents a good compromise between ready accessibility to rela-
tively large amounts of data and reliability of the property values
retrieved. Combining several large data sets (excluding those that
are compilations) from such a search and building a preliminary
model can provide a sense of how feasible building a global model
for the property of interest is going to be. Once the likely quality of
the QSPR can be assessed, it may be possible to justify adding data
from more resource-intensive sources.
3.3 Curating It is possible for a model to be better than the raw data, but only
Property Data insofar as it has been carefully curated. Reported error rates in the
literature are quite high [25], which agrees with the authors’ per-
sonal experience. Unfortunately, many of the most common kinds
of error can confer undue leverage on the affected data points,
148 Robert D. Clark and Pankaj R. Daga
Fig. 1 Tiled display (“wallpaper”) from ADMET Predictor 8.5 showing structures for a rat liver microsome
clearance data set taken from ChEMBL
3.5 Model Building Many factors contribute to how well the model-building process
works for any particular program. As a result, how each stage is best
carried out depends on the detailed nature of the other stages.
Moreover, most people will not have access to equally sophisticated
versions of more than two or three fundamentally different
approaches and know how to apply them properly. Rather than
attempt a thorough exploration of the pluses and minuses of the
various methodologies here, we will describe in general terms the
workflow most commonly used in constructing QSPR models for
ADMET Predictor. That said, some form of the basic elements of
the workflow given is applicable to most other programs as well:
1. The list of potential descriptors is filtered to remove any with
low variance and that show unacceptably high pairwise or
multilinear correlation with other descriptors.
2. The descriptors that remain are sorted in descending order of
sensitivity—i.e., the extent to which they themselves can
account for variation in the target property across the training
pool.3
3. An initial number of descriptors and a number of hidden-layer
neurons are selected for the first artificial neural network
ensemble architecture to be trained. Later, models with more
descriptors and/or more neurons in the hidden layer are
trained.
4. A set of 165 artificial neural networks (ANNs) are trained, each
with its own randomly assigned training and verification sets.
3
Where necessary, promising groups of descriptors are generated using a genetic algorithm (GA) [28].
152 Robert D. Clark and Pankaj R. Daga
3.6 Performance Model performance is assessed based on how well property values
Assessment predicted for the external test set compare to the observed property
values. Aggregated performance statistics are important, but exam-
ining individual predictions is more apt to improve a model.
1. Pearson’s correlation coefficient (r2) between observed and
predicted values is widely used as measure of a regression
model’s overall performance. It is a good measure of correla-
tion between variables but is suboptimal as a measure of pre-
dictive performance in that it can be relatively high even when
all predictions are off by the same amount or in the same
proportion. The root mean square error (RMSE) is generally
a more appropriate statistic to use:
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
y^ i y i
RMSE ¼
n
where yi is an observed value, ŷi is the corresponding predicted
value, and n is the number of observations in the test set. The
mean absolute error (MAE) is an analogous nonparametric
statistic that is less sensitive to distortion by extreme outliers.
2. Classification performance is sometimes measured by “accu-
racy,” which is the fraction of all predictions that are correct;
however, most QSPR data sets are unbalanced, in that the
positive class is typically underrepresented. In such cases, it
makes sense to evaluate performance on each class separately.
Doing so yields four statistics:
Building a QSPR Model 153
3.7 Analyzing The process outlined above usually does not produce a finished
Outliers QSPR model in the first round; rather, it provides a way to identify
more subtle kinds of error in the input data.
1. Overall performance statistics like RMSE are important for
comparing the performance of different regression models,
but examining plots of predicted vs observed property values
is much more informative and should always be done. Such a
visual examination highlights anomalously high or low predic-
tions. Outliers in the test set usually indicate a weakness in the
model, but outliers in the training set may be cases where the
experimental value is in error. Finding the latter kind of outlier
is good, since it indicates that the model is robust enough to
make an accurate prediction despite being given wrong infor-
mation. Identifying all of the data points in such a plot that
come from the same literature source as an outlier can indicate
a whole block of observations that are also mispredicted but to
154 Robert D. Clark and Pankaj R. Daga
Fig. 2 Identifying systematic bias due to errors in reported units by examining plots of predicted versus
observed CYP rat liver microsomal intrinsic clearance (RLM CLint). (LEFT) Highlighted points represent
uncorrected values from Hibi et al. [30]. (RIGHT) Highlighted points represent uncorrected values from
Röver et al. [31]
a lesser degree because the model was able to partially offset the
aggregate bias arising from the error. Correcting all of the
errors almost always improves (decreases) the overall RMSE.
2. Highlighting blocks of observations from a single source can
identify data sources that generate no obvious outliers. Figure 2
shows results from an interim model that revealed two blocks
of flawed rat liver microsome clearance data. The compounds
in the lower block were too low by a factor of 1000. ChEMBL
provided the wrong units for the upper set of outliers—μL/
min/mg of protein were reported instead of mL/min per gm
of liver. The (erroneous) information that the first block of
compounds contributed to the model evidently overlapped
enough with the information from other compounds that the
model was unable to reconcile the discrepancy. The structures
in the second block were unique enough that the model was
able to accommodate the errors. Nonetheless, correcting the
error improved the overall fit and predictivity.
3. Classification models are not amenable to the analysis described
above, but the confidence estimates described above can serve a
similar function. It often happens that there are a handful of
predictions that every single network in the ensemble gets
wrong. This sometimes reflects experimental problems (e.g.,
in HTS data) that cannot be resolved, but it often reflects the
same kind of systematic errors that come out of the regression
analyses described above—i.e., cases where it is the input data
that are wrong, not the predictions.
Building a QSPR Model 155
3.8 Iterate Once you have used your initial model to identify second-order
errors in the data set, you will be ready to repeat the process by
building a refined model. A second or third iteration may be
required before all the addressable errors have been found, but
the effort invested in improving the data will ultimately pay off in
increased accuracy and robustness of the resulting QSPR model.
4 Notes
Fig. 3 Structures retrieved from ChEMBL for a paper by Pevarello et al. [36]. The
image at the right is taken from Table 3 in that paper (adapted with permission.
Copyright 2005 American Chemical Society), which lays out the heterocyclic
para-substituents for a series of 2-phenylpropionamide analogs. The dotted line
shows where the background from the substituent for compound 22 overlaps
the structure for the substituent for compound 21, obscuring its carbonyl
oxygen. A check of the synthesis details section confirmed that 21 is, in fact,
a succinimide
Building a QSPR Model 157
Acknowledgments
References
1. Pragyan P, Kesharwani SS, Nandekar PP, 4. RCSB Protein Data Bank Royal Society of
Rathod V, Sangamwa AT (2014) Predicting Chemistry. https://www.rcsb.org/pdb/
drug metabolism by CYP1A1, CYP1A2, and home/home.do
CYP1B1: insights from MetaSite, molecular 5. Weininger D (1988) SMILES, a chemical lan-
docking and quantum chemical calculations. guage and information system. 1. Introduction
Mol Divers 18(4):865–878 to methodology and encoding rules. J Chem
2. Houston JB, Kenworthy KE (2000) In vitro- Inf Comput Sci 28(1):31–36. https://doi.
in vivo scaling of CYP kinetic data not consis- org/10.1021/ci00057a005
tent with the classical Michaelis-Menten 6. SMILES–A simplified chemical language. Day-
Model. Drug Metab Dispos 28(3):246–254 light Chemical Information Systems, Inc.
3. ADMET Predictor™. Simulations Plus Inc., http://www.daylight.com/dayhtml/doc/the
Lancaster, CA, USA ory/theory.smiles.html
158 Robert D. Clark and Pankaj R. Daga
7. The IUPAC international chemical identifier 21. Todeschini R, Consonni V (2000) Handbook
(InChI). International union of pure and of molecular descriptors. Wiley-VCH, Wein-
applied chemistry. https://iupac.org/who-we- heim; New York
are/divisions/division-details/inchi/ 22. Rogers D, Hahn M (2010) Extended-
8. Stephen R Heller AM, Pletnev I, Stein S, Tche- connectivity fingerprints. J Chem Inf Model
khovskoi D (2015) InChI, the IUPAC interna- 50(5):742–752. https://doi.org/10.1021/
tional chemical identifier. J Cheminform 7:23 ci100050t
9. MedChem Designer™: Chemical structure 23. Hall LH, Kier LB (1995) Electrotopological
drawing and property prediction. Simulations state indices for atom types: a novel combina-
Plus, Inc. http://www.simulations-plus.com/ tion of electronic, topological, and valence
software/medchem-designer/ state information. J Chem Inf Comput Sci
10. ChEMBL. EMBL-EBI. https://www.ebi.ac. 35:1039–1045
uk/chembl/ 24. Yan A, Gasteiger J (2003) Prediction of aque-
11. Bento AP, Gaulton A, Hersey A, Bellis LJ, ous solubility of organic compounds by topo-
Chambers J, Davies M, Krüger FA, Light Y, logical descriptors. QSAR Comb Sci
Mak L, McGlinchey S, Nowotka M, 22:821–829. https://doi.org/10.1002/qsar.
Papadatos G, Santos R, Overington J (2014) 200330822
The ChEMBL bioactivity database: an update. 25. Tiikkainen P, Bellis L, Light Y, Franke L (2013)
Nucleic Acids Res 42:1083–1090. https://doi. Estimating error rates in bioactivity databases. J
org/10.1093/nar/gkt1031 Chem Inf Model 53(10):2499–2505. https://
12. Wang YL, Bryant SH, Cheng TJ, Wang JY, doi.org/10.1021/ci400099q
Gindulyte A, Shoemaker BA, Thiessen PA, He 26. May R, Maier H, GC D (2010) Data splitting
SQ, Zhang J (2017) PubChem BioAssay: 2017 for artificial neural networks using SOM-based
update. Nucleic Acids Res 45(D1): stratified sampling. Neural Netw 23:283–294
D955–D963. https://doi.org/10.1093/nar/ 27. Clark RD (2003) Boosted leave-many-out
gkw1118 cross-validation: the effect of training set and
13. Waldman M, Fraczkiewicz R, Clark RD (2015) test set diversity on PLS statistics. J Comput
Tales from the war on error: the art and science Aided Mol Des 17:265–275
of curating QSAR data. J Comput Aided Mol 28. Žuvela P, Liu JJ, Macur K, Ba˛czek T (2015)
Des 29:897 Molecular descriptor subset selection in theo-
14. AID 1996: Aqueous Solubility from MLSMR retical peptide quantitative structure–retention
Stock Solutions (2009) Available via National relationship model development using nature-
Center for Biotechnology Information. inspired optimization algorithms. Anal Chem
https://pubchem.ncbi.nlm.nih.gov/bioassay/ 87(19):9876–9883. https://doi.org/10.
1996. Accessed Nov 2017 1021/acs.analchem.5b02349
15. ChemSpider: Search and share chemistry. 29. Clark RD, Liang W, Lee AC, Lawless MS,
Royal Society of Chemistry. http://www. Fraczkiewicz R, Waldman M (2014) Using
chemspider.com/ beta binomials to estimate classification uncer-
16. What is R? The R Foundation. https://www.r- tainty for ensemble models. J Cheminform 6
project.org/about.html (1):34
17. Willighagen EL, Mayfield JW, Alvarsson J, 30. Hibi S, Ueno K, Nagato S, Kawano K, Ito K,
Arvid Berg LC, Jeliazkova N, Kuhn S, Norimine Y, Takenaka O, Hanada T, Yonaga M
Pluskal T, Rojas-Chertó M, Spjuth O, (2012) Discovery of 2-(2-Oxo-1-phenyl-5-
Torrance G, Evelo CT, Guha R, Steinbeck C pyridin-2-yl-1,2-dihydropyridin-3-yl)benzoni-
(2017) The Chemistry Development Kit trile (Perampanel): a novel, noncompetitive
(CDK) v2.0: atom typing, depiction, molecular α-amino-3-hydroxy-5-methyl-4-isoxazolepro-
formulas, and substructure searching. J Che- panoic Acid (AMPA) receptor antagonist. J
minform 9:33 Med Chem 55(23):10584–10600. https://
18. About KNIME. KNIME. https://www.knime. doi.org/10.1021/jm301268u
com/. Accessed 17 Nov 2017 31. Röver S, Andjelkovic M, Bénardeau A,
19. BIOVIA Pipeline Pilot. Dessault Systemes. Chaput E, Guba W, Hebeisen P, Mohr S,
http://accelrys.com/products/collaborative- Nettekoven M, Obst U, Richter WF,
science/biovia-pipeline-pilot/ Ullmer C, Waldmeier P, Wright MB (2013)
6-Alkoxy-5-aryl-3-pyridinecarboxamides, a
20. Tosco P, Stiefl N, Landrum G (2014) Bringing new series of bioavailable cannabinoid receptor
the MMFF force field to the RDKit: implemen- type 1 (CB1) antagonists including peripher-
tation and validation. J Cheminform 6:37 ally selective compounds. J Med Chem 56
Building a QSPR Model 159
Abstract
This chapter provides a broad overview of ion mobility-mass spectrometry (IM-MS) and its applications in
separation science, with a focus on pharmaceutical applications. A general overview of fundamental ion
mobility (IM) theory is provided with descriptions of several contemporary instrument platforms which are
available commercially (i.e., drift tube and traveling wave IM). Recent applications of IM-MS toward the
evaluation of structural isomers are highlighted and placed in the context of both a separation and
characterization perspective. We conclude this chapter with a guided reference protocol for obtaining
routine IM-MS spectra on a commercially available uniform-field IM-MS.
Key words Isomers, Drugs, Conformation, Ion mobility spectrometry, Ion mobility-mass spectrom-
etry, IM-MS
1 Introduction
Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_9, © Springer Science+Business Media, LLC, part of Springer Nature 2019
161
162 Shawn T. Phillips et al.
1.1 Isomers Isomers are defined as compounds having the same molecular
formula but differing in their overall chemical structure [13]. Iso-
meric species are further subdivided into categories that reflect their
structural variations, which may include covalent bond rearrange-
ments (constitutional isomers), stereochemical variations (stereoi-
somers), or rotational isomers, commonly referred to as rotamers.
As constitutional isomers vary in skeletal structure between constit-
uent atoms, these isomers possess a broad scope of biological
activity based upon their particular structural arrangements. For
example, the molecular formula C8H9NO2 is reported to have
33 isomers by the PubChem database [14], and many of these
isomers have unique chemical behavior and physiological function
(Fig. 1). In one case, paracetamol (more commonly known as
acetaminophen) is a well-known analgesic, yet its constitutional
isomer, methyl anthranilate, functions as a bird repellent and a
flavor additive in drinks [15]. The structural makeup of constitu-
tional isomers can also vary widely depending on their biological
class. For example, lipid isomers typically vary in alkyl chain position
and cis/trans double bond positioning [16], while peptides tend to
Isomeric and Conformational Analysis of Small Drug and Drug-Like Molecules. . . 163
Fig. 1 Structures related to four constitutional isomers of chemical formula C8H9NO2 and corresponding
function or typical use
(A)
Ion Activation
Region
+ + + +
Voltage
Voltage
V
Distance Distance
Fig. 2 (a) Block diagram of a typical IM-MS instrument. Ions are separated in the presence of a neutral drift
gas by (b) a linear electric field along a series of ring electrodes (DTIMS) or (c) by a pulse wave generated by
applied sequential voltage along a series of ring electrodes (TWIMS)
1.2.1 Drift Tube Ion In drift tube ion mobility (DTIMS), the IM region consists of a
Mobility series of ring electrodes contained within a neutral drift gas (typi-
cally helium or nitrogen) (Fig. 2b) [38, 39]. DTIMS is operated at
one of two pressure regimes: low (1–10 Torr) and elevated
(ca. 760 Torr) pressures. Typically, ion transmission is more effi-
cient at reduced pressure yet typically results in somewhat reduced
IM resolving power from high diffusion. As ions are introduced
into the drift tube, they are drawn through the drift region as a
result of an applied electric field along the ring electrodes. During
ion drift, the ions interact with the buffer gas at low energy, and
molecules with smaller rotationally averaged surface area (smaller
CCS) transverse the region faster as a result of fewer collisions.
Mathematically, the CCS of the analyte ion can be calculated
using the Mason-Schamp equation [27, 32] where K0 is the
measured mobility of the ion, z is the charge of the ion, T is the
temperature of the drift gas, and N0 is the number density of the
drift gas at standard temperature and pressure. The terms e and kB
are the elementary charge and Boltzmann’s constant, respectively:
166 Shawn T. Phillips et al.
1=2 K1
3ze 2π 0
CCS ¼ ð1Þ
16N 0 μkB T
1.2.2 Traveling Wave Ion Similar to DTIMS, a traveling wave ion mobility drift cell uses an
Mobility inert buffer gas and a series of ring electrodes to move ions through
the drift region (Fig. 2c) [40, 41]. In TWIMS, ion pulses are
mobility separated by sequentially applying a direct current voltage
to the rings in a series along the drift cell to create a migrating
potential along the length of the cell. These sequential low-voltage
pulses generate waves of electric potential that push ions through
the drift region. As the wave propels the ions forward through the
device, low-energy elastic collisions occur between the analyte ions
and the buffer gas. Smaller ions experience fewer collisions with the
buffer gas and, as a result, traverse the drift region faster than larger
ions, resulting in shorter drift times. This mechanism is almost
identical to what is experienced in DTIMS, but with the exception
that larger ions are slower in TWIMS as a result of “falling over” the
wave pulses during their transit through the cell. The drift times are
converted to collision cross sections through a calibration proce-
dure which takes into account the drift times and CCS of known
internal standards [42, 43]. The ion mobility spectra obtained from
both TWIMS and DTIMS are qualitatively similar.
1.3 Current Work Historically, there are many reported studies where DTIMS and
in Isomer Structural TWIMS have been used to observe both constitutional and confor-
Separations mational structures of large molecules, where structural differences
are significant and readily measured [44–46]. In this section, we
will present some recent examples of the use of TWIMS and
DTIMS to separate constitutional and conformational isomers in
small molecule systems, which have been given less attention in the
literature, but nonetheless are important avenues for developing
IM-MS for the separation and characterization of drug and drug-
like small molecule isomer systems.
L-tert-Leucine (2S) 0
(3)
120 125 130 135 140 145
132.4 ± 0.3 (Å2)
100 (1) (3) (11)
D-tert-Leucine (2R)
(4) Constitutional
132.5 ± 0.2 (Å2) (II) 50 Isomers
0
Relative Abundance (%)
L-allo-Isoleucine (2S,3R)
120 125 130 135 140 145
(5) 2
132.9 ± 0.2 (Å )
100 (7) (9)
0
L-Isoleucine (2S,3S) 120 125 130 135 140 145
(7) 2
133.5 ± 0.3 (Å )
100 (5) (7)
Fig. 3 (a) Leucine/isoleucine isomers with chemical formula C6H13NO2 examined in this study. Experimental
cross sections with respective standard deviations are shown at the right with corresponding stereochemistry.
(b) (I) Experimental IM spectrum overlays for all isomer compounds (standard error bars omitted for clarity).
(II) Overlay of the IM spectra corresponding to N,N-dimethylglycine ethyl ester, L-tert-leucine, and L-norleucine
and the IM spectrum corresponding to the mixture (black). (III) Overlays of L-isoleucine and L-leucine in
addition to the equal ratio mixture. (IV and V) Overlays of diastereomers and enantiomers, respectively.
Adapted with permission from ref. 24, Copyright 2017 American Chemical Society
168 Shawn T. Phillips et al.
1.3.2 Separation Another biological class where IM-MS has been utilized to facilitate
of Conformational Isomers the separation of isomers is carbohydrates. Carbohydrates, or sac-
charides, are a class of compounds which includes sugars, starches,
and cellulose. Carbohydrates are challenging systems to study with
most analytical techniques as they commonly exist as complex
mixtures in nature with variations in skeletal structure, bond coor-
dination, and stereochemistry (Fig. 4). Studies are further
Isomeric and Conformational Analysis of Small Drug and Drug-Like Molecules. . . 169
β-D-Glucose-GA
(A) (B)
Posive Mode (M+H) Negave Mode (M-H)
100 100
(I) 152.5 Å2 (+) Thalidomide (VI) 159.8 Å2 (+) Thalidomide
(-) Thalidomide (-) Thalidomide
80 80
(IV) 169.0 Å2
Relative Intensity
Relative Intensity
60 60
40 (II) 157.8 Å2 40
(III) 162.3 Å2
20 20 (V) 153.5 Å2
0 0
19 20
20 21 22 23 21 22 23
23 24 25 26 27
Drift Time
Dri Time (ms)
(ms) DriftTime
Dri Time(ms)
(ms)
(R, +) Thalidomide
(S, -) Thalidomide
Fig. 5 Drift time profiles of (R) and (S) thalidomide enantiomers with corresponding CCS observed in both
positive and negative mode (a and b, respectively) for nitrogen drift gas. Structures are illustrated at the
bottom for chiral reference
2 Materials
3 Methods
The method described below is for new users to the Agilent 6560
IM-MS instrument who have general knowledge of traditional
mass spectrometry operation. It is intended to provide the novice
user with the basic steps necessary to obtain routine IM-MS spec-
tra. It is not intended to provide an exhaustive description of all
instrument settings and their uses. The reader should consult their
service manual for extended instructions and a comprehensive
description of settings. Explicit notes are added following the
Subheading 3.
3.1 Preparing 1. To ensure drift time and collision cross section reproducibility,
the Instrument check instrument pressures in each of the following instrument
for Direct Infusion compartments: the high-pressure funnel region should be set
172 Shawn T. Phillips et al.
Fig. 6 Agilent MassHunter Workstation Data Acquisition Program tune page with callouts for various
commands
Isomeric and Conformational Analysis of Small Drug and Drug-Like Molecules. . . 173
3.2 Data Acquisition 1. Load the preexisting low mass method in the drag-down box in
(Consult Fig. 7) method editor, and click “Apply” (Fig. 7a). If the user is
interested in collecting collision cross sections, the method
should include a voltage gradient in the drift tube region in
order to subtract out the non-mobility flight time. This voltage
gradient can be accessed under the “Advanced Parameters” tab
of the method editor screen (Fig. 7b) (see Note 5).
2. Load the sample in the syringe and syringe pump for direct
infusion. Ensure that the syringe pump is set with the correct
syringe diameter in order to output the correct flow rates.
Although flow rate will vary based on specific instrument
source settings and sensitivity, the flow rate used here is
1–10 μL/min (see Note 6). Turn on the syringe pump. Once
sample reaches the instrument, analyte ion peaks begin to
appear in the mass window.
3. To set up the data acquisition, select the “Sample Run” tab at
the bottom of the method editor screen (Fig. 7c). In the
sample run mode, name the file (Fig. 7d), and select the path
directory (Fig. 7e) for your file.
4. To acquire data, click the forward arrow (Fig. 7f) in the sample
run screen.
174 Shawn T. Phillips et al.
Fig. 7 Sample run dialogue box including file directory information and sample name
3.3 Data Workup 1. After data acquisition is complete, load the “Agilent MassHun-
ter IM-MS Browser” software, and open the desired file.
Once the file has opened, select “Condense File” under the
“Actions” tab (Fig. 8a). Condensing files will compress the
data from each experimental sequence into a single frame,
which is convenient for viewing multiple segment runs (e.g.,
CCS experiments) or long infusion experiments. Note that
condensing the file is an optional step.
2. In IM-MS Browser, you can view the resulting mass spectrum
(Fig. 8b), the drift spectrum data (Fig. 8c), and the 2-D plot of
mass-to-charge vs drift time (Fig. 8d) for all ions in the sample.
In the example spectrum of thalidomide (Fig. 8e), we will focus
on the mass-to-charge ion at 259.0708, [MþH]+.
3. Expand the Counts vs. Mass-To-Charge window by right-
clicking and holding on the mass axis and dragging over the
desired mass range. (To expand or contract any axis, right-click
hold any axis, and move the mouse right or left.) To move the
peak right or left, left click and hold the axis and drag accord-
ingly (Fig. 8b).
4. To obtain a drift time spectrum for a specific mass-to-charge
region, right-click and drag over the desired ion in the drift
time vs m/z region. This produces a box around the desired ion
Isomeric and Conformational Analysis of Small Drug and Drug-Like Molecules. . . 175
Fig. 8 Agilent MassHunter IM-MS Browser interface. (a) Mass spectrum of thalidomide, (b) drift spectra
window with expanded window for thalidomide drift profile, and (c) 2-D IM-MS window with callout of the
thalidomide [MþH]+ ion species
4 Notes
Acknowledgments
This work was supported in part using the resources of the Center
for Innovative Technology at Vanderbilt University. Financial sup-
port for aspects of this research was provided by The National
Institutes of Health (NIH Grant R01GM092218) and under Assis-
tance Agreement No. 83573601 awarded by the US Environmen-
tal Protection Agency (EPA). This work has not been formally
reviewed by the EPA, and the EPA does not endorse any products
or commercial services mentioned in this publication. Further-
more, the content is solely the responsibility of the authors and
should not be interpreted as representing the official views and
policies, either expressed or implied, of the funding agencies and
organizations.
References
1. Chail H (2008) DNA sequencing technologies determination of inorganic impurities in drugs
key to the human genome project. Nature and pharmaceuticals. J Pharm Biomed Anal
Education 1:219 43:1–13
2. Pareek CS, Smoczynski R, Tretyn A (2011) 9. Kauppila TJ, Wiseman JM, Ketola RA,
Sequencing technologies and genome Kotiaho T, Cooks RG, Kostiainen R (2006)
sequencing. J Appl Genet 52:413–435 Desorption electrospray ionization mass spec-
3. Zhang JH, Chung TD, Oldenburg KR (1999) trometry for the analysis of pharmaceuticals
A simple statistical parameter for use in evalua- and metabolites. Rapid Commun Mass Spec-
tion and validation of high throughput screen- trom 20:387–392
ing assays. J Biomol Screen 4:67–73 10. Cooks RG (1995) Special feature: historical.
4. Takenaka T (2001) Classical vs reverse pharma- Collision-induced dissociation: readings and
cology in drug discovery. BJU Int 88:7–10 commentary. J Mass Spectrom 30:1215–1221
5. Harvey AL, Edrada-Ebel R, Quinn RJ (2015) 11. Wells JM, McLuckey SA (2005) Collision-
The re-emergence of natural products for drug induced dissociation (CID) of peptides and
discovery in the genomics era. Nat Rev Drug proteins. Methods Enzymol 402:148–185
Discov 14:111–129 12. Nguyen LA, He H, Pham-Huy C (2006) Chi-
6. Vaidya ADB (2014) Reverse pharmacology-a ral drugs: an overview. Int J Biomed Sci
paradigm shift for drug discovery and develop- 2:85–100
ment. Curr Res Drug Discov 1:39–44 13. McMurry J (2008) Organic chemistry, 7th
7. Roses AD (2008) Pharmacogenetics in drug edn. Cengage Learning, Stamford, CT
discovery and development: a translational per- 14. NCBI.NLM.NIH.Gov. Search terms
spective. Nat Rev Drug Discov 7:807–817 “C8H9NO2.” Accessed 18 Apr 2017
8. Nageswara Rao R, Talluri MV (2007) An over- 15. Ferreres F, Giner JM, Tomás-Barberán FA
view of recent applications of inductively cou- (1994) A comparative study of hesperetin and
pled plasma-mass spectrometry (ICP-MS) in methyl anthranilate as markers of the floral
Isomeric and Conformational Analysis of Small Drug and Drug-Like Molecules. . . 177
origin of citrus honey. J Sci Food Agric collision cross sections measured on a proto-
65:371–372 type high resolution drift tube ion mobility-
16. Groessl M, Graf S, Knochenmuss R (2015) mass spectrometer. Anal Chem 86:2107–2116
High resolution ion mobility-mass spectrome- 28. Web of Science. Thomson Reuters. Search
try for separation and identification of isomeric terms “Ion Mobility” AND “Mass Spectrome-
lipids. Analyst 140:6904–6911 try.” Articles from 2002 to 2017. Accessed
17. Xiao Y, Vecchi MM, Wen D (2016) Distin- 15 May 2017
guishing between leucine and isoleucine by 29. Paglia G, Astarita G (2017) Metabolomics and
integrated LC-MS analysis using Orbitrap lipidomics using traveling-wave ion mobility
fusion mass spectrometer. Anal Chem mass spectrometry. Nat Protoc 12:797–813
88:10757–10766 30. Stow SM, Lareau NM, Hines KM, McNees
18. Takayama K, Kilburn JO (1989) Inhibition of CR, Goodwin CR, Bachmann BO, McLean
synthesis of arabinogalactan by ethambutol in JA (2014) In: Havlı́ček V, Spı́žek J (eds) Natu-
mycobacterium smegmatis. Antimicrob Agents ral products analysis: instrumentation, meth-
Chemother 33:1493–1499 ods, and applications. John Wiley & Sons,
19. Chatterjee VK, Buchanan DR, Friedmann AI, Inc., Hoboken, NJ, pp 397–432
Green M (1986) Ocular toxicity following eth- 31. Sundarapandian S, May JC, McLean JA (2010)
ambutol in standard dosage. Br J Dis Chest Dual source ion mobility mass-spectrometer
80:288–291 for direct comparison of ESI and MALDI col-
20. Carey R (1996) Organic chemistry, 3rd edn. lision cross section measurements. Anal Chem
McGraw Hill, New York, pp 89–92 82:3247–3254
21. Kothiwale S, Mendenhall JL, Meiler J (2015) 32. Mason EA, McDaniel EW (1988) Transport
BCL::CONF: small molecule conformational properties of ions in gases. John Wiley and
sampling using a knowledge based rotamer Sons, Indianapolis, IN
library. J Cheminform 7:47 33. Glaskin RS, Valentine SJ, Clemmer DE (2010)
22. Paglia G, Williams JP, Menikarachchi L, A scanning frequency mode for ion cyclotron
Thompson JW, Tyldesley-Worster R, mobility spectrometry. Anal Chem
Halldórsson S, Rolfsson O, Moseley A, 82:8266–8271
Grant D, Langridge J, Palsson BO, Astarita G 34. Cumeras R, Figueras E, Davis CE, Baumbach
(2014) Ion mobility derived collision cross sec- JI, Grácia I (2015) Review on ion mobility
tions to support metabolomics applications. spectrometry. Part 1: current instrumentation.
Anal Chem 86:3985–3993 Analyst 140:1376–1390
23. Enders JR, McLean JA (2009) Chiral and 35. Cumeras R, Figueras E, Davis CE, Baumbach
structural analysis of biomolecules using mass JI, Grácia I (2015) Review on ion mobility
spectrometry and ion mobility –mass spec- spectrometry. Part 2: hyphenated methods
trometry. Chirality 21:253–264 and effects of experimental parameters. Analyst
24. Dodds JN, May JC, McLean JA (2017) Inves- 140:1391–1410
tigation of the complete suite of the leucine 36. Adamov A, Mauriala T, Teplov V, Laakia J,
and isoleucine isomers: toward prediction of Pedersen CS, Kotiaho T, Sysoev AA (2010)
ion mobility separation capabilities. Anal Characterization of a high resolution drift
Chem 89:952–959 tube ion mobility spectrometer with a multi-
25. Pringle SD, Giles K, Wildgoose JL, Williams ion source platform. Int J Mass Spectrom
JP, Slade SE, Thalassinos K, Bateman RH, 298:24–29
Bowers MT, Scrivens JH (2007) An investiga- 37. Kanu AB, Dwivedi P, Tam M, Matz L, Hill HH
tion of the mobility separation of some peptide Jr (2008) Ion mobility–mass spectrometry. J
and protein ions using a new hybrid quadru- Mass Spectrom 43:1–22
pole/travelling wave IMS/oa-ToF instrument. 38. Jurneczko E, Kalapothakis J, Campuzano ID,
Int J Mass Spectrom 261:1–12 Morris M, Barran PE (2012) Effects of drift gas
26. May JC, McLean JA (2015) Ion mobility-mass on collision cross sections of a protein standard
spectrometry: time-dispersive instrumentation. in linear drift tube and traveling wave ion
Anal Chem 87:1422–1436 mobility mass spectrometry. Anal Chem
27. May JC, Goodwin CR, Lareau NM, Leaptrot 84:8524–8531
KL, Morris CB, Ruwam T, Kurulugama RT, 39. Ujma J, Giles K, Morris M, Barran PE (2016)
Mordehai A, Klein C, Barry W, Darland E, New high resolution ion mobility mass spec-
Overney G, Imatani K, Stafford GC, Fjeldsted trometer capable of measurements of collision
JC, McLean JA (2014) Conformational order- cross sections from 150 to 520 K. Anal Chem
ing of biomolecules in the gas phase: nitrogen 88:9469–9478
178 Shawn T. Phillips et al.
40. Giles K, Williams JP, Campuzano I (2011) ion mobility-mass spectrometry. Biochimica et
Enhancements in travelling wave ion mobility Biophusica Acta (BBA)-Molecular and Cell
resolution. Rapid Commun Mass Spectrom Biology of Lipids 1811:935–945
25:1559–1566 48. Gaye MM, Nagy G, Clemmer DE, Pohl NL
41. Shvartsburg AA, Smith RD (2008) Fundamen- (2016) Multidimensional analysis of 16 glucose
tals of traveling wave ion mobility spectrome- isomers by ion mobility spectrometry. Anal
try. Anal Chem 80:9689–9699 Chem 88:2335–2344
42. Bush MF, Campuzano ID, Robinson CV 49. Fenn LS, McLean JA (2013) Structural separa-
(2012) Ion mobility mass spectrometry of pep- tions by ion mobility-MS for glycomics and
tide ions: effects of drift gas and calibration glycoproteomics. Methods Mol Biol
strategies. Anal Chem 84:7124–7130 951:171–194
43. Hines KM, May JC, McLean JA, Xu L (2016) 50. Lalli PM, Corilo YE, Rowland SM, Marshall
Evaluation of collision cross section calibrants AG, Rodgers RP (2015) Isomeric separation
for structural analysis of lipids by traveling wave and structural characterization of acids in
ion mobility-mass spectrometry. Anal Chem petroleum by ion mobility mass spectrometry.
88:7329–7336 Energy Fuel 29:3626–3633
44. Lanucara F, Holman SW, Gray CJ, Eyers CE 51. Barnett DA, Ells B, Guevremont R, Purves RW
(2014) The power of ion mobility-mass spec- (1999) Separation of leucine and isoleucine by
trometry for structural characterization and the electrospray ionization-high field asymmetric
study of conformational dynamics. Nat Chem waveform ion mobility spectrometry-mass
6:281–294 spectrometry. J Am Chem Soc 10:1279–1284
45. May JC, McLean JA (2015) A uniform field ion 52. Knapman TW, Berryman JT, Campuzano I,
mobility of melittin and implications of Harris SA, Ashcroft AE (2010) Considerations
low-field mobility for resolving fine cross- in experimental and theoretical collision cross-
sectional detail in peptide and protein experi- section measurements of small molecules using
ments. Proteomics 15:2862–2871 travelling wave ion mobility spectrometry-mass
46. Shvartsburg AA, Tang K, Smith RD (2009) spectrometry. Int J Mass Spectrom 298:17–23
Two-dimensional ion mobility analyses of pro- 53. Li H, Bendiak B, Siems WF, Gang DR, Hill
teins and peptides. Methods Mol Biol HH Jr (2013) Ion mobility mass spectrometry
492:417–445 analysis of isomeric disaccharide precursor,
47. Kilman M, May JC, McLean JA (2011) Lipid product and cluster ions. Rapid Commun
analysis and lipidomics by structually selective Mass Spectrom 27:2699–2709
Part III
Abstract
In the era of big data and informatics, computational integration of data across the hierarchical structures of
human biology enables discovery of new druggable targets of disease and new mode of action of a drug. We
present herein a computational framework and guide of integrating drug targets, gene expression data,
transcription factors, and prior knowledge of protein interactions to computationally construct the signal-
ing network (mode of action) of a drug. In a similar manner, a disease network is constructed using its
disease targets. And then, drug candidates are computationally prioritized by computationally ranking the
closeness between a disease network and a drug’s signaling network. Furthermore, we describe the use of
the most perturbed HLA genes to assess the safety risk for immune-mediated adverse reactions such as
Stevens-Johnson syndrome/toxic epidermal necrolysis.
Key words Gene expression, Integer linear programming, Drug targets, Network protein interac-
tions, HLA genes
1 Introduction
Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_10, © Springer Science+Business Media, LLC, part of Springer Nature 2019
181
182 Ioannis N. Melas et al.
Data compilation
* drug list * drug targets
Precision medicine
* prioritizing the drug list using a distance metric between
a drug’s and the disease’s networks (illustration with Anthrax
disease)
Personalized risk assessment
* subjecting individual patient’s cells to a drug for transcriptomic profiling
* identifying each patient’s HLA genes significantly perturbed by the drug
* referencing literature for the HLA genes associated with severe/fatal
adverse drug reactions to assess the risk for individual patients
Fig. 1 The framework of a computational platform and guide for precision medicine in which steps consist of
(a) compilation of the needed data from databases, (b) computational construction of a drug’s mode of action
and a disease network, (c) prioritizing drug candidates, (d) individualized assessment of treatment-related
severe/fetal adverse reactions
2 Materials
2.2 Genome-Wide There are multiple large-scale biological databases available [11]
Transcriptional with some of them dedicated to transcriptional expression data.
Expression Data ArrayExpress [12] and Gene Expression Omnibus (GEO) [13]
organized by the National Center for Biotechnology Information
(NCBI) are the two main sources of publicly available gene expres-
sion data. As of today, ArrayExpress contains 69,728 experiments
performed using 2,201,027 different assays and 45.30 terabytes of
184 Ioannis N. Melas et al.
Table 1
Databases of protein interactions and transcription factors described in Subheading 2
2.4 Transcription Though transcription factor (TF) databases are not as abundant as
Factors PPI ones, there are databases available as summarized in Table 1,
including DBD and JASPAR that include reviewed as well as pre-
dicted TFs, and more conservative databases such as AnimalTFDB,
Swiss-Prot, and TRANSFAC host a manually curated list of TFs.
Once the appropriate database is selected and the data are parsed
into the appropriate format, the TF-gene interaction can be added
to the prior knowledge network as regular edges.
3 Methods
Table 2
List of computational algorithms and the needed optimizer for constructing a drug’s mode of action
Nature of computational
Name algorithm URL link for tool or source code
Integer linear Model the mechanics of signal http://pubs.rsc.org/en/content/articlelanding/
programming transduction from one node to ib/2015/C4ib00294f#!divAbstract
the next through the signaling
network
IBM ILOG An mathematical optimizer for http://www-03.ibm.com/software/products/
CPLEX ILP en/ibmilogcpleoptistud
CellNOptR Genetic algorithm to reconstruct Implementation in python https://github.com/
signaling topologies cellnopt/cellnopt or Cytoscape: http://www.
cellnopt.org/cytocopter/
X2K An enrichment-based approach to http://www.maayanlab.net/X2K
identifying upstream regulators
of differentially expressed genes
188 Ioannis N. Melas et al.
3.1.3 Applying ILP As described above, the ILP algorithm requires a gene expression
to Construct the Network signature and (optionally) a set of initially perturbed nodes. Drugs
of a Druggable Target offer an ideal application for this since most drugs have known
for a Disease targets and their gene expression signature can be retrieved from
databases like CMap. However, other types of external perturba-
tion, such as infections, can be analyzed in the same way. In our
recent publication [21], we used the same technique to analyze the
190 Ioannis N. Melas et al.
Table 3
List of web-based tools for identifying druggable targets for a disease
Table 4
The 22 contingency table
3.1.5 Computing Having calculated the mode of action of candidate drugs based on
and Prioritizing the Drug their transcriptomic profile, their fitness to a specific indication may
Candidates Using be prioritized and predicted by calculating the overlap of their
the Shortest Distance or modes of action with known disease mechanisms. In the work by
Scoring Their Ability Lamb et al. [36], the authors demonstrated that compounds which
to Reverse the Disease reverse the transcriptomic profile of a disease may have a therapeu-
Network for Drug tic effect. Melas et al. [3] advanced this idea further and showed
Repurposing that compounds whose modes of action disrupt the key signaling
pathways deregulated in a disease may also have a therapeutic effect
for treating that disease. Assuming the signaling pathways in a
disease are known (see Note 7), the overlap with a drug’s mode of
action is quantified via Fisher’s exact test on the included gene sets.
One approach would be to take into account the directionality
of the signature [21]. In particular, instead of identifying the drug
whose mode of action had the highest overlap with the disease, they
identified the drug that most closely resembled the “reversed”
disease network. The ILP algorithm produces a qualitative descrip-
tion of the signaling effects where proteins assume discrete states,
so the “reverse” network is just a network where the signs of the
nodes are flipped. In this context, the distance of two networks is
equal to the Euclidean distance between the node signatures of the
two networks (drug and disease).
Other distance metrics are also an option. For example the
Manhattan and Hamming distances were also considered as candi-
dates. Since the compared vectors are defined in the discrete space
of { 1, 0, 1}N, where N is the number of nodes, different distance
metrics amount to different trade-offs between predicting opposite
effects ( 1 vs 1) and predicting non-effects ( 1 or 1 vs 0). In
Table 5 are listed the misclassification costs for the different
metrics.
In a recent work [21], we opted for the Euclidean distance in
order to maximize the pharmacological effect of a drug on the
network of the disease at hand or minimize the synergies between
Table 5
Misclassification costs for the different metrics
4 Notes
Acknowledgments
References
1. Hur J, Guo AY, Loh WY, Feldman EL, Bai JP 10. Szklarczyk D, Santos A, von Mering C, Jensen
(2014) Integrated systems pharmacology anal- LJ, Bork P, Kuhn M (2016) STITCH 5: aug-
ysis of clinical drug-induced peripheral neurop- menting protein-chemical interaction networks
athy. CPT Pharmacometrics Syst Pharmacol 3: with tissue and affinity data. Nucleic Acids Res
e114. https://doi.org/10.1038/psp.2014.11 44(D1):D380–D384. https://doi.org/10.
2. Hur J, Zhao C, Bai JP (2015) Systems pharma- 1093/nar/gkv1277
cological analysis of drugs inducing stevens- 11. Zou D, Ma L, Yu J, Zhang Z (2015) Biological
Johnson syndrome and toxic epidermal necro- databases for human research. Genomics Pro-
lysis. Chem Res Toxicol 28(5):927–934. teomics Bioinformatics 13(1):55–63. https://
https://doi.org/10.1021/tx5005248 doi.org/10.1016/j.gpb.2015.01.006
3. Melas IN, Sakellaropoulos T, Iorio F, Alexo- 12. Parkinson H, Sarkans U, Kolesnikov N,
poulos LG, Loh WY, Lauffenburger DA, Saez- Abeygunawardena N, Burdett T, Dylag M,
Rodriguez J, Bai JP (2015) Identification of Emam I, Farne A, Hastings E, Holloway E,
drug-specific pathways based on gene expres- Kurbatova N, Lukk M, Malone J, Mani R,
sion data: application to drug induced lung Pilicheva E, Rustici G, Sharma A, Williams E,
injury. Integr Biol (Camb) 7(8):904–920. Adamusiak T, Brandizi M, Sklyar N, Brazma A
https://doi.org/10.1039/c4ib00294f (2011) ArrayExpress update–an archive of
4. Hur J, Liu Z, Tong W, Laaksonen R, Bai JP microarray and high-throughput sequencing-
(2014) Drug-induced rhabdomyolysis: from based functional genomics experiments.
systems pharmacology analysis to biochemical Nucleic Acids Res 39(Database issue):
flux. Chem Res Toxicol 27(3):421–432. D1002–D1004. https://doi.org/10.1093/
https://doi.org/10.1021/tx400409c nar/gkq1040
5. Melas IN, Samaga R, Alexopoulos LG, Klamt S 13. Barrett T, Wilhite SE, Ledoux P, Evangelista C,
(2013) Detecting and removing inconsisten- Kim IF, Tomashevsky M, Marshall KA, Phil-
cies between experimental data and signaling lippy KH, Sherman PM, Holko M, Yefanov A,
network topologies using integer linear pro- Lee H, Zhang N, Robertson CL, Serova N,
gramming on interaction graphs. PLoS Com- Davis S, Soboleva A (2013) NCBI GEO:
put Biol 9(9):e1003204. https://doi.org/10. archive for functional genomics data sets--
1371/journal.pcbi.1003204 update. Nucleic Acids Res 41(Database issue):
6. Wishart DS, Knox C, Guo AC, Shrivastava S, D991–D995. https://doi.org/10.1093/nar/
Hassanali M, Stothard P, Chang Z, Woolsey J gks1193
(2006) DrugBank: a comprehensive resource 14. DtoxS–drug toxicity signature generation cen-
for in silico drug discovery and exploration. ter–Data & resources. https://martip03.u.hpc.
Nucleic Acids Res 34(Database issue): mssm.edu/about.php. Accessed Feb 2017
D668–D672. https://doi.org/10.1093/nar/ 15. Lamb J (2007) The connectivity map: a new
gkj067 tool for biomedical research. Nat Rev Cancer 7
7. Ursu O, Holmes J, Knockel J, Bologa CG, (1):54–60. https://doi.org/10.1038/
Yang JJ, Mathias SL, Nelson SJ, Oprea TI nrc2044
(2017) DrugCentral: online drug compen- 16. Iorio F, Bosotti R, Scacheri E, Belcastro V,
dium. Nucleic Acids Res 45(D1): Mithbaokar P, Ferriero R, Murino L,
D932–D939. https://doi.org/10.1093/nar/ Tagliaferri R, Brunetti-Pierri N, Isacchi A, di
gkw993 Bernardo D (2010) Discovery of drug mode of
8. Santos R, Ursu O, Gaulton A, Bento AP, action and drug repositioning from transcrip-
Donadi RS, Bologa CG, Karlsson A, tional responses. Proc Natl Acad Sci U S A 107
Al-Lazikani B, Hersey A, Oprea TI, Overing- (33):14621–14626. https://doi.org/10.
ton JP (2017) A comprehensive map of molec- 1073/pnas.1000138107
ular drug targets. Nat Rev Drug Discov 16 17. Benjamini Y, Hochberg Y (1995) Controlling
(1):19–34. https://doi.org/10.1038/nrd. the false discovery rate: a practical and powerful
2016.230 approach to multiple testing. J R Stat Soc B
9. Yang H, Qin C, Li YH, Tao L, Zhou J, Yu CY, Methodol 57(1):289–300
Xu F, Chen Z, Zhu F, Chen YZ (2016) Thera- 18. Terfve C, Cokelaer T, Henriques D,
peutic target database update 2016: enriched MacNamara A, Goncalves E, Morris MK, van
resource for bench to clinical drug target and Iersel M, Lauffenburger DA, Saez-Rodriguez J
targeted pathway information. Nucleic Acids (2012) CellNOptR: a flexible toolkit to train
Res 44(D1):D1069–D1074. https://doi.org/ protein signaling networks to data using
10.1093/nar/gkv1230
A Computational Platform and Guide for Acceleration of Novel Medicines. . . 197
multiple logic formalisms. BMC Syst Biol DAVID: database for annotation, visualization,
6:133. https://doi.org/10.1186/1752- and integrated discovery. Genome Biol 4(5):3
0509-6-133 28. Huang da W, Sherman BT, Lempicki RA
19. Chen EY, Xu H, Gordonov S, Lim MP, Perkins (2009) Systematic and integrative analysis of
MH, Ma’ayan A (2012) Expression2Kinases: large gene lists using DAVID bioinformatics
mRNA profiling linked to multiple upstream resources. Nat Protoc 4(1):44–57. https://
regulatory layers. Bioinformatics 28 doi.org/10.1038/nprot.2008.211
(1):105–111. https://doi.org/10.1093/bioin 29. Sartor MA, Mahavisno V, Keshamouni VG,
formatics/btr625 Cavalcoli J, Wright Z, Karnovsky A, Kuick R,
20. Wu G, Feng X, Stein L (2010) A human func- Jagadish HV, Mirel B, Weymouth T, Athey B,
tional protein interaction network and its appli- Omenn GS (2010) ConceptGen: a gene set
cation to cancer data analysis. Genome Biol 11 enrichment and gene set relation mapping
(5):R53. https://doi.org/10.1186/gb-2010- tool. Bioinformatics 26(4):456–463. https://
11-5-r53 doi.org/10.1093/bioinformatics/btp683
21. Bai JP, Sakellaropoulos T, Alexopoulos LG 30. Chen EY, Tan CM, Kou Y, Duan Q, Wang Z,
(2017) A biologically-based computational Meirelles GV, Clark NR, Ma’ayan A (2013)
approach to drug repurposing for anthrax Enrichr: interactive and collaborative HTML5
infection. Toxins 9(3):E99 gene list enrichment analysis tool. BMC Bioin-
22. Kanehisa M, Goto S (2000) KEGG: Kyoto formatics 14:128. https://doi.org/10.1186/
encyclopedia of genes and genomes. Nucleic 1471-2105-14-128
Acids Res 28(1):27–30 31. Mi H, Huang X, Muruganujan A, Tang H,
23. Vastrik I, D’Eustachio P, Schmidt E, Joshi- Mills C, Kang D, Thomas PD (2017) PAN-
Tope G, Gopinath G, Croft D, de Bono B, THER version 11: expanded annotation data
Gillespie M, Jassal B, Lewis S, Matthews L, from gene ontology and reactome pathways,
Wu G, Birney E, Stein L (2007) Reactome: a and data analysis tool enhancements. Nucleic
knowledge base of biologic pathways and pro- Acids Res 45(D1):D183–D189. https://doi.
cesses. Genome Biol 8(3):R39. https://doi. org/10.1093/nar/gkw1138
org/10.1186/gb-2007-8-3-r39 32. Ghiassian SD, Menche J, Barabasi AL (2015) A
24. Liberzon A, Birger C, Thorvaldsdottir H, DIseAse MOdule detection (DIAMOnD)
Ghandi M, Mesirov JP, Tamayo P (2015) The algorithm derived from a systematic analysis of
molecular signatures database (MSigDB) hall- connectivity patterns of disease proteins in the
mark gene set collection. Cell Syst 1 human interactome. PLoS Comput Biol 11(4):
(6):417–425. https://doi.org/10.1016/j. e1004120. https://doi.org/10.1371/journal.
cels.2015.12.004 pcbi.1004120
25. Subramanian A, Tamayo P, Mootha VK, 33. Ahn YY, Bagrow JP, Lehmann S (2010) Link
Mukherjee S, Ebert BL, Gillette MA, communities reveal multiscale complexity in
Paulovich A, Pomeroy SL, Golub TR, Lander networks. Nature 466(7307):761–764.
ES, Mesirov JP (2005) Gene set enrichment https://doi.org/10.1038/nature09182
analysis: a knowledge-based approach for inter- 34. Newman M (2010) Networks: an introduc-
preting genome-wide expression profiles. Proc tion. OUP, Oxford
Natl Acad Sci U S A 102(43):15545–15550. 35. Scardoni G, Petterlini M, Laudanna C (2009)
https://doi.org/10.1073/pnas.0506580102 Analyzing biological network parameters with
0506580102 [pii] CentiScaPe. Bioinformatics 25
26. Mootha VK, Lindgren CM, Eriksson KF, (21):2857–2859. https://doi.org/10.1093/
Subramanian A, Sihag S, Lehar J, bioinformatics/btp517
Puigserver P, Carlsson E, Ridderstrale M, 36. Lamb J, Crawford ED, Peck D, Modell JW,
Laurila E, Houstis N, Daly MJ, Patterson N, Blat IC, Wrobel MJ, Lerner J, Brunet JP,
Mesirov JP, Golub TR, Tamayo P, Subramanian A, Ross KN, Reich M,
Spiegelman B, Lander ES, Hirschhorn JN, Hieronymus H, Wei G, Armstrong SA, Hag-
Altshuler D, Groop LC (2003) PGC-1alpha- garty SJ, Clemons PA, Wei R, Carr SA, Lander
responsive genes involved in oxidative phos- ES, Golub TR (2006) The connectivity map:
phorylation are coordinately downregulated using gene-expression signatures to connect
in human diabetes. Nat Genet 34 small molecules, genes, and disease. Science
(3):267–273. https://doi.org/10.1038/ 313(5795):1929–1935. https://doi.org/10.
ng1180 1126/science.1132939
27. Dennis G Jr, Sherman BT, Hosack DA, Yang J, 37. Chen P, Lin JJ, Lu CS, Ong CT, Hsieh PF,
Gao W, Lane HC, Lempicki RA (2003) Yang CC, Tai CT, Wu SL, Lu CH, Hsu YC, Yu
198 Ioannis N. Melas et al.
HY, Ro LS, Lu CT, Chu CC, Tsai JJ, Su YH, Pharmacogenomics 13(11):1285–1306.
Lan SH, Sung SF, Lin SY, Chuang HP, Huang https://doi.org/10.2217/pgs.12.108
LC, Chen YJ, Tsai PJ, Liao HT, Lin YH, Chen 39. Daly AK, Donaldson PT, Bhatnagar P, Shen Y,
CH, Chung WH, Hung SI, Wu JY, Chang CF, Pe’er I, Floratos A, Daly MJ, Goldstein DB,
Chen L, Chen YT, Shen CY, Taiwan SJSC John S, Nelson MR, Graham J, Park BK, Dillon
(2011) Carbamazepine-induced toxic effects JF, Bernal W, Cordell HJ, Pirmohamed M, Aithal
and HLA-B*1502 screening in Taiwan. N GP, Day CP, Study D, International SAEC
Engl J Med 364(12):1126–1133. https://doi. (2009) HLA-B*5701 genotype is a major deter-
org/10.1056/NEJMoa1009717 minant of drug-induced liver injury due to flu-
38. Pavlos R, Mallal S, Phillips E (2012) HLA and cloxacillin. Nat Genet 41(7):816–819. https://
pharmacogenetics of drug hypersensitivity. doi.org/10.1038/ng.379
Chapter 11
Abstract
Systems pharmacology aims to understand drug actions on a multi-scale from atomic details of drug-target
interactions to emergent properties of biological network and rationally design drugs targeting an inter-
acting network instead of a single gene. Multifaceted data-driven studies, including machine learning-based
predictions, play a key role in systems pharmacology. In such works, the integration of multiple omics data is
the key initial step, followed by optimization and prediction. Here, we describe the overall procedures for
drug-target association prediction using REMAP, a large-scale off-target prediction tool. The method
introduced here can be applied to other relation inference problems in systems pharmacology.
Key words Big data, Machine learning, Collaborative filtering, Drug repurposing, Off-target
identification
1 Introduction
Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_11, © Springer Science+Business Media, LLC, part of Springer Nature 2019
199
200 Hansaim Lim and Lei Xie
2 Materials
3 Methods
DRUG 1 TARGET 1
TARGET 2
DRUG 2 DRUG 3
TARGET 3
DRUG 4
TARGET 4
DRUG 5
TARGET 5
DRUG 6
D1 . . . D6 T1 . . . T5 T1 . . . T5
D1 D1 T1
...
...
...
D6 D6 T5
Fig. 1 Drug-target network and its matrix representation. (a) Drug-target associations can be represented as a
network containing three types of connections. A bipartite graph connects drugs and targets. The solid pink
lines and the blue dashed lines represent the known drug-target associations and the unknown off-targets,
respectively. The sky-blue lines and the green lines represent chemical-chemical similarity and protein-
protein similarity networks, respectively. For conciseness, only six drugs and five targets are drawn here. (b)
The matrix representation of the network in (a). The matrices D,T,and R represent drug-drug similarity, target-
target similarity, and known drug-target association networks, respectively. Color-filled entries of the matrices
represent the connections in (a). The values in D and T are nonnegative number with a maximum of 1 for
identity. The values in R are either 0 or 1, where 1 represents the association which is known. Note that the
entries for the unknown off-targets in the matrix R (blue-filled) start at 0 and the goal is to predict these
unknown associations using the known associations and the two similarity matrices
there are 12,384 unique chemicals and 3500 unique proteins in the
ZINC data set, chemical-protein association matrix size must be
12,384 by 3500, chemical-chemical similarity matrix size must be
12,384 by 12,384, and protein-protein similarity matrix size must
be 3500 by 3500. For more details, refer to Lim et al. [20].
3.2 Command-Line We will use both bash commands and Matlab commands. To
Notations minimize confusions, the commands are in Courier New font
with gray shade. Bash commands start with a dollar sign, and
Matlab commands start with two greater than signs. We will not
use multiple commands in one sentence. In cases where a one-line
command is written on multiple lines of the page, we will specify
that the command is a one-line command. For example, $ echo
This is a bash command! means to type “echo This is a bash
command!” on a terminal screen, and >> disp(This is a Matlab
command!); disp(Big Data in Drug Discovery); means to
type “disp(‘This is a Matlab command!’); disp(‘Big Data in Drug
Discovery’);” on Matlab command line. A long Matlab command
can be separated at each semicolon. The Matlab command above
can be either a one-line command or two lines, such as >> disp
(This is a Matlab command!); followed by >> disp(Big Data
in Drug Discovery); without changing the results. The leading
dollar signs and the leading greater than signs are not to be typed by
the readers. We expect that the readers are familiar with the basic
bash commands, such as changing directory, viewing and editing
text files, and moving and copying files (see Note 1).
The result file should be in the “list” directory. The first two
columns are the protein indices, and the third column is the
sequence-based similarity score for the protein pair. The
protein-protein matrix may not be symmetric. In other
words, the similarity score from query protein 1 against target
protein 2 may be different from that of query protein 2 against
target protein 1. However, the self-query similarity score is
always 1. The output file does not contain self-query scores as
they can be easily added in Matlab.
4. Comma-separated files can easily be read in Matlab to generate
a numerical matrix. Thus, we generate the chemical-chemical
similarity file in the same format as protein-protein similarity.
Under “list” directory, run
The resulting file should contain only the SMILES string for
each chemical, without the header line. This file is the input for
ChemAxon JChem software (see Note 4).
5. Assuming ChemAxon JChem is installed under “/home/
user/” directory, run
$ /home/user/ChemAxon/JChem/bin/screenmd ZINC_chemical_
structures.txt ZINC_chemical_structures.txt -k ECFP -g -c
-M Tanimoto > ZINC_tanimoto.dat
See Note 5 for more information. This process will take some
time (approximately 9 min on a 3.40 GHz quad-core desktop
machine). The output file is a matrix of Tanimoto distances
between chemicals based on their 2D structures. Open the
output file by running
$ less -S ZINC_tanimoto.dat
206 Hansaim Lim and Lei Xie
on the terminal and confirm that the diagonal entries are zero
since the scores are distances, not similarities.
6. We need to convert the distances to similarity scores by sub-
tracting each distance from 1. Also, as stated in Lim et al. [20],
chemical-chemical similarity scores lower than 0.5 are filtered
out. Under “scipt” directory, run
(see Note 6). The output file format is the same as “prot_prot_-
sim_Idx.csv” file, but the indices are a chemical index instead of
a protein index.
7. Finally, we need to convert the text-based drug-target associa-
tions into index-based associations. Under “script” directory,
run
The output file contains only two columns: chemical index and
protein index. There is no need to output the third column as
the chemical-protein association matrix is binary.
8. Next, we need to create Matlab-readable matrices from the
comma-separated files. Open Matlab and move to the “list”
directory. Run
>> save(‘/home/user/REMAP/chem_prot_zinc.mat’,chem_prot);
>> load(‘/home/user/REMAP/BiDD/matrix/chem_prot_zinc.
mat’);
2. Once the three matrices are loaded, check the names of loaded
variables. The command
>> who
shows the names of the loaded matrices. These names are the
variable names that were used to save the matrices into files.
The three names should appear “chem_prot_zinc, chem_-
chem_zinc, prot_prot_zinc”.
3. Run
>> addpath(‘/home/user/REMAP/matlab_codes/’);
>> REMAP_optimization(chem_prot_zinc,chem_chem_zinc,
prot_prot_zinc)
(see Note 10). Once the grid search is done, the best para-
meters will be shown on the Matlab command window. For
detailed performance comparisons, open the output file out-
side the Matlab command window. The output file should be
located under “BiDD” directory (one level higher than the
current working directory). Based on the grid search, the best
parameters were rank ¼ 100, p6 ¼ 0.99, and p7 ¼ 0.33, and
they yielded an average AUC of 0.9661. Using these para-
meters, we can continue to get the prediction scores for each
chemical-protein pair.
4. Now that we have the optimal parameters, the actual prediction
step is simply to use all known associations to get the score
matrix P. In other words, we do not divide our data into
training and test sets. We will use “REMAP.m” script, which
takes the same inputs and the user-defined parameters. First, we
define parameters. On Matlab command window, run
>> para=[0.1,0.1,0.01,100,100,0.99,0.33];
followed by
>> REMAP(chem_prot_zinc,chem_chem_zinc,prot_prot_zinc,’/
home/user/REMAP/REMAP_prediction.txt’,’false’,0.7,para);
4 Notes
>> REMAP(chem_prot_zinc,chem_chem_zinc,prot_prot_zinc,’/
home/user/REMAP/REMAP_prediction.txt’,‘true’,0.7,para);
>> COSINE_Optimization(chem_prot_zinc,chem_chem_zinc,
prot_prot_zinc)
212 Hansaim Lim and Lei Xie
5 Conclusion
Acknowledgment
References
1. Kennedy T (1997) Managing the drug discov- 10. Bowes J, Brown AJ, Hamon J, Jarolimek W,
ery/development interface. Drug Discov Sridhar A, Waldron G, Whitebread S (2012)
Today 2(10):436–444. https://doi.org/10. Reducing safety-related drug attrition: the use
1016/S1359-6446(97)01099-4 of in vitro pharmacological profiling. Nat Rev
2. Weber A, Casini A, Heine A, Kuhn D, Supuran Drug Discov 11(12):909–922. https://doi.
CT, Scozzafava A, Klebe G (2004) Unexpected org/10.1038/nrd3845
nanomolar inhibition of carbonic anhydrase by 11. Hart T, Xie L (2016) Providing data science
COX-2-selective celecoxib: new pharmacolog- support for systems pharmacology and its
ical opportunities due to related binding site implications to drug discovery. Expert Opin
recognition. J Med Chem 47(3):550–557. Drug Discov 11(3):241–256. https://doi.
https://doi.org/10.1021/jm030912m org/10.1517/17460441.2016.1135126
3. Xie L, Wang J, Bourne PE (2007) In silico 12. Xie L, Draizen EJ, Bourne PE (2017) Harnes-
elucidation of the molecular mechanism defin- sing big data for systems pharmacology. Annu
ing the adverse effect of selective estrogen Rev Pharmacol Toxicol 57:245–262. https://
receptor modulators. PLoS Comput Biol 3 doi.org/10.1146/annurev-pharmtox-
(11):e217. https://doi.org/10.1371/journal. 010716-104659
pcbi.0030217 13. Xie L, Xie L, Kinnings SL, Bourne PE (2012)
4. Forrest MJ, Bloomfield D, Briscoe RJ, Brown Novel computational approaches to polyphar-
PN, Cumiskey AM, Ehrhart J, Hershey JC, macology as a means to define responses to
Keller WJ, Ma X, McPherson HE, Messina E, individual drugs. Annu Rev Pharmacol Toxicol
Peterson LB, Sharif-Rodriguez W, Siegl PK, 52:361–379. https://doi.org/10.1146/
Sinclair PJ, Sparrow CP, Stevenson AS, Sun annurev-pharmtox-010611-134630
SY, Tsai C, Vargas H, Walker M 3rd, West 14. Xie L, Ge X, Tan H, Xie L, Zhang Y, Hart T,
SH, White V, Woltmann RF (2008) Yang X, Bourne PE (2014) Towards structural
Torcetrapib-induced blood pressure elevation systems pharmacology to study complex dis-
is independent of CETP inhibition and is eases and personalized medicine. PLoS Com-
accompanied by increased circulating levels of put Biol 10(5):e1003554. https://doi.org/10.
aldosterone. Br J Pharmacol 154 1371/journal.pcbi.1003554
(7):1465–1473. https://doi.org/10.1038/ 15. Koutsoukas A, Lowe R, Kalantarmotamedi Y,
bjp.2008.229 Mussa HY, Klaffke W, Mitchell JB, Glen RC,
5. Howes LG, Kostner K (2007) The withdrawal Bender A (2013) In silico target predictions:
of torcetrapib from drug development: impli- defining a benchmarking data set and compari-
cations for the future of drugs that alter HDL son of performance of the multiclass naive
metabolism. Expert Opin Investig Drugs 16 Bayes and Parzen-Rosenblatt window. J Chem
(10):1509–1516. https://doi.org/10.1517/ Inf Model 53(8):1957–1966. https://doi.
13543784.16.10.1509 org/10.1021/ci300435j
6. Butler D, Callaway E (2016) Scientists in the 16. van Laarhoven T, Nabuurs SB, Marchiori E
dark after French clinical trial proves fatal. (2011) Gaussian interaction profile kernels for
Nature 529(7586):263–264. https://doi. predicting drug-target interaction. Bioinfor-
org/10.1038/nature.2016.19189 matics 27(21):3036–3043. https://doi.org/
7. Xie L, Evangelidis T, Xie L, Bourne PE (2011) 10.1093/bioinformatics/btr500
Drug discovery using chemical systems biol- 17. van Laarhoven T, Marchiori E (2013) Predict-
ogy: weak inhibition of multiple kinases may ing drug-target interactions for new drug com-
contribute to the anti-cancer effect of nelfina- pounds using a weighted nearest neighbor
vir. PLoS Comput Biol 7(4):e1002037. profile. PLoS One 8(6):e66952. https://doi.
https://doi.org/10.1371/journal.pcbi. org/10.1371/journal.pone.0066952
1002037 18. Gonen M (2012) Predicting drug-target inter-
8. Bertolini F, Sukhatme VP, Bouche G (2015) actions from chemical and genomic kernels
Drug repurposing in oncology--patient and using Bayesian matrix factorization. Bioinfor-
health systems opportunities. Nat Rev Clin matics 28(18):2304–2310. https://doi.org/
Oncol 12(12):732–742. https://doi.org/10. 10.1093/bioinformatics/bts360
1038/nrclinonc.2015.169 19. Rouillard AD, Gundersen GW, Fernandez NF,
9. Novac N (2013) Challenges and opportunities Wang Z, Monteiro CD, McDermott MG,
of drug repositioning. Trends Pharmacol Sci Ma’ayan A (2016) The harmonizome: a collec-
34(5):267–272. https://doi.org/10.1016/j. tion of processed datasets gathered to serve and
tips.2013.03.004
214 Hansaim Lim and Lei Xie
Abstract
Nowadays, drug discovery is a long process which includes target identification, validation, lead optimiza-
tion, and many other major/minor steps. The huge flow of data has necessitated the need for computa-
tional support for collection, storage, retrieval, analysis, and correlation of data sets of complex information.
At the beginning of the twentieth century, it was cumbersome to elaborate the experimental findings in the
form of clinical outcomes, but current research in the field of bioinformatics clearly shows ongoing
unification of experimental findings and clinical outcomes. Bioinformatics has made it easier for researchers
to overcome various challenges of time-consuming and expensive procedures of evaluation of safety and
efficacy of drugs at a much faster and economic way. In the near future, it may be a major game player and
trendsetter for personalized medicine, drug discovery, drug standardization, as well as food products. Due
to rapidly increasing commercial interest, currently probiotic-based industries are flooding the market with
a range of probiotic products under the banner of dietary supplements, natural health products, food
supplements, or functional foods. Most of the consumers are attracted toward probiotic formulations due
to the rosy picture provided by the media and advertisements about high beneficial claims. These products
are not regulated by pharmaceutical regulatory authorities in different countries of origin and are rather
regulated as per their intended use. Lack of stipulated quality standard is a major challenge for probiotic
industry; hence there would always be a possibility of marketing of ineffective and unsafe products with false
claims. Hence it is very important and pertinent to ensure the safety of probiotic formulations available as
over-the-counter (OTC) products for ignorant society. At the same time, probiotic industry, being in its
initial stages in developing and underdeveloped countries, requires to ensure safe, swift, and successful
usage of probiotics. In the absence of harmonized regulatory guidelines, safety, quality, as well as the efficacy
of the probiotic strain does not remain a mandate but becomes a choice for the manufacturer. Hence there
is an urgent need to screen already marketed probiotic formulations for their safety with respect to specific
strains of probiotic. Various conventional methods used by the manufacturers for the identification of
probiotic microbes create a blurred image about their status as probiotics. The present manuscript focuses
on a bioinformatics-based technique for validation of marketed probiotic formulation using 16s rRNA
sequencing and strain-level identification of bacterial species using Ez Texan and laser gene software. This
technique gives a clear picture about the safety of the product for human use.
Key words Bioinformatics, Probiotics, Ez Texan, Laser gene, Drug discovery, 16s rRNA, Strain-level
identification, Pathogenic strain, Safety and efficacy
Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_12, © Springer Science+Business Media, LLC, part of Springer Nature 2019
215
216 Parveen Bansal et al.
1 Introduction
3 Methodology
Characterization of sample
Genomic DNA
Gram Staining & Catalase test, Indole test, Oxidase Isolation
Microscopic test, Methyl Red test, VP test, Citrate
Examination utilization test, Sugar fermentation
pattern Amplification of
16S rDNAregion
rRNA purification
and sequencing
Sequence
determination of 16S
rDNA Region
Percentage Phylogenetic
homology analysis
Data
Table 1
Details of primers used for amplification of 16S rDNA region
Forward
Group A-forward 50 -gt-tTg-atc-mtg-gct-cag-rAc-30
Group B-forward 50 -gt-tTg-atc-mtg-gct-cag-aKtg-30
Group C-forward 50 -agt-ttg-atc-mtg-gct-cag-gAt-30
Reverse
Groups A,B,C-reverse (pD)** 50 -gta-tta-ccg-cgg-ctg-ctg-30
Table 2
Details of the reaction mixture used for RT-PCR
Table 3
Details of PCR conditions used for amplification
3.1 Bioinformatics Sequence obtained after sequencing was exploited for analysis of
Tools Based Data data with respect to the following parameters:
Analysis
1. Percentage homology.
2. Phylogenetic analysis.
l Percentage homology: The percentage homology of the
sequence obtained from the marketed formulation was ana-
lyzed with other available strains by using Ez-Texan data-
base, and hence observations for percentage homology,
sequence analysis, graphic summary, dot matrix, identity
analysis, and taxonomic hierarchy had been recorded as
images.
l Phylogenetic analysis: The obtained query sequence of iso-
late from marketed probiotic formulation was closely com-
pared with other related species with the help of laser gene
software. Significant level of variations in 16s rRNAs
between strains of different species was observed, and
hence a phylogenetic tree had been drawn from the
observed parameters.
line. For example, in the current study, dot matrix shows that most
of the base pairs are similar and hence representing a straight line as
shown in Fig. 4.
226 Parveen Bansal et al.
3.3 Phylogenetic For the phylogenetic analysis, the obtained sequence, i.e., query
analysis sequence of isolate from selected probiotic formulation, was closely
compared with other related species with the help of laser gene
software. Significant level of variations in 16s rRNAs between
strains of different species was evident from the obtained results.
The query sequence of 699 bp was compared between the three
strains of B. coagulans and closely related species of the genus
Bacillus. The level of 16S rDNA from B. coagulans (marketed
formulation isolate) was closely related to two strains of
B. coagulans, namely, B. coagulans (Strain C4) and B. coagulans
(LA 204), and the similarity was 98.8% and 99.3%, respectively.
Further the phylogenetic tree prepared with the help of laser gene
software had also clarified that the bacterial sample isolated from
the marketed formulation is indeed of Bacillus coagulans, and it
formed a coherent cluster with two other strains of B. coagulans,
namely, B. coagulans C4 and B. coagulans L204 which is a type
Bioinformatics-Based Tools and Software in Clinical Research: A New. . . 227
References
1. Zerhouni EA (2006) Clinical research at a 11. Arora M, Baldi A (2017) Selective identifica-
crossroads: the NIH roadmap. J Investig Med tion and characterization of potential probiotic
54:171–173 strains: a review on comprehensive Polyphasic
2. DiMasi JA, Hansen RW, Grabowski HG approach. App Clin Res Clin Trials Reg Aff 4
(2003) The price of innovation: new estimates (1):60–76
of drug development costs. J Health Econ 12. Ennis V (2008) Microbiology handouts. Int J
22:151–185 Food Microbiol 91:305–313
3. Gill SK, Christopher AF, Gupta V et al (2016) 13. Hiu SF, Holt RA, Sriranganathan N (1984)
Emerging role of bioinformatics tools and soft- Lactobacillus piscicola, a new species from sal-
ware in evolution of clinical research. Perspect monid fish. Int J Syst Evol Microbiol 34
Clin Res 7(3):115–119 (4):393–400
4. PWC United States (2017) Medical Cost 14. Taylor WI, Achanzar D (1972) Catalase test as
Trend. https://www.pwc.com/us/en/health- an aid to the identification of Enterobacteria-
industries/health-research-institute/behind- ceae. Appl Microbiol 24(1):58–61
the-numbers-2017.html. Accessed on 26 Sep 15. Miller JM, Wright JW (1982) Spot Indole test:
2018 evaluation of four reagents. J Clin Microbiol
5. Clark DE, Pickett SD (2000) Computational 15(4):589–592
methods for the prediction of drug-likeness. 16. Tarrand JJ, Gröschel DH (1982) Rapid mod-
Drug Discov Today 5:49–58 ified oxidase test for oxidase-variable bacterial
6. Drews J (1996) Genomic sciences and the isolates. J Clin Microbiol 16(4):772–774
medicine of tomorrow. Nat Biotechnol 17. Barry G (2011) Probiotics and health: from
14:1516–1518 history to future. In: Wolfgang K
7. Luscombe NM, Greenbaum D, Gerstein M (ed) Probiotics and health claim, 1st edn.
(2001) What is bioinformatics? An introduc- Blackwell Publishing Ltd, New Jeresy
tion and overview. Yearb Med Inform 10(01), 18. Kateete DP, Kimani CN, Katabazi FA et al
83–100 (2010) Identification of Staphylococcus aureus:
8. Isea R (2015) The present-day meaning of the Dnase and Mannitol salt agar improve the effi-
word bioinformatics. Global J Adv Res ciency of the tube coagulase test. Ann Clin
2:70–73 Microbiol Antimicrob 9(1):23–25
9. Pyar HA, Peh K (2014) Characterization and 19. Miller JM, Rhoden DL (1991) Preliminary
identification of Lactobacillus acidophilus using evaluation of biolog, a carbon source utiliza-
biolog rapid identification system. Int J Pharm tion method for bacterial identification. J Clin
Pharm Sci 6(1):189–193 Microbiol 29(6):1143–1147
10. Stanton C, Gardiner G, Meehan H et al (2001)
Market potential for probiotics. Am J Clin
Nutr 73(2):476s–483s
Chapter 13
Abstract
Recent advances in technology have led to the exponential growth of scientific literature in biomedical
sciences. This rapid increase in information has surpassed the threshold for manual curation efforts,
necessitating the use of text mining approaches in the field of life sciences. One such application of text
mining is in fostering in silico drug discovery such as drug target screening, pharmacogenomics, adverse
drug event detection, etc. This chapter serves as an introduction to the applications of various text mining
approaches in drug discovery. It is divided into two parts with the first half as an overview of text mining in
the biosciences. The second half of the chapter reviews strategies and methods for four unique applications
of text mining in drug discovery.
Key words Biomedical text mining, Drug discovery, Biomedical literature, Electronic medical
records, Deep learning
1 Introduction
Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_13, © Springer Science+Business Media, LLC, part of Springer Nature 2019
231
232 Si Zheng et al.
1.1 What Is Text In a nutshell, text mining (TM) is the process of discovering and
Mining capturing knowledge or useful patterns from a large number of
unstructured textual data. It is an interdisciplinary field that draws
on data mining, machine learning, natural language processing,
statistics, and more. Broadly speaking, TM tasks include document
summarization, information retrieval, entity recognition, and rela-
tionship extraction. The extracted information is often linked via
knowledge graphs to form new facts or hypotheses. Take for exam-
ple the mining of protein-protein interactions from bioscience
textual data. Using TM techniques, one can extract articles with
individual protein name mentions, acquire related words that occur
in each of these articles, and find additional articles containing the
same sets of words. The ultimate goal is to find potential protein-
protein interactions from the articles obtained.
1.2 Text Mining The biosciences have become one of the most promising applica-
for Facilitating Drug tion areas for text mining, as most biomedical scientific findings are
Discovery presented in the form of textual data in scholarly publications. In
recent years, biomedical text mining has generated promising out-
comes for drug discovery by utilizing textual databases and
advanced information extraction techniques. Hidden information
like drug-drug and drug-target interactions can be extracted from
textual data [4], which can aid in the identification of novel drugs
or the reusability of these approved drugs for other indications. All
of these approaches have sought to infer novel relationships among
biological entities by combining known elements hidden in
scientific text.
1.3 Challenges Although promising, there are many challenges in text mining for
of Text Mining in Drug drug discovery. For instance, drug and chemical names in scientific
Discovery textual data are heterogeneous/ambiguous, and documents differ
substantially in their length, structure, language/vocabulary,
writing style, and information content. As for TM technologies,
the lack of interoperability among TM tools impedes the develop-
ment of more complex systems. Other challenges include limited
high-quality training data for the development and evaluation of
advanced supervised machine learning methods. For example, deep
learning, a new and advanced area of machine learning research, has
yet to show its full potential when applied to biomedical text
mining tasks. One major obstacle is the lack of sufficient training
data in the biomedical domain.
Lastly, the ability to handle large-scale data continues to be a
challenge and is necessary for effective text mining applications.
Recent computing technologies, such as cloud computing, have
been applied to help boost and optimize existing tools.
Text Mining for Drug Discovery 233
2.1 Information The first critical step to biomedical text mining is to obtain infor-
Retrieval mation relevant to a particular topic from data resources, also called
information retrieval. Topics of interest can be represented by
various queries, with each query retrieving a list of matching docu-
ments. There are two main strategies when crafting queries: (1)
obtain all matches that may be relevant to the topic (maximize
recall) or (2) find only documents that are truly related to the
topic of interest (maximize precision). Methods for generating
different types of queries include keywords with controlled vocab-
ulary, Boolean search queries, natural language queries, wildcard
234 Si Zheng et al.
2.2 Named Entity The next step in the text mining pipeline includes named entity
Recognition recognition. Named entity recognition is the process of locating
specific predefined types of entities in text. For biomedical text
mining, named entity recognition typically involves extraction of
entities such as drugs, diseases, genes/proteins, and mutations
from unstructured text. The information extracted can be used to
define semantic relationships between entities to allow for further
analysis of article topics. Over the years, many biomedical named
entity recognition systems have been developed (Table 1).
Common methods for biomedical entity recognition include
dictionary look-up, rule-based approaches, machine learning, and
hybrid techniques. Dictionary-based methods are simple and prac-
tical but are limited by the scale and quality of the dictionary.
Resources like Unified Medical Language System (UMLS:
https://www.nlm.nih.gov/research/umls/) [7], Comparative
Toxicogenomics Database (CTD: https://ctdbase.org) [8], Drug-
Bank (https://www.drugbank.ca) [9], PubChem (https://
pubchem.ncbi.nlm.nih.gov) [10], RxNorm (https://www.nlm.
nih.gov/resesarch/umls/rxnorm/) [11], etc. are often used for
the creation of entity dictionaries. Rule-based approaches can be
effective when resources such as entity gazetteers or entity-labeled
textual training data are missing [12]. In recent studies, machine
learning methods such as conditional random fields (CRF),
structured support vector machines (SSVMs), and deep neural net-
works have been widely adopted. For instance, entity recognition
tools such as BANNER [13], DNorm [14], and TaggerOne [15]
are based on conditional random field (CRF) techniques. Finally,
Text Mining for Drug Discovery 235
Table 1
Named entity recognition systems
Content Textual
NER system Description URL recognized data type
tmChem An open-source https://www.ncbi.nlm. Chemical and Biomedical
software for nih.gov/research/ drug names literature
identifying bionlp/Tools/
chemical names tmchem/
BANNER- A system for https://bitbucket.org/ Chemical and Free text
CHEMDNER identifying tsendeemts/banner- drug names
chemical and chemdner
drug mentions
CheNER A tool for chemical http://ubio.bioinfo.cnio. Chemical and Free text
named entity es/biotools/CheNER/ drug names
recognition
ChemDataExtractor A toolkit for http://chemdataextractor. Drug names, Biomedical
extraction of org/ associated literature
chemical properties,
information measurements,
and
relationships
DNorm An open-source https://www.ncbi.nlm. Disease names Biomedical
software tool for nih.gov/research/ literature
identifying bionlp/Tools/dnorm/
disease names
OSCAR4 An extensible https://bitbucket.org/ Chemical and Biomedical
system for wwmm/oscar4/wiki/ drug names literature
annotation of Home
chemistry
GNormPlus An integrative https://www.ncbi.nlm. Gene and protein Biomedical
approach for nih.gov/research/ names literature
tagging gene, bionlp/Tools/
gene family, and gnormplus/
protein domains
tmVar A text mining https://www.ncbi.nlm. Mutation names Biomedical
approach for nih.gov/research/ literature
extracting bionlp/Tools/tmvar/
sequence variants
Whatizit A text processing http://www.ebi.ac.uk/ Gene, chemical, Free text
system for webservices/whatizit/ and disease
identifying info.jsf names
molecular
biology terms
236 Si Zheng et al.
2.4 Knowledge Knowledge graph construction is the final step for biomedical text
Graph Construction mining. A knowledge graph is a structured semantic knowledge
base that represents knowledge in graphical form. Each knowledge
graph is comprised of nodes representing entities, attributes asso-
ciated with each node, and edges demarcating unidirectional or
multidirectional relationships between nodes. Knowledge graphs
have become the foundation of automatic semantic retrieval in
modern web searches. The construction process can be described
as a link prediction problem and divided into three layers according
to the abstract level of its input materials. The input includes the
information extraction layer, the knowledge integration layer, and
the knowledge processing layer, respectively. Commonly used
approaches for knowledge graph construction include analysis of
graph extraction, incorporation of ontology constrains and rela-
tional patterns, and discovery of statistical relationships within the
knowledge graph.
Previous studies explored both molecular network databases
and biomedical literature to create a knowledge base and coupled it
with approaches such as machine learning, graph mining, and data
visualization. Due to recent efforts, linked open data constitutes a
large collection of datasets comprised of standard formats and have
become new resources for knowledge discovery. Dalleau et al.
integrated a set of linked data associated with pharmacogenomics
and defined pharmacogenes in a large Resource Description Frame-
work (RDF) graph by using machine learning models [23]. Three
distinct types of entities (gene, phenotype, and drug) along with
their corresponding relationships were depicted by the knowledge
graph.
3.1 Materials Many biomedical resources are available for drug discovery using
text mining approaches. In recent years, open-access biomedical
resources have become increasingly easier and cheaper to acquire,
with the majority of them being textual. Often, valuable informa-
tion is encoded in neither structured nor classified forms, thus
238 Si Zheng et al.
Table 2
Open-access textual data resources for drug discovery
Resource
type Resources Website Description
Clinical MIMIC- https://mimic. A large, publicly available database of critical care units
textual III physionet.org/ from the Beth Israel Deaconess Medical Center
data NHS https://www. An executive non-departmental public body (NDPB)
England england.nhs.uk/ of the Department of Health
PCORnet http://www. A large, highly representative national network of
pcornet.org/ clinical data and health information
Biomedical MEDLINE https:// A publicly available bibliographic database of life
literature medlineplus.gov/ science and biomedical information
PMC https://www.ncbi. A free full-text archive of biomedical and life sciences
nlm.nih.gov/ journal literature
pmc/
DOAJ https://doaj.org/ A community-curated online directory that indexes
and provides access to high-quality, open-access,
peer-reviewed journals
3.1.1 Clinical With the ongoing acquisition of clinical textual data in medicine,
Textual Data such as practice guidelines, clinical notes, and electronic health
records (EHRs), large amounts of data exist as potential resources
for drug discovery. For example, mining electronic health records
(EHRs) have the potential for establishing new patient-
stratification principles and for revealing unknown drug correla-
tions [25]. This method has become increasingly more feasible as
more and more clinical databases are free of access restrictions.
1. MIMIC-III: Medical Information Mart for Intensive Care III
(MIMIC-III) is a large, freely available database comprising of
patient information from all critical care units at a large tertiary
care hospital [26]. This database contains information such as
clinical notes, vital signs, medication lists, laboratory values,
procedure codes, diagnostic codes, imaging reports, hospital
length of stay, and survival data. To access MIMIC-III one
must formally request access via a process documented on the
Text Mining for Drug Discovery 239
3.2 Methods There are multiple approaches to text mining for drug discovery,
each relying on the fundamental principles of information retrieval,
named entity extraction, and relationship identification. Extracting
disease and gene mutation relationships are important for identify-
ing individual variability in diseases and developing drugs that
target each of these mutations. Understanding how drugs behave
in different genetic backgrounds can determine either its potential
benefit or harm to a specific patient population. This section pro-
vides the methodology behind each application of text mining for
drug discovery and provides a review of the most up-to-date
research involved in this field.
3.2.1 Extracting Disease- The identification of disease and gene mutation correlations is an
Mutation Relationships important strategy for drug discovery in precision medicine, as it
for Precision Medicine takes into account individual variability in disease prevention and
treatment. In the case of a genetic disease like cancer, the genomic
diversity and instability of cancer cells become major determinants
that enable them to acquire malignant or characteristic traits. Pre-
vious studies have shown that tumor response to a drug(s) ulti-
mately depends on the steps of oncogenesis, tumor heterogeneity,
and oncogenic evolution of resistant clones within the tumor. The
success of a cancer drug today is fundamentally dependent on the
drug company’s ability to identify target genes that control tumor-
igenic pathways [28].
Various mutations have been reported in recent studies and
many databases like ClinVar [29], Online Mendelian Inheritance
in Man [30], COSMIC [31], and GWAS Catalog [32] are available
for locating manually curated disease and gene mutation associa-
tions; however, the majority of associations still remain buried in
unstructured text of biomedical publications or electronic medical
records. In this instance, text mining methods can be employed to
extract such relationships. Many text mining approaches such as
simple co-occurrence, pattern matching, and machine learning, are
regularly used methods for disease-mutation extraction. PolySearch
is a search-based text mining tool that infers relationships between
mutations and diseases based on their frequency of co-occurrence
Text Mining for Drug Discovery 241
3.2.2 Extracting Pharmacogenomics (PGx) is the study of how genetic variants may
Pharmacogenomics affect drug efficacy and toxicity. This information is important for
Relationships drug discovery, especially for precision medicine. Substantial
research efforts using both manual and computational approaches
have been conducted for developing these databases. For instance,
the Pharmacogenomics Knowledge Base (PharmGKB) is the larg-
est manually curated resource for information regarding the impact
of genetic variation on drug response [45]. In recent years, manual
curation efforts have been assisted with computational approaches
due to high labor costs and time constraints.
Text mining strategies can be used to extract novel pharmaco-
genomics data using information from publicly available datasets as
prior knowledge for computational inference. A myriad of
Text Mining for Drug Discovery 243
3.2.3 Mining Drug Extracting drug targets from publicly available datasets and bio-
Targets medical literature is another useful approach in text mining for drug
discovery. Multiple types of interactions between drugs and their
targets have been identified in recent studies. Traditional in silico
methods utilize machine learning approaches, such as classification
models (Ding et al.) [55] and rule-based inference methods
(Cheng et al.) [56] to predict drug-target associations. Similarity-
based methods examine associations between drug-drug and
target-target pairs and use these relationships for weighting poten-
tial associations. This includes similarities between chemical
Text Mining for Drug Discovery 245
4. Pattern importance:
(a) Low p-values between drug-target pairs suggest a strong
probability of association between the drug and target.
However, this association may not be meaningful biologi-
cally. Each pattern was then assessed as a feature and its
resulting ability to identify other drug-target pairs from
the set. Patterns with high ROC scores were deemed as
informative, and its respective drug and target pair is likely
to interact.
3.2.4 Identifying Drug One of the difficulties with the drug discovery process is managing
Adverse Effects adverse drug effects, which restricts the clinical use of otherwise
effective drugs. This has been the leading cause of failure for new
drugs in clinical trials. In recent studies, text mining approaches
have been used in the identification of drug side effects.
For example, in Xu’s 2014 study, they presented an automated
learning approach to accurately extract drug-side effect pairs from
the vast amount of published biomedical literature [61]:
1. Materials: This study used 119 million MEDLINE sentences
and their corresponding parse trees as the text corpus.
2. Known drug-side effect pairs extraction: 100,000 known drug
side effect pairs were derived from FDA drug labels; this
includes 996 FDA-approved drugs and 4,199 adverse event
terms. These drug-side effect pairs were used as prior knowl-
edge to extract relevant sentences and parse trees.
3. Lexicon building: Disease, side effect (SE), and drug lexicons
were developed. The disease lexicon was built by combining all
disease terms in UMLS [7] and Human Disease Ontology
[62]. Side effect lexicons were based on the Medical Dictionary
for Regulatory Activities (MedDRA) [63]. The drug lexicon
was subsequently downloaded from DrugBank [9].
4. Drug-side effect relationship extraction: The drug-side effect
extraction system consists of four parts: (1) pattern extraction,
Text Mining for Drug Discovery 247
4 Conclusion
Acknowledgments
References
1. Reichert JM (2003) Trends in development (Database issue):D267–D270. https://doi.
and approval times for new therapeutics in the org/10.1093/nar/gkh061
United States. Nat Rev Drug Discov 2 8. Mattingly CJ, Colby GT, Forrest JN, Boyer JL
(9):695–702. https://doi.org/10.1038/ (2003) The comparative Toxicogenomics data-
nrd1178 base (CTD). Environ Health Perspect 111
2. Woodcock J, Woosley R (2008) The FDA crit- (6):793–795
ical path initiative and its influence on new 9. Wishart DS, Knox C, Guo AC, Shrivastava S,
drug development. Annu Rev Med 59:1–12. Hassanali M, Stothard P, Chang Z, Woolsey J
https://doi.org/10.1146/annurev.med.59. (2006) DrugBank: a comprehensive resource
090506.155819 for in silico drug discovery and exploration.
3. Claus BL, Underwood DJ (2002) Discovery Nucleic Acids Res 34(Database issue):
informatics: its evolving role in drug discovery. D668–D672. https://doi.org/10.1093/nar/
Drug Discov Today 7(18):957–966 gkj067
4. Percha B, Garten Y, Altman RB (2012) Dis- 10. Kim S, Thiessen PA, Bolton EE, Chen J, Fu G,
covery and explanation of drug-drug interac- Gindulyte A, Han L, He J, He S, Shoemaker
tions via text mining. Pac Symp BA, Wang J, Yu B, Zhang J, Bryant SH (2016)
Biocomput:410–421 PubChem substance and compound databases.
5. Huang CC, Lu Z (2016) Discovering biomed- Nucleic Acids Res 44(D1):D1202–D1213.
ical semantic relations in PubMed queries for https://doi.org/10.1093/nar/gkv951
information retrieval and database curation. 11. Nelson SJ, Zeng K, Kilbourne J, Powell T,
Database (Oxford) 2016. https://doi.org/10. Moore R (2011) Normalized names for clinical
1093/database/baw025 drugs: RxNorm at 6 years. J Am Med Inform
6. Kraus M, Niedermeier J, Jankrift M, Assoc 18(4):441–448. https://doi.org/10.
Tietbohl S, Stachewicz T, Folkerts H, 1136/amiajnl-2011-000116
Uflacker M, Neves M (2017) Olelo: a web 12. Krallinger M, Rabal O, Lourenco A,
application for intuitive exploration of biomed- Oyarzabal J, Valencia A (2017) Information
ical literature. Nucleic Acids Res. https://doi. retrieval and text mining technologies for
org/10.1093/nar/gkx363 chemistry. Chem Rev 117(12):7673–7761.
7. Bodenreider O (2004) The unified medical https://doi.org/10.1021/acs.chemrev.
language system (UMLS): integrating biomed- 6b00851
ical terminology. Nucleic Acids Res 32 13. Leaman R, Gonzalez G (2008) BANNER: an
executable survey of advances in biomedical
250 Si Zheng et al.
NHGRI-EBI catalog of published genome- 42. Wermter J, Tomanek K, Hahn U (2009) High-
wide association studies (GWAS catalog). performance gene name normalization with
Nucleic Acids Res 45(D1):D896–D901. GeNo. Bioinformatics 25(6):815–821.
https://doi.org/10.1093/nar/gkw1133 https://doi.org/10.1093/bioinformatics/
33. Cheng D, Knox C, Young N, Stothard P, btp071
Damaraju S, Wishart DS (2008) PolySearch: a 43. Mahmood AS, Wu TJ, Mazumder R, Vijay-
web-based text mining system for extracting Shanker K (2016) DiMeX: a text mining sys-
relationships between human diseases, genes, tem for mutation-disease association extrac-
mutations, drugs and metabolites. Nucleic tion. PLoS One 11(4):e0152725. https://
Acids Res 36(Web Server issue):W399–W405. doi.org/10.1371/journal.pone.0152725
https://doi.org/10.1093/nar/gkn296 44. Van Cutsem E, Kohne CH, Hitre E, Zaluski J,
34. Rebholz-Schuhmann D, Marcel S, Albert S, Chien CRC, Makhson A, D’Haens G, Pinter T,
Tolle R, Casari G, Kirsch H (2004) Automatic Lim R, Bodoky G, Roh JK, Folprecht G,
extraction of mutations from Medline and Ruff P, Stroh C, Tejpar S, Schlichting M,
cross-validation with OMIM. Nucleic Acids Nippgen J, Rougier P (2009) Cetuximab and
Res 32(1):135–142. https://doi.org/10. chemotherapy as initial treatment for meta-
1093/nar/gkh162 static colorectal cancer. New Engl J Med 360
35. Doughty E, Kertesz-Farkas A, Bodenreider O, (14):1408–1417. https://doi.org/10.1056/
Thompson G, Adadey A, Peterson T, Kann Nejmoa0805019
MG (2011) Toward an automatic method for 45. Hewett M, Oliver DE, Rubin DL, Easton KL,
extracting cancer- and other disease-related Stuart JM, Altman RB, Klein TE (2002)
point mutations from the biomedical literature. PharmGKB: the Pharmacogenetics Knowledge
Bioinformatics 27(3):408–415. https://doi. Base. Nucleic Acids Res 30(1):163–165
org/10.1093/bioinformatics/btq667 46. Maglott D, Ostell J, Pruitt KD, Tatusova T
36. Wei CH, Kao HY, Lu Z (2015) GNormPlus: (2011) Entrez gene: gene-centered informa-
an integrative approach for tagging genes, gene tion at NCBI. Nucleic Acids Res 39:
families, and protein domains. Biomed Res Int D52–D57. https://doi.org/10.1093/nar/
2015:918710. https://doi.org/10.1155/ gkq1237
2015/918710 47. Sherry ST, Ward MH, Kholodov M, Baker J,
37. Wei CH, Harris BR, Kao HY, Lu Z (2013) Phan L, Smigielski EM, Sirotkin K (2001)
tmVar: a text mining approach for extracting dbSNP: the NCBI database of genetic varia-
sequence variants in biomedical literature. Bio- tion. Nucleic Acids Res 29(1):308–311
informatics 29(11):1433–1439. https://doi. 48. Pakhomov S, McInnes BT, Lamba J, Liu Y,
org/10.1093/bioinformatics/btt156 Melton GB, Ghodke Y, Bhise N, Lamba V,
38. Ravikumar KE, Wagholikar KB, Li D, Kocher Birnbaum AK (2012) Using PharmGKB to
JP, Liu H (2015) Text mining facilitates data- train text mining approaches for identifying
base curation - extraction of mutation-disease potential gene targets for pharmacogenomic
associations from bio-medical literature. BMC studies. J Biomed Inform 45(5):862–869.
Bioinformatics 16:185. https://doi.org/10. https://doi.org/10.1016/j.jpi.2012.04.007
1186/s12859-015-0609-x 49. Ian H, Witten EF (2011) Data mining: practi-
39. Torii M, Hu Z, Wu CH, Liu H (2009) cal machine learning tools and techniques, 2nd
BioTagger-GM: a gene/protein name recogni- edn. Morgan Kaufmann Publishers, San
tion system. J Am Med Inform Assoc 16 Francisco
(2):247–255. https://doi.org/10.1197/ 50. Xu R, Wang Q (2013) A semi-supervised
jamia.M2844 approach to extract pharmacogenomics-
40. Caporaso JG, Baumgartner WA Jr, Randolph specific drug-gene pairs from biomedical litera-
DA, Cohen KB, Hunter L (2007) Mutation- ture for personalized medicine. J Biomed
Finder: a high-performance system for extract- Inform 46(4):585–593. https://doi.org/10.
ing point mutation mentions from text. 1016/j.jbi.2013.04.001
Bioinformatics 23(14):1862–1865. https:// 51. Hakenberg J, Voronov D, Nguyen VH,
doi.org/10.1093/bioinformatics/btm235 Liang S, Anwar S, Lumpkin B, Leaman R,
41. Wei CH, Kao HY, Lu Z (2013) PubTator: a Tari L, Baral C (2012) A SNPshot of PubMed
web-based text mining tool for assisting bio- to associate genetic variants with drugs, dis-
curation. Nucleic Acids Res 41(Web Server eases, and adverse reactions. J Biomed Inform
issue):W518–W522. https://doi.org/10. 45(5):842–850. https://doi.org/10.1016/j.
1093/nar/gkt441 jbi.2012.04.006
252 Si Zheng et al.
52. Chang JT, Altman RB (2004) Extracting and 60. Zong N, Kim H, Ngo V, Harismendy O
characterizing gene-drug relationships from (2017) Deep mining heterogeneous networks
the literature. Pharmacogenetics 14 of biomedical linked data to predict novel
(9):577–586 drug-target associations. Bioinformatics 33
53. Lakiotaki K, Kartsaki E, Kanterakis A, Katsila T, (15):2337–2344. https://doi.org/10.1093/
Patrinos GP, Potamias G (2016) ePGA: a bioinformatics/btx160
web-based information system for translational 61. Xu R, Wang Q (2014) Automatic construction
pharmacogenomics. PLoS One 11(9). ARTN of a large-scale and accurate drug-side-effect
e0162801). https://doi.org/10.1371/jour association knowledge base from biomedical
nal.pone.0162801 literature. J Biomed Inform 51:191–199.
54. Dalma-Weiszhausz DD, Warrington J, Tani- https://doi.org/10.1016/j.jbi.2014.05.013
moto EY, Miyada CG (2006) The affymetrix 62. Schriml LM, Arze C, Nadendla S, Chang YW,
GeneChip platform: an overview. Methods Mazaitis M, Felix V, Feng G, Kibbe WA (2012)
Enzymol 410:3–28. https://doi.org/10. Disease ontology: a backbone for disease
1016/S0076-6879(06)10001-4 semantic integration. Nucleic Acids Res 40
55. Ding H, Takigawa I, Mamitsuka H, Zhu S (Database issue):D940–D946. https://doi.
(2014) Similarity-based machine learning org/10.1093/nar/gkr972
methods for predicting drug-target interac- 63. Brown EG, Wood L, Wood S (1999) The med-
tions: a brief review. Brief Bioinform 15 ical dictionary for regulatory activities (Med-
(5):734–747. https://doi.org/10.1093/bib/ DRA). Drug Saf 20(2):109–117
bbt056 64. Canada A, Capella-Gutierrez S, Rabal O,
56. Cheng F, Liu C, Jiang J, Lu W, Li W, Liu G, Oyarzabal J, Valencia A, Krallinger M (2017)
Zhou W, Huang J, Tang Y (2012) Prediction LimTox: a web tool for applied text mining of
of drug-target interactions and drug reposi- adverse event and toxicity associations of com-
tioning via network-based inference. PLoS pounds, drugs and genes. Nucleic Acids Res.
Comput Biol 8(5):e1002503. https://doi. https://doi.org/10.1093/nar/gkx462
org/10.1371/journal.pcbi.1002503 65. Iqbal E, Mallah R, Jackson RG, Ball M, Ibra-
57. Chen B, Ding Y, Wild DJ (2012) Assessing him ZM, Broadbent M, Dzahini O, Stewart R,
drug target association using semantic linked Johnston C, Dobson RJ (2015) Identification
data. PLoS Comput Biol 8(7). ARTN of adverse drug events from free text electronic
e1002574). https://doi.org/10.1371/jour patient records and information in a large men-
nal.pcbi.1002574 tal health case register. PLoS One 10(8):
58. Chen B, Dong X, Jiao D, Wang H, Zhu Q, e0134208. https://doi.org/10.1371/journal.
Ding Y, Wild DJ (2010) Chem2Bio2RDF: a pone.0134208
semantic framework for linking and data 66. Wang G, Jung K, Winnenburg R, Shah NH
mining chemogenomic and systems chemical (2015) A method for systematic discovery of
biology data. BMC Bioinformatics 11:255. adverse drug events from clinical notes. J Am
https://doi.org/10.1186/1471-2105-11- Med Inform Assoc 22(6):1196–1204. https://
255 doi.org/10.1093/jamia/ocv102
59. Chen B, Ding Y, Wild DJ (2012) Improving 67. Takarabe M, Kotera M, Nishimura Y, Goto S,
integrative searching of systems chemical biol- Yamanishi Y (2012) Drug target prediction
ogy data using semantic annotation. J Chem 4 using adverse event report systems: a pharma-
(1):6. https://doi.org/10.1186/1758-2946- cogenomic approach. Bioinformatics 28(18):
4-6 I611–I618. https://doi.org/10.1093/bioin
formatics/bts413
Part IV
Abstract
The creation of big clinical data cohorts for machine learning and data analysis require a number of steps
from the beginning to successful completion. Similar to data set preprocessing in other fields, there is an
initial need to complete data quality evaluation; however, with large heterogeneous clinical data sets, it is
important to standardize the data in order to facilitate dimensionality reduction. This is particularly
important for clinical data sets including medications as a core data component due to the complexity of
coded medication data. Data integration at the individual subject level is essential with medication-related
machine learning applications since it can be difficult to accurately identify drug exposures, therapeutic
effects, and adverse drug events without having high-quality data integration of insurance, medication, and
medical data. Successful data integration and standardization efforts can substantially improve the ability to
identify and replicate personalized treatment pathways to optimize drug therapy.
Key words Medication safety, Clinical data integration, Clinical comorbidity evaluation, Personalized
medication therapy
1 Introduction
Clinical big data cohorts provide data analysts with the capacity to
effectively evaluate clinical care, medication, demographic, and
health system factors which may impact clinical treatment plans
and patient outcomes. Comprehensive clinical databases can pro-
vide data representation on a variety of healthcare components. In
order to develop effective clinical big data cohorts, data integration
is generally required including health insurance coverage informa-
tion and medical and pharmacy claims data in order to provide
sufficient data representation to identify clinical events related to
medication use and personalized medication treatment.
Large clinical data cohorts provide an opportunity for research-
ers to ask meaningful questions which can be answered with the
available data for outcomes analysis and data exploration. As with
any data set, it is important to invest sufficient resources into
Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_14, © Springer Science+Business Media, LLC, part of Springer Nature 2019
255
256 Terrence J. Adam and Chih-Lin Chi
2.1 Common Data Clinical data focused on medication-related outcomes vary sub-
Problems stantially in their relative quality and utility for advanced data
analysis. Since medications can be difficult to aggregate, identifica-
tion of the most standardized form of drug information available in
the data set provides a good starting point. In some cases, this may
only be a drug name. In other cases, drug dosing quantity, dosage
form, the clinical setting of use, and other data may be available.
With regard to drug names, initial data quality work should evalu-
ate nomenclature consistency. Since medications can be named by
generic or trade names, it is important to identify if the source data
contains one or both names. Frequently, both the trade and generic
medication names are included and ideally also include coded
medication information using either standard or nonstandard cod-
ing which may require data transformation to account for variations
in dosage forms, dosage quantity, drug class, and/or therapeutic
class.
Big Data Cohort Extraction for Personalized Statin Treatment and Machine. . . 257
lab tests, the presence of abnormalities which are low can indicate
different disease likelihoods than high-abnormal laboratory values.
As an example, when one looks at a high-frequency test like a
hemoglobin level, the subject may have blood loss or a lack of
function of blood formation or nutritional insufficiencies. High
hemoglobin values may be related to certain blood overproduction
disorders, chronic low oxygenation, and cardiopulmonary disease
compensation, among other disorders. In this case, an abnormal
value higher than normal has a different disease risk than one with a
value below normal. Other laboratory tests reflect more of an
ordered continuum such as with measures of inflammation such
as an erythrocyte sedimentation rate (ESR) where the high num-
bers reflect a higher potential level of inflammation. Clinical experts
can help identify key components for laboratory data management
and data preprocessing.
5 Comorbidity Adjustment
scores in ICD9 and ICD10 which are adopted from the work of
Quan [15].
Another comorbidity adjustment approach was developed by
Elixhauser which included 30 comorbidity measures in the initial
development [17]. The Elixhauser Comorbidity Index uses ICD9
and ICD10 codes from administrative data and was developed to
predict hospital resource use and hospital mortality. The method
has changed slightly since its original development and SAS code is
available for its use from the University of Manitoba [18].
The Elixhauser and Charlson Index approaches both have the
appeal of being able to create a single score for use in assessing the
clinical comorbidity of a single subject. Broader grouper software
such as the CCS can represent a substantial amount of clinical
information in a relatively small number of groups of clinically
pertinent codes. In practical application, it can be useful to start
with either Elixhauser or Charlson Index aggregate scores along
with their individual clinical components to assess if the clinical
comorbidity adjustment is relevant in the data analysis. Inclusion
of some method of comorbidity adjustment in any clinical data
analysis work is likely to improve the modeling and acceptance of
the results by the clinical community.
Clinical comorbidity adjustment can be completed with the use
of broad index tools or grouping software. However, with
medication-related data analysis, it may also be useful to use tar-
geted clinical comorbidity assessment related to the target medica-
tion. Targeted clinical comorbidity assessment related to
medication use will focus on factors affecting the absorption, distri-
bution, metabolism, and excretion of medications which may have
an effect on the efficacy of a particular medication. Since medica-
tions are typically metabolized and excreted from the body via the
liver and/or kidneys, it is important to account for systematic
deficiencies in this physiological function in the data sets at the
individual patient level. This can be difficult when the organ sys-
tems can be affected by the target drug as is the case with statin
medications. Since the exposure to the drug can lead to a potential
adverse event, this has the potential for misclassification if it is not
accounted for in the analysis. To attempt to discern drug-induced
adverse events from those resulting from conditions which were
present at drug initiation, the presence of these clinical states need
to be identified prior to the initial medication prescription initia-
tion. To adequately screen for the presence of such clinical states, a
review of a sufficient time period prior to the first drug prescription
is required and can typically be either 1 month, 3 months,
6 months, or even a year prior to the initial prescription similar to
the approach in developing a disease-free or washout time frame
with a predilection to longer time frames.
Big Data Cohort Extraction for Personalized Statin Treatment and Machine. . . 267
the potential dosages available for each medication. Once the path-
ways are defined, the usage patterns can be evaluated for continu-
ous use, irregular use, or discontinuation of therapy. Usage patterns
can be defined around prescription fill data looking at the number
of days of drug supplied versus the number of days in which a
patient is on medication therapy.
For the assessment of statin therapy and other chronic medica-
tions, there may be a drug initiation followed by subsequent
changes in dose over time to improve a patient’s lipid profiles, to
manage changes in the relative risk of cardiovascular disease, and to
reduce adverse drug events which may be associated with statin
medication use. Assessment of the frequency of medication pre-
scription orders in the medical record can provide information on
the changes in therapy and the likelihood of patient use of the
medication. Unfortunately, medication orders or prescriptions are
not always directly related to the actual use of medications since
many medications which are prescribed are never filled which can
create misclassification problems in identifying drug exposures.
Ideally, in this case, the data analyst has data outside the electronic
record to identify prescription fill data in the form of prescription
claims data. If such data is available, it can substantially reduce the
likelihood of exposure misclassification but does add to data prepa-
ration and data integration efforts for medical record focused
projects.
More specific measures of drug usage focusing on medication
adherence can also be considered with some of the widely used
measures available including the proportion of days covered (PDC)
[23], medication possession ratio (MPR), and others. The typical
measures of adherence use a ratio of the medication supplied to a
patient divided by the number of days in the observation time
period. For statin medications, the proportion of days covered
method is supported by several groups as a preferred measure of
medication adherence including the Pharmacy Quality Alliance and
the National Quality Forum. The PDC is also currently used in the
US Center for Medicare and Medicaid Services star ratings. The
PDC method is calculated by identifying the number of days in the
time period of observation which are covered with a medication
divided by the number of days in the observation time period and
converting this proportion to a percentage measure with some
specific case adjustments [24]. For chronic medication therapies
such as statin drugs, a measure of 80% or higher is generally seen as
a high level of adherence though higher numbers are desired for
some other medications where missing doses can be problematic as
in the case of therapy for human immunodeficiency virus therapy or
antibiotic therapy.
A number of other factors can also be evaluated in personaliz-
ing medication therapy. However, many of these factors are not
readily available in a form which can be readily integrated with
Big Data Cohort Extraction for Personalized Statin Treatment and Machine. . . 271
References
1. Jill Kolesar LV (2015) McGraw-Hill’s 2016/ DataStandards/StructuredProductLabeling/
2017 top 300 pharmacy drug cards. McGraw- default.htm. Accessed 31 July 2017
Hill 6. WHO Collaborating Centre for Drug Statistics
2. ClinCalc (2017) The Top 200 of 2017 Clin- O (2017) ATC: structure and principles.
Calc LLC. http://clincalc.com/DrugStats/ https://www.whocc.no/atc/structure_and_
Top200Drugs.aspx. Accessed 30 July 2017 principles/. Accessed 31 July 2017
3. Agency for Healthcare Research and Quality R, 7. WHO Collaborating Centre for Drug Statistics
MD (2017) Medical expenditure panel survey O (2017) ATC/DDD index 2017. https://
4. Food and Drug Administration US (2017) www.whocc.no/atc_ddd_index/. Accessed
National drug code directory. https://www. 31 July 2017
fda.gov/drugs/informationondrugs/ 8. U.S. National Library of Medicine B (2017)
ucm142438.htm. Accessed 30 July 2017 RxNorm technical documentation.
5. Food and Drug Administration US (2017) U.S. National Library of Medicine. https://
Structured product labeling resources. www.nlm.nih.gov/research/umls/rxnorm/
https://www.fda.gov/ForIndustry/ docs/2017/rxnorm_doco_full_2017-2.html.
Accessed 31 July 2017
272 Terrence J. Adam and Chih-Lin Chi
9. Svensson-Ranallo PA, Adam TJ, Sainfort F 17. Elixhauser A, Steiner C, Harris DR, Coffey RM
(2011) A framework and standardized meth- (1998) Comorbidity measures for use with
odology for developing minimum clinical data- administrative data. Med Care 36(1):8–27
sets. AMIA Jt Summits Transl Sci Proc 18. Manitoba Centre for Health Policy C (2016)
2011:54–58 Concept: elixhauser comorbidity index http://
10. Regenstrief I (2017) LOINC: the international mchp-appserv.cpe.umanitoba.ca/
standard for identifying health measurements, viewConcept.php?conceptID¼1436. Accessed
observations, and documents. Regenstrief 31 July 2017
Institute https://loinc.org/. Accessed 31 July 19. Chi C-L, Wang J, Clancy TR, Robinson JG,
2017 Tonellato PJ, Adam TJ (2017) Big data cohort
11. Agency for Healthcare Research and Quality R, extraction to facilitate machine learning to
MD (2017) HCUP chronic condition indica- improve statin treatment. West J Nurs Res 39
tor. Healthcare cost and utilization project (1):42–62. https://doi.org/10.1177/
(HCUP): Chronic condition indicator (CCI) 0193945916673059
for ICD-9-CM. https://www.hcup-us.ahrq. 20. Hebert PL, Geiss LS, Tierney EF, Engelgau
gov/toolssoftware/chronic/chronic.jsp. MM, Yawn BP, McBean AM (1999) Identify-
Accessed 31 July 2017 ing persons with diabetes using medicare
12. Agency for Healthcare Research and Quality R, claims data. Am J Med Qual 14(6):270–277.
MD (2012) HCUP CCS fact sheet. Healthcare https://doi.org/10.1177/
cost and utilization project (HCUP). https:// 106286069901400607
www.hcup-us.ahrq.gov/toolssoftware/ccs/ 21. Center for Medicare and Medicaid Services H
ccsfactsheet.jsp. Accessed 31 July 2017 (2017) 2017 ICD-10-CM and GEMs. https://
13. Charlson ME, Pompei P, Ales KL, MacKenzie www.cms.gov/Medicare/Coding/ICD10/
CR (1987) A new method of classifying prog- 2017-ICD-10-CM-and-GEMs.html. Accessed
nostic comorbidity in longitudinal studies: 28 July 2017
development and validation. J Chronic Dis 40 22. Olson CH, Dierich M, Adam T, Westra BL
(5):373–383 (2014) Optimization of decision support tool
14. Manitoba Centre for Health Policy C (2016) using medication regimens to assess rehospital-
Concept: charlson comorbidity index http:// ization risks. Appl Clin Inform 5(3):773–788.
mchp-appserv.cpe.umanitoba.ca/ https://doi.org/10.4338/ACI-2014-04-RA-
viewConcept.php?conceptID¼1098#a_ 0040
references. Accessed 31 July 2017 23. Benner JS, Glynn RJ, Mogun H, Neumann PJ,
15. Quan H, Sundararajan V, Halfon P, Fong A, Weinstein MC, Avorn J (2002) Long-term per-
Burnand B, Luthi JC, Saunders LD, Beck CA, sistence in use of statin therapy in elderly
Feasby TE, Ghali WA (2005) Coding algo- patients. JAMA 288(4):455–461
rithms for defining comorbidities in ICD-9- 24. Nau DP (2017) Proportion of days covered
CM and ICD-10 administrative data. Med (PDC) as a preferred method of measuring
Care 43(11):1130–1139 medication adherence. Pharmacy quality alli-
16. Deyo RA, Cherkin DC, Ciol MA (1992) ance. http://pqaalliance.org/resources/adher
Adapting a clinical comorbidity index for use ence.asp Accessed 31 July 2017
with ICD-9-CM administrative databases. J
Clin Epidemiol 45(6):613–619
Chapter 15
Abstract
The library of integrated Network-Based Cellular Signatures (LINCS) project aims to create a network-
based understanding of biology by cataloging changes in gene expression and signal transduction. L1000
big datasets provide gene expression profiles induced by over 10,000 compounds, shRNAs, and kinase
inhibitors using L1000 platform. We developed a systematic compound signature discovery pipeline named
csNMF, which covers from raw L1000 data processing to drug screening and mechanism generation. The
discovered compound signatures of breast cancer were consistent with the LINCS KINOMEscan data and
were clinically relevant. In this way, the potential mechanisms of compounds’ efficacy are elucidated by our
computational model.
Key words LINCS, L1000, csNMF, Compound signature, Drug signature, Compound efficacy
1 Introduction
Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_15, © Springer Science+Business Media, LLC, part of Springer Nature 2019
273
274 Wei Chen and Xiaobo Zhou
2 Materials
3 Methods
Fig. 1 Overview of the compound signature discovery framework. This method requires raw L1000 data after
various compounds and gene knockdown treatments. The raw data after the two types of treatments are
preprocessed to yield gene expression data in Step I. In Step II, the EGEM matrix is constructed based on these
gene expression data to measure relationships among compounds and knockdown genes. This matrix is then
decomposed to a weight matrix and a coefficient matrix by the csNMF method. Protein–protein interaction
data are added in consideration of biological connections. Signatures are identified based on strongly
associated genes (i.e., those with larger values in the coefficient matrix)
Step III: csNMF signature analysis and annotation using our devel-
oped quadruple model, which reveals how compounds in each
csNMF signature alters the downstream transcription factors
and causes the differential changes of the apparent gene expres-
sion patterns (see Note 2).
3.1 Step I: Raw Data The goal in Step I was to reliably process, normalize, clean, and
Preprocessing Pipeline annotate the L1000 raw data. The major challenges in this step
were reliable peak calling, normalization and quality control, and
the computational burdens for processing big raw data. A GMM
peak calling approach was developed for reliable peak calling from
raw L1000 data [6]. The raw data for each sample were deconvo-
luted, and the fluorescent intensity peak corresponding to each
mRNA probe was identified using the GMM model, annotated
with the gene symbol, probe ID, gene description, and the analyte
and L1000 probe set information [7]. This information was then
Genomic Big Data for Drug signature Detection 277
A. Flowchart
Level 1
Raw Data (.lxb)
Peak Calling
(GMM)
Level 2
Raw Gene Expression
(.csv and .gct)
Quantile Normalization
(Removing Batch Effects)
Quality Control
(Experiment Quality)
Level 3
Perturbagent-induced
Gene Expression
Pattern (.gct)
Fig. 2 Overview of the data preprocessing framework. The raw Luminex data are transformed into gene
expression data by the GMM peak calling method. Quantile normalization is then performed to reduce the
batch effects, and quality control is executed to filter out poor-quality data
3.2 Step II: Enrichment of Gene Effect to a Molecule (EGEM) was developed
Compound Signature to identify proteins closely related to cellular responses to a small-
Discovery by EGEM- molecule compound, using the LINCS L1000 landmark gene
Based csNMF Model expression data. A small-molecule compound affected a cell by
directly or indirectly changing the activities and functions of its
3.2.1 EGEM Score target proteins, which drive downstream biological events, and
and the EGEM Matrix finally alter cellular gene expression patterns. We hypothesized
that the knockdown of a gene, which is closely related to the target
proteins of a small-molecule compounds, induces similar gene
expression pattern changes. Thus, identification of such genes
could reveal the mechanisms of cellular responses to these com-
pounds and predict their pharmaceutical potentials. We defined the
“target genes” of a compound in the general meaning: the
corresponding proteins of such genes could be either the real
drug targets or those at downstream or upstream and were closely
related to the real targets. The data for 3000 single-gene knock-
down experiments were used as the target gene reference library,
and the data for compound treatments were profiled against this
reference library to identify possible target genes of corresponding
small-molecule compounds.
We defined the EGEM score to describe the similarity between
the treatments of a compound and a shRNA targeting a gene using
the mutual enrichment of their resultant differential expressed
landmark genes. The EGEM metric was derived from the rank-
based gene set enrichment analysis (GSEA) [8] and the connectivity
analysis [9]. Compound treatments could be taken as “pheno-
types” and the differentially expressed genes (DEGs) of a single-
gene knocking down treatment as a “signature gene set” in the
GSEA terminology. The EGEM metric enabled gene set enrich-
ment analysis against the LINCS target gene reference library. The
construction of the EGEM score is shown in Fig. 3, where a
signature gene set of a target gene is composed of n DEGs after
the knockdown of a target gene. Among them, tup were upregu-
lated and tdown downregulated. DEGs were detected according to
the LFCs of the L1000 landmark genes using 1.5 IQR (interquar-
tile range) as the threshold, which was robust against outliers. For a
small-molecule compound, two lists of landmark genes were used
to represent the patterns of the compound-induced L1000 gene
expression changes. One ( pup) was sorted ascendantly, while the
other ( pdown) was sorted descendantly based on the LFCs. Such
p1( j) and p2( j) were the positions of the jth up- and downregulated
DEGs, respectively, in their corresponding probe gene lists. The
EGEM score was defined in Eq. 1:
Genomic Big Data for Drug signature Detection 279
Fig. 3 EGEM score construction. The up- and down-DEGs after a gene knockdown treatment are used as two
feature sets. The locations of up and down feature sets in the ascendant- and descendant-sorted gene list
after a compound treatment are measured by Kolmogorov-Smirnov statistic. The value is normalized by the
total size of up and down feature sets
i p1 ði Þ i
EGEM ¼ max
i¼1:t 1 t 1 þ t 2 2n t 1 t 2
j p2 ðj Þ j
þ max ð1Þ
j ¼1:t 2 t 1 þ t 2 2n t 1 t 2
3.2.3 Simultaneous A co-module consisted of both positive and negative EGEM scores
Clustering of Positive as long as they were significant and consistent across compounds in
and Negative EGEM Scores the same module, but canonical NMF approaches could only accept
nonnegative values. To simultaneously handle both positive and
negative EGEM scores, from the original EGEM matrix A, we
extracted the positive EGEM scores into the similar EGEM Matrix
AS and the absolute values of the negative EGEM scores into the
reverse EGEM Matrix AR, both of the same dimensions as A. Both
the two EGEM matrices were presented in the overall objective
function above and were simultaneously optimized during iterative
Genomic Big Data for Drug signature Detection 281
3.3 Step III: In order to examine the biomedical relevance and the pharmaceu-
Compound Signature tical potentials of the detected compound signatures, we proposed
Analysis quadruple models to reveal the molecular events associated with
compound signatures.
A compound impacts the functions of its target proteins
directly or indirectly, triggers regulatory networks, alters the activ-
ities of downstream transcription factors, and thus changes the
gene expression patterns. To reveal such an underlying mechanism
of signatures, our proposed quadruple model gives the included
compound, its direct and indirect targets, downstream transcrip-
tion factors, and affected genes, which are shown in Fig. 4. In
Fig. 4, transcription factors for each signature were identified by
the enrichment analysis according to signature-associated genes
using ChIP enrichment analysis, setting a p-value of less than 0.05
282 Wei Chen and Xiaobo Zhou
Fig. 4 Quadruple models and Signature 2 in a kinome inhibitor study. A quadruple model simultaneously
includes a compound, its targets, related transcription factors, and the resulting gene expression pattern. This
compound signature discovery method (red) can detect similar quadruples (blue). The similar quadruples
include four compounds with similar target sets. 90 in 109 related TFs of the quadruples are covered using the
enriched TFs of signature genes
and ratios of the interacting genes to all genes that exceeded 0.1
[15]. The quadruples of compound signatures were thus con-
structed. The biomedical relevance of a typical signature (Signature
2) was validated by comparing the predicted transcription factors
from signature target genes with the enriched transcription factors
derived from the direct measurement of kinase targets of four
kinase inhibitors (ALW-II-38-3, ALW-II-49-7, QL-XI-92, and
CP724714) in this signature [16].
The compound signatures were composed of compounds and
their associated target genes. Compounds in a given signature
shared similar target genes and thus perturbed the cell functions
in similar ways for the corresponding cancer cell line. If some had
already demonstrated effectiveness for this type of cancer, other
compounds in this signature were more likely to be promising drug
candidates. We used FDA-approved chemotherapy drugs for breast
cancers to identify breast cancer-specific compound signatures and
examined the drug potentials of corresponding drugs [17]. Func-
tions of the signature also could be revealed by enrichment of
functions among these target genes. Signatures that demonstrated
anti-oncological functions [18] such as the reduced cell prolifera-
tion, increased cell death, and induced apoptosis were more likely
to be seen in potential drugs. We utilized the DAVID gene func-
tional annotation tool [19] to annotate functions of compound
signatures and identify antitumor signatures.
Genomic Big Data for Drug signature Detection 283
3.4 Sample We used the kinase inhibitor dataset to validate the concept of the
Application compound signatures discovered by the EGEM-based csNMF
of Compound approach. We chose this dataset because some kinase inhibitors
Signature Mining had been experimentally profiled to identify their direct kinase
targets and thus could be used to validate the predictions of the
3.4.1 Signatures csNMF modeling. The 51 kinase inhibitors were analyzed against
and Quadruples for Kinase the 3341-target gene reference library. In all, we detected eight
Inhibitors compound signatures, which are Signature 1 8: PD173074,
CP724714, QL-X-138, PLX-4720AZD1152, A443644WZ3105,
WZ7043XMD11, PD0332991, and HGSSUCLG1.
3.4.3 Clinical Relevance We examined the associations of the discovered compound signa-
of Compound Signatures tures with patient survival and other clinical traits. Clinical features
and gene expression profiles of 2116 breast cancer patients col-
lected from Belgium, England, and Singapore (GEO:GSE45255)
were examined by the gene set enrichment of the eight discovered
breast cancer-related compound signatures. For example, in terms
of distant metastasis-free survival, patients in the Signature 4Low
category responded poorly to chemotherapy compared with those
in the Signature 4High category (Fig. 5c). Signature
4 (PLX-4720AZD1152) was selectively associated with chemo-
therapy but not hormone therapy (tamoxifen). We performed a
univariable and multivariable survival analysis using discovered
compound signatures as well as conventional clinical features
including patient age, tumor size, PAM50 as well as molecular
subtypes, lymph node involvement, the ER status, and the patho-
logical grades. The results suggested that the compound Signatures
4 and 5 (A443644WZ3105) are strongly associated with poor
284 Wei Chen and Xiaobo Zhou
Fig. 5 Breast cancer compound signatures. (a) Eight signatures were detected (yellow rectangles). For each
signature, compounds (columns) and genes (rows) corresponding to a red region showed similar gene
expression effects, whereas those corresponding to a green region exhibited reverse effects. (b) Degree of
yellow represents the relative enrichment for the related gene ontology (GO) terms. (c) Associations of
Signature 4 with drug responses and survival in data from 2116 breast cancer patients collected from
Belgium, England, and Singapore (GEO:GSE45255)
prognosis for patients with chemotherapies but not for those with
tamoxifen treatment. The analysis results were consistent with the
drug response survival results showed in Fig. 5. Signatures also
demonstrated associations with breast cancer subtypes (Signature
2: CP724714) and receptor status (Signatures 3 (QL-X-138) and
6 (WZ7043XMD11) with estrogen receptor status). Such associa-
tion results demonstrate the clinical potential of the compound
signatures discovered in the MCF-7 breast cancer cell line model.
Follow-up investigations could include testing the underlying
mechanisms for the poor prognosis of patients in the Signature
4Low category, by further studying the predicted target genes
using the established Signature 4 (PLX-4720AZD1152) quadruple
model (see Note 3).
4 Notes
Acknowledgment
References
1. Downey W, Liu C, Hartigan J (2010) Com- 4. Lamb J, Crawford ED, Peck D, Modell JW,
pound profiling: size impact on primary screen- Blat IC et al (2006) The connectivity map:
ing libraries. Drug Discovery World pp 81–86 using gene-expression signatures to connect
2. Lehmann BD et al (2011) Identification of small molecules, genes, and disease. Science
human triple-negative breast cancer subtypes 313:1929–1935
and preclinical models for selection of targeted 5. Peng HM, Zhao WL, Tan H, Ji ZW, Li JS,
therapies. J Clin Invest 121(7):2750–2767 Li K, Zhou XB (2016) Prediction of treatment
3. Duan Q et al (2014) LINCS canvas browser: efficacy for prostate cancer using a mathemati-
interactive web app to query, browse and inter- cal model. Sci Rep 6:21599
rogate LINCS L1000 gene expression signa- 6. Liu CL, Su J, Yang F, Wei K, Ma JW, Zhou XB
tures. Nucleic Acids Res 42:W449–W460 (2015) Compound signature detection on
286 Wei Chen and Xiaobo Zhou
LINCS L1000 big data. Mol BioSyst 14. You ZH, Li JQ, Gao X, He Z, Zhu L, Lei YK, Ji
11:714–722 ZW (2015) Detecting protein-protein interac-
7. Ji ZW, Wu D, Zhao WL, Peng HM, Zhao SJ, tions with a novel matrix-based protein
Huang DS, Zhou XB (2015) Systemic model- sequence representation and support vector
ing myeloma-osteoclast interactions under machines. Biomed Res Int 2015:867516
normoxic/hypoxic condition using a novel 15. Lachmann A et al (2010) ChEA: transcription
computational approach. Sci Rep 5:13291 factor regulation inferred from integrating
8. Subramanian A et al (2005) Gene set enrich- genome-wide ChIP-X experiments. Bioinfor-
ment analysis: a knowledge-based approach for matics 26(19):2438–2444
interpreting genome-wide expression profiles. 16. Shao HW, Peng T, Ji ZW, Su J, Zhou XB
Proc Natl Acad Sci U S A 102 (2013) Systematically studying kinase inhibitor
(43):15545–15550 induced signaling network signatures by inte-
9. Lamb J et al (2006) The connectivity map: grating both therapeutic and side effects. PLoS
using gene-expression signatures to connect One 8(12):e80832
small molecules, genes, and disease. Science 17. Ji ZW, Su J, Wu D, Peng HM, Zhao WL, Zhou
313(5795):1929–1935 XB (2017) Predicting the impact of combined
10. Lee DD, Seung HS (2001) Algorithms for therapies on myeloma cell growth using a
non-negative matrix factorization. In: hybrid multi-scale agent-based model. Onco-
Advances in neural information processing target 8:7647–7665
systems 18. Gerl R, Vaux DL (2005) Apoptosis in the
11. Lee DD, Seung HS (1999) Learning the parts development and treatment of cancer. Carcino-
of objects by non-negative matrix factorization. genesis 26(2):263–270
Nature 401(6755):788–791 19. Huang D et al (2007) The DAVID gene func-
12. Kim H, Park H (2007) Sparse non-negative tional classification tool: a novel biological
matrix factorizations via alternating non- module-centric algorithm to functionally ana-
negativity-constrained least squares for micro- lyze large gene lists. Genome Biol 8(9):R183
array data analysis. Bioinformatics 23 20. Siddiqa A et al (2008) Expression of HER-2 in
(12):1495–1502 MCF-7 breast cancer cells modulates anti-
13. Mering CV et al (2005) STRING: known and apoptotic proteins Survivin and Bcl-2 via the
predicted protein-protein associations, extracellular signal-related kinase (ERK) and
integrated and transferred across organisms. phosphoinositide-3 kinase (PI3K) signalling
Nucleic Acids Res 33(suppl 1):D433–D437 pathways. BMC Cancer 8(1):129
Chapter 16
Abstract
The library of integrated Network-Based Cellular Signatures (LINCS) project aims to create a network-
based understanding of biology by cataloging changes in gene expression and signal transduction. Gene
expression and proteomic data in LINCS L1000 are cataloged for human cancer cells treated with
compounds and genetic reagents. For understanding the related cell pathways and facilitating drug
discovery, we developed binary linear programming (BLP) to infer cell-specific pathways and identify
compounds’ effects using L1000 gene expression and phosphoproteomics data. A generic pathway map
for the MCF7 breast cancer cell line was built. Within them, BLP extracted the cell-specific pathways, which
reliably predicted the compounds’ effects. In this way, the potential drug effects are revealed by our models.
Key words LINCS, L1000, Binary linear programming, Drug effect, Cell-specific pathway
1 Introduction
Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_16, © Springer Science+Business Media, LLC, part of Springer Nature 2019
287
288 Wei Chen and Xiaobo Zhou
2 Materials
3 Methods
The workflow of the whole approach includes three steps, which are
shown in Fig. 1.
In step one, we inferred the potential targets of the compounds
from L1000 gene expression profiles and information from the
literature and then created the corresponding downstream path-
ways of those inferred targets by integrating the PPI, transcriptional
factor, and KEGG pathway database [5]. We also searched the
pathways related to MCF7 and PC3 cell lines in IPA (http://
www.ingenuity.com) and the literature [6–8]. After that, we
integrated these pathways with inferred targets and their down-
stream pathways together to construct a generic pathway map. In
Fig. 1 The flow chart of the proposed approach to infer a cell-type specific pathway map and to identify a
compound’s effects
290 Wei Chen and Xiaobo Zhou
3.1 Binary Linear Here, we describe how the Boolean model can be reformulated as a
Programming (BLP) BLP to optimize the cell-specific pathways. Two reports in the
literature [9, 10] used a Boolean model to optimize the generic
pathway map under the stimulation of combined different cyto-
kines. However, their models are designed only for phosphopro-
teomics data in the early stage of signal transduction. For inferring a
cell-specific pathway map using the P100 database with only one
time point (6 h), we assume a virtual time point (t) before 6 h,
which represents most of enzyme’s activities reach to saturation
conditions after phosphorylation at time t. The cell-specific path-
ways inferred by our BLP approach correspond to the topological
structure in the saturation condition in early response. The
observed time point (6 h) could be represented as t + 1, indicating
the mid-stage signaling response. In our BLP approach, we employ
binary variables to describe the phosphorylation states of enzymes
and the reactions (activated or inhibited). We also use binary linear
constraints to model the relationship between the early response at
t and mid-stage response at t + 1. According to the concept of Hill
Function [11], there are three scenarios for the state of enzyme x at
time t and t + 1 in our BLP approach:
(a) Equation 1 suggests that x is activated by its upstream enzyme
in the early stage and reaches a steady state at time t, and its
activity is unchanged until time t + 1.
x ðt Þ ¼ x ðt þ 1Þ ¼ 1 ð1Þ
x ðt Þ ¼ 1; x ðt þ 1Þ ¼ 0 ð2Þ
x ðt Þ ¼ x ðt þ 1Þ ¼ 0 ð3Þ
The states of all proteins at time t will completely satisfy the
causal relationships in our constraint set. The change of states for
each measured phosphoprotein is also considered between two
Genomic and Proteomic Big Data for Dug Effect Prediction 291
ð6Þ
292 Wei Chen and Xiaobo Zhou
3.2 Sample Figure 2a shows the generic pathway map composed of some
Application important pathways. The estrogen pathway induces tumor growth
of Identifying in estrogen receptor-positive breast cancers [12]; the PI3K/AKT
Compound Effects pathway is an important player in cell survival [7]; the TNF signal-
ing is anticancer-related pathway [13]; and MEK/ERK pathways
3.2.1 Construction of a are usually associated with proliferation and antiapoptosis; the
Generic Pathway Map NFkB pathway is involved in many cell functions, such as cell
proliferation, cell survival, and cellular stress [14]; the JNK/p53/
p21 pathway may induce cell apoptosis [6]; and HDAC1/
Fig. 2 Boolean network topologies of the generic and inferred cell-specific pathways for the MCF7 cell line. (a)
The MCF7 generic pathway map included some important classic pathways. The edges with green color were
potential downstream pathways of some compounds. (b) After optimization via BLP, the red nodes and gray
dash lines were removed from generic pathway map so that the cell-specific pathways were obtained
Genomic and Proteomic Big Data for Dug Effect Prediction 293
3.2.2 Inference of Cell- To infer cell-specific pathways based on the constructed generic
Specific Pathways by BLP pathway map, we then minimized the differences between the
measurements and the simulated values, as well as the complexity
of the signaling pathway structure’s topology. We developed the
BLP approach to optimize such multi-objective functions. The
concept behind BLP is that the states of the proteins (variables)
are normalized to binary numbers (activated state or no activation);
edges between two proteins are also represented as binary numbers
(inhibition or promotion); binary linear constraints are used to
describe the relationship between upstream and downstream pro-
teins; and the optimization is done with binarized values taken by
variables, edges, and constraints. The BLP is solved with the opti-
mization toolbox in MATLAB that guarantees minimal differences
between phosphoproteomics data and predicted data, as well as the
Boolean topology of the generic pathway map. Because P100 data
are captured at the mid-stage of signal transduction, we developed
constraints to simulate the change from early to mid-stage so that
we can still obtain many important causal relationships of phos-
phorylation. The fitting precision of the optimized cell-specific
pathways is 87.66% for MCF7, which proves that our model
works well on mid-stage phosphoproteomics data.
Figure 2b shows the inferred cell-specific pathways of MCF7.
The blue nodes are the measured phosphoproteins, while the red
nodes and gray dotted lines were removed after optimization. After
our BLP system simultaneously optimizes the two objective func-
tions with the P100 data, we keep those reactions whose edges exist
for some compounds and remove those reactions whose edges
completely do not exist for all compounds. For example, HDAC1
inhibitor-induced JNK activation in turn activates the downstream
pathways p53 and p21 and eventually results in cell cycle arrest.
Another example is that the reaction between cJUN and p53
(cJUN decreases the expression of p53) did not appear in the
pathways induced by all compounds, so we removed this reaction,
although it does exist for certain conditions [16]. In addition, to
keep the proteins at the end of each pathway as measured phos-
phoproteins, all other proteins at the end of each pathway not
measured in P100 were removed (e.g., (ERK AND CK2) !
ELK-1 reaction in Fig. 2b). Thus, the inferred cell-specific pathway
map was smaller but contains only those elements that can fit the
experimental evidence very well. When this approach was applied to
the PC3 cell line, the goodness of data-fitting on the inferred cell-
specific pathways was 90.91%.
294 Wei Chen and Xiaobo Zhou
4 Notes
Fig. 3 The compound-induced topological alterations in the MCF7-specific pathways revealed by BLP. The
treatment effects of four compounds on MCF7 cell line were shown in the figure. Red arrows denote that these
reactions were blocked after treatment with compounds. (a) Compound Trichostatin A; (b) Compound MS-275;
(c) Compound Staurosporine; (d) Compound Digoxigenin
pathways for underlining disease states. All the data are from
15 cancer cell lines on 1000 carefully chosen landmark genes,
which can reduce the number of measurements and will not be
biased for a particular cellular model.
2. We developed binary linear programming (BLP) approach to
infer the best fitting cell-specific signaling pathways from per-
turbation- induced topological structures. We believe that BLP
can complement standard biochemical drug profiling assays
296 Wei Chen and Xiaobo Zhou
Acknowledgment
References
1. Hongwei Shao TP, Ji Z, Jing S, Zhou X (2013) binary linear programming. PLoS One 9(7):
Systematically studying kinase inhibitor e102798
induced signaling network signatures by inte- 5. Ogata H, Goto S, Fujibuchi W, Kanehisa M
grating both therapeutic and side effects. PLoS (1998) Computation with the KEGG pathway
One 8(12):e80832 database. Biosystems 47:119–128
2. Saez-Rodriguez J, Goldsipe A, Muhlich J, 6. Perfettini JL, Castedo M, Nardacci R,
Alexopoulos LG, Millard B et al (2008) Flexi- Ciccosanti F, Boya P et al (2005) Essential
ble informatics for linking experimental data to role of p53 phosphorylation by p38 MAPK in
mathematical models via DataRail. Bioinfor- apoptosis induction by the HIV-1 envelope. J
matics 24:840–847 Exp Med 201:279–289
3. Hendriks BS, Espelin CW (2010) DataPflex: a 7. Su JS, Woods SM, Ronen SM (2012) Meta-
MATLAB-based tool for the manipulation and bolic consequences of treatment with AKT
visualization of multidimensional datasets. Bio- inhibitor perifosine in breast cancer cells.
informatics 26:432–433 NMR Biomed 25:379–388
4. Ji ZW, Su J, Liu CL, Wang HY, Huang DS, 8. Xue LY, Chiu SM, Oleinick NL (2003)
Zhou XB (2014) Integrating genomics and Staurosporine-induced death of MCF-7
proteomics data to predict drug effects using human breast cancer cells: a distinction
between caspase-3-dependent steps of
Genomic and Proteomic Big Data for Dug Effect Prediction 297
apoptosis and the critical lethal lesions. Exp 15. Wang GL, Salisbury E, Shi X, Timchenko L,
Cell Res 283:135–145 Medrano EE et al (2008) HDAC1 promotes
9. Mitsos A, Melas IN, Siminelakis P, Chairakaki liver proliferation in young mice via interac-
AD, Saez-Rodriguez J et al (2009) Identifying tions with C/EBPbeta. J Biol Chem
drug effects via pathway alterations using an 283:26179–16187
integer linear programming optimization for- 16. Schreiber M, Kolbus A, Piu F, Szabowski A,
mulation on Phosphoproteomic data. PLoS Mohle-Steinlein U et al (1999) Control of cell
Comput Biol 5:e1000591 cycle progression by c-Jun is p53 dependent.
10. Saez-Rodriguez J, Alexopoulos LG, Genes Dev 13:607–619
Epperlein J, Samaga R, Lauffenburger DA 17. Coulonval K, Bockstaele L, Paternot S, Roger
et al (2009) Discrete logic modelling as a PP (2003) Phosphorylations of cyclin-
means to link protein signalling networks with dependent kinase 2 revisited using
functional analysis of mammalian signal trans- two-dimensional gel electrophoresis. J Biol
duction. Mol Syst Biol 5:331 Chem 278:52052–52060
11. Mather W, Bennett MR, Hasty J, Tsimring LS 18. Rosato RR, Almenara JA, Grant S (2003) The
(2009) Delay-induced degrade-and-fire oscilla- histone deacetylase inhibitor MS-275 pro-
tions in small genetic circuits. Phys Rev Lett motes differentiation or apoptosis in human
102:068105 leukemia cells through a process regulated by
12. Giacinti L, Giacinti C, Gabellini C, Rizzuto E, generation of reactive oxygen species and
Lopez M et al (2012) Scriptaid effects on induction of p21(CIP1/WAF1). Cancer Res
breast cancer cell lines. J Cell Physiol 63:3637–3645
227:3426–3433 19. Opiteck GJ, Scheffler JE (2004) Target class
13. Rodriguez-Berriguete G, Fraile B, Paniagua R, strategies in mass spectrometry-based proteo-
Aller P, Royuela M (2012) Expression of NF- mics. Expert Rev Proteomics 1:57–66
kappaB-related proteins and their modulation 20. Chiara DG, Marcocci ME, Torcia M,
during TNFalpha-provoked apoptosis in pros- Lucibello M, Rosini P et al (2006) Bcl-2 phos-
tate cancer cells. Prostate 72:40–50 phorylation by p38 MAPK - identification of
14. Courtois G, Gilmore TD (2006) Mutations in target sites and biologic consequences. J Biol
the NF-kappa B signaling pathway: implica- Chem 281:21353–21361
tions for human disease. Oncogene
25:6831–6843
Chapter 17
Abstract
Human diseases are historically categorized into groups based on the specific organ or tissue affected. Over
the past two decades, advances in high-throughput genomic and proteomic technologies have generated
substantial evidence demonstrating that many diseases are in fact markedly heterogeneous, comprising
multiple clinically and molecularly distinct subtypes that simply share an anatomical location. Here, a
Bayesian network analysis is applied to study comorbidity patterns that define disease subtypes in pediatric
pulmonary hypertension. The analysis relearned established subtypes, thus validating the approach, and
identified rare subtypes that are difficult to discern through clinical observations, providing impetus for
deeper investigation of the disease subtypes that will enrich current disease classifications. Further advances
linking disease subtypes to therapeutic response, disease outcomes, as well as the molecular profiles of
individual subtypes will provide impetus for the development of more effective and targeted therapies.
1 Introduction
Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4_17, © Springer Science+Business Media, LLC, part of Springer Nature 2019
299
300 Mei-Sing Ong
1.1 The Emergence The whole is more than the sum of its parts—Aristotle
of Network Medicine Network science has emerged in diverse disciplines as a means of
analyzing complex relational data. Building on graph theory, a
network is a graphical model comprising a set of nodes and con-
nections between them. Network nodes represent random variables
to be modeled; network edges connecting nodes can have many
interpretations ranging from physical interactions to mathematical
relationships. Combinatorial analysis of these relationships using
graph theoretics can uncover structure and patterns, including both
local properties and global phenomena, providing insights into the
characteristics of complex systems (Box 1).
In the context of biological networks, network nodes represent
biological entities (e.g., genes, proteins, diseases), and edges
between nodes represent biological relationships (e.g., gene corre-
lations, protein-protein interactions, functional associations). Pub-
lished studies have shown that networks operating in biological
systems are not random, but are characterized by a set of organizing
principles such that network topology and connectivity can convey
biologically important information about the network entities
[4–10]. For example, analysis of gene co-expression networks,
whereby genes are modeled as nodes and edges between nodes
A Bayesian Network Approach to Disease Subtype Discovery 301
Table 1
Major PH subtypes defined in the WHO classification of PH
Table 2
Recommended treatment for each PH subtype [27]
Table 3
Panama classification of pediatric pulmonary hypertensive vascular
disease
2 Materials
3 Methods
3.1 Defining The first step involved extracting the comorbidity profile of the
Comorbidity Profiles study subjects. The presence of a comorbidity was defined as having
of Patients with PH two or more healthcare encounters related to the condition, iden-
tified using ICD-9 codes. To select conditions that were most
relevant to PH, the analysis was confined to conditions that were
significantly associated with PH compared with the general popu-
lation of children without PH. Comorbidities significantly asso-
ciated with PH were systematically identified from the data source
using the chi-square statistic to compare their prevalence among
children with and without PH. The α level of 0.05 was used to
3.2 Constructing The next step involved constructing a Bayesian network to model
Bayesian Comorbidity the interdependencies of the comorbidities in children with
Network PH. The following subsections provide a brief overview of
Bayesian network and the specific techniques applied in the current
study.
3.2.1 Bayesian Network A Bayesian network is a probabilistic graphical model that repre-
Overview sents a joint probability distribution of a set of random variables
[44]. The network structure consists of nodes representing ran-
dom variables and directed edges between nodes representing
probabilistic dependencies (Box 2). The absence of a link
between two nodes signifies conditional independence. The
edge strength indicates the relative magnitude of the depen-
dency between two variables. The edge directionality, when
present, is not intended to imply causation. Rather, an edge
from node xi to xj can be interpreted as the presence of xi
“influences” the occurrence of xj.
P ðx; . . . ; x n Þ ¼ ∏ni¼1 P ðx i jπ i Þ
308 Mei-Sing Ong
3.2.2 Bayesian Network The problem of learning a Bayesian network can be stated as
Learning follows: given a dataset, find a network that best represents the
dataset. The construction of a network entails a two-step process:
structure learning and parameter estimation. Structure learning
involves determining the network structure that most accurately
reflects the observed data. There are two main approaches to struc-
ture learning [45–49]:
1. Constraint-based approach applies statistical tests to establish
conditional dependence and independence relationships
among the variables in a model; these relationships form a set
of edge constraints; the algorithm then finds the best directed
acyclic graph (DAG) that satisfies the constraints.
2. Score-based approach defines a network scoring function to
evaluate the goodness of fit of candidate DAGs with respect to
the data; the method then searches over the space of DAGs for
a structure that maximizes the score.
Hybrid algorithms that combined both approaches have also
been developed, whereby constraint-based algorithms are used to
reduce the space of candidate DAGs and network scores are used to
identify the optimal DAG [50, 51].
In the current study, a score-based structure learning algorithm
was applied to search for the best-fit model. Specifically, the Bayes-
ian Dirichlet equivalent score (with equivalent sample size of ten)
[52, 53] was calculated for each candidate network to measure its
goodness of fit, and the network with the highest score was selected
(Box 3).
To search for high-scoring structures, standard heuristic search
techniques can be applied. Local heuristic search strategies are the
most commonly applied method: starting from an initial feasible
solution, an iterative search is performed to successively improve
the solution through a series of local modifications (i.e., edge
addition, deletion, and reversal) until an optimum is reached.
Such an approach often only finds local optima, which are not
necessarily the best possible solution (i.e., the global optimum). A
range of algorithms have been developed to overcome this limita-
tion, one of which is “tabu search” [54]—the algorithm used in the
current study. To explore solution space beyond local optimality,
the tabu search learning process applies an iterative local search
procedure but maintains a “tabu list” of previously visited solutions
that had been found suboptimal; solutions in the “tabu list” are
prohibited in future moves, thus permitting the search procedure
to escape local optima.
A Bayesian Network Approach to Disease Subtype Discovery 309
where
i ∈ {1, . . ., n},j ∈ {1, . . ., rπ i},k ∈ {1, . . ., ri}
ri ¼ number of categories of xi, π i ¼ parents of xi.
nijk is the count
P of elements in D containing both xik and
π ij, and nij ¼ nijk .
k
αP
¼ (αijk) 8 ijk are the Dirichlet hyperparameters, where
αij ¼ αijk .
k
The Bayesian Dirichlet equivalent score assumes αijk ¼ α∗.
P(Θijk| G), where α∗ is the equivalent sample size.
Once the best-fit network structure has been selected, the next
step in the Bayesian network construction process—parameter esti-
mation—involves finding a set of probability distribution para-
meters for the learned network that best explains the observed
data using Bayes estimation [55]. Given the learned conditional
distribution parameters, the strength of the relations between node
310 Mei-Sing Ong
3.2.3 Model Averaging In finding the “best-fit” model, over-fitting can occur when the
Technique resultant model describes random error or noise instead of the
underlying distribution of the data. To improve the statistical
robustness of the analysis, the following strategies were applied.
First, instead of building a single model, a model averaging tech-
nique [56] was used where multiple best-scored networks were
developed using 1000 subsamples of the dataset generated through
bootstrap resampling; the final model was estimated by averaging
over the highest scoring networks, such that only network edges
(i.e., comorbidity relations) that were statistically significant were
selected for inclusion [57], as described in Box 4. The parameters of
the selected edges were then estimated using the full dataset in the
final network. This technique allows the identification of network
features that are robust to perturbations of the observations
[56, 57]. Second, prior knowledge about the biology of PH
informed construction of the set of comorbidities used for devel-
oping the network. For example, it is well-established that right and
left heart diseases have distinct disease pathophysiology. Thus,
diagnosis codes pertaining to left heart disease were grouped into
a single category and diagnosis codes pertaining to right heart
disease into a separate category. Reducing the number of para-
meters relative to the number of observations also has the effect
of restricting the degrees of freedom during learning, thus resulting
in a more robust model. Third, to further reduce the complexity of
the model, the analysis considered only comorbidities that were
found, in bivariate analyses, to be significantly associated with PH,
compared with the general population without PH, and selected
those for which the lower 95% confidence interval bound for the
odds ratio was greater than five. Comorbidities that occurred in
fewer than four PH patients were also excluded, since a small
number of observations would not suffice to distinguish between
true and spurious correlations.
(continued)
A Bayesian Network Approach to Disease Subtype Discovery 311
Box 4: (continued)
2. Learn the structure of the graphical model G ¼ (V, Eb) from
Xb.
In the current study, the number of iterations m was set to
1000.
Step 2: model averaging
For each edge ei learned through the bootstrap resam-
pling process, estimate the probability that it is present in the
true network structure G0 ¼ (V, E0) as:
1 X m
P^ ðe i Þ ¼ sb
m b¼1
1 if e i ∈E b
sb ¼
0 otherwise
The empirical probabilities P^ ðe i Þ are known as edge inten-
sities or arc strengths and can be interpreted as the degree of
confidence that ei is present in the network structure G0
describing the true dependence structure of the dataset.
Step 3: selection of significant edges.
Significant edges were identified by defining a threshold t,
such that only edges with edge intensity greater than t were
included in the final model. Thus:
e i ∈E b if pðiÞ > F 1
pð:Þ ðt Þ
where F 1
pð:Þ ðt Þ is the quantile function:
n o
F 1
pð:Þ ðt Þ ¼ inf x∈ℝ F p ð:Þ
ðx Þ t
3.3 Defining To define PH subtypes, the network was partitioned into subgraphs
Subtypes Through comprising highly connected comorbidities using a strict partition-
Network Clustering ing rule, whereby each comorbidity belongs to exactly one cluster.
Analysis The graph partitioning process involves merging nodes agglomera-
tively. Specifically, a random walk clustering approach was applied
to identify the pathways that were closest to each node in the
network [58]. The process involves a random walk on a network
for t number of steps: a walker at node i and step t randomly selects
one of its neighbors to which it hops at step t + 1; the probability of
walking from node i to node j is quantified by the weight of the
edge divided by the total number of nodes directly linked to i.
Similarity between two nodes is measured by the L2 distance
between their respective transition probabilities, and cluster analysis
involves merging nodes such that the mean of the squared distances
between each node and its cluster is minimized. A more detailed
description of this approach is provided in Box 5. The length of
t must be sufficiently long to gather enough information about the
topology of the graph, but short enough to detect clusters. To
guide the choice of t, a commonly used measure known as “mod-
ularity” was applied to quantify the strength of a network division.
A positive value of modularity is indicative of the potential presence
of community structure [59]. A t that maximized modularity was
chosen; in the current analysis, t of 3 was found to be optimal. In
the cluster analysis, edges with a weight of less than 0.2 were
excluded, in order to capture the strongest relations.
(continued)
A Bayesian Network Approach to Disease Subtype Discovery 313
Box 5: (continued)
Let i and j be two nodes. The distance between the nodes
is quantified as follows:
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u 2
u n
uX P ikt P tjk
r ij ¼ t
k¼1
d ðkÞ
1 X X 2
σk ¼ r :
n C∈P i∈C iC
k
3.4 Evaluation A review by experts evaluated the study approach by checking for
of Network-Derived the identification of established PH subtypes. Accordingly, each
Subtypes comorbidity cluster was assigned a WHO and Panama classification
subtype that best describes the cluster. For example, a comorbidity
cluster comprising portal hypertension and the associated condi-
tions would be classified as WHO Group 1 (PAH associated with
portal hypertension) and Panama Group 10 (pediatric pulmonary
vascular disease associated with other system disorders). Classifica-
tion was first performed by one researcher and then validated by
two pediatric PH experts; an inter-rater agreement was quantified
using Cohen’s kappa statistic, and discrepancies were resolved
through consensus. The WHO and Panama classifications comprise
5 and 10 subtypes, respectively; the WHO classification further
categorizes some of the 5 major subtypes into 28 “minor” subtypes
(Fig. 1c). A literature review was conducted to evaluate network-
derived clusters not described in either WHO or Panama classifica-
tions to assess if published evidence supported the co-occurrence of
PH with conditions captured by each cluster.
314 Mei-Sing Ong
4 Notes
4.1 Detection Cluster analysis of the comorbidity network identified all five major
of Well-Established subtypes (kappa score, 100%) and 21 of 28 minor subtypes (kappa
Subtypes score 96%) defined in the WHO classification (Table 1), and 9 of
10 subtypes defined in the Panama classification (kappa score, 90%)
(Table 2), with a few anticipated exceptions. For example, in the
absence of pedigree and genetic data, the analysis was unable to
discern the various forms of heritable PAH, with the exception of
PH in association with hereditary hemorrhagic telangiectasia
(HHT), a condition known to associate with PAH and linked to
pathogenic variants in ALK1 and ENG genes [30]. The identifica-
tion of pathogenic drugs and toxins associated with PAH is beyond
the scope of this study. The analysis was also unable to detect PH
caused by chronic exposure to high altitude: the diagnostic code for
the condition is non-specific (ICD-9993.2: “other and unspecified
effects of high altitude”) and was not assigned to any of the patients
in the study dataset. The imprecision of ICD codes also precluded
differentiation between left ventricular systolic and diastolic dys-
functions. Finally, subtypes associated with HIV infection or schis-
tosomiasis were not captured, since none of the PH patients in the
study data source had billing codes for these conditions.
4.2 Detection of Rare The analysis detected known rare associations with PH (Table 3).
Subtypes An example is the co-occurrence of PH with juvenile idiopathic
arthritis (JIA) and hemophagocytic syndrome. The clustering
occurrence is not surprising given that PAH has been reported in
several patients with systemic-onset JIA, particularly in association
with macrophage activation syndrome (MAS) [60, 61]. It has been
hypothesized that PAH may be caused by exposures to IL-1 and
IL-6 inhibitors used for treating systemic-onset JIA and MAS [62];
however, the underlying biology of this association remains
unknown. In the study dataset, of 18 patients with both JIA and
hemophagocytic syndrome, four patients developed PH, signifying
the potential importance of this comorbidity pattern.
The cluster comprising glycogen storage disease (GSD), hered-
itary muscular dystrophy, and cardiomyopathy typifies type 2 GSD.
While type 1 GSD has been linked to PH, the relationship between
PH and type 2 GSD is less studied. A case report noted the
development of PH resulting from respiratory muscular atrophy
and alveolar hypoventilation caused by type 2 GSD [63]. Another
report documented severe pulmonary veno-occlusive disease
(PVOD) in a patient with type 2 GSD [64].
A Bayesian Network Approach to Disease Subtype Discovery 315
4.3 Discovery Several network-derived comorbidity clusters do not fall into any of
of Unknown Subtypes the categories in the WHO and Panama classifications. Of note, a
number of these clusters are linked to neurological defects not
commonly thought to be associated with PH, including encepha-
locele, hydrocephalus, microcephaly, periventricular leukomalacia,
and congenital brain reduction deformities. It is well-established
that children with severe neurological impairments are predisposed
to respiratory problems that occur as a direct consequence of the
underlying disability. For example, oropharyngeal motor problems
A Bayesian Network Approach to Disease Subtype Discovery 317
4.4 Feasibility The study demonstrated that comorbidity patterns of patients with
and Challenges PH captured in a Bayesian network can be stratified into subtypes
of Data-Driven that are biologically and clinically informative. The algorithmic
Discovery of Disease method automatically relearned most of the major PH subtypes
Subtypes with known etiological basis defined by the WHO classification.
The similarity of the derived network structure to current taxon-
omy of PH provides face validity to the approach. Furthermore, the
network approach enriches the current classification of PH. First, it
captured several subtypes documented in only a few case studies for
which evidence for systematic association remains lacking. This
both validates the approach and provides impetus for deeper inves-
tigation of the disease subtypes. The analysis also identified rare
subtypes with findings consistent with several well-described
genetic syndromes. In the same way in which novel genetic associa-
tions in PH stimulate new avenues of research, so too may novel
phenotypic associations prompt important discoveries related to
disease susceptibility.
To construct the comorbidity network, the Bayesian model
averaging technique was applied to find a network model that
best fits the underlying data. The approach is uniquely suited to
accommodate the inherent uncertainties of biological processes and
to minimize the effects of noise in the data. In maximizing specific-
ity, however, other subtypes may have been missed. Furthermore,
in defining comorbidity clusters, a strict partitioning rule was
applied, whereby each comorbidity belongs exactly to one cluster.
While this approach produces a model that is easier to interpret, the
full expression of subtypes may not have been captured. As shown
in Fig. 3, many comorbidities are linked to comorbidities belonging
to another cluster. Shared features among multiple clusters may
also reflect the overlapping phenotypes of PH, an increasingly
recognized phenomenon [30]. Future research should explore
methods that would facilitate the delineation of subtypes with
overlapping manifestations and etiologies.
A further limitation of the study is the coding inaccuracies
inherent to administrative claims dataset. The diagnoses coded for
billing purposes may not reflect actual comorbidities in the patients.
318 Mei-Sing Ong
4.5 Summary This chapter describes the application of the Bayesian network
approach to routinely collected clinical data to discern disease sub-
types in childhood-onset PH. With the increased availability of
large clinical datasets, such data-driven approaches can facilitate
and expedite scientific discovery of the causes and treatments of
complex diseases. Further advances linking disease subtypes to
therapeutic response, disease outcomes, as well as the molecular
profiles of individual subtypes will provide impetus for the develop-
ment of more effective and targeted therapies.
Acknowledgment
References
1. Kitano H (2002) Systems biology: a brief over- 6. Barabási AL, Oltvai ZN (2004) Network biol-
view. Science 295(5560):1662–1664 ogy: understanding the cell’s functional orga-
2. Aderem A (2005) Systems biology: its practice nization. Nat Rev Genet 5(2):101–113
and challenges. Cell 121(4):511–513 7. Chan SY, Loscalzo J (2012) The emerging
3. Hood L, Heath JR, Phelps ME, Lin B (2004) paradigm of network medicine in the study of
Systems biology and new technologies enable human disease. Circ Res 111(3):359–374
predictive and preventative medicine. Science 8. Barabási AL, Albert R (1999) Emergence of
306(5696):640–643 scaling in random networks. Science
4. Westerhoff HV, Palsson BO (2004) The evolu- 286:509–512
tion of molecular biology into systems biology. 9. Bhalla US, Iyengar R (1999) Emergent proper-
Nat Biotechnol 22(10):1249–1252 ties of networks of biological signaling path-
5. Barabási AL, Gulbahce N, Loscalzo J (2011) ways. Science 283(5400):381–387
Network medicine: a network-based approach 10. Jeong H, Tombor B, Albert R, Oltvai ZN,
to human disease. Nat Rev Genet 12(1):56–68 Barabási AL (2000) The large-scale
A Bayesian Network Approach to Disease Subtype Discovery 319
conference on Uncertainty in artificial intelli- 61. Li EK, Tam LS (1999) Pulmonary hyperten-
gence, p 206–215 sion in systemic lupus erythematosus: Clinical
47. Cooper GF, Herskovits E (1992) A Bayesian association and survival in 18 patients. J Rheu-
method for the induction of probabilistic net- matol 26(9):1923–1929
works from data. Mach Learn 9:309–347 62. Humbert M, Monti G, Brenot F, Sitbon O,
48. De Campos, Zeng Z, Ji Q (2009) Structure Portier A, Grangeot-Keros L et al (1995)
learning of Bayesian networks using con- Increased interleukin-1 and interleukin-6
straints. In: Proceedings of the 26th Interna- serum concentrations in severe primary pulmo-
tional conference on machine learning, nary hypertension. Am J Respir Crit Care Med
Montreal, Canada 151(5):1628–1631
49. Friedman N, Koller D (2003) Being Bayesian 63. Inoue S, Nakamura T, Hasegawa K, Tadaoka S,
about network structure: a Bayesian approach Samukawa M, Nezuo S et al (1989) Pulmonary
to structure discovery in Bayesian networks. hypertension due to glycogen storage disease
Mach Learn 50(1-2):95–125 type II (Pompe’s disease): a case report. J Car-
50. Perrier E, Imoto S, Miyano S (2008) Finding diol 19(1):323–332
optimal Bayesian network given a super- 64. Kobayashi H, Shimada Y, Ikegami M, Kawai T,
structure. J Mach Learn Res 9:2251–2286 Sakurai K, Urashima T et al (2010) Prognostic
51. Acid S, de Campos LM (2001) A hybrid meth- factors for the late onset Pompe disease with
odology for learning belief networks: BENE- enzyme replacement therapy: from our experi-
DICT. Int J Approx Reason 27(3):235–262 ence of 4 cases including an autopsy case. Mol
Genet Metab 100(1):14–19
52. Heckerman D, Geiger D, Chickering DM
(1995) Learning Bayesian networks: the com- 65. Brandenburg VM, Krueger S, Haage P,
bination of knowledge and statistical data. Mertens P, Riehl J (2002 May) Heterotaxy
Mach Learn 20(3):197–243 syndrome with severe pulmonary hypertension
in an adult patient. South Med J 95
53. de Campos CP, Ji Q Properties of Bayesian (5):536–538
Dirichlet scores to learn Bayesian network
structures. In: Proceedings of the 24th AAAI 66. Yousuf T, Kramer J, Jones B, Keshmiri H, Dia
conference on artificial intelligence M (2016) Pulmonary hypertension in a patient
with congenital heart defects and heterotaxy
54. Glover F, Laguna M (2013) Tabu Search. syndrome. Ochsner J 16(3):309–311
Handbook of Combinatorial
Optimization:3261–3362 67. Hills C, Moller JH, Finkelstein M, Lohr J,
Schimmenti L (2006) Cri du Chat syndrome
55. Koller D, Friedman N (2009) Probabilistic and congenital heart disease: a review of previ-
graphical models: Principles and Techniques. ously reported cases and presentation of an
Massachusetts Institute of additional 21 cases from the Pediatric Cardiac
Technology:733–749 Care Consortium. Pediatrics 117(5):
56. Friedman N, Goldszmidt M, Wyner A. Data e924–e927
analysis with Bayesian networks: a bootstrap 68. Levy B, Dunn TM, Kern JH, Hirschhorn K,
approach. In: Proceeding UAI’99 Proceedings Kardon NB (2002) Delineation of the dup5q
of the fifteenth conference on uncertainty in phenotype by molecular cytogenetic analysis in
artificial intelligence, p 196–205 a patient with dup5q/del 5p (cri du chat). Am J
57. Scutari M, Nagarajan R (2013) Identifying sig- Med Genet 108(3):192–197
nificant edges in graphical models of molecular 69. Bechtold SM, Dalla Pozza R, Becker A,
networks. Artif Intell Med:207–217 Meidert A, Döhlemann C, Schwarz HP
58. Pons P, Latapy M Computing communities in (2004) Partial anomalous pulmonary vein con-
large networks using random walks, Computer nection: an underestimated cardiovascular
and information sciences-ISCIS, vol 2005. defect in Ullrich-Turner syndrome. Eur J
Berlin Heidelberg: Springer, 284–293 Pediatr 163(3):158–162
59. Newman ME (2006) Modularity and commu- 70. Tinker A, Schofield UJ (1989) Severe pulmo-
nity structure in networks. Proc Natl Acad Sci nary hypertension in Turner syndrome. Br
U S A 103(23):8577–8582 Heart J 62:74–77
60. Kimura Y, Weiss JE, Haroldson KL, Lee T, 71. Bakker B, Maneatis T, Lippe B (2007) Sudden
Punaro M, Oliveira S et al (2013) Pulmonary death in Prader-Willi syndrome: brief review of
hypertension and other potentially fatal pulmo- five additional cases. Horm Res 67:203–204
nary complications in systemic juvenile idio- 72. Katheria AC, Masliah E, Benirschke K, Jones
pathic arthritis. Arthritis Care Res 65 KL, Kim JH (2010) Idiopathic persistent pul-
(5):745–752 monary hypertension in an infant with Smith-
322 Mei-Sing Ong
Lemli-Opitz syndrome. Fetal Pediatr Pathol 29 76. Dauvilliers Y, Stal V, Abril B, Coubes P,
(6):373–379 Bobin S, Touchon J, Escourrou P, Parker F,
73. Austin ED, Ma L, LeDuc C, Berman Bourgin P (2007) Chiari malformation and
Rosenzweig E, Borczuk A, Phillips JA 3rd sleep related breathing disorders. J Neurol
et al (2012) Whole exome sequencing to iden- Neurosurg Psychiatry 78(12):1344–1348
tify a novel gene (caveolin-1) associated with 77. Wang B, Mezlini AM, Demir F, Fiume M,
human pulmonary arterial hypertension. Circ Tu Z, Brudno M, Haibe-Kains B, Goldenberg
Cardiovasc Genet 5:336–343 A (2014) Similarity network fusion for aggre-
74. Nowaczyk M (2013) Smith-Lemli-Opitz syn- gating data types on a genomic scale. Nat
drome. GeneReviews Methods 11(3):333–337
75. Seddon PC, Khan Y (2003) Respiratory pro- 78. Ong MS, Mullen MP, Austin ED, Szolovits P,
blems in children with neurological Natter MD, Geva A, Cai T, Kong SW, Mandl
impairment. Arc Dis Child 88(1):75–78 KD (2017) Circ Res 121(4):341–353
INDEX
Richard S. Larson and Tudor I. Oprea (eds.), Bioinformatics and Drug Discovery, Methods in Molecular Biology, vol. 1939,
https://doi.org/10.1007/978-1-4939-9089-4, © Springer Science+Business Media, LLC, part of Springer Nature 2019
323
BIOINFORMATICS AND DRUG DISCOVERY
324 Index
K Property filtering .................................................. 122, 134
PubChem.............................................. 37–40, 43, 44, 46,
Knowledge acquisition..........................49, 52, 54, 56, 64 97, 112, 120, 121, 123, 124, 128, 133, 142, 162,
Knowledge acquisition and representation methodology 234, 244
(KNARM)..................................49, 52, 54, 56, 64 PubMed ............................................. 58, 73, 78, 80, 234,
236, 239, 241, 244
L
L1000 ................................................ 101, 105, 273, 276, Q
277, 279, 282, 284, 287, 289, 292, 295 Quantitative structure activity relationships
Library of Integrated Network-Based Cellular Signatures (QSARs)............................ 38, 139, 144, 145, 149
(LINCS) program .................................56, 61, 67, Quantitative structure-property relationship
102, 104–107, 111, 184, 273–275, 277, 284, (QSPR) ..................................................... 139–157
285, 287, 294
R
M
Regression ....................................................140, 152–154
Machine learning........................................ 38, 77, 78, 85,
99, 114, 120, 144, 185, 200–202, 232, 234, 236, S
237, 240, 241, 243, 244, 247, 255–271
Medication safety ............................................................ 96 ScrubChem...................................................37–40, 42–46
Semantic model ...................................................... 51, 245
N Semantic web.......................................49–51, 54, 56, 247
Structured information ................................................... 73
Names entity recognition ............................................. 232 Synergy ......................................... 3, 7, 8, 11, 12, 28, 193
Network protein interactions ....................................... 181 Systems pharmacology......................................... 199–212
O T
Off-target identification....................................... 199, 201 Text mining .............................................v, 58, 73–86, 96,
Online services ................................................85, 99, 101, 185, 231–247, 262
102, 108, 183, 201, 203, 238, 240, 247, 260 Text normalization.......................................................... 77
Ontologies ............................................38, 49, 52, 54, 56,
64, 77, 80, 81, 93, 95–97, 236, 237, 245 U
P Unwanted structures................................... 122, 127, 134