0% found this document useful (0 votes)
24 views10 pages

Liu 2018

This document compares the performance of three machine learning methods - deep neural networks (DNNs), random forests, and variable nearest neighbor - in predicting the molecular activities of 21 data sets. The results show that the overall performance of the three methods was similar. For molecules structurally similar to training samples, all methods gave reliable predictions, but prediction accuracy decreased as structural dissimilarity to training samples increased. For very dissimilar molecules, all methods predicted the average activity of training samples. The findings suggest that model quality depends more on sufficient training data of structurally diverse compounds than choice of machine learning approach. Distance to training samples can also define applicability domains to flag reliable and unreliable predictions.

Uploaded by

Rajan Jatt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views10 pages

Liu 2018

This document compares the performance of three machine learning methods - deep neural networks (DNNs), random forests, and variable nearest neighbor - in predicting the molecular activities of 21 data sets. The results show that the overall performance of the three methods was similar. For molecules structurally similar to training samples, all methods gave reliable predictions, but prediction accuracy decreased as structural dissimilarity to training samples increased. For very dissimilar molecules, all methods predicted the average activity of training samples. The findings suggest that model quality depends more on sufficient training data of structurally diverse compounds than choice of machine learning approach. Distance to training samples can also define applicability domains to flag reliable and unreliable predictions.

Uploaded by

Rajan Jatt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Article

Cite This: J. Chem. Inf. Model. 2019, 59, 117−126 pubs.acs.org/jcim

Dissecting Machine-Learning Prediction of Molecular Activity: Is an


Applicability Domain Needed for Quantitative Structure−Activity
Relationship Models Based on Deep Neural Networks?
Ruifeng Liu,*,† Hao Wang,† Kyle P. Glover,§ Michael G. Feasel,§ and Anders Wallqvist*,†

Department of Defense, Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced
Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, Maryland 21702, United States
§
U.S. Army−Edgewood Chemical Biological Center, Aberdeen Proving Ground, Maryland 21010, United States
*
S Supporting Information
See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

ABSTRACT: Deep neural networks (DNNs) are the major drivers of recent
progress in artificial intelligence. They have emerged as the machine-learning
Downloaded via CORNELL UNIV on September 1, 2020 at 17:48:42 (UTC).

method of choice in solving image and speech recognition problems, and


their potential has raised the expectation of similar breakthroughs in other
fields of study. In this work, we compared three machine-learning methods
DNN, random forest (a popular conventional method), and variable nearest
neighbor (arguably the simplest method)in their ability to predict the
molecular activities of 21 in vivo and in vitro data sets. Surprisingly, the overall
performance of the three methods was similar. For molecules with structurally
close near neighbors in the training sets, all methods gave reliable predictions,
whereas for molecules increasingly dissimilar to the training molecules, all
three methods gave progressively poorer predictions. For molecules sharing
little to no structural similarity with the training molecules, all three methods
gave a nearly constant valueapproximately the average activity of all training moleculesas their predictions. The results
confirm conclusions deduced from analyzing molecular applicability domains for accurate predictions, i.e., the most important
determinant of the accuracy of predicting a molecule is its similarity to the training samples. This highlights the fact that even in
the age of deep learning, developing a truly high-quality model relies less on the choice of machine-learning approach and more
on the availability of experimental efforts to generate sufficient training data of structurally diverse compounds. The results also
indicate that the distance to training molecules offers a natural and intuitive basis for defining applicability domains to flag
reliable and unreliable quantitative structure−activity relationship predictions.

■ INTRODUCTION
In recent years, deep neural networks (DNNs) have emerged
other machine-learning methods, raising the hope that DNNs
may help overcome key modeling challenges in drug discovery,
as the machine-learning method of choice in image1 and one of which is to provide guidance for efficient exploration of
speech recognition, 2 and their versatility has led to new chemical spaces and the discovery and evaluation of
unprecedented progress in artificial intelligence.3 This success structurally novel drugs. Interestingly, with Merck Challenge
has spurred applications of DNNs in many other fields, data, Winkler and Le showed that a single-hidden layer shallow
including quantitative structure−activity relationship (QSAR) neural network performed similarly as the DNNs.6 Their
prediction of molecular activities.4 In a Kaggle competition results may look surprising, but they are consistent with the
sponsored by Merck in 2012 to examine the ability of modern Universal Approximation Theorem, which states that a
machine-learning methods to solve QSAR problems in feedforward network with a single hidden layer is sufficient
pharmacology and drug discovery, DNNs were among the to represent any function.12,13
winning entries. Merck researchers followed this up in a Most published studies evaluating the performance of DNNs
detailed study that specifically compared the performance of and conventional machine-learning methods have relied on
DNN models to that of random forest (RF) models and global performance metrics, such as the correlation coefficient
showed that DNN models could routinely make better (R2) or the root mean squared error (RMSE) between the
prospective predictions on a series of large, diverse QSAR predicted and experimental results of all molecules, because of
data sets generated as part of Merck’s drug discovery efforts.5 the large number of data sets they examined (a few tens to
Since then, many other studies have been published comparing more than a thousand). Although such global metrics are
DNNs and conventional machine-learning methods in terms of
their ability to predict molecular activities.6−11 The bulk of Received: June 3, 2018
these studies showed better performance with DNNs than with Published: November 9, 2018

© 2018 American Chemical Society 117 DOI: 10.1021/acs.jcim.8b00348


J. Chem. Inf. Model. 2019, 59, 117−126
Journal of Chemical Information and Modeling Article

appealing because they conveniently provide a single number sets, the Merck Challenge only provided molecular activities
to interpret, they may miss important prediction details for and atom-pair descriptor values, and the molecular structures
individual molecules or groups of molecules in different were not disclosed. Therefore, we used the provided atom-pair
activity ranges. In a recent study, we examined the perform- descriptor values as input features.
ance of DNN, RF, and variable nearest neighbor (v-NN) Machine-Learning Methods. Deep Neural Networks.
methods with in vivo chemical toxicity and in vitro molecular For the in vivo toxicity data sets, we used a fully connected
activity data sets. Judged by R2 and RMSE, DNN performance feed-forward network architecture of dimensions
improved with increasing data set size and outperformed RF 2048:300:300:30:1, where the first number represents 2048
and v-NN models for large data sets, consistent with previously ECFP_4 fingerprint features as inputs for all data sets, followed
published studies. However, closer examination revealed that by 300, 300, and 30 neurons in the first, second, and third
all machine-learning methods gave good predictions for hidden layers, respectively, and a single neuron in the output
molecules with marginal activity and markedly poorer layer. We built seven single-task DNNs, each for an individual
predictions for highly active and highly inactive molecules.14 toxicity end point. For the DNN calculations, we used the
Because one of the main objectives of predictive toxicology open source Python library Keras (https://keras.io/) on top of
and drug discovery is to identify highly active molecules, these the Theano19 backend, the ReLU activation function for the
results suggest that the potential for machine learning to input and hidden layers, the Adam optimizer, a kernel
advance predictive toxicology and drug discovery might be initializer with a normal distribution, and a dropout rate of
substantially lower than the global performance metrics R2 and 30% on all input and hidden layers. We reported our
RMSE indicate. They also suggest that, when evaluating the hyperparameter selection and DNN performance in a recent
performance of machine-learning methods, one should not paper.14 For each data set, we selected the 2048 ECFP_4
only rely on global metrics but also examine detailed prediction fingerprint features as input for the DNNs according to the
performance to better understand the strength and weakness of following procedure:
the machine-learning methods. (1) Identify all unique fingerprint features present in the
In the present study, we analyzed details of machine learning whole data set.
predictions of molecular activities with the aim of under-
standing if DNNs can truly learn new relationships and provide (2) Calculate the frequency of each fingerprint feature
more reliable predictions than conventional machine-learning appearing in the molecules in the data set.
methods for molecules whose molecular structures are not very (3) Select the fingerprint features appearing in 50% of the
similar to training samples. This is one of the most challenging molecules and those closest to 50% of the molecules,
issues facing conventional machine learning methods, as until the total number of selected features reaches 2048.
analyses of applicability domains of conventional machine- This selection process excludes the least important
learning models indicate that the most important determinant fingerprints, because it deselects fingerprint features that
for error of predicting molecular properties is not a machine appear in all or nearly none of the molecules.
learning method but the similarity of the molecules to the For the in vitro data sets, we first preprocessed Merck data
training set molecules.15,16


sets using Merck-provided Python code downloaded from
GitHub, and then implemented Merck DNN models, again
MATERIALS AND METHODS using Merck-provided Python code downloaded from GitHub
Data Sets. We derived seven in vivo acute chemical toxicity (https://github.com/RuwanT/merck). The Merck DNN
data sets from the Leadscope Toxicity Database (http://www. models consisted of a variable number of input features,
leadscope.com/toxicity_database/). After removing entries not ranging from 2796 to 6559 depending on the data set: 4000,
suitable for QSAR modeling, we collected data sets of 1745, 2000, 1000, and 1000 hidden neurons in the first, second,
2191, 4115, 10 363, 11 716, 21 776, and 29 476 compounds for third, and fourth hidden layers, respectively; and a single
rabbit skin, rat subcutaneous, mouse subcutaneous, rat oral, output for each model. Because we could not calculate any
mouse intravenous, mouse oral, and mouse intraperitoneal molecular descriptors given that Merck did not disclose
toxicity, respectively. Each compound has an experimentally molecular structure information, we used the Merck provided
derived LD50 value in milligrams per kilogram body weight. atom-pair descriptors as input features for the DNN
We converted the LD50s into log(millimoles per kilogram) calculations for the in vitro data sets.
before modeling. Details of our data cleaning procedure can be Random Forests. We used the Pipeline Pilot implementa-
found elsewhere.14 tion of the random forest (RF) algorithm called Forest of
We also used 14 of the 15 in vitro data sets in the Merck Random Trees (http://accelrys.com/products/collaborative-
Molecular Activity Challenge. The LogD data set was excluded science/biovia-pipeline-pilot/) to develop the RF models. The
in this study mainly because of computational cost due to its RF model for each data set consisted of 500 decision trees. For
large size (50 000 compounds, the largest of the Merck the 7 in vivo toxicity data sets, we used ECFP_4 fingerprint
Challenge data sets) and also because it is a relatively features as molecular descriptors. For the 14 Merck in vitro
straightforward property for QSAR modeling, as indicated by molecular activity data sets, we used the Merck-provided atom-
the good performance of multilinear regression for LogP pair descriptors as input features. For both the in vivo and in
predictions,17 where LogD is LogP under a specific pH vitro data sets, the maximum tree depth was 50, and a third of
condition. all molecular descriptors were tested as split criteria within
Molecular Descriptors. For the in vivo toxicity data sets, each tree.
we used extended connectivity fingerprints with a diameter of Variable Nearest Neighbor. The v-NN method is based on
four chemical bonds (ECFP_4)18 as input molecular features. the principle that similar structures have similar activity. Its
The ECFP_4 fingerprints were directly calculated from prediction is a distance-weighted average of all qualified
molecular structures. For the in vitro molecular activity data nearest neighbors in the training set
118 DOI: 10.1021/acs.jcim.8b00348
J. Chem. Inf. Model. 2019, 59, 117−126
Journal of Chemical Information and Modeling Article

Figure 1. Impact of Tanimoto distance threshold (d0) on the ability of the variable nearest neighbor (v-NN) method to predict the experimental
log(LD50) values of a rat oral toxicity data set. In this and all subsequent figures plotting the predicted value against the experimental value, a data
point on the diagonal line indicates that these values are identical. The predictions were made via 10-fold cross validation.

di 2 0.6 represent a weighted average toxicity of only training


v −( h )
∑i = 1 ye
i samples that met this Tanimoto distance threshold, these
y= di 2
results indicate that within the v-NN approach, predictions
∑i = 1 e−( h )
v
(1) based on information from all compounds are no better than
In this equation, yi is the activity of the ith nearest neighbor in predictions based on information from qualified compounds
the training set, di is the distance between the ith nearest only. Thus, information from unqualified compounds may
neighbor and the molecule for which v-NN is making a make the predictions worse. On the basis of this observation,
prediction, h is a smoothing factor that modulates the distance we decided to adopt a layered v-NN prediction approach. That
penalty, and v is the count of all nearest neighbors in the is, for a given test compound, we segregate the chemical space,
training set that satisfy the condition di ≤ d0, where d0 is a with the test compound at the center, into ten partitions. The
distance threshold that ensures the validity of the similar first partition is a sphere with a radius of d0 = 0.1. The other
structure−similar activity principle. We consider all training set partitions represent shells of spaces, defined by Tanimoto
near neighbors that meet the condition di ≤ d0 qualified distances between 0.1 and 0.2, 0.2 and 0.3, ... up to 0.9 and 1.0
training compounds. d0 and h are the only model parameters to (Figure 2). We give a v-NN prediction for a test compound by
be determined from the training data. using information on training set compounds in the closest
In previous studies, we found that v-NN performance partition only. For example, if the test compound has
depends strongly on d0.20 If d0 is small, then information on neighbors in the training set in layer 1 (di ≤ 0.1), then only
only structurally very similar compounds is used in making information for these compounds is used to make a prediction
predictions and the results are more likely to be reliable.
However, with a small d0, the number of training set molecules
meeting the Tanimoto distance threshold is small, and
therefore, the number of molecules for which the method
can make predictions is low. When we used the Tanimoto
distance calculated from ECFP_4 molecular fingerprints, we
found that the combination of d0 = 0.6 and h = 0.3 worked well
for predicting molecular activities, with both reasonable
reliability and acceptable coverage (percentage of molecules
for which v-NN predictions could be made).14 For example,
Figure 1 shows the results of 10-fold cross validation for the rat
oral toxicity data set, calculated using ECFP_4 fingerprints
with d0 values of 0.6 and 1.0. With d0 = 1.0, the RMSE of
prediction was higher than that obtained with d0 = 0.6,
although it allowed predictions for 100% of the compounds,
compared to 86% of the compounds with d0 = 0.6. Figure 2. Scheme illustrating the layered v-NN approach. To make a
Figure 1 shows that with d0 = 0.6, the data points are more prediction for molecule m, the entire chemical space is segregated into
10 spherical layers of equal depth (a Tanimoto distance of 0.1) with
symmetrically distributed around the diagonal identity line m at the center. The training molecules are then distributed among
than with d0 = 1.0, with the latter condition leading to the layers by their Tanimoto distance to m, and only those in the layer
underestimation of toxicity for highly toxic compounds and closest to m are used in eq 1 for v-NN predictions. For comparison,
overestimation of toxicity for nontoxic compounds. Because v- we made RF and DNN predictions of m using models trained with all
NN predictions with d0 = 1.0 represent a weighted average training samples. We then grouped the predictions into layers by the
toxicity of all training samples, whereas predictions with d0 = distance between m and the training samples.

119 DOI: 10.1021/acs.jcim.8b00348


J. Chem. Inf. Model. 2019, 59, 117−126
Journal of Chemical Information and Modeling Article

Figure 3. Plots of predicted versus experimental log(LD50) values of the rat oral toxicity data set. The predictions were made by variable nearest
neighbor (v-NN) model via 10-fold cross validation. The predicted results were grouped, based on the shortest Tanimoto distance to the training
molecules, into layers described in Figure 2.

regardless of whether or not training samples exist in the set randomly into 10 equal-sized groups and then used 9 of
remaining layers. If the test compound has no neighbors in them as the training set to predict the toxicities of the
layer 1, then only information on training samples in layer 2 is compounds in the left-out group. This process was repeated
used to make predictions. We refer to such v-NN predictions nine times so that each and every group was left-out once and
as layered predictions. For the acute toxicity data sets, we used as a test set. We used a smoothing factor of 0.3 and
define the layers based on Tanimoto distances calculated using Tanimoto distances calculated with ECFP_4 fingerprints in
ECFP_4 fingerprints; for Merck in vitro molecular activity data performing the v-NN calculations. For all data sets, the number
sets, the layers were defined by Tanimoto distances calculated of compounds with layer 9 predictions, i.e., those without
using the atom-pair fingerprints derived from Merck-provided training set compounds within a Tanimoto distance of 0.8, was
descriptor values, by stripping the counts of atom-pairs in a very small, and an even smaller number of compounds were
molecule and retaining only information on their presence or present with layer 10 predictions. In analyzing the data, we
absence. combined the predictions for layers 9 and 10 as layer 9

■ RESULTS
Layered v-NN Predictions for in Vivo Toxicity Data
predictions.
For the rat oral toxicity data set, we compared the predicted
toxicities in different layers with the experimental results
Sets. To evaluate the performance of layered v-NN (Figure 3). Because the results for the other six data sets were
predictions, we performed 10-fold cross validation calculations similar, we have included them in the Supporting Information
for the in vivo toxicity data sets. Thus, we first split each data (Figure S1). In the Supporting Information (Table 1), we
120 DOI: 10.1021/acs.jcim.8b00348
J. Chem. Inf. Model. 2019, 59, 117−126
Journal of Chemical Information and Modeling Article

Figure 4. Root mean squared error (RMSE) between the predicted and experimental log(LD50) values of compounds in different layers of the
seven in vivo acute toxicity data sets. The predictions were made via 10-fold cross validation, using the variable nearest neighbor (v-NN), random
forest (RF), and deep neural network (DNN) models. The layers were described in Figure 2.

present the squared correlation coefficient, R2, between the training compounds. Subsequently, we segregated the com-
predicted and experimental log(LD50) values for each layer of pounds by their shortest Tanimoto distance to the training
the seven in vivo data sets. Both the table and figures show that samples into groups similar to those of the layered v-NN
the predictions were highly reliable when the test molecules approach. To our surprise, the distributions of RF- and DNN-
had close neighbors in the training set but less reliable when predicted versus experimental log(LD50) values of the rat oral
they had no such neighbors. This became increasingly apparent toxicity data set are remarkably similar to that shown in Figure
for predictions of compounds in layers 7 to 9. The R2 for these 3, even though both RF and DNN are much more
compounds was smaller than 0.1, indicating no correlation sophisticated machine-learning methods. Because they are so
between the predicted and experimental results. A notable similar, we presented them in Figures S2 and S3 of the
feature for compounds in layer 9 is that the predicted toxicity Supporting Information. Similar results were also observed for
was roughly constant, regardless of molecular structure and the other in vivo toxicity data sets and are presented in Figures
experimentally measured log(LD50) value. This is not S2 and S3. The R2 between the predicted and experimental
surprising for layered v-NN predictions, because, according log(LD50) values for all seven in vivo data sets are presented in
to eq 1, a layer 9 or 10 v-NN prediction is simply the average Table S1. Figure 4 plots the RMSE of compounds within each
toxicity of almost all training samples. Because RF and DNN layer of predictions for all seven in vivo data sets.
are based on more intricate algorithms, it is interesting to Although the RF and DNN models are more complex than
assess how they perform under the same circumstances. the layered v-NN models, the plots in Figures 3 and S1−S3
RF and DNN Performance for the in Vivo Toxicity (Supporting Information) show similar results for all models,
Data Sets. Unlike the layered v-NN approach, which uses especially for predictions of the lower layers for molecules with
information on only qualified neighbors in a training set to structurally close near neighbors in the training sets. For the six
make predictions, the RF and DNN methods use information larger data sets (i.e., those excluding the rabbit skin toxicity
on all training samples to first build the models, which they data set, which contains too few molecules), all three methods
then use to make predictions. To assess model performance for show highly reliable predictions for compounds within a
the RF and DNN methods in a manner similar to that for the Tanimoto distance of 0.4 to any training sample (layers 1−3).
layered v-NN approach, we first made RF and DNN For compounds without any training samples within a
predictions for all compounds in 10-fold cross validation, Tanimoto distance of 0.4, but with those within a Tanimoto
and then calculated Tanimoto distances between the test and distance of 0.7 (layers 4−6), all models gave inferior but
121 DOI: 10.1021/acs.jcim.8b00348
J. Chem. Inf. Model. 2019, 59, 117−126
Journal of Chemical Information and Modeling Article

Figure 5. Plots of predicted versus experimental log(LD50) values of the CB1 data set. The predictions were made by a deep neural network
(DNN) model using time-split training and test data provided by Merck Challenge (https://github.com/RuwanT/merck). The predicted results
were grouped into layers by the Tanimoto distance to the training molecules, as described in Figure 2.

acceptable predictions that were clearly correlated with the the training compounds, its predicted activity is close to the
experimental results (Table S1 and Figure 4). However, for average activity of the training molecules regardless of the
test compounds with a Tanimoto distance of at least 0.7 from machine-learning method.
any training sample, none of the models gave acceptable Results for in Vitro Molecular Activity Data Sets. For
predictions (i.e., there was simply no correlation between the the in vitro data sets, each data set was provided in the form of
predicted and experimental results). For the rabbit skin toxicity a training set and a test set consisting of 75% and 25% of the
data set (the smallest acute toxicity data set), all machine- compounds, respectively. The compounds were split into a
learning methods showed poorer performance, as judged by training set and a test set based on the dates they were
the RMSE (Figure 4) and R2 (Table S1 in the Supporting evaluated. This approach is intended to capture a set of
Information) of the different prediction layers. Because DNNs training compounds representing chemistries that were
employ a large number of model parameters, they require a synthesized and evaluated before the newer test set
large data set to develop a good model. These results show that compounds were synthesized and made available for
all machine-learning methods performed better the greater the evaluation. This time-split test provides a more realistic
amount of training data. estimate of model performance for new compounds, because
Interestingly, the plots in Figures 3 and S1−S3 reveal a by design, chemical and pharmaceutical research constantly
common trend regardless of the machine-learning method: the explores new chemical spaces that differ from the space of the
distribution of data points rotated clockwise from roughly 45° training set.21 They are ideal test sets for assessing how good a
for data points in layer 1 to approximately 0° for those in layer machine-learning method can learn from a known region of
9. Thus, for compounds with close near neighbors in the chemical space and make reliable predictions for compounds
training sets, the data points distribute symmetrically around of a previously unexplored region. Conventional machine-
the diagonal identity lines. Going from the lowest to highest learning methods do not perform well in this respect: they give
layer, the models increasingly underestimated the toxicity of poor predictions for molecules whose structures are not well-
highly toxic compounds and overestimated that of the least represented by the training set. To remedy this issue, various
toxic compounds, resulting in a nearly horizontal distribution applicability domains have been defined for flagging com-
of data points in layer 9. Because v-NN predictions for pounds where reliable predictions cannot be made.22,23
compounds in layer 9 are simply the average toxicity of all Although deep learning is currently considered the most
training samples, the horizontal distributions of layer-9 data powerful machine-learning method,9 an interesting question is
points suggest that the RF and DNN predictions for these whether it represents an incremental improvement over
compounds were also close to the average toxicity of all traditional machine-learning methods or a fundamental change
training samples. Thus, when a compound is too far away from (i.e., it learns something new and infers relationships
122 DOI: 10.1021/acs.jcim.8b00348
J. Chem. Inf. Model. 2019, 59, 117−126
Journal of Chemical Information and Modeling Article

Figure 6. Root mean squared error (RMSE) between the predicted and experimental log(LD50) values of compounds in different layers of the in
vitro molecular activity data sets. The predictions were made with the time-split training and test data provided by the Merck Challenge, using the
variable nearest neighbor (v-NN), random forest (RF), and deep neural network (DNN) models. The layers were as described in Figure 2. Assay
abbreviations: 3A4, Cyp 3A4 inhibition (pIC50); OX2, Orexin 2 receptor inhibition (pKi); PPB, human plasma protein binding [log(bound/
unbound)]; PGD, transport by p-glycoprotein [log(BSA/AB)]; OX1, Orexin 1 receptor inhibition (pKi); THROMBIN, human thrombin
inhibition (pIC50); TDI, time-dependent Cyp3A4 inhibition [log(IC50 without NADPH/IC50 with NADPH)]; HIVINT, inhibition of HIV
integrase in a cell based assay (pIC50); METAB, percent remaining after 30 min microsomal incubation; CB1, binding to cannabinoid receptor 1
(pIC50); HIVPROT, inhibition of HIV protease (pIC50); NK1, inhibition of neurokinin 1 receptor binding (pIC50); RAT_F, log(rat
bioavailability) at 2 mg/kg; DPP4, inhibition of dipeptidyl peptidase 4 (pIC50).

123 DOI: 10.1021/acs.jcim.8b00348


J. Chem. Inf. Model. 2019, 59, 117−126
Journal of Chemical Information and Modeling Article

heretofore unobserved in the training data). Therefore, we including QSAR modeling of molecular activities. DNNs may
used the Merck-provided time-split training and test sets to thus provide the potential means to overcome many
evaluate performance on the in vitro data sets. computational and modeling challenges in drug discovery
We calculated Tanimoto distances between a test molecule and healthcare based on advanced data analysis. Here, we
and the training molecules, using the presence or absence of examined machine-learning predictions for QSAR modeling of
Merck-defined atom-pair features (ignoring the counts of such molecular activity in detail, using 7 in vivo acute toxicity and 14
features present) in a molecule (see Materials and Methods). in vitro molecular activity data sets. For these data sets, we
The information content of the atom-pair features is much showed that current DNN implementations may represent an
lower than that in ECFP_4 fingerprints. As a result, we could incremental improvement over the other machine-learning
reasonably expect the v-NN results to be inferior. This was methods examined, although their performance is largely in
corroborated by the observation that only a small fraction of line with the latter methods. Like the other machine-learning
molecules had a Tanimoto distance of 0.6 or more to the methods, for a molecule with close near neighbors in the
training molecules in the in vitro data sets. For RF and DNN training set, DNNs are able to accurately predict its activity.
predictions, we used the atom-pair descriptors (including However, for a molecule representing a hitherto unexamined
counts of the atom-pairs) as provided in the Merck Challenge chemical seriesand therefore having no near neighbors in the
data sets. We then grouped the predicted results by the training setDNNs assign predictions close to the average of
Tanimoto distance to the training samples into different all training molecule activities, much like the other machine-
prediction layers. Because most data sets contained only a learning methods. Thus, current implementations of DNN for
small fraction of compounds in layers 7 and 8, and no QSAR modeling of molecular activities lack the ability to learn
compounds in the outmost layers, we grouped the compounds beyond the training set, have limited potential for guiding the
in layers 7 and 8 into a terminal layer 6. exploration of new chemical space or the discovery of
As an example, a scatterplot of DNN-predicted activity as a structurally novel drugs, and still need an applicability domain
function of experimentally measured activity for the molecules for estimating reliability of predictions. A leading applicability
in the CB1 data set is presented in Figure 5. The plots of the domain metric is ensemble variance metric, which is defined as
other data sets show the same general trend and are presented the standard deviation of predictions given by an ensemble of
in Figures S4−S6 (Supporting Information). Although we used prediction models.15 However, this metric requires the
markedly different input descriptors and DNN architectures to development of an ensemble of prediction models, which is
model the in vitro and in vivo data sets, the general trend of the not easily applicable to DNNs due to the amount work
predicted activity plotted against the measured activity of the required to create these models. On the other hand, our results
molecules in the in vitro data sets was similar to that in the in show that for molecules within 0.3 Tanimoto distance to a
vivo data sets. That is, when a test molecule has close neighbors training molecule, all machine-learning predictions are
in the training set, the predicted values are generally more reasonably reliable. Therefore, experimental measurements
reliable, as indicated by the greater number of data points for these molecules can be safely replaced by machine-learning
distributed along the diagonal identity line in each plot. predictions, so that precious resources can be redirected to
However, for molecules in increasingly higher layers where they can have the highest impact.24,25 We like to point
(structurally further away from the training sets), the out that for estimating prediction errors for individual
distribution of the data points rotated clockwise from around molecules, studies have shown that the closest similarity to
45° for data points of the first layer to approximately 0° for training set compounds does not perform as well as ensemble
data points of the last layer. Thus, we found in the in vitro data variance.15 However, we believe this is due to inappropriate
sets the same trend as that which we had observed in the in use of similarity to training compounds, as only similarity to
vivo data sets, i.e., regardless of the machine-learning method the nearest neighbor in the training set was used and
used, the prediction for a molecule became increasingly similarities to the other training molecules were ignored. In a
unreliable as the distances of the molecule from the training recent study, we defined a new similarity-based DA metric as a
compounds increased. sum of distance-weighted contributions of all training
Figure 6 shows the RMSE values between the predicted and molecules, which performs as well as if not better than the
experimental activities of the compounds in different layers of ensemble variance metric.26
the in vitro data sets (the R2 data are presented in Table S2 of A likely caveat of our study is that we did not examine the
the Supporting Information). For most data sets, the RF and performance of multitask DNNs, i.e., those using a single
DNN results are similar, whereas the RMSE values for v-NN neural network to learn and predict multiple end points.
are notably higher than the corresponding values for RF or Several recently published studies have found that multitask
DNN. This is evident in how the scatter of data points in DNNs can perform slightly better than their single-task
Figure S4 (Supporting Information) is broader than that in counterparts on classification problems.9,27,28 For regression
Figures S5 and S6. This differs from the plots in Figure 4, in problems, Ma et al. compared multitask and single-task DNNs
which the RMSE of all three methods are similar for all data on the Merck Challenge data sets.5 They found that multitask
sets. We believe this is most likely due to information loss of DNNs performed slightly better for some data sets, whereas
the atom-pair fingerprint used in the v-NN calculations. they performed similar or slightly worse than the correspond-
Although we used atom-pair counts for RF and DNN ing single-task DNNs for others. When averaged across all
calculations, we could not do so for the v-NN calculations Merck Challenge data sets, multitask DNNs performed slightly
owing to the way Tanimoto distance was defined.


better, with smaller data sets benefiting more at the expense of
worse performance on the largest data sets. The seemingly
DISCUSSION better performance of multitask DNNs generated some
Numerous published studies have shown that DNNs can excitement, as one of the plausible explanations is that the
outperform other techniques for many machine-learning tasks, relationships (weights) linking the input features to nodes in
124 DOI: 10.1021/acs.jcim.8b00348
J. Chem. Inf. Model. 2019, 59, 117−126
Journal of Chemical Information and Modeling Article

the neural network are related due to an inherent biological machine-learning methods to learn the underlying processes
response similarity (i.e., what can be “learned” from one assay governing assay outcomes.
can be transferred to another assay end point). Interestingly, Typical assay data range from physiochemical properties,
Xu et al. performed detailed data analyses on the Merck single enzyme assay data, cell-based screening data, to animal
Challenge data sets and found that the apparently better in vivo outcomes, representing a wealth of chemical and
performance of multitask DNNs was in large part due to assay biological processes. The diversity of processes reflected by
relatedness (i.e., test molecules for one assay were in the these data suggests that to apply artificial intelligence in
training set of a correlated assay). 29 Thus, current learning how to make predictions for these end points, the
implementations suggest that the transferability of a learned DNN constructs need to be trained on mechanisms rather than
task is limited for molecular activities. on outcomes. These mechanisms or DNN models could then
Our study noted that machine-learning methods ranging be integrated and interrogated depending on the modeled
from the deepest (multiple hidden layers, >700 000 weight assay end point, similar to an adverse outcome pathway
framework in predictive toxicology.


parameters) to the shallowest (v-NN with two parameters)
have similar prediction accuracy when considering molecules
ranging from the closest to the farthest to the training sets. ASSOCIATED CONTENT
Thus, the most important determinant for the prediction *
S Supporting Information

accuracy of molecular activities is not machine-learning The Supporting Information is available free of charge on the
method, but the distance to training molecules. This ACS Publications website at DOI: 10.1021/acs.jcim.8b00348.
corroborates and supports the observations of Sheridan et Correlation coefficients (R2) between predicted and
al.30 and Tetko et al.,15 inferred from conventional machine- measured molecular activities (Tables S1 and S2) and
learning methods, that the error of predicting a molecule does scatterplots of predicted versus measured molecular
not depend on the descriptors or machine-learning method activities (Figures S1−S6) (PDF)


used, but rather on the similarity to the training set molecules.
These results indicate two paths forward for using machine- AUTHOR INFORMATION
learning methods to improve predictions of molecular Corresponding Authors
activities: (1) coupling focused data generation with strictly *E-mail: [email protected] (R.L.).
defined prediction errors and (2) developing machine-learning *E-mail: [email protected] (A.W.).
methods that truly learn transferable biological relationships ORCID
based on sparse data.
Ruifeng Liu: 0000-0001-7582-9217
With respect to the first path, the results of this study
indicate that irrespective of current machine-learning methods, Michael G. Feasel: 0000-0001-7029-2764
if assay data generation can be tailored for modeling (e.g., by Notes
The authors declare no competing financial interest.


examining as many diverse chemical scaffolds as possible rather
than focusing on any selected chemical series), model accuracy
and applicability will be optimized. Importantly, by coupling ACKNOWLEDGMENTS
such tailored data generation with a strictly defined and The authors gratefully acknowledge the assistance of Dr.
validated applicability domain, it will be possible to provide a Tatsuya Oyama in editing the manuscript. The research was
tool that can be used with confidence to prospectively gauge supported by the U.S. Army Medical Research and Materiel
the broadest possible set of chemicals. It will also provide Command (Ft. Detrick, MD) as part of the U.S. Army’s
guidance on which chemicals are not part of the model’s Network Science Initiative and by the Defense Threat
applicability domain and suggestion on where additional assay Reduction Agency grant CBCall14-CBS-05-2-0007. The
experiments are needed. In this sense, the ability to distinguish opinions and assertions contained herein are the private
a reliable from an unreliable prediction is paramount, as the views of the authors and are not to be construed as official or
chemical space is vast, sparsely and unevenly populated by as reflecting the views of the U.S. Army or of the U.S.
existing chemicals used to parametrize models. Department of Defense. This paper has been approved for
public release with unlimited distribution.


The fundamental basis for QSAR predictions is that similar
molecules have similar properties. Thus, the ability to learn this
REFERENCES
similarity provides a model with the means to predict
molecular properties. This enormously successful principle (1) Rawat, W.; Wang, Z. Deep Convolutional Neural Networks for
Image Classification: A Comprehensive Review. Neural computation
has been exploited using different techniques, ranging from 2017, 29, 2352−2449.
regression analysis to machine learning. If we are to take (2) Deng, L.; Li, X. Machine Learning Paradigms for Speech
advantage of current developments in artificial intelligence and Recognition: An Overview. IEEE Transactions on Audio, Speech, and
go beyond the similarity principle, machine-learning methods Language Processing 2013, 21, 1060−1089.
will need to truly learn not only similarities, but biological and (3) Brown, N.; Sandholm, T. Superhuman AI for heads-up no-limit
chemical principles as well. poker: Libratus beats top professionals. Science 2018, 359, 418−424.
The situation in the drug development field, in which data (4) Goh, G. B.; Hodas, N. O.; Vishnu, A. Deep learning for
are sparse and major efforts are required to generate data computational chemistry. J. Comput. Chem. 2017, 38, 1291−1307.
(5) Ma, J.; Sheridan, R. P.; Liaw, A.; Dahl, G. E.; Svetnik, V. Deep
points for chemical and biological assays, is opposite to that of neural nets as a method for quantitative structure-activity relation-
other fields, such as image and speech recognition, where data ships. J. Chem. Inf. Model. 2015, 55, 263−274.
are in ample supply. Although most chemical fingerprints (6) Winkler, D. A.; Le, T. C. Performance of Deep and Shallow
should provide sufficient chemical characterization, we lack Neural Networks, the Universal Approximation Theorem, Activity
sufficient chemical and biological data to enable standard Cliffs, and QSAR. Mol. Inf. 2017, 36, 1781141.

125 DOI: 10.1021/acs.jcim.8b00348


J. Chem. Inf. Model. 2019, 59, 117−126
Journal of Chemical Information and Modeling Article

(7) Koutsoukas, A.; Monaghan, K. J.; Li, X.; Huan, J. Deep-learning: (25) Tetko, I. V.; Poda, G. I.; Ostermann, C.; Mannhold, R. Large-
investigating deep neural networks hyper-parameters and comparison scale evaluation of log P predictors: local corrections may compensate
of performance to shallow methods for modeling bioactivity data. J. insufficient accuracy and need of experimentally testing every other
Cheminf. 2017, 9, 42. compound. Chem. Biodiversity 2009, 6, 1837−1844.
(8) Ramsundar, B.; Liu, B.; Wu, Z.; Verras, A.; Tudor, M.; Sheridan, (26) Liu, R.; Glover, K. P.; Feasel, M. G.; Wallqvist, A. General
R. P.; Pande, V. Is Multitask Deep Learning Practical for Pharma? J. Approach to Estimate Error Bars for Quantitative Structure-Activity
Chem. Inf. Model. 2017, 57, 2068−2076. Relationship Predictions of Molecular Activity. J. Chem. Inf. Model.
(9) Lenselink, E. B.; Ten Dijke, N.; Bongers, B.; Papadatos, G.; van 2018, 58, 1561−1575.
Vlijmen, H. W. T.; Kowalczyk, W.; IJzerman, A. P.; van Westen, G. J. (27) Kearnes, S.; Goldman, B.; Pande, V. Modeling industrial
P. Beyond the hype: deep neural networks outperform established ADMET data with multitask networks. arXiv.org 2017, 1606.08793.
methods using a ChEMBL bioactivity benchmark set. J. Cheminf. (28) Dahl, G. E.; Jaitly, N.; Salakhutdinov, R. Multi-task neural
2017, 9, 45. networks for QSAR predictions. arXiv.org 2014, 1406.1231.
(10) Zhang, L.; Tan, J.; Han, D.; Zhu, H. From machine learning to (29) Xu, Y.; Ma, J.; Liaw, A.; Sheridan, R. P.; Svetnik, V.
deep learning: progress in machine intelligence for rational drug Demystifying Multitask Deep Neural Networks for Quantitative
discovery. Drug Discovery Today 2017, 22, 1680−1685. Structure-Activity Relationships. J. Chem. Inf. Model. 2017, 57, 2490−
(11) Chen, H.; Engkvist, O.; Wang, Y.; Olivecrona, M.; Blaschke, T. 2504.
The rise of deep learning in drug discovery. Drug Discovery Today (30) Sheridan, R. P.; Feuston, B. P.; Maiorov, V. N.; Kearsley, S. K.
2018, 23, 1241−1250. Similarity to molecules in the training set is a good discriminator for
(12) Cybenko, G. Approximation by superposition of sigmoidal prediction accuracy in QSAR. Journal of chemical information and
functions. Math. Control Signals Syst 1989, 2, 303−314. computer sciences 2004, 44, 1912−1928.
(13) Hornik, K. Approximation capabilities of multilayer feedfor-
ward networks. Neural Networks 1991, 4, 251−257.
(14) Liu, R.; Madore, M.; Glover, K. P.; Feasel, M. G.; Wallqvist, A.
Assessing deep and shallow learning methods for quantitative
prediction of acute chemical toxicity. Toxicol. Sci. 2018, 164, 512−
526.
(15) Tetko, I. V.; Sushko, I.; Pandey, A. K.; Zhu, H.; Tropsha, A.;
Papa, E.; Oberg, T.; Todeschini, R.; Fourches, D.; Varnek, A. Critical
assessment of QSAR models of environmental toxicity against
Tetrahymena pyriformis: focusing on applicability domain and
overfitting by variable selection. J. Chem. Inf. Model. 2008, 48,
1733−1746.
(16) Sushko, I.; Novotarskyi, S.; Korner, R.; Pandey, A. K.;
Cherkasov, A.; Li, J.; Gramatica, P.; Hansen, K.; Schroeter, T.; Muller,
K. R.; Xi, L.; Liu, H.; Yao, X.; Oberg, T.; Hormozdiari, F.; Dao, P.;
Sahinalp, C.; Todeschini, R.; Polishchuk, P.; Artemenko, A.; Kuz’min,
V.; Martin, T. M.; Young, D. M.; Fourches, D.; Muratov, E.; Tropsha,
A.; Baskin, I.; Horvath, D.; Marcou, G.; Muller, C.; Varnek, A.;
Prokopenko, V. V.; Tetko, I. V. Applicability domains for classification
problems: Benchmarking of distance to models for Ames
mutagenicity set. J. Chem. Inf. Model. 2010, 50, 2094−2111.
(17) Ghose, A. K.; Crippen, G. M. Atomic Physicochemical
Parameters for Three-Dimensional Structure-Directed Quantitative
Structure-Activity Relationships I. Partition Coefficients as a Measure
of Hydrophobicity. J. Comput. Chem. 1986, 7, 565−577.
(18) Rogers, D.; Hahn, M. Extended-connectivity fingerprints. J.
Chem. Inf. Model. 2010, 50, 742−754.
(19) Al-Rfou, R.; Alain, G.; Almahairi, A.; Angermueller, C.;
Bahdanau, D.; Ballas, N.; Bastien, F.; Bayer, J.; Belikov, A.;
Belopolsky, A.; Bengio, Y.; Bergeron, A.; Bergstra, J.; Bisson, V.;
Snyder, J. B.; Bouchard, N. Theano: A Python framework for fast
computation of mathematical expressions. arXiv.org 2016,
1605.02688.
(20) Liu, R.; Tawa, G.; Wallqvist, A. Locally weighted learning
methods for predicting dose-dependent toxicity with application to
the human maximum recommended daily dose. Chem. Res. Toxicol.
2012, 25, 2216−2226.
(21) Sheridan, R. P. Time-split cross-validation as a method for
estimating the goodness of prospective prediction. J. Chem. Inf. Model.
2013, 53, 783−790.
(22) Weaver, S.; Gleeson, M. P. The importance of the domain of
applicability in QSAR modeling. J. Mol. Graphics Modell. 2008, 26,
1315−1326.
(23) Schwaighofer, A.; Schroeter, T.; Mika, S.; Blanchard, G. How
wrong can we get? A review of machine learning approaches and error
bars. Comb. Chem. High Throughput Screening 2009, 12, 453−468.
(24) Tetko, I. V.; Poda, G. I.; Ostermann, C.; Mannhold, R.
Accurate in silico logP predictions: One can’t embrace the
unembraceable. QSAR Comb. Sci. 2009, 28, 845−849.

126 DOI: 10.1021/acs.jcim.8b00348


J. Chem. Inf. Model. 2019, 59, 117−126

You might also like