0% found this document useful (0 votes)
18 views26 pages

Molecular Machine Learning With Conformer Ensembles

Uploaded by

tnguy122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views26 pages

Molecular Machine Learning With Conformer Ensembles

Uploaded by

tnguy122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Molecular machine learning with conformer ensembles

Simon Axelrod?,† and Rafael Gómez-Bombarelli†


?
Department of Chemistry and Chemical Biology
Harvard University
Cambridge, MA, 02138

Department of Materials Science and Engineering
arXiv:2012.08452v2 [cs.LG] 18 Feb 2021

Massachusetts Institute of Technology


Cambridge, MA, 02142
[email protected], [email protected]
February 22, 2021

Abstract
Virtual screening can accelerate drug discovery by identifying promising candidates
for experimental evaluation. Machine learning is a powerful method for screening,
as it can learn complex structure-property relationships from experimental data
and make rapid predictions over virtual libraries. Molecules inherently exist as a
three-dimensional ensemble and their biological action typically occurs through
supramolecular recognition. However, most deep learning approaches to molecular
property prediction use a 2D graph representation as input, and in some cases a
single 3D conformation. Here we investigate how the 3D information of multi-
ple conformers, traditionally known as 4D information in the cheminformatics
community, can improve molecular property prediction in deep learning models.
We introduce multiple deep learning models that expand upon key architectures
such as ChemProp and Schnet, adding elements such as multiple-conformer inputs
and conformer attention. We then benchmark the performance trade-offs of these
models on 2D, 3D and 4D representations in the prediction of drug activity using
a large training set of geometrically resolved molecules. The new architectures
perform significantly better than 2D models, but their performance is often just as
strong with a single conformer as with many. We also find that 4D deep learning
models learn interpretable attention weights for each conformer.

1 Introduction
Drug development is a long and costly process. Bringing a new drug to market takes an average
of seven years [1] and costs $2.9 billion (USD, 2013) [2]. Because so many drugs fail in late stage
trials, it is critical to generate a variety of leads to increase the chance of success. Leads are often
discovered by screening large chemical libraries, but these libraries are expensive to manage, and their
chemistry is both homogeneous and poorly reflective of actual drug chemistry [3]. Computational
screening can improve this process by exploring a much larger space of compounds and identifying
the top candidates for experimental testing. Such methods range from physics-based simulations,
such as computational docking [4, 5] and molecular dynamics [6], to data-driven regressions, such as
machine learning (ML) [7–9]. Combinations of both have also produced fruitful results [10]. While
docking accuracy is limited by the scoring function and force field [4], the accuracy of ML methods
is mostly limited by the amount of available data, and ML inference is orders of magnitude faster.
Given enough data, neural networks used in ML can in principle learn any function, including the
mapping from molecule to property. However, there is typically a scarcity of data for successful drugs,

1
as the vast majority of tested molecules do not bind the target protein. It is therefore necessary to
optimize the neural network architecture to best leverage the limited data. Advances in deep learning
have improved ML performance by training networks directly on molecular graphs [11–23]. These
message-passing neural networks (MPNNs) use graphs to generate learnable, continuous fingerprints,
which are then used as input to a neural network that outputs predicted properties. This representation
encodes a task-specific molecular similarity, as fingerprints predicted to have similar properties are
themselves similar.
The state of the art in deep learning uses 2D molecular graphs [24] (or in some cases a single 3D
molecular structure [11]) to generate these fingerprints. One way to further improve the representation
is to use 3D ensemble information. A molecule is neither a 2D structure nor a single 3D structure,
but rather an ensemble of continuously inter-converting 3D structures. The process of drug binding is
a 3D recognition event between the drug and the binding pocket, and so depends critically on the
3D structures accessible to the molecule. Ensembles have been used to generate fixed descriptors
in computational drug discovery [25], but they have not been used in MPNNs. Nevertheless, the
ensemble information may still be learned implicitly by 2D MPNNs, as the set of conformers is
in principle a function of the graph only. The extent to which 2D models can implicitly learn
this information, and thus match the performance of single- or multi-conformer 3D models, is not
currently understood.
Here we investigate whether 3D information of one or more conformers can improve computational
drug discovery with ML (Fig. 1). As an example we screen molecules that can inhibit the novel
SARS coronavirus 2 (SARS-CoV-2), which causes COVID-19 [26]. We find that models based on
a single conformer can better identify held out CoV-2 inhibitors than models based on 2D graphs.
However, using multiple conformers does not further improve the results. We also test a transfer
learning strategy to leverage the large amount of data for SARS-CoV, which causes SARS, to better
predict SARS-CoV-2 inhibition. We find that transfer learning with 3D and 4D models can be better
than training 2D models from scratch, but that the improvement is small.

2 Previous work
We distinguish two broad approaches to computational drug discovery. The first, which we refer
to as QSAR or classical QSAR (quantitative structure-activity relationship), has long been applied
by the computational chemistry community. The second, which we refer to as deep learning, has
seen growing interest in the last years driven by advances in representation learning, such as graph
convolutional neural networks. Both approaches predict molecular properties by applying a readout
function to a molecule’s features. Readout methods include linear, inherently regularized functions
like partial least-squares [27], and nonlinear, flexible functions like k-nearest neighbors [28] and
neural networks. However, MPNNs differ from classical QSAR approaches, including those with
neural network readouts, in that they learn the features directly from the molecules in the training set.
Both the cheminformatics and deep learning communities have focused on the improvement of
so-called featurization or representation learning. Researchers in computational drug discovery have
developed features that better reflect the molecule-protein binding process (see below). Machine
learning researchers have developed models that better “train themselves” to learn powerful fea-
turizations. MPNNs can generate strong molecular representations that outperform hand-crafted
descriptors, but require vast amounts of data to do so. In practice their performance is often improved
by incorporating features developed by experts [24]. Here we review various 3D and 4D methods
that have been extensively developed by the drug discovery community and successfully applied to
many targets [29–49]. We then briefly review different 2D and 3D approaches in deep learning and
discuss how the principles of 4D QSAR can be applied to MPNNs.

2.1 3D QSAR

Classical QSAR methods generate features based on physical principles. One such principle is that
binding affinity and specificity are determined by the forces between ligands and receptors. For
example, the popular CoMFA method [50, 51] aligns molecules in a 3D grid, computes steric and
electrostatic fields at different positions around the molecules, combines these values into features,
and correlates the features with biological activity using partial least-squares [27]. The GRID method
[52–54] also incorporates hydrophobic and hydrogen bonding interactions.

2
Figure 1: Molecular representations of the latanoprost molecule. top SMILES string. left Stereochemical
formula with edge features, including wedges for in- and out-of-plane bonds, and a double line for cis isomerism.
right Overlay of conformers. Higher transparency corresponds to lower statistical weight.

Other methods use the principle that ligands with similar shapes have similar binding properties. For
example, the MSA method uses the difference in steric volume between a sampled structure and
a reference structure to predict the sample’s activity [55]. The reference structure is chosen as the
geometry in the training set that maximizes the quality of the fit. Other methods, such as ROCS
[40, 56] and Phase Shape [57], follow a similar principle but use different methods for computing the
volume overlap.
In all 3D methods one must choose a molecular conformation for each species. Molecules exist as
an ensemble of conformers, but the most important is the one with the strongest binding affinity to
the target. This is known as the bioactive conformation. Many 3D methods assume that the lowest
energy conformer is the bioactive conformation [27, 58], but this is often not the case. Others, such
as Compass [59], generate a model and an estimate of the bioactive pose together.

2.2 4D QSAR

4D methods use multiple conformations for each molecule. For example, the method of Ref. [60]
positions each conformer in a 3D grid, assigns atom features to occupied grid sites, and averages over
all conformers. Refs. [61, 62] use a similar approach but with frames generated by MD simulations.
As in 3D models, one can identify the bioactive pose of a new species by selecting the conformer
with the highest predicted activity. Even though the binding pocket is not used during training,
the interactions between the predicted bioactive pose and the binding site tend to be energetically
favorable [25].
Ref. [63] used molecular dynamics trajectories of the ligand-protein complex to generate an ensemble
of WHIM shape descriptors [64], and used the mean and standard deviation of the descriptors as
additional features. WHIM descriptors are invariant to translation and rotation, while grid-based
and volume overlap methods depend on the orientation of the molecule. Therefore, the alignment
of each molecule to a reference structure, which is often ambiguous and can significantly alter
model predictions [27, 65, 66], is avoided. Further, explicitly identifying a bioactive conforma-
tion was not necessary because the MD simulations were performed with the full ligand-receptor
complex. The MD descriptors were shown to strongly distinguish between the most active and the
moderately/weakly active and inactive ERK2 kinase inhibitors. They were also shown to provide
information absent in 2D descriptors, and even in 3D descriptors from binding poses generated
with computational docking. However, since the approach used the protein-ligand complex in MD
simulations, it is not clear whether the results would also apply to ligand-only calculations.

2.3 MPNNs

Many QSAR methods generate features based on physical principles that are relevant to the target
property. A conceptually different approach is to generate fingerprints that encode the maximum
amount of molecular information and can be applied in any context. The current state of the art in
deep learning uses MPNNs [11] to learn molecular representations directly from 2D or 3D molecular
graphs [12–20].

3
2.4 QSAR and MPNNs

Extensive work has been done in the deep learning community to develop and improve MPNNs that
act on 2D or single 3D molecular graphs [11–20], but no work has been done on representations with
ensembles. Moreover, unlike in QSAR, the issue of bioactive conformation has not been addressed
in deep learning, as the lowest energy conformer is the only structure used, and this may not be the
bioactive conformation.
Another issue for 3D and 4D models is the generation of conformers. Conformer generation is a
difficult task (see Subsection 3.6 below), and the ranking of conformers by energy with classical force
fields is highly inaccurate [67]. MD simulations may be able to better sample low-energy regions
of phase space than stochastic conformer generators, but without advanced sampling methods can
become stuck in local minima. Moreover, the quality of the classical MD conformers is still limited
by the accuracy of the force field. This problem affects all models that use single or multiple 3D
structures, whether the featurization is hand-crafted or learned.
In this work we combine the physical picture of a bioactive pose, used extensively in QSAR, with
tools developed in deep learning. First we develop a number of new 3D MPNN models by combining
force field architectures with 2D property predictors. Next we show how the MPNNs can be trained
on multiple conformers or on a single “effective conformer” that represents the ensemble. The former
uses the deep learning concept of attention [68–71] to learn the bioactive pose concurrently with the
model training. We also address the issue of inaccurate conformer generation, common to all 3D and
4D models, by using the high-quality conformers in the GEOM dataset [72]. The GEOM conformers
were generated with the CREST method [73], which uses semi-empirical DFT and advanced sampling
to thoroughly and accurately sample the low-energy regions of phase space. Our work thus expands
upon methods used in 2D and 3D deep learning, addresses conformer issues common to all 3D and
4D methods, and provides a number of ways to add 4D information to MPNNs.

3 Methods
3.1 Neural network architectures

Here we discuss and extend various 2D and 3D message-passing architectures, and describe several
methods for applying them to conformer ensembles.

3.1.1 Message passing


Molecules can be represented as graphs. These graphs consist of a set of nodes (atoms) connected to
each other by a set of edges. Both the nodes and edges have features. The atoms, for example, can be
characterized by their atomic number and partial charge. The edges can be characterized by bond
type or by interatomic distance. In the message passing phase, message-passing neural networks
(MPNNs) aggregate node and edge features to create a learned fingerprint. The readout phase uses
the fingerprint as input to a regressor that predicts a property [24].
The message passing phase consists of T steps, or convolutions. In what follows, superscripts denote
the convolution number. The node features of the v th node are ~xv , and the edge features between
nodes v and w are ~evw . The atom features ~xv are first mapped to so-called hidden states ~h0v . A
~ t+1
message is m v is created in the tth convolution, which combines ~hv and ~hw for each pair of nodes
v and w with edge features ~evw [24, 74]:
X
~ t+1
m v = Mt (~htv , ~htw , ~evw ), (1)
w∈N (v)

where N (v) is the set of neighbors of v in graph G, and Mt is a message function. The hidden states
are updated using a vertex update function Ut :
~ht+1 = Ut (~ht , mt+1
v v ~ v ). (2)
The readout phase then uses a function R to map the final hidden states to a property y, through

ŷ = R(~hTv |v ∈ G). (3)

4
In this work we also use the directed message-passing ChemProp model [24], which achieves state-
of-the-art performance on a wide range of prediction tasks [24]. In this implementation, hidden states
~ht and messages m ~ tvw are used, rather than node-based states ~htv and messages m
~ tv . Hidden states
vw
are initialized with
~h0 = τ (Wi [~xv ||~evw ]), (4)
vw

where Wi ∈ RF ×F0 is a learned matrix, F is the hidden state dimension, F0 is the dimension of
[~xv ||~evw ], which is the concatenation of the atom features ~xv for atom v and the bond features ~evw for
bond vw, and τ is an activation function. The message passing function is simply Mt (~xv , ~xw , ~htvw ) =
~ht . The edge update function is the same neural network at each step:
vw

Ut (~htvw , m
~ t+1 ~ t ~ t+1 ) = τ (~h0 + Wm m
vw ) = U (hvw , mvw vw ~ t+1
vw ), (5)

where Wm ∈ RF ×F is a learned matrix. Each message-passing phase is then


X
m~ t+1 ~ht , ~ht+1 = τ (~h0 + Wm m ~ t+1
vw = kv vw vw vw ), (6)
k∈N (v)\w

for t ∈ 1, ..., T . After the final convolution, the atom representation of the molecule is recovered
through
X
m
~v = ~hT , ~hv = τ (Wa [~xv ||m
~ v ]). (7)
vw
w∈N (v)

The hidden states are then summed to give a feature vector for the molecule: ~h = ~
P
v∈G hv .
Properties are predicted through ŷ = f (~h), where f is a feed-forward neural network. In ChemProp
the atom features are atom type, number of bonds, formal charge, chirality, number of bonded
hydrogen atoms, hybridization, aromaticity, and atomic mass. The bond features are the bond type
(single, double, triple, or aromatic), whether the bond is conjugated, whether it is part of a ring, and
whether it contains stereochemistry (none, any, E/Z or cis/trans). All features are one-hot encodings.
The ChemProp code was accessed through [75].

3.1.2 Learning with 3D features


A variety of graph convolutional models have been proposed for learning force fields, which map a
set of 3D atomic positions of a molecular entity to an energy. Architectures designed for force fields
typically do not incorporate information about the covalent connectivity [17, 18, 76, 77] since these
bonds are broken and formed during chemical reactions and may not be clearly defined. This differs
from architectures for property prediction, which are typically based on 2D graphs [12, 24, 78] but
can also use 3D information [74].
Here we combine graph-based models with 3D models in a number of ways. Our first such model is
called SchNetFeatures, as it uses the SchNet force field architecture [17, 18] (code adapted from
[79]), but adds additional graph-based features based on the covalent connectivity. Our second
model is called ChemProp3D, as it uses the ChemProp property prediction architecture, but adds
additional 3D (distance-based) edge features between atoms within a distance threshold. We have
also explored a reduced ChemProp3D model, in which bond states are updated only based on other
covalently-bonded neighbors, and not on non-bonded neighbors. In this case the updated bond
features are simply concatenated with the original distance features at each step. This model is called
CP3D-NDU, for ChemProp3D with no distance updates. We also abbreviate this as CND.
In the typical SchNet model the feature vector of each atom is initialized with an embedding function.
This embedding generates a random vector that is unique to every atom with a given atomic number,
and is also learnable. The edge features at each step t are generated through a so-called filter network
V t . The filter network generates an edge vector ~evw by expanding the distance ||~rv − ~rw || in a
basis of Gaussian functions. The expansion coefficients are then transformed into ~evw through linear
and non-linear operations. Because only the distance between two atoms is used to create ~evw , the
features produced are invariant to rotations and translations.

5
In each convolution t + 1, the new messages and hidden vectors are given by
X X
~ t+1
m v = Mt (~htv , ~htw , ~evw ) = ~ht ◦ ~evw ,
v
w w
 
X
~ht+1 = Ut (~ht , mt+1 ~t t
J t (~htw ) ◦ ~evw  ,
v v ~ v ) = hv + I (8)
w∈N (v)

where the edge features are


~evw = V t (rvw ). (9)
Here ◦ denotes element-wise multiplication, and I t and J t are a series of linear and non-linear
operations applied to the atomic features. The neighbors N (v) of atom v are those within a pre-set
cutoff distance rcut . For SchNetFeatures we replace the initial atomic feature embedding with the
graph-based atom features described above for ChemProp. We also combine distances features with
bond features to create the ~evw of Eq. (8), through
dist bond
~evw → [~evw || ~evw ] = [V t (rvw ) || ~evw
bond
]. (10)
dist bond
Here ~evw are the edge features generated by the SchNet distance filter network. The ~evw are
hidden bond vectors, obtained from a nonlinear operation applied to the bond features, and are set to
0 for non-bonded atoms pairs. Note that applying I t in Eq. (8) mixes the distance and bond features
before they are used to update the htv .
In the original SchNet implementation the readout layer converted each atomic feature vector into a
single number, and the numbers were summed to give an energy. Consistent with the the notion of
property prediction, we here instead convert the node features into a molecular fingerprint by adding
the features of each node. The readout function is then applied to this fingerprint.
The usual ChemProp model creates and updates edge embeddings for each bond. For ChemProp3D
we create an edge embedding for each pair of atoms separated by less than rcut . In particular, we
use one initialization matrix for the distances, Widist , and one for the bond features, Wibond . Hidden
states are then initialized through
~h0, dist = τ (Wdist [~xv || ~e dist ])
vw i vw
~h0, bond = τ (Wbond [~xv || ~e bond ])
vw i vw
~h0 = [~h0, dist || ~h0, bond ]. (11)
vw vw vw

The distance features edist


vw are the result of a SchNet filter network applied to the distances rvw . The
bond features for non-bonded pairs are again set to 0, and the remainder of the ChemProp architecture
is unchanged (apart from the neighbors now including all atom pairs within rcut of each other).
The CND model differs from ChemProp3D in that only the edge features of bonded pairs are updated.
These updates do not use any distance information, meaning that they are equivalent to the updates in
the 2D ChemProp model. Distance information is instead incorporated in the following way. After all
dist
T convolutions are complete, a set of distance features ~evw is generated for all pairs vw within rcut
dist
of each other. The ~evw of bonded pairs are concatenated with the final bonded edge features m ~ Tvw ,
while those of non-bonded pairs are concatenated with zeros. These concatenated vectors are then
summed as in Eq. (7) to give the node features m ~ v , which are in turn summed to give the molecular
features h.
For all 3D models we use 10 Gaussian functions for each distance and a cutoff distance of 5 Å. We
use three convolutions for SchNetFeatures, but use two convolutions for the ChemProp3D models
because of higher memory costs. The SchNet force field uses Gaussians spaced 0.1 Å apart [17, 18],
which is necessary to resolve movements that lead to appreciable changes in energy. Here we use a
much larger spacing (0.3-0.5 Å) because we are interested in larger-scale features of the molecule.
Resolving the structure at a finer scale would not be expected to improve performance, and could
more easily lead to overfitting.
The above discussion applies to molecules associated with one geometry. However, multiple con-
formers correspond to a single stereochemical formula and can be considered at the same time, each

6
one having a different statistical weight p(n) . It is not immediately clear how to pool the fingerprints
of the different conformers. A simple pooling scheme would be to multiply the fingerprint of each
conformer by p(n) and add the results. However, this assumes that the resultant property is a statis-
tical average of individual conformer properties, and this is not always the case. For example, the
contribution of a conformer to the binding process is not determined by its statistical weight. Rather,
it is determined by the affinity of the conformer for the target, but this affinity is not known a priori,
and large differences in binding energy may compensate for small differences in conformational
energy.

3.2 Attention model

An ideal pooling scheme would be adaptable, learning fingerprint weights that are best suited to the
task at hand. For this reason we propose a pooling mechanism based on attention [68–71]. Attention
combines a learned vector with two feature vectors to determine the importance of one set of features
to another. The resulting coefficients are then normalized to give weights for each feature vector.
Similarly to other applications of attention [71], we have found that it is useful to include multiple
attention heads in the pooling. These different heads can learn different features to focus on in the
pooling.
The notion of attention can be applied to conformer fingerprints in one of two ways. The first is to
let the attention mechanism learn the importance of one conformer’s features to another, giving a
set of attention coefficients αnm for n, m ∈ N , where N is the number of conformers. The second,
which we call linear attention, is to simply learn the conformer’s overall importance, rather than its
importance with respect to another conformer. This gives a set of coefficients αn for each conformer.
The choice of attention or linear attention is treated as a hyperparameter to be optimized.

3.2.1 Incorporating statistical weights


The attention mechanism should be able to use the statistical weight of each conformer if it improves
predictions. However, the fingerprint of the nth conformer, ~h(n) , does not take into account p(n) . To
add this information we embed the weight as a vector d~ (n) , through
d~ (n) = SoftMax Dp(n) + ~b ,

(12)
where D ∈ RS×1 is a learned matrix, S is the dimension of d~ (n) , and ~b is a learned bias. The softmax
(n)
activation means that d~k can be interpreted as the projection of p(n) onto the k th probability bin.
Here we choose S = 10 so that the probabilities are divided into 10 bins. A linear layer is then
applied to the concatenation of d~ (n) with ~h(n) , yielding the final conformer fingerprint ~q (n) :
~q (n) = H [~h(n) ||d~ (n) ] + ~b. (13)
Here, H ∈ RF ×(F +S) is a learned matrix, and ~b is again a learned bias.

3.2.2 Computing attention coefficients


For the pairwise attention method we then compute the attention coefficients cnm between conformers
n and m as
cnm = ~a • [A~q (n) ||A~q (m) ], (14)
2F F ×F
where ~a ∈ R is a vector, A ∈ R is a learned matrix, and • is the dot product. For ease
of notation we have supressed the dependence on the attention head index. The attention paid by
conformer n to conformer m is then

exp LeakyReLU cnm
αnm = P  . (15)
l exp LeakyReLU cnl
~ is a weighted sum over each fingerprint followed by an activation function
The pooled fingerprint Q
τ , such that
!
~ =τ 1 X
(m)
Q αnm A~q . (16)
N nm

7
(a) (b)

Figure 2: Schematic of the architecture in this work. (a) A neural network uses the 3D structure of the
nth conformer, along with its graph information, to create a molecular fingerprint ~h(n) . (b) Each fingerprint
is combined with an embedding of the conformer’s statistical weight p(n) . The new fingerprints ~ q (n) are
~
aggregated through an attention mechanism to yield the final fingerprint Q.

P
Here N is the number of conformers, and the division by N ensures that nm αnm /N = 1. For
the linear attention mechanism only one conformer is used to determine each coefficient, through
cn = ~a • (A~q (n) ), where ~a ∈ RF . The coefficients αn are normalized by summing over one index
only, and the pooled conformers are
!
X
Q~ linear = τ αn A~q (n) . (17)
n
Adding a superscript to denote fingerprints from different heads, the final fingerprint resulting from
K linear or pair-wise attention heads is then
~ = [Q
Q ~ (1) || Q
~ (2) || ... || Q
~ (K) ]. (18)
A schematic of the fingerprinting architecture is shown in Fig. 2.

3.3 Average distance model

As an alternative to the attention-based model we also propose an average-distance or average


neighbors (“avg nbrs”) model. In this approach, message-passing occurs over a single effective
conformer, with interatomic distances given by the distances averaged over all conformers. That is,
the effective distance d¯ij between atoms i and j is given by
(n)
X
d¯ij = wn dij , (19)
n
(n)
where the sum is over all conformers, wn is the statistical weight of the nth conformer, and dij is the
distance between atoms i and j in the nth conformer. The benefit of this approach is the enormous
reduction in computational cost compared to using multiple conformers. Since only one effective
graph is needed per species, the computational cost is reduced by a factor of nconfs , the average
number of conformers per species. The drawback, however, is that we are forced to use non-learnable
weights, as learnable weights would require each conformer to be stored in the computation graph.
Here we have used statistical weights, but it would also be possible to use any other weights that are
determined before training. For example, one could first train an attention model on non-learnable
fingerprints, and then use the resulting weights in Eq. (19).
It is important to notice that Eq. (19) is not equivalent to a single geometry whose atomic positions
are averaged over all conformers. The reason is that dij is a nonlinear function of the positions ri

8
p
and rj , since dij = (xi − xj )2 + (yi − yj )2 + (zi − zj )2 . Hence the average distance does not
equal the distance between average positions. Equation (19) then contains information of the position
distribution up to at least its second order moment. Note that Eq. (19) is invariant to translations or
(n)
rotations of any one conformer, since each dij is itself invariant.

3.4 WHIM model

To connect with recent 4D approaches in the drug discovery literature, we have also followed Ref.
[63] and used the mean and standard deviation of conformer WHIM vectors for training. For each
species we computed the WHIM vector of each conformer using RDKit, and computed the mean
and standard deviation using the conformers’ statistical weights. These features were then combined
with learnable fingerprints generated by ChemProp. As is the default for external features supplied to
ChemProp, each of the WHIM features was scaled to have zero mean and variance. Our approach
differs somewhat from Ref. [63] in that our conformers are generated in vaccuum rather than sampled
from ligand-receptor MD, and also that we concatenate the WHIM features with those generated
from message-passing.

3.5 Baseline models

We compared the 3D models to the 2D-based ChemProp and random forest models, with the random
forest trained on Morgan fingerprints. We experimented with using atom-pair fingerprints [80] for the
random forest training, but did not find any difference in performance. We also experimented with
a variant of the Compass approach to finding bioactive conformations [59], training feed-forward
networks on WHIM vectors instead of on ligand-query volume differences. We then used the
predicted bioactive conformers as input to single conformer models. However, we discarded this
approach because, for some targets, the model performance decreased as the predicted poses were
updated. This indicated that the quality of the poses became worse, and hence that the iterative
procedure was non-convergent.

3.6 Obtaining conformers

To train a model on 3D ensemble information, one must first reduce the infinite set of 3D structures
to a finite set of conformers, the set of thermally accessible structures at local minima on the
potential energy surface. However, generating accurate conformers is challenging. A variety of
exhaustive, stochastic, and Bayesian methods have been developed to generate conformers [81–
88]. The stochastic conformer generation methods used in cheminformatics packages [87] are not
exhaustive and may miss low-energy structures, and exhaustive sampling has prohibitive exponential
scaling with the number of rotatable bonds. Further, the classical force fields used are generally not
accurate enough to assign conformer orderings based on energy [67].
Significant progress in conformer generation has been made with the CREST program [73], which
uses semi-empirical quantum chemistry and advanced sampling methods to generate reliable geome-
tries. Conformer accuracy is crucial for all models that use 3D geometries. For example, in the QSAR
Compass approach one considers only conformers with energy below 2.0 kcal/mol as candidates
for the bioactive conformation [59]. However, because classical force fields cannot reliably assign
energies in this narrow window [67], it is easy to include high-energy conformers and miss low-energy
ones. The CREST conformer orderings are more reliable because the energies are generated with
semi-empirical DFT, which is a good approximation to high-accuracy ab initio methods [72]. Further,
stochastic methods can miss conformers, and MD simulations can become stuck in local minima
even with long simulations. Indeed, the latter was suggested as one reason for the poor performance
of the per-MD-frame WHIM vectors used in Ref. [89]. These issues are avoided in CREST through
the use of multiple rounds of metadynamics, high-temperature MD and genetic methods [73].
We recently used CREST to create the GEOM dataset [72, 90, 91], which contains conformers
annotated by energy for over 300,000 drug-like molecules with experimental binding data and
130,000 combinatorially generated small molecules [92]. The 300,000 drug-like molecules contain
ligands with in-vitro data for the inhibition of SARS-CoV 3CL protease, SARS-CoV PL protease,
SARS-CoV-2 3CL protease, SARS-CoV-2, E. Coli and Pseudomonas aeruginosa [9, 93–100]. We
use this dataset to train the 3D conformer-based models introduced in this work.

9
CoV 3CL CoV-2 3CL CoV-2
Train 275 (167,255) 50 (485) 53 (3,294)
Validation 81 (55,750) 15 (157) 17 (1,096)
Test 70 (55,756) 11 (162) 22 (1,086)
Total 426 (278,758) 76 (804) 92 (5,476)

Table 1: Number of hits (total number of species in parentheses) for each split of each dataset.

3.7 Training

We trained different models on experimental data for the specific inhibition of the SARS-CoV-2 3CL
protease (“CoV-2 CL”) [95], the general inhibition of SARS-CoV-2, measured in vitro in human
cells [93, 94], and the specific inhibition of the SARS-CoV 3CL protease (“CoV CL”). The CoV
3CL protease is found in the virus causing SARS, and has 96% sequence similarity to CoV-2 CL
[101]. Because there is 1000 times more data for CoV 3CL than for CoV-2 3CL (Table 1), and
because the sequences are so similar, we also explored using transfer learning (TL) from CoV 3CL to
CoV-2 3CL. Performance was measured using the receiver operating characteristic area under the
curve (ROC-AUC), the precision-recall area under the curve (PRC-AUC), and the ROC enrichment
(ROCE) [102] at 0.5%, 1%, 2% and 2%. Further discussion of these metrics can be found in the
Supplementary Material (SM).
We used the conformers from the GEOM dataset as inputs to different 3D/4D models (SchNetFeatures,
ChemProp3D, ChemProp3D-NDU). The training was performed with one conformer, with up to 200
conformers, and with a single effective conformer. We also trained a ChemProp + WHIM model,
which contains non-learnable conformer ensemble information. For comparison we also trained
models with only graph information (ChemProp, and random forest over circular fingerprints). These
networks were first trained on the CoV 3CL task and the two CoV-2 tasks. The CoV 3CL models
were then used to generate fixed features for molecules in the CoV-2 3CL dataset for TL. Details of
the data preprocessing, hyperparameter optimization, and uncertainty quantification of the model
scores can be found in the SM.

4 Results

ROCE scores of various models are shown in Fig. 3, and ROC and PRC scores are given in Table 2.
For visualization purposes we have also included thin horizontal bars for each ROCE. The thin bars
of identical ROCE scores are shifted down to avoid overlap. Bars stacked vertically with no spaces
between them all have the value of the highest bar in the stack. For example, thin bars at negative
y-values are always part of a stack for which the highest value is ≥ 0.

4.1 CoV-2 3CL

We begin by analyzing performance on CoV-2 3CL, the task with the smallest amount of data, and
then proceed to datasets of increasing size (CoV-2 followed by CoV 3CL). Figure 3(a) shows that
the 2D models generally have the best ROCE scores, while the 3D and 4D models are generally no
better than the 2D models. The only exception is CND with one conformer (“CND-1”) at 2% FPR,
where its ROCE of 22.3 is slightly higher than ChemProp’s ROCE of 20.8. The ChemProp model
with WHIM features averaged over all conformers (“CP WHIM”) performs similarly to ChemProp
and random forest. All other 3D and 4D models are significantly outperformed by 2D models.
We also see that using MPNNs in the 2D models is largely unnecessary, as a simple random forest has
comparable performance to the ChemProp models. For example, the scores and rankings of random
forest and ChemProp are, respectively, 20.6/first and 14.4/third; 28.1/first and 27.8/second; 18.9/third
and 20.8/second; and 10.5/fourth and 11.4/first. The random forest is the best model at 0.5% and 1%
FPR, and is only slightly outperformed by ChemProp at higher FPR.
The ROCE scores are consistent with the ROC and PRC scores in Table 2. Here we see that ChemProp,
random forest, CND-1 and CP WHIM all have high ROC and PRC scores. In fact, the CND-1 model
has the highest ROC score while the CP WHIM model has the highest PRC. The other CND models

10
2D 3D 4D 30 2D 3D 4D
30 (a) (b)

20 20
ROCE

ROCE
10 10

0 0
RF Chem CND CP Schn CND CP Schn CND CP Schn CP Chem Chem CND CP3D Schn CND Schn CND CP Schn Chem
Prop 1 3D Feat 200 3D Feat avg 3D Feat WHIM Prop Prop 1 1 Feat 200 Feat Avg 3D Feat Prop
1 1 200 200 nbrs avg avg no TL TL TL 1 TL 200 TL Avg Avg WHIM
nbrs nbrs TL TL TL TL TL TL

60
2D 3D 4D 0.5% 2%
1% 5%

40 2D 3D 4D
60
2D 3D 4D
(c) (d)
ROCEROCE

15
40
20 ROCE
10

5 20

0
0 RF Chem CND CP Schn CND Schn CND CP Schn CP
0
PropCP Schn
RF Chem CND 1 CND 3D CPFeat
Schn 200 Feat
CND CP Schnavg
CP 3D Feat WHIM
RF Chem CND CP Schn CND Schn CND CP Schn CP
Prop 1 1 3D 1Feat avg 3D
3D Feat 200 200Featnbrs
WHIM avg avg Prop 1 3D Feat 200 Feat avg 3D Feat WHIM
1 1 200 200 nbrs avg avg nbrs nbrs 1 1 200 nbrs avg avg
nbrs nbrs nbrs nbrs

Figure 3: ROCE scores of various models for different targets. The color of each bar indicates the false positive
rate according to the central legend. (a) CoV-2 3CL protease, (b) CoV-2 3CL protease with transfer-learned
models, (c) CoV-2, and (d) CoV 3CL protease.

also perform well, with CND-200 and CND avg achieving better ROC scores than ChemProp, and
reasonably high PRC scores.
This task has the smallest amount of data (485 training species), which may explain the results: with
so little data, it is difficult for the neural networks to learn meaningful representations. The learned
representations are likely no more powerful than the Morgan fingerprints used in random forest; this
is often the case in low-data regimes [103]. This is especially true of the 3D models with distance
updates, as they haven an abundance of input information with few labels to guide the representation
learning. This may make the 3D models especially prone to overfitting.

4.2 CoV-2

The results are quite different for the CoV-2 task. As seen in Fig. 3(c), the ChemProp and random
forest models are outperformed by a number of 3D models. The CND-1 model in particular has the
highest ROCE scores at 0.5%, 1%, and 2% FPR, and is essentially tied with CND avg and CP3D-200
for the highest score at 5% FPR. The CND-1 model outperforms ChemProp, the best 2D model,
by a significant margin. At 0.5% FPR, CND-1 has an ROCE of 16.1, nearly 40 times higher than
ChemProp’s ROCE of 0.4. At lower FPR values the improvement is smaller but still quite significant,
with CND-1 outperforming ChemProp by factors of 26, 11, and 4 at FPRs of 1%, 2%, and 5%,
respectively.
Other 3D and 4D models also perform better than ChemProp, though none achieves the success
of CND-1. For example, SchNetFeat-1, CND-200, and CND avg all achieve high ROCE scores,
improving on the ChemProp scores by factors of 4-20 at various FPR values. WHIM fingerprints also
improve the performance of ChemProp, increasing ROCE scores by factors between 3 and 9. This
should be contrasted with the CoV-2 3CL task, in which adding WHIM fingerprints did not improve

11
CoV-2 3CL CoV-2 CoV 3CL
ChemProp 0.89(1) 0.62(4) 0.754(7) 2D
Random Forest 0.92(1) 0.61(1) 0.69(1)
SchNetFeatures (1 conf) 0.907(5) 0.676(4) 0.718(7)
ChemProp3D (1 conf) 0.78(5) 0.50(1) 0.65(2) 3D
CP3D-NDU (1 conf) 0.929(3) 0.67(0) 0.72(3)
SchNetFeatures (200 confs) 0.86(2) 0.63(2) 0.72(1)
ROC ChemProp3D (200 confs) 0.66(7) 0.53(0) —
CP3D-NDU (200 confs) 0.901(4) 0.663(8) 0.759(2)
SchNetFeatures (avg nbrs) 0.84(3) 0.61(1) 0.74(2) 4D
ChemProp3D (avg nbrs) 0.73(0) 0.56(2) 0.70(3)
CP3D-NDU (avg nbrs) 0.916(5) 0.647(3) 0.71(0)
ChemProp + WHIM 0.89(2) 0.66(5) 0.77(7)

ChemProp 0.56(6) 0.028(3) 0.05(1) 2D


Random Forest 0.55(3) 0.028(2) 0.079(2)
SchNetFeatures (1 conf) 0.333(0) 0.051(3) 0.021(6)
ChemProp3D (1 conf) 0.33(6) 0.021(0) 0.04(3) 3D
CP3D-NDU (1 conf) 0.533(1) 0.124(0) 0.07(2)
SchNetFeatures (200 confs) 0.26(5) 0.037(3) 0.03(2)
PRC ChemProp3D (200 confs) 0.20(1) 0.032(0) —
CP3D-NDU (200 confs) 0.413(3) 0.06(1) 0.04(3)
SchNetFeatures (avg nbrs) 0.29(6) 0.027(0) 0.07(3) 4D
ChemProp3D (avg nbrs) 0.31(0) 0.10(6) 0.07(1)
CP3D-NDU (avg nbrs) 0.467(5) 0.058(4) 0.063(1)
ChemProp + WHIM 0.57(6) 0.04(1) 0.07(2)

Table 2: Performance of various models on the three classification tasks, evaluated with the ROC and PRC
scores. Uncertainty on the last digit is denoted by a subscript, and is given here by the standard deviation of test
scores. Uncertainties less than 10−3 are recorded as zero. Due to computational constraints we did not train
ChemProp3D (200 confs) on the CoV 3CL task.

performance. We also note that the random forest model performs quite poorly, achieving zero
ROCE for 0.5% and 1% FPR, and an ROCE of 0.22 at 2% FPR (three times lower than ChemProp).
Evidently message-passing is far more helpful here than in the CoV-2 3CL task.
These trends are reflected in the PRC scores, and to a lesser extent in the ROC scores. Table 2
shows that SchNetFeat-1, CND-1, CND-200, and CND-avg all have some of the highest ROC scores.
However, the same is true of CP WHIM, which has a rather low ROCE score. Moreover, ChemProp
and random forest, which have the lowest ROCE scores, have ROC values only slightly below those of
the best models (approximately 8%). Clearly the ROC score is insufficient for evaluating the models.
By contrast, the PRC scores show that ChemProp and Random Forest are significantly outperformed
by SchNetFeat-1, CND avg and CP WHIM, which in turn are significantly outperformed by CND-1.
Indeed, the CND-1 score is four times that of ChemProp and double that of SchNetFeat-1, CND
avg and CP WHIM. The CP3D avg model also has a very high PRC, though interestingly its ROC
is rather low. We also note that while CND-1 performance is quite impressive, the CND-1 model
with the best validation PRC also had the best validation ROC, so the uncertainty for each of its
scores is zero (see SM). It is therefore possible that the extremely high PRC and ROCE scores are
somewhat lucky. However, even if we ignored the CND-1 model, the next best 3D models would still
be significant improvements over the 2D models.
For this task it is clear that 3D information can significantly improve performance. Two factors
distinguish this task from the CoV-2 3CL problem. First, there is substantially more data (see Table
1). Second, the hits may have completely different mechanisms of action from each other, as they
inhibit the in-vitro growth of SARS-CoV-2 in human cells, while the CoV-2 CL data is for ligands
that specifically inhibit the CoV-2 3CL protease. Indeed, the ROCE scores are far lower than the
best ROCE scores in the CoV-2 3CL task, indicating that this problem is more difficult. Below we

12
show that MPNN-based 3D and 4D models also perform well for the CoV-3CL classification, but that
they are only slightly better than 2D models. Therefore, it is possible that 3D information is more
important when the mechanisms of action are different, and/or when there is an intermediate amount
of training data (not too much or too little). Further investigation is required to test these hypotheses.
Another important observation is that the 4D models are no better than the 3D models. For example,
the 4D models with the best ROCE scores are CND-200 and CND avg, but both are outperfomed by
CND-1 at FPR values below 5% and matched at 5%. Similarly, SchnetFeat-200 is outperformed by
SchNetFeat-1 at every FPR. The exception to this trend is CP3D avg, which is more accurate than
CP3D-1. However, it is still far worse than the other 3D models. These trends can also be seen in the
ROC and PRC scores.
Since all information contained in a single conformer representation is also available with 200
conformers, it seems that the model training is hampered by the extra information. This is particularly
intriguing because the attention mechanism learns meaningful information about the conformers (see
below), focusing on geometries that are similar to those of other hits. The model may be overfitting to
the extra information, despite the hyperparameter optimization to determine reasonable dropout rates.

4.3 CoV 3CL

The final target is CoV 3CL. It contains by far the largest training set, and, like CoV-2 3CL, contains
experimental data for ligands that bind to a specific protein. According to the ROCE scores in Fig.
3(d), the models that perform best on average are CND avg, CP WHIM, CND-1, CND-200, and
ChemProp. The CND avg is best at FPRs of 0.5% and 1% (ROCE=47.0 and 28.6), CP WHIM is best
at 2% (ROCE=17.1), and ChemProp is best at 5% (ROCE=8.6).
The 3D/4D models are generally better than the 2D models, but only by a small margin. For example,
the ratios of top 3D/4D score to top 2D score are 1.18, 1.05, 1.09, and 0.95 at FPRs of 0.5%, 1%, 2%
and 5%, respectively. Interestingly, this is the only task in which 4D models consistently outperform
3D models, but the margin is again rather small. Specifically, the ratios of top 4D score to top 3D
score are 1.84, 1.11 1.20, and 1.14 at FPRs of 0.5%, 1%, 2% and 5%, respectively.
We also see that the random forest is outperformed by MPNN models, though to a lesser extent than
in the CoV-2 task. The superiority of MPNNs is to be expected for this task because the dataset is
quite large. This provides ample information from which to learn an ideal molecular representation.
The size of the dataset also likely explains the fact that the ROCE scores are higher than in any other
task.
The ROCE trends are only somewhat similar to the PRC and ROC trends. The random forest model
has the highest PRC score, and the CND-1, SchNetFeat avg, CP3D avg, CND avg and CP WHIM
models are close behind. The ChemProp PRC is noticeably lower than each of these models, while
CND-200 score is lower still. While the CND-1, CND avg and CP WHIM models also have high
ROCEs, the CP3D avg and random forest models do not. In fact, according to the ROCE metric, the
random forest is one of the worst models. Moreover, ChemProp and CND-200 have high ROCE
scores but low PRCs. These results again emphasize the importance of going beyond the ROC and
PRC scores when comparing models.

4.4 Transfer learning

We also analyzed how TL could improve CoV-2 3CL scores. To do so we used the pre-trained CoV
3CL models to generate fingerprints, and used the fingerprints as input to a standard feed-forward
neural network. We did this with and without additional message passing from ChemProp. When
additional message passing was used, the fixed fingerprints were concatenated with the learned
ChemProp fingerprints. For each architecture we used two models, one with the best CoV 3CL AUC
score and one with the best CoV 3CL PRC score. Each of the two pre-trained models was then used
to produce molecular fingerprints, which were then used to train an ensemble of ten different CoV-2
3CL models. The final scores are thus an average over 20 different models. Hyperparameters were
again optimized separately for each model.
Transfer learning ROC and PRC score are shown in Table 3, and can be compared with the ChemProp
and random forest baselines in Table 2. The ROCE scores are shown in Fig. 3(b), and correspond
to averages over 40 models (20 with additional message-passing and 20 without). The ROC and

13
CoV 3CL model CoV-2 3CL message ROC PRC
for fingerprinting passing (Y/N)
ChemProp (2D) Y 0.78(3) 0.36(8)
ChemProp (2D) N 0.71(6) 0.35(7)
SchNetFeatures (1 conf) Y 0.76(8) 0.3(1)
SchNetFeatures (1 conf ) N 0.77(7) 0.3(2)
ChemProp3D (1 conf) Y 0.75(5) 0.2(2)
ChemProp3D (1 conf) N 0.69(6) 0.2(1)
CP3D-NDU (1 conf) Y 0.85(3) 0.43(9)
CP3D-NDU (1 conf) N 0.81(4) 0.47(9)
SchNetFeatures (200 confs) Y 0.86(6) 0.42(9)
SchNetFeatures (200 confs) N 0.87(5) 0.50(9)
CP3D-NDU (200 confs) Y 0.88(5) 0.47(9)
CP3D-NDU (200 confs) N 0.87(5) 0.4(1)
SchNetFeatures (avg nbrs) Y 0.79(6) 0.3(1)
SchNetFeatures (avg nbrs) N 0.83(6) 0.4(1)
ChemProp3D (avg nbrs) Y 0.6(2) 0.2(1)
ChemProp3D (avg nbrs) N 0.74(6) 0.29(9)
CP3D-NDU (avg nbrs) Y 0.88(2) 0.48(5)
CP3D-NDU (avg nbrs) N 0.87(3) 0.41(5)
ChemProp + WHIM Y 0.77(3) 0.42(4)
ChemProp + WHIM N 0.81(2) 0.44(3)

Table 3: Transfer learning results from CoV 3CL to CoV-2 3CL for different models, with and without additional
message passing from ChemProp.

PRC scores show no evidence that TL improves performance, as the best TL scores are lower than
those of ChemProp and Random Forest without pre-training. However, the ROCE scores show that
TL helps at low FPR. In particular, SchNetFeat-200 TL at 0.5% FPR has an ROCE of 25.7, which
is 1.25 times greater than that of random forest (ROCE=20.6), and 1.78 times greater than that of
ChemProp (ROCE=14.4). CND-1 TL also shows strong performance, as its ROCE of 22.3 is 1.08
times larger than random forest and 1.54 larger than ChemProp. The strong performance of these
models is consistent with their high ROC and PRC scores, though the latter never exceed those of
non-TL models. At FPR values above 0.5% there is no advantage to using transfer learning, as the
highest scores are always those of non-TL models.
It is interesting that 3D/4D TL models perform better than the 2D TL model. This is reflected
in the ROCE, ROC, and PRC scores. It seems that the representations learned by the 3D models
are more easily generalized to new targets than the 2D representations. It is also noteworthy that
SchNetFeat-200, a 4D model, provides the best TL results. However, by all metrics SchNetFeat-200
is one of the worst models at predicting CoV 3CL inhibition, while CND-1 is one of the best. It is
therefore difficult to explain why they provide some of the best TL results at low FPR.
The ROCE improvements at 0.5% FPR are certainly interesting, but they are not enough to justify
pre-training a 3D or 4D model with hundreds of thousands of species. The non-TL ChemProp model
performs better at all other FPR values, and a random forest with Morgan fingerprints has competitive
accuracy. Further work is needed to evaluate the usefulness of TL in drug discovery.

5 Discussion

5.1 Accuracy

Adding 3D information can improve prediction accuracy when using the right model. Of all the
models introduced, the new CND model shows the strongest and most consistent performance, either
matching or beating 2D baselines in most tasks. CND (1-C) in particular shows the most consistent
performance of all models, even on the CoV-2 3CL task with little data. The CND model is similar

14
to PotentialNet [21], in that both contain a separation between distance updates and graph updates.
PotentialNet uses two sets of gated graph neural networks (GGNNs) [13], the first of which acts on
the 2D graph alone. The output of this network is a set of node features hTv that have been updated
T times. The second GGNN updates the node features hTv through convolutions over the 2D graph
and the 3D structure. Using the hTv as input to the second network means that graph information is
already present in the nodes once 3D information is added. Hence the second network can fine-tune
the graph-based features using 3D information, rather than generating all features at once. The CND
approach also separates the aggregation of graph and distance information (see Methods above).
While it does not include distance-based convolutions, its use of edge states means that distance
information is included for all pairs within rcut of each other. This is roughly equivalent to a single
3D convolution in a node-based model.
While the CND model matches or improves upon 2D baselines, many 3D models do not improve
performance, and some can actually hurt it. For example, ChemProp3D is outperformed by 2D
models in every task, while SchNetFeatures improves results in CoV-2 but hurts them in CoV 3CL. It
is also clear that adding extra conformers hurts performance in most tasks (CoV-2 3CL and CoV-2),
and only leads to small improvements in others (CoV 3CL). Given the enormous computational cost
of training the 200-C models (see below), it is a far better use of resources to train single conformer
models. Transfer learning can somewhat improve accuracy, but the effect is small and absent in most
models. Much of the improvement can be matched with a simple 2D baseline like random forest.
The results of MPNN-based 3D models and 3D models with non-learnable features are also mixed.
For CoV-2 3CL it is best to use a 2D model or a 3D model with non-learnable WHIM features. For
CoV 3CL it is best to use a 3D or 4D model with learnable features, but using WHIM features gives
comparable performance. By contrast, for CoV-2 it is far better to use MPNN-based 3D models.
This is another example of a trend that is different for CoV-2 than for the other targets. It therefore
reinforces the need for further investigation into tasks without a well-defined protein target.

5.2 Conformer importance

Using 4D models leads to minor improvement over 3D models for CoV 3CL, but hurts performance
for other targets. Moreover, at each FPR value the best 4D model is CND avg or CP WHIM, neither
of which uses attention to determine the most important conformation. It is therefore important to
understand how conformers are pooled to see if the models are learning meaningful information from
them. To this end we analyzed the similarity of high-attention conformers in different species. For
each comparison we randomly selected two species, and for each species we selected the conformer
with the highest attention weight among all attention heads. We then computed the fingerprints of the
two conformers using the E3FP method [104], which is an extension of the extended connectivity
fingerprint to 3D structures. The cosine similarity metric was then applied to the fingerprints. This
was repeated for 5,000 random pairs in the test set to obtain the average similarity. We applied this
analysis separately to pairs of species that were both hits, and to pairs of species that contained one hit
and one miss. We also repeated the analysis with a random selection of conformers for comparison.
Results of this analysis are shown in Table 4.1 Random selection of conformers yields hits that are
more similar to each other than to misses. Choosing conformers based on attention increases the
similarity of hits to each other, while leaving the similarity of hits to misses virtually unchanged. The
difference between hit/hit and hit/miss scores, denoted by ∆, is on average 40% larger for conformers
chosen by attention than conformers chosen randomly. This means that it is easier to distinguish hits
from misses using attention conformers than using random conformers. Interestingly, unlike the other
models, CND (200-C) did not learn meaningful attention weights for the CoV 3CL task (it assigned
equal weights to nearly every conformer), but had the strongest performance of any attention-based
4D model. This result, along with the strong performance of the attention-free CP WHIM and CND
avg models, suggest that best results may be achieved without attention. However, the clear evidence
that attention weights contain meaningful conformer information indicate that further investigation is
required.
1
In the table we have reported the uncertainty as the standard error of the mean (SEM). The standard deviation
would indicate the expected range of scores for any single pair of species, and thus the likelihood of the attention
model beating the random model for any one pair. By contrast, the SEM measures our confidence that the
average difference between the models is statistically significant. The SEM values in Table 4 suggest that the
difference between the models is not random.

15
CoV CoV-2 CoV-2 3CL
hit/hit hit/miss ∆ hit/hit hit/miss ∆ hit/hit hit/miss ∆
Attention 0.325(1) 0.3071(9) 0.018(1) 0.340(2) 0.291(1) 0.049(2) 0.378(3) 0.276(1) 0.102(3)
Random 0.321(1) 0.3054(9) 0.015(1) 0.323(1) 0.293(1) 0.031(2) 0.344(2) 0.273(1) 0.071(2)

Table 4: Average cosine similarity between E3FP fingerprints of different species. A single conformer was
selected for each species, either randomly or based on highest attention weight from a SchNetFeatures model
with the lowest validation loss. “hit/hit” means similarity among hits, “hit/miss” means similarity between
hits and misses, and ∆ = (hit/hit) − (hit/miss). The highest ∆ scores for each task are shown in bold. 5,000
comparisons were made for each category. Uncertainty on the last digit is denoted by a subscript and given by
the standard error of the mean.

5.3 Computational cost

The 4D attention-based models are quite computationally demanding. 3D models are generally more
expensive than 2D models, as they incorporate non-bonded edges in addition to bonded ones. More
importantly, conformer-based models are nconf times more expensive than single conformer models,
where nconf is the average number of conformers. Here we used a maximum of 200 conformers per
species, which corresponds to 76 on average. Hence our models are hundreds of times slower than
2D models, and our training on the CoV 3CL dataset took several days on 32 GPUs.
It is not beneficial to use conformer models in their current state, as they are expensive and can hurt
performance. However, given the sheer volume of extra information contained in conformers, it is
helpful to think of other ways to use the ensemble. Let us consider how model complexity can be
reduced, as this would reduce computational cost and could also improve performance. A simple
method is to bin similar conformers together, so that the number of effective conformers is reduced.
The statistical weight of each conformer would be the sum of its constituent weights. This would
lead to models that act over far fewer conformers for each species, thereby reducing complexity
and cost. One might also consider expanding upon the average-distance models introduced here,
which come at the cost of single-conformer models but contain ensemble information. Performance
could be improved by using pre-set weights that reflect binding affinity rather than statistical weight.
These could perhaps be found with computational docking or with attention models acting over
non-learnable 3D fingerprints, which are far less expensive. One might also use the docking results to
simply select the best single conformer for each species, and train a 3D model on one conformer only
for each species.

6 Conclusion
We have introduced a new method for predicting properties from conformer ensembles, and introduced
several new 3D-based models. We achieved significant improvement over 2D models with a single
conformer, but found that multiple conformers did not improve the predictions. With access to the
GEOM dataset and the models introduced in this work, the community will be able to use our models
and refine our conformer approach to improve virtual screening.

7 Data and Software Availability


All code and data is publicly available. The GEOM dataset can be accessed at [90], with instructions
for loading the data in [91]. Code for training the models is available in the Neural Force Field
repository at [105]. Training datasets, trained models, log files, and model details can be found at
[106].

8 Acknowledgements
The authors thank the XSEDE COVID-19 HPC Consortium, project CHE200039, for compute time.
NASA Advanced Supercomputing (NAS) Division and LBNL National Energy Research Scientific
Computing Center (NERSC), MIT Engaging cluster, Harvard Cannon cluster, and MIT Lincoln
Lab Supercloud clusters are gratefully acknowledged for computational resources and support. The

16
authors also thank Christopher E. Henze (NASA) and Shane Canon and Laurie Stephey (NERSC) for
technical discussions and computational support, MIT AI Cures (https://www.aicures.mit.edu/) for
molecular datasets and Wujie Wang, Daniel Schwalbe Koda, Shi Jun Ang (MIT DMSE) for scientific
discussions and access to computer code. Financial support from DARPA (Award HR00111920025)
and MIT-IBM Watson AI Lab is acknowledged.

17
Supplementary Material for Molecular machine learning
with conformer ensembles
Simon Axelrod?,† and Rafael Gómez-Bombarelli†
?
Department of Chemistry and Chemical Biology
Harvard University
Cambridge, MA, 02138

Department of Materials Science and Engineering
Massachusetts Institute of Technology
Cambridge, MA, 02142
[email protected], [email protected]
February 22, 2021

A Saved datasets and models


The code for creating datasets and training models is available in the Neural Force Field repository
[105], which contains a variety of 3D-based models [17–19] to predict either molecular properties
or atomic forces. The force field can handle periodic structures, generate coarse-grained forces
[107, 108], and run molecular dynamics through the Atomic Simulation Environment [109]. The
property predictor extends SchNet [17, 18] and ChemProp [24] to predict properties based on one or
more conformer, and contains scripts with extensive documentation for dataset generation, training,
hyperparameter optimization, and transfer learning.
The trained models and datasets themselves can be found at [106]. Both the model and dataset folders
have sub-folders for each prediction task, with the model folder further split by network architecture.
Each network folder contains log files with training and validation scores for each epoch, models
saved at every epoch, the parameters used to create and train the model, fingerprints and attention
weights generated for the test set, and test set predictions. There are also folders with ChemProp
models trained from scratch, ChemProp models used for transfer learning, and parameter scores from
hyperparameter optimization. Further details of the data layout can be found in the accompanying
README file.

B Data pre-processing
Because the conformers were generated with CREST, which allows for reactivity such as keto-enol
tautomerism, some of the species had conformers with different molecular graphs from each other. A
small number of conformers also had bond lengths greater than the cutoff distance. To avoid these
issues we removed all such problematic species, which accounted for about 4% of the total. This is
why there are fewer species here than in the original sources, or in the associated datasets found in
[110]. The data used are available at [106].
Due to computational constraints we limited the dataset to the 200 highest-probability conformers of
molecules with 100 atoms or fewer. The probabilities were renormalized to account for the missing
conformer probabilities. The constraint on the number of atoms excluded only 0.2% of molecules.
The conformer constraint was more restrictive, but still reasonable. 85% of the molecules had 200
conformers or fewer. On average, a limit of 200 conformers recovered 94% of the total number of
conformers of a species.

18
C Training

In all cases we used a 60/20/20 train/validation/test split with species split by scaffold, as described
in Ref. [24]. This approach groups molecules with the same scaffolds into the same split. This makes
the classification task harder than a random split, as the model cannot simply identify a scaffold from
the training set and apply it to the validation and test sets.
The training process was very computationally demanding. First, the dataset contained 30 million
geometries. Second, the computational graph of each species contained up to 200 separate graphs
for each fingerprint. Since the fingerprints of a species were pooled and used as input to the readout
layer, none of the computational graphs could be broken until all fingerprints had been created and
pooled. Third, the ChemProp3D models had to store and update edge features. For a molecule with
M atoms and an average of m neighbors per atom, this means that M · m directed edge features
had to be updated, whereas a node-based MPNN would only update M node features. The average
atom had m = 11 neighbors. Moreover, the neighbors of each edge also had to be aggregated, which
further increased computational cost.
To address these challenges we parallelized training over 32 Nvidia Volta V100 GPUs (32 GB of
memory each), and for ChemProp3D models performed batching over conformers. Each batch
consisted of 4-7 conformers for one species, which was the largest value that could fit in GPU
memory. This value is rather small, as memory also had to be allocated for the computational graph
of all conformers. Fingerprints were generated for each batch in succession. The fingerprints were
pooled once all batches were finished. The pooled fingerprints were used as input for the readout layer,
which generated a prediction. The prediction was used to update the loss gradient. The computational
graph was erased, freeing memory to repeat the process. This process was repeated twice on each
GPU, generating a loss gradient from 64 species. The gradient was then used for a training step.
3D models were trained with an initial learning rate of 10−4 . The learning rate was reduced by a factor
of two if the validation loss had not decreased in 10 epochs. The training continued until the learning
rate reached 10−6 . The exception was CoV 3CL models, which were only trained for 25-50 epochs
because of time constraints. The model with the best score on the validation set was evaluated on the
test set. To account for the binding/non-binding class imbalance we used restrictive over-sampling.
In this approach the under-represented class is sampled at an equal rate to the over-represented class.
The average batch then contains half positive binders and half negative binders. One epoch samples
N molecules, where N is the size of the dataset, but only about half of them are unique: the other
half contains positives that are continually resampled. It therefore takes the model two epochs to see
the entire dataset. In ChemProp an epoch contains all positives sampled once, with an equal number
of negatives. It then takes the model nneg /npos epochs to see all the data. Since ChemProp usually
requires 30 epochs for convergence, we trained all ChemProp models for ≈ 30 nneg /npos epochs,
though the models usually converged far earlier.

D Hyperparameter optimization

Hyperparameter optimization was performed for each model type and each individual task using
Bayesian optimization [111]. In all cases we scored models by their best validation scores rather than
test scores, to avoid biasing the models toward molecules in the test set. Models were then trained on
full datasets.
Hyperparameters for ChemProp models were optimized using the defaults in the repository [75],
which vary the dimension of the hidden state (300 to 2400), number of convolutions (2 to 6), dropout
rate (0 to 0.4), and number of readout layers (1 to 3). For single conformer 3D models we optimized
the dropout rate in the convolution layers (0 to 0.4) and readout layers (0 to 0.4), sampling the
logarithm of the dropout rate uniformly. For multiple conformer species we also optimized the type
of attention (linear or pair-wise), the dropout rate in the attention layer, and the number of attention
heads. We used Bayesian hyperparameter optimization for each model [111], with 20 hyperparameter
combinations for faster models and 5-10 for slower ones. For average neighbor models we used the
hyperparameters that were optimized for single-conformer models with the same architecture. For
example, for the CND average model we used the hyperparameters optimized for the CND (1-C)
model.

19
To choose the best model we evaluated the score on the validation set (not the test set). In most cases
the PRC-AUC metric was used, for the following reason. Both the PRC and ROC involve recall (or
true positive rate), the proportion of total hits correctly identified by the model. The PRC’s second
variable is precision, the proportion of identified hits that are actually hits, whereas the ROC’s second
variable is the false positive rate, the percentage of hits incorrectly identified as misses. The precision
is more relevant to virtual screening, which is focused on maximizing the ratio of hits to misses in a
small sample, rather than minimizing the number of hits that were mislabeled. An exception was
made for CoV-2 3CL and CoV-2 ChemProp3D/ChemProp3D-NDU. Here the models learned slowly,
and early epochs contained high PRC values, but also ROC values below 0.5 (worse than a random
model). Hence the ROC was a more informative metric for hyperparameter optimization.
Random forest models were trained as regressors rather than classifiers, as the ROC and PRC scores
do not make sense when the predictions themselves are binary. Any predictions greater than 1 were
automatically set to 1, and any predictions less than 0 to 0, though all model predictions were already
in the range (0, 1). Hyperparameters were also optimized with Bayesian methods [111]. We used
the ROC score on the validation set to evaluate hyperparameters, because it consistently gave the
most robust models (i.e. those with both high ROC and PRC scores). Ten models were trained with
different initial seeds, and balanced sampling was used during training. Details of the hyperparameter
ranges can be found in [105], and further details about the optimization are in [106].

E Scoring metrics

In the ML community, the standard metric for evaluating binary classifiers is the receiver operating
characteristic area under the curve (ROC-AUC) [24, 112, 113]. The ROC curve plots the true positive
rate (TPR) on the y-axis against the false positive rate (FPR) on the x-axis. The plot is created by
considering different thresholds for which a predicted probability is considered a hit. For example, a
threshold of 0 gives an FPR of 1 and a TPR of 1, while a threshold of 1 gives an FPR of 0 and a TPR
of 0. By varying the threshold one arrives at a set of (FPR, TPR) pairs. The ROC-AUC is the area
under the FPR-TPR curve; it is 0.5 for a random model and 1.0 for a perfect model.
An alternative to the ROC curve is the precision-recall curve (PRC), which plots the precision on
the y-axis (the proportion of predicted hits that are actually hits) against recall on the x-axis (same
as TPR). For a perfect model the PRC-AUC is equal to 1.0, while for a random model it is equal to
the proportion of hits in the dataset. The dependence on the makeup of the dataset is a drawback,
since, unlike the ROC-AUC, the PRC-AUC is not transferable among models tested on different
datasets. However, its use of precision makes it more applicable to the early retrieval problem in
drug discovery. Moreover, when the dataset is imbalanced, which is almost always the case in drug
discovery, the PRC-AUC is strictly more informative than the ROC-AUC [114].
A significant problem with both metrics is that they compute the area under the entire curve. This
is problematic because high x values correspond to low thresholds, i.e. thresholds for which most
species are considered hits. The purpose of virtual screening is to select the top few candidates
for experimental testing. One would never use a threshold for which a majority of candidates are
considered hits, because then one would have to test a large portion of the sample. If one could
test a large portion then, presumably, one could test all candidates, and so virtual screening would
be unnecessary. While this point has long been understood in the computational drug discovery
community [102], it has largely been overlooked in the deep learning literature (with the notable
exception of Ref. [21]).
An alternative is the enrichment factor, defined as TPmodel /TPrand . Here TPmodel is the number of
true hits among species in the top few percent of predictions, and TPrand is the number that would
be retrieved by a random model [102]. However, like the PRC-AUC it too depends on the ratio of
hits to misses in the dataset [115]. To address this issue, and to standardize the definition of “top few
percent”, Ref. [102] recommends using the so-called ROC enrichment (ROCE), defined as the ratio
of the y value to the x value on an ROC plot, at the x values of 0.5%, 1%, 2%, and 5%. In this work
we reported the ROCE values in addition to the ROC-AUC and PRC-AUC scores, and suggest that
future works in deep learning for drug discovery do the same.

20
F Uncertainty quantification
For less intensive models like ChemProp with small amounts of data, test score uncertainty was
determined by training ten different models and calculating the standard deviation in their scores.
For evaluating the test set PRC, the model with the best validation PRC was used, and similarly for
the ROC. For the ROCE scores the average was taken over all 20 models (ten models with the best
validation PRC and ten with the best validation ROC).
The ensemble method of uncertainty quantification was not feasible for any of the 3D models, or for
ChemProp trained on the CoV 3CL data. In these cases we used two different models from different
epochs in the same training progression. Each model had the best validation score according to either
the PRC or ROC metric. Averages and standard deviations were then calculated from two scores, one
from each of the two models evaluated on the test set. This approach is justified because, in many
cases, the model with the best validation PRC/ROC was not the one with the best test PRC/ROC.
Note, however, that this method gives zero uncertainty whenever the epoch with the best ROC also
has the best PRC. Further details of the training process can be found in the SM, and exact parameters
used for each of the different models can be found in [106].

References
[1] Kenneth I Kaitin. Deconstructing the drug development process: the new face of innovation. Clinical
Pharmacology & Therapeutics, 87(3):356–361, 2010.
[2] Joseph A DiMasi, Henry G Grabowski, and Ronald W Hansen. Innovation in the pharmaceutical industry:
new estimates of R&D costs. Journal of health economics, 47:20–33, 2016.
[3] Dean G Brown, Tricia L May-Dracka, Moriah M Gagnon, and Ruben Tommasi. Trends and exceptions
of physical properties on antibacterial activity for Gram-positive and Gram-negative pathogens. Journal
of medicinal chemistry, 57(23):10144–10161, 2014.
[4] Oleg Trott and Arthur J Olson. AutoDock Vina: improving the speed and accuracy of docking with a new
scoring function, efficient optimization, and multithreading. Journal of computational chemistry, 31(2):
455–461, 2010.
[5] Honglin Li, Zhenting Gao, Ling Kang, Hailei Zhang, Kun Yang, Kunqian Yu, Xiaomin Luo, Weiliang
Zhu, Kaixian Chen, Jianhua Shen, et al. TarFisDock: a web server for identifying drug targets with
docking approach. Nucleic acids research, 34(suppl_2):W219–W224, 2006.
[6] Hernan Alonso, Andrey A Bliznyuk, and Jill E Gready. Combining docking and molecular dynamic
simulations in drug design. Medicinal research reviews, 26(5):531–568, 2006.
[7] Robert Burbidge, Matthew Trotter, B Buxton, and Sl Holden. Drug design by machine learning: support
vector machines for pharmaceutical data analysis. Computers & chemistry, 26(1):5–14, 2001.
[8] Jessica Vamathevan, Dominic Clark, Paul Czodrowski, Ian Dunham, Edgardo Ferran, George Lee, Bin
Li, Anant Madabhushi, Parantu Shah, Michaela Spitzer, et al. Applications of machine learning in drug
discovery and development. Nature Reviews Drug Discovery, 18(6):463–477, 2019.
[9] Jonathan M Stokes, Kevin Yang, Kyle Swanson, Wengong Jin, Andres Cubillos-Ruiz, Nina M Donghia,
Craig R MacNair, Shawn French, Lindsey A Carfrae, Zohar Bloom-Ackerman, et al. A deep learning
approach to antibiotic discovery. Cell, 180(4):688–702, 2020.
[10] Chao Shen, Junjie Ding, Zhe Wang, Dongsheng Cao, Xiaoqin Ding, and Tingjun Hou. From machine
learning to deep learning: Advances in scoring functions for protein–ligand docking. Wiley Interdisci-
plinary Reviews: Computational Molecular Science, 10(1):e1429, 2020.
[11] Albert P Bartók, Risi Kondor, and Gábor Csányi. On representing chemical environments. Physical
Review B, 87(18):184115, 2013.
[12] David K Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gómez-Bombarelli, Timothy
Hirzel, Alán Aspuru-Guzik, and Ryan P. Adams. Convolutional networks on graphs for learning molecular
fingerprints. In Advances in Neural Information Processing Systems, pages 2215–2223, 2015.
[13] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks.
arXiv preprint arXiv:1511.05493, 2015.
[14] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for
learning about objects, relations and physics. Advances in neural information processing systems, 29:
4502–4510, 2016.
[15] Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph
convolutions: moving beyond fingerprints. Journal of computer-aided molecular design, 30(8):595–608,
2016.

21
[16] Kristof T Schütt, Farhad Arbabzadah, Stefan Chmiela, Klaus R Müller, and Alexandre Tkatchenko.
Quantum-chemical insights from deep tensor neural networks. Nature communications, 8(1):1–8, 2017.
[17] Kristof T Schütt, Huziel E Sauceda, P-J Kindermans, Alexandre Tkatchenko, and K-R Müller. SchNet–A
deep learning architecture for molecules and materials. The Journal of Chemical Physics, 148(24):241722,
2018.
[18] Kristof Schütt, Pieter-Jan Kindermans, Huziel Enoc Sauceda Felix, Stefan Chmiela, Alexandre
Tkatchenko, and Klaus-Robert Müller. SchNet: A continuous-filter convolutional neural network for
modeling quantum interactions. In Advances in neural information processing systems, pages 991–1001,
2017.
[19] Johannes Klicpera, Janek Groß, and Stephan Günnemann. Directional message passing for molecular
graphs. In International Conference on Learning Representations, 2019.
[20] Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley.
Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds. arXiv
preprint arXiv:1802.08219, 2018.
[21] Evan N Feinberg, Debnil Sur, Zhenqin Wu, Brooke E Husic, Huanghao Mai, Yang Li, Saisai Sun, Jianyi
Yang, Bharath Ramsundar, and Vijay S Pande. Potentialnet for molecular property prediction. ACS
central science, 4(11):1520–1530, 2018.
[22] Oliver T Unke and Markus Meuwly. Physnet: A neural network for predicting energies, forces, dipole
moments, and partial charges. Journal of chemical theory and computation, 15(6):3678–3693, 2019.
[23] Ziteng Liu, Liqiang Lin, Qingqing Jia, Zheng Cheng, Yanyan Jiang, Yanwen Guo, and Jing Ma. Trans-
ferable multi-level attention neural network for accurate prediction of quantum chemistry properties via
multi-task learning. ChemRxiv, 12588170:v1, 2020.
[24] Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez,
Timothy Hopper, Brian Kelley, Miriam Mathea, et al. Analyzing learned molecular representations for
property prediction. Journal of chemical information and modeling, 59(8):3370–3388, 2019.
[25] Carolina H Andrade, Kerly FM Pasqualoto, Elizabeth I Ferreira, and Anton J Hopfinger. Rational design
and 3d-pharmacophore mapping of 5’-thiourea-substituted α-thymidine analogues as mycobacterial tmpk
inhibitors. Journal of chemical information and modeling, 49(4):1070–1078, 2009.
[26] Xiaobo Yang, Yuan Yu, Jiqian Xu, Huaqing Shu, Hong Liu, Yongran Wu, Lu Zhang, Zhui Yu, Minghao
Fang, Ting Yu, et al. Clinical course and outcomes of critically ill patients with SARS-CoV-2 pneumonia
in Wuhan, China: a single-centered, retrospective, observational study. The Lancet Respiratory Medicine,
2020.
[27] Jitender Verma, Vijay M Khedkar, and Evans C Coutinho. 3d-qsar in drug design-a review. Current
topics in medicinal chemistry, 10(1):95–115, 2010.
[28] Subhash Ajmani, Kamalakar Jadhav, and Sudhir A Kulkarni. Three-dimensional qsar using the k-nearest
neighbor method and its interpretation. Journal of chemical information and modeling, 46(1):24–31,
2006.
[29] KH Kim. List of comfa references, 1998.
[30] Francesco Ortuso, Thierry Langer, and Stefano Alcaro. Gbpm: Grid-based pharmacophore model:
concept and application studies to protein–protein recognition. Bioinformatics, 22(12):1449–1455, 2006.
[31] Ulrike Holzgrabe and Anton J Hopfinger. Conformational analysis, molecular shape comparison, and
pharmacophore identification of different allosteric modulators of muscarinic receptors. Journal of
chemical information and computer sciences, 36(5):1018–1024, 1996.
[32] K-B Rhyu, HC Patel, and Anton J. Hopfinger. A 3d-qsar study of anticoccidial triazines using molecular
shape analysis. Journal of chemical information and computer sciences, 35(4):771–778, 1995.
[33] John S Tokarski and Anton J Hopfinger. Three-dimensional molecular shape analysis-quantitative
structure-activity relationship of a series of cholecystokinin-a receptor antagonists. Journal of medicinal
chemistry, 37(21):3639–3654, 1994.
[34] MG Cardozo, Y Iimura, H Sugimoto, Y Yamanishi, and AJ Hopfinger. Qsar analyses of the sub-
stituted indanone and benzylpiperidine rings of a series of indanone-benzylpiperidine inhibitors of
acetylcholinesterase. Journal of medicinal chemistry, 35(3):584–589, 1992.
[35] Tomasz Magdziarz, Pawel Mazur, and Jaroslaw Polanski. Receptor independent and receptor dependent
comsa modeling with ive-pls: application to cbg benchmark steroids and reductase activators. Journal of
molecular modeling, 15(1):41–51, 2009.
[36] Rafal Gieleciak and Jaroslaw Polanski. Modeling robust qsar. 2. iterative variable elimination schemes
for comsa: Application for modeling benzoic acid p k a values. Journal of chemical information and
modeling, 47(2):547–556, 2007.

22
[37] Tomasz Magdziarz, Bozena Łozowicka, Rafał Gieleciak, Andrzej Bak,
˛ Jarosław Polański, and Zdzisław
Chilmonczyk. 3d qsar study of hypolipidemic asarones by comparative molecular surface analysis.
Bioorganic & medicinal chemistry, 14(5):1630–1643, 2006.
[38] Halina Niedbala, Jaroslaw Polanski, Rafal Gieleciak, Robert Musiol, Dominik Tabak, Barbara Podeszwa,
Andrzej Bak, Anna Palka, Jean-Francois Mouscadet, Johann Gasteiger, et al. Comparative molecular
surface analysis (comsa) for virtual combinatorial library screening of styrylquinoline hiv-1 blocking
agents. Combinatorial chemistry & high throughput screening, 9(10):753–770, 2006.
[39] Balázs Jójárt, Tamás A Martinek, and Árpád Márki. The 3d structure of the binding pocket of the human
oxytocin receptor for benzoxazine antagonists, determined by molecular docking, scoring functions and
3d-qsar methods. Journal of computer-aided molecular design, 19(5):341–356, 2005.
[40] Thomas S Rush, J Andrew Grant, Lidia Mosyak, and Anthony Nicholls. A shape-based 3-d scaffold
hopping method and its application to a bacterial protein- protein interaction. Journal of medicinal
chemistry, 48(5):1489–1495, 2005.
[41] Craig L Senese, J Duca, Dahua Pan, Anton J Hopfinger, and Yufeng J Tseng. 4d-fingerprints, universal
qsar and qspr descriptors. Journal of chemical information and computer sciences, 44(5):1526–1539,
2004.
[42] Manisha Iyer and Anton J Hopfinger. Treating chemical diversity in qsar analysis: modeling diverse
hiv-1 integrase inhibitors using 4d fingerprints. Journal of chemical information and modeling, 47(5):
1945–1960, 2007.
[43] Nelilma Correia Romeiro, Magaly Girão Albuquerque, Ricardo Bicca de Alencastro, Malini Ravi, and
Anton J Hopfinger. Construction of 4d-qsar models for use in the design of novel p38-mapk inhibitors.
Journal of computer-aided molecular design, 19(6):385–400, 2005.
[44] Jianzhong Liu, Dahua Pan, Yufeng Tseng, and Anton J Hopfinger. 4d-qsar analysis of a series of antifungal
p450 inhibitors and 3d-pharmacophore comparisons as a function of alignment. Journal of chemical
information and computer sciences, 43(6):2170–2179, 2003.
[45] Kerly FM Pasqualoto, Elizabeth I Ferreira, Osvaldo A Santos-Filho, and Anton J Hopfinger. Rational
design of new antituberculosis agents: receptor-independent four-dimensional quantitative structure-
activity relationship analysis of a set of isoniazid derivatives. Journal of medicinal chemistry, 47(15):
3755–3764, 2004.
[46] Xuan Hong and Anton J Hopfinger. 3d-pharmacophores of flavonoid binding at the benzodiazepine
gabaa receptor site using 4d-qsar analysis. Journal of chemical information and computer sciences, 43(1):
324–336, 2003.
[47] Matthew D Krasowski, Xuan Hong, AJ Hopfinger, and Neil L Harrison. 4d-qsar analysis of a set of
propofol analogues: mapping binding sites for an anesthetic phenol on the gabaa receptor. Journal of
medicinal chemistry, 45(15):3210–3221, 2002.
[48] Poonsiri Thipnate, Jianzhong Liu, Supa Hannongbua, and Anton J Hopfinger. 3d pharmacophore mapping
using 4d qsar analysis for the cytotoxicity of lamellarins against human hormone-dependent t47d breast
cancer cells. Journal of chemical information and modeling, 49(10):2312–2322, 2009.
[49] Malini Ravi, Anton J Hopfinger, Robert E Hormann, and Laurence Dinan. 4d-qsar analysis of a set of
ecdysteroids and a comparison to comfa modeling. Journal of Chemical Information and Computer
Sciences, 41(6):1587–1604, 2001.
[50] Richard D Cramer, David E Patterson, and Jeffrey D Bunce. Comparative molecular field analysis (comfa).
1. effect of shape on binding of steroids to carrier proteins. Journal of the American Chemical Society,
110(18):5959–5967, 1988.
[51] Brent L Podlogar and David M Ferguson. Qsar and comfa: a perspective on the practical application to
drug discovery. Drug design and discovery (Print), 17(1):4–12, 2000.
[52] Peter J Goodford. A computational procedure for determining energetically favorable binding sites on
biologically important macromolecules. Journal of medicinal chemistry, 28(7):849–857, 1985.
[53] Ki Hwan Kim, Giovanni Greco, and Ettore Novellino. A critical review of recent comfa applications. In
3D QSAR in drug design, pages 257–315. Springer, 1998.
[54] Ki H Kim. Thermodynamic aspects of hydrophobicity and biological qsar. Journal of computer-aided
molecular design, 15(4):367–380, 2001.
[55] Anton J Hopfinger. A qsar investigation of dihydrofolate reductase inhibition by baker triazines based
upon molecular shape analysis. Journal of the American Chemical Society, 102(24):7196–7206, 1980.
[56] Paul CD Hawkins, A Geoffrey Skillman, and Anthony Nicholls. Comparison of shape-matching and
docking as virtual screening tools. Journal of medicinal chemistry, 50(1):74–82, 2007.

23
[57] G Madhavi Sastry, Steven L Dixon, and Woody Sherman. Rapid shape-based ligand alignment and
virtual screening method based on atom/feature-pair similarities and volume overlap scoring. Journal of
chemical information and modeling, 51(10):2455–2466, 2011.
[58] Tudor I Oprea. 3d qsar modeling in drug design. In Computational Medicinal Chemistry for Drug
Discovery, pages 597–648. CRC Press, 2003.
[59] Ajay N Jain, Kimberle Koile, and David Chapman. Compass: predicting biological activities from
molecular surface properties. performance comparisons on a steroid benchmark. Journal of Medicinal
Chemistry, 37(15):2315–2327, 1994.
[60] AJ Hopfinger, Shen Wang, John S Tokarski, Baiqiang Jin, Magaly Albuquerque, Prakash J Madhav, and
Chaya Duraiswami. Construction of 3d-qsar models using the 4d-qsar analysis formalism. Journal of the
American Chemical Society, 119(43):10509–10524, 1997.
[61] Magaly G Albuquerque, Anton J Hopfinger, EJ Barreiro, and Ricardo B de Alencastro. Four-dimensional
quantitative structure- activity relationship analysis of a series of interphenylene 7-oxabicycloheptane
oxazole thromboxane a2 receptor antagonists. Journal of chemical information and computer sciences,
38(5):925–938, 1998.
[62] M Albuquerque, M Brito, E Cunha, R Alencastro, O Antunes, H Castro, and C Rodrigues.
Multidimensional-qsar: Beyond the third-dimension in drug design. Current Methods in Medicinal
Chemistry and Biological Physics, 1:91–100, 2007.
[63] Jeremy Ash and Denis Fourches. Characterizing the chemical space of erk2 kinase inhibitors using
descriptors computed from molecular dynamics trajectories. Journal of chemical information and
modeling, 57(6):1286–1299, 2017.
[64] Roberto Todeschini and Viviana Consonni. Descriptors from molecular geometry. Handbook of Chemoin-
formatics: From Data to Knowledge in 4 Volumes, pages 1004–1033, 2003.
[65] Miki Akamatsu. Current state and perspectives of 3d-qsar. Current topics in medicinal chemistry, 2(12):
1381–1394, 2002.
[66] KH Kim. Comparative molecular field analysis (comfa). In Molecular similarity in drug design, pages
291–331. Springer, 1995.
[67] Ilana Y. Kanal, John A. Keith, and Geoffrey R. Hutchison. A sobering assessment of small-molecule
force field methods for low energy conformer predictions. International Journal of Quantum Chemistry,
118(5):e25512, 2018.
[68] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning
to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[69] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M Rush. Structured attention networks. arXiv
preprint arXiv:1702.00887, 2017.
[70] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing
systems, pages 5998–6008, 2017.
[71] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio.
Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
[72] Simon Axelrod and Rafael Gomez-Bombarelli. GEOM: Energy-annotated molecular conformations for
property prediction and molecular generation. arXiv preprint arXiv:2006.05531, 2020.
[73] Stefan Grimme. Exploration of chemical compound, conformer, and reaction space with meta-dynamics
simulations based on tight-binding quantum chemical calculations. Journal of chemical theory and
computation, 15(5):2847–2862, 2019.
[74] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message
passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine
Learning-Volume 70, pages 1263–1272. JMLR. org, 2017.
[75] Chemprop Machine Learning for Molecular Property Prediction. https://github.com/chemprop/
chemprop. Accessed: 2020-03-31.
[76] Justin S. Smith, Olexandr Isayev, and Adrian E. Roitberg. ANI-1, A data set of 20 million calculated
off-equilibrium conformations for organic molecules. Scientific Data, 4:170193, 2017. ISSN 2052-4463.
doi: 10.1038/sdata.2017.193.
[77] J. S. Smith, O. Isayev, and A. E. Roitberg. ANI-1: an extensible neural network potential with DFT
accuracy at force field computational cost. Chemical Science, 8(4):3192–3203, 2017. ISSN 2041-6520.
doi: 10.1039/C6SC05720A.
[78] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured
data. In International conference on machine learning, pages 2702–2711, 2016.

24
[79] SchNetPack - Deep Neural Networks for Atomistic Systems. https://github.com/
atomistic-machine-learning/schnetpack. Accessed: 2018-10-23.
[80] Raymond E Carhart, Dennis H Smith, and R Venkataraghavan. Atom pairs as molecular features in
structure-activity studies: definition and applications. Journal of Chemical Information and Computer
Sciences, 25(2):64–73, 1985.
[81] Mikko J Vainio and Mark S Johnson. Generating conformer ensembles using a multiobjective genetic
algorithm. Journal of chemical information and modeling, 47(6):2462–2474, 2007.
[82] J Santeri Puranen, Mikko J Vainio, and Mark S Johnson. Accurate conformation-dependent molecular
electrostatic potentials for high-throughput in silico drug discovery. Journal of computational chemistry,
31(8):1722–1732, 2010.
[83] Noel M O’Boyle, Tim Vandermeersch, Christopher J Flynn, Anita R Maguire, and Geoffrey R Hutchison.
Confab-Systematic generation of diverse low-energy conformers. Journal of cheminformatics, 3(1):1–9,
2011.
[84] Maria A Miteva, Frederic Guyon, and Tufféry Pierre. Frog2: Efficient 3D conformation ensemble
generator for small compounds. Nucleic acids research, 38:W622–W627, 2010.
[85] Chemical Computing Group ULC. CCG: Molecular Operating Environment (MOE). http://www.
chemcomp.com/, 2018.
[86] Paul CD Hawkins, A Geoffrey Skillman, Gregory L Warren, Benjamin A Ellingson, and Matthew T Stahl.
Conformer generation with OMEGA: algorithm and validation using high quality structures from the
Protein Databank and Cambridge Structural Database. Journal of chemical information and modeling, 50
(4):572–584, 2010.
[87] RDKit: Open-source cheminformatics. http://www.rdkit.org.
[88] Lucian Chan, Geoffrey R Hutchison, and Garrett M Morris. Bayesian optimization for conformer
generation. Journal of cheminformatics, 11(1):1–11, 2019.
[89] Phyo Phyo Kyaw Zin, Alexandre Borrel, and Denis Fourches. Benchmarking 2d/3d/md-qsar models for
imatinib derivatives: How far can we predict? Journal of Chemical Information and Modeling, 60(7):
3342–3360, 2020.
[90] Simon Axelrod and Rafael Gomez-Bombarelli. GEOM. https://doi.org/10.7910/DVN/JNGTDF,
2021.
[91] GEOM: Energy-annotated molecular conformations. https://github.com/learningmatter-mit/
geom.
[92] Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole von Lilienfeld. Quantum
chemistry structures and properties of 134 kilo molecules. Scientific data, 1:140022, 2014. ISSN
2052-4463. doi: 10.1038/sdata.2014.22.
[93] Bernhard Ellinger, Denisa Bojkova, Andrea Zaliani, Jindrich Cinatl, Carsten Claussen, Sandra Westhaus,
Jeanette Reinshagen, Maria Kuzikov, Markus Wolf, Gerd Geisslinger, et al. Identification of inhibitors
of SARS-CoV-2 in-vitro cellular toxicity in human (Caco-2) cells using a large scale drug repurposing
collection. 2020.
[94] Franck Touret, Magali Gilles, Karine Barral, Antoine Nougairède, Etienne Decroly, Xavier de Lamballerie,
and Bruno Coutard. In vitro screening of a FDA approved chemical library reveals potential inhibitors of
SARS-CoV-2 replication. BioRxiv, 2020.
[95] Main protease structure and XChem fragment screen. https://www.diamond.ac.uk/covid-19/
for-scientists/Main-protease-structure-and-XChem.html.
[96] Valerie Tokars and Andrew Mesecar. QFRET-based primary biochemical high throughput screening
assay to identify inhibitors of the SARS coronavirus 3C-like Protease (3CLPro). https://pubchem.
ncbi.nlm.nih.gov/bioassay/1706.
[97] Daniel Engel. qHTS of Yeast-based Assay for SARS-CoV PLP. https://pubchem.ncbi.nlm.nih.
gov/bioassay/485353, .
[98] Daniel Engel. qHTS of Yeast-based Assay for SARS-CoV PLP: Hit Validation. https://pubchem.
ncbi.nlm.nih.gov/bioassay/652038, .
[99] Mattia Zampieri, Michael Zimmermann, Manfred Claassen, and Uwe Sauer. Nontargeted metabolomics
reveals the multilevel response to antibiotic perturbations. Cell reports, 19(6):1214–1228, 2017.
[100] https://www.aicures.mit.edu/data, 2020. Accessed: 2020-05-22.
[101] Zhijian Xu, Cheng Peng, Yulong Shi, Zhengdan Zhu, Kaijie Mu, Xiaoyu Wang, and Weiliang Zhu.
Nelfinavir was predicted to be a potential inhibitor of 2019-nCov main protease by an integrative approach
combining homology modelling, molecular docking and binding free energy calculation. BioRxiv, 2020.

25
[102] Ajay N Jain and Anthony Nicholls. Recommendations for evaluation of computational methods. Journal
of computer-aided molecular design, 22(3-4):133–139, 2008.
[103] Aneesh Pappu and Brooks Paige. Making Graph Neural Networks Worth It for Low-Data Molecular
Machine Learning. arXiv preprint arXiv:2011.12203, 2020.
[104] Seth D Axen, Xi-Ping Huang, Elena L Cáceres, Leo Gendelev, Bryan L Roth, and Michael J Keiser. A
simple representation of three-dimensional molecular structure. Journal of medicinal chemistry, 60(17):
7393–7409, 2017.
[105] Neural Force Field. https://github.com/learningmatter-mit/NeuralForceField.
[106] Simon Axelrod and Rafael Gomez-Bombarelli. Conformer models and training datasets. https:
//doi.org/10.7910/DVN/N4VLQL, 2021.
[107] Jurgis Ruza, Wujie Wang, Daniel Schwalbe-Koda, Simon Axelrod, William H Harris, and Rafael Gómez-
Bombarelli. Temperature-transferable coarse-graining of ionic liquids with dual graph convolutional
neural networks. The Journal of Chemical Physics, 153(16):164501, 2020.
[108] Wujie Wang and Rafael Gómez-Bombarelli. Coarse-graining auto-encoders for molecular dynamics. npj
Computational Materials, 5(1):1–9, 2019.
[109] Ask Hjorth Larsen, Jens Jørgen Mortensen, Jakob Blomqvist, Ivano E Castelli, Rune Christensen, Marcin
Dułak, Jesper Friis, Michael N Groves, Bjørk Hammer, Cory Hargus, et al. The atomic simulation
environment—a Python library for working with atoms. Journal of Physics: Condensed Matter, 29(27):
273002, 2017.
[110] Home | AI Cures. URL https://www.aicures.mit.edu/.
[111] Distributed Asynchronous Hyperparameter Optimization in Python. https://github.com/hyperopt/
hyperopt. Accessed: 2020-09-03.
[112] Zhenqin Wu, Bharath Ramsundar, Evan˜N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu,
Karl Leswing, and Vijay Pande. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci.,
9(2):513–530, 2018.
[113] Andreas Mayr, Günter Klambauer, Thomas Unterthiner, Marvin Steijaert, Jörg K Wegner, Hugo Ceule-
mans, Djork-Arné Clevert, and Sepp Hochreiter. Large-scale comparison of machine learning methods
for drug target prediction on chembl. Chemical science, 9(24):5441–5451, 2018.
[114] Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the ROC plot
when evaluating binary classifiers on imbalanced datasets. PloS one, 10(3), 2015.
[115] Anthony Nicholls. What do we know and when do we know it? Journal of computer-aided molecular
design, 22(3-4):239–255, 2008.

26

You might also like