Reinforcement Learning For Molecular Design Guided by Quantum Mechanics
Reinforcement Learning For Molecular Design Guided by Quantum Mechanics
Abstract
arXiv:2002.07717v2 [stat.ML] 29 Jun 2020
3.2. Actor cide where to place the atom relative to the focal atom.
Therefore, we assume that the policy factorizes as
Action Representation We model the position of the
atom to be placed in internal coordinates—a commonly πθ (ψ, α, d, e, f |s) = p(ψ, α, d|e, f, s)
used representation for molecular structures in compu- × p(e|f, s)p(f |s). (3)
tational chemistry—relative to previously placed atoms.
If the canvas is initially empty, C0 = ∅, the agent se- We model the distributions over f and e as categorical,
lects an element e0 from β0 and places it at the origin, Cat(h), where hf ∈ Rn(C) and he ∈ REmax are the logits
i.e. a0 = (e0 , [0, 0, 0]T ). Once the canvas Ct contains at predicted by separate MLPs, and Emax is the largest atomic
least one atom, the agent first decides on a focal atom, number that can be selected. Further, p(ψ, α, d|e, f, s) is
f ∈ {1, . . . , n(Ct )}, where n(Ct ) denotes the number of factored into a product of univariate Gaussian distributions
atoms in Ct . This focal atom represents a local reference N (µ, σ 2 ), where the means µd , µα and µψ are given by an
point close to which the next atom is going to be placed (see MLP and the standard deviations σd , σα and σψ are global
Fig. 3). The agent then models the position x ∈ R3 with parameters. Formally,
respect to f in internal coordinates (d, α, ψ), where hf = MLPf (s̃), (4)
• d ∈ R is the Euclidean distance between x and the he = MLPe (s̃f ), (5)
position xf of the focal atom; µd , µα , µψ = MLPcont (s̃f , 1(e)), (6)
• α ∈ [0, π] is the angle between the two lines defined
by (x, xf ) and (x, xn1 ), where xn1 is the position of where s̃f = [C˜f , β̃] is the state embedding of the focal atom
the atom closest to f ; if less than two atoms are on the f ∼ Cat(f ; hf ), 1(e) is a one-hot vector representation
canvas, α is undefined/unused. of element e ∼ Cat(e; he ), and d ∼ N (d; µd , σd2 ), α ∼
N (α; µα , σα2 ), and ψ ∼ N (ψ; µψ , σψ2 ) are sampled from
• ψ ∈ [−π, π] is the dihedral angle between two inter- their respective distributions. The model is shown in Fig. 4.
secting planes spanned by (x, xf , xn1 ) and (xf , xn1 ,
xn2 ), where xn2 is the atom that is the second3 closest Maintaining Valid Actions As the agent places atoms
to the focal atom; if less than three atoms are on the onto the canvas during a rollout, the number of possible
canvas, ψ is undefined/unused. focal atoms f increases and the number of elements e to
choose from decreases. To guarantee that the agent only
As shown in Fig. 3 (right), these internal coordinates can chooses valid actions, i.e. f ∈ {1, . . . , n(C)} and e ∈ β,
then be mapped back to Cartesian coordinates x. we mask out invalid focal atoms and elements by setting
Model This action representation suggests a natural gen- their probabilities to zero and re-normalizing the categorical
erative process: first choose next to which focal atom the distributions. Neither the agent nor the environment makes
new atom is placed, then select its element, and finally de- use of ad-hoc concepts like valence or bond connectivity—
any atom on the canvas can potentially be chosen.
2
We omit the time index when it is clear from the context.
3
In the unlikely event that two atoms are exactly equally far Learning the Dihedral Angle The sign of the dihedral
from the focal atom, a random order for xn1 and xn2 is chosen. angle ψ depends on the two nearest neighbors of the focal
Figure 4. Illustration of the state embedding, actor and critic network. The canvas C and the bag of atoms β are fed to the state embedding
network to obtain a translation and rotation invariant state representation s̃. The actor network then selects 1) a focal atom f , 2) an element
e, and 3) internal coordinates (d, α, ψ). The critic takes the bag and the sum across all atoms on the canvas to compute a value V .
atom and is difficult to learn, especially if the two atoms we include an entropy regularization term over the policy.
are nearly equally close to the focal atom. In practice, we However, note that the entropies of the continuous and cate-
therefore learn the absolute value |ψ| ∈ [0, π] instead of ψ, gorical distributions often have different magnitudes; further,
as well as the sign κ ∈ {+1, −1}, such that ψ = κ|ψ|. To in this setting the entropies over the categorical distributions
estimate κ, we exploit the fact that the transition dynamics vary significantly throughout a rollout: as the agent places
are deterministic. We generate embeddings of both possible more atoms, the support of the distribution over valid focal
next states (for κ = +1 and κ = −1) and select the embed- atoms f increases and the support of the distribution over
ding of the atom just added, which we denote by s̃+ and s̃− . valid elements e decreases. To mitigate this issue, we only
We then choose κ = +1 over κ = −1 with probability apply entropy regularization to the categorical distributions,
which we find to be sufficient in practice.
exp(u+ )
p+ = , (7)
exp(u+ ) + exp(u− )
4. Related Work
such that p(κ| |ψ|, α, d, e, f, s) = Ber(κ; p+ ), where u± =
Deep Generative Models A prevalent strategy for molec-
MLPκ (s̃±, ); we further motivate this choice in the Ap-
ular design based on machine learning is to employ deep
pendix. Thus, the policy is given by πθ (κ|ψ|, α, d, e, f |s) =
generative models. These approaches first learn a latent
p(κ| |ψ|, α, d, e, f, s)p(|ψ|, α, d|e, f, s)p(e|f, s)p(f |s).
representation of the molecules and then perform a search
in latent space (e.g., through gradient descent) to discover
3.3. Critic new molecules with sought chemical properties. For exam-
The critic needs to compute a value for the entire state s. ple, Gómez-Bombarelli et al. (2018); Kusner et al. (2017);
Since the canvas is growing as more atoms are taken from Blaschke et al. (2018); Lim et al. (2018); Dai et al. (2018)
the bag and placed onto the canvas, a pooling operation utilized VAEs to perform search or optimization in a la-
is required. Here, we compute the sum over all atomic tent space to find new molecules. Segler et al. (2018) used
embeddings C̃i . Thus, the critic is given by RNNs to design molecular libraries. The aforementioned ap-
proaches generate SMILES strings, a linear string notation,
n(C)
X to describe molecules (Weininger, 1988). Further, there ex-
Vφ (s) = MLPφ C̃i , β̃ , (8) ist a plethora of generative models that work with graph rep-
i=1 resentations of molecules (e.g., Jin et al. (2017); Bradshaw
et al. (2019a); Li et al. (2018a;b); Liu et al. (2018); De Cao
where MLPφ is an MLP that computes value V (see Fig. 4). & Kipf (2018); Bradshaw et al. (2019b)). In these methods,
atoms and bonds are represented by nodes and edges, re-
3.4. Optimization spectively. Brown et al. (2019) developed a benchmark suite
for graph-based generative models, showing that generative
We employ PPO (Schulman et al., 2017) to learn the param-
models outperform classical approaches for molecular de-
eters (θ, φ) of the actor πθ and critic Vφ , respectively. While
sign. While the generated molecules are shown to be valid
most RL algorithms can only deal with either continuous
(De Cao & Kipf, 2018; Liu et al., 2018) and synthesizable
or discrete action spaces and thus require additional mod-
(Bradshaw et al., 2019b), the generative model is restricted
ifications to handle both (Masson et al., 2016; Wei et al.,
to a (small) region of chemical space for which the graph
2018; Xiong et al., 2018), PPO can be applied directly as is.
representation is valid, e.g. single organic molecules.
To help maintain sufficient exploration throughout learning,
3D Point Cloud Generation Another downside of string- our agent learn to construct single molecules in Cartesian
and graph-based approaches is their neglect of information coordinates from scratch, 2) does our approach allow build-
encoded in the interatomic distances. To this end, Gebauer ing molecules across multiple bags simultaneously, 3) are
et al. (2018; 2019) proposed a generative neural network we able to scale to larger molecules, and 4) can our agent
for sequentially placing atoms in Cartesian coordinates. construct systems comprising multiple molecules?
While their model respects local symmetries by construction,
atoms are placed on a 3D grid. Further, similar to aforemen- 5.1. Tasks
tioned approaches, this model depends on a dataset to exist
that covers the particular class of molecules for which one We propose three different tasks for molecular design in
seeks to generate new molecules. Cartesian coordinates, which are instances of the MDP for-
mulation introduced in Section 2.2: single-bag, multi-bag,
Reinforcement Learning Olivecrona et al. (2017), and solvation. More formally, the tasks are as follows:
Guimaraes et al. (2018), Putin et al. (2018), Neil et al. (2018)
and Popova et al. (2018) presented RL approaches based on Single-bag Given a bag, learn to design stable molecules.
string representations of molecules. They successfully gen- This task assesses an agent’s ability to build single stable
erated molecules with given desirable properties but, similar molecules. The reward function is given by r(st , at ) =
to other generative models using SMILES strings, struggled −∆E (st , at ), see Eq. (1). If the reward is below a threshold
with chemical validity. You et al. (2018) proposed a graph of −0.6, the molecule is deemed invalid and the episode
convolutional policy network based on graph representa- terminates prematurely with the reward clipped at −0.6.5
tions of molecules, where the reward function is based on Multi-bag Given multiple bags with one of them being
empirical properties such as the drug-likeliness. While this randomly selected before each episode, learn to design sta-
approach was able to consistently produce valid molecules, ble molecules. This task focuses on the agent’s capabilities
its performance still depends on a dataset required for pre- to learn to build different molecules of different composition
training. Considering the large diversity of chemical struc- and size at the same time. The same reward function as in
tures, the generation of a dataset that covers the whole chem- the single-bag task is used. Offline performance is evaluated
ical space is hopeless. To address this limitation, Zhou et al. in terms of the average return across bags. Similarly, the
(2019) proposed an agent that learned to generate molecules baseline is given by the average optimal return over all bags.
from scratch using a Deep Q-Network (DQN) (Mnih et al.,
2015). However, such graph-based RL approaches are still Solvation The task is to learn to place water molecules
restricted to the generation of single organic molecules for around an existing molecule (i.e. C0 is non-empty). This
which this representation was originally designed. Further, task assesses an agent’s ability to distinguish intra- and inter-
graph representations prohibit the use of reward functions molecular interactions, i.e. the atomic interactions within a
based on fundamental physical laws, and one has to resort molecule and those between molecules. These interactions
to heuristics instead. Finally, geometric constraints cannot are paramount for the accurate description of chemistry in
be imposed on the design process. Jørgensen et al. (2019) the liquid phase. In this task, we deviate from the protocol
introduced an atomistic structure learning algorithm, called used in the previous experiments as follows. Initially, the
ALSA, that utilizes a convolutional neural network to build agent is provided with an H2 O bag. Once the bag is empty,
2D structures and planar compounds atom by atom. the environment will refill it and the episode continues. The
episode terminates once n ∈ N+ bags of H2 O have been
placed on the canvas. By refilling the H2 O bag n − 1 times
5. Experiments instead of providing a single H2n On bag, the agent is guided
We perform experiments to evaluate the performance of towards building H2 O molecules. 6 The reward function is
the policy introduced in Section 3. While prior work has augmented with a penalty term for placing atoms far away
focused on building molecules using molecular graph rep- from the center, i.e. r(st , at ) = −∆E − ρkxk2 , where ρ is
resentations, we are interested in designing molecules in a hyper-parameter. This corresponds to a soft constraint on
Cartesian coordinates. To this end, we introduce a new the radius at which the atoms should be placed. This is a
RL environment called M OL G YM in Section 5.1. It com- task a graph-based RL approach could not solve.
prises a set of molecular design tasks, for which we provide
baselines using quantum-chemical calculations. See the 5.2. Results
Appendix for details on how the baselines are determined. 4 In this section, we use the tasks specified in Section 5.1 to
We use M OL G YM to answer the following questions: 1) can evaluate our proposed policy. We further assess the chemical
5
4
Source code of the agent and environment is available at ∆E is on the order of magnitude of −0.1 Hartree, resulting
https://github.com/gncs/molgym. in a reward of around 0.25 for a well placed atom.
6
A comparison of the two protocols is given in the Appendix.
1.5
Average Return
1.0
0.5
0.0
C2 H2 O2
CH3NO
0.5 CH4O
0 10 20 30 40 50 60 70 80
Steps x 1000
Figure 5. (a) Average offline performance on the single-bag task for bags CH3 NO, CH4 O and C2 H2 O2 across 10 seeds. Dashed lines
denote optimal returns for each bag, respectively. Error bars show two standard deviations. (b) Generated molecular structures at different
terminal states over time, showing the agent’s learning progress.
0.8
Table 1. QM9 bags used in the experiments.
Experiment QM9 Bags Used
Average Return
0.6
Single-bag C2 H2 O2 , CH3 NO, CH4 O
0.4
H2 O, CHN, C2 N2 , H3 N, C2 H2 , CH2 O,
Multi-bag
C2 HNO, N4 O, C3 HN, CH4 , CF4
0.2 Single-bag (large) C3 H5 NO3 , C4 H7 N, C3 H8 O
0.0
0.2 the multi-bag task using all formulas contained in the QM9
0 20 40 60 80 100 120 dataset (Ruddigkeit et al., 2012; Ramakrishnan et al., 2014)
Steps x 1000
with up to 5 atoms, resulting in 11 bags (see Table 1). De-
spite their small size, the molecules feature a diverse set
Figure 6. Average offline performance on the multi-bag task, us-
of bonds (single, double, and triple) and geometries (linear,
ing 11 bags consisting of up to five atoms across 10 seeds. The
dashed line denotes the optimal average return. Error bars show trigonal planar, and tetrahedral). From the performance and
two standard deviations. The molecular structures shown are the from visual inspection of the generated molecular structures
terminal states at the end of training from one seed. shown in Fig. 6, it can be seen that a single policy is able
to build different molecular structures across multiple bags.
For example, it learned that a carbon atom can have varying
validity, diversity and stability of the generated structures. number and type of neighboring atoms leading to specific
Experiments were run on a 16-core Intel Xeon Skylake 6142 bond distance, angles, and dihedral angles.
CPU with 2.6GHz and 96GB RAM. Details on the model
Scaling to Larger Molecules To study our agent’s abil-
architecture and hyperparameters are in the Appendix.
ity to construct large molecules we let it solve the single-bag
Learning to Construct Single Molecules In this toy ex- task with the bags C3 H5 NO3 , C3 H8 O, and C4 H7 N. Re-
periment, we train the agent on the single-bag task for the sults are shown in Fig. 7. After 154 000 steps, the agent
bags CH3 NO, CH4 O and C2 H2 O2 , respectively. Fig. 5 achieved an average return of 2.60 on C3 H5 NO3 (maximum
shows that the agent was able to learn the rules of chemi- across seeds at 2.72, optimum at 2.79), 2.17 on C4 H7 N
cal bonding and interatomic distances from scratch. While (2.21, 2.27), and 1.98 on C3 H8 O (2.04, 2.07). While the
on average the agent reaches 90% of the optimal return af- agent did not always find the most stable configurations, it
ter only 12 000 steps, the snapshots in Fig. 5 (b) highlight was able to explore a diverse set of chemically valid struc-
that the last 10% determine chemical validity. As shown tures (including bimolecular structures, see Appendix).
in Fig. 5 (b), the model first learns the atomic distances d,
Constructing Molecular Clusters We task the agent to
followed by the angles α and the dihedral angles ψ.
place 5 water molecules around a formaldehyde molecule,
Learning across Multiple Bags We train the agent on i.e. C0 = CH2 O and n = 5. The distance penalty parameter
2.5
2.0
Average Return
1.5
1.0
0.5
0.0 C3H5NO3
C4 H7 N
0.5 C3 H8 O
Figure 7. (a) Average offline performance on the single-bag task for bags C3 H5 NO3 , C3 H8 O and C4 H7 N across 10 seeds. Dashed lines
denote optimal return for each bag, respectively. Error bars show two standard deviations. (b) Selection of molecular structures generated
by trained models for the bag C3 H5 NO3 . For the bags C3 H8 O and C4 H7 N, see the Appendix.
1.5
Task Experiment Validity RMSD Diversity
1.0 C2 H2 O2 0.90 0.32 3
Single-bag CH3 NO 0.70 0.20 3
0.5 CH4 O 0.80 0.11 1
0.0 Multi-bag - 0.78 0.05 22
0.5
C3 H5 NO3 0.70 0.39 40
Single-bag C4 H7 N 0.80 0.29 20
0 20 40 60 80 100 120 140 (large) C3 H8 O 0.90 0.47 4
Steps x 1000 C7 H8 N2 O2 0.60 0.61 61
Formaldehyde 0.80 1.03 1
Figure 8. Average offline performance on the solvation task with Solvation Acetonitrile 0.90 1.06 1
5 H2 O molecules across 10 seeds. Error bars show two standard Ethanol 0.90 0.92 1
errors. The plot is smoothed across five evaluations for better read-
ability. The dashed line denotes the optimal return. A selection of
molecular clusters generated by trained models are shown in solid
circles; for comparison, a stable configuration obtained through Quality Assessment of Generated Molecules In the
structure optimization is depicted in a dashed circle. spirit of the GuacaMol benchmark (Brown et al., 2019),
we assess the molecular structures generated by the agent
with respect to chemical validity, diversity and structural
stability for each experiment. To enable a comparison with
ρ is set to 0.01.7 From Fig. 8, we observe that the agent is
existing approaches, we additionally ran experiments with
able to learn to construct H2 O molecules and place them
the bag C7 H8 N2 O2 , the stoichiometry of which is taken
in the vicinity of the solute. A good placement also allows
from the GuacaMol benchmark (Brown et al., 2019).
for hydrogen bonds to be formed between water molecules
themselves and between water molecules and the solute (see The results are shown in Table 2. To determine the valid-
Fig. 8, dashed circle). In most cases, our agent arranges ity and stability of the generated structures, we first took
H2 O molecules such that these bonds can be formed (see the terminal states of the last iteration for a particular ex-
Fig. 8, solid circles). The lack of hydrogen bonds in some periment. Structures are considered valid if they can be
structures could be attributed to the approximate nature of successfully parsed by RDK IT (Landrum, 2019). However,
the quantum-chemical method used in the reward function. those consisting of multiple molecules were not considered
Overall, this experiment showcases that our agent is able valid (except in the solvation task). The validity reported in
to learn both intra- and intermolecular interactions, going Table 2 is the ratio of valid molecules over 10 seeds.
beyond what graph-based agents can learn.
All valid generated structures underwent a structure opti-
7
Further experiments on the solvation task are in the Appendix. mization using the PM6 method (see Appendix for more
details). Then, the RMSD (in Å) between the original and approach and enable the agent to stop before a given bag
the optimized structure were computed. In Table 2, the is empty. Moreover, we are interested in combining the
median RMSD over all generated structures is given per reward with other properties such as drug-likeliness and
experiment. In the approach by Gebauer et al. (2019), an applying our approach to other classes of molecules, e.g.
average RMSD of ≈ 0.25 Å is reported. Due to significant transition-metal catalysts.
differences in approach, application, and training procedure
we forego a direct comparison of the methods. Acknowledgements
Further, two molecules are considered identical if the
We would like to thank the anonymous reviewers for their
SMILES strings generated by RDK IT are the same. The di-
valuable feedback. We further thank Austin Tripp and Vin-
versity reported in Table 2 is the total number of unique and
cent Stimper for useful discussions and feedback. GNCS
valid structures generated through training over 10 seeds.
acknowledges funding through an Early Postdoc.Mobility
fellowship by the Swiss National Science Foundation
6. Discussion (P2EZP2 181616). RP receives funding from iCASE grant
#1950384 with support from Nokia.
This work is a first step towards general molecular design
through RL in Cartesian coordinates. One limitation of
the current formulation is that we need to provide bags References
for which we know good solutions exist when placed com- Blaschke, T., Olivecrona, M., Engkvist, O., Bajorath, J.,
pletely. While being able to provide such prior knowledge and Chen, H. Application of Generative Autoencoder in
can be beneficial, we are currently restricted to designing De Novo Molecular Design. Mol. Inf., 37(1-2):1700123,
molecules of known formulas. A possible solution is to 2018.
provide bags that are larger than necessary, e.g. generated
randomly or according to some fixed budget for each ele- Bosia, F., Husch, T., Vaucher, A. C., and Reiher, M.
ment, and enable the agent to stop before the bag is empty. qcscine/sparrow: Release 1.0.0, 2019. URL https:
//doi.org/10.5281/zenodo.3244106.
Compared to graph-based approaches, constructing
molecules by sequentially placing atoms in Cartesian co- Bradshaw, J., Kusner, M. J., Paige, B., Segler, M. H. S.,
ordinates greatly increases the flexibility in terms of the and Hernández-Lobato, J. M. A generative model for
type of molecular structures that can be built. However, electron paths. In International Conference on Learning
it also makes the exploration problem more challenging: Representations, 2019a.
whereas in graph-based approaches a molecule can be ex-
panded by adding a node and an edge, here, the agent has Bradshaw, J., Paige, B., Kusner, M. J., Segler, M., and
to learn to precisely position an atom in Cartesian coordi- Hernández-Lobato, J. M. A Model to Search for Synthe-
nates from scratch. As a result, the molecules we generate sizable Molecules. In Advances in Neural Information
are still considerably smaller. Several approaches exist to Processing Systems, pp. 7935–7947, 2019b.
mitigate the exploration problem and improve scalability,
including: 1) hierarchical RL, where molecular fragments Brown, N., Fiscato, M., Segler, M. H., and Vaucher, A. C.
or entire molecules are used as high-level actions; 2) imita- GuacaMol: Benchmarking Models for de Novo Molecu-
tion learning, in which known molecules are converted into lar Design. J. Chem. Inf. Model., 59(3):1096–1108, 2019.
expert trajectories; and 3) curriculum learning, where the doi: 10.1021/acs.jcim.8b00839.
complexity of the molecules to be built increases over time. Dai, H., Tian, Y., Dai, B., Skiena, S., and Song, L. Syntax-
directed variational autoencoder for structured data. In
7. Conclusion International Conference on Learning Representations,
2018.
We have presented a novel RL formulation for molecular
design in Cartesian coordinates, in which the reward func- De Cao, N. and Kipf, T. MolGAN: An implicit genera-
tion is based on quantum-mechanical properties such as the tive model for small molecular graphs. arXiv preprint
energy. We further proposed an actor-critic neural network arXiv:1805.11973, 2018.
architecture based on a translation and rotation invariant
state-action representation. Finally, we demonstrated that Gebauer, N. W. A., Gastegger, M., and Schütt, K. T. Gener-
our model can efficiently solve a range of molecular design ating equilibrium molecules with deep neural networks.
tasks from our M OL G YM RL environment from scratch. arXiv preprint arXiv:1810.11347, 2018.
In future work, we plan to increase the scalability of our Gebauer, N. W. A., Gastegger, M., and Schütt, K. T.
Symmetry-adapted generation of 3d point sets for the
targeted discovery of molecules. In Advances in Neural Liu, Q., Allamanis, M., Brockschmidt, M., and Gaunt,
Information Processing Systems, pp. 7564–7576, 2019. A. Constrained Graph Variational Autoencoders for
Molecule Design. In Advances in Neural Information
Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D., Processing Systems, pp. 7795–7804, 2018.
Hernández-Lobato, J. M., Sánchez-Lengeling, B., She-
berla, D., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, Masson, W., Ranchod, P., and Konidaris, G. Reinforcement
R. P., and Aspuru-Guzik, A. Automatic Chemical De- learning with parameterized actions. In Thirtieth AAAI
sign Using a Data-Driven Continuous Representation of Conference on Artificial Intelligence, 2016.
Molecules. ACS Cent. Sci., 4(2):268–276, 2018.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-
Guimaraes, G. L., Sanchez-Lengeling, B., Outeiral, C., ness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,
Farias, P. L. C., and Aspuru-Guzik, A. Objective- Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C.,
Reinforced Generative Adversarial Networks (ORGAN) Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wier-
for Sequence Generation Models. arXiv preprint stra, D., Legg, S., and Hassabis, D. Human-level control
arXiv:1705.10843, 2018. through deep reinforcement learning. Nature, 518(7540):
529–533, 2015.
Husch, T. and Reiher, M. Comprehensive Analysis of the
Neglect of Diatomic Differential Overlap Approximation. Neil, D., Segler, M., Guasch, L., Ahmed, M., Plumb-
J. Chem. Theory Comput., 14(10):5169–5179, 2018. ley, D., Sellwood, M., and Brown, N. Exploring
Deep Recurrent Models with Reinforcement Learning
Husch, T., Vaucher, A. C., and Reiher, M. Semiempir-
for Molecule Design. OpenReview, 2018. URL https:
ical molecular orbital models based on the neglect of
//openreview.net/forum?id=HkcTe-bR-.
diatomic differential overlap approximation. Int. J. Quan-
tum Chem., 118(24):e25799, 2018. Olivecrona, M., Blaschke, T., Engkvist, O., and Chen, H.
Molecular de-novo design through deep reinforcement
Jin, W., Coley, C., Barzilay, R., and Jaakkola, T. Predict-
learning. J. Cheminf., 9(1):48, 2017.
ing Organic Reaction Outcomes with Weisfeiler-Lehman
Network. In Advances in Neural Information Processing Polishchuk, P. G., Madzhidov, T. I., and Varnek, A. Es-
Systems, pp. 2607–2616, 2017. timation of the size of drug-like chemical space based
on GDB-17 data. J. Comput.-Aided Mol. Des., 27(8):
Jørgensen, M. S., Mortensen, H. L., Meldgaard, S. A., Kols-
675–679, 2013.
bjerg, E. L., Jacobsen, T. L., Sørensen, K. H., and Ham-
mer, B. Atomistic structure learning. J. Chem. Phys., 151 Popova, M., Isayev, O., and Tropsha, A. Deep reinforce-
(5):054111, 2019. doi: 10.1063/1.5108871. ment learning for de novo drug design. Sci. Adv., 4(7):
eaap7885, 2018.
Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms.
In Advances in Neural Information Processing Systems, Putin, E., Asadulaev, A., Ivanenkov, Y., Aladinskiy, V.,
pp. 1008–1014, 2000. Sanchez-Lengeling, B., Aspuru-Guzik, A., and Zha-
voronkov, A. Reinforced Adversarial Neural Computer
Kusner, M. J., Paige, B., and Hernández-Lobato, J. M.
for de Novo Molecular Design. J. Chem. Inf. Model., 58
Grammar Variational Autoencoder. In Precup, D. and
(6):1194–1204, 2018.
Teh, Y. W. (eds.), International Conference on Machine
Learning, volume 70, pp. 1945–1954. PMLR, 2017. Ramakrishnan, R., Dral, P. O., Rupp, M., and von Lilienfeld,
O. A. Quantum chemistry structures and properties of
Landrum, G. RDKit 2019.09.3. http://www.rdkit.org/, 2019.
134 kilo molecules. Sci. Data, 1:140022, 2014.
(Accessed: 22. January 2019).
Li, Y., Vinyals, O., Dyer, C., Pascanu, R., and Battaglia, Ruddigkeit, L., van Deursen, R., Blum, L. C., and Rey-
P. Learning Deep Generative Models of Graphs. arXiv mond, J.-L. Enumeration of 166 Billion Organic Small
preprint arXiv:1803.03324, 2018a. Molecules in the Chemical Universe Database GDB-17.
J. Chem. Inf. Model., 52(11):2864–2875, 2012.
Li, Y., Zhang, L., and Liu, Z. Multi-objective de novo
drug design with conditional graph generative model. J. Schneider, P., Walters, W. P., Plowright, A. T., Sieroka,
Cheminf., 10(1):33, 2018b. N., Listgarten, J., Goodnow, R. A., Fisher, J., Jansen,
J. M., Duca, J. S., Rush, T. S., Zentgraf, M., Hill, J. E.,
Lim, J., Ryu, S., Kim, J. W., and Kim, W. Y. Molecular Krutoholow, E., Kohler, M., Blaney, J., Funatsu, K., Lue-
generative model based on conditional variational autoen- bkemann, C., and Schneider, G. Rethinking drug design
coder for de novo molecular design. J. Cheminf., 10:31, in the artificial intelligence era. Nat. Rev. Drug Discovery,
2018. pp. 1–12, 2019.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, Zhavoronkov, A., Ivanenkov, Y. A., Aliper, A., Veselov,
P. Trust region policy optimization. In International M. S., Aladinskiy, V. A., Aladinskaya, A. V., Terentiev,
Conference on Machine Learning, pp. 1889–1897, 2015. V. A., Polykovskiy, D. A., Kuznetsov, M. D., Asadulaev,
A., Volkov, Y., Zholus, A., Shayakhmetov, R. R., Zhebrak,
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and A., Minaeva, L. I., Zagribelnyy, B. A., Lee, L. H., Soll,
Klimov, O. Proximal policy optimization algorithms. R., Madge, D., Xing, L., Guo, T., and Aspuru-Guzik,
arXiv preprint arXiv:1707.06347, 2017. A. Deep learning enables rapid identification of potent
DDR1 kinase inhibitors. Nat. Biotechnol., 37(9):1038–
Schütt, K., Kindermans, P.-J., Sauceda Felix, H. E., Chmiela, 1040, 2019.
S., Tkatchenko, A., and Müller, K.-R. SchNet: A
continuous-filter convolutional neural network for model- Zhou, Z., Kearnes, S., Li, L., Zare, R. N., and Riley, P. Opti-
ing quantum interactions. In Advances in Neural Infor- mization of Molecules via Deep Reinforcement Learning.
mation Processing Systems, pp. 991–1001, 2017. Sci. Rep., 9(1):1–10, 2019.
Xiong, J., Wang, Q., Yang, Z., Sun, P., Han, L., Zheng,
Y., Fu, H., Zhang, T., Liu, J., and Liu, H. Parametrized
deep q-networks learning: Reinforcement learning with
discrete-continuous hybrid action space. arXiv preprint
arXiv:1810.06394, 2018.
You, J., Liu, B., Ying, Z., Pande, V., and Leskovec, J. Graph
Convolutional Policy Network for Goal-Directed Molecu-
lar Graph Generation. In Advances in Neural Information
Processing Systems, pp. 6410–6421, 2018.
A. Quantum-Chemical Calculations C. Experimental Details
For the calculation of the energy E we use the fast semi- C.1. Model Architecture
empirical Parametrized Method 6 (PM6) (Stewart, 2007).
The model architecture is summarized in Table 3. We initial-
In particular, we use the implementation from the software
ize the biases of each MLP with 0 and each weight matrix
package S PARROW (Husch et al., 2018; Bosia et al., 2019).
as a (semi-)orthogonal matrix. After each hidden layer,
For each calculation, a molecular charge of zero and the
a ReLU non-linearity is used. The output activations are
lowest possible spin multiplicity are chosen. All calculations
shown in Table 3. As explained in the main text, both MLPf
are spin-unrestricted.
and MLPe use a masked softmax activation function to guar-
Limitations of semi-empirical methods are highlighted in, antee that only valid actions are chosen. Further, we rescale
for example, recent work by Husch & Reiher (2018). More the continuous actions (µd , µα , µψ ) ∈ [−1, 1]3 predicted
accurate methods such as approximate density functionals by MLPcont to ensure that µd ∈ [dmin , dmax ], µα ∈ [0, π]
need to be employed especially for systems containing tran- and µψ ∈ [0, π]. For more details on the SchNet, see the
sition metals. original work (Schütt et al., 2018b).
For the quantum-chemical calculations to converge reliably,
we ensured that atoms are not placed too close (< 0.6 Å) nor
too far away from each other (> 2.0 Å). If the agent places Table 3. Model architecture for actor and critic networks.
an atom outside these boundaries, the minimum reward of Operation Dimensionality Activation
−0.6 is awarded and the episode terminates. SchNet n(C) × 4, ∗, n(C) × 64 ∗ (cf. Table 7)
MLPβ emax , 128, 32 linear
B. Learning the Dihedral Angle tile 32, n(C) × 32 —
concat n(C) × (64, 32), n(C) × 96 —
We experimentally validate the benefits of learning |ψ| ∈ MLPf n(C) × 96, n(C) × 128, n(C) × 1 softmax
[0, π] and κ ∈ {−1, 1} instead of ψ ∈ [−π, π] by compar- select n(C) × 96, 96 —
ing the two models on the single-bag task with bag CH4 MLPe 96, 128, emax softmax
(methane). Methane is one of the simplest molecules that concat (96, emax ), 96 + emax —
MLPcont 96 + emax , 128, 3 tanh
requires the model to learn a dihedral angle. As shown in
MLPκ 2 × 96, 2 × 128, 2 × 1 softmax
Fig. 9, learning the sign of the dihedral angle separately
pooling n(C) × 96, 96 —
(with κ) speeds up learning significantly. In fact, the ablated MLPφ 96, 128, 128, 1 linear
model (without κ) fails to converge to the optimal return
even after 100 000 steps (not shown).
C.2. Hyperparameters
0.8
We manually performed an initial hyperparameter search
0.6 on a single holdout validation seed. The considered hy-
0.4 perparameters and the selected values are listed in Table 4
Average Return
0.2
(single-bag), Table 5 (multi-bag) and Table 6 (solvation).
The hyperparameters used for SchNet are shown in Table 7.
0.0
0.2
0.4
D. Baselines
0.6 Below, we report how the baselines for the single-bag and
0.8
multi-bag tasks were derived. First, we took all molecular
0 1 2 3 4 5 6 7
structures for a given chemical formula (i.e. bag) from the
Steps x 1000 QM9 dataset (Ruddigkeit et al., 2012; Ramakrishnan et al.,
2014). Subsequently, we performed a structure optimiza-
Figure 9. Average offline performance on the single-bag task for tion using the PM6 method (as described in Section A) on
the bag CH4 across 10 seeds. Estimating κ and |ψ| separately the structures. This was necessary as the structures in this
(with κ) significantly speeds up learning compared to estimating dataset were optimized with a different quantum-chemical
ψ directly (without κ). Error bars show two standard deviations. method. Then, the most stable structure was selected and
The dashed line denotes the optimal return. considered optimal for this chemical formula; the remaining
structures were discarded. Since the undiscounted return is
path independent, we determined the return R(s) by com-
Table 4. Hyperparameters for the single-bag task. Adapted values Table 6. Hyperparameters for the solvation task.
for the scalability (large) experiment are in parentheses. Hyperparameter Search set Value
Hyperparameter Search set Value (large)
Range [dmin , dmax ] (Å) — [0.90, 2.80]
Range [dmin , dmax ] (Å) — [0.95, 1.80] Max. atomic number emax — 10
Max. atomic number emax — 10 Distance penalty ρ — 0.01
Workers — 16 Workers — 16
Clipping — 0.2 Clipping — 0.2
Gradient clipping — 0.5 Gradient clipping — 0.5
GAE parameter λ — 0.95 GAE parameter λ — 0.95
VF coefficient c1 — 1 VF coefficient c1 — 1
Entropy coefficient c2 {0.00, 0.01, 0.03} 0.01 Entropy coefficient c2 {0.00, 0.01, 0.03} 0.01
Training epochs {5, 10} 5 Training epochs {5, 10} 5
Adam stepsize {10−4 , 3 × 10−4 } 3 × 10−4 Adam stepsize {10−4 , 3 × 10−4 } 3 × 10−4
Discount γ {0.99, 1.00} 0.99 Discount γ {0.99, 1.00} 0.99
Time horizon T {192, 256} 192 (256) Time horizon T {384, 512} 384
Minibatch size {24, 32} 24 (32) Minibatch size {48, 64} 48
2.0
Average Return
1.5
Figure 11. Bimolecular structure generated by a trained model for
the bag C3 H8 O in the single-bag task. 1.0
0.5
0.0
0.5
2.0 2.0
1.5 1.5
Average Return
Average Return
1.0
1.0
0.5
0.5
0.0
acetonitrile 0.0
0.5 ethanol
0.5
0 20 40 60 80 100 120
Steps x 1000 0 25 50 75 100 125 150 175 200
Steps x 1000
Figure 13. Average offline performances across 10 seeds on the
solvation task with n = 5 and the initial states being acetonitrile Figure 15. Average offline performance for agents A/B: trained
and ethanol. Error bars show two standard errors. The plot is on bags A of size 6 and tested on bags B of size 8, B/B: trained
smoothed across five evaluations for better readability. The dashed and tested on B, and A → B/B: trained on A for 96 000 steps,
lines denote the optimal returns. A selection of molecular clusters and fine-tune and tested on B. See main text for more details.
generated by trained models are shown in circles. Error bars show two standard deviations. The dashed line denotes
the optimal average return.
In Fig. 14, we compare the average offline performance of
two agents placing in total 10 H and 5 O atoms around a To assess the generalization capabilities of our agent when
formaldehyde molecule. One agent is given 5 H2 O bags faced with previously unseen bags, we train an agent on
consecutively following the protocol of the solvation task bags A = {C2 H2 O2 , C2 H3 N, C3 H2 O, C3 N2 O, CH3 NO,
as described in the main text, another is given a single CH4 O} of size 6 and test on bags B = {C3 H2 O3 , C3 H4 O,
H10 O5 bag. Their average offline performances are shown C4 H2 O2 , CH4 N2 O, C4 N2 O2 , C5 H2 O} of size 8. As
shown in Fig. 15, the agent A/B achieves an average re-
turn of 1.79, which is approximately 88% of the optimal
return. In comparison, an agent trained and tested on B
(B/B) reaches an average return of 1.96 (or 0.97% of the
optimal return). We additionally train an agent on A for
96 000 steps, and then fine-tune and test on B. The agent
A → B/B reaches the same performance as if trained from
scratch within 20 000 steps of fine-tuning, showing success-
ful transfer. We anticipate that training on more bags and
incorporating best practices from multi-task learning would
further improve performance.