0% found this document useful (0 votes)
5 views15 pages

Reinforcement Learning For Molecular Design Guided by Quantum Mechanics

materials science, machine learning, deep reinforcement learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views15 pages

Reinforcement Learning For Molecular Design Guided by Quantum Mechanics

materials science, machine learning, deep reinforcement learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Reinforcement Learning for Molecular Design

Guided by Quantum Mechanics

Gregor N. C. Simm * 1 Robert Pinsler * 1 José Miguel Hernández-Lobato 1

Abstract
arXiv:2002.07717v2 [stat.ML] 29 Jun 2020

Automating molecular design using deep rein-


forcement learning (RL) holds the promise of
accelerating the discovery of new chemical com-
pounds. Existing approaches work with molecu-
lar graphs and thus ignore the location of atoms
in space, which restricts them to 1) generating
single organic molecules and 2) heuristic reward
functions. To address this, we present a novel
RL formulation for molecular design in Carte- Figure 1. Visualization of the molecular design process presented
sian coordinates, thereby extending the class of in this work. The RL agent (depicted by a robot arm) sequentially
molecules that can be built. Our reward function places atoms onto a canvas. By working directly in Cartesian
is directly based on fundamental physical prop- coordinates, the agent learns to build structures from a very general
erties such as the energy, which we approximate class of molecules. Learning is guided by a reward that encodes
via fast quantum-chemical methods. To enable fundamental physical properties. Bonds are only for illustration.
progress towards de-novo molecular design, we
introduce M OL G YM, an RL environment com-
prising several challenging molecular design tasks networks (RNNs) (Segler et al., 2018), and generative ad-
along with baselines. In our experiments, we versarial networks (GANs) (De Cao & Kipf, 2018) have
show that our agent can efficiently learn to solve been successfully applied to propose potential drug candi-
these tasks from scratch by working in a transla- dates. Despite recent advances in generating valid structures,
tion and rotation invariant state-action space. proposing truly novel molecules beyond the training data
distribution remains a challenging task. This issue is exacer-
bated for many classes of molecules (e.g. transition metals),
1. Introduction where such a representative dataset is not even available.
Finding new chemical compounds with desired properties An alternative strategy is to employ RL, in which an agent
is a challenging task with important applications such as de builds new molecules in a step-wise fashion (e.g., Olive-
novo drug design and materials discovery (Schneider et al., crona et al. (2017), Guimaraes et al. (2018), Zhou et al.
2019). The diversity of synthetically feasible chemicals (2019), Zhavoronkov et al. (2019)). Training an RL agent
that can be considered as potential drug-like molecules was only requires samples from a reward function, alleviating
estimated to be between 1030 and 1060 (Polishchuk et al., the need for an existing dataset of molecules. However,
2013), making exhaustive search hopeless. the choice of state representation in current models still
severely limits the class of molecules that can be generated.
Recent applications of machine learning have accelerated
In particular, molecules are commonly described by graphs,
the search for new molecules with specific desired proper-
where atoms and bonds are represented by nodes and edges,
ties. Generative models such as variational autoencoders
respectively. Since a graph is a simplified model of the phys-
(VAEs) (Gómez-Bombarelli et al., 2018), recurrent neural
ical representation of molecules in the real world, one is
*
Equal contribution 1 Department of Engineering, University limited to the generation of single organic molecules. Other
of Cambridge, Cambridge, UK. Correspondence to: Gregor N. C. types of molecules cannot be appropriately described as
Simm <[email protected]>. this representation lacks important three-dimensional (3D)
Proceedings of the 37 th International Conference on Machine information, i.e. the relative position of atoms in space. For
Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by example, systems consisting of multiple molecules cannot
the author(s). be generated for this reason. Furthermore, it prohibits the
use of reward functions based on fundamental physical laws;
instead, one has to resort to heuristic physicochemical pa-
rameters, e.g. the Wildman-Crippen partition coefficient
(Wildman & Crippen, 1999). Lastly, it is not possible to
impose geometric constraints on the design process, e.g.
those given by the binding pocket of a protein which the
generated molecule is supposed to target.
In this work, we introduce a novel RL formulation for molec-
ular design in which an agent places atoms from a given Figure 2. Rollout of an episode with bag β0 = CH2 O. The agent
bag of atoms onto a 3D canvas (see Fig. 1). As the reward constructs a molecule by sequentially placing atoms from the bag
function is based on fundamental physical properties such onto the 3D canvas until the bag is empty.
as energy, this formulation is not restricted to the genera-
tion of molecules of a particular type. We thus encourage scribed by a Markov decision process (MDP) M =
the agent to implicitly learn the laws of atomic interaction (S, A, T , µ0 , γ, T, r) with state space S, action space A,
from scratch to build molecules that go beyond what can transition function T : S × A 7→ S, initial state distri-
be represented with graph-based RL methods. To enable bution µ0 , discount factor γ ∈ (0, 1], time horizon T and
progress towards designing such molecules, we introduce reward function r : S × A 7→ R. The value function
a new RL environment called M OL G YM. It comprises a V π (st ) is defined as the expected discounted return when
suite of tasks in which both single molecules and molecule
clusters need to be constructed. For all of these tasks, we PT st and
starting from state
0
following policy π thereafter, i.e.
V π (st ) = Eπ [ t0 =t γ t r(st0 , at0 )|st ]. The goal is to learn
provide baselines using quantum-chemical calculations. Fi- a stochastic policy π(at |st ) that maximizes the expected
nally, we propose a novel policy network architecture that discounted return J(θ) = Es0 ∼µ0 [V π (s0 )].
can efficiently learn to solve these tasks by working in a
translation and rotation invariant state-action space. Policy Gradient Algorithms Policy gradient methods
are well-suited for RL in continuous action spaces. These
In summary, our contributions are as follows: methods learn a parametrized policy πθ by performing gra-
• we propose a novel RL formulation for general molec- dient ascent in order to maximize J(θ). More recent algo-
ular design in Cartesian coordinates (Section 2.2); rithms (Schulman et al., 2015; 2017) improve the stability
• we design a reward function based on the electronic en- during learning by constraining the policy updates. For ex-
ergy, which we approximate via fast quantum-chemical ample, proximal policy optimization (PPO) (Schulman et al.,
calculations (Section 2.3); 2017) employs a clipped surrogate objective. Denoting the
probability ratio between the updated and the old policy as
• we present a translation and rotation invariant policy
rt (θ) = ππθθ (a(at |s t)
t |st )
, the clipped objective J CL is given by
network architecture for molecular design (Section 3); old

• we introduce M OL G YM, an RL environment compris- h i


J CL (θ) = E min(rt (θ)Ât , clip(rt (θ), 1 − , 1 + )Ât ) ,
ing several molecular design tasks along with baselines
based on quantum-chemical calculations (Section 5.1);
where Ât is an estimator of the advantage function, and  is a
• we perform experiments to evaluate the performance
hyperparameter that controls the interval beyond which r(θ)
of our proposed policy network using standard policy
gets clipped. To further reduce the variance of the gradient
gradient methods (Section 5.2).
estimator, actor-critic approaches (Konda & Tsitsiklis, 2000)
are often employed. The idea is to use the value function
2. Reinforcement Learning for Molecular (i.e. the critic) to assist learning the policy (i.e. the actor). If
Design Guided by Quantum Mechanics the actor and critic share parameters, the objective becomes
In this section, we provide a brief introduction to RL and J AC (θ) = E J CL (θ) − c1 J V + c2 H[πθ |st ] ,
 
present our novel RL formulation for molecular design in
Cartesian coordinates. where c1 , c2 are coefficients, J V = (Vφπ (st ) − V target )2 is a
squared-error loss, and H is an entropy regularization term
2.1. Background: Reinforcement Learning to encourage sufficient exploration.
In the standard RL framework, an agent interacts with
2.2. Setup
the environment in order to maximize some reward. We
consider a fully observable environment with determin- We design molecules by sequentially drawing atoms from
istic dynamics. Such an environment is formally de- a given bag and placing them onto a 3D canvas. This task
can be formulated as a sequential decision-making problem Quantum-chemical methods, such as the ones based on den-
in an MDP with deterministic transition dynamics, where sity functional theory (DFT), can be employed to compute
the energy E for a given C. Since such methods are com-
• state st = (Ct , βt ) contains the canvas Ct = C0 ∪
putationally demanding in general, we instead choose to
{(ei , xi )}t−1
i=0 , i.e. a set of atoms with chemical element evaluate the energy using the semi-empirical Parametrized
ei ∈ {H, C, N, O, . . . } and position xi ∈ R3 placed
Method 6 (PM6) (Stewart, 2007) as implemented in the soft-
until time t − 1, as well as a bag βt = {(e, m(e))} of
ware package S PARROW (Husch et al., 2018; Bosia et al.,
atoms still to be placed; C0 can either be empty, C0 = ∅,
2019); see the Appendix for details. PM6 is significantly
or contain a set of atoms, i.e. C0 = {(ei , xi )} for some
faster than more accurate methods based on DFT and suffi-
i ∈ Z− ; m(e) is the multiplicity of the element e;
ciently accurate for the scope of this study. For example, the
• action at = (et , xt ) contains the element et ∈ βt and energy E of systems containing 10 atoms can be computed
position xt ∈ R3 of the next atom to be placed; within hundreds of milliseconds with PM6; with DFT, this
• deterministic transition function T (st , at ) places an would take minutes. We note that more accurate methods
atom through action at in state st , returning the next can be used as well if the computational budget is available.
state st+1 = (Ct+1 , βt+1 ) with βt+1 = βt \et ;
• reward function r(st , at ) quantifies how applying ac- 3. Policy
tion at in state st alters properties of the molecule, e.g. Building molecules in Cartesian coordinates allows to 1) ex-
the stability of the molecule as measured in terms of tend molecular design through deep RL to a much broader
its quantum-chemical energy. class of molecules compared to graph-based approaches,
Fig. 2 depicts the rollout of an episode. The initial state and 2) employ reward functions based on fundamental phys-
(C0 , β0 ) ∼ µ0 (s0 ) of the episode comprises the initial con- ical properties such as the energy. However, working di-
tent C0 of the canvas and a bag of atoms β0 to be placed, rectly in Cartesian coordinates introduces several additional
e.g. C0 = ∅, and β0 = CH2 O1 sampled uniformly from a challenges for policy learning.
given set of bags. The agent then sequentially draws atoms Firstly, it would be highly inefficient to naively learn to
from the bag without replacement and places them onto the place atoms directly in Cartesian coordinates since molecu-
canvas until the bag is empty. lar properties are invariant under symmetry operations such
as translation and rotation. For instance, the energy of a
2.3. Reward Function molecule—and thus the reward—does not change if the
molecule gets rotated, yet an agent that is not taking this
One advantage of designing molecules in Cartesian coor-
into account would need to learn these solutions separately.
dinates is that we can evaluate states in terms of quantum-
Therefore, we require an agent that is covariant to transla-
mechanical properties, such as the energy or dipole moment.
tion and rotation, i.e., if the canvas is rotated or translated,
In this paper, we focus on designing stable molecules, i.e.
the position xt of the atom to be placed should be rotated
molecules with low energy E ∈ R; however, linear com-
and translated as well. To achieve this, our agent first mod-
binations of multiple desirable properties are possible as
els the atom’s position in internal coordinates which are
well (see Section 5.1 for an example). We define the reward
invariant under translation and rotation. Then, by mapping
r(st , at ) = −∆E (st , at ) as the negative difference in en-
from internal to Cartesian coordinates, we obtain a position
ergy between the resulting molecule described by Ct+1 and
xt that features the required covariance. The agent’s inter-
the sum of energies of the current molecule Ct and a new
nal representations for states and actions are introduced in
atom of element et , i.e.
Sections 3.1 and 3.2, respectively.
∆E (st , at ) = E(Ct+1 ) − [E(Ct ) + E(et )] , (1) Secondly, the action space contains both discrete (i.e. ele-
ment et ) and continuous actions (i.e. position xt ). This is
where E(e) := E({(e, [0, 0, 0]T }). Intuitively, the agent is in contrast to most RL algorithms, which assume that the
rewarded for placing atoms so that the energy of the result- action space is either discrete or continuous. Due to the
ing molecules is low. Importantly, with this formulation the continuous actions, policy exploration becomes much more
undiscounted return for building a molecule is independent challenging compared to graph-based approaches. Further,
of the order in which atoms are placed. If the reward only not all discrete actions are valid in every state, e.g. the ele-
consisted of E(Ct+1 ), one would double-count interatomic ment et has to be contained in the bag βt . These issues are
interactions. As a result, the formulation in Eq. (1) prevents addressed in Section 3.2, where we propose a novel actor-
the agent from learning to greedily choose atoms of high critic neural network architecture for efficiently constructing
atomic number first, as they have low intrinsic energy. molecules in Cartesian coordinates.
1
Short hand for {(C, 2), (H, 2), (O, 1)}.
3.1. State Representation
Given that our agent models the position of the atom to be
placed in internal coordinates, we require a representation
for each atom on the canvas C that is invariant under transla-
tion and rotation of the canvas.2 To achieve this, we employ
S CH N ET (Schütt et al., 2017; 2018b), a deep learning archi-
tecture consisting of continuous-filter convolutional layers
that works directly on atoms placed in Cartesian coordinates.
SchNet(C) produces an embedding of each atom in C that
captures information about its local atomic environment.
Figure 3. Construction of a molecule using an action-space rep-
As shown in Fig. 4 (left), we combine this embedding C˜
resentation that is invariant under translation and rotation. Left:
with a latent representation β̃ of the bag, yielding a state Current state st with canvas Ct and remaining bag βt . Center: Ac-
embedding s̃, i.e. tion at adds an atom from the bag (highlighted in orange) relative
to the focus f (highlighted in blue). The relative coordinates (d,
˜ β̃],
s̃ = [C, C˜ = SchNet(C), β̃ = MLPβ (β), (2) α, ψ) uniquely determine its absolute position. Right: Resulting
state st+1 after applying action at in state st .
where MLPβ is a multi-layer perceptron (MLP).

3.2. Actor cide where to place the atom relative to the focal atom.
Therefore, we assume that the policy factorizes as
Action Representation We model the position of the
atom to be placed in internal coordinates—a commonly πθ (ψ, α, d, e, f |s) = p(ψ, α, d|e, f, s)
used representation for molecular structures in compu- × p(e|f, s)p(f |s). (3)
tational chemistry—relative to previously placed atoms.
If the canvas is initially empty, C0 = ∅, the agent se- We model the distributions over f and e as categorical,
lects an element e0 from β0 and places it at the origin, Cat(h), where hf ∈ Rn(C) and he ∈ REmax are the logits
i.e. a0 = (e0 , [0, 0, 0]T ). Once the canvas Ct contains at predicted by separate MLPs, and Emax is the largest atomic
least one atom, the agent first decides on a focal atom, number that can be selected. Further, p(ψ, α, d|e, f, s) is
f ∈ {1, . . . , n(Ct )}, where n(Ct ) denotes the number of factored into a product of univariate Gaussian distributions
atoms in Ct . This focal atom represents a local reference N (µ, σ 2 ), where the means µd , µα and µψ are given by an
point close to which the next atom is going to be placed (see MLP and the standard deviations σd , σα and σψ are global
Fig. 3). The agent then models the position x ∈ R3 with parameters. Formally,
respect to f in internal coordinates (d, α, ψ), where hf = MLPf (s̃), (4)
• d ∈ R is the Euclidean distance between x and the he = MLPe (s̃f ), (5)
position xf of the focal atom; µd , µα , µψ = MLPcont (s̃f , 1(e)), (6)
• α ∈ [0, π] is the angle between the two lines defined
by (x, xf ) and (x, xn1 ), where xn1 is the position of where s̃f = [C˜f , β̃] is the state embedding of the focal atom
the atom closest to f ; if less than two atoms are on the f ∼ Cat(f ; hf ), 1(e) is a one-hot vector representation
canvas, α is undefined/unused. of element e ∼ Cat(e; he ), and d ∼ N (d; µd , σd2 ), α ∼
N (α; µα , σα2 ), and ψ ∼ N (ψ; µψ , σψ2 ) are sampled from
• ψ ∈ [−π, π] is the dihedral angle between two inter- their respective distributions. The model is shown in Fig. 4.
secting planes spanned by (x, xf , xn1 ) and (xf , xn1 ,
xn2 ), where xn2 is the atom that is the second3 closest Maintaining Valid Actions As the agent places atoms
to the focal atom; if less than three atoms are on the onto the canvas during a rollout, the number of possible
canvas, ψ is undefined/unused. focal atoms f increases and the number of elements e to
choose from decreases. To guarantee that the agent only
As shown in Fig. 3 (right), these internal coordinates can chooses valid actions, i.e. f ∈ {1, . . . , n(C)} and e ∈ β,
then be mapped back to Cartesian coordinates x. we mask out invalid focal atoms and elements by setting
Model This action representation suggests a natural gen- their probabilities to zero and re-normalizing the categorical
erative process: first choose next to which focal atom the distributions. Neither the agent nor the environment makes
new atom is placed, then select its element, and finally de- use of ad-hoc concepts like valence or bond connectivity—
any atom on the canvas can potentially be chosen.
2
We omit the time index when it is clear from the context.
3
In the unlikely event that two atoms are exactly equally far Learning the Dihedral Angle The sign of the dihedral
from the focal atom, a random order for xn1 and xn2 is chosen. angle ψ depends on the two nearest neighbors of the focal
Figure 4. Illustration of the state embedding, actor and critic network. The canvas C and the bag of atoms β are fed to the state embedding
network to obtain a translation and rotation invariant state representation s̃. The actor network then selects 1) a focal atom f , 2) an element
e, and 3) internal coordinates (d, α, ψ). The critic takes the bag and the sum across all atoms on the canvas to compute a value V .

atom and is difficult to learn, especially if the two atoms we include an entropy regularization term over the policy.
are nearly equally close to the focal atom. In practice, we However, note that the entropies of the continuous and cate-
therefore learn the absolute value |ψ| ∈ [0, π] instead of ψ, gorical distributions often have different magnitudes; further,
as well as the sign κ ∈ {+1, −1}, such that ψ = κ|ψ|. To in this setting the entropies over the categorical distributions
estimate κ, we exploit the fact that the transition dynamics vary significantly throughout a rollout: as the agent places
are deterministic. We generate embeddings of both possible more atoms, the support of the distribution over valid focal
next states (for κ = +1 and κ = −1) and select the embed- atoms f increases and the support of the distribution over
ding of the atom just added, which we denote by s̃+ and s̃− . valid elements e decreases. To mitigate this issue, we only
We then choose κ = +1 over κ = −1 with probability apply entropy regularization to the categorical distributions,
which we find to be sufficient in practice.
exp(u+ )
p+ = , (7)
exp(u+ ) + exp(u− )
4. Related Work
such that p(κ| |ψ|, α, d, e, f, s) = Ber(κ; p+ ), where u± =
Deep Generative Models A prevalent strategy for molec-
MLPκ (s̃±, ); we further motivate this choice in the Ap-
ular design based on machine learning is to employ deep
pendix. Thus, the policy is given by πθ (κ|ψ|, α, d, e, f |s) =
generative models. These approaches first learn a latent
p(κ| |ψ|, α, d, e, f, s)p(|ψ|, α, d|e, f, s)p(e|f, s)p(f |s).
representation of the molecules and then perform a search
in latent space (e.g., through gradient descent) to discover
3.3. Critic new molecules with sought chemical properties. For exam-
The critic needs to compute a value for the entire state s. ple, Gómez-Bombarelli et al. (2018); Kusner et al. (2017);
Since the canvas is growing as more atoms are taken from Blaschke et al. (2018); Lim et al. (2018); Dai et al. (2018)
the bag and placed onto the canvas, a pooling operation utilized VAEs to perform search or optimization in a la-
is required. Here, we compute the sum over all atomic tent space to find new molecules. Segler et al. (2018) used
embeddings C̃i . Thus, the critic is given by RNNs to design molecular libraries. The aforementioned ap-
  proaches generate SMILES strings, a linear string notation,
n(C)
X to describe molecules (Weininger, 1988). Further, there ex-
Vφ (s) = MLPφ  C̃i , β̃  , (8) ist a plethora of generative models that work with graph rep-
i=1 resentations of molecules (e.g., Jin et al. (2017); Bradshaw
et al. (2019a); Li et al. (2018a;b); Liu et al. (2018); De Cao
where MLPφ is an MLP that computes value V (see Fig. 4). & Kipf (2018); Bradshaw et al. (2019b)). In these methods,
atoms and bonds are represented by nodes and edges, re-
3.4. Optimization spectively. Brown et al. (2019) developed a benchmark suite
for graph-based generative models, showing that generative
We employ PPO (Schulman et al., 2017) to learn the param-
models outperform classical approaches for molecular de-
eters (θ, φ) of the actor πθ and critic Vφ , respectively. While
sign. While the generated molecules are shown to be valid
most RL algorithms can only deal with either continuous
(De Cao & Kipf, 2018; Liu et al., 2018) and synthesizable
or discrete action spaces and thus require additional mod-
(Bradshaw et al., 2019b), the generative model is restricted
ifications to handle both (Masson et al., 2016; Wei et al.,
to a (small) region of chemical space for which the graph
2018; Xiong et al., 2018), PPO can be applied directly as is.
representation is valid, e.g. single organic molecules.
To help maintain sufficient exploration throughout learning,
3D Point Cloud Generation Another downside of string- our agent learn to construct single molecules in Cartesian
and graph-based approaches is their neglect of information coordinates from scratch, 2) does our approach allow build-
encoded in the interatomic distances. To this end, Gebauer ing molecules across multiple bags simultaneously, 3) are
et al. (2018; 2019) proposed a generative neural network we able to scale to larger molecules, and 4) can our agent
for sequentially placing atoms in Cartesian coordinates. construct systems comprising multiple molecules?
While their model respects local symmetries by construction,
atoms are placed on a 3D grid. Further, similar to aforemen- 5.1. Tasks
tioned approaches, this model depends on a dataset to exist
that covers the particular class of molecules for which one We propose three different tasks for molecular design in
seeks to generate new molecules. Cartesian coordinates, which are instances of the MDP for-
mulation introduced in Section 2.2: single-bag, multi-bag,
Reinforcement Learning Olivecrona et al. (2017), and solvation. More formally, the tasks are as follows:
Guimaraes et al. (2018), Putin et al. (2018), Neil et al. (2018)
and Popova et al. (2018) presented RL approaches based on Single-bag Given a bag, learn to design stable molecules.
string representations of molecules. They successfully gen- This task assesses an agent’s ability to build single stable
erated molecules with given desirable properties but, similar molecules. The reward function is given by r(st , at ) =
to other generative models using SMILES strings, struggled −∆E (st , at ), see Eq. (1). If the reward is below a threshold
with chemical validity. You et al. (2018) proposed a graph of −0.6, the molecule is deemed invalid and the episode
convolutional policy network based on graph representa- terminates prematurely with the reward clipped at −0.6.5
tions of molecules, where the reward function is based on Multi-bag Given multiple bags with one of them being
empirical properties such as the drug-likeliness. While this randomly selected before each episode, learn to design sta-
approach was able to consistently produce valid molecules, ble molecules. This task focuses on the agent’s capabilities
its performance still depends on a dataset required for pre- to learn to build different molecules of different composition
training. Considering the large diversity of chemical struc- and size at the same time. The same reward function as in
tures, the generation of a dataset that covers the whole chem- the single-bag task is used. Offline performance is evaluated
ical space is hopeless. To address this limitation, Zhou et al. in terms of the average return across bags. Similarly, the
(2019) proposed an agent that learned to generate molecules baseline is given by the average optimal return over all bags.
from scratch using a Deep Q-Network (DQN) (Mnih et al.,
2015). However, such graph-based RL approaches are still Solvation The task is to learn to place water molecules
restricted to the generation of single organic molecules for around an existing molecule (i.e. C0 is non-empty). This
which this representation was originally designed. Further, task assesses an agent’s ability to distinguish intra- and inter-
graph representations prohibit the use of reward functions molecular interactions, i.e. the atomic interactions within a
based on fundamental physical laws, and one has to resort molecule and those between molecules. These interactions
to heuristics instead. Finally, geometric constraints cannot are paramount for the accurate description of chemistry in
be imposed on the design process. Jørgensen et al. (2019) the liquid phase. In this task, we deviate from the protocol
introduced an atomistic structure learning algorithm, called used in the previous experiments as follows. Initially, the
ALSA, that utilizes a convolutional neural network to build agent is provided with an H2 O bag. Once the bag is empty,
2D structures and planar compounds atom by atom. the environment will refill it and the episode continues. The
episode terminates once n ∈ N+ bags of H2 O have been
placed on the canvas. By refilling the H2 O bag n − 1 times
5. Experiments instead of providing a single H2n On bag, the agent is guided
We perform experiments to evaluate the performance of towards building H2 O molecules. 6 The reward function is
the policy introduced in Section 3. While prior work has augmented with a penalty term for placing atoms far away
focused on building molecules using molecular graph rep- from the center, i.e. r(st , at ) = −∆E − ρkxk2 , where ρ is
resentations, we are interested in designing molecules in a hyper-parameter. This corresponds to a soft constraint on
Cartesian coordinates. To this end, we introduce a new the radius at which the atoms should be placed. This is a
RL environment called M OL G YM in Section 5.1. It com- task a graph-based RL approach could not solve.
prises a set of molecular design tasks, for which we provide
baselines using quantum-chemical calculations. See the 5.2. Results
Appendix for details on how the baselines are determined. 4 In this section, we use the tasks specified in Section 5.1 to
We use M OL G YM to answer the following questions: 1) can evaluate our proposed policy. We further assess the chemical
5
4
Source code of the agent and environment is available at ∆E is on the order of magnitude of −0.1 Hartree, resulting
https://github.com/gncs/molgym. in a reward of around 0.25 for a well placed atom.
6
A comparison of the two protocols is given in the Appendix.
1.5
Average Return

1.0

0.5

0.0
C2 H2 O2
CH3NO

0.5 CH4O

0 10 20 30 40 50 60 70 80
Steps x 1000

Figure 5. (a) Average offline performance on the single-bag task for bags CH3 NO, CH4 O and C2 H2 O2 across 10 seeds. Dashed lines
denote optimal returns for each bag, respectively. Error bars show two standard deviations. (b) Generated molecular structures at different
terminal states over time, showing the agent’s learning progress.

0.8
Table 1. QM9 bags used in the experiments.
Experiment QM9 Bags Used
Average Return

0.6
Single-bag C2 H2 O2 , CH3 NO, CH4 O
0.4
H2 O, CHN, C2 N2 , H3 N, C2 H2 , CH2 O,
Multi-bag
C2 HNO, N4 O, C3 HN, CH4 , CF4
0.2 Single-bag (large) C3 H5 NO3 , C4 H7 N, C3 H8 O

0.0

0.2 the multi-bag task using all formulas contained in the QM9
0 20 40 60 80 100 120 dataset (Ruddigkeit et al., 2012; Ramakrishnan et al., 2014)
Steps x 1000
with up to 5 atoms, resulting in 11 bags (see Table 1). De-
spite their small size, the molecules feature a diverse set
Figure 6. Average offline performance on the multi-bag task, us-
of bonds (single, double, and triple) and geometries (linear,
ing 11 bags consisting of up to five atoms across 10 seeds. The
dashed line denotes the optimal average return. Error bars show trigonal planar, and tetrahedral). From the performance and
two standard deviations. The molecular structures shown are the from visual inspection of the generated molecular structures
terminal states at the end of training from one seed. shown in Fig. 6, it can be seen that a single policy is able
to build different molecular structures across multiple bags.
For example, it learned that a carbon atom can have varying
validity, diversity and stability of the generated structures. number and type of neighboring atoms leading to specific
Experiments were run on a 16-core Intel Xeon Skylake 6142 bond distance, angles, and dihedral angles.
CPU with 2.6GHz and 96GB RAM. Details on the model
Scaling to Larger Molecules To study our agent’s abil-
architecture and hyperparameters are in the Appendix.
ity to construct large molecules we let it solve the single-bag
Learning to Construct Single Molecules In this toy ex- task with the bags C3 H5 NO3 , C3 H8 O, and C4 H7 N. Re-
periment, we train the agent on the single-bag task for the sults are shown in Fig. 7. After 154 000 steps, the agent
bags CH3 NO, CH4 O and C2 H2 O2 , respectively. Fig. 5 achieved an average return of 2.60 on C3 H5 NO3 (maximum
shows that the agent was able to learn the rules of chemi- across seeds at 2.72, optimum at 2.79), 2.17 on C4 H7 N
cal bonding and interatomic distances from scratch. While (2.21, 2.27), and 1.98 on C3 H8 O (2.04, 2.07). While the
on average the agent reaches 90% of the optimal return af- agent did not always find the most stable configurations, it
ter only 12 000 steps, the snapshots in Fig. 5 (b) highlight was able to explore a diverse set of chemically valid struc-
that the last 10% determine chemical validity. As shown tures (including bimolecular structures, see Appendix).
in Fig. 5 (b), the model first learns the atomic distances d,
Constructing Molecular Clusters We task the agent to
followed by the angles α and the dihedral angles ψ.
place 5 water molecules around a formaldehyde molecule,
Learning across Multiple Bags We train the agent on i.e. C0 = CH2 O and n = 5. The distance penalty parameter
2.5

2.0
Average Return

1.5

1.0

0.5

0.0 C3H5NO3
C4 H7 N
0.5 C3 H8 O

0 50 100 150 200 250


Steps x 1000

Figure 7. (a) Average offline performance on the single-bag task for bags C3 H5 NO3 , C3 H8 O and C4 H7 N across 10 seeds. Dashed lines
denote optimal return for each bag, respectively. Error bars show two standard deviations. (b) Selection of molecular structures generated
by trained models for the bag C3 H5 NO3 . For the bags C3 H8 O and C4 H7 N, see the Appendix.

2.0 Table 2. Assessment of generated structures in different experi-


ments by chemical validity, RMSD (in Å), and diversity.
Average Return

1.5
Task Experiment Validity RMSD Diversity
1.0 C2 H2 O2 0.90 0.32 3
Single-bag CH3 NO 0.70 0.20 3
0.5 CH4 O 0.80 0.11 1
0.0 Multi-bag - 0.78 0.05 22

0.5
C3 H5 NO3 0.70 0.39 40
Single-bag C4 H7 N 0.80 0.29 20
0 20 40 60 80 100 120 140 (large) C3 H8 O 0.90 0.47 4
Steps x 1000 C7 H8 N2 O2 0.60 0.61 61
Formaldehyde 0.80 1.03 1
Figure 8. Average offline performance on the solvation task with Solvation Acetonitrile 0.90 1.06 1
5 H2 O molecules across 10 seeds. Error bars show two standard Ethanol 0.90 0.92 1
errors. The plot is smoothed across five evaluations for better read-
ability. The dashed line denotes the optimal return. A selection of
molecular clusters generated by trained models are shown in solid
circles; for comparison, a stable configuration obtained through Quality Assessment of Generated Molecules In the
structure optimization is depicted in a dashed circle. spirit of the GuacaMol benchmark (Brown et al., 2019),
we assess the molecular structures generated by the agent
with respect to chemical validity, diversity and structural
stability for each experiment. To enable a comparison with
ρ is set to 0.01.7 From Fig. 8, we observe that the agent is
existing approaches, we additionally ran experiments with
able to learn to construct H2 O molecules and place them
the bag C7 H8 N2 O2 , the stoichiometry of which is taken
in the vicinity of the solute. A good placement also allows
from the GuacaMol benchmark (Brown et al., 2019).
for hydrogen bonds to be formed between water molecules
themselves and between water molecules and the solute (see The results are shown in Table 2. To determine the valid-
Fig. 8, dashed circle). In most cases, our agent arranges ity and stability of the generated structures, we first took
H2 O molecules such that these bonds can be formed (see the terminal states of the last iteration for a particular ex-
Fig. 8, solid circles). The lack of hydrogen bonds in some periment. Structures are considered valid if they can be
structures could be attributed to the approximate nature of successfully parsed by RDK IT (Landrum, 2019). However,
the quantum-chemical method used in the reward function. those consisting of multiple molecules were not considered
Overall, this experiment showcases that our agent is able valid (except in the solvation task). The validity reported in
to learn both intra- and intermolecular interactions, going Table 2 is the ratio of valid molecules over 10 seeds.
beyond what graph-based agents can learn.
All valid generated structures underwent a structure opti-
7
Further experiments on the solvation task are in the Appendix. mization using the PM6 method (see Appendix for more
details). Then, the RMSD (in Å) between the original and approach and enable the agent to stop before a given bag
the optimized structure were computed. In Table 2, the is empty. Moreover, we are interested in combining the
median RMSD over all generated structures is given per reward with other properties such as drug-likeliness and
experiment. In the approach by Gebauer et al. (2019), an applying our approach to other classes of molecules, e.g.
average RMSD of ≈ 0.25 Å is reported. Due to significant transition-metal catalysts.
differences in approach, application, and training procedure
we forego a direct comparison of the methods. Acknowledgements
Further, two molecules are considered identical if the
We would like to thank the anonymous reviewers for their
SMILES strings generated by RDK IT are the same. The di-
valuable feedback. We further thank Austin Tripp and Vin-
versity reported in Table 2 is the total number of unique and
cent Stimper for useful discussions and feedback. GNCS
valid structures generated through training over 10 seeds.
acknowledges funding through an Early Postdoc.Mobility
fellowship by the Swiss National Science Foundation
6. Discussion (P2EZP2 181616). RP receives funding from iCASE grant
#1950384 with support from Nokia.
This work is a first step towards general molecular design
through RL in Cartesian coordinates. One limitation of
the current formulation is that we need to provide bags References
for which we know good solutions exist when placed com- Blaschke, T., Olivecrona, M., Engkvist, O., Bajorath, J.,
pletely. While being able to provide such prior knowledge and Chen, H. Application of Generative Autoencoder in
can be beneficial, we are currently restricted to designing De Novo Molecular Design. Mol. Inf., 37(1-2):1700123,
molecules of known formulas. A possible solution is to 2018.
provide bags that are larger than necessary, e.g. generated
randomly or according to some fixed budget for each ele- Bosia, F., Husch, T., Vaucher, A. C., and Reiher, M.
ment, and enable the agent to stop before the bag is empty. qcscine/sparrow: Release 1.0.0, 2019. URL https:
//doi.org/10.5281/zenodo.3244106.
Compared to graph-based approaches, constructing
molecules by sequentially placing atoms in Cartesian co- Bradshaw, J., Kusner, M. J., Paige, B., Segler, M. H. S.,
ordinates greatly increases the flexibility in terms of the and Hernández-Lobato, J. M. A generative model for
type of molecular structures that can be built. However, electron paths. In International Conference on Learning
it also makes the exploration problem more challenging: Representations, 2019a.
whereas in graph-based approaches a molecule can be ex-
panded by adding a node and an edge, here, the agent has Bradshaw, J., Paige, B., Kusner, M. J., Segler, M., and
to learn to precisely position an atom in Cartesian coordi- Hernández-Lobato, J. M. A Model to Search for Synthe-
nates from scratch. As a result, the molecules we generate sizable Molecules. In Advances in Neural Information
are still considerably smaller. Several approaches exist to Processing Systems, pp. 7935–7947, 2019b.
mitigate the exploration problem and improve scalability,
including: 1) hierarchical RL, where molecular fragments Brown, N., Fiscato, M., Segler, M. H., and Vaucher, A. C.
or entire molecules are used as high-level actions; 2) imita- GuacaMol: Benchmarking Models for de Novo Molecu-
tion learning, in which known molecules are converted into lar Design. J. Chem. Inf. Model., 59(3):1096–1108, 2019.
expert trajectories; and 3) curriculum learning, where the doi: 10.1021/acs.jcim.8b00839.
complexity of the molecules to be built increases over time. Dai, H., Tian, Y., Dai, B., Skiena, S., and Song, L. Syntax-
directed variational autoencoder for structured data. In
7. Conclusion International Conference on Learning Representations,
2018.
We have presented a novel RL formulation for molecular
design in Cartesian coordinates, in which the reward func- De Cao, N. and Kipf, T. MolGAN: An implicit genera-
tion is based on quantum-mechanical properties such as the tive model for small molecular graphs. arXiv preprint
energy. We further proposed an actor-critic neural network arXiv:1805.11973, 2018.
architecture based on a translation and rotation invariant
state-action representation. Finally, we demonstrated that Gebauer, N. W. A., Gastegger, M., and Schütt, K. T. Gener-
our model can efficiently solve a range of molecular design ating equilibrium molecules with deep neural networks.
tasks from our M OL G YM RL environment from scratch. arXiv preprint arXiv:1810.11347, 2018.

In future work, we plan to increase the scalability of our Gebauer, N. W. A., Gastegger, M., and Schütt, K. T.
Symmetry-adapted generation of 3d point sets for the
targeted discovery of molecules. In Advances in Neural Liu, Q., Allamanis, M., Brockschmidt, M., and Gaunt,
Information Processing Systems, pp. 7564–7576, 2019. A. Constrained Graph Variational Autoencoders for
Molecule Design. In Advances in Neural Information
Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D., Processing Systems, pp. 7795–7804, 2018.
Hernández-Lobato, J. M., Sánchez-Lengeling, B., She-
berla, D., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, Masson, W., Ranchod, P., and Konidaris, G. Reinforcement
R. P., and Aspuru-Guzik, A. Automatic Chemical De- learning with parameterized actions. In Thirtieth AAAI
sign Using a Data-Driven Continuous Representation of Conference on Artificial Intelligence, 2016.
Molecules. ACS Cent. Sci., 4(2):268–276, 2018.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-
Guimaraes, G. L., Sanchez-Lengeling, B., Outeiral, C., ness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,
Farias, P. L. C., and Aspuru-Guzik, A. Objective- Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C.,
Reinforced Generative Adversarial Networks (ORGAN) Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wier-
for Sequence Generation Models. arXiv preprint stra, D., Legg, S., and Hassabis, D. Human-level control
arXiv:1705.10843, 2018. through deep reinforcement learning. Nature, 518(7540):
529–533, 2015.
Husch, T. and Reiher, M. Comprehensive Analysis of the
Neglect of Diatomic Differential Overlap Approximation. Neil, D., Segler, M., Guasch, L., Ahmed, M., Plumb-
J. Chem. Theory Comput., 14(10):5169–5179, 2018. ley, D., Sellwood, M., and Brown, N. Exploring
Deep Recurrent Models with Reinforcement Learning
Husch, T., Vaucher, A. C., and Reiher, M. Semiempir-
for Molecule Design. OpenReview, 2018. URL https:
ical molecular orbital models based on the neglect of
//openreview.net/forum?id=HkcTe-bR-.
diatomic differential overlap approximation. Int. J. Quan-
tum Chem., 118(24):e25799, 2018. Olivecrona, M., Blaschke, T., Engkvist, O., and Chen, H.
Molecular de-novo design through deep reinforcement
Jin, W., Coley, C., Barzilay, R., and Jaakkola, T. Predict-
learning. J. Cheminf., 9(1):48, 2017.
ing Organic Reaction Outcomes with Weisfeiler-Lehman
Network. In Advances in Neural Information Processing Polishchuk, P. G., Madzhidov, T. I., and Varnek, A. Es-
Systems, pp. 2607–2616, 2017. timation of the size of drug-like chemical space based
on GDB-17 data. J. Comput.-Aided Mol. Des., 27(8):
Jørgensen, M. S., Mortensen, H. L., Meldgaard, S. A., Kols-
675–679, 2013.
bjerg, E. L., Jacobsen, T. L., Sørensen, K. H., and Ham-
mer, B. Atomistic structure learning. J. Chem. Phys., 151 Popova, M., Isayev, O., and Tropsha, A. Deep reinforce-
(5):054111, 2019. doi: 10.1063/1.5108871. ment learning for de novo drug design. Sci. Adv., 4(7):
eaap7885, 2018.
Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms.
In Advances in Neural Information Processing Systems, Putin, E., Asadulaev, A., Ivanenkov, Y., Aladinskiy, V.,
pp. 1008–1014, 2000. Sanchez-Lengeling, B., Aspuru-Guzik, A., and Zha-
voronkov, A. Reinforced Adversarial Neural Computer
Kusner, M. J., Paige, B., and Hernández-Lobato, J. M.
for de Novo Molecular Design. J. Chem. Inf. Model., 58
Grammar Variational Autoencoder. In Precup, D. and
(6):1194–1204, 2018.
Teh, Y. W. (eds.), International Conference on Machine
Learning, volume 70, pp. 1945–1954. PMLR, 2017. Ramakrishnan, R., Dral, P. O., Rupp, M., and von Lilienfeld,
O. A. Quantum chemistry structures and properties of
Landrum, G. RDKit 2019.09.3. http://www.rdkit.org/, 2019.
134 kilo molecules. Sci. Data, 1:140022, 2014.
(Accessed: 22. January 2019).
Li, Y., Vinyals, O., Dyer, C., Pascanu, R., and Battaglia, Ruddigkeit, L., van Deursen, R., Blum, L. C., and Rey-
P. Learning Deep Generative Models of Graphs. arXiv mond, J.-L. Enumeration of 166 Billion Organic Small
preprint arXiv:1803.03324, 2018a. Molecules in the Chemical Universe Database GDB-17.
J. Chem. Inf. Model., 52(11):2864–2875, 2012.
Li, Y., Zhang, L., and Liu, Z. Multi-objective de novo
drug design with conditional graph generative model. J. Schneider, P., Walters, W. P., Plowright, A. T., Sieroka,
Cheminf., 10(1):33, 2018b. N., Listgarten, J., Goodnow, R. A., Fisher, J., Jansen,
J. M., Duca, J. S., Rush, T. S., Zentgraf, M., Hill, J. E.,
Lim, J., Ryu, S., Kim, J. W., and Kim, W. Y. Molecular Krutoholow, E., Kohler, M., Blaney, J., Funatsu, K., Lue-
generative model based on conditional variational autoen- bkemann, C., and Schneider, G. Rethinking drug design
coder for de novo molecular design. J. Cheminf., 10:31, in the artificial intelligence era. Nat. Rev. Drug Discovery,
2018. pp. 1–12, 2019.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, Zhavoronkov, A., Ivanenkov, Y. A., Aliper, A., Veselov,
P. Trust region policy optimization. In International M. S., Aladinskiy, V. A., Aladinskaya, A. V., Terentiev,
Conference on Machine Learning, pp. 1889–1897, 2015. V. A., Polykovskiy, D. A., Kuznetsov, M. D., Asadulaev,
A., Volkov, Y., Zholus, A., Shayakhmetov, R. R., Zhebrak,
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and A., Minaeva, L. I., Zagribelnyy, B. A., Lee, L. H., Soll,
Klimov, O. Proximal policy optimization algorithms. R., Madge, D., Xing, L., Guo, T., and Aspuru-Guzik,
arXiv preprint arXiv:1707.06347, 2017. A. Deep learning enables rapid identification of potent
DDR1 kinase inhibitors. Nat. Biotechnol., 37(9):1038–
Schütt, K., Kindermans, P.-J., Sauceda Felix, H. E., Chmiela, 1040, 2019.
S., Tkatchenko, A., and Müller, K.-R. SchNet: A
continuous-filter convolutional neural network for model- Zhou, Z., Kearnes, S., Li, L., Zare, R. N., and Riley, P. Opti-
ing quantum interactions. In Advances in Neural Infor- mization of Molecules via Deep Reinforcement Learning.
mation Processing Systems, pp. 991–1001, 2017. Sci. Rep., 9(1):1–10, 2019.

Schütt, K. T., Kessel, P., Gastegger, M., Nicoli, K. A.,


Tkatchenko, A., and Müller, K.-R. SchNetPack: A Deep
Learning Toolbox For Atomistic Systems. J. Chem. The-
ory Comput., 2018a.

Schütt, K. T., Sauceda, H. E., Kindermans, P. J., Tkatchenko,


A., and Müller, K. R. SchNet–A deep learning architec-
ture for molecules and materials. J. Chem. Phys., 148
(24), 2018b.

Segler, M. H. S., Kogej, T., Tyrchan, C., and Waller, M. P.


Generating Focused Molecule Libraries for Drug Discov-
ery with Recurrent Neural Networks. ACS Cent. Sci., 4
(1):120–131, 2018.

Stewart, J. J. P. Optimization of parameters for semiempiri-


cal methods V: Modification of NDDO approximations
and application to 70 elements. J. Mol. Model., 13(12):
1173–1213, 2007.

Wei, E., Wicke, D., and Luke, S. Hierarchical approaches


for reinforcement learning in parameterized action space.
In 2018 AAAI Spring Symposium Series, 2018.

Weininger, D. SMILES, a chemical language and informa-


tion system. 1. Introduction to methodology and encoding
rules. J. Chem. Inf. Comput. Sci., 28(1):31–36, 1988.

Wildman, S. A. and Crippen, G. M. Prediction of Physico-


chemical Parameters by Atomic Contributions. J. Chem.
Inf. Comput. Sci., 39(5):868–873, 1999.

Xiong, J., Wang, Q., Yang, Z., Sun, P., Han, L., Zheng,
Y., Fu, H., Zhang, T., Liu, J., and Liu, H. Parametrized
deep q-networks learning: Reinforcement learning with
discrete-continuous hybrid action space. arXiv preprint
arXiv:1810.06394, 2018.

You, J., Liu, B., Ying, Z., Pande, V., and Leskovec, J. Graph
Convolutional Policy Network for Goal-Directed Molecu-
lar Graph Generation. In Advances in Neural Information
Processing Systems, pp. 6410–6421, 2018.
A. Quantum-Chemical Calculations C. Experimental Details
For the calculation of the energy E we use the fast semi- C.1. Model Architecture
empirical Parametrized Method 6 (PM6) (Stewart, 2007).
The model architecture is summarized in Table 3. We initial-
In particular, we use the implementation from the software
ize the biases of each MLP with 0 and each weight matrix
package S PARROW (Husch et al., 2018; Bosia et al., 2019).
as a (semi-)orthogonal matrix. After each hidden layer,
For each calculation, a molecular charge of zero and the
a ReLU non-linearity is used. The output activations are
lowest possible spin multiplicity are chosen. All calculations
shown in Table 3. As explained in the main text, both MLPf
are spin-unrestricted.
and MLPe use a masked softmax activation function to guar-
Limitations of semi-empirical methods are highlighted in, antee that only valid actions are chosen. Further, we rescale
for example, recent work by Husch & Reiher (2018). More the continuous actions (µd , µα , µψ ) ∈ [−1, 1]3 predicted
accurate methods such as approximate density functionals by MLPcont to ensure that µd ∈ [dmin , dmax ], µα ∈ [0, π]
need to be employed especially for systems containing tran- and µψ ∈ [0, π]. For more details on the SchNet, see the
sition metals. original work (Schütt et al., 2018b).
For the quantum-chemical calculations to converge reliably,
we ensured that atoms are not placed too close (< 0.6 Å) nor
too far away from each other (> 2.0 Å). If the agent places Table 3. Model architecture for actor and critic networks.
an atom outside these boundaries, the minimum reward of Operation Dimensionality Activation
−0.6 is awarded and the episode terminates. SchNet n(C) × 4, ∗, n(C) × 64 ∗ (cf. Table 7)
MLPβ emax , 128, 32 linear
B. Learning the Dihedral Angle tile 32, n(C) × 32 —
concat n(C) × (64, 32), n(C) × 96 —
We experimentally validate the benefits of learning |ψ| ∈ MLPf n(C) × 96, n(C) × 128, n(C) × 1 softmax
[0, π] and κ ∈ {−1, 1} instead of ψ ∈ [−π, π] by compar- select n(C) × 96, 96 —
ing the two models on the single-bag task with bag CH4 MLPe 96, 128, emax softmax
(methane). Methane is one of the simplest molecules that concat (96, emax ), 96 + emax —
MLPcont 96 + emax , 128, 3 tanh
requires the model to learn a dihedral angle. As shown in
MLPκ 2 × 96, 2 × 128, 2 × 1 softmax
Fig. 9, learning the sign of the dihedral angle separately
pooling n(C) × 96, 96 —
(with κ) speeds up learning significantly. In fact, the ablated MLPφ 96, 128, 128, 1 linear
model (without κ) fails to converge to the optimal return
even after 100 000 steps (not shown).

C.2. Hyperparameters
0.8
We manually performed an initial hyperparameter search
0.6 on a single holdout validation seed. The considered hy-
0.4 perparameters and the selected values are listed in Table 4
Average Return

0.2
(single-bag), Table 5 (multi-bag) and Table 6 (solvation).
The hyperparameters used for SchNet are shown in Table 7.
0.0

0.2

0.4
D. Baselines
0.6 Below, we report how the baselines for the single-bag and
0.8
multi-bag tasks were derived. First, we took all molecular
0 1 2 3 4 5 6 7
structures for a given chemical formula (i.e. bag) from the
Steps x 1000 QM9 dataset (Ruddigkeit et al., 2012; Ramakrishnan et al.,
2014). Subsequently, we performed a structure optimiza-
Figure 9. Average offline performance on the single-bag task for tion using the PM6 method (as described in Section A) on
the bag CH4 across 10 seeds. Estimating κ and |ψ| separately the structures. This was necessary as the structures in this
(with κ) significantly speeds up learning compared to estimating dataset were optimized with a different quantum-chemical
ψ directly (without κ). Error bars show two standard deviations. method. Then, the most stable structure was selected and
The dashed line denotes the optimal return. considered optimal for this chemical formula; the remaining
structures were discarded. Since the undiscounted return is
path independent, we determined the return R(s) by com-
Table 4. Hyperparameters for the single-bag task. Adapted values Table 6. Hyperparameters for the solvation task.
for the scalability (large) experiment are in parentheses. Hyperparameter Search set Value
Hyperparameter Search set Value (large)
Range [dmin , dmax ] (Å) — [0.90, 2.80]
Range [dmin , dmax ] (Å) — [0.95, 1.80] Max. atomic number emax — 10
Max. atomic number emax — 10 Distance penalty ρ — 0.01
Workers — 16 Workers — 16
Clipping  — 0.2 Clipping  — 0.2
Gradient clipping — 0.5 Gradient clipping — 0.5
GAE parameter λ — 0.95 GAE parameter λ — 0.95
VF coefficient c1 — 1 VF coefficient c1 — 1
Entropy coefficient c2 {0.00, 0.01, 0.03} 0.01 Entropy coefficient c2 {0.00, 0.01, 0.03} 0.01
Training epochs {5, 10} 5 Training epochs {5, 10} 5
Adam stepsize {10−4 , 3 × 10−4 } 3 × 10−4 Adam stepsize {10−4 , 3 × 10−4 } 3 × 10−4
Discount γ {0.99, 1.00} 0.99 Discount γ {0.99, 1.00} 0.99
Time horizon T {192, 256} 192 (256) Time horizon T {384, 512} 384
Minibatch size {24, 32} 24 (32) Minibatch size {48, 64} 48

Table 5. Hyperparameters for the multi-bag task.


Table 7. Hyperparameters for SchNet (Schütt et al., 2018a) used in
Hyperparameter Search set Value all experiments.
Range [dmin , dmax ] (Å) — [0.95, 1.80] Hyperparameter Search set Value
Max. atomic number emax — 10
Number of interactions — 3
Workers — 16
Clipping  — 0.2 Cutoff distance (Å) — 5.0
Gradient clipping — 0.5 Number of filters — 128
GAE parameter λ — 0.95 Number of atomic basis functions {32, 64, 128} 64
VF coefficient c1 — 1
Entropy coefficient c2 {0.00, 0.01, 0.03} 0.01
Training epochs {5, 10} 5 E. Additional Results
Adam stepsize {10−4 , 3 × 10−4 } 3 × 10−4
E.1. Single-bag Task
Discount γ {0.99, 1.00} 0.99
Time horizon T {384, 512} 384 In Fig. 10, we show a selection of molecular structures gen-
Minibatch size {48, 64} 48 erated by trained models for the bags C4 H7 N and C3 H8 O.
Further, since the agent is agnostic to the concept of molec-
ular bonds, it is able to build multiple molecules if it results
puting the total interaction energy in the canvas C, i.e. in a higher return. An example of a bimolecular structure
N
X generated by a trained model for the bag C3 H8 O is shown in
R(s) = E(C) − E(ei ), (9) Fig. 11. Finally, in Fig. 12, we showcase a set of generated
i=1 molecular structures that are not chemically valid.
where N is the number of atoms placed on the canvas.
The baseline for the solvation task was determined in the
following way. 12 molecular clusters were generated by ran-
domly placing n H2 O molecules around the solute molecule
(in the main text n = 5). Subsequently, the structure of
these clusters was optimized with the PM6 method (as de-
scribed in Section A). Similar to Eq. (9), the undiscounted
return of each cluster can be computed:
N
X
R(s) = E(C) − E(C0 ) − {E(ei ) + ρkxi k2 } , (10)
i=1
Figure 10. Selection of molecular structures generated by trained
where the distance penalty ρ = 0.01. Finally, the maximum models for the bags C4 H7 N (a) and C3 H8 O (b).
return over the optimized clusters was determined.
in Fig. 14 in blue and red, respectively. It can be seen that
giving the agent 5 H2 O bags one at a time instead of a single
H10 O5 bag improves performance.

2.0

Average Return
1.5
Figure 11. Bimolecular structure generated by a trained model for
the bag C3 H8 O in the single-bag task. 1.0

0.5

0.0

0.5

0 20 40 60 80 100 120 140

Figure 12. Selection of chemically invalid molecular structures Steps x 1000


generated by trained models for the bags C3 H8 O (a), C3 H5 NO3
(b), and C4 H7 N (c). Figure 14. Average offline performance for the solvation task with
n = 5 (blue) and placing atoms from a single H10 O5 bag (red).
In both experiments, C0 is formaldehyde. Error bars show two
E.2. Solvation Task standard errors. The plot is smoothed across five evaluations for
better readability. The dashed line denotes the optimal return. A
In Fig. 13, we report the average offline performances of selection of molecular clusters generated by models trained on the
agents placing 5 H2 O molecules around the solutes (i.e, C0 ) H10 O5 bag are shown in red solid circles; for comparison, a stable
acetonitrile and ethanol. As can be seen, the agents are able configuration obtained through structure optimization is depicted
to accurately place water molecules such that they interact in a black dashed circle.
with the solute. However, we stress that more accurate
quantum-chemical methods for computing the reward are
required to describe hydrogen bonds to chemical accuracy. E.3. Generalization and Transfer Learning

2.0 2.0

1.5 1.5
Average Return

Average Return

1.0
1.0

0.5
0.5
0.0
acetonitrile 0.0
0.5 ethanol

0.5
0 20 40 60 80 100 120
Steps x 1000 0 25 50 75 100 125 150 175 200
Steps x 1000
Figure 13. Average offline performances across 10 seeds on the
solvation task with n = 5 and the initial states being acetonitrile Figure 15. Average offline performance for agents A/B: trained
and ethanol. Error bars show two standard errors. The plot is on bags A of size 6 and tested on bags B of size 8, B/B: trained
smoothed across five evaluations for better readability. The dashed and tested on B, and A → B/B: trained on A for 96 000 steps,
lines denote the optimal returns. A selection of molecular clusters and fine-tune and tested on B. See main text for more details.
generated by trained models are shown in circles. Error bars show two standard deviations. The dashed line denotes
the optimal average return.
In Fig. 14, we compare the average offline performance of
two agents placing in total 10 H and 5 O atoms around a To assess the generalization capabilities of our agent when
formaldehyde molecule. One agent is given 5 H2 O bags faced with previously unseen bags, we train an agent on
consecutively following the protocol of the solvation task bags A = {C2 H2 O2 , C2 H3 N, C3 H2 O, C3 N2 O, CH3 NO,
as described in the main text, another is given a single CH4 O} of size 6 and test on bags B = {C3 H2 O3 , C3 H4 O,
H10 O5 bag. Their average offline performances are shown C4 H2 O2 , CH4 N2 O, C4 N2 O2 , C5 H2 O} of size 8. As
shown in Fig. 15, the agent A/B achieves an average re-
turn of 1.79, which is approximately 88% of the optimal
return. In comparison, an agent trained and tested on B
(B/B) reaches an average return of 1.96 (or 0.97% of the
optimal return). We additionally train an agent on A for
96 000 steps, and then fine-tune and test on B. The agent
A → B/B reaches the same performance as if trained from
scratch within 20 000 steps of fine-tuning, showing success-
ful transfer. We anticipate that training on more bags and
incorporating best practices from multi-task learning would
further improve performance.

You might also like