0% found this document useful (0 votes)
19 views8 pages

VAE Molecular Graphs Niloy AAAI19

Variational autoencoder using graph convolution in the architecture, for designing molecules with target properties.

Uploaded by

awadhuts
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views8 pages

VAE Molecular Graphs Niloy AAAI19

Variational autoencoder using graph convolution in the architecture, for designing molecules with target properties.

Uploaded by

awadhuts
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19)

N E VAE: A Deep Generative Model for Molecular Graphs


Bidisha Samanta∗ Abir De Gourhari Jana
IIT Kharagpur MPI-SWS IIT Kharagpur
[email protected] [email protected] [email protected]

Pratim Kumar Chattaraj Niloy Ganguly Manuel Gomez Rodriguez


IIT Kharagpur IIT Kharagpur MPI-SWS
[email protected] [email protected] [email protected]

Abstract consists of representing molecules using a domain specific


textual representation—SMILES strings—and then leverag-
Deep generative models have been praised for their ability ing deep generative models for text generation for molecule
to learn smooth latent representation of images, text, and au-
dio, which can then be used to generate new, plausible data.
design. Unfortunately, SMILE strings do not capture the
However, current generative models are unable to work with structural similarity between molecules and, moreover, a
molecular graphs due to their unique characteristics—their molecule can have multiple SMILES representations. As a
underlying structure is not Euclidean or grid-like, they re- consequence, the generated molecules lack in terms of di-
main isomorphic under permutation of the nodes labels, and versity and validity, as shown in Tables 1–2 and Figure 3.
they come with a different number of nodes and edges. In this The second strategy (Simonovsky and Komodakis 2018;
paper, we propose NeVAE, a novel variational autoencoder Jin, Barzilay, and Jaakkola 2018) consists of representing
for molecular graphs, whose encoder and decoder are spe- molecules using molecular graphs, rather than SMILES rep-
cially designed to account for the above properties by means resentations, and then developing deep generative models
of several technical innovations. In addition, by using mask- for molecular graphs, in which atoms correspond to nodes
ing, the decoder is able to guarantee a set of valid proper-
ties in the generated molecules. Experiments reveal that our
and bonds correspond to edges. However, current generative
model can discover plausible, diverse and novel molecules models for molecular graphs share one or more of the fol-
more effectively than several state of the art methods. More- lowing limitations, which preclude them from realizing all
over, by utilizing Bayesian optimization over the continuous their potential: (i) they can only generate (and be trained on)
latent representation of molecules our model finds, we can molecules with the same number of atoms while, in practice,
also find molecules that maximize certain desirable proper- molecules having similar properties often come with a dif-
ties more effectively than alternatives. ferent number of atoms and bonds; (ii) they are not invariant
to permutations of their node labels, however, graphs remain
isomorphic under permutation of their node labels; (iii) their
Introduction training procedure suffers from a quadratic complexity with
Drug design aims to identify (new) molecules with a set of respect to the number of nodes in the graph, which make
specified properties, which in turn results in a therapeutic it difficult to leverage a sizeable number of large molecules
benefit to a group of patients. However, drug design is still a during training; and, (iv) they generate molecular graphs by
lengthy, expensive, difficult, and inefficient process with low combining a small set of molecular graphlets (or subgraphs).
rate of new therapeutic discovery (Paul et al. 2010), in which The above shortcomings constrain the diversity of the gen-
candidate molecules are produced through chemical syn- erated molecules, as shown in Table 1 and Figure 3.
thesis or biological processes. In the context of computer-
aided drug design (Merz, Ringe, and Reynolds 2010), there In this paper, we develop NeVAE, a deep generative
is a great interest in developing automated, machine learn- model for molecular graphs based on variational autoen-
ing techniques to discover sizeable numbers of plausible, di- coders that overcomes the above shortcomings. To do so,
verse and novel candidate molecules in the vast (1023 −1060 ) it relies on several technical innovations, which distinguish
and unstructured molecular space (Polishchuk, Madzhidov, us from previous work (Dai et al. 2018; Kusner et al. 2017;
and Varnek 2013). In recent years, there has been a flurry of Gómez-Bombarelli et al. 2016; Simonovsky and Komodakis
work devoted to developing deep generative models for au- 2018; Jin, Barzilay, and Jaakkola 2018):
tomatic molecule design (Dai et al. 2018; Kusner et al. 2017;
(i) Our probabilistic encoder learns to aggregate informa-
Gómez-Bombarelli et al. 2016; Simonovsky and Komodakis
tion (e.g., atom and bond features) from a different num-
2018; Jin, Barzilay, and Jaakkola 2018), which has predom-
ber of hops away from a given atom and then map this
inantly followed two strategies. The first strategy (Dai et al.
aggregate information into a continuous latent space,
2018; Kusner et al. 2017; Gómez-Bombarelli et al. 2016)
as in inductive graph representation learning (Hamilton,
Copyright c 2019, Association for the Advancement of Artificial

Intelligence (www.aaai.org). All rights reserved. This work was partially done during B. Samanta’s internship at MPI-SWS.

1110
Ying, and Leskovec 2017; Lei et al. 2017). However, in ables p(z) and an approximate probabilistic inference model
contrast with inductive graph representation learning, the qφ (z|x). In this characterization, pθ and qφ are arbitrary
aggregator functions are learned via variational inference distributions parametrized by two (deep) neural networks θ
so that the resulting aggregator functions are especially and φ and one can think of the generative model as a prob-
well suited to enable the probabilistic decoder to gen- abilistic decoder, which decodes latent variables into ob-
erate new molecules rather than other downstream ma- served variables, and the inference model as a probabilistic
chine learning tasks such as, e.g., link prediction. More- encoder, which encodes observed variables into latent vari-
over, by using (symmetric) aggregator functions, it is in- ables.
variant to permutations of the node labels and can encode Ideally, if we use the maximum likelihood principle
graphs with a variable number of atoms, as opposed to to train a variational autoencoder, we should optimize
existing graph generative models, with a few the notable the marginal log-likelihood of the observed data, i.e.,
exception of those based on GCNs (Kipf and Welling ED [log pθ (x)], where pD is the data distribution. Unfortu-
2016b). nately, computing log pθ (x) requires marginalization with
(ii) Our probabilistic decoder jointly represents all edges respect to the latent variable z, which is typically in-
as an unnormalized log probability vector (or ‘logit’), tractable. Therefore, one resorts to maximizing a variational
which then feeds a single multinomial edge distribution. lower bound or evidence lower bound (ELBO) of the log-
Such scheme allows for an efficient inference algorithm likelihood of the observed data, i.e.,
with O(l) complexity, where l is the number of true h i
edges in the molecules, which is also invariant to per- max max ED −KL(qφ (z|x)||p(z)) + Eqφ (z|x) log pθ (x|z) .
θ φ
mutations of the node labels. In contrast, previous work
typically models the presence and absence of each po- Finally, note that the quality of this variational lower bound
tential edge using a Bernoulli distribution and this leads depends on the expressive ability of the approximate infer-
to inference algorithms with O(n2 ) complexity, where n ence model qφ (z|x), which is typically assumed to be a nor-
is the number of nodes, which are not permutation in- mal distribution whose mean and variance are parametrized
variant. by a neural network φ with the observed data x as an input.
(iii) Our probabilistic decoder is able to guarantee a set of lo-
cal structural and functional properties in the generated NeVAE: A Variational Autoencoder
graphs by using a mask in the edge distribution defini- for Molecular Graphs
tion, which can prevent the generation of certain undesir- In this section, we first give a high-level overview of the de-
able edges during the decoding process. While masking sign of NeVAE, our variational autoencoder for molecular
have been increasingly used to account for prior (expert) graphs, starting from the data it is designed for. Then, we
knowledge in generative models (Gómez-Bombarelli et describe more in-depth the key technical aspects of its in-
al. 2016; Kusner et al. 2017) based on SMILES, their dividual components. Finally, we elaborate on the training
use in generative models for molecular graphs has been procedure, scalability and implementation details.
lacking. High-level overview. We observe a collection of N molec-
We evaluate our model using molecules from two publicly ular graphs {Gi = (Vi , Ei )}i∈[N ] , where Vi and Ei denote
available datasets, ZINC (Irwin et al. 2012) and QM9 (Ra- the corresponding set of nodes (atoms) and edges (bonds),
makrishnan et al. 2014), and show that our model beats the respectively, and this collection may contain graphs with a
state of the art in terms of several relevant quality metrics, different number of nodes and edges. Moreover, for each
i.e., validity, novelty and uniqueness. molecular graph G = (V, E), we also observe a set of
We also observe that the resulting latent space represen- node features F = {fu }u∈V and edge weights Y =
tation of molecules exhibit powerful semantics—we can {yuv }(u,v)∈E . More specifically, the node features fu are
smoothly interpolate between molecules—and generaliza- one-hot representations of the type of the atoms (i.e., C, H,
tion ability—we can generate (valid) molecules that are N or O), and the edge weight yuv are the bond types (i.e.,
larger than any of the molecules in the datasets. Finally, by single, double, triple). Our goal is then to design a varia-
utilizing Bayesian optimization over the latent representa- tional autoencoder for molecular graphs that, once trained on
tion, we can also identify molecules that maximize certain this collection of graphs, has the ability of creating new plau-
desirable properties more effectively than alternatives. We sible molecular graphs, including node features and edge
are releasing an open source implementation of our model weights. In doing so, it will also provide a latent representa-
in Tensorflow. 1 tion of any graph in the collection (or elsewhere) with mean-
ingful semantics.
Background on Variational Autoencoders Following the above background on variational autoen-
coders, we characterize NeVAE by means of:
Variational autoencoders (Kingma and Welling 2013; — Prior: p(z1 , . . . , zn ), where |V| = |F | = n ∼ Poisson(λn )
Rezende, Mohamed, and Wierstra 2014) are characterized
— Inference model (encoder): qφ (z1 , . . . , zn |V, E, F , Y)
by a probabilistic generative model pθ (x|z)
— Generative model (decoder): pθ (E, F , Y|z1 , . . . , zn )
of the observed variables x ∈ RN given the latent vari-
ables z ∈ RM , a prior distribution over the latent vari- In the above characterization, note that we define one latent
variable per node, i.e., we have a node-based latent represen-
1
https://github.com/Networks-Learning/nevae tation, and the number of nodes is a random variables and, as

1111
cv (j) → cv (j + 1) v
c 1 (1, .., K), .., c |V| (1, .., K)
v Wµ , bµ zv
Wh , bh Wσ , bσ
u W1 W2 W 3..K c1 (K) c u (1, .., K)
c2 (K) µ 1 , .., µ |V|
σ 1 , .., σ |V|
zu
G, F , Y
c |V|−1 (K) c v (1, .., K)
c |V| (K) φ enc (cc∗ (1, .., K)) w
u
w c w (1, .., K) zw

Figure 1: The encoder of our variational autoencoder for molecular graphs. From left to right, given a molecular graph G with
a set of node features F and edge weights Y, the encoder aggregates information from a different number of hops j ≤ K
away for each node v ∈ G into an embedding vector cv (j). These embeddings are fed into a differentiable function φenc which
parameterizes the posterior distribution qφ , from where the latent representation of each node in the input graph are sampled
from.

a consequence, both the latent representation as well as the Proposition 1 The probabilistic encoder defined by Eqs. 1
graph can vary in size. Next, we formally define the func- and 2 has the following properties:
tional form of the inference model, the generative model, (i) For each node u, its corresponding embedding cu (k) is
and the prior. invariant to permutations of the node labels of its neigh-
Inference model (probabilistic encoder). Given a graph bors.
G = (V, E) with node features F and edge weights Y, our (ii) The weight matrices W1 , . . . , Wk do not depend on the
inference model qφ defines a probabilistic encoding for each number of nodes and edges in the graph and thus a sin-
node in the graph by aggregating information from differ- gle encoder allows for graphs with a variable number of
ent distances. More formally, for each node u, the inference nodes and edges.
model is defined as follows: Generative model (probabilistic decoder). Given a set of
qφ (zu |V, E, F, Y) ∼ N (µu , diag(σu )) (1) of n nodes with latent variables Z = {zu }u∈[n] , our gener-
where zu is the latent variable associated to node u, ative model pθ is defined as follows:
[µu , diag(σu )] = φenc (cu (1), . . . , cu (K)), and cu (k) ag- pθ (E, Y, F|Z) = pθ (F|Z) pθ (E, Y|Z), (3)
gregates information from k hops away from u, i.e.,

r(Wk fu ) if k = 1 with
cu (k) =  Y
r Wk fu Λ ∪v∈N (u) yuv g(cv (k − 1) if k > 1. pθ (F |Z) = pθ (fu |Z),
(2) u∈V

In the above, Wk are trainable weight matrices, which prop- pθ (E, Y|Z) = pθ (l Z) . pθ (E, Y|Z, l),
agate information between different search depths, Λ(.) is
Y
pθ (E, Y|Z, l) = pθ (ek |Ek−1 , F , Z)pθ (yuk vk |Yk−1 , F , Z),
a (possibly nonlinear) symmetric aggregator function in its k∈[l]
arguments, g(·) and r(·) are (possibly nonlinear) differen-
tiable functions, φenc is a neural network, and denotes where the ordering for the edge and edge weights is in-
pairwise product. Figure 1 describes our encoder architec- dependent of node labels and hence permutation invariant,
ture. ek and yuk vk denote the k-th edge and edge weight under
The above node embeddings, defined by Eq. 2, are very the chosen order, and Ek−1 = {e1 , . . . , ek−1 } and Yk−1 =
similar to the ones used in several graph representation {yu1 v1 , . . . , yuk−1 vk−1 } denote the k − 1 previously gener-
learning algorithms such as GraphSAGE (Hamilton, Ying, ated edges and edge weights respectively.
and Leskovec 2017), column networks (Pham et al. 2017), Moreover, the model characterizes the conditional proba-
and GCNs (Kipf and Welling 2016a), the main difference bilities in the above formulation as follows. For each node,
with our work is the way we will train the weight matrices it represents all potential node feature values fu = q as
Wk . Here, we will use variational inference so that the re- an unnormalized log probability vector (or ‘logits’), feeds
sulting embeddings are especially well suited to enable our this logit into a softmax distribution and samples the node
probabilistic decoder to generate new, plausible molecular features. Then, it represents the average number of edges
graphs. In contrast, the above algorithms use non variational through as a logit, feeds this logit into a Poisson distribu-
approaches to compute general purpose embeddings to feed tion and samples the number of edges. Finally, it represents
downstream machine learning tasks. all potential edges as logits and, for each edge, all poten-
The following proposition highlights several desirable tial edge weights as another logit, and it feeds the former
theoretical properties of our probabilistic encoder (details in vector into a single softmax distribution and the latter vec-
arxiv version),2 which distinguishes our design from most tors each into a different softmax distribution. Moreover, the
existing generative models of graphs (Jin, Barzilay, and edge distribution and the corresponding edge weight distri-
Jaakkola 2018; Simonovsky and Komodakis 2018): butions depend on a set of binary masks, which may depend
on the sampled node features and also get updated every
2
https://arxiv.org/abs/1802.05283 time a new edge and edge weight are sampled. By doing so,

1112
Figure 2: The decoder of our variational autoencoder for molecular graphs. From left to right, the decoder first samples the
number of nodes n = |V| from a Poisson distribution pn (λn ) and it samples a latent vector zu per node u ∈ V from N (0, I).
Then, for each node u, it represents all potential node feature values as an unnormalized log probability vector (or ‘logits’),
where each entry is given by a nonlinearity θγdec of the corresponding latent representation zu , feeds this logit into a softmax
distribution and samples the node features. Next, it feeds all latent vectors Z into a nonlinear log intensity function θβdec (Z)
which is used to sample the number of edges. Thereafter, on the top row, it constructs a logit for all potential edges (u, v), where
each entry is given by a nonlinearity θαdec of the corresponding latent representations (zu , zv ). Then, it samples the edges one
by one from a soft max distribution depending on the logit and a mask xe (Ek−1 ), which gets updated every time it samples a
new edge ek . On the bottom row, it constructs a logit per edge (u, v) for all potential edge weight values m, where each entry
is given by a nonlinearity θξdec of the latent representations of the edge and edge weight value (zu , zv , m). Then, every time it
samples an edge, it samples the edge weight value from a soft max distribution depending on the corresponding logit and mask
xm (u, v), which gets updated every time it samples a new yuk vk .

it prevents the generation of certain undesirable edges and be useful to account for prior (expert) knowledge, it may
edges weights, allowing for the generated graph to fulfill a be costly to check for some local (or global) structural and
set of predefined local structural and functional properties. functional properties on-the-fly.
More formally, the distributions of each node feature, the Prior. Given a set of n nodes with latent variables Z =
number of edges, each edge and edge weight are given by:
{zu }u∈[n] , pz (Z) ∼ N (0, I).
dec
eθγ (zu ,q) dec Training. Given a collection of N molecular graphs {Gi =
pθ (fu = q|Z) = P , pθ (l Z) = pl (eθβ (Z)
),
e
dec (z ,q 0 )
θγ u (Vi , Ei )}i∈[N ] , each with ni nodes, a set of node features
q0
dec
Fi and set of edge weights Yi , we train our variational
xe eθα (zu ,zv )
autoencoder for graphs by maximizing the evidence lower
pθ (e = (u, v)|Ek−1 , Z) = P dec (z 0 ,z 0 )
,
e0 =(u0 ,v 0 )∈E
/ k−1 xe0 eθα u v bound (ELBO), as described in the previous section, plus
the log-likelihood of the Poisson distribution pλn modeling
dec the number of nodes in each graph. Hence we aim to solve:
xm (u, v)eθξ (zu ,zv ,m)
pθ (yuv = m|Yk−1 , Z) = P dec (z ,z ,m0 )
, 1 X
m0 6=m xm0 (u, v)e
θξ u v maximize Eqφ (Zi |Vi ,Ei ,Fi ,Yi ) log pθ (Ei , Yi , Fi |Zi )
φ,θ,λn N
i∈[N ]
where pl denotes a Poisson distribution, xe is the binary 
mask for edge e and xm (u, v) is the binary mask for fea- − KL(qφ ||pz ) + log pλn (ni ) (4)
ture edge value m, and θ•dec are neural networks. Note that Note that, in the above objective, computation of
the parameters of the neural networks do not depend on the Eqφ log pθ (Ei , Yi , Fi |Zi ) requires to specify an order of
number of nodes or edges in the molecular graph and the de- edges present in the graph Gi . To determine this order,
pendency of the binary masks xe and xm (u, v) on the node we use breadth-first-traversals (BFS) with randomized tie
features and the previously generated edges Ek−1 and edge breaking during the child-selection step. Such a tie break-
weights Yk−1 is deterministic and domain dependent. Fig- ing method makes the edge order independent of all node
labels except for the source node label. Therefore, to
ure 2 summarizes our decoder architecture. make it completely permutation invariant, for each graph,
Note that, by using a softmax distribution, it is only nec- we sample the source nodes from an arbitrary distribu-
essary to account for the presence of an edge, not its ab- tion. More formally, we replace log pθ (Ei , Yi , Fi |Zi ) with
sence, and this, in combination with negative sampling, will log Es∼ζ(Vi ) pθ (Ei , Yi , Fi |Zi ) for each graph Gi , where s is
allow for efficient training and decoding, as it will become the randomly sampled source node for the BFS, and ζ is
clear later in this section. This is in contrast with previ- the sampling distribution for s. Note that, the logarithm of a
ous generative models for graphs (Kipf and Welling 2016b; marginalized likelihood is difficult to compute. Fortunately,
Simonovsky and Komodakis 2018), which need to model by using Jensen inequality, we can have a lower bound of
the actual likelihood:
both the presence and absence of each potential edge. More-
over, we would like to acknowledge that, while masking may log Es∼ζ(Vi ) pθ (Ei , Yi , Fi |Zi ) ≥ Es∼ζ(Vi ) log pθ (Ei , Yi , Fi |Zi )

1113
Therefore, to train our model, we maximize of nodes5 . Finally, we sample 106 molecular graphs from
1 X each of the (two) trained variational autoencoders using: (i)
Eqφ (Zi |Vi ,Ei ,Fi ,Yi ),s∼ζ(Vi ) log pθ (Ei , Yi , Fi |Zi ) G ∼ pθ (G|Z), where Z ∼ p(Z) and (ii) Z ∼ pθ (Z|G =
N GT ), where GT is a molecular graph from the correspond-
i∈[N ]
 ing (training) dataset. In the above procedure, we only use
− KL(qφ ||pz ) + log pλn (ni ) , (5)
masking on the weight (i.e., type of bond) distributions both
The following theorem points out the key property of our during training and sampling to ensure that the valence of
objective function (proven in arxiv version).3 the nodes at both ends are valid at all times, i.e., xm (u, v) =
Theorem 2 If the source distribution ζ does not depend on I(m + nk (u) ≤ mmax (u) ∧ m + nk (v) ≤ mmax (v)), where
the node labels, then the parameters learned by maximizing nk (u) is the current valence of node u and mmax (u) is the
the objective in Eq. 5 are invariant to the permutations of maximum valence of node u, which depends on its type
the node labels. fu . Moreover, during sampling, if there is no valid weight
value for a sampled edge, we reject it. To assess to which
Scalability and implementation details. In terms of extent masking helps, we also train and sample from our
scalability, the major bottleneck is computing the gra- model without masking. Here, we would like to highlight
dient of the first term in Eq. 5 during training, rather that, while using masking during test does not lead to signif-
than encoding and decoding graphs once the model is icant increase in the time it takes to generate a graph, using
trained. More specifically, given a source node for a net- masking during training does lead to an increase of 5% in
work without masks, an exact computation of the per training time.
edgePpartition function of the log-likelihood of the edges, We compare the quality of the molecules generated by
dec 2
/ k−1 exp(θα (zu , zv )), requires O(|V| )
i.e., e0 =(u0 ,v0 )∈E our trained models and the molecules generated by sev-
0 0

computations, similarly as in most inference algorithms for eral state of the art competing methods: (i) GraphVAE (Si-
existing generative models of graphs, and hence is costly to monovsky and Komodakis 2018), (ii) GrammarVAE (Kus-
compute even for medium networks. Fortunately, in prac- ner et al. 2017), (iii) CVAE (Gómez-Bombarelli et al.
tice, we can approximate such partition function using neg- 2016), (iv) SDVAE (Dai et al. 2018), (v) JTVAE (Jin,
ative sampling (Mikolov et al. 2013) which reduces the like- Barzilay, and Jaakkola 2018) and (vi) CGVAE (Liu et al.
lihood computation to O(l), where l = |E| is the number of 2018). Among them, GraphVAE, JTVAE and CGVAE use
(true) edges in the graph. Therefore, for S samples of source molecular graphs, however, the rest of the methods use
nodes, the complexity becomes O(Sl). Here, note that most SMILES strings, a domain specific textual representation of
real-world graphs are sparse and thus l  |V|2 . molecules. We use the following evaluation metrics for per-
formance comparison:
Experiments on Real Data (i) Novelty: we use this metric to evaluate to which degree
In this section, we first show that our model beats sev- a method generates novel molecules, i.e., molecules
eral state of the art machine learning models for molecule which were not present in the (training) dataset, i.e.
design (Dai et al. 2018; Gómez-Bombarelli et al. 2016; Novelty = 1 − |Cs ∩ D|/|Cs |, where Cs is the set of
Kusner et al. 2017; Simonovsky and Komodakis 2018; generated molecules which are chemically valid, D is
Jin, Barzilay, and Jaakkola 2018; Liu et al. 2018) in terms the training dataset, and Novelty ∈ [0, 1].
of several relevant quality metrics, i.e., validity, novelty and (ii) Uniqueness: we use this metric to evaluate to what
uniqueness. Then, by applying Bayesian optimization over extent a method generates unique chemically valid
the latent space of molecules provided by our encoder, we molecules. We define, Uniqueness = |set(Cs )|/ns
also show that our model can find a greater number of where ns is the number of generated molecules and
molecules that maximize certain desirable properties. Fi- Unique ∈ [0, 1].
nally we show that the continuous latent representations of (iii) Validity: we use this metric to evaluate to which de-
molecules that our model finds are smooth. gree a method generates chemically valid molecules6 .
Experimental setup. We sample ∼10,000 drug-like com- That is, Validity = |Cs |/ns where ns is the num-
mercially available molecules from the ZINC dataset (Ir- ber of generated molecules, Cs is the set of generated
win et al. 2012) with E[n] = 44 atoms and ∼10,000 molecules which are chemically valid, and note that
molecules from the QM9 dataset (Ramakrishnan et al. 2014; Validity ∈ [0, 1].
Ruddigkeit et al. 2012) with E[n] = 21 atoms. For each Quality of the generated molecules. Tables 1–2 compare
molecule, we construct a molecular graph, where nodes our trained models to the state of the art methods above in
are the atoms, the node features are the type of the atoms terms of novelty, uniqueness, and validity. For GraphVAE
i.e. fu ∈ {C, H, N, O}, edges are the bonds between two and CGVAE we report the results reported in the paper and,
atoms, and the weight associated to an edge is the type of for SDVAE, since there is no public domain implementation
bonds (single, double or triple)4 . Then, for each dataset, we
5
train our variational autoencoder for molecular graphs us- We batch graphs with respect to the number of nodes for efficiency reasons since,
ing batches comprised of molecules with the same number every time that the number of nodes changes, we need to change the size of the com-
putational graph in Tensorflow.
3 6
https://arxiv.org/abs/1802.05283 We used the opensource cheminformatics suite RDkit (http://www.rdkit.org) to
4
We have not selected any molecule whose bond types are others than these three. check the validity of a generated molecule.

1114
Novelty
Dataset NeVAE NeVAE∗ GraphVAE GrammarVAE CVAE SDVAE JTVAE CGVAE
ZINC 1.000 1.000 - 1.000 0.980 1.000 0.999 1.000
QM9 1.000 1.000 0.661 1.000 0.902 - 1.000 0.943
Uniqueness
Dataset NeVAE NeVAE∗ GraphVAE GrammarVAE CVAE SDVAE JTVAE CGVAE
ZINC 0.999 0.588 - 0.273 0.021 1.000 0.991 0.998
QM9 0.998 0.676 0.305 0.197 0.031 - 0.371 0.986

Table 1: Novelty and Uniqueness of the molecules generated using NeVAE and all baselines. The sign ∗ indicates no masking.
For both the datasets, we report Novelty (Uniqueness) over valid (106 ) sampled molecules.

Validity
Dataset Sampling type NeVAE NeVAE∗ GraphVAE GrammarVAE CVAE SDVAE JTVAE CGVAE
Z ∼ P (Z) 1.000 0.590 0.135 0.440 0.021 0.432 1.000 1.000
ZINC
Z ∼ P (Z|GT ) 1.000 0.580 - 0.381 0.175 - 1.000 -
Z ∼ P (Z) 0.999 0.682 0.458 0.200 0.031 - 0.997 1.000
QM9
Z ∼ P (Z|GT ) 0.999 0.660 - 0.301 0.100 - 0.965 -

Table 2: Validity the molecules generated using NeVAE and all baselines. The sign ∗ indicates no masking. For both the datasets,
we report the numbers over 106 sampled molecules.

Objective NeVAE GrammarVAE CVAE JTVAE


of these methods at the time of writing, we have used the
LL -1.45 -1.75 -2.29 -1.54
sampled molecules from the prior provided by the authors RMSE 1.23 1.38 1.80 1.25
for the ZINC dataset. For CVAE, GrammarVAE and JTVAE, Fraction of valid molecules 1.00 0.77 0.53 1.00
we run their public domain implementations in the same set Fraction of unique molecules 0.58 0.29 0.41 0.32
of molecules that we used. We find that, in terms of novelty,
both our trained models and all competing methods except Table 3: Property prediction performance (LL and RMSE)
for the GraphVAE, which assumes a fixed number of nodes, using Sparse Gaussian processes (SGPs) and property max-
are able to (almost) always generate novel molecules. How- imization using Bayesian Optimization (BO).
ever, we would also like to note that novelty is only defined
over chemically valid molecules. Therefore, despite hav-
ing (almost) perfect novelty scores, all baselines except JT-
VAE generate significantly fewer novel molecules than our
method. In terms of uniqueness, which is defined over the
y(m) = 2.826 (1st) y(m) = 2.477 (2nd) y(m) = 2.299 (3rd)
set of sampled molecules, we observe that all baseline meth-
ods, except CGVAE (for ZINC and QM9) and JTVAE (for Figure 4: Best molecules found by Bayesian Optimization
ZINC), perform very poorly in both datasets in comparison (BO) using our model.
with NeVAE. In terms of validity, our trained model signif-
icantly outperform four competing methods—GraphVAE,
GrammarVAE, CVAE and SDVAE—even without the use
of masking, and achieve a comparable performance to JT- CVAE and SDVAE use SMILES, a domain specific string
VAE and CGVAE. In contrast to our model, GrammarVAE, based representation, and thus they may be constrained by
its limited expressiveness. Among them, GrammarVAE and
SDVAE achieve better performance by using a grammar
150 5 to favor valid molecules. GraphVAE generates molecular
GrammarVAE GrammarVAE
CVAE CVAE graphs, as our model, however, its performance is inferior to
#Unique mols

100 JTVAE 0 JTVAE


Scores

NeVAE NeVAE our method because it assumes a fixed number of nodes, it


50 -5 samples edges independently from a Bernoulli distribution,
0 -10 and is not permutation invariant.
1 2 3 4 5 0 50 100 150 200
No of BO iterations i'th best molecule
Bayesian optimization. Here, we leverage our model to
(a) Uniqueness (b) Score discover novel molecules with desirable properties. Simi-
larly as in previous work (Gómez-Bombarelli et al. 2016;
Figure 3: Property maximization using Bayesian optimiza- Kusner et al. 2017; Jin, Barzilay, and Jaakkola 2018), we
tion. Each plot shows the values of y(m) in decreasing or- use Bayesian optimization (BO) to identify novel molecules
der for unique molecules. Panel (a) shows the variation of m with a high value of the octanol-water partition coefficient
Uniqueness with the no. of BO iterations. Panel (b) shows (logP) y(m), penalized by synthetic accessibility (SA) score
the values of y(m) sorted in the decreasing order. and number of long cycles. More specifically, we first sam-

1115
(129.15, -4.51) (133.14, -4.27) (166.16, -3.77) (103.16, -3.73) (103.16, -3.22)
(179.26, -3.03) (179.26, -3.99) (166.26, -4.05) (179.26, -4.86) (179.26, -4.76)

(129.15, -4.51) (129.15, -4.89) (112.17, -4.72) (112.17, -5.13) (112.17, -3.78)
(179.26, -3.03) (179.26, -4.52) (179.26, -4.81) (179.26, -4.78) (179.26, -5.37)

(129.15, -4.51) (120.15, -3.82) (116.16, -4.77) (100.16, -4.17) (99.17, -2.75)
(179.26, -3.03) (179.26, -4.78) (179.26, -4.81) (179.26, -4.40) (168.28, -4.09)

(179.26, -3.03) (179.26, -4.73) (165.23, -5.10) (165.26, -4.66) (166.26, -4.04)
(129.15, -4.51) (112.17, -4.30) (112.17, -4.91) (116.16, -2.99) (120.14, -3.26)

(179.26, -3.03) (179.26, -4.90) (179.26, -4.62) (166.26, -4.38) (179.26, -5.23)
(129.15, -4.51) (120.15, -3.82) (116.16, -4.42) (103.16, -4.21) (112.17, -4.48)

(a) ZINC dataset (b) QM9 Dataset


Figure 5: Molecules sampled using the probabilistic decoder G ∼ pθ (G|Z), where Z = {zi + ai zi | zi ∈ Z0 , ai ≥ 0} and
ai are given parameters. In each row, we use same molecule set ai > 0 for a single arbitrary node i (denoted as •) and set
aj = 0, j 6= i for the remaining nodes. Under each molecule we report its molecular weight and synthetic accessibility score.
forms all baselines. In terms of the property values E [y(m)]
of the discovered molecules and fraction of valid and good
molecules, BO under NeVAE also outperforms all baselines.
Here, we would like to highlight that, while BO under JT-
VAE is able to find a few molecules with larger property
value than BO under NeVAE, it is unable to discover a size-
able set of unique molecules with high property values.

Smooth latent space of molecules. In this section, we first


demonstrate (qualitatively) that the latent space of molecules
Figure 6: Molecules sampled using probabilistic decoder, inferred by our model is smooth. Given a molecule, along
i.e. Gi ∼ pθ (G|Z), given the (sampled) latent representa- with its associated graph G, node features F and edge
weights Y, we first sample its latent representation Z using
tion Z of a given molecule G from the ZINC dataset. The our probabilistic encoder, i.e., Z ∼ qφ (Z|G, F, Y). Then,
sampled molecules are topologically similar to each other as given this latent representation, we generate various molec-
well as the original. ular graphs by sampling from our probabilistic decoder, i.e.,
Gi ∼ pθ (G|Z). Figure 6 summarizes the results for one
molecule from ZINC dataset, which show that the sampled
ple 3,000 molecules from our ZINC dataset, which we split molecules are topologically similar to the given molecule.
into training (90%) and test (10%) sets. Then, for our model
and each competing model with public domain implemen-
tations, we train a sparse Gaussian process (SGP) (Snelson Next, we show that our encoder, once trained, creates a
and Ghahramani 2006) with the latent representations and latent space representation of molecules with powerful se-
y(m) values of 100 inducing points sampled from the train- mantics. In particular, since each node in a molecule has a
ing set. The SGPs allow us to make predictions for the prop- latent representation, we can make fine-grained changes to
erty values of new molecules in the latent spaces. Then, we the structure of a molecule by perturbing the latent represen-
run 5 iterations of batch Bayesian optimization (BO) using tation of single nodes. To this aim, we proceed by first select-
the expected improvement (EI) heuristic (Jones, Schonlau, ing one molecule with n nodes from the ZINC dataset. Given
and Welch 1998), with 50 (new) latent vectors (molecules) its corresponding graph, node features and edge weights, G,
per iteration. Here, we compare the performance of all mod- F and Y, we sample its latent representation Z0 . Then, we
els using several quality measures: (a) the predictive per- sample new molecular graphs G from the probabilistic de-
formance of the trained SGPs in terms of log-likelihood coder G ∼ pθ (G|Z), where Z = {zi + ai zi | zi ∈ Z0 , ai ≥
(LL) and root mean square error (RMSE) on the test set and 0} and ai are given parameters. Figure 5 provides several ex-
(b) the average value E [y(m)], fraction of valid molecules amples across both datasets, which show that the latent space
and fraction of good molecules, i.e., y(m) > 0, among the representation is smooth and, as the distance from the initial
molecules found using EI. molecule increases in the latent space, the resulting molecule
Table 3, Figure 3 and Figure 4 summarize the results. differs more from the original. Here, note that the interpola-
In terms of log-likelihood and RMSE, the SGP trained us- tion is smooth both in terms of graph structure and relevant
ing the latent representations provided by our model outper- chemical properties, e.g., synthetic accessibility score and
molecular weight.

1116
Conclusions Lei, T.; Jin, W.; Barzilay, R.; and Jaakkola, T. 2017. Deriv-
In this work, we have introduced a variational autoencoder ing neural architectures from sequence and graph kernels.
for molecular graphs, that is permutation invariant of the ICML.
nodes labels of the graphs they are trained with, and allow Liu, Q.; Allamanis, M.; Brockschmidt, M.; and Gaunt,
for graphs with different number of nodes and edges. More- A. L. 2018. Constrained graph variational autoencoders
over, the decoder is able to guarantee a set of local structural for molecule design. arXiv preprint arXiv:1805.09076.
and functional properties in the generated graphs through Merz, K. M.; Ringe, D.; and Reynolds, C. H. 2010. Drug
masking. Finally, we have shown that our variational au- design: structure-and ligand-based approaches. Cambridge
toencoder can also be used to discover valid and diverse University Press.
molecules with certain desirable properties more effectively
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and
than several state of the art methods.
Dean, J. 2013. Distributed representations of words and
Our work also opens many interesting venues for future
phrases and their compositionality. In NIPS.
work. For eg. in the design of our variational autoencoder,
we have assumed graphs to be static, however, it would be Paul, S. M.; Mytelka, D. S.; Dunwiddie, C. T.; Persinger,
interesting to augment our design to dynamic graphs by, C. C.; Munos, B. H.; Lindborg, S. R.; and Schacht, A. L.
e.g., incorporating a recurrent neural network. We have per- 2010. How to improve r&d productivity: the pharmaceutical
formed experiments on a single real-world application, e.g., industry’s grand challenge. Nature reviews Drug discovery
automatic chemical design, however, it would be interesting 9(3):203.
to explore other applications e.g. an end-to-end generative Pham, T.; Tran, T.; Phung, D. Q.; and Venkatesh, S. 2017.
modeling of molecules with specified properties. Column networks for collective classification. In Proceed-
Acknowledgements. B. Samanta was supported by a ings of the Thirty-First AAAI Conference on Artificial Intel-
Google India Ph.D. Fellowship and the “Learning Repre- ligence, 2485–2491.
sentations from Network Data” project sponsored by Intel. Polishchuk, P. G.; Madzhidov, T. I.; and Varnek, A. 2013.
P. K. Chattaraj would like to thank DST, New Delhi for the Estimation of the size of drug-like chemical space based on
J.C.Bose National Fellowship. gdb-17 data. Journal of computer-aided molecular design
27(8):675–679.
References Ramakrishnan, R.; Dral, P. O.; Rupp, M.; and Von Lilien-
Dai, H.; Tian, Y.; Dai, B.; Skiena, S.; and Song, L. 2018. feld, O. A. 2014. Quantum chemistry structures and proper-
Syntax-directed variational autoencoder for structured data. ties of 134 kilo molecules. Scientific data 1:140022.
In ICLR. Rezende, D. J.; Mohamed, S.; and Wierstra, D. 2014.
Gómez-Bombarelli, R.; Duvenaud, D.; Hernández-Lobato, Stochastic backpropagation and approximate inference in
J. M.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.; deep generative models. arXiv preprint arXiv:1401.4082.
and Aspuru-Guzik, A. 2016. Automatic chemical design Ruddigkeit, L.; Van Deursen, R.; Blum, L. C.; and Rey-
using a data-driven continuous representation of molecules. mond, J.-L. 2012. Enumeration of 166 billion organic small
arXiv preprint arXiv:1610.02415. molecules in the chemical universe database gdb-17. Jour-
Hamilton, W.; Ying, R.; and Leskovec, J. 2017. Inductive nal of chemical information and modeling 52(11):2864–
representation learning on large graphs. NIPS. 2875.
Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; and Simonovsky, M., and Komodakis, N. 2018. Graphvae: To-
Coleman, R. G. 2012. Zinc: a free tool to discover chemistry wards generation of small graphs using variational autoen-
for biology. Journal of chemical information and modeling coders. arXiv preprint arXiv:1802.03480.
52(7):1757–1768. Snelson, E., and Ghahramani, Z. 2006. Sparse gaussian
Jin, W.; Barzilay, R.; and Jaakkola, T. 2018. Junction processes using pseudo-inputs. In Advances in neural infor-
tree variational autoencoder for molecular graph generation. mation processing systems.
arXiv preprint arXiv:1802.04364.
Jones, D. R.; Schonlau, M.; and Welch, W. J. 1998. Effi-
cient global optimization of expensive black-box functions.
Journal of Global optimization 13(4):455–492.
Kingma, D. P., and Welling, M. 2013. Auto-encoding vari-
ational bayes. arXiv preprint arXiv:1312.6114.
Kipf, T. N., and Welling, M. 2016a. Semi-supervised clas-
sification with graph convolutional networks.
Kipf, T. N., and Welling, M. 2016b. Variational graph auto-
encoders. arXiv preprint arXiv:1611.07308.
Kusner, M. J.; Paige; Brooks; and Hernández-Lobato, J. M.
2017. Grammar variational autoencoder. arXiv preprint
arXiv:1703.01925.

1117

You might also like