VAE Molecular Graphs Niloy AAAI19
VAE Molecular Graphs Niloy AAAI19
1110
Ying, and Leskovec 2017; Lei et al. 2017). However, in ables p(z) and an approximate probabilistic inference model
contrast with inductive graph representation learning, the qφ (z|x). In this characterization, pθ and qφ are arbitrary
aggregator functions are learned via variational inference distributions parametrized by two (deep) neural networks θ
so that the resulting aggregator functions are especially and φ and one can think of the generative model as a prob-
well suited to enable the probabilistic decoder to gen- abilistic decoder, which decodes latent variables into ob-
erate new molecules rather than other downstream ma- served variables, and the inference model as a probabilistic
chine learning tasks such as, e.g., link prediction. More- encoder, which encodes observed variables into latent vari-
over, by using (symmetric) aggregator functions, it is in- ables.
variant to permutations of the node labels and can encode Ideally, if we use the maximum likelihood principle
graphs with a variable number of atoms, as opposed to to train a variational autoencoder, we should optimize
existing graph generative models, with a few the notable the marginal log-likelihood of the observed data, i.e.,
exception of those based on GCNs (Kipf and Welling ED [log pθ (x)], where pD is the data distribution. Unfortu-
2016b). nately, computing log pθ (x) requires marginalization with
(ii) Our probabilistic decoder jointly represents all edges respect to the latent variable z, which is typically in-
as an unnormalized log probability vector (or ‘logit’), tractable. Therefore, one resorts to maximizing a variational
which then feeds a single multinomial edge distribution. lower bound or evidence lower bound (ELBO) of the log-
Such scheme allows for an efficient inference algorithm likelihood of the observed data, i.e.,
with O(l) complexity, where l is the number of true h i
edges in the molecules, which is also invariant to per- max max ED −KL(qφ (z|x)||p(z)) + Eqφ (z|x) log pθ (x|z) .
θ φ
mutations of the node labels. In contrast, previous work
typically models the presence and absence of each po- Finally, note that the quality of this variational lower bound
tential edge using a Bernoulli distribution and this leads depends on the expressive ability of the approximate infer-
to inference algorithms with O(n2 ) complexity, where n ence model qφ (z|x), which is typically assumed to be a nor-
is the number of nodes, which are not permutation in- mal distribution whose mean and variance are parametrized
variant. by a neural network φ with the observed data x as an input.
(iii) Our probabilistic decoder is able to guarantee a set of lo-
cal structural and functional properties in the generated NeVAE: A Variational Autoencoder
graphs by using a mask in the edge distribution defini- for Molecular Graphs
tion, which can prevent the generation of certain undesir- In this section, we first give a high-level overview of the de-
able edges during the decoding process. While masking sign of NeVAE, our variational autoencoder for molecular
have been increasingly used to account for prior (expert) graphs, starting from the data it is designed for. Then, we
knowledge in generative models (Gómez-Bombarelli et describe more in-depth the key technical aspects of its in-
al. 2016; Kusner et al. 2017) based on SMILES, their dividual components. Finally, we elaborate on the training
use in generative models for molecular graphs has been procedure, scalability and implementation details.
lacking. High-level overview. We observe a collection of N molec-
We evaluate our model using molecules from two publicly ular graphs {Gi = (Vi , Ei )}i∈[N ] , where Vi and Ei denote
available datasets, ZINC (Irwin et al. 2012) and QM9 (Ra- the corresponding set of nodes (atoms) and edges (bonds),
makrishnan et al. 2014), and show that our model beats the respectively, and this collection may contain graphs with a
state of the art in terms of several relevant quality metrics, different number of nodes and edges. Moreover, for each
i.e., validity, novelty and uniqueness. molecular graph G = (V, E), we also observe a set of
We also observe that the resulting latent space represen- node features F = {fu }u∈V and edge weights Y =
tation of molecules exhibit powerful semantics—we can {yuv }(u,v)∈E . More specifically, the node features fu are
smoothly interpolate between molecules—and generaliza- one-hot representations of the type of the atoms (i.e., C, H,
tion ability—we can generate (valid) molecules that are N or O), and the edge weight yuv are the bond types (i.e.,
larger than any of the molecules in the datasets. Finally, by single, double, triple). Our goal is then to design a varia-
utilizing Bayesian optimization over the latent representa- tional autoencoder for molecular graphs that, once trained on
tion, we can also identify molecules that maximize certain this collection of graphs, has the ability of creating new plau-
desirable properties more effectively than alternatives. We sible molecular graphs, including node features and edge
are releasing an open source implementation of our model weights. In doing so, it will also provide a latent representa-
in Tensorflow. 1 tion of any graph in the collection (or elsewhere) with mean-
ingful semantics.
Background on Variational Autoencoders Following the above background on variational autoen-
coders, we characterize NeVAE by means of:
Variational autoencoders (Kingma and Welling 2013; — Prior: p(z1 , . . . , zn ), where |V| = |F | = n ∼ Poisson(λn )
Rezende, Mohamed, and Wierstra 2014) are characterized
— Inference model (encoder): qφ (z1 , . . . , zn |V, E, F , Y)
by a probabilistic generative model pθ (x|z)
— Generative model (decoder): pθ (E, F , Y|z1 , . . . , zn )
of the observed variables x ∈ RN given the latent vari-
ables z ∈ RM , a prior distribution over the latent vari- In the above characterization, note that we define one latent
variable per node, i.e., we have a node-based latent represen-
1
https://github.com/Networks-Learning/nevae tation, and the number of nodes is a random variables and, as
1111
cv (j) → cv (j + 1) v
c 1 (1, .., K), .., c |V| (1, .., K)
v Wµ , bµ zv
Wh , bh Wσ , bσ
u W1 W2 W 3..K c1 (K) c u (1, .., K)
c2 (K) µ 1 , .., µ |V|
σ 1 , .., σ |V|
zu
G, F , Y
c |V|−1 (K) c v (1, .., K)
c |V| (K) φ enc (cc∗ (1, .., K)) w
u
w c w (1, .., K) zw
Figure 1: The encoder of our variational autoencoder for molecular graphs. From left to right, given a molecular graph G with
a set of node features F and edge weights Y, the encoder aggregates information from a different number of hops j ≤ K
away for each node v ∈ G into an embedding vector cv (j). These embeddings are fed into a differentiable function φenc which
parameterizes the posterior distribution qφ , from where the latent representation of each node in the input graph are sampled
from.
a consequence, both the latent representation as well as the Proposition 1 The probabilistic encoder defined by Eqs. 1
graph can vary in size. Next, we formally define the func- and 2 has the following properties:
tional form of the inference model, the generative model, (i) For each node u, its corresponding embedding cu (k) is
and the prior. invariant to permutations of the node labels of its neigh-
Inference model (probabilistic encoder). Given a graph bors.
G = (V, E) with node features F and edge weights Y, our (ii) The weight matrices W1 , . . . , Wk do not depend on the
inference model qφ defines a probabilistic encoding for each number of nodes and edges in the graph and thus a sin-
node in the graph by aggregating information from differ- gle encoder allows for graphs with a variable number of
ent distances. More formally, for each node u, the inference nodes and edges.
model is defined as follows: Generative model (probabilistic decoder). Given a set of
qφ (zu |V, E, F, Y) ∼ N (µu , diag(σu )) (1) of n nodes with latent variables Z = {zu }u∈[n] , our gener-
where zu is the latent variable associated to node u, ative model pθ is defined as follows:
[µu , diag(σu )] = φenc (cu (1), . . . , cu (K)), and cu (k) ag- pθ (E, Y, F|Z) = pθ (F|Z) pθ (E, Y|Z), (3)
gregates information from k hops away from u, i.e.,
r(Wk fu ) if k = 1 with
cu (k) = Y
r Wk fu Λ ∪v∈N (u) yuv g(cv (k − 1) if k > 1. pθ (F |Z) = pθ (fu |Z),
(2) u∈V
In the above, Wk are trainable weight matrices, which prop- pθ (E, Y|Z) = pθ (l Z) . pθ (E, Y|Z, l),
agate information between different search depths, Λ(.) is
Y
pθ (E, Y|Z, l) = pθ (ek |Ek−1 , F , Z)pθ (yuk vk |Yk−1 , F , Z),
a (possibly nonlinear) symmetric aggregator function in its k∈[l]
arguments, g(·) and r(·) are (possibly nonlinear) differen-
tiable functions, φenc is a neural network, and denotes where the ordering for the edge and edge weights is in-
pairwise product. Figure 1 describes our encoder architec- dependent of node labels and hence permutation invariant,
ture. ek and yuk vk denote the k-th edge and edge weight under
The above node embeddings, defined by Eq. 2, are very the chosen order, and Ek−1 = {e1 , . . . , ek−1 } and Yk−1 =
similar to the ones used in several graph representation {yu1 v1 , . . . , yuk−1 vk−1 } denote the k − 1 previously gener-
learning algorithms such as GraphSAGE (Hamilton, Ying, ated edges and edge weights respectively.
and Leskovec 2017), column networks (Pham et al. 2017), Moreover, the model characterizes the conditional proba-
and GCNs (Kipf and Welling 2016a), the main difference bilities in the above formulation as follows. For each node,
with our work is the way we will train the weight matrices it represents all potential node feature values fu = q as
Wk . Here, we will use variational inference so that the re- an unnormalized log probability vector (or ‘logits’), feeds
sulting embeddings are especially well suited to enable our this logit into a softmax distribution and samples the node
probabilistic decoder to generate new, plausible molecular features. Then, it represents the average number of edges
graphs. In contrast, the above algorithms use non variational through as a logit, feeds this logit into a Poisson distribu-
approaches to compute general purpose embeddings to feed tion and samples the number of edges. Finally, it represents
downstream machine learning tasks. all potential edges as logits and, for each edge, all poten-
The following proposition highlights several desirable tial edge weights as another logit, and it feeds the former
theoretical properties of our probabilistic encoder (details in vector into a single softmax distribution and the latter vec-
arxiv version),2 which distinguishes our design from most tors each into a different softmax distribution. Moreover, the
existing generative models of graphs (Jin, Barzilay, and edge distribution and the corresponding edge weight distri-
Jaakkola 2018; Simonovsky and Komodakis 2018): butions depend on a set of binary masks, which may depend
on the sampled node features and also get updated every
2
https://arxiv.org/abs/1802.05283 time a new edge and edge weight are sampled. By doing so,
1112
Figure 2: The decoder of our variational autoencoder for molecular graphs. From left to right, the decoder first samples the
number of nodes n = |V| from a Poisson distribution pn (λn ) and it samples a latent vector zu per node u ∈ V from N (0, I).
Then, for each node u, it represents all potential node feature values as an unnormalized log probability vector (or ‘logits’),
where each entry is given by a nonlinearity θγdec of the corresponding latent representation zu , feeds this logit into a softmax
distribution and samples the node features. Next, it feeds all latent vectors Z into a nonlinear log intensity function θβdec (Z)
which is used to sample the number of edges. Thereafter, on the top row, it constructs a logit for all potential edges (u, v), where
each entry is given by a nonlinearity θαdec of the corresponding latent representations (zu , zv ). Then, it samples the edges one
by one from a soft max distribution depending on the logit and a mask xe (Ek−1 ), which gets updated every time it samples a
new edge ek . On the bottom row, it constructs a logit per edge (u, v) for all potential edge weight values m, where each entry
is given by a nonlinearity θξdec of the latent representations of the edge and edge weight value (zu , zv , m). Then, every time it
samples an edge, it samples the edge weight value from a soft max distribution depending on the corresponding logit and mask
xm (u, v), which gets updated every time it samples a new yuk vk .
it prevents the generation of certain undesirable edges and be useful to account for prior (expert) knowledge, it may
edges weights, allowing for the generated graph to fulfill a be costly to check for some local (or global) structural and
set of predefined local structural and functional properties. functional properties on-the-fly.
More formally, the distributions of each node feature, the Prior. Given a set of n nodes with latent variables Z =
number of edges, each edge and edge weight are given by:
{zu }u∈[n] , pz (Z) ∼ N (0, I).
dec
eθγ (zu ,q) dec Training. Given a collection of N molecular graphs {Gi =
pθ (fu = q|Z) = P , pθ (l Z) = pl (eθβ (Z)
),
e
dec (z ,q 0 )
θγ u (Vi , Ei )}i∈[N ] , each with ni nodes, a set of node features
q0
dec
Fi and set of edge weights Yi , we train our variational
xe eθα (zu ,zv )
autoencoder for graphs by maximizing the evidence lower
pθ (e = (u, v)|Ek−1 , Z) = P dec (z 0 ,z 0 )
,
e0 =(u0 ,v 0 )∈E
/ k−1 xe0 eθα u v bound (ELBO), as described in the previous section, plus
the log-likelihood of the Poisson distribution pλn modeling
dec the number of nodes in each graph. Hence we aim to solve:
xm (u, v)eθξ (zu ,zv ,m)
pθ (yuv = m|Yk−1 , Z) = P dec (z ,z ,m0 )
, 1 X
m0 6=m xm0 (u, v)e
θξ u v maximize Eqφ (Zi |Vi ,Ei ,Fi ,Yi ) log pθ (Ei , Yi , Fi |Zi )
φ,θ,λn N
i∈[N ]
where pl denotes a Poisson distribution, xe is the binary
mask for edge e and xm (u, v) is the binary mask for fea- − KL(qφ ||pz ) + log pλn (ni ) (4)
ture edge value m, and θ•dec are neural networks. Note that Note that, in the above objective, computation of
the parameters of the neural networks do not depend on the Eqφ log pθ (Ei , Yi , Fi |Zi ) requires to specify an order of
number of nodes or edges in the molecular graph and the de- edges present in the graph Gi . To determine this order,
pendency of the binary masks xe and xm (u, v) on the node we use breadth-first-traversals (BFS) with randomized tie
features and the previously generated edges Ek−1 and edge breaking during the child-selection step. Such a tie break-
weights Yk−1 is deterministic and domain dependent. Fig- ing method makes the edge order independent of all node
labels except for the source node label. Therefore, to
ure 2 summarizes our decoder architecture. make it completely permutation invariant, for each graph,
Note that, by using a softmax distribution, it is only nec- we sample the source nodes from an arbitrary distribu-
essary to account for the presence of an edge, not its ab- tion. More formally, we replace log pθ (Ei , Yi , Fi |Zi ) with
sence, and this, in combination with negative sampling, will log Es∼ζ(Vi ) pθ (Ei , Yi , Fi |Zi ) for each graph Gi , where s is
allow for efficient training and decoding, as it will become the randomly sampled source node for the BFS, and ζ is
clear later in this section. This is in contrast with previ- the sampling distribution for s. Note that, the logarithm of a
ous generative models for graphs (Kipf and Welling 2016b; marginalized likelihood is difficult to compute. Fortunately,
Simonovsky and Komodakis 2018), which need to model by using Jensen inequality, we can have a lower bound of
the actual likelihood:
both the presence and absence of each potential edge. More-
over, we would like to acknowledge that, while masking may log Es∼ζ(Vi ) pθ (Ei , Yi , Fi |Zi ) ≥ Es∼ζ(Vi ) log pθ (Ei , Yi , Fi |Zi )
1113
Therefore, to train our model, we maximize of nodes5 . Finally, we sample 106 molecular graphs from
1 X each of the (two) trained variational autoencoders using: (i)
Eqφ (Zi |Vi ,Ei ,Fi ,Yi ),s∼ζ(Vi ) log pθ (Ei , Yi , Fi |Zi ) G ∼ pθ (G|Z), where Z ∼ p(Z) and (ii) Z ∼ pθ (Z|G =
N GT ), where GT is a molecular graph from the correspond-
i∈[N ]
ing (training) dataset. In the above procedure, we only use
− KL(qφ ||pz ) + log pλn (ni ) , (5)
masking on the weight (i.e., type of bond) distributions both
The following theorem points out the key property of our during training and sampling to ensure that the valence of
objective function (proven in arxiv version).3 the nodes at both ends are valid at all times, i.e., xm (u, v) =
Theorem 2 If the source distribution ζ does not depend on I(m + nk (u) ≤ mmax (u) ∧ m + nk (v) ≤ mmax (v)), where
the node labels, then the parameters learned by maximizing nk (u) is the current valence of node u and mmax (u) is the
the objective in Eq. 5 are invariant to the permutations of maximum valence of node u, which depends on its type
the node labels. fu . Moreover, during sampling, if there is no valid weight
value for a sampled edge, we reject it. To assess to which
Scalability and implementation details. In terms of extent masking helps, we also train and sample from our
scalability, the major bottleneck is computing the gra- model without masking. Here, we would like to highlight
dient of the first term in Eq. 5 during training, rather that, while using masking during test does not lead to signif-
than encoding and decoding graphs once the model is icant increase in the time it takes to generate a graph, using
trained. More specifically, given a source node for a net- masking during training does lead to an increase of 5% in
work without masks, an exact computation of the per training time.
edgePpartition function of the log-likelihood of the edges, We compare the quality of the molecules generated by
dec 2
/ k−1 exp(θα (zu , zv )), requires O(|V| )
i.e., e0 =(u0 ,v0 )∈E our trained models and the molecules generated by sev-
0 0
computations, similarly as in most inference algorithms for eral state of the art competing methods: (i) GraphVAE (Si-
existing generative models of graphs, and hence is costly to monovsky and Komodakis 2018), (ii) GrammarVAE (Kus-
compute even for medium networks. Fortunately, in prac- ner et al. 2017), (iii) CVAE (Gómez-Bombarelli et al.
tice, we can approximate such partition function using neg- 2016), (iv) SDVAE (Dai et al. 2018), (v) JTVAE (Jin,
ative sampling (Mikolov et al. 2013) which reduces the like- Barzilay, and Jaakkola 2018) and (vi) CGVAE (Liu et al.
lihood computation to O(l), where l = |E| is the number of 2018). Among them, GraphVAE, JTVAE and CGVAE use
(true) edges in the graph. Therefore, for S samples of source molecular graphs, however, the rest of the methods use
nodes, the complexity becomes O(Sl). Here, note that most SMILES strings, a domain specific textual representation of
real-world graphs are sparse and thus l |V|2 . molecules. We use the following evaluation metrics for per-
formance comparison:
Experiments on Real Data (i) Novelty: we use this metric to evaluate to which degree
In this section, we first show that our model beats sev- a method generates novel molecules, i.e., molecules
eral state of the art machine learning models for molecule which were not present in the (training) dataset, i.e.
design (Dai et al. 2018; Gómez-Bombarelli et al. 2016; Novelty = 1 − |Cs ∩ D|/|Cs |, where Cs is the set of
Kusner et al. 2017; Simonovsky and Komodakis 2018; generated molecules which are chemically valid, D is
Jin, Barzilay, and Jaakkola 2018; Liu et al. 2018) in terms the training dataset, and Novelty ∈ [0, 1].
of several relevant quality metrics, i.e., validity, novelty and (ii) Uniqueness: we use this metric to evaluate to what
uniqueness. Then, by applying Bayesian optimization over extent a method generates unique chemically valid
the latent space of molecules provided by our encoder, we molecules. We define, Uniqueness = |set(Cs )|/ns
also show that our model can find a greater number of where ns is the number of generated molecules and
molecules that maximize certain desirable properties. Fi- Unique ∈ [0, 1].
nally we show that the continuous latent representations of (iii) Validity: we use this metric to evaluate to which de-
molecules that our model finds are smooth. gree a method generates chemically valid molecules6 .
Experimental setup. We sample ∼10,000 drug-like com- That is, Validity = |Cs |/ns where ns is the num-
mercially available molecules from the ZINC dataset (Ir- ber of generated molecules, Cs is the set of generated
win et al. 2012) with E[n] = 44 atoms and ∼10,000 molecules which are chemically valid, and note that
molecules from the QM9 dataset (Ramakrishnan et al. 2014; Validity ∈ [0, 1].
Ruddigkeit et al. 2012) with E[n] = 21 atoms. For each Quality of the generated molecules. Tables 1–2 compare
molecule, we construct a molecular graph, where nodes our trained models to the state of the art methods above in
are the atoms, the node features are the type of the atoms terms of novelty, uniqueness, and validity. For GraphVAE
i.e. fu ∈ {C, H, N, O}, edges are the bonds between two and CGVAE we report the results reported in the paper and,
atoms, and the weight associated to an edge is the type of for SDVAE, since there is no public domain implementation
bonds (single, double or triple)4 . Then, for each dataset, we
5
train our variational autoencoder for molecular graphs us- We batch graphs with respect to the number of nodes for efficiency reasons since,
ing batches comprised of molecules with the same number every time that the number of nodes changes, we need to change the size of the com-
putational graph in Tensorflow.
3 6
https://arxiv.org/abs/1802.05283 We used the opensource cheminformatics suite RDkit (http://www.rdkit.org) to
4
We have not selected any molecule whose bond types are others than these three. check the validity of a generated molecule.
1114
Novelty
Dataset NeVAE NeVAE∗ GraphVAE GrammarVAE CVAE SDVAE JTVAE CGVAE
ZINC 1.000 1.000 - 1.000 0.980 1.000 0.999 1.000
QM9 1.000 1.000 0.661 1.000 0.902 - 1.000 0.943
Uniqueness
Dataset NeVAE NeVAE∗ GraphVAE GrammarVAE CVAE SDVAE JTVAE CGVAE
ZINC 0.999 0.588 - 0.273 0.021 1.000 0.991 0.998
QM9 0.998 0.676 0.305 0.197 0.031 - 0.371 0.986
Table 1: Novelty and Uniqueness of the molecules generated using NeVAE and all baselines. The sign ∗ indicates no masking.
For both the datasets, we report Novelty (Uniqueness) over valid (106 ) sampled molecules.
Validity
Dataset Sampling type NeVAE NeVAE∗ GraphVAE GrammarVAE CVAE SDVAE JTVAE CGVAE
Z ∼ P (Z) 1.000 0.590 0.135 0.440 0.021 0.432 1.000 1.000
ZINC
Z ∼ P (Z|GT ) 1.000 0.580 - 0.381 0.175 - 1.000 -
Z ∼ P (Z) 0.999 0.682 0.458 0.200 0.031 - 0.997 1.000
QM9
Z ∼ P (Z|GT ) 0.999 0.660 - 0.301 0.100 - 0.965 -
Table 2: Validity the molecules generated using NeVAE and all baselines. The sign ∗ indicates no masking. For both the datasets,
we report the numbers over 106 sampled molecules.
1115
(129.15, -4.51) (133.14, -4.27) (166.16, -3.77) (103.16, -3.73) (103.16, -3.22)
(179.26, -3.03) (179.26, -3.99) (166.26, -4.05) (179.26, -4.86) (179.26, -4.76)
(129.15, -4.51) (129.15, -4.89) (112.17, -4.72) (112.17, -5.13) (112.17, -3.78)
(179.26, -3.03) (179.26, -4.52) (179.26, -4.81) (179.26, -4.78) (179.26, -5.37)
(129.15, -4.51) (120.15, -3.82) (116.16, -4.77) (100.16, -4.17) (99.17, -2.75)
(179.26, -3.03) (179.26, -4.78) (179.26, -4.81) (179.26, -4.40) (168.28, -4.09)
(179.26, -3.03) (179.26, -4.73) (165.23, -5.10) (165.26, -4.66) (166.26, -4.04)
(129.15, -4.51) (112.17, -4.30) (112.17, -4.91) (116.16, -2.99) (120.14, -3.26)
(179.26, -3.03) (179.26, -4.90) (179.26, -4.62) (166.26, -4.38) (179.26, -5.23)
(129.15, -4.51) (120.15, -3.82) (116.16, -4.42) (103.16, -4.21) (112.17, -4.48)
1116
Conclusions Lei, T.; Jin, W.; Barzilay, R.; and Jaakkola, T. 2017. Deriv-
In this work, we have introduced a variational autoencoder ing neural architectures from sequence and graph kernels.
for molecular graphs, that is permutation invariant of the ICML.
nodes labels of the graphs they are trained with, and allow Liu, Q.; Allamanis, M.; Brockschmidt, M.; and Gaunt,
for graphs with different number of nodes and edges. More- A. L. 2018. Constrained graph variational autoencoders
over, the decoder is able to guarantee a set of local structural for molecule design. arXiv preprint arXiv:1805.09076.
and functional properties in the generated graphs through Merz, K. M.; Ringe, D.; and Reynolds, C. H. 2010. Drug
masking. Finally, we have shown that our variational au- design: structure-and ligand-based approaches. Cambridge
toencoder can also be used to discover valid and diverse University Press.
molecules with certain desirable properties more effectively
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and
than several state of the art methods.
Dean, J. 2013. Distributed representations of words and
Our work also opens many interesting venues for future
phrases and their compositionality. In NIPS.
work. For eg. in the design of our variational autoencoder,
we have assumed graphs to be static, however, it would be Paul, S. M.; Mytelka, D. S.; Dunwiddie, C. T.; Persinger,
interesting to augment our design to dynamic graphs by, C. C.; Munos, B. H.; Lindborg, S. R.; and Schacht, A. L.
e.g., incorporating a recurrent neural network. We have per- 2010. How to improve r&d productivity: the pharmaceutical
formed experiments on a single real-world application, e.g., industry’s grand challenge. Nature reviews Drug discovery
automatic chemical design, however, it would be interesting 9(3):203.
to explore other applications e.g. an end-to-end generative Pham, T.; Tran, T.; Phung, D. Q.; and Venkatesh, S. 2017.
modeling of molecules with specified properties. Column networks for collective classification. In Proceed-
Acknowledgements. B. Samanta was supported by a ings of the Thirty-First AAAI Conference on Artificial Intel-
Google India Ph.D. Fellowship and the “Learning Repre- ligence, 2485–2491.
sentations from Network Data” project sponsored by Intel. Polishchuk, P. G.; Madzhidov, T. I.; and Varnek, A. 2013.
P. K. Chattaraj would like to thank DST, New Delhi for the Estimation of the size of drug-like chemical space based on
J.C.Bose National Fellowship. gdb-17 data. Journal of computer-aided molecular design
27(8):675–679.
References Ramakrishnan, R.; Dral, P. O.; Rupp, M.; and Von Lilien-
Dai, H.; Tian, Y.; Dai, B.; Skiena, S.; and Song, L. 2018. feld, O. A. 2014. Quantum chemistry structures and proper-
Syntax-directed variational autoencoder for structured data. ties of 134 kilo molecules. Scientific data 1:140022.
In ICLR. Rezende, D. J.; Mohamed, S.; and Wierstra, D. 2014.
Gómez-Bombarelli, R.; Duvenaud, D.; Hernández-Lobato, Stochastic backpropagation and approximate inference in
J. M.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.; deep generative models. arXiv preprint arXiv:1401.4082.
and Aspuru-Guzik, A. 2016. Automatic chemical design Ruddigkeit, L.; Van Deursen, R.; Blum, L. C.; and Rey-
using a data-driven continuous representation of molecules. mond, J.-L. 2012. Enumeration of 166 billion organic small
arXiv preprint arXiv:1610.02415. molecules in the chemical universe database gdb-17. Jour-
Hamilton, W.; Ying, R.; and Leskovec, J. 2017. Inductive nal of chemical information and modeling 52(11):2864–
representation learning on large graphs. NIPS. 2875.
Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; and Simonovsky, M., and Komodakis, N. 2018. Graphvae: To-
Coleman, R. G. 2012. Zinc: a free tool to discover chemistry wards generation of small graphs using variational autoen-
for biology. Journal of chemical information and modeling coders. arXiv preprint arXiv:1802.03480.
52(7):1757–1768. Snelson, E., and Ghahramani, Z. 2006. Sparse gaussian
Jin, W.; Barzilay, R.; and Jaakkola, T. 2018. Junction processes using pseudo-inputs. In Advances in neural infor-
tree variational autoencoder for molecular graph generation. mation processing systems.
arXiv preprint arXiv:1802.04364.
Jones, D. R.; Schonlau, M.; and Welch, W. J. 1998. Effi-
cient global optimization of expensive black-box functions.
Journal of Global optimization 13(4):455–492.
Kingma, D. P., and Welling, M. 2013. Auto-encoding vari-
ational bayes. arXiv preprint arXiv:1312.6114.
Kipf, T. N., and Welling, M. 2016a. Semi-supervised clas-
sification with graph convolutional networks.
Kipf, T. N., and Welling, M. 2016b. Variational graph auto-
encoders. arXiv preprint arXiv:1611.07308.
Kusner, M. J.; Paige; Brooks; and Hernández-Lobato, J. M.
2017. Grammar variational autoencoder. arXiv preprint
arXiv:1703.01925.
1117