Graphvae
Graphvae
Autoencoders
research topic with many applications. However, In this work, we propose to sidestep these hurdles by having
past work has concentrated on learning graph em- the decoder output a probabilistic fully-connected graph of a
bedding tasks, which is in contrast with advances predefined maximum size directly at once. In a probabilistic
in generative models for images and text. Is it graph, the existence of nodes and edges, as well as their
possible to transfer this progress to the domain of attributes, are modeled as independent random variables.
graphs? We propose to sidestep hurdles associ- The method is formulated in the framework of variational
ated with linearization of such discrete structures autoencoders (VAE) by Kingma & Welling (2013).
by having a decoder output a probabilistic fully- We demonstrate our method, coined GraphVAE, in chem-
connected graph of a predefined maximum size informatics on the task of molecule generation. Molecular
directly at once. Our method is formulated as datasets are a challenging but convenient testbed for our gen-
a variational autoencoder. We evaluate on the erative model, as they easily allow for both qualitative and
challenging task of molecule generation. quantitative tests of decoded samples. While our method is
applicable for generating smaller graphs only and its perfor-
mance leaves space for improvement, we believe our work
1. Introduction is an important initial step towards powerful and efficient
graph decoders.
Deep learning on graphs has very recently become a pop-
ular research topic (Bronstein et al., 2017), with useful
applications across fields such as chemistry (Gilmer et al., 2. Related work
2017), medicine (Ktena et al., 2017), or computer vision (Si- Graph Decoders. Graph generation has been largely un-
monovsky & Komodakis, 2017). Past work has concentrated explored in deep learning. The closest work to ours is by
on learning graph embedding tasks so far, i.e. encoding an Johnson (2017), who incrementally constructs a probabilis-
input graph into a vector representation. This is in stark tic (multi)graph as a world representation according to a
contrast with fast-paced advances in generative models for sequence of input sentences to answer a query. While our
images and text, which have seen massive rise in quality model also outputs a probabilistic graph, we do not assume
of generated samples. Hence, it is an intriguing question having a prescribed order of construction transformations
how one can transfer this progress to the domain of graphs, available and we formulate the learning problem as an au-
i.e. their decoding from a vector representation. Moreover, toencoder.
the desire for such a method has been mentioned in the past
by Gómez-Bombarelli et al. (2016). Xu et al. (2017) learns to produce a scene graph from an
input image. They construct a graph from a set of object
However, learning to generate graphs is a difficult prob- proposals, provide initial embeddings to each node and edge,
lem for methods based on gradient optimization, as graphs and use message passing to obtain a consistent prediction. In
are discrete structures. Unlike sequence (text) generation, contrast, our method is a generative model which produces
graphs can have arbitrary connectivity and there is no clear a probabilistic graph from a single opaque vector, without
best way how to linearize their construction in a sequence specifying the number of nodes or the structure explicitly.
of steps. On the other hand, learning the order for incremen-
1
Related work pre-dating deep learning includes random
Université Paris Est & École des Ponts ParisTech, Champs sur graphs (Erdos & Rényi, 1960; Barabási & Albert, 1999),
Marne, France. Correspondence to: Martin Simonovsky <mar-
[email protected]>. stochastic blockmodels (Snijders & Nowicki, 1997), or state
transition matrix learning (Gong & Xiang, 2003).
GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders
Figure 1. Illustration of the proposed variational graph autoencoder. Starting from a discrete attributed graph G = (A, E, F ) on n nodes
(e.g. a representation of propylene oxide), stochastic graph encoder qφ (z|G) embeds the graph into continuous representation z. Given a
point in the latent space, our novel graph decoder pθ (G|z) outputs a probabilistic fully-connected graph G e = (A,
e E,
e Fe) on predefined
k ≥ n nodes, from which discrete samples may be drawn. The process can be conditioned on label y for controlled sampling at test time.
Reconstruction ability of the autoencoder is facilitated by approximate graph matching for aligning G with G. e
The first term of L, the reconstruction loss, enforces high The decoder itself is deterministic. Its architecture is a
similarity of sampled generated graphs to the input graph G. simple multi-layer perceptron (MLP) with three outputs in
The second term, KL-divergence, regularizes the code space its last layer. Sigmoid activation function is used to compute
to allow for sampling of z directly from p(z) instead from A,
e whereas edge- and node-wise softmax is applied to obtain
qφ (z|G) later. The dimensionality of z is usually fairly small E and Fe, respectively. At test time, we are often interested
e
so that the autoencoder is encouraged to learn a high-level in a (discrete) point estimate of G,
e which can be obtained by
compression of the input instead of learning to simply copy taking edge- and node-wise argmax in A, e E,e and Fe. Note
any given input. While the regularization is independent on that this can result in a discrete graph on less than k nodes.
the input space, the reconstruction loss must be specifically
designed for each input modality. In the following, we 3.3. Reconstruction Loss
introduce our graph decoder together with an appropriate
reconstruction loss. Given a particular of a discrete input graph G on n ≤ k
nodes and its probabilistic reconstruction Ge on k nodes,
evaluation of Equation 1 requires computation of likelihood
3.2. Probabilistic Graph Decoder
pθ (G|z) = P (G|G).
e
Graphs are discrete objects, ultimately. While this does not
Since no particular ordering of nodes is imposed in either
pose a challenge for encoding, demonstrated by the recent
Ge or G and matrix representation of graphs is not invariant
developments in graph convolution networks (Gilmer et al.,
to permutations of nodes, comparison of two graphs is hard.
2017), graph generation has been an open problem so far. In
However, approximate graph matching described further
a related task of text sequence generation, the currently dom-
in Subsection 3.4 can obtain a binary assignment matrix
inant approach is character-wise or word-wise prediction
X ∈ {0, 1}k×n , where Xa,i = 1 only if node a ∈ G e is
(Bowman et al., 2016). However, graphs can have arbitrary
assigned to i ∈ G and Xa,i = 0 otherwise.
connectivity and there is no clear way how to linearize their
construction in a sequence of steps1 . On the other hand, Knowledge of X allows to map information between both
iterative construction of discrete structures during training graphs. Specifically, input adjacency matrix is mapped to
without step-wise supervision involves discrete decisions, the predicted graph as A0 = XAX T , whereas the predicted
which are not differentiable and therefore problematic for node attribute matrix and slices of edge attribute matrix are
back-propagation. transferred to the input graph as Fe0 = X T Fe and E e0 =
·,·,l
T e
Fortunately, the task can become much simpler if we re- X E·,·,l X. The maximum likelihood estimates, i.e. cross-
strict the domain to the set of all graphs on maximum k entropy, of respective variables are as follows:
nodes, where k is fairly small (in practice up to the order
of tens). Under this assumption, handling dense graph rep- log p(A0 |z) =
resentations is still computationally tractable. We propose X
to make the decoder output a probabilistic fully-connected = 1/k A0a,a log A
ea,a + (1 − A0 ) log(1 − A
a,a
ea,a ) +
graph G e = (A,
e E,
e Fe) on k nodes at once. This effectively a
X
sidesteps both problems mentioned above. + 1/k(k − 1) A0a,b log A
ea,b + (1 − A0a,b ) log(1 − A
ea,b )
a6=b
In probabilistic graphs, the existence of nodes and edges X
T e0
is modeled as Bernoulli variables, whereas node and edge log p(F |z) = 1/n log Fi,· Fi,·
attributes are multinomial variables. While not discussed in i
X
this work, continuous attributes could be easily modeled as log p(E|z) = 1/(||A||1 − n) T e0
log Ei,j,· Ei,j,· , (2)
Gaussian variables represented by their mean and variance. i6=j
We assume all variables to be independent.
Each tensor of the representation of G e has thus a proba- where we assumed that F and E are encoded in one-hot no-
bilistic interpretation. Specifically, the predicted adjacency tation. The formulation considers existence of both matched
matrix A e ∈ [0, 1]k×k contains both node probabilities A ea,a and unmatched nodes and edges but attributes of only the
and edge probabilities Aa,b for nodes a 6= b. The edge at-
e matched ones. Furthermore, averaging over nodes and edges
tribute tensor Ee ∈ Rk×k×de indicates class probabilities for separately has shown beneficial in training as otherwise the
edges and, similarly, the node attribute matrix Fe ∈ Rk×dn edges dominate the likelihood. The overall reconstruction
contains class probabilities for nodes. loss is a weighed sum of the previous terms:
1
While algorithms for canonical graph orderings are available
(McKay & Piperno, 2014), Vinyals et al. (2015) empirically found − log p(G|z) = −λA log p(A0 |z) − λF log p(F |z) −
out that the linearization order matters when learning on sets.
− λE log p(E|z) (3)
GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders
Fortunately, we found graph representation of molecules, and ReLU; followed by parallel triplet of FCLs to output
as undirected graphs with atoms as nodes and bonds as graph tensors. We set c = 40, λA = λF = λE = 1, batch
edges, to be a convenient testbed for generative models. size 32, 75 MPM iterations and train for 25 epochs with
On one hand, generated graphs can be easily visualized Adam with learning rate 1e-3 and β1 =0.5.
in standardized structural diagrams. On the other hand,
chemical validity of graphs, as well as many further prop- Embedding Visualization. To visually judge the quality
erties a molecule can fulfill, can be checked using software and smoothness of the learned embedding z of our model,
packages (SanitizeMol in RDKit) or simulations. This we may traverse it in two ways: along a slice and along a line.
makes both qualitative and quantitative tests possible. For the former, we randomly choose two c-dimensional or-
thonormal vectors and sample z in regular grid pattern over
Chemical constraints on compatible types of bonds and atom
the induced 2D plane. For the latter, we randomly choose
valences make the space of valid graphs complicated and
two molecules G(1) , G(2) of the same label from test set
molecule generation challenging. In fact, a single addition
and interpolate between their embeddings µ(G(1) ), µ(G(2) ).
or removal of edge or change in atom or bond type can
This also evaluates the encoder, and therefore benefits from
make a molecule chemically invalid. Comparably, flipping
low reconstruction error.
a single pixel in MNIST-like number generation problem is
of no issue. We plot two planes in Figure 2, for a frequent label (left)
and a less frequent label in QM9 (right). Both images show
To help the network in this application, we introduce three
a varied and fairly smooth mix of molecules. The left image
remedies. First, we make the decoder output symmetric A e
has many valid samples broadly distributed across the plane,
and E by predicting their (upper) triangular parts only, as
e
as presumably the autoencoder had to fit a large portion of
undirected graphs are sufficient representation for molecules.
database into this space. The right exhibits stronger effect
Second, we use prior knowledge that molecules are con-
of regularization, as valid molecules tend to be only around
nected and, at test time only, construct maximum spanning
center.
tree on the set of probable nodes {a : A ea,a ≥ 0.5} in
order to include its edges (a, b) in the discrete pointwise An example of several interpolations is shown in Figure 3.
estimate of the graph even if A ea,b < 0.5 originally. Third, We can find both meaningful (1st, 2nd and 4th row) and less
we do not generate Hydrogen explicitly and let it be added meaningful transitions, though many samples on the lines
as ”padding” during chemical validity check. do not form chemically valid compounds.
O N OH NH NH
N O N O N N
HO HO O
N O N N N N H2 O O O
N O OH HO OH NH2
NH N N O HO OH2
HO OH2 O O
N OH OH2
OH OH OH OH NH2 OH OH OH2
OH N OH
O NH
N O O N NH O
O O OH NH
N O N O HO OH OH O O
OH O
N HO N N N
O O O
N N
N N N HO HO OH2
OH OH OH OH OH2
OH OH
OH OH OH
OH
N NH OH2 NH
N O O NH OH2
HO HO OH2 H2 O
N N OH NH
N O N N H2 N H2 O O
OH N O OH2 O O O
OH2
N NH
NH2 OH
OH OH2 O
OH OH OH2
OH OH2 OH OH OH OH OH
NH
OH
OH2
NH2 O OH NH
OH OH2 NH OH2 O
N OH2 OH2 OH
N N N HN HN OH2 NH O
O O
OH OH OH2
N OH2 NH2 NH NH O
N N OH
N OH2 O OH OH
OH OH OH OH OH O O OH
OH2 OH
OH NH2
OH OH2
OH2
OH2 OH NH
O NH O NH
N N N N NH2 OH NH OH2
N N OH OH2 H2 O NH
OH2 O OH
N OH HO N O N N
OH O NH2 OH2 O
O OH O
OH2 O OH O
OH OH OH2 O O O
OH NH2
OH OH2
OH2 OH2 HO O O
OH2 NH
O
O N
H2 O OH2 NH NH OH2
N O N N OH2 NH
N OH OH NH N N O O O
N N O O O
OH2 O NH2 NH2 OH2 O
O
O
NH2 OH O
N OH2 O O
OH O
OH2
OH OH
OH OH2 OH2
N H2 O NH
OH
O N OH2
OH O N N H2 O OH2 NH OH NH
OH O NH N NH O NH O
N N O O O
N N OH2 O
OH2 OH2 NH2 OH2 OH2
O OH O
NH2 O NH2 O
NH2 O O
OH2
OH OH OH OH2
O OH2 NH
N N OH2 OH2 NH NH
O H2 O OH2 NH OH O
N N NH OH2 OH2 OH2 O
N O OH2
O NH2 OH2 OH2
N N H2 O NH
N NH2 OH
O NH2
OH OH O O O O
O O
OH2
O OH OH2 OH O NH2 OH
OH2 OH2 NH NH
O
OH2 OH2 O
N OH2 NH NH OH2 O OH2 H2 O OH2
OH2 N O NH2 OH2 OH2 OH2
N N NH HN
NH O NH2 N
N N N H2 N NH2
O OH2 OH2
OH2 O
O O O H2 O
Figure 2. Decodings of latent space points of a conditional model sampled over a random 2D plane in z-space of c = 40 (within 5 units
from center of coordinates). Left: Samples conditioned on 7x Carbon, 1x Nitrogen, 1x Oxygen (12% QM9). Right: Samples conditioned
on 5x Carbon, 1x Nitrogen, 3x Oxygen (2.6% QM9). Color legend as in Figure 3.
model, as the decoder is forced less to rely on actual labels. Valid, which makes model selection somewhat difficult.
The ratio of Valid samples shows less clear behavior, likely
because the discrete performance is not directly optimized
for. For all models, it is remarkable that about 60% of gen- Implicit Node Probabilities. Our decoder assumes inde-
erated molecules are out of the dataset, i.e. the network has pendence of node and edge probabilities, which allows for
never seen them during training. isolated nodes or edges. Making further use of the fact that
molecules are connected graphs, here we investigate the
Looking at the baselines, CVAE can output only very few effect of making node probabilities a function of edge prob-
valid samples as expected, while GVAE generates the high- abilities. Specifically, we consider the probability for node
est number of valid samples (60%) but of very low variance a as that of its most probable edge: A ea,a = maxb A ea,b .
(less than 10%). Additionally, we investigate the importance
of graph matching by using identity assignment X instead The evaluation on QM9 in Table 2 shows a clear improve-
and thus learning to reproduce particular node permutations ment in Valid, Accurate, and Novel metrics in both the
in the training set, which correspond to the canonical or- conditional and unconditional setting. However, this is paid
dering of SMILES strings from rdkit. This ablated model for by lower variability and higher reconstruction loss. This
(denoted as NoGM in Table 1) produces many valid samples indicates that while the new constraint is useful, the model
of lower variety and, surprisingly, outperforms GVAE in cannot fully cope with it.
this regard. In comparison, our model can achieve good
performance in both metrics at the same time. 4.3. ZINC Dataset
ZINC dataset (Irwin et al., 2012) contains about 250k drug-
like organic molecules of up to 38 heavy atoms with 9
Likelihood. Besides the application-specific metric intro- distinct atomic numbers and 4 bond types, we set k = 38,
duced above, we also report evidence lower bound (ELBO) de = 4 and dn = 9 and use the same split strategy as
commonly used in VAE literature, which corresponds to with QM9. We investigate the degree of scalability of an
−L(φ, θ; G) in our notation. In Table 1, we state mean unconditional generative model.
bounds over test set, using a single z sample per graph.
We observe both reconstruction loss and KL-divergence de-
crease due to larger c providing more freedom. However, Setup. The setup is equivalent as for QM9 but with a
there seems to be no strong correlation between ELBO and wider encoder (64, 128, 256 channels).
GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders
NH2 NH2
OH NH2 NH2
F
N N
F H2 N F H2 N F HN F F
O
H2 N F F
H2 N H2 N
N NH
OH NH O
N N N
NH O
OH
O O O
NH O N O
NH N O
O HN
OH2
O O
NH
O NH
OH2 O O
OH OH H2 O HO
O
O
O O O O O
OH2 OH2
OH2 OH2 OH
H2 O
OH2 OH2 O OH2 OH2
O O O
O
OH O O
O
OH OH
OH
O
OH OH OH O
O HO HO HO
N N O O O
N N N
OH H2 O N H2 O N
HO OH
OH2 OH2 N
OH
O OH
OH OH
Figure 3. Linear interpolation between row-wise pairs of randomly chosen molecules in z-space of c = 40 in a conditional model. Color
legend: encoder inputs (green), chemically invalid graphs (red), valid graphs with wrong label (blue), valid and correct (white).
Decoder Quality Metrics. Our best model with c = 40 In Table 3 we vary k and for each tensor separately and
has archived Valid = 0.135, which is clearly worse than report mean accuracies (computed in the same fashion as
for QM9. Using implicit node probabilities brought no losses in Equation 3) over 100 random samples from ZINC
improvement. For comparison, CVAE failed to generated with size up to k nodes. While we observe an expected
any valid sample, while GVAE achieved Valid = 0.357 fall of accuracy with stronger noise, the behavior is fairly
(models provided by Kusner et al. (2017), c = 56). robust with respect to increasing k at a fixed noise level,
the most sensitive being the adjacency matrix. Note that
We attribute such a low performance to a generally much
accuracies are not comparable across tables due to different
higher chance of producing a chemically-relevant inconsis-
dimensionalities of random variables. We may conclude
tency (number of possible edges growing quadratically). To
that the quality of the matching process is not a major hurdle
confirm the relationship between performance and graph
to scalability.
size k, we kept only graphs not larger than k = 20 nodes,
corresponding to 21% of ZINC, and obtained Valid = 0.341
(and Valid = 0.185 for k = 30 nodes, 92% of ZINC). To 5. Conclusion
verify that the problem is likely not caused by our proposed
In this work we addressed the problem of generating graphs
graph matching loss, we synthetically evaluate it in the fol-
from a continuous embedding in the context of variational
lowing.
autoencoders. We evaluated our method on two molecu-
lar datasets of different maximum graph size. While we
Matching Robustness. Robust behavior of graph match- achieved to learn embedding of reasonable quality on small
ing using our similarity function S is important for good molecules, our decoder had a hard time capturing complex
performance of GraphVAE. Here we study graph matching chemical interactions for larger molecules. Nevertheless,
in isolation to investigate its scalability. To that end, we add we believe our method is an important initial step towards
Gaussian noise N (0, A ), N (0, E ), N (0, F ) to each ten- more powerful decoders and will spark interesting in the
sor of input graph G, truncating and renormalizing to keep community.
their probabilistic interpretation, to create its noisy version
There are many avenues to follow for future work. Besides
GN . We are interested in the quality of matching between
the obvious desire to improve the current method (for ex-
self, P [G, G], using noisy assignment matrix X between G
ample, by incorporating a more powerful prior distribution
and GN . The advantage to naive checking X for identity is
or adding a recurrent mechanism for correcting mistakes),
the invariance to permutation of equivalent nodes.
GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders
Table 1. Performance on conditional and unconditional QM9 models evaluated by mean test-time reconstruction log-likelihood
(log pθ (G|z)), mean test-time evidence lower bound (ELBO), and decoding quality metrics (Section 4.2). Baselines CVAE (Gómez-
Bombarelli et al., 2016) and GVAE (Kusner et al., 2017) are listed only for the embedding size with the highest Valid.
Table 2. Performance on conditional and unconditional QM9 models with implicit node probabilities. Improvement with respect to
Table 1 is emphasized in italics.
ACKNOWLEDGMENTS
Table 3. Mean accuracy of matching ZINC graphs to their noisy
counterparts in a synthetic benchmark as a function of maximum We thank Shell Xu Hu for discussions on variational meth-
graph size k. ods, Shinjae Yoo for project motivation, and anonymous
reviewers for their comments.
Noise k = 15 k = 20 k = 25 k = 30 k = 35 k = 40
A,E,F = 0 99.55 99.52 99.45 99.4 99.47 99.46
A = 0.4 90.95 89.55 86.64 87.25 87.07 86.78
References
A = 0.8 82.14 81.01 79.62 79.67 79.07 78.69 Barabási, Albert-László and Albert, Réka. Emergence of
E = 0.4 97.11 96.42 95.65 95.90 95.69 95.69 scaling in random networks. Science, 286(5439):509–
E = 0.8 92.03 90.76 89.76 89.70 88.34 89.40
512, 1999.
F = 0.4 98.32 98.23 97.64 98.28 98.24 97.90
F = 0.8 97.26 97.00 96.60 96.91 96.56 97.17
Bengio, Samy, Vinyals, Oriol, Jaitly, Navdeep, and Shazeer,
Noam. Scheduled sampling for sequence prediction with
we would like to extend it beyond a proof of concept by ap- recurrent neural networks. In NIPS, pp. 1171–1179, 2015.
plying it to real problems in chemistry, such as optimization
of certain properties or predicting chemical reactions. An Bowman, Samuel R., Vilnis, Luke, Vinyals, Oriol, Dai,
advantage of a graph-based decoder compared to SMILES- Andrew M., Józefowicz, Rafal, and Bengio, Samy. Gen-
based decoder is the possibility to predict detailed attributes erating sentences from a continuous space. In CoNLL, pp.
of atoms and bonds in addition to the base structure, which 10–21, 2016.
might be useful in these tasks. Our autoencoder might also
be used to pre-train graph encoders for fine-tuning on small Bronstein, Michael M, Bruna, Joan, LeCun, Yann, Szlam,
datasets (Goh et al., 2017). Arthur, and Vandergheynst, Pierre. Geometric deep learn-
GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders
ing: going beyond euclidean data. IEEE Signal Process- Kusner, Matt J. and Hernández-Lobato, José Miguel. GANS
ing Magazine, 34(4):18–42, 2017. for sequences of discrete elements with the gumbel-
softmax distribution. CoRR, abs/1611.04051, 2016.
Cho, Minsu, Sun, Jian, Duchenne, Olivier, and Ponce, Jean.
Finding matches in a haystack: A max-pooling strategy Kusner, Matt J., Paige, Brooks, and Hernández-Lobato,
for graph matching in the presence of outliers. In CVPR, José Miguel. Grammar variational autoencoder. In ICML,
pp. 2091–2098, 2014. pp. 1945–1954, 2017.
Date, Ketan and Nagi, Rakesh. Gpu-accelerated hungarian Landrum, Greg. RDKit: Open-source cheminformatics.
algorithms for the linear assignment problem. Parallel URL http://www.rdkit.org.
Computing, 57:52–72, 2016.
Li, Yujia, Swersky, Kevin, and Zemel, Richard S. Gener-
Erdos, Paul and Rényi, Alfréd. On the evolution of random ative moment matching networks. In ICML, pp. 1718–
graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17–60, 1727, 2015a.
1960.
Li, Yujia, Tarlow, Daniel, Brockschmidt, Marc, and Zemel,
Gilmer, Justin, Schoenholz, Samuel S., Riley, Patrick F., Richard S. Gated graph sequence neural networks. CoRR,
Vinyals, Oriol, and Dahl, George E. Neural message abs/1511.05493, 2015b.
passing for quantum chemistry. In ICML, pp. 1263–1272,
Makhzani, Alireza, Shlens, Jonathon, Jaitly, Navdeep, and
2017.
Goodfellow, Ian J. Adversarial autoencoders. CoRR,
Goh, Garrett B., Siegel, Charles, Vishnu, Abhinav, and abs/1511.05644, 2015.
Hodas, Nathan O. Chemnet: A transferable and general-
McKay, Brendan D. and Piperno, Adolfo. Practical graph
izable deep neural network for small-molecule property
isomorphism, II. Journal of Symbolic Computation, 60
prediction. arXiv preprint arXiv:1712.02734, 2017.
(0):94 – 112, 2014. ISSN 0747-7171.
Gómez-Bombarelli, Rafael, Duvenaud, David K., Olivecrona, Marcus, Blaschke, Thomas, Engkvist, Ola, and
Hernández-Lobato, José Miguel, Aguilera-Iparraguirre, Chen, Hongming. Molecular de novo design through
Jorge, Hirzel, Timothy D., Adams, Ryan P., and deep reinforcement learning. CoRR, abs/1704.07555,
Aspuru-Guzik, Alán. Automatic chemical design using 2017.
a data-driven continuous representation of molecules.
CoRR, abs/1610.02415, 2016. Ramakrishnan, Raghunathan, Dral, Pavlo O, Rupp,
Matthias, and von Lilienfeld, O Anatole. Quantum chem-
Gong, Shaogang and Xiang, Tao. Recognition of group istry structures and properties of 134 kilo molecules. Sci-
activities using dynamic probabilistic networks. In ICCV, entific Data, 1, 2014.
pp. 742–749, 2003.
Segler, Marwin H. S., Kogej, Thierry, Tyrchan, Christian,
Irwin, John J., Sterling, Teague, Mysinger, Michael M., and Waller, Mark P. Generating focussed molecule li-
Bolstad, Erin S., and Coleman, Ryan G. ZINC: A free tool braries for drug discovery with recurrent neural networks.
to discover chemistry for biology. Journal of Chemical CoRR, abs/1701.01329, 2017.
Information and Modeling, 52(7):1757–1768, 2012.
Simonovsky, Martin and Komodakis, Nikos. Dynamic edge-
Jang, Eric, Gu, Shixiang, and Poole, Ben. Categori- conditioned filters in convolutional neural networks on
cal reparameterization with gumbel-softmax. CoRR, graphs. In CVPR, 2017.
abs/1611.01144, 2016.
Snijders, Tom A.B. and Nowicki, Krzysztof. Estimation
Johnson, Daniel D. Learning graphical state transitions. In and prediction for stochastic blockmodels for graphs with
ICLR, 2017. latent block structure. Journal of Classification, 14(1):
75–100, Jan 1997.
Kingma, Diederik P. and Welling, Max. Auto-encoding
variational bayes. CoRR, abs/1312.6114, 2013. Sohn, Kihyuk, Lee, Honglak, and Yan, Xinchen. Learning
structured output representation using deep conditional
Ktena, Sofia Ira, Parisot, Sarah, Ferrante, Enzo, Rajchl, generative models. In NIPS, pp. 3483–3491, 2015.
Martin, Lee, Matthew C. H., Glocker, Ben, and Rueckert,
Daniel. Distance metric learning using graph convolu- Stewart, Russell, Andriluka, Mykhaylo, and Ng, Andrew Y.
tional networks: Application to functional brain networks. End-to-end people detection in crowded scenes. In CVPR,
In MICCAI, 2017. pp. 2325–2333, 2016.
GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders
Theis, Lucas, van den Oord, Aäron, and Bethge, Matthias. architecture, we train it as unregularized in this section,
A note on the evaluation of generative models. CoRR, i.e. with a deterministic encoder and without KL-divergence
abs/1511.01844, 2015. term in Equation 1.
Vinyals, Oriol, Bengio, Samy, and Kudlur, Manjunath. Or- Unconditional models for QM9 achieve mean test log-
der matters: Sequence to sequence for sets. arXiv preprint likelihood log pθ (G|z) of roughly −0.37 (about −0.50
arXiv:1511.06391, 2015. for the implicit node probability model) for all c ∈
{20, 40, 60, 80}. While these log-likelihoods are signifi-
Williams, Ronald J. and Zipser, David. A learning algorithm cantly higher than in Tables 1 and 2, our architecture can
for continually running fully recurrent neural networks. not achieve perfect reconstruction of inputs. We were suc-
Neural Computation, 1(2):270–280, 1989. cessful to increase training log-likelihood to zero only on
fixed small training sets of hundreds of examples, where
Xu, Danfei, Zhu, Yuke, Choy, Christopher Bongsoo, and
the network could overfit. This indicates that the network
Fei-Fei, Li. Scene graph generation by iterative message
has problems finding generally valid rules for assembly of
passing. In CVPR, 2017.
output tensors.
Yu, Lantao, Zhang, Weinan, Wang, Jun, and Yu, Yong.
Seqgan: Sequence generative adversarial nets with policy
gradient. In AAAI, 2017.
Appendix
A. Max-Pooling Matching
In this section we briefly review max-pooling matching algo-
rithm of Cho et al. (2014). In its relaxed form, a continuous
correspondence matrix X ∗ ∈ [0, 1]k×n between nodes of
graphs G and G e is determined based on similarities of node
pairs i, j ∈ G and a, b ∈ G e represented as matrix elements
+
Sia;jb ∈ R .
Let x∗ denote the column-wise replica of X ∗ . The relaxed
graph matching problem is expressed as quadratic
Pn program-
ming task x∗ = arg maxx xT Sx such that i=1 xia ≤ 1,
Pk kn
a=1 xia ≤ 1, and x ∈ [0, 1] . The optimization strategy
of choice is derived to be equivalent to the power method
with iterative update rule x(t+1) = Sx(t) /||Sx(t) ||2 . The
starting correspondences x(0) are initialized as uniform and
the rule is iterated until convergence; in our use case we run
for a fixed amount of iterations.
In the context of graph matching, the matrix-vector product
Sx can be interpreted as P sum-pooling
P over match candi-
dates: xia ← xia Sia;ia + j∈Ni b∈Na xjb Sia;jb , where
Ni and Na denote the set of neighbors of node i and a.
The authors argue that this formulation is strongly influ-
enced by uninformative or irrelevant elements and pro-
pose a more robust max-pooling version, which consid-
ers only the best pairwise
P similarity from each neighbor:
xia ← xia Sia;ia + j∈Ni maxb∈Na xjb Sia;jb .
B. Unregularized Autoencoder
The regularization in VAE works against achieving perfect
reconstruction of training data, especially for small embed-
ding sizes. To understand the reconstruction ability of our