0% found this document useful (0 votes)
25 views10 pages

Graphvae

Uploaded by

ljn98425
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views10 pages

Graphvae

Uploaded by

ljn98425
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

GraphVAE: Towards Generation of Small Graphs Using Variational

Autoencoders

Martin Simonovsky 1 Nikos Komodakis 1

Abstract tal construction involves discrete decisions, which are not


differentiable.
Deep learning on graphs has become a popular
arXiv:1802.03480v1 [cs.LG] 9 Feb 2018

research topic with many applications. However, In this work, we propose to sidestep these hurdles by having
past work has concentrated on learning graph em- the decoder output a probabilistic fully-connected graph of a
bedding tasks, which is in contrast with advances predefined maximum size directly at once. In a probabilistic
in generative models for images and text. Is it graph, the existence of nodes and edges, as well as their
possible to transfer this progress to the domain of attributes, are modeled as independent random variables.
graphs? We propose to sidestep hurdles associ- The method is formulated in the framework of variational
ated with linearization of such discrete structures autoencoders (VAE) by Kingma & Welling (2013).
by having a decoder output a probabilistic fully- We demonstrate our method, coined GraphVAE, in chem-
connected graph of a predefined maximum size informatics on the task of molecule generation. Molecular
directly at once. Our method is formulated as datasets are a challenging but convenient testbed for our gen-
a variational autoencoder. We evaluate on the erative model, as they easily allow for both qualitative and
challenging task of molecule generation. quantitative tests of decoded samples. While our method is
applicable for generating smaller graphs only and its perfor-
mance leaves space for improvement, we believe our work
1. Introduction is an important initial step towards powerful and efficient
graph decoders.
Deep learning on graphs has very recently become a pop-
ular research topic (Bronstein et al., 2017), with useful
applications across fields such as chemistry (Gilmer et al., 2. Related work
2017), medicine (Ktena et al., 2017), or computer vision (Si- Graph Decoders. Graph generation has been largely un-
monovsky & Komodakis, 2017). Past work has concentrated explored in deep learning. The closest work to ours is by
on learning graph embedding tasks so far, i.e. encoding an Johnson (2017), who incrementally constructs a probabilis-
input graph into a vector representation. This is in stark tic (multi)graph as a world representation according to a
contrast with fast-paced advances in generative models for sequence of input sentences to answer a query. While our
images and text, which have seen massive rise in quality model also outputs a probabilistic graph, we do not assume
of generated samples. Hence, it is an intriguing question having a prescribed order of construction transformations
how one can transfer this progress to the domain of graphs, available and we formulate the learning problem as an au-
i.e. their decoding from a vector representation. Moreover, toencoder.
the desire for such a method has been mentioned in the past
by Gómez-Bombarelli et al. (2016). Xu et al. (2017) learns to produce a scene graph from an
input image. They construct a graph from a set of object
However, learning to generate graphs is a difficult prob- proposals, provide initial embeddings to each node and edge,
lem for methods based on gradient optimization, as graphs and use message passing to obtain a consistent prediction. In
are discrete structures. Unlike sequence (text) generation, contrast, our method is a generative model which produces
graphs can have arbitrary connectivity and there is no clear a probabilistic graph from a single opaque vector, without
best way how to linearize their construction in a sequence specifying the number of nodes or the structure explicitly.
of steps. On the other hand, learning the order for incremen-
1
Related work pre-dating deep learning includes random
Université Paris Est & École des Ponts ParisTech, Champs sur graphs (Erdos & Rényi, 1960; Barabási & Albert, 1999),
Marne, France. Correspondence to: Martin Simonovsky <mar-
[email protected]>. stochastic blockmodels (Snijders & Nowicki, 1997), or state
transition matrix learning (Gong & Xiang, 2003).
GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders

Figure 1. Illustration of the proposed variational graph autoencoder. Starting from a discrete attributed graph G = (A, E, F ) on n nodes
(e.g. a representation of propylene oxide), stochastic graph encoder qφ (z|G) embeds the graph into continuous representation z. Given a
point in the latent space, our novel graph decoder pθ (G|z) outputs a probabilistic fully-connected graph G e = (A,
e E,
e Fe) on predefined
k ≥ n nodes, from which discrete samples may be drawn. The process can be conditioned on label y for controlled sampling at test time.
Reconstruction ability of the autoencoder is facilitated by approximate graph matching for aligning G with G. e

Discrete Data Decoders. Text is the most common dis- 3. Method


crete representation. Generative models there are usually
trained in a maximum likelihood fashion by teacher forc- We approach the task of graph generation by devising a neu-
ing (Williams & Zipser, 1989), which avoids the need to ral network able to translate vectors in a continuous code
backpropagate through output discretization by feeding the space to graphs. Our main idea is to output a probabilistic
ground truth instead of the past sample at each step. Bengio fully-connected graph and use a standard graph matching
et al. (2015) argued this may lead to expose bias, i.e. possi- algorithm to align it to the ground truth. The proposed
bly reduced ability to recover from own mistakes. Recently, method is formulated in the framework of variational au-
efforts have been made to overcome this problem. Notably, toencoders (VAE) by Kingma & Welling (2013), although
computing a differentiable approximation using Gumbel dis- other forms of regularized autoencoders would be equally
tribution (Kusner & Hernández-Lobato, 2016) or bypassing suitable (Makhzani et al., 2015; Li et al., 2015a). We briefly
the problem by learning a stochastic policy in reinforcement recapitulate VAE below and continue with introducing our
learning (Yu et al., 2017). Our work also circumvents the novel graph decoder together with an appropriate objective.
non-differentiability problem, namely by formulating the
loss on a probabilistic graph. 3.1. Variational Autoencoder
Let G = (A, E, F ) be a graph specified with its adjacency
matrix A, edge attribute tensor E, and node attribute ma-
Molecule Decoders. Generative models may become trix F . We wish to learn an encoder and a decoder to map
promising for de novo design of molecules fulfilling certain between the space of graphs G and their continuous em-
criteria by being able to search for them over a continu- bedding z ∈ Rc , see Figure 1. In the probabilistic setting
ous embedding space (Olivecrona et al., 2017). With that in of a VAE, the encoder is defined by a variational poste-
mind, we propose a conditional version of our model. While rior qφ (z|G) and the decoder by a generative distribution
molecules have an intuitive representation as graphs, the pθ (G|z), where φ and θ are learned parameters. Further-
field has had to resort to textual representations with fixed more, there is a prior distribution p(z) imposed on the latent
syntax, e.g. so-called SMILES strings, to exploit recent code representation as a regularization; we use a simplis-
progress made in text generation with RNNs (Olivecrona tic isotropic Gaussian prior p(z) = N (0, I). The whole
et al., 2017; Segler et al., 2017; Gómez-Bombarelli et al., model is trained by minimizing the upper bound on negative
2016). As their syntax is brittle, many invalid strings tend to log-likelihood − log pθ (G) (Kingma & Welling, 2013):
be generated, which has been recently addressed by Kusner
et al. (2017) by incorporating grammar rules into decod-
L(φ, θ; G) =
ing. While encouraging, their approach does not guarantee
semantic (chemical) validity, similarly as our method. = Eqφ (z|G) [− log pθ (G|z)] + KL[qφ (z|G)||p(z)] (1)
GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders

The first term of L, the reconstruction loss, enforces high The decoder itself is deterministic. Its architecture is a
similarity of sampled generated graphs to the input graph G. simple multi-layer perceptron (MLP) with three outputs in
The second term, KL-divergence, regularizes the code space its last layer. Sigmoid activation function is used to compute
to allow for sampling of z directly from p(z) instead from A,
e whereas edge- and node-wise softmax is applied to obtain
qφ (z|G) later. The dimensionality of z is usually fairly small E and Fe, respectively. At test time, we are often interested
e
so that the autoencoder is encouraged to learn a high-level in a (discrete) point estimate of G,
e which can be obtained by
compression of the input instead of learning to simply copy taking edge- and node-wise argmax in A, e E,e and Fe. Note
any given input. While the regularization is independent on that this can result in a discrete graph on less than k nodes.
the input space, the reconstruction loss must be specifically
designed for each input modality. In the following, we 3.3. Reconstruction Loss
introduce our graph decoder together with an appropriate
reconstruction loss. Given a particular of a discrete input graph G on n ≤ k
nodes and its probabilistic reconstruction Ge on k nodes,
evaluation of Equation 1 requires computation of likelihood
3.2. Probabilistic Graph Decoder
pθ (G|z) = P (G|G).
e
Graphs are discrete objects, ultimately. While this does not
Since no particular ordering of nodes is imposed in either
pose a challenge for encoding, demonstrated by the recent
Ge or G and matrix representation of graphs is not invariant
developments in graph convolution networks (Gilmer et al.,
to permutations of nodes, comparison of two graphs is hard.
2017), graph generation has been an open problem so far. In
However, approximate graph matching described further
a related task of text sequence generation, the currently dom-
in Subsection 3.4 can obtain a binary assignment matrix
inant approach is character-wise or word-wise prediction
X ∈ {0, 1}k×n , where Xa,i = 1 only if node a ∈ G e is
(Bowman et al., 2016). However, graphs can have arbitrary
assigned to i ∈ G and Xa,i = 0 otherwise.
connectivity and there is no clear way how to linearize their
construction in a sequence of steps1 . On the other hand, Knowledge of X allows to map information between both
iterative construction of discrete structures during training graphs. Specifically, input adjacency matrix is mapped to
without step-wise supervision involves discrete decisions, the predicted graph as A0 = XAX T , whereas the predicted
which are not differentiable and therefore problematic for node attribute matrix and slices of edge attribute matrix are
back-propagation. transferred to the input graph as Fe0 = X T Fe and E e0 =
·,·,l
T e
Fortunately, the task can become much simpler if we re- X E·,·,l X. The maximum likelihood estimates, i.e. cross-
strict the domain to the set of all graphs on maximum k entropy, of respective variables are as follows:
nodes, where k is fairly small (in practice up to the order
of tens). Under this assumption, handling dense graph rep- log p(A0 |z) =
resentations is still computationally tractable. We propose X
to make the decoder output a probabilistic fully-connected = 1/k A0a,a log A
ea,a + (1 − A0 ) log(1 − A
a,a
ea,a ) +
graph G e = (A,
e E,
e Fe) on k nodes at once. This effectively a
X
sidesteps both problems mentioned above. + 1/k(k − 1) A0a,b log A
ea,b + (1 − A0a,b ) log(1 − A
ea,b )
a6=b
In probabilistic graphs, the existence of nodes and edges X
T e0
is modeled as Bernoulli variables, whereas node and edge log p(F |z) = 1/n log Fi,· Fi,·
attributes are multinomial variables. While not discussed in i
X
this work, continuous attributes could be easily modeled as log p(E|z) = 1/(||A||1 − n) T e0
log Ei,j,· Ei,j,· , (2)
Gaussian variables represented by their mean and variance. i6=j
We assume all variables to be independent.
Each tensor of the representation of G e has thus a proba- where we assumed that F and E are encoded in one-hot no-
bilistic interpretation. Specifically, the predicted adjacency tation. The formulation considers existence of both matched
matrix A e ∈ [0, 1]k×k contains both node probabilities A ea,a and unmatched nodes and edges but attributes of only the
and edge probabilities Aa,b for nodes a 6= b. The edge at-
e matched ones. Furthermore, averaging over nodes and edges
tribute tensor Ee ∈ Rk×k×de indicates class probabilities for separately has shown beneficial in training as otherwise the
edges and, similarly, the node attribute matrix Fe ∈ Rk×dn edges dominate the likelihood. The overall reconstruction
contains class probabilities for nodes. loss is a weighed sum of the previous terms:
1
While algorithms for canonical graph orderings are available
(McKay & Piperno, 2014), Vinyals et al. (2015) empirically found − log p(G|z) = −λA log p(A0 |z) − λF log p(F |z) −
out that the linearization order matters when learning on sets.
− λE log p(E|z) (3)
GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders

3.4. Graph Matching computing a differentiable loss.


The goal of (second-order) graph matching is to find cor-
3.5. Further Details
respondences X ∈ {0, 1}k×n between nodes of graphs
G and G e based on the similarities of their node pairs Encoder. A feed forward network with edge-conditioned
S : (i, j)×(a, b) → R+ for i, j ∈ G and a, b ∈ G. e It can be graph convolutions (ECC) (Simonovsky & Komodakis,
expressed as integer quadratic programming problem of sim- 2017) is used as encoder, although any other graph em-
ilarity maximization over X and is typically approximated bedding method is applicable. As our edge attributes are
by relaxation of X into continuous domain: X ∗ ∈ [0, 1]k×n categorical, a single linear layer for the filter generating
(Cho et al., 2014). For our use case, the similarity function network in ECC is sufficient. Due to smaller graph sizes
is defined as follows: no pooling is used in encoder except for the global one, for
which we employ gated pooling by Li et al. (2015b). As
S((i, j), (a, b)) = usual in VAE, we formulate encoder as probabilistic and
T e
= (Ei,j,· Ea,b,· )Ai,j A
ea,b A eb,b [i 6= j ∧ a 6= b] +
ea,a A enforce Gaussian distribution of qφ (z|G) by having the last
T e
encoder layer outputs 2c features interpreted as mean and
+ (Fi,· ea,a [i = j ∧ a = b]
Fa,· )A (4) variance, allowing to sample zl ∼ N (µl (G), σl (G)) for
l ∈ 1, .., c using the re-parameterization trick (Kingma &
The first term evaluates similarity between edge pairs and
Welling, 2013).
the second term between node pairs, [·] being the Iverson
bracket. Note that the scores consider both feature compat-
Disentangled Embedding. In practice, rather than ran-
ibility (Fe and E)
e and existential compatibility (A),
e which
dom drawing of graphs, one often desires more control over
has empirically led to more stable assignments during train-
the properties of generated graphs. In such case, we follow
ing. To summarize the motivation behind both Equations 3
Sohn et al. (2015) and condition both encoder and decoder
and 4, our method aims to find the best graph matching
on label vector y associated with each input graph G. De-
and then further improve on it by gradient descent on the
coder pθ (G|z, y) is fed a concatenation of z and y, while
loss. Given the stochastic way of training deep network, we
in encoder qφ (z|G, y), y is concatenated to every node’s
argue that solving the matching step only approximately is
features just before the graph pooling layer. If the size of
sufficient. This is conceptually similar to the approach for
latent space c is small, the decoder is encouraged to exploit
learning to output unordered sets by (Vinyals et al., 2015),
information in the label.
where the closest ordering of the training data is sought.
In practice, we are looking for a graph matching algorithm Limitations. The proposed model is expected to be useful
robust to noisy correspondences which can be easily im- only for generating small graphs. This is due to growth
plemented on GPU in batch mode. Max-pooling matching of GPU memory requirements and number of parameters
(MPM) by Cho et al. (2014) is a simple but effective algo- (O(k 2 )) as well as matching complexity (O(k 4 )), with small
rithm following the iterative scheme of power methods, see decrease in quality for high values of k. In Section 4 we
Appendix A for details. It can be used in batch mode if demonstrate results for up to k = 38. Nevertheless, for
similarity tensors are zero-padded, i.e. S((i, j), (a, b)) = 0 many applications even generation of small graphs is still
for n < i, j ≤ k, and the amount of iterations is fixed. very useful.
Max-pooling matching outputs continuous assignment ma-
trix X ∗ . Unfortunately, attempts to directly use X ∗ instead 4. Evaluation
of X in Equation 3 performed badly, as did experiments with
direct maximization of X ∗ or soft discretization with soft- We demonstrate our method for the task of molecule gener-
max or straight-through Gumbel softmax (Jang et al., 2016). ation by evaluating on two large public datasets of organic
We therefore discretize X ∗ to X using Hungarian algorithm molecules, QM9 and ZINC.
to obtain a strict one-on-one mapping2 . While this operation
is non-differentiable, gradient can still flow to the decoder 4.1. Application in Cheminformatics
directly through the loss function and training convergence Quantitative evaluation of generative models of images and
proceeds without problems. Note that this approach is of- texts has been troublesome (Theis et al., 2015), as it very
ten taken in works on object detection, e.g. (Stewart et al., difficult to measure realness of generated samples in an au-
2016), where a set of detections need to be matched to a set tomated and objective way. Thus, researchers frequently
of ground truth bounding boxes and treated as fixed before resort there to qualitative evaluation and embedding plots.
2
Some predicted nodes are not assigned for n < k. Our current However, qualitative evaluation of graphs can be very unin-
implementation performs this step on CPU although a GPU version tuitive for humans to judge unless the graphs are planar and
has been published (Date & Nagi, 2016). fairly simple.
GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders

Fortunately, we found graph representation of molecules, and ReLU; followed by parallel triplet of FCLs to output
as undirected graphs with atoms as nodes and bonds as graph tensors. We set c = 40, λA = λF = λE = 1, batch
edges, to be a convenient testbed for generative models. size 32, 75 MPM iterations and train for 25 epochs with
On one hand, generated graphs can be easily visualized Adam with learning rate 1e-3 and β1 =0.5.
in standardized structural diagrams. On the other hand,
chemical validity of graphs, as well as many further prop- Embedding Visualization. To visually judge the quality
erties a molecule can fulfill, can be checked using software and smoothness of the learned embedding z of our model,
packages (SanitizeMol in RDKit) or simulations. This we may traverse it in two ways: along a slice and along a line.
makes both qualitative and quantitative tests possible. For the former, we randomly choose two c-dimensional or-
thonormal vectors and sample z in regular grid pattern over
Chemical constraints on compatible types of bonds and atom
the induced 2D plane. For the latter, we randomly choose
valences make the space of valid graphs complicated and
two molecules G(1) , G(2) of the same label from test set
molecule generation challenging. In fact, a single addition
and interpolate between their embeddings µ(G(1) ), µ(G(2) ).
or removal of edge or change in atom or bond type can
This also evaluates the encoder, and therefore benefits from
make a molecule chemically invalid. Comparably, flipping
low reconstruction error.
a single pixel in MNIST-like number generation problem is
of no issue. We plot two planes in Figure 2, for a frequent label (left)
and a less frequent label in QM9 (right). Both images show
To help the network in this application, we introduce three
a varied and fairly smooth mix of molecules. The left image
remedies. First, we make the decoder output symmetric A e
has many valid samples broadly distributed across the plane,
and E by predicting their (upper) triangular parts only, as
e
as presumably the autoencoder had to fit a large portion of
undirected graphs are sufficient representation for molecules.
database into this space. The right exhibits stronger effect
Second, we use prior knowledge that molecules are con-
of regularization, as valid molecules tend to be only around
nected and, at test time only, construct maximum spanning
center.
tree on the set of probable nodes {a : A ea,a ≥ 0.5} in
order to include its edges (a, b) in the discrete pointwise An example of several interpolations is shown in Figure 3.
estimate of the graph even if A ea,b < 0.5 originally. Third, We can find both meaningful (1st, 2nd and 4th row) and less
we do not generate Hydrogen explicitly and let it be added meaningful transitions, though many samples on the lines
as ”padding” during chemical validity check. do not form chemically valid compounds.

4.2. QM9 Dataset Decoder Quality Metrics. The quality of a conditional


decoder can be evaluated by the validity and variety of gener-
QM9 dataset (Ramakrishnan et al., 2014) contains about ated graphs. For a given label y(l) , we draw ns = 104 sam-
134k organic molecules of up to 9 heavy (non Hydrogen) ples z(l,s) ∼ p(z) and compute the discrete point estimate
atoms with 4 distinct atomic numbers and 4 bond types, we of their decodings Ĝ(l,s) = arg max pθ (G|z(l,s) , y(l) ).
set k = 9, de = 4 and dn = 4. We set aside 10k samples
for testing and 10k for validation (model selection). Let V (l) be the list of chemically valid molecules from
Ĝ(l,s) and C (l) be the list of chemically valid molecules
We compare our unconditional model to the character- with atom histograms equal to y(l) . We are interested in ra-
based generator of Gómez-Bombarelli et al. (2016) (CVAE) tios Valid(l) = |V (l) |/ns and Accurate(l) = |C (l) |/ns .
and the grammar-based generator of Kusner et al. (2017) Furthermore, let Unique(l) = |set(C (l) )|/|C (l) | be the
(GVAE). We used the code and architecture in Kusner et al.
fraction of unique correct graphs and Novel(l) = 1 −
(2017) for both baselines, adapting the maximum input
|set(C (l) ) ∩ QM9|/|set(C (l) )| the fraction of novel out-of-
length to the smallest possible. In addition, we demon-
strate a conditional generative model for an artificial task of dataset graphs; we define Unique(l) = 0 and Novel(l) = 0
generating molecules given a histogram of heavy atoms as if |C (l) | = 0. Finally, the introduced metrics are ag-
4-dimensional label y, the success of which can be easily gregated by frequencies of labels in QM9, e.g. Valid =
P (l) (l)
validated. l Valid freq(y ). Unconditional decoders are eval-
uated by assuming there is just a single label, therefore
Valid = Accurate.
Setup. The encoder has two graph convolutional layers
(32 and 64 channels) with identity connection, batchnorm, In Table 1, we can see that on average 50% of generated
and ReLU; followed by the graph-level output formulation molecules are chemically valid and, in the case of condi-
in Equation 7 of Li et al. (2015b) with auxiliary networks tional models, about 40% have the correct label which the
being a single fully connected layer (FCL) with 128 output decoder was conditioned on. Larger embedding sizes c
channels; finalized by a FCL outputting (µ, σ). The decoder are less regularized, demonstrated by a higher number of
has 3 FCLs (128, 256, and 512 channels) with batchnorm Unique samples and by lower accuracy of the conditional
GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders

O N OH NH NH
N O N O N N
HO HO O
N O N N N N H2 O O O
N O OH HO OH NH2
NH N N O HO OH2
HO OH2 O O
N OH OH2
OH OH OH OH NH2 OH OH OH2
OH N OH

O NH
N O O N NH O
O O OH NH
N O N O HO OH OH O O
OH O
N HO N N N
O O O
N N
N N N HO HO OH2
OH OH OH OH OH2
OH OH
OH OH OH

OH
N NH OH2 NH
N O O NH OH2
HO HO OH2 H2 O
N N OH NH
N O N N H2 N H2 O O
OH N O OH2 O O O
OH2
N NH
NH2 OH
OH OH2 O
OH OH OH2
OH OH2 OH OH OH OH OH
NH
OH

OH2
NH2 O OH NH
OH OH2 NH OH2 O
N OH2 OH2 OH
N N N HN HN OH2 NH O
O O
OH OH OH2
N OH2 NH2 NH NH O
N N OH
N OH2 O OH OH
OH OH OH OH OH O O OH
OH2 OH
OH NH2

OH OH2
OH2
OH2 OH NH
O NH O NH
N N N N NH2 OH NH OH2
N N OH OH2 H2 O NH
OH2 O OH
N OH HO N O N N
OH O NH2 OH2 O
O OH O
OH2 O OH O
OH OH OH2 O O O
OH NH2

OH OH2
OH2 OH2 HO O O
OH2 NH
O
O N
H2 O OH2 NH NH OH2
N O N N OH2 NH
N OH OH NH N N O O O
N N O O O
OH2 O NH2 NH2 OH2 O
O
O
NH2 OH O
N OH2 O O
OH O

OH2
OH OH
OH OH2 OH2
N H2 O NH
OH
O N OH2
OH O N N H2 O OH2 NH OH NH
OH O NH N NH O NH O
N N O O O
N N OH2 O
OH2 OH2 NH2 OH2 OH2
O OH O
NH2 O NH2 O
NH2 O O

OH2
OH OH OH OH2
O OH2 NH
N N OH2 OH2 NH NH
O H2 O OH2 NH OH O
N N NH OH2 OH2 OH2 O
N O OH2
O NH2 OH2 OH2
N N H2 O NH
N NH2 OH
O NH2
OH OH O O O O
O O

OH2
O OH OH2 OH O NH2 OH
OH2 OH2 NH NH
O
OH2 OH2 O
N OH2 NH NH OH2 O OH2 H2 O OH2
OH2 N O NH2 OH2 OH2 OH2
N N NH HN
NH O NH2 N
N N N H2 N NH2
O OH2 OH2
OH2 O
O O O H2 O

Figure 2. Decodings of latent space points of a conditional model sampled over a random 2D plane in z-space of c = 40 (within 5 units
from center of coordinates). Left: Samples conditioned on 7x Carbon, 1x Nitrogen, 1x Oxygen (12% QM9). Right: Samples conditioned
on 5x Carbon, 1x Nitrogen, 3x Oxygen (2.6% QM9). Color legend as in Figure 3.

model, as the decoder is forced less to rely on actual labels. Valid, which makes model selection somewhat difficult.
The ratio of Valid samples shows less clear behavior, likely
because the discrete performance is not directly optimized
for. For all models, it is remarkable that about 60% of gen- Implicit Node Probabilities. Our decoder assumes inde-
erated molecules are out of the dataset, i.e. the network has pendence of node and edge probabilities, which allows for
never seen them during training. isolated nodes or edges. Making further use of the fact that
molecules are connected graphs, here we investigate the
Looking at the baselines, CVAE can output only very few effect of making node probabilities a function of edge prob-
valid samples as expected, while GVAE generates the high- abilities. Specifically, we consider the probability for node
est number of valid samples (60%) but of very low variance a as that of its most probable edge: A ea,a = maxb A ea,b .
(less than 10%). Additionally, we investigate the importance
of graph matching by using identity assignment X instead The evaluation on QM9 in Table 2 shows a clear improve-
and thus learning to reproduce particular node permutations ment in Valid, Accurate, and Novel metrics in both the
in the training set, which correspond to the canonical or- conditional and unconditional setting. However, this is paid
dering of SMILES strings from rdkit. This ablated model for by lower variability and higher reconstruction loss. This
(denoted as NoGM in Table 1) produces many valid samples indicates that while the new constraint is useful, the model
of lower variety and, surprisingly, outperforms GVAE in cannot fully cope with it.
this regard. In comparison, our model can achieve good
performance in both metrics at the same time. 4.3. ZINC Dataset
ZINC dataset (Irwin et al., 2012) contains about 250k drug-
like organic molecules of up to 38 heavy atoms with 9
Likelihood. Besides the application-specific metric intro- distinct atomic numbers and 4 bond types, we set k = 38,
duced above, we also report evidence lower bound (ELBO) de = 4 and dn = 9 and use the same split strategy as
commonly used in VAE literature, which corresponds to with QM9. We investigate the degree of scalability of an
−L(φ, θ; G) in our notation. In Table 1, we state mean unconditional generative model.
bounds over test set, using a single z sample per graph.
We observe both reconstruction loss and KL-divergence de-
crease due to larger c providing more freedom. However, Setup. The setup is equivalent as for QM9 but with a
there seems to be no strong correlation between ELBO and wider encoder (64, 128, 256 channels).
GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders

NH2 NH2
OH NH2 NH2
F
N N
F H2 N F H2 N F HN F F
O
H2 N F F
H2 N H2 N
N NH
OH NH O
N N N
NH O

OH
O O O
NH O N O
NH N O
O HN
OH2
O O
NH
O NH
OH2 O O

OH OH H2 O HO
O
O
O O O O O
OH2 OH2
OH2 OH2 OH
H2 O
OH2 OH2 O OH2 OH2
O O O
O

OH O O
O

OH OH
OH
O
OH OH OH O
O HO HO HO

N N O O O
N N N
OH H2 O N H2 O N
HO OH
OH2 OH2 N
OH
O OH
OH OH

Figure 3. Linear interpolation between row-wise pairs of randomly chosen molecules in z-space of c = 40 in a conditional model. Color
legend: encoder inputs (green), chemically invalid graphs (red), valid graphs with wrong label (blue), valid and correct (white).

Decoder Quality Metrics. Our best model with c = 40 In Table 3 we vary k and  for each tensor separately and
has archived Valid = 0.135, which is clearly worse than report mean accuracies (computed in the same fashion as
for QM9. Using implicit node probabilities brought no losses in Equation 3) over 100 random samples from ZINC
improvement. For comparison, CVAE failed to generated with size up to k nodes. While we observe an expected
any valid sample, while GVAE achieved Valid = 0.357 fall of accuracy with stronger noise, the behavior is fairly
(models provided by Kusner et al. (2017), c = 56). robust with respect to increasing k at a fixed noise level,
the most sensitive being the adjacency matrix. Note that
We attribute such a low performance to a generally much
accuracies are not comparable across tables due to different
higher chance of producing a chemically-relevant inconsis-
dimensionalities of random variables. We may conclude
tency (number of possible edges growing quadratically). To
that the quality of the matching process is not a major hurdle
confirm the relationship between performance and graph
to scalability.
size k, we kept only graphs not larger than k = 20 nodes,
corresponding to 21% of ZINC, and obtained Valid = 0.341
(and Valid = 0.185 for k = 30 nodes, 92% of ZINC). To 5. Conclusion
verify that the problem is likely not caused by our proposed
In this work we addressed the problem of generating graphs
graph matching loss, we synthetically evaluate it in the fol-
from a continuous embedding in the context of variational
lowing.
autoencoders. We evaluated our method on two molecu-
lar datasets of different maximum graph size. While we
Matching Robustness. Robust behavior of graph match- achieved to learn embedding of reasonable quality on small
ing using our similarity function S is important for good molecules, our decoder had a hard time capturing complex
performance of GraphVAE. Here we study graph matching chemical interactions for larger molecules. Nevertheless,
in isolation to investigate its scalability. To that end, we add we believe our method is an important initial step towards
Gaussian noise N (0, A ), N (0, E ), N (0, F ) to each ten- more powerful decoders and will spark interesting in the
sor of input graph G, truncating and renormalizing to keep community.
their probabilistic interpretation, to create its noisy version
There are many avenues to follow for future work. Besides
GN . We are interested in the quality of matching between
the obvious desire to improve the current method (for ex-
self, P [G, G], using noisy assignment matrix X between G
ample, by incorporating a more powerful prior distribution
and GN . The advantage to naive checking X for identity is
or adding a recurrent mechanism for correcting mistakes),
the invariance to permutation of equivalent nodes.
GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders

Table 1. Performance on conditional and unconditional QM9 models evaluated by mean test-time reconstruction log-likelihood
(log pθ (G|z)), mean test-time evidence lower bound (ELBO), and decoding quality metrics (Section 4.2). Baselines CVAE (Gómez-
Bombarelli et al., 2016) and GVAE (Kusner et al., 2017) are listed only for the embedding size with the highest Valid.

log pθ (G|z) ELBO Valid Accurate Unique Novel


Ours c = 20 -0.578 -0.722 0.565 0.467 0.314 0.598
Cond. Ours c = 40 -0.504 -0.617 0.511 0.416 0.484 0.635
Ours c = 60 -0.492 -0.585 0.520 0.406 0.583 0.613
Ours c = 80 -0.475 -0.557 0.458 0.353 0.666 0.661
Ours c = 20 -0.660 -0.916 0.485 0.485 0.457 0.575
Unconditional

Ours c = 40 -0.537 -0.744 0.542 0.542 0.618 0.617


Ours c = 60 -0.486 -0.656 0.517 0.517 0.695 0.570
Ours c = 80 -0.482 -0.628 0.557 0.557 0.760 0.616
NoGM c = 80 -2.388 -2.553 0.810 0.810 0.241 0.610
CVAE c = 60 – – 0.103 0.103 0.675 0.900
GVAE c = 20 – – 0.602 0.602 0.093 0.809

Table 2. Performance on conditional and unconditional QM9 models with implicit node probabilities. Improvement with respect to
Table 1 is emphasized in italics.

log pθ (G|z) ELBO Valid Accurate Unique Novel


Ours/imp c = 20 -0.784 -0.919 0.572 0.482 0.238 0.718
Cond.

Ours/imp c = 40 -0.671 -0.776 0.611 0.518 0.307 0.665


Ours/imp c = 60 -0.618 -0.714 0.566 0.448 0.416 0.710
Ours/imp c = 80 -0.627 -0.713 0.583 0.451 0.475 0.681
Ours/imp c = 20 -0.857 -1.091 0.533 0.533 0.228 0.610
Uncond.

Ours/imp c = 40 -0.737 -0.932 0.562 0.562 0.420 0.758


Ours/imp c = 60 -0.634 -0.797 0.587 0.587 0.459 0.730
Ours/imp c = 80 -0.642 -0.777 0.571 0.571 0.520 0.719

ACKNOWLEDGMENTS
Table 3. Mean accuracy of matching ZINC graphs to their noisy
counterparts in a synthetic benchmark as a function of maximum We thank Shell Xu Hu for discussions on variational meth-
graph size k. ods, Shinjae Yoo for project motivation, and anonymous
reviewers for their comments.
Noise k = 15 k = 20 k = 25 k = 30 k = 35 k = 40
A,E,F = 0 99.55 99.52 99.45 99.4 99.47 99.46
A = 0.4 90.95 89.55 86.64 87.25 87.07 86.78
References
A = 0.8 82.14 81.01 79.62 79.67 79.07 78.69 Barabási, Albert-László and Albert, Réka. Emergence of
E = 0.4 97.11 96.42 95.65 95.90 95.69 95.69 scaling in random networks. Science, 286(5439):509–
E = 0.8 92.03 90.76 89.76 89.70 88.34 89.40
512, 1999.
F = 0.4 98.32 98.23 97.64 98.28 98.24 97.90
F = 0.8 97.26 97.00 96.60 96.91 96.56 97.17
Bengio, Samy, Vinyals, Oriol, Jaitly, Navdeep, and Shazeer,
Noam. Scheduled sampling for sequence prediction with
we would like to extend it beyond a proof of concept by ap- recurrent neural networks. In NIPS, pp. 1171–1179, 2015.
plying it to real problems in chemistry, such as optimization
of certain properties or predicting chemical reactions. An Bowman, Samuel R., Vilnis, Luke, Vinyals, Oriol, Dai,
advantage of a graph-based decoder compared to SMILES- Andrew M., Józefowicz, Rafal, and Bengio, Samy. Gen-
based decoder is the possibility to predict detailed attributes erating sentences from a continuous space. In CoNLL, pp.
of atoms and bonds in addition to the base structure, which 10–21, 2016.
might be useful in these tasks. Our autoencoder might also
be used to pre-train graph encoders for fine-tuning on small Bronstein, Michael M, Bruna, Joan, LeCun, Yann, Szlam,
datasets (Goh et al., 2017). Arthur, and Vandergheynst, Pierre. Geometric deep learn-
GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders

ing: going beyond euclidean data. IEEE Signal Process- Kusner, Matt J. and Hernández-Lobato, José Miguel. GANS
ing Magazine, 34(4):18–42, 2017. for sequences of discrete elements with the gumbel-
softmax distribution. CoRR, abs/1611.04051, 2016.
Cho, Minsu, Sun, Jian, Duchenne, Olivier, and Ponce, Jean.
Finding matches in a haystack: A max-pooling strategy Kusner, Matt J., Paige, Brooks, and Hernández-Lobato,
for graph matching in the presence of outliers. In CVPR, José Miguel. Grammar variational autoencoder. In ICML,
pp. 2091–2098, 2014. pp. 1945–1954, 2017.

Date, Ketan and Nagi, Rakesh. Gpu-accelerated hungarian Landrum, Greg. RDKit: Open-source cheminformatics.
algorithms for the linear assignment problem. Parallel URL http://www.rdkit.org.
Computing, 57:52–72, 2016.
Li, Yujia, Swersky, Kevin, and Zemel, Richard S. Gener-
Erdos, Paul and Rényi, Alfréd. On the evolution of random ative moment matching networks. In ICML, pp. 1718–
graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17–60, 1727, 2015a.
1960.
Li, Yujia, Tarlow, Daniel, Brockschmidt, Marc, and Zemel,
Gilmer, Justin, Schoenholz, Samuel S., Riley, Patrick F., Richard S. Gated graph sequence neural networks. CoRR,
Vinyals, Oriol, and Dahl, George E. Neural message abs/1511.05493, 2015b.
passing for quantum chemistry. In ICML, pp. 1263–1272,
Makhzani, Alireza, Shlens, Jonathon, Jaitly, Navdeep, and
2017.
Goodfellow, Ian J. Adversarial autoencoders. CoRR,
Goh, Garrett B., Siegel, Charles, Vishnu, Abhinav, and abs/1511.05644, 2015.
Hodas, Nathan O. Chemnet: A transferable and general-
McKay, Brendan D. and Piperno, Adolfo. Practical graph
izable deep neural network for small-molecule property
isomorphism, II. Journal of Symbolic Computation, 60
prediction. arXiv preprint arXiv:1712.02734, 2017.
(0):94 – 112, 2014. ISSN 0747-7171.
Gómez-Bombarelli, Rafael, Duvenaud, David K., Olivecrona, Marcus, Blaschke, Thomas, Engkvist, Ola, and
Hernández-Lobato, José Miguel, Aguilera-Iparraguirre, Chen, Hongming. Molecular de novo design through
Jorge, Hirzel, Timothy D., Adams, Ryan P., and deep reinforcement learning. CoRR, abs/1704.07555,
Aspuru-Guzik, Alán. Automatic chemical design using 2017.
a data-driven continuous representation of molecules.
CoRR, abs/1610.02415, 2016. Ramakrishnan, Raghunathan, Dral, Pavlo O, Rupp,
Matthias, and von Lilienfeld, O Anatole. Quantum chem-
Gong, Shaogang and Xiang, Tao. Recognition of group istry structures and properties of 134 kilo molecules. Sci-
activities using dynamic probabilistic networks. In ICCV, entific Data, 1, 2014.
pp. 742–749, 2003.
Segler, Marwin H. S., Kogej, Thierry, Tyrchan, Christian,
Irwin, John J., Sterling, Teague, Mysinger, Michael M., and Waller, Mark P. Generating focussed molecule li-
Bolstad, Erin S., and Coleman, Ryan G. ZINC: A free tool braries for drug discovery with recurrent neural networks.
to discover chemistry for biology. Journal of Chemical CoRR, abs/1701.01329, 2017.
Information and Modeling, 52(7):1757–1768, 2012.
Simonovsky, Martin and Komodakis, Nikos. Dynamic edge-
Jang, Eric, Gu, Shixiang, and Poole, Ben. Categori- conditioned filters in convolutional neural networks on
cal reparameterization with gumbel-softmax. CoRR, graphs. In CVPR, 2017.
abs/1611.01144, 2016.
Snijders, Tom A.B. and Nowicki, Krzysztof. Estimation
Johnson, Daniel D. Learning graphical state transitions. In and prediction for stochastic blockmodels for graphs with
ICLR, 2017. latent block structure. Journal of Classification, 14(1):
75–100, Jan 1997.
Kingma, Diederik P. and Welling, Max. Auto-encoding
variational bayes. CoRR, abs/1312.6114, 2013. Sohn, Kihyuk, Lee, Honglak, and Yan, Xinchen. Learning
structured output representation using deep conditional
Ktena, Sofia Ira, Parisot, Sarah, Ferrante, Enzo, Rajchl, generative models. In NIPS, pp. 3483–3491, 2015.
Martin, Lee, Matthew C. H., Glocker, Ben, and Rueckert,
Daniel. Distance metric learning using graph convolu- Stewart, Russell, Andriluka, Mykhaylo, and Ng, Andrew Y.
tional networks: Application to functional brain networks. End-to-end people detection in crowded scenes. In CVPR,
In MICCAI, 2017. pp. 2325–2333, 2016.
GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders

Theis, Lucas, van den Oord, Aäron, and Bethge, Matthias. architecture, we train it as unregularized in this section,
A note on the evaluation of generative models. CoRR, i.e. with a deterministic encoder and without KL-divergence
abs/1511.01844, 2015. term in Equation 1.
Vinyals, Oriol, Bengio, Samy, and Kudlur, Manjunath. Or- Unconditional models for QM9 achieve mean test log-
der matters: Sequence to sequence for sets. arXiv preprint likelihood log pθ (G|z) of roughly −0.37 (about −0.50
arXiv:1511.06391, 2015. for the implicit node probability model) for all c ∈
{20, 40, 60, 80}. While these log-likelihoods are signifi-
Williams, Ronald J. and Zipser, David. A learning algorithm cantly higher than in Tables 1 and 2, our architecture can
for continually running fully recurrent neural networks. not achieve perfect reconstruction of inputs. We were suc-
Neural Computation, 1(2):270–280, 1989. cessful to increase training log-likelihood to zero only on
fixed small training sets of hundreds of examples, where
Xu, Danfei, Zhu, Yuke, Choy, Christopher Bongsoo, and
the network could overfit. This indicates that the network
Fei-Fei, Li. Scene graph generation by iterative message
has problems finding generally valid rules for assembly of
passing. In CVPR, 2017.
output tensors.
Yu, Lantao, Zhang, Weinan, Wang, Jun, and Yu, Yong.
Seqgan: Sequence generative adversarial nets with policy
gradient. In AAAI, 2017.

Appendix
A. Max-Pooling Matching
In this section we briefly review max-pooling matching algo-
rithm of Cho et al. (2014). In its relaxed form, a continuous
correspondence matrix X ∗ ∈ [0, 1]k×n between nodes of
graphs G and G e is determined based on similarities of node
pairs i, j ∈ G and a, b ∈ G e represented as matrix elements
+
Sia;jb ∈ R .
Let x∗ denote the column-wise replica of X ∗ . The relaxed
graph matching problem is expressed as quadratic
Pn program-
ming task x∗ = arg maxx xT Sx such that i=1 xia ≤ 1,
Pk kn
a=1 xia ≤ 1, and x ∈ [0, 1] . The optimization strategy
of choice is derived to be equivalent to the power method
with iterative update rule x(t+1) = Sx(t) /||Sx(t) ||2 . The
starting correspondences x(0) are initialized as uniform and
the rule is iterated until convergence; in our use case we run
for a fixed amount of iterations.
In the context of graph matching, the matrix-vector product
Sx can be interpreted as P sum-pooling
P over match candi-
dates: xia ← xia Sia;ia + j∈Ni b∈Na xjb Sia;jb , where
Ni and Na denote the set of neighbors of node i and a.
The authors argue that this formulation is strongly influ-
enced by uninformative or irrelevant elements and pro-
pose a more robust max-pooling version, which consid-
ers only the best pairwise
P similarity from each neighbor:
xia ← xia Sia;ia + j∈Ni maxb∈Na xjb Sia;jb .

B. Unregularized Autoencoder
The regularization in VAE works against achieving perfect
reconstruction of training data, especially for small embed-
ding sizes. To understand the reconstruction ability of our

You might also like