0% found this document useful (0 votes)

39 views18 pages

G AF: F - A M M G G: Raph A LOW Based Utoregressive Odel For Olecular Raph Eneration

1) GraphAF is a flow-based autoregressive model for molecular graph generation that combines the advantages of autoregressive and flow-based approaches. 2) It has a high model capacity for density estimation and an efficient parallel training process. 3) The sampling process is autoregressive and iterative, allowing incorporation of chemical rules to generate 100% chemically valid molecules.

Uploaded by

ericchen859

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views18 pages

G AF: F - A M M G G: Raph A LOW Based Utoregressive Odel For Olecular Raph Eneration

Uploaded by

ericchen859

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Published as a conference paper at ICLR 2020

G RAPH AF: A F LOW- BASED AUTOREGRESSIVE

M ODEL FOR M OLECULAR G RAPH G ENERATION
Chence Shi*1 , Minkai Xu*2 , Zhaocheng Zhu3,4 , Weinan Zhang2 , Ming Zhang1 , Jian Tang3,5,6
1
Department of Computer Science, Peking University, China
2
Shanghai Jiao Tong University, China
3
Mila - Québec AI Institute, Canada
4
Université de Montréal, Canada
5
HEC Montréal, Canada
6
CIFAR AI Research Chair
arXiv:2001.09382v2 [cs.LG] 27 Feb 2020

{chenceshi,mzhang cs}@pku.edu.cn
{mkxu,wnzhang}@apex.sjtu.edu.cn
[email protected]
[email protected]

A BSTRACT
Molecular graph generation is a fundamental problem for drug discovery and
has been attracting growing attention. The problem is challenging since it re-
quires not only generating chemically valid molecular structures but also optimiz-
ing their chemical properties in the meantime. Inspired by the recent progress
in deep generative models, in this paper we propose a flow-based autoregressive
model for graph generation called GraphAF. GraphAF combines the advantages
of both autoregressive and flow-based approaches and enjoys: (1) high model
flexibility for data density estimation; (2) efficient parallel computation for train-
ing; (3) an iterative sampling process, which allows leveraging chemical domain
knowledge for valency checking. Experimental results show that GraphAF is able
to generate 68% chemically valid molecules even without chemical knowledge
rules and 100% valid molecules with chemical rules. The training process of
GraphAF is two times faster than the existing state-of-the-art approach GCPN.
After fine-tuning the model for goal-directed property optimization with reinforce-
ment learning, GraphAF achieves state-of-the-art performance on both chemical
property optimization and constrained property optimization.1

1 I NTRODUCTION
Designing novel molecular structures with desired properties is a fundamental problem in a variety
of applications such as drug discovery and material science. The problem is very challenging, since
the chemical space is discrete by nature, and the entire search space is huge, which is believed to be
as large as 1033 (Polishchuk et al., 2013). Machine learning techniques have seen a big opportunity
in molecular design thanks to the large amount of data in these domains. Recently, there are increas-
ing efforts in developing machine learning algorithms that can automatically generate chemically
valid molecular structures and meanwhile optimize their properties.
Specifically, significant progress has been achieved by representing molecular structures as graphs
and generating graph structures with deep generative models, e.g., Variational Autoencoders
(VAEs) (Kingma & Welling, 2013), Generative Adversarial Networks (GANs) (Goodfellow et al.,
2014) and Autoregressive Models (Van Oord et al., 2016). For example, Jin et al. (2018) proposed a
Junction Tree VAE (JT-VAE) for molecular structure encoding and decoding. De Cao & Kipf (2018)
studied how to use GANs for molecular graph generation. You et al. (2018a) proposed an approach
called Graph Convolutional Policy Network (GCPN), which formulated molecular graph genera-
tion as a sequential decision process and dynamically generates the nodes and edges based on the
∗
Equal contribution, with order determined by flipping a coin. Work was done during internship at Mila.
1
Code is available at https://github.com/DeepGraphLearning/GraphAF

1
Published as a conference paper at ICLR 2020

Table 1: Previous state-of-the-art algorithms for molecular graph generation. The comparison of training is
only conducted between autoregressive models.
Generative Model Sampling Process Training Process
Name
VAE GAN RNN Flow One-shot Iterative Sequential Parallel
JT-VAE X - - - - X - -
RVAE X - - - X - - -
GCPN - X - - - X X -
MRNN - - X - - X X -
GraphNVP - - - X X - - -
GraphAF - - - X - X - X

existing graph substructures. They used reinforcement learning to optimize the properties of gener-
ated graph structures. Recently, another very related work called MolecularRNN (MRNN) (Popova
et al., 2019) proposed to use an autoregressive model for molecular graph generation. The autore-
gressive based approaches including both GCPN and MRNN have demonstrated very competitive
performance in a variety of tasks on molecular graph generation.
Recently, besides the aforementioned three types of generative models, normalizing flows have made
significant progress and have been successfully applied to a variety of tasks including density esti-
mation (Dinh et al., 2016; Papamakarios et al., 2017), variational inference (Kingma et al., 2016;
Louizos & Welling, 2017; Rezende & Mohamed, 2015), and image generation (Kingma & Dhariwal,
2018). Flow-based approaches define invertible transformations between a latent base distribution
(e.g. Gaussian distribution) and real-world high-dimensional data (e.g. images and speech). Such an
invertible mapping allows the calculation of the exact data likelihood. Meanwhile, by using multiple
layers of non-linear transformation between the hidden space and observation space, flows have a
high capacity to model the data density. Moreover, different architectures can be designed to pro-
mote fast training (Papamakarios et al., 2017) or fast sampling (Kingma et al., 2016) depending on
the requirement of different applications.
Inspired by existing work on autoregressive models and recent progress of deep generative mod-
els with normalizing flows, we propose a flow-based autoregressive model called GraphAF for
molecular graph generation. GraphAF effectively combines the advantages of autoregressive and
flow-based approaches. It has a high model capacity and hence is capable of modeling the density
of real-world molecule data. The sampling process of GraphAF is designed as an autoregressive
model, which dynamically generates the nodes and edges based on existing sub-graph structures.
Similar to existing models such as GCPN and MRNN, such a sequential generation process allows
leveraging chemical domain knowledge and valency checking in each generation step, which guar-
antees the validity of generated molecular structures. Meanwhile, different from GCPN and MRNN
as an autoregressive model during training, GraphAF defines a feedforward neural network from
molecular graph structures to the base distribution and is therefore able to compute the exact data
likelihood in parallel. As a result, the training process of GraphAF is very efficient.
We conduct extensive experiments on the standard ZINC (Irwin et al., 2012) dataset. Results show
that the training of GraphAF is significantly efficient, which is two times faster than the state-of-the-
art model GCPN. The generated molecules are 100% valid by incorporating the chemical rules dur-
ing generation. We are also surprised to find that even without using the chemical rules for valency
checking during generation, the percentage of valid molecules generated by GraphAF can be still
as high as 68%, which is significantly higher than existing state-of-the-art GCPN. This shows that
GraphAF indeed has the high model capability to learn the data distribution of molecule structures.
We further fine-tune the generation process with reinforcement learning to optimize the chemical
properties of generated molecules. Results show that GraphAF significantly outperforms previous
state-of-the-art GCPN on both property optimization and constrained property optimization tasks.

2 R ELATED W ORK

A variety of deep generative models have been proposed for molecular graph generation re-
cently (Segler et al., 2017; Olivecrona et al., 2017; Samanta et al., 2018; Neil et al., 2018). The
RVAE model (Ma et al., 2018) used a variational autoencoder for molecule generation, and pro-
posed a novel regularization framework to ensure semantic validity. Jin et al. (2018) proposed to

2
Published as a conference paper at ICLR 2020

represent a molecule as a junction tree of chemical scaffolds and proposed the JT-VAE model for
molecule generation. For the VAE-based approaches, the optimization of chemical properties is
usually done by searching in the latent space with Bayesian Optimization (Jin et al., 2018). De Cao
& Kipf (2018) used Generative Adversarial Networks for molecule generation. The state-of-the-art
models are built on autoregressive based approaches (You et al., 2018a; Popova et al., 2019). (You
et al., 2018a) formulated the problem as a sequential decision process by dynamically adding new
nodes and edges based on current sub-graph structures, and the generation policy network is trained
by a reinforcement learning framework. Recently, Popova et al. (2019) proposed an autoregressive
model called MolecularRNN to generate new nodes and edges based on the generated nodes and
edge sequences. The iterative nature of autoregressive model allows effectively leveraging chemical
rules for valency checking during generation and hence the proportion of valid molecules generated
by these models is very high. However, due to the sequential generation nature, the training process
is usually slow. Our GraphAF approach enjoys the advantage of iterative generation process like au-
toregressive models (the mapping from latent space to observation space) and meanwhile calculates
the exact likelihood corresponding to a feedforward neural network (the mapping from observation
space to latent space), which can be implemented efficiently through parallel computation.
Two recent work—Graph Normalizing Flows (GNF) (Liu et al., 2019) and GraphNVP (Madhawa
et al., 2019)—are also flow-based approaches for graph generation. However, our work is funda-
mentally different from their work. GNF defines a normalizing flow from a base distribution to the
hidden node representations of a pretrained Graph Autoencoders. The generation scheme is done
through two separate stages by first generating the node embeddings with the normalizing flow and
then generate the graphs based on the generated node embeddings in the first stage. By contrast, in
GraphAF, we define an autoregressive flow from a base distribution directly to the molecular graph
structures, which can be trained end-to-end. GraphNVP also defines a normalizing flow from a base
distribution to the molecular graph structures. However, the generation process of GraphNVP is
one-shot, which cannot effectively capture graph structures and also cannot guarantee the validity of
generated molecules. In our GraphAF, we formulate the generation process as a sequential decision
process and effectively capture the sub-graph structures via graph neural networks, based on which
we define a policy function to generate the nodes and edges. The sequential generation process also
allows incorporating the chemical rules. As a result, the validity of the generated molecules can be
guaranteed. We summarize existing approaches in Table 1.

3 P RELIMINARIES
3.1 AUTOREGRESSIVE F LOW

A normalizing flow (Kobyzev et al., 2019) defines a parameterized invertible deterministic trans-
formation from a base distribution E (the latent space, e.g., Gaussian distribution) to real-world
observational space Z (e.g. images and speech). Let f : E → Z be an invertible transformation
where ∼ pE () is the base distribution, then we can compute the density function of real-world
data z, i.e., pZ (z), via the change-of-variables formula:
∂f −1 (z)
pZ (z) = pE fθ−1 (z) det θ

. (1)
∂z

Now considering two key processes of normalizing flows as a generative model: (1) Calculating
Data Likelihood: given a datapoint z, the exact density pZ (z) can be calculated by inverting the
transformation f , = fθ−1 (z); (2) Sampling: z can be sampled from the distribution pZ (z) by
first sample ∼ pE () and then perform the feedforward transformation z = fθ (). To efficiently
perform the above mentioned operations, fθ is required to be invertible with an easily computable
Jacobian determinant. Autoregressive flows (AF), originally proposed in Papamakarios et al. (2017),
is a variant that satisfies these criteria, which holds a triangular Jacobian matrix, and the determinant
can be computed linearly. Formally, given z ∈ RD (D is the dimension of observation data), the
autoregressive conditional probabilities can be parameterized as Gaussian distributions:
p(zd |z1:d−1 ) = N (zd |µd , (αd )2 ), where µd = gµ (z1:d−1 ; θ), αd = gα (z1:d−1 ; θ), (2)
where gµ and gα are unconstrained and positive scalar functions of z1:d−1 respectively to compute
the mean and deviation. In practice, these functions can be implemented as neural networks. The

3
Published as a conference paper at ICLR 2020

affine transformation of AF can be written as:

zd − µd
fθ (d ) = zd = µd + αd · d ; fθ−1 (zd ) = d = . (3)
αd
∂zi
The Jacobian matrix in AF is triangular, since ∂ is non-zero only for j ≤ i. Therefore, the determi-
QD j
nant can be efficiently computed through d=1 αd . Specifically, to perform density estimation, we
can apply all individual scalar affine transformations in parallel to compute the base density, each of
which depends on previous variables z1:d−1 ; to sample z, we can first sample ∈ RD and compute
z1 through the affine transformation, and then each subsequent zd can be computed sequentially
based on previously observed z1:d−1 .

3.2 G RAPH R EPRESENTATION L EARNING

Following existing work, we also represent a molecule as a graph G = (A, X), where A is the
adjacency tensor and X is the node feature matrix. Assuming there are n nodes in the graph, d
and b are the number of different types of nodes and edges respectively, then A ∈ {0, 1}n×n×b and
X ∈ {0, 1}n×d . Aijk = 1 if there exists a bond with type k between ith and j th nodes.
Graph Convolutional Networks (GCN) (Duvenaud et al., 2015; Gilmer et al., 2017; Kearnes et al.,
2016; Kipf & Welling, 2016; Schütt et al., 2017) are a family of neural network architectures
for learning representations of graphs. In this paper, we use a variant of Relational GCN (R-
GCN) (Schlichtkrull et al., 2018) to learn the node representations (i.e., atoms) of graphs with cat-
egorical edge types. Let k denote the embedding dimension. We compute the node embeddings
H l ∈ Rn×k at the lth layer of R-GCN by aggregating messages from different edge types:
−1 −1

H l = Agg ReLU {D̃i 2 Ẽi D̃i 2 H l−1 Wil } i ∈ (1, . . . , b) , (4)

where Ei = A[:,:,i] denotes the ith slice of edge-conditioned adjacency tensor, Ẽi = Ei + I, and
(l)
is a trainable weight matrix for the ith edge type. Agg(·) denotes an
P
D̃i = k Ẽi [j, k]. Wi
aggregation function such as mean pooling or summation. The initial hidden node representation
H 0 is set as the original node feature matrix X. After L message passing layers, we use the the final
hidden representation H L as the node representations. Meanwhile, the whole graph representations
can be defined by aggregating the whole node representations using a readout function (Hamilton
et al., 2017), e.g., summation.

4 P ROPOSED M ETHOD

4.1 G RAPH AF F RAMEWORK

Similar to existing works like GCPN (You et al., 2018a) and MolecularRNN (Popova et al., 2019),
we formalize the problem of molecular graph generation as a sequential decision process. Let G =
(A, X) denote a molecular graph structure. Starting from an empty graph G1 , in each step a new
node Xi is generated based on the current sub-graph structure Gi , i.e., p(Xi |Gi ). Afterwards, the
edges between this new node and existing nodes are sequentially generated according to the current
graph structure, i.e., p(Aij |Gi , Xi , Ai,1:j−1 ). This process is repeated until all the nodes and edges
are generated. An illustrative example is given in Fig. 1(a).
GraphAF is aimed at defining an invertible transformation from a base distribution (e.g. multivariate
Gaussian) to a molecular graph structure G = (A, X). Note that we add one additional type of edge
between two nodes, which corresponds to no edge between two nodes, i.e., A ∈ {0, 1}n×n×(b+1) .
Since both the node type Xi and the edge type Aij are discrete, which do not fit into a flow-based
model, a standard approach is to use Dequantization technique (Dinh et al., 2016; Kingma & Dhari-
wal, 2018) to convert discrete data into continuous data by adding real-valued noise. We follow this
approach to preprocess a discrete graph G = (A, X) into continuous data z = (z A , z X ):

ziX = Xi + u, u ∼ U [0, 1)d ; zij

A
= Aij + u, u ∼ U [0, 1)b+1 . (5)

4
Published as a conference paper at ICLR 2020

𝜀𝜀1 𝜀𝜀2 𝜀𝜀3 𝜀𝜀4

1
O O O
𝑋𝑋1 𝜀𝜀1
C C C C
1
1 2 C C C
𝑋𝑋2 𝜀𝜀2
𝜀𝜀21 3 𝜀𝜀31 𝜀𝜀41 1
𝐴𝐴21 𝜀𝜀21
O O O

Noise from 𝑵𝑵(𝟎𝟎, 𝑰𝑰) C C C … …

C C C
C Node: Atom Parallel Training
𝜀𝜀32 𝜀𝜀42
Edge: Single bond 1
O O 𝑋𝑋1 𝜀𝜀1
Edge: Double bond C C 2
C C C 𝑋𝑋2 𝜀𝜀2
Edge: No bond
𝜀𝜀43 𝐴𝐴21
3
𝜀𝜀21
Affine Transformation for Node Generation
O
Affine Transformation for Edge Generation C
… …
1 Sampling / Training Order C C
Sequential Sampling
(a) Sampling Phases (b) Framework
Figure 1: Overview of the proposed GraphAF model. (a) Illustration of the generative procedure. New nodes
or edges are marked in red. Starting from an empty graph and iteratively sample random variables to map them
to atom/bond features. The numbered first three steps correspond to the maps in the bottom figure of Fig. 1(b).
(b) Computation graph of GraphAF. The left side are the nodes and edges and the right are latent variables.

We present further discussions on dequantization techniques in Appendix A. Formally, we define

the conditional distributions for the generation as:

p(ziX |Gi ) =N (µX X 2

i , (αi ) ), (6)
where µX
i = gµX (Gi ), αiX = gαX (Gi ),
A
p(zij |Gi , Xi , Ai,1:j−1 ) = N (µA A 2
ij , (αij ) ), j ∈ {1, 2, . . . , i − 1}, (7)
where µA A
ij = gµA (Gi , Xi , Ai,1:j−1 ), αij = gαA (Gi , Xi , Ai,1:j−1 ),

where gµX , gµA and gαX , gαA are parameterized neural networks for defining the mean and standard
deviation of a Gaussian distribution. More specifically, given the current sub-graph structure Gi , we
use a L-layer of Relational GCN (defined in Section 3.2) to learn the node embeddings HiL ∈ Rn×k ,
and the embedding of entire sub-graph h̃i ∈ Rk , based on which we define the mean and standard
deviations of Gaussian distributions to generate the nodes and edges respectively:

R-GCN: HiL = R-GCN(Gi ), h̃i = sum(HiL );

Node-MLPs: gµX = mµX (h̃i ), gαX = mαX (h̃i ); (8)
L L L L
Edge-MLPs: gµA = mµA (h̃i , Hi,i , Hi,j ), gαA = mαA (h̃i , Hi,i , Hi,j ),
L
where sum denotes the sum-pooling operation, and Hi,j ∈ Rk denotes the embedding of the j-th
node in the embeddings HiL . mµX , mαX are Multi-Layer Perceptrons (MLP) that predict the node
types according to the current sub-graph embedding. and mµA , mαA are MLPs that predict the types
of edges according to the current sub-graph embedding and node embeddings.
To generate a new node Xi and its edges connected to existing nodes, we just sample random
variables i and ij from the base Gaussian distribution and convert it to discrete features. More
specifically,
ziX = i αiX + µX i , i ∈ R ;
d

A A
(9)
zij = ij αij + µAij , j ∈ {1, 2, . . . , i − 1}, ij ∈ R
b+1
,
where is the element-wise multiplication. In practice, a real molecular graph is generated by
d b+1
taking the argmax of generated continuous vectors, i.e., Xi = vargmax(z X ) and Aij = v
argmax(z A )
,
i ij
where vqp denotes a p dimensional one-hot vector with q th dimension equal to 1.
Let = {1 , 2 , 21 , 3 , 31 , 32 , . . . , n , n1 , . . . , n,n−1 }, where n is the number of atoms in the
given molecule, GraphAF defines an invertible mapping between the base Gaussian distribution

5
Published as a conference paper at ICLR 2020

and the molecule structures z = (z A , z X ). According to Eq. 9, the inverse process from z =
(z A , z X ) to can be easily calculated as:
1 1
i = ziX − µX A
− µA

i X
; ij = zij ij A
, j ∈ {1, 2, . . . , i − 1}, (10)
αi αij
1 1
where αX
and αA
denote element-wise reciprocals of αiX and αij
A
respectively.
i ij

4.2 E FFICIENT PARALLEL T RAINING

In GraphAF, since f : E → Z is autoregressive, the Jacobian matrix of the inverse process f −1 :

Z → E is a triangular matrix, and its determinant can be calculated very efficiently. Given a mini-
batch of training data G, the exact density of each molecule under a given order can be efficiently
computed by the change-of-variables formula in Eq. 1. Our objective is to maximize the likelihood
of training data.
During training, we are able to perform parallel computation by defining a feedforward neural
network between the input molecule graph G and the output latent variable by using masking.
The mask drops out some connections from inputs to ensure that R-GCN is only connected to the
sub-graph Gi when inferring the hidden variable of node i, i.e., i , and connected to sub-graph
Gi , Xi , Ai,1:j−1 when inferring the hidden variable of edge Aij , i.e., ij . This is similar to the ap-
proaches used in MADE (Germain et al., 2015) and MAF (Papamakarios et al., 2017). With the
masking technique, GraphAF satisfies the autoregressive property, and at the same time p(G) can
be efficiently calculated in just one forward pass by computing all the conditionals in parallel.
To further accelerate the training process, the nodes and edges of a training graph are re-ordered
according to the breadth-first search (BFS) order, which is widely adopted by existing approaches
for graph generation (You et al., 2018b; Popova et al., 2019). Due to the nature of BFS, bonds can
only be present between nodes within the same or consecutive BFS depths. Therefore, the maximum
dependency distance between nodes is bounded by the largest number of nodes in a single BFS
depth. In our data sets, any single BFS depth contains no more than 12 nodes, which means we only
need to model the edges between current atom and the latest generated 12 atoms.
Due to space limitation, we summarize the detailed training algorithm into Appendix B.

4.3 VALIDITY C ONSTRAINED S AMPLING

In chemistry, there exist many chemical rules, which can help to generate valid molecules. Thanks
to the sequential generation process, GraphAF can leverage these rules in each generation step.
Specifically, we can explicitly apply a valency constraint during sampling to check whether current
bonds have exceeded the allowed valency, which has been widely adopted in previous models (You
et al., 2018a; Popova et al., 2019). Let |Aij | denote the order of the chemical bond Aij . In each
edge generation step of Aij , we check the following valency constraint for the ith and j th atoms:
X X
|Aij | ≤ Valency(Xi ) and |Aij | ≤ Valency(Xj ). (11)
j i

If the newly added bond breaks the valency constraint, we just reject the bond Aij , sample a new
ij in the latent space and generate another new bond type. The generation process will terminate if
one of the following conditions is satisfied: 1) the graph size reaches the max-size n, 2) no bond is
generated between the newly generated atom and previous sub-graph. Finally, hydrogens are added
to the atoms that have not filled up their valencies.

4.4 G OAL - DIRECTED M OLECULE G ENERATION WITH R EINFORCEMENT L EARNING

So far, we have introduced how to use GraphAF to model the data density of molecular graph
structures and generate valid molecules. Nonetheless, for drug discovery, we also need to optimize
the chemical properties of generated molecules. In this part, we introduce how to fine-tune our
generation process with reinforcement learning to optimize the properties of generated molecules.
State and Policy Network. The state is the current sub-graph, and the initial state is an empty
graph. The policy network is the same as the autoregressive model defined in Section 4.1, which

6
Published as a conference paper at ICLR 2020

includes the process of generating a new atom based on the current sub-graph and generating the
edges between the new atom and existing atoms, i.e., p (Xi |Gi ) and p (Aij |Gi , Xi , Ai,1:j−1 ). The
policy network itself defines a distribution pθ of molecular graphs G. If there are no edges between
the newly generated atom and current sub-graph, the generation process terminates. For the state
transition dynamics, we also incorporate the valency check constraint.
Reward design. Similar to GCPN You et al. (2018a), we also incorporate both intermediate and final
rewards for training the policy network. A small penalization will be introduced as the intermediate
reward if the edge predictions violate the valency check. The final rewards include both the score
of targeted-properties of generated molecules such as octanol-water partition coefficient (logP) or
drug-likeness (QED) (Bickerton et al., 2012) and the chemical validity reward such as penalties for
molecules with excessive steric strain and or functional groups that violate ZINC functional group
filters (Irwin et al., 2012). The final reward is distributed to all intermediate steps with a discounting
factor to stabilize the training.
In practice, we adopt Proximal Policy Optimization (PPO) (Schulman et al., 2017), an advanced pol-
icy gradient algorithm to train GraphAF in the above defined environment. Let Gij be the shorthand
notation of sub-graph Gi ∪ Xi ∪ Ai,1:j−1 . Formally,
n h
L(θ) = − EG∼pθ Ei min ri (θ)V (Gi , Xi ), clip (ri (θ), 1 − , 1 + ) V (Gi , Xi )
io (12)
+ Ej min rij (θ)V (Gij , Aij ), clip (rij (θ), 1 − , 1 + ) V (Gij , Aij ) ,

p (A |G )
where ri (θ) = pθpθ (X(Xi |G i)
i |Gi )
and rij (θ) = pθ θ (Aijij |Gijij ) are ratios of probabilities output by old
old old
and new policies, and V (state, action) is the estimated advantage function with a moving average
baseline to reduce the variance. More specifically, we treat generating a node and all its edges
with existing nodes as one step and maintain a moving average baseline for each step. The clipped
surrogate objective prevents the policy from being updated to collapse for some extreme rewards.

5 E XPERIMENTS

5.1 E XPERIMENT S ETUP

Evaluation Tasks. Following existing works on molecule generation (Jin et al., 2018; You et al.,
2018a; Popova et al., 2019), we conduct experiments by comparing with the state-of-the-art ap-
proaches on three standard tasks. Density Modeling and Generation evaluates the model’s capacity
to learn the data distribution and generate realistic and diverse molecules. Property Optimization
concentrates on generating novel molecules with optimized chemical properties. For this task, we
fine-tune our network pretrained from the density modeling task to maximize the desired proper-
ties. Constrained Property Optimization is first proposed in Jin et al. (2018), which is aimed at
modifying the given molecule to improve desired properties while satisfying a similarity constraint.
Data. We use the ZINC250k molecular dataset (Irwin et al., 2012) for training. The dataset contains
250, 000 drug-like molecules with a maximum atom number of 38. It has 9 atom types and 3 edge
types. We use the open-source chemical software RDkit (Landrum, 2016) to preprocess molecules.
All molecules are presented in kekulized form with hydrogen removed.
Baselines. We compare GraphAF with the following state-of-the-art approaches for molecule gener-
ation. JT-VAE (Jin et al., 2018) is a VAE-based model which generates molecules by first decoding
a tree structure of scaffolds and then assembling them into molecules. JT-VAE has been shown
to outperform other previous VAE-based models (Kusner et al., 2017; Gómez-Bombarelli et al.,
2018; Simonovsky & Komodakis, 2018). GCPN is a state-of-the-art approach which combines re-
inforcement learning and graph representation learning methods to explore the vast chemical space.
MolecularRNN (MRNN), another autoregressive model, uses RNN to generate molecules in a se-
quential manner. We also compare our model with GraphNVP (Madhawa et al., 2019), a recently
proposed flow-based model. Results of baselines are taken from original papers unless stated.
Implementation Details. GraphAF is implemented in PyTorch (Paszke et al., 2017). The R-GCN is
implemented with 3 layers, and the embedding dimension is set as 128. The max graph size is set as
48 empirically. For density modeling, we train our model for 10 epochs with a batch size of 32 and a

7
Published as a conference paper at ICLR 2020

Table 2: Comparison of different models on density modeling and generation. Reconstruction is only evaluated
on latent variable models. Validity w/o check is only evaluated on models with valency constraints. Result with
† is obtained by running GCPN’s open-source code. Results with ‡ are taken from Popova et al. (2019).

Method Validity Validity w/o check Uniqueness Novelty Reconstruction

‡ ‡
JT-VAE 100% — 100% 100% 76.7%
GCPN 100% 20%† 99.97%‡ 100%‡ —
MRNN 100% 65% 99.89% 100% —
GraphNVP 42.60% — 94.80% 100% 100%
GraphAF 100% 68% 99.10% 100% 100%

Table 3: Results of density modeling and generation on three different datasets.

Method Validity Validity w/o check Uniqueness Novelty Reconstruction

ZINC250k 100% 68% 99.10% 100% 100%
QM9 100% 67% 94.51% 88.83% 100%
MOSES 100% 71% 99.99% 100% 100%

learning rate of 0.001. For property optimization, we perform a grid search on the hyperparameters
and select the best setting according to the chemical scoring performance. We use Adam (Kingma
& Ba, 2014) to optimize our model. Full training details can be found in Appendix C.

5.2 N UMERICAL R ESULTS

Density Modeling and Generation. We evaluate the ability of the proposed method to model
real molecules by utilizing the widely-used metrics: Validity is the percentage of valid molecules
among all the generated graphs. Uniqueness is the percentage of unique molecules among all the
generated molecules. Novelty is the percentage of generated molecules not appearing in training set.
Reconstruction is the percentage of the molecules that can be reconstructed from latent vectors. We
calculate the above metrics from 10,000 randomly generated molecules.
Table 2 shows that GraphAF achieves competitive results on all four metrics. As a flow-based model,
GraphAF holds perfect reconstruction ability compared with VAE approaches. Our model also
achieves a 100% validity rate since we can leverage the valency check during sequential generation.
By contrast, the validity rate of another flow-based approach GraphNVP is only 42.60% due to
its one-shot sampling process. An interesting result is that even without the valency check during
generation, GraphAF can still achieve a validity rate as high as 68%, while previous state-of-the-art
approach GCPN only achieves 20%. This indicates the strong flexibility of GraphAF to model the
data density and capture the domain knowledge from unsupervised training on the large chemical
dataset. We also compare the efficiency of different methods on the same computation environment,
a machine with 1 Tesla V100 GPU and 32 CPU cores. To achieve the results in Table 2, JT-VAE and
GCPN take around 24 and 8 hours, respectively, while GraphAF only takes 4 hours.
To show that GraphAF is not overfitted to the specific dataset ZINC250k, we also conduct experi-
ments on two other molecule datasets, QM9 (Ramakrishnan et al., 2014) and MOSES (Polykovskiy
et al., 2018). QM9 contains 134k molecules with 9 heavy atoms, and MOSES is much larger and
more challenging, which contains 1.9M molecules with up to 30 heavy atoms. Table 3 shows that
GraphAF can always generate valid and novel molecules even on the more complicated dataset.
Furthermore, though GraphAF is originally designed for molecular graph generation, it is actually
very general and can be used to model different types of graphs by simply modifying the node and
edge generating functions Edge-MLPs and Node-MLPs (Eq. 8). Following the experimental setup of
Graph Normalizing Flows (GNF) (Liu et al., 2019), we test GraphAF on two generic graph datasets:
COMMUNITY-SMALL, which is a synthetic data set containing 100 2-community graphs, and
EGO-SMALL, which is a set of graphs extracted from Citeseer dataset (Sen et al., 2008). In prac-
tice, we use one-hot indicator vectors as node features for R-GCN. We borrow open source scripts
from GraphRNN (You et al., 2018b) to generate datasets and evaluate different models. For evalua-
tion, we report the Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) between generated

8
Published as a conference paper at ICLR 2020

Table 4: Comparison between different graph generative models on general graphs with MMD metrics. We
follow the evaluation scheme of GNF (Liu et al., 2019). Results of baselines are also taken from GNF.

COMMUNITY-SMALL EGO-SMALL
Method
Degree Cluster Orbit Degree Cluster Orbit
GraphVAE 0.35 0.98 0.54 0.13 0.17 0.05
DEEPGMG 0.22 0.95 0.4 0.04 0.10 0.02
GraphRNN 0.08 0.12 0.04 0.09 0.22 0.003
GNF 0.20 0.20 0.11 0.03 0.10 0.001
GraphAF 0.18 0.20 0.02 0.03 0.11 0.001
GraphRNN(1024) 0.03 0.01 0.01 0.04 0.05 0.06
GNF(1024) 0.12 0.15 0.02 0.01 0.03 0.0008
GraphAF(1024) 0.06 0.10 0.015 0.04 0.04 0.008

Table 5: Comparison of the top 3 property scores of generated molecules.

Penalized logP QED

Method
1st 2nd 3rd Validity 1st 2nd 3rd Validity
ZINC (Dataset) 4.52 4.30 4.23 100.0% 0.948 0.948 0.948 100.0%
JT-VAE (Jin et al., 2018) 5.30 4.93 4.49 100.0% 0.925 0.911 0.910 100.0%
GCPN (You et al., 2018a) 7.98 7.85 7.80 100.0% 0.948 0.947 0.946 100.0%
MRNN1 (Popova et al., 2019) 8.63 6.08 4.73 100.0% 0.844 0.796 0.736 100.0%
GraphAF 12.23 11.29 11.05 100.0% 0.948 0.948 0.947 100.0%

OH Cl
O

O NH
NH O N O NH O
O N
S
N NH2+ NH
O N NH NH
N

12.23 11.29 0.948 0.948 -8.25 -0.11

O O
Br NH NH
O S NH NH
N N N N
NH NH
O O

N O Br O

11.05 10.83 0.947 0.946 -8.66 -1.10

(a) Penalized logP optimization (b) QED optimization (c) Constrained property optimization
Figure 2: Molecules generated in property optimization and constrained property optimization tasks. (a)
Molecules with high penalized logP scores. (b) Molecules with high QED scores. (c) Two pairs of molecules
in constrained property optimization for penalized logP with similarity 0.71(top) and 0.64(bottom).

and training graphs using some specific metrics on graphs proposed by You et al. (2018b). The
results in Table 4 demonstrate that when applied to generic graphs, GraphAF can still consistently
yield comparable or better results compared with GraphRNN and GNF. We give the visualization of
generated generic graphs in Appendix D.

Property Optimization. In this task, we aim at generating molecules with desired properties.
Specifically, we choose penalized logP and QED as our target property. The former score is logP
score penalized by ring size and synthetic accessibility, while the latter one measures the drug-
likeness of the molecules. Note that both scores are calculated using empirical prediction models
and we adopt the script used in (You et al., 2018a) to make results comparable. To perform this task,
we pretrain the GraphAF network for 300 epochs for likelihood modeling, and then apply the RL
process described in section 4.4 to fine-tune the network towards desired chemical properties. De-
tailed reward design and hyper-parameters setting can be found in Appendix C. Following existing
works, we report the top-3 scores found by each model.

1
The scores reported here are recalculated based on top 3 molecules presented in the original paper (Popova
et al., 2019) using GCPN’s script.

9
Published as a conference paper at ICLR 2020

Table 6: Comparison of results on constrained property optimization.

JT-VAE GCPN GraphAF
δ
Improvement Similarity Success Improvement Similarity Success Improvement Similarity Success
0.0 1.91 ± 2.04 0.28 ± 0.15 97.5% 4.20 ± 1.28 0.32 ± 0.12 100% 13.13 ± 6.89 0.29 ± 0.15 100%
0.2 1.68 ± 1.85 0.33 ± 0.13 97.1% 4.12 ± 1.19 0.34 ± 0.11 100% 11.90 ± 6.86 0.33 ± 0.12 100%
0.4 0.84 ± 1.45 0.51 ± 0.10 83.6% 2.49 ± 1.30 0.47 ± 0.08 100% 8.21 ± 6.51 0.49 ± 0.09 99.88%
0.6 0.21 ± 0.71 0.69 ± 0.06 46.4% 0.79 ± 0.63 0.68 ± 0.08 100% 4.98 ± 6.49 0.66 ± 0.05 96.88%

As shown in Table 5, GraphAF outperforms all baselines by a large margin for penalized logP
score and achieves comparable results for QED. This phenomenon indicates that combined with
RL process, GraphAF successfully captures the distribution of desired molecules. Note that we
re-evaluate the properties of the top-3 molecules found by MolecularRNN, which turn out to be
lower than the results reported in the original paper. Figure 2(a) and 2(b) show the molecules with
the highest score discovered by our model. More realistic molecules generated by GraphAF with
penalized logP score ranging from 5 to 10 are presented in Figure 6 in Appendix E.
One should note that, as defined in Sec 4.4, our RL process is close to the one used in previous work
GCPN (You et al., 2018a). Therefore, the good property optimization performance is believed to
come from the flexibility of flow. Compared with the GAN model used in GCPN, which is known
to suffer from the mode collapse problem, flow is flexible at modeling complex distribution and
generating diverse data (as shown in Table 2 and Table 3). This allows GraphAF to explore a variety
of molecule structures in the RL process for molecule properties optimization.

Constrained Property Optimization. The goal of the last task is to modify the given molecule to
improve specified property with the constraint that the similarity between the original and modified
molecule is above a threshold δ. Following Jin et al. (2018) and You et al. (2018a), we choose to
optimize penalized logP for 800 molecules in ZINC250k with the lowest scores and adopt Tanimoto
similarity with Morgan fingerprint (Rogers & Hahn, 2010) as the similarity metric.
Similar to the property optimization task, we pretrain GraphAF via density modeling and then fine-
tune the model with RL. During generation, we set the initial states as sub-graphs randomly sampled
from 800 molecules to be optimized. For evaluation, we report the mean and standard deviation
of the highest improvement and the corresponding similarity between the original and modified
molecules in Table 6. Experiment results show that GraphAF significantly outperforms all previous
approaches and almost always succeeds in improving the target property. Figure 2(c) visualizes two
optimization examples, showing that our model is able to improve the penalized logP score by a
large margin while maintaining a high similarity between the original and modified molecule.

6 C ONCLUSION
We proposed GraphAF, the first flow-based autoregressive model for generating realistic and diverse
molecular graphs. GraphAF is capable to model the complex molecular distribution thanks to the
flexibility of normalizing flow, as well as generate novel and 100% valid molecules in empirical
experiments. Moreover, the training of GraphAF is very efficient. To optimize the properties of gen-
erated molecules, we fine-tuned the generative process with reinforcement learning. Experimental
results show that GraphAF outperforms all previous state-of-the-art baselines on the standard tasks.
In the future, we plan to train our GraphAF model on larger datasets and also extend it to generate
other types of graph structures (e.g., social networks).

ACKNOWLEDGEMENT
We would like to thank Min Lin, Meng Qu, Andreea Deac, Laurent Dinh, Louis-Pascal A. C.
Xhonneux and Vikas Verma for the extremely helpful discussions and comments. This project is
supported by the Natural Sciences and Engineering Research Council of Canada, the Canada CI-
FAR AI Chair Program, and a collaboration grant between Microsoft Research and Mila. Ming
Zhang is supported by National Key Research and Development Program of China with Grant No.
2018AAA0101900/2018AAA0101902 as well as Beijing Municipal Commission of Science and
Technology under Grant No. Z181100008918005. Weinan Zhang is supported by National Natural
Science Foundation of China (61702327, 61772333, 61632017, 81771937) and Shanghai Sailing
Program (17YF1428200).

10
Published as a conference paper at ICLR 2020

R EFERENCES
G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L Hopkins.
Quantifying the chemical beauty of drugs. Nature chemistry, 4(2):90, 2012.
Nicola De Cao and Thomas Kipf. Molgan: An implicit generative model for small molecular graphs.
arXiv preprint arXiv:1805.11973, 2018.
Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv
preprint arXiv:1605.08803, 2016.
David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán
Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular
fingerprints. In Advances in neural information processing systems, pp. 2224–2232, 2015.
Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for
distribution estimation. In International Conference on Machine Learning, pp. 881–889, 2015.
Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural
message passing for quantum chemistry. In Proceedings of the 34th International Conference on
Machine Learning-Volume 70, pp. 1263–1272. JMLR. org, 2017.
Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato,
Benjamı́n Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel,
Ryan P Adams, and Alán Aspuru-Guzik. Automatic chemical design using a data-driven contin-
uous representation of molecules. ACS central science, 4(2):268–276, 2018.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural infor-
mation processing systems, pp. 2672–2680, 2014.
Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola.
A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs.
In Advances in Neural Information Processing Systems, pp. 1024–1034, 2017.
Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improv-
ing flow-based generative models with variational dequantization and architecture design. In
Proceedings of the 36th International Conference on Machine Learning. PMLR, 2019. URL
http://proceedings.mlr.press/v97/ho19a.html.
John J Irwin, Teague Sterling, Michael M Mysinger, Erin S Bolstad, and Ryan G Coleman. Zinc:
a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52
(7):1757–1768, 2012.
Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for
molecular graph generation. arXiv preprint arXiv:1802.04364, 2018.
Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph
convolutions: moving beyond fingerprints. Journal of computer-aided molecular design, 30(8):
595–608, 2016.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3nd Interna-
tional Conference on Learning Representations, 2014.
Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In 2nd International
Conference on Learning Representations, 2013.
Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In
Advances in Neural Information Processing Systems, pp. 10215–10224, 2018.
Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Im-
proved variational inference with inverse autoregressive flow. In Advances in neural information
processing systems, pp. 4743–4751, 2016.

11
Published as a conference paper at ICLR 2020

Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional net-
works. arXiv preprint arXiv:1609.02907, 2016.
Ivan Kobyzev, Simon Prince, and Marcus A. Brubaker. Normalizing flows: Introduction and ideas.
arXiv preprint arXiv:1908.09257, 2019.
Matt J Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoen-
coder. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp.
1945–1954. JMLR. org, 2017.
Greg Landrum. Rdkit: Open-source cheminformatics software. 2016.
Jenny Liu, Aviral Kumar, Jimmy Ba, Jamie Kiros, and Kevin Swersky. Graph normalizing flows.
arXiv preprint arXiv:1905.13177, 2019.
Christos Louizos and Max Welling. Multiplicative normalizing flows for variational bayesian neural
networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70,
pp. 2218–2227. JMLR. org, 2017.
Tengfei Ma, Jie Chen, and Cao Xiao. Constrained generation of semantically valid graphs via
regularizing variational autoencoders. In Advances in Neural Information Processing Systems,
pp. 7113–7124, 2018.
Kaushalya Madhawa, Katushiko Ishiguro, Kosuke Nakago, and Motoki Abe. Graphnvp: An invert-
ible flow model for generating molecular graphs. arXiv preprint arXiv:1905.11600, 2019.
Daniel Neil, Marwin H. S. Segler, Laura Guasch, Mohamed Ahmed, Dean Plumbley, Matthew
Sellwood, and Nathan Brown. Exploring deep recurrent models with reinforcement learning for
molecule design. In 6th International Conference on Learning Representations, Workshop Track
Proceedings, 2018.
Marcus Olivecrona, Thomas Blaschke, Ola Engkvist, and Hongming Chen. Molecular de-novo
design through deep reinforcement learning. Journal of cheminformatics, 9(1):48, 2017.
George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density
estimation. In Advances in Neural Information Processing Systems, pp. 2338–2347, 2017.
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,
Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in
pytorch. In NIPS-W, 2017.
P. G. Polishchuk, T. I. Madzhidov, and A. Varnek. Estimation of the size of drug-like chemical space
based on gdb-17 data. Journal of Computer-Aided Molecular Design, 27(8):675–679, Aug 2013.
Daniil Polykovskiy, Alexander Zhebrak, Benjamin Sanchez-Lengeling, Sergey Golovanov, Oktai
Tatanov, Stanislav Belyaev, Rauf Kurbanov, Aleksey Artamonov, Vladimir Aladinskiy, Mark
Veselov, Artur Kadurin, Sergey Nikolenko, Alan Aspuru-Guzik, and Alex Zhavoronkov. Molec-
ular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. arXiv preprint
arXiv:1811.12823, 2018.
Mariya Popova, Mykhailo Shvets, Junier Oliva, and Olexandr Isayev. Molecularrnn: Generating
realistic molecular graphs with optimized properties. arXiv preprint arXiv:1905.13372, 2019.
Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole von Lilienfeld. Quantum
chemistry structures and properties of 134 kilo molecules. Scientific Data, 1, 2014.
Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv
preprint arXiv:1505.05770, 2015.
David Rogers and Mathew Hahn. Extended-connectivity fingerprints. Journal of chemical informa-
tion and modeling, 50:742–754, 2010.
Bidisha Samanta, Abir De, Niloy Ganguly, and Manuel Gomez-Rodriguez. Designing random
graph models using variational autoencoders with applications to chemical design. arXiv preprint
arXiv:1802.05283, 2018.

12
Published as a conference paper at ICLR 2020

Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max
Welling. Modeling relational data with graph convolutional networks. In European Semantic Web
Conference, pp. 593–607. Springer, 2018.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Kristof T Schütt, Farhad Arbabzadah, Stefan Chmiela, Klaus R Müller, and Alexandre Tkatchenko.
Quantum-chemical insights from deep tensor neural networks. Nature communications, 8:13890,
2017.
Marwin HS Segler, Thierry Kogej, Christian Tyrchan, and Mark P Waller. Generating focused
molecule libraries for drug discovery with recurrent neural networks. ACS central science, 4(1):
120–131, 2017.
Prithviraj Sen, Galileo Mark Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, and Tina
Eliassi-Rad. Collective classification in network data. AI Magazine, 29(3):93–106, 2008. URL
http://www.cs.iit.edu/˜ml/pdfs/sen-aimag08.pdf.
Martin Simonovsky and Nikos Komodakis. Graphvae: Towards generation of small graphs using
variational autoencoders. In International Conference on Artificial Neural Networks, pp. 412–
422. Springer, 2018.
L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models. In
International Conference on Learning Representations, 2016. URL http://arxiv.org/
abs/1511.01844. arXiv:1511.01844.
Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In
International Conference on Machine Learning, pp. 1747–1756, 2016.
Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, and Jure Leskovec. Graph convolutional pol-
icy network for goal-directed molecular graph generation. In Advances in Neural Information
Processing Systems, pp. 6410–6421, 2018a.
Jiaxuan You, Rex Ying, Xiang Ren, William L Hamilton, and Jure Leskovec. Graphrnn: Generating
realistic graphs with deep auto-regressive models. arXiv preprint arXiv:1802.08773, 2018b.

13
Published as a conference paper at ICLR 2020

Appendix
A D ISCCUSIONS ON D EQUANTIZATION TECHNIQUES

The dequantization techniques allow mapping the discrete data into the continuous one by adding
a small noise to each dimension. By adding noise from U [0, 1), we can ensure that the range of
different categories will not overlap. For example, after dequantization, the value of 1-entry in the
one-hot vector lies in [1, 2) while the 0-entry lies in [0, 1). Therefore, we can map the dequantized
continuous data back to the discrete one-hot data by easily performing the argmax operation in
the generation process. Theoretically, as shown in Theis et al. (2016); Ho et al. (2019), training a
continuous density model on uniform dequantized data can be interpreted as maximizing a lower
bound on the log-likelihood for the original discrete data. Mathematically, this statement holds for
both image data and binary/categorical data.
Furthermore, as suggested in Ho et al. (2019), instead of adding random uniform noise to each
discrete data for dequantization, a more advanced dequantization technique is to treat the noise as a
hidden variable and use variational inference to infer the optimum noise added to each discrete data,
which we would like to explore in our future work.

B PARALLEL T RAINING A LGORITHM

Algorithm 1 Parallel Training Algorithm of GraphAF

Input: η learning rate, M batch size, P maximum dependency distance in BFS, Adam hyperparam-
eters β1 , β2 , use Prod(·) as the product of elements across dimensions of a tensor
Initial: Parameters θ of GraphAF (R-GCN, Node-MLP and Edge-MLP)
1: while θ is not converged do
2: for m = 1, ..., M do
3: Sample a molecule mol from dataset and get the graph size N
4: Convert mol to G = (A, X) with BFS re-ordering
5: for i = 1, ..., N do
6: ziX = Xi + u, u ∼ U [0, 1)d
7: µXi = gµX Gi , αiX = gαX Gi
1
8: i = ziX − µX i αX
i
1
9: LXi = − log(Prod(pE (i ))) − log(Prod( αX ))
i
10: for j = max{1, i − P }, ..., i − 1 do
A
11: zij = Aij + u, u ∼ U [0, 1)b+1
A
A
12: µij = gµA Gi , Xi , Ai,1:j−1 , αij = gαA Gi , Xi , Ai,1:j−1
1
A

13: ij = zij − µAij αA ij
1
14: LA
ij = − log(Prod(pE (ij ))) − log(Prod( αA ))
ij
15: end for
16: end forP
n PP
17: LGm =
X
i=1 Li + j=1 Lij
A

18: end for Pm

1 G
19: θ ← ADAM( M m=1 Lm , θ, η, β1 , β2 )
20: end while

C E XPERIMENT DETAILS

Network architecture. The network architecture is fixed among all three tasks. More specifically,
the R-GCN is implemented with 3 layers and the embedding dimension is set as 128. We use
batch normalization before graph pooling to accelerate the convergence and choose sum-pooling as

14
Published as a conference paper at ICLR 2020

the readout function for graph representations. Both node MLPs and edge MLPs have two fully-
connected layers equipped with tanh non-linearity.

Density Modeling and Generation. To achieve the results in Table 2, we train GraphAF on
ZINC250K with a batch size of 32 on 1 Tesla V100 GPU and 32 CPU cores for 10 epochs. We
optimize our model with Adam with a fixed learning rate of 0.001.

Property Optimization. For both property optimization and constrained property optimization,
we first pretrain a GraphAF network via the density modeling task for 300 epochs, and then fine-
tune the network toward desired molecular distribution through RL process. Following are details
about the reward design for property optimization. The reward of each step consists of step-wise
validity rewards and the final rewards discounted by a fixed factor γ. The step-wise validity penalty
is fixed as -1. The final reward of a molecule m includes both property-targeted reward and chemical
validation reward. We adopt the same chemical validation rewards as GCPN. We define property-
targeted reward as follows:
r(m) = t1 · QED(m)

logPpen (mol) (13)
r(m) = exp
t2
γ is set to 0.97 for QED optimization and 0.9 for penalized logP optimization respectively. We
fine-tune the pretrained model for 200 iterations with a fixed batch size of 64 using Adam optimizer.
We also adopt a linear learning rate warm-up to stabilize the training. We perform the grid search to
determine the optimal hyperparameters according to the chemical scoring performance. The search
space is summarised in Table 7.

Table 7: Tuned-parameters for policy gradient and their search space.

PARAM Description Search space
lr Learning rate {0.001, 0.0005, 0.0001}
t1 Coefficient for QED score {2, 3, 4, 5}
t2 Temperature for exponential function {3, 4, 5}
wm Number of warm up iterations {12, 24, 36}

Constrained Property Optimization. We first introduce the way we sample sub-graphs from 800
ZINC molecules. Given a molecule, we first randomly sample a BFS order and then drop the last
m nodes in BFS order as well as edges induced by these nodes, where m is randomly chosen from
{0, 1, 2, 3, 4, 5} each time. Finally, we reconstruct the sub-graph from the remaining nodes in the
BFS sequence. Note that the final sub-graph is connected due to the nature of BFS order. For reward
design, we set it as the improvement of the target score. We fine-tune the pretrained model for 200
iterations with a batch size of 64. We also use Adam with a learning rate of 0.0001 to optimize the
model. Finally, each molecule is optimized for 200 times by the tuned model.

D V ISUALIZATION OF GENERATED GENERIC GRAPHS

We present visualizations of graphs from both the training set and generated graphs by GraphAF
in Figure 3 and Figure 4. The visualizations demonstrate that GraphAF has strong ability to model
different graph structures in the generic graph datasets.

E M ORE MOLECULE SAMPLES

We present more molecule samples generated by GraphAF in the following pages. Figure 5 presents
50 molecules randomly sampled from multivariate Gaussian, which justify the ability of our model
to generate novel, realistic and unique molecules. From Figure 6 we can see that our model is
able to generate molecules with high and diverse penalized logP scores ranging from 5 to 10. For
constrained property optimization of penalized logP score, as shown by Figure 7, our model can
either reduce the ring size, remove the big ring or grow carbon chains from the original molecule,
improving the penalized logP score by a large margin.

15
Published as a conference paper at ICLR 2020

(a) Graphs from training set (b) Graphs generated by GraphAF

Figure 3: Visualizations of training graphs and generated graphs of EGO-SMALL.

(a) Graphs from training set (b) Graphs generated by GraphAF

Figure 4: Visualizations of training graphs and generated graphs of COMMUNITY-SMALL.

16
Published as a conference paper at ICLR 2020

O O O
F

Cl N
NH N
SH
NH NH
NH O
N
O
N H2 N
Cl
N NH
OH
N

HO N

NH
N
S

H2 N N
S
Cl N
Br
NH O

Br
HO

Cl
Cl

NH2
Br
O F
Cl O

S N F
O
O
O
O
N
N
Br

O
I

O
O
OH
O Cl NH
S
HO NH
N H2 N

N O O
N

NH
O

I N
N N
NH
N
N
NH NH NH
O N

O
O

NH2

NH
Cl

NH
Br O N
NH Br O Cl
Cl N N
N
O
N N
O
S
H2 N N O
F Cl NH2

N Br
NH
H2 N
N

O Cl

N O NH F

O F
N
N O
O F
HO

Cl
NH N

O
O
S
O N S
S S NH
N

NH N
O NH F
OH
NH NH
O

NH2
F

O
NH

N
NH O N
N N
NH
N

NH OH

O
N
O
NH

NH2

Cl
I

O N
N
N N

S N Cl
NH
O N
O NH O
N
NH S Cl
O

NH2

Figure 5: 50 molecules sampled from prior.

17
Published as a conference paper at ICLR 2020

N
P

10.75 10.21 9.79 9.18

Cl Cl Cl

Cl Cl

8.29 8.08 7.75 7.64

S Cl

S I

N O Cl
I
O

Br
N
F Cl
F

6.27 6.09 5.48 5.28

Figure 6: Molecule samples with high penalized logp score generated by GraphAF.

NH O
NH
O N OH
N O
NH
O N NH2 +
NH+
NH+
N N O
O
N

0.61 19.06 0.64 11.86 0.63 9.01 0.71 2.10

NH N
NH
O OH
O NH2 +
N O NH+
NH
O N
N N
NH
O

O
NH2
O
N O OH
NH+ NH N S
O
O NH+
N NH+ NH+
NH O

0.50 9.47 0.53 9.38 0.46 7.84 0.50 7.13

H2 N
N OH O
N O O
NH
NH N N N N
O
N S

O O

Figure 7: More results on constrained property optimization for penalized logP score. Numbers beside the
arrow denote similarity and improvement of the given molecule pair respectively.

Advanced Deep Learning Ghosal
No ratings yet
Advanced Deep Learning Ghosal
9 pages
The Mostly Complete Chart of Neural Networks
100% (1)
The Mostly Complete Chart of Neural Networks
19 pages
Neural Network Unsupervised Machine Learning: What Are Autoencoders?
No ratings yet
Neural Network Unsupervised Machine Learning: What Are Autoencoders?
22 pages
Zen AInew
No ratings yet
Zen AInew
29 pages
Deep Stock Representation Learning From Candlestick
No ratings yet
Deep Stock Representation Learning From Candlestick
10 pages
The Functional API
No ratings yet
The Functional API
18 pages
Noe 2020 Machine Learning For Molecular Simu
No ratings yet
Noe 2020 Machine Learning For Molecular Simu
32 pages
Term Paper: Dept of CSE, GMRIT
No ratings yet
Term Paper: Dept of CSE, GMRIT
16 pages
Deepfake Video Detection Challenges and Opportunities
No ratings yet
Deepfake Video Detection Challenges and Opportunities
48 pages
Machine Learning-Based Breast Cancer Detection
No ratings yet
Machine Learning-Based Breast Cancer Detection
82 pages
Unsupervised Learning of Video Representations Using Lstms
No ratings yet
Unsupervised Learning of Video Representations Using Lstms
12 pages
ML Material
No ratings yet
ML Material
38 pages
Deep Learning PPT Full Notes
No ratings yet
Deep Learning PPT Full Notes
105 pages
Tensorflow
No ratings yet
Tensorflow
9 pages
Pgrev Stud 954409 3 1 2 L
No ratings yet
Pgrev Stud 954409 3 1 2 L
33 pages
Neural Network Message Passing For Quantum Chemistry New Paper
No ratings yet
Neural Network Message Passing For Quantum Chemistry New Paper
14 pages
Coarse-to-Fine - A Hierarchical Diffusion Model For Molecule Generation in 3D
No ratings yet
Coarse-to-Fine - A Hierarchical Diffusion Model For Molecule Generation in 3D
23 pages
M - O G N N C - A M D: Ixed Integer Ptimisation of Raph Eural Etworks FOR Omputer Ided Olecular Esign
No ratings yet
M - O G N N C - A M D: Ixed Integer Ptimisation of Raph Eural Etworks FOR Omputer Ided Olecular Esign
27 pages
A Deep Learning Approach To Network Intrusion Detection - IEEE TETCI v2n1 201802 - Shone, Ngoc, Phai, Shi
No ratings yet
A Deep Learning Approach To Network Intrusion Detection - IEEE TETCI v2n1 201802 - Shone, Ngoc, Phai, Shi
10 pages
A Survey of Multi-View Representation Learning: Yingming Li, Ming Yang, Zhongfei (Mark) Zhang, Senior Member, IEEE
No ratings yet
A Survey of Multi-View Representation Learning: Yingming Li, Ming Yang, Zhongfei (Mark) Zhang, Senior Member, IEEE
20 pages
Artificiall Intelligence Paper 4
No ratings yet
Artificiall Intelligence Paper 4
15 pages
From Big Data To Artificial Intelligence Chemoinformatics Meets New Challenges
No ratings yet
From Big Data To Artificial Intelligence Chemoinformatics Meets New Challenges
3 pages
SSRN 3335536
No ratings yet
SSRN 3335536
35 pages
RP 1
No ratings yet
RP 1
13 pages
Representing Molecules As Random Walks Over Interpretable Grammars
No ratings yet
Representing Molecules As Random Walks Over Interpretable Grammars
30 pages
GNN ML For Materials ACS 2019
No ratings yet
GNN ML For Materials ACS 2019
31 pages
Chapter17 Autoencoders
No ratings yet
Chapter17 Autoencoders
23 pages
Schnet: A Continuous-Filter Convolutional Neural Network For Modeling Quantum Interactions
No ratings yet
Schnet: A Continuous-Filter Convolutional Neural Network For Modeling Quantum Interactions
11 pages
Molecular Design With Automated Quantum Computing-Based Deep Learning and Optimization
No ratings yet
Molecular Design With Automated Quantum Computing-Based Deep Learning and Optimization
14 pages
2.Prior-Based Regularization
No ratings yet
2.Prior-Based Regularization
10 pages
Text To Video
No ratings yet
Text To Video
11 pages
JT VAE JTVAE Junction Tree
No ratings yet
JT VAE JTVAE Junction Tree
17 pages
Generative Learning For Compound Design ACS
No ratings yet
Generative Learning For Compound Design ACS
19 pages
Molecular Informatics - 2022 - Schietgat - Automated Detection of Toxicophores and Prediction of Mutag
No ratings yet
Molecular Informatics - 2022 - Schietgat - Automated Detection of Toxicophores and Prediction of Mutag
12 pages
Atomistic Graph Networks For Experimental Materials Property Prediction
No ratings yet
Atomistic Graph Networks For Experimental Materials Property Prediction
22 pages
42 LBDD CH
No ratings yet
42 LBDD CH
9 pages
Structure To Property: Chemical Element Embeddings and A Deep Learning Approach For Accurate Prediction of Chemical Properties
No ratings yet
Structure To Property: Chemical Element Embeddings and A Deep Learning Approach For Accurate Prediction of Chemical Properties
11 pages
Bilodeau-Generative Models For Molecular Discovery-Recent Advances and challenges-article-2022-NA
No ratings yet
Bilodeau-Generative Models For Molecular Discovery-Recent Advances and challenges-article-2022-NA
17 pages
Audio GAN
No ratings yet
Audio GAN
2 pages
Predicting Ground-State Molecular Properties Using The Quantum-Mechanical Dataset QM7-X
No ratings yet
Predicting Ground-State Molecular Properties Using The Quantum-Mechanical Dataset QM7-X
8 pages
Deep Density-Based Image Clustering
No ratings yet
Deep Density-Based Image Clustering
8 pages
A Self-Attention Based Message Passing Neural Netw
No ratings yet
A Self-Attention Based Message Passing Neural Netw
10 pages
Predicting Drug Solubilty Wtih Deep Learning
No ratings yet
Predicting Drug Solubilty Wtih Deep Learning
9 pages
Molecular Classification With Graph ConvolutionalNetworks: Exploring The MUTAG Dataset For Mutagenicity Prediction
No ratings yet
Molecular Classification With Graph ConvolutionalNetworks: Exploring The MUTAG Dataset For Mutagenicity Prediction
6 pages
Shinkle Et Al 2024 Thermodynamic Transferability in Coarse Grained Force Fields Using Graph Neural Networks
No ratings yet
Shinkle Et Al 2024 Thermodynamic Transferability in Coarse Grained Force Fields Using Graph Neural Networks
16 pages
VAE Molecular Graphs Niloy AAAI19
No ratings yet
VAE Molecular Graphs Niloy AAAI19
8 pages
Constrained Graph Variational Autoencoders For Molecule Design
No ratings yet
Constrained Graph Variational Autoencoders For Molecule Design
13 pages
Graph-Based Molecular Representation Learning
No ratings yet
Graph-Based Molecular Representation Learning
9 pages
Amabilino Et Al. - 2020 - Guidelines For Recurrent Neural Network Transfer L
No ratings yet
Amabilino Et Al. - 2020 - Guidelines For Recurrent Neural Network Transfer L
15 pages
Molgan: An Implicit Generative Model For Small Molecular Graphs
No ratings yet
Molgan: An Implicit Generative Model For Small Molecular Graphs
13 pages
2105-Attention-Based Generative Models For de Novo Molecular Design
No ratings yet
2105-Attention-Based Generative Models For de Novo Molecular Design
11 pages
Geometric Deep Learning On Molecular Representations
No ratings yet
Geometric Deep Learning On Molecular Representations
16 pages
1152 CS F425 Comprehensive Exam Question Paper DL
No ratings yet
1152 CS F425 Comprehensive Exam Question Paper DL
2 pages
Junction Tree Variational Autoencoder For Molecular Graph Generation
No ratings yet
Junction Tree Variational Autoencoder For Molecular Graph Generation
10 pages
ML Techniques in Molecular Modeling Seminar All Exaggerated
No ratings yet
ML Techniques in Molecular Modeling Seminar All Exaggerated
7 pages
Augmenting Optimization Based Molecular Design Wi 2024 Computers Chemical
No ratings yet
Augmenting Optimization Based Molecular Design Wi 2024 Computers Chemical
12 pages
Final Chemistry Project Report 2024
No ratings yet
Final Chemistry Project Report 2024
10 pages
Deep Learning Methods For Molecular Representa
No ratings yet
Deep Learning Methods For Molecular Representa
13 pages
2110-Transformers For Molecular Graph Generation-Conf
No ratings yet
2110-Transformers For Molecular Graph Generation-Conf
6 pages
Graph Neural Network Framework For Web Based Predi c6fvzbn6
No ratings yet
Graph Neural Network Framework For Web Based Predi c6fvzbn6
11 pages
Moflow
No ratings yet
Moflow
10 pages
ArXiv-2022-YuyangWang-0-Molecular Contrastive Learning of Representations Via Graph Neural Networks
No ratings yet
ArXiv-2022-YuyangWang-0-Molecular Contrastive Learning of Representations Via Graph Neural Networks
19 pages
Chem Prop
No ratings yet
Chem Prop
9 pages
ArXiv 2020 Stanislaw 0 Molecule Attention Transformer
No ratings yet
ArXiv 2020 Stanislaw 0 Molecule Attention Transformer
16 pages
Augmented Memory Sample Efficient Generative Molecular Design With Reinforcement Learning
No ratings yet
Augmented Memory Sample Efficient Generative Molecular Design With Reinforcement Learning
13 pages
VJTNN
No ratings yet
VJTNN
13 pages
3DMolNet：分子结构的生成网络
No ratings yet
3DMolNet：分子结构的生成网络
9 pages
A Spatiotemporal Deep Learning Approach For Unsupervised Anomaly Detection in CL
No ratings yet
A Spatiotemporal Deep Learning Approach For Unsupervised Anomaly Detection in CL
15 pages
ML Team 13 B Div
No ratings yet
ML Team 13 B Div
6 pages
Artificial Intelligence For Prediction of Biological 2vipcq6k
No ratings yet
Artificial Intelligence For Prediction of Biological 2vipcq6k
22 pages
Summary of Research Papers
No ratings yet
Summary of Research Papers
8 pages
Few-Shot Learning With Transformers Via Graph Embeddings For Molecular
No ratings yet
Few-Shot Learning With Transformers Via Graph Embeddings For Molecular
13 pages
Moler
No ratings yet
Moler
22 pages
CS601 - Machine Learning - Unit 2 New
No ratings yet
CS601 - Machine Learning - Unit 2 New
56 pages
EEML Submission
No ratings yet
EEML Submission
3 pages
2309 09355v3
No ratings yet
2309 09355v3
16 pages
N Evaluation of Unconditional Molecular Generation Methods
No ratings yet
N Evaluation of Unconditional Molecular Generation Methods
16 pages
Syllabus - ML Lab
No ratings yet
Syllabus - ML Lab
3 pages
Molecule Generation
No ratings yet
Molecule Generation
34 pages
A Collaborative Constrained Graph Diffusion Model
No ratings yet
A Collaborative Constrained Graph Diffusion Model
28 pages
GCPN Neurips18
No ratings yet
GCPN Neurips18
12 pages
Hybrid DNNBiLSTMaided Intrusion Detection and Trustclustering and Routingbased Intrusion Prevention System in VANET 2025 Taylor and Francis LTD
No ratings yet
Hybrid DNNBiLSTMaided Intrusion Detection and Trustclustering and Routingbased Intrusion Prevention System in VANET 2025 Taylor and Francis LTD
19 pages
Nnadili Et Al 2024 Surfactant Specific Ai Driven Molecular Design Integrating Generative Models Predictive Modeling and
No ratings yet
Nnadili Et Al 2024 Surfactant Specific Ai Driven Molecular Design Integrating Generative Models Predictive Modeling and
12 pages
Graph Neural Networks For Materials Science and Chemistry
No ratings yet
Graph Neural Networks For Materials Science and Chemistry
37 pages
Navigating Chemical Space With Latent Flows: Guanghao Wei Yining Huang
No ratings yet
Navigating Chemical Space With Latent Flows: Guanghao Wei Yining Huang
28 pages
Preprints202504 0512 v1
No ratings yet
Preprints202504 0512 v1
28 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
38 pages
Master Thesis Cortecchia Tommaso
No ratings yet
Master Thesis Cortecchia Tommaso
84 pages
25549-Article Text-29612-1-2-20230626
No ratings yet
25549-Article Text-29612-1-2-20230626
10 pages
PyTorch Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
PyTorch Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

G AF: F - A M M G G: Raph A LOW Based Utoregressive Odel For Olecular Raph Eneration

Uploaded by

G AF: F - A M M G G: Raph A LOW Based Utoregressive Odel For Olecular Raph Eneration

Uploaded by

Published as a conference paper at ICLR 2020

G RAPH AF: A F LOW- BASED AUTOREGRESSIVE

affine transformation of AF can be written as:

3.2 G RAPH R EPRESENTATION L EARNING

4.1 G RAPH AF F RAMEWORK

ziX = Xi + u, u ∼ U [0, 1)d ; zij

𝜀𝜀1 𝜀𝜀2 𝜀𝜀3 𝜀𝜀4

Noise from 𝑵𝑵(𝟎𝟎, 𝑰𝑰) C C C … …

We present further discussions on dequantization techniques in Appendix A. Formally, we define

p(ziX |Gi ) =N (µX X 2

R-GCN: HiL = R-GCN(Gi ), h̃i = sum(HiL );

4.2 E FFICIENT PARALLEL T RAINING

In GraphAF, since f : E → Z is autoregressive, the Jacobian matrix of the inverse process f −1 :

4.3 VALIDITY C ONSTRAINED S AMPLING

4.4 G OAL - DIRECTED M OLECULE G ENERATION WITH R EINFORCEMENT L EARNING

5.1 E XPERIMENT S ETUP

Method Validity Validity w/o check Uniqueness Novelty Reconstruction

Table 3: Results of density modeling and generation on three different datasets.

Method Validity Validity w/o check Uniqueness Novelty Reconstruction

5.2 N UMERICAL R ESULTS

Table 5: Comparison of the top 3 property scores of generated molecules.

Penalized logP QED

12.23 11.29 0.948 0.948 -8.25 -0.11

11.05 10.83 0.947 0.946 -8.66 -1.10

Table 6: Comparison of results on constrained property optimization.

B PARALLEL T RAINING A LGORITHM

Algorithm 1 Parallel Training Algorithm of GraphAF

18: end for Pm

Table 7: Tuned-parameters for policy gradient and their search space.

D V ISUALIZATION OF GENERATED GENERIC GRAPHS

E M ORE MOLECULE SAMPLES

(a) Graphs from training set (b) Graphs generated by GraphAF

(a) Graphs from training set (b) Graphs generated by GraphAF

Figure 5: 50 molecules sampled from prior.

10.75 10.21 9.79 9.18

8.29 8.08 7.75 7.64

6.27 6.09 5.48 5.28

0.61 19.06 0.64 11.86 0.63 9.01 0.71 2.10

0.50 9.47 0.53 9.38 0.46 7.84 0.50 7.13

You might also like