0% found this document useful (0 votes)
19 views10 pages

Moflow

The paper presents MoFlow, an innovative flow-based generative model designed to create molecular graphs with specific chemical properties, enhancing the drug discovery process. MoFlow utilizes invertible mappings to efficiently generate valid molecular graphs while ensuring chemical validity through a novel graph conditional flow and post-hoc validity correction. The model demonstrates state-of-the-art performance in molecular graph generation, reconstruction, and optimization, significantly outperforming existing generative frameworks.

Uploaded by

chi shang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views10 pages

Moflow

The paper presents MoFlow, an innovative flow-based generative model designed to create molecular graphs with specific chemical properties, enhancing the drug discovery process. MoFlow utilizes invertible mappings to efficiently generate valid molecular graphs while ensuring chemical validity through a novel graph conditional flow and post-hoc validity correction. The model demonstrates state-of-the-art performance in molecular graph generation, reconstruction, and optimization, significantly outperforming existing generative frameworks.

Uploaded by

chi shang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA

MoFlow: An Invertible Flow Model for Generating


Molecular Graphs
Chengxi Zang Fei Wang
Department of Population Health Sciences, Weill Cornell Department of Population Health Sciences, Weill Cornell
Medicine Medicine
[email protected] [email protected]
ABSTRACT Conference on Knowledge Discovery and Data Mining (KDD ’20), August
Generating molecular graphs with desired chemical properties 23–27, 2020, Virtual Event, CA, USA. ACM, New York, NY, USA, 10 pages.
https://doi.org/10.1145/3394486.3403104
driven by deep graph generative models provides a very promising
way to accelerate drug discovery process. Such graph generative
models usually consist of two steps: learning latent representations 1 INTRODUCTION
and generation of molecular graphs. However, to generate novel Drug discovery aims at finding candidate molecules with desired
and chemically-valid molecular graphs from latent representations chemical properties for clinical trials, which is a long (10-20 years)
is very challenging because of the chemical constraints and combi- and costly ($0.5-$2.6 billion) process with a high failure rate [1, 24].
natorial complexity of molecular graphs. In this paper, we propose Recently, deep graph generative models have demonstrated their
MoFlow, a flow-based graph generative model to learn invertible big potential to accelerate the drug discovery process by exploring
mappings between molecular graphs and their latent representa- large chemical space in a data-driven manner [12, 34]. These mod-
tions. To generate molecular graphs, our MoFlow first generates els usually first learn a continuous latent space by encoding1 the
bonds (edges) through a Glow based model, then generates atoms training molecular graphs and then generate novel and optimized
(nodes) given bonds by a novel graph conditional flow, and finally ones through decoding from the learned latent space guided by
assembles them into a chemically valid molecular graph with a targeted properties [9, 12]. However, it is still very challenging to
posthoc validity correction. Our MoFlow has merits including exact generate novel and chemically-valid molecular graphs with desired
and tractable likelihood training, efficient one-pass embedding and properties since: a) the scale of the chemical space of drug-like com-
generation, chemical validity guarantees, 100% reconstruction of pounds is 1060 [21] but the scale of possibly generated molecular
training data, and good generalization ability. We validate our model graphs by existing methods are much smaller, and b) generating
by four tasks: molecular graph generation and reconstruction, vi- molecular graphs that have both multi-type nodes and edges and
sualization of the continuous latent space, property optimization, follow bond-valence constraints is a hard combinatorial task.
and constrained property optimization. Our MoFlow achieves state- Prior works leverage different deep generative frameworks for
of-the-art performance, which implies its potential efficiency and generating molecular SMILES codes [32] or molecular graphs, in-
effectiveness to explore large chemical space for drug discovery. cluding variational autoencoder (VAE)-based models [4, 5, 12, 15,
18, 19, 30], generative adversarial networks (GAN)-based models
CCS CONCEPTS [6, 33], and autoregressive (AR)-based models [25, 33]. In this pa-
• Mathematics of computing → Graph algorithms; • Theory per, we explore a different deep generative framework, namely the
of computation → Generating random combinatorial structures; normalizing flow [7, 13, 20] to generate molecular graphs. Com-
• Computing methodologies → Unsupervised learning; Neural pared with above three frameworks, the flow-based models are
networks; Maximum likelihood modeling; the only one which can memorize and exactly reconstruct all the
input data, and at the same time have the potential to generate
KEYWORDS more novel, unique and valid molecules, which implies its poten-
Graph Generative Model; Graph Normalizing Flow; Graph Condi- tial capability of deeper exploration of the huge chemical space.
tional Flow; Deep Generative Model; De novo Drug Design; Molec- To our best knowledge, there have been three flow-based models
ular Graph Generation; Molecular Graph Optimization; proposed for molecular graph generation. The GraphAF [29] model
is an autoregressive flow-based model that achieves state-of-the-art
ACM Reference Format: performance in molecular graph generation. GraphAF generates
Chengxi Zang and Fei Wang. 2020. MoFlow: An Invertible Flow Model
molecules in a sequential manner by adding each new atom or
for Generating Molecular Graphs. In Proceedings of the 26th ACM SIGKDD
bond followed by a validity check. GraphNVP [20] and GRF [10]
Permission to make digital or hard copies of all or part of this work for personal or are proposed for molecular graph generation in a one-shot manner.
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
However, they cannot guarantee chemical validity and thus show
on the first page. Copyrights for components of this work owned by others than the poor performance in generating valid and novel molecules.
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or In this paper, we propose a novel deep graph generative model
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected]. named MoFlow to generate molecular graphs. Our MoFlow is the
KDD ’20, August 23–27, 2020, Virtual Event, CA, USA
1 Inthis paper, we use inference, embedding or encoding interchangeably to refer to
© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-7998-4/20/08. . . $15.00 the transformation from molecular graphs to the learned latent space, and we use
https://doi.org/10.1145/3394486.3403104 decoding or generation for the reverse transformation.

617
Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA

first of its kind which not only generates molecular graphs effi- 2 RELATED WORK
ciently by invertible mapping at one shot, but also has a chemical Molecular Generation. Different deep generative frameworks are
validity guarantee. More specifically, to capture the combinatorial proposed for generating molecular SMILES or molecular graphs.
atom-and-bond structures of molecular graphs, we propose a vari- Among the variational autoencoder (VAE)-based models [4, 5, 12,
ant of the Glow model [13] to generate bonds (multi-type edges, 15, 18, 19, 30], the JT-VAE [12] generates valid tree-structured
e.g., single, double and triple bonds), a novel graph conditional flow molecules by first generating a tree-structured scaffold of chemical
to generate atoms (multi-type nodes, e.g. C, N etc.) given bonds substructures and then assembling substructures according to the
by leveraging graph convolutions, and finally assemble atoms and generated scaffold. The MolGAN [6] is a generative adversarial
bonds into a valid molecular graph which follows bond-valence networks (GAN)-based model but shows very limited performance
constraints. We illustrate our modelling framework in Figure 1. Our in generating valid and unique molecules. The autoregressive-based
MoFlow is trained by exact and tractable likelihood estimation, and models generate molecules in a sequential manner with validity
one-pass inference and generation can be efficiently utilized for check at each generation step. For example, the MolecularRNN [25]
molecular graph optimization. sequentially generates each character of SMILES and the GCPN
We validate our MoFlow through a wide range of experiments [33] sequentially generates each atom/bond in a molecular graphs.
from molecular graph generation, reconstruction, visualization to In this paper, we explore a different deep generative framework,
optimization. As baselines, we compare the state-of-the-art VAE- namely the normalizing flow models [7, 13, 20], for molecular graph
based model [12], autoregressive-based models [25, 33], and all generation, which have the potential to memorize and reconstruct
three flow-based models [10, 20, 29]. As for memorizing input data, all the training data and generalize to generating more valid, novel
MoFlow achieves 100% reconstruction rate. As for exploring the un- and unique molecules.
known chemical space, MoFlow outperforms above models by gen- Flow-based Models. The (normalizing) flow-based models try to
erating more novel, unique and valid molecules (as demonstrated learn mappings between complex distributions and simple prior
by the N.U.V. scores in Table 2 and 3). MoFlow generates 100% distributions through invertible neural networks and such a frame-
chemically-valid molecules when sampling from prior distributions. work has good merits of exact and tractable likelihood estimation
Furthermore, if without validity correction, MoFlow still generates for training, efficient one-pass inference and sampling, invertible
much more valid molecules than existing models (validity-without- mapping and thus reconstructing all the training data etc. Examples
check scores in Table 2 and 3). For example, the state-of-the-art include NICE[7], RealNVP[8], Glow[13] and GNF [17] which show
autoregressive-flow-based model GraphAF [29] achieves 67% and promising results in generating images or even graphs [17]. See
68% validity-without-check scores for two datasets while MoFlow latest reviews in [14, 22] and more technical details in Section 3.
achieves 96% and 82% respectively, thanks to its capability of captur- To our best knowledge, there are three flow-based models for
ing the chemical structures in a holistic way. As for chemical prop- molecular graph generation. The GraphAF [29] is an autoregres-
erty optimization, MoFlow can find much more novel molecules sive flow-based model which achieves state-of-the-art performance
with top drug-likeness scores than existing models (Table 4 and in molecular graph generation. The GraphAF generates molecular
Figure 5). As for constrained property optimization, MoFlow finds graphs in a sequential manner with validity check when adding any
novel and optimized molecules with the best similarity scores and new atom or bond. The GraphNVP [20] and GRF [10] are proposed
second best property improvement (Table 5). for molecular graph generation in a one-shot manner. However,
It is worthwhile to highlight our contributions as follows: they have no guarantee for chemical validity and thus show very
• Novel MoFlow model: our MoFlow is one of the first flow- limited performance in generating valid and novel molecular graphs.
based graph generative models which not only generates Our MoFlow is the first of its kind which not only generates molec-
molecular graphs at one shot by invertible mapping but ular graphs efficiently by invertible mapping at one shot but also
also has a validity guarantee. To capture the combinatorial has a validity guarantee. In order to capture the atom-and-bond
atom-and-bond structures of molecular graphs, we propose composition of molecules, we propose a variant of Glow[13] model
a variant of Glow model for bonds (edges) and a novel graph for bonds and a novel graph conditional flow for atoms given bonds,
conditional flow for atoms (nodes) given bonds, and then and then combining them with a post-hoc validity correction. Our
assemble them into valid molecular graphs. MoFlow achieves many state-of-the-art results thanks to capturing
• State-of-the-art performance: our MoFlow achieves many the chemical structures in a holistic way, and our one-shot inference
state-of-the-art results w.r.t. molecular graph generation, re- and generation are more efficient than sequential models.
construction, optimization, etc., and at the same time our
one-shot inference and generation are very efficient, which
implies its potentials in deep exploration of huge chemical 3 MODEL PRELIMINARY
space for drug discovery.
The flow framework. The flow-based models aim to learn a se-
The outline of this paper is: survey (Sec. 2), proposed method
quence of invertible transformations f Θ = f L ◦ ... ◦ f 1 between
(Sec. 3 and 4), experiments (Sec. 5), and conclusions (Sec. 6). In order
complex high-dimensional data X ∼ P X (X ) and Z ∼ P Z (Z ) in a
to promote reproducibility, our codes and datasets are open-sourced
latent space with the same number of dimensions where the latent
at https://github.com/calvin-zcx/moflow.
distribution P Z (Z ) is easy to model (e.g., strong independence as-
sumptions hold in such a latent space). The potentially complex data
in the original space can be modelled by the change of variable

618
Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA

Inference Latent Space Generation


formula where Z = f Θ (X ) and: CNOF CNOF

∂Z 𝑨 *
Reverse
*

PX (X ) = P Z (Z ) | det( )|. (1) 𝒁𝑨|𝑩 GCF


∂X
To sample X̃ ∼ P X (X ) is achieved by sampling Z̃ ∼ P Z (Z ) and Graph Conditional 𝒇−𝟏
𝑨|𝑩
then to transform X̃ = f Θ−1 (Z̃ ) by the reverse mapping of f Θ . 𝑴 Flow 𝒇𝑨|𝑩 Validity
Correction
Let Z = f Θ (X ) = f L ◦ ... ◦ f 1 (X ), Hl = fl (Hl −1 ) where fl
(l = 1, ...L ∈ N+ ) are invertible mappings, H 0 = X , H L = Z and 𝒁𝑩
𝑩
P Z (Z ) follows a standard isotropic Gaussian with independent Reverse
dimensions. Then we get the log-likelihood of X by the change of Glow

variable formula as follows: 𝒇−𝟏


Glow 𝒇𝑩 𝒁∗ ~ 𝑩
Regression
∂Z 𝑵(𝝁, 𝝈𝟐 𝑰) MLP
log P X (X ) = log P Z (Z ) + log | det( )| 𝒚(𝒁) Optimization
∂X
L (2)
Õ Õ ∂fl Figure 1: The outline of our MoFlow. A molecular graph M
= log P Zi (Z i ) + log | det( )|
i
∂H l −1 (e.g. Metformin) is represented by a feature matrix A for
l =1
atoms and adjacency tensors B for bonds. Inference: the
where P Zi (Z i ) is the probability of the i t h dimension of Z and graph conditional flow (GCF) f A | B for atoms (Sec. 4.2) trans-
f Θ = f L ◦ ... ◦ f 1 is an invertible deep neural network to be learnt. forms A given B into conditional latent vector Z A |B , and the
Thus, the exact-likelihood-based training is tractable. Glow f B for bonds (Sec. 4.3) transform B into latent vector
Invertible affine coupling layers. However, how to design a.) Z B . The latent space follows a spherical Gaussian distribu-
an invertible function f Θ with b.) expressive structures and c.) effi- tion. Generation: the generation process is the reverse trans-
cient computation of the Jacobian determinant are nontrivial. The formations of previous operations, followed by a validity
NICE[7] and RealNVP [8] design an affine coupling transformation correction (Sec. 4.4) procedure which ensures the chemical
Z = f Θ (X ) : Rn 7→ Rn : validity. We summarize MoFlow in Sec. 4.5. Regression and
Z 1:d = X 1:d optimization: the mapping y(Z ) between latent space and
(3)
Z d +1:n = X d +1:n ⊙ e S Θ (X 1:d ) + TΘ (X 1:d ), molecular properties are used for molecular graph optimiza-
tion and property prediction (Sec. 5.3, Sec. 5.4).
by splitting X into two partitions X = (X 1:d , X d+1:n ). Thus, a.) the
invertibility is guaranteed by:
4 PROPOSED MOFLOW MODEL
X 1:d = Z 1:d
(4) In this section, we first define the problem and then introduce our
X d +1:n = (Z d +1:n − TΘ (Z 1:d ))/e S Θ (Z 1:d ) , Molecular Flow (MoFlow) model in detail. We show the outline of
b.) the expressive power depends on arbitrary neural structures of our MoFlow in Figure 1 as a roadmap for this section.
the Scale function S Θ : Rd 7→ Rn−d and the Transformation
function TΘ : Rd 7→ Rn−d in the affine transformation of X d+1:n , 4.1 Problem Definition: Learning a Probability
and c.) the Jacobian determiant can be computed efficiently by Model of Molecular Graphs
∂Z ) = exp (Í S (X ) ).
det( ∂X j Θ 1:d j Let M = A × B ⊂ Rn×k × Rc×n×n denote the set of Molecules
Splitting Dimensions. The flow-based models, e.g., RealNVP which is the Cartesian product of the Atom set A with at most
[8] and Glow [13], adopt squeeze operation which compresses
n n
n ∈ N+ atoms belonging to k ∈ N+ atom types and the Bond set
the spatial dimension X c×n×n into X (ch )× h × h to make more chan-
2
B with c ∈ N+ bond types. A molecule M = (A, B) ∈ A × B is a
nels and then split channels into two halves for the coupling layer. pair of an atom matrix A ∈ Rn×k and a bond tensor B ∈ Rc×n×n .
A deep flow model at a specific layer transforms unchanged di- We use one-hot encoding for the empirical molecule data where
mensions in the previous layer to keep all the dimensions trans- A(i, k) = 1 represents the atom i has atom type k, and B(c, i, j) =
formed. In order to learn an optimal partition of X , Glow [13] B(c, j, i) = 1 represents a type c bond between atom i and atom j.
model introduces an invertible 1 × 1 convolution : Rc×n×n × Thus, a molecule M can be viewed as an undirected graph with
Rc×c 7→ Rc×n×n with learnable convolution kernel W ∈ Rc×c multi-type nodes and multi-type edges.
which is initialized as a random rotation matrix. After the transfor- Our primary goal is to learn a molecule generative model P M (M)
mation Y = invertible 1 × 1 convolution(X,W ), a fixed partition which is the probability of sampling any molecule M from P M . In
Y = (Y1: c ,:,:, Y c +1:n,:,: ) over the channel c is used for the affine order to capture the combinatorial atom-and-bond structures of
2 2
coupling layers. molecular graphs, we decompose the P M (M) into two parts:
Numerical stability by actnorm. In order to ensure the nu-
P M (M ) = P M ((A, B)) ≈ P A|B (A |B; θ A|B )P B (B; θ B ) (5)
merical stability of the flow-based models, actnorm layer is intro-
duced in Glow [13] which normalizes dimensions in each channel where P M is the distribution of molecular graphs, P B is the distribu-
over a batch by an affine transformation with learnable scale and tion of bonds (edges) in analogy to modelling multi-channel images
bias. The scale and the bias are initialized as the mean and the , and P A | B is the conditional distribution of atoms (nodes) given
inverse of the standard variation of the dimensions in each channel the bonds by leveraging graph convolution operations. The θ B and
over the batch. θ A | B are learnable modelling parameters. In contrast with VAE

619
Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA

𝐙𝐙𝐀𝐀|𝐁𝐁 = (𝐀𝐀𝟏𝟏 , 𝐙𝐙𝐀𝐀𝟐𝟐 |𝑩𝑩 ) 𝐁𝐁�


or GAN based frameworks, we can learn the parameters by exact 𝐙𝐙𝐀𝐀𝟐𝟐 |𝑩𝑩 𝑺𝑺 𝑨𝑨𝟏𝟏 𝑩𝑩
𝑻𝑻(𝑨𝑨𝟏𝟏 |𝑩𝑩)
maximum likelihood estimation (MLE) framework by maximizing: Affine Coupling MLP

arg max EM =(A,B)∼pM−d at a [log P A|B (A |B; θ A|B ) + log P B (B; θ B )] (6)
ReLu
θ B ,θ A|B
×L layers 𝑨𝑨𝟐𝟐 × 𝒍𝒍
Our model thus consists of two parts, namely a graph conditional Batchnorm

flow for atoms to learn the atom matrix conditional on the bond 𝑨𝑨𝟏𝟏
Split/Mask Graphconv
tensors and a flow for bonds to learn bond tensors. We further
Actnorm2D 𝐁𝐁�
learn a mapping between the learned latent vectors and molecular
properties to regress the graph-based molecular properties, and to Graphnorm
𝐀𝐀
guide the generation of optimized molecular graphs. 𝐁𝐁
Figure 2: Graph conditional flow f A | B for the atom matrix.
4.2 Graph Conditional Flow for Atoms We show the details of one invertible graph coupling layer
Given a bond tensor B ∈ B ⊂ Rc×n×n , our goal of the atom flow is and a multiscale structure consists of a cascade of L layers
to generate the right atom-type matrix A ∈ A ⊂ Rn×k to assemble of such graph coupling layer. The graphnorm is computed
valid molecules M = (A, B) ∈ M ⊂ Rn×k +c×n×n . We first define only once.
B-conditional flow and graph conditional flow f A | B to trans-
form A given B into conditional latent variable Z A |B = f A | B (A|B) layer by incorporating graph convolution structures. The bond
which follows isotropic Gaussian P ZA|B . We can get the condi- tensor B ∈ Rc×n×n keeps a fixed value during transforming the
tional probability of atom features given the bond graphs P A | B by atom matrix A. We also apply the masked convolution idea in [8] to
a conditional version of the change of variable formula. the graph convolution in the graph coupling layer. Here, we adopt
Relational Graph Convolutional Networks (R-GCN) [28] to build
4.2.1 B-Conditional Flow and Graph Conditional Flow. graph convolution layer graphconv as follows:
Definition 4.1. B-conditional flow: A B-conditional flow c
Õ
Z A |B |B = f A | B (A|B) is an invertible and dimension-kept mapping graphconv(A1 ) = Bˆi (M ⊙ A)Wi + (M ⊙ A)W0 (10)
i =1
and there exists reverse transformation f A −1 (Z
| B A |B
|B) = A|B where
where Bˆi = D −1 Bi is the normalized adjacency matrix at channel i,
f A | B and f A
−1 : A × B 7→ A × B.
D = c ,i Bc ,i,j is the sum of the in-degree over all the channels for
Í
|B
The condition B ∈ B keeps fixed during the transformation. each node, and M ∈ {0, 1}n×k is a binary mask to select a partition
Under the independent assumption of A and B, the Jacobian of A1 from A. Because the bond graph is fixed during graph coupling
f A | B is: layer and thus the graph normalization, denoted as graphnorm,
∂f A|B ∂f A|B is computed only once.
" #
∂f A|B
= ∂A ∂B , (7) We use multiple stacked graphconv->BatchNorm1d->ReLu lay-
∂(A, B) 0 1B
∂f ∂f
ers with a multi-layer perceptron (MLP) output layer to build the
A|B
the determiant of this Jacobian is det ∂(A,B) = det ∂AA|B
, and thus graph scale function S Θ and the graph transformation function TΘ .
the conditional version of the change of variable formula in the form What’s more, instead of using exponential function for the S Θ as
of log-likelihood is: discussed in Sec. 3, we adopt Sigmoid function for the sake of the
∂f A|B numerical stability of cascading multiple flow layers. The reverse
log P A|B (A |B) = log P ZA|B (Z A|B ) + log | det |. (8)
mapping of the graph coupling layer f A −1 is:
∂A |B
Definition 4.2. Graph conditional flow: A graph conditional A1 = Z A1 |B
flow is a B-conditional flow Z A |B |B = f A | B (A|B) where B ∈ B ⊂ A2 = (Z A2 |B − TΘ (Z A1 |B |B))/Sigmoid(S Θ (Z A1 |B |B)).
(11)
Rc×n×n is the adjacency tenor for edges with c types and A ∈ A ⊂
The logarithm of the Jacobian determiant of each graph coupling
Rn×k is the feature matrix of the corresponding n nodes.
layer can be efficiently computed by:
4.2.2 Graph coupling layer. We construct aforementioned invert- ∂f A|B Õ
log | det( ) |= log Sigmoid(S Θ (A1 |B)) j (12)
ible mapping f A | B and f A −1 by the scheme of the affine coupling ∂A j
|B
layer. Different from traditional affine coupling layer, our coupling where j iterates each element. In principle, we can use arbitrary
transformation relies on graph convolution [31] and thus we name complex graph convolution structures for S Θ and TΘ since the
such a coupling transformation as a graph coupling layer. computing of above Jacobian determinant of f A | B does not involve
For each graph coupling layer, we split input A ∈ Rn×k into in computing the Jacobian of S Θ or TΘ .
two parts A = (A1, A2 ) along the n row dimension, and we get the
output Z A |B = (Z A1 |B , Z A2 |B ) = f A | B (A|B) as follows: 4.2.3 Actnorm for 2-dimensional matrix. For the sake of numerical
stability, we design a variant of invertible actnorm layer [13] for
Z A1 |B = A1
(9) the 2-dimensional atom matrix, denoted as actnorm2D (activation
Z A2 |B = A2 ⊙ Sigmoid(S Θ (A1 |B)) + TΘ (A1 |B) normalization for 2D matrix), to normalize each row, namely the
where ⊙ is the element-wise product. We deign the scale function feature dimension for each node, over a batch of 2-dimensional
S Θ and the transformation function TΘ in each graph coupling atom matrices. Given the mean µ ∈ Rn×1 and the standard deviation

620
Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA

𝐙𝐙𝐁𝐁 = (𝐁𝐁𝟏𝟏 , 𝐙𝐙𝑩𝑩𝟐𝟐 )


Unsqueeze
𝐙𝐙𝑩𝑩𝟐𝟐 𝑺𝑺(𝑩𝑩𝟏𝟏 ), 𝑻𝑻(𝑩𝑩𝟏𝟏 )
Affine Coupling
In order to learn optimal partition and ensure model’s stability
ReLu and learning rate, we also use the invertible 1 × 1 convolution
× 𝒍𝒍 layer and actnorm layer adopted in the Glow. In order to get more
×L layers 𝑩𝑩𝟐𝟐 Batchnorm2D
𝑩𝑩𝟏𝟏 channels for masking and transformation, we squeeze the spatial
n n
Split/Mask 3*3 conv2D
size of B from Rc×n×n to R(c∗h∗h)× h × h by a factor h and apply the
Invertible
1*1 Conv
affine coupling transformation to the squeezed data. The reverse
Actnorm unsqueeze operation is adopted to the output. We summarize our
bond flow in Figure 3.
Squeeze

𝐁𝐁 4.4 Validity Correction


Figure 3: A variant of Glow f B for bonds’ adjacency tensors.
Molecules must follow the valency constraints for each atom, but
assembling a molecule from generated bond tensor and atom matrix
σ 2 ∈ Rn×1 for each row dimension, the normalized input follows
A−µ may lead to chemically invalid ones. Here we define the valency
 = √ 2 where ϵ is a small constant, the reverse transformation
σ +ϵ√ constraint for the i t h atom as:
is A = Â ∗ σ 2 + ϵ + µ, and the logarithmic Jacobian determiant is: Õ
c × B(c, i, j) ≤ Valency(Atomi ) + Ch
n
(17)
∂actnorm2D k Õ c,j
log | det |= | log(σi2 + ϵ ) | (13)
∂X 2 i where B ∈ {0, 1}c×n×n is the one-hot bond tensor over c ∈ {1, 2, 3}
4.2.4 Deep architectures. We summarize our deep graph condi- order of chemical bonds (single, double, triple) and Ch ∈ N repre-
tional flow in Figure 2. We stack multiple graph coupling layers sents the formal charge. Different from existing valency constraints
to form graph conditional flow. We alternate different partition of defined in [25, 33], we consider the effect of formal charge which
A = (A1, A2 ) in each layer to transform the unchanged part of the may introduce extra bonds for the charged atoms. For example,
previous layer. ammonium [NH4]+ may have 4 bonds for N instead of 3. Similarly,
S+ and O+ may have 3 bonds instead of 2. Here we only consider
4.3 Glow for Bonds Ch = 1 for N+ , S+ and O+ and make Ch = 0 for other atoms.
In contrast with the existing reject-sampling-based validity check
The bond flow aims to learn an invertible mapping f B : B ⊂ adopted in the autoregressive models [25, 33], we introduce a new
Rc×n×n 7→ B ⊂ Rc×n×n where the transformed latent variable post-hoc validity correction procedure after generating a molecule
Z B = f B (B) follows isotropic Gaussian. According to the change of M at once: 1) check the valency constraints of M; 2) if all the atoms of
variable formula, we can get the logarithmic probability of bonds by M follows valecny constraints, we return the largest connected com-
∂f
log P B (B) = log P ZB (Z B ) + log | det( ∂BB ) | and generating bond ponent of the molecule M and end the procedure; 3) if there exists
tensor by reversing the mapping B̃ = f B−1 (Z̃ ) where Z̃ ∼ P Z (Z ). an invalid atom i, namely c ,j c × B(c, i, j) > Valency(Atomi ) +Ch,
Í
We can use arbitrary flow model for the bond tensor and we build we sort the bonds of i by their order and delete 1 order for the
our bond flow f B based on a variant of Glow [13] framework. bond with the largest order; 4) go to step 1). Our validity correction
We also follow the scheme of affine coupling layer to build in- procedure tries to make a minimum change to the existing molecule
vertible mappings. For each affine coupling layer, We split input and to keep the largest connected component as large as possible.
B ∈ Rc×n×n into two parts B = (B 1, B 2 ) along the channel c dimen-
sion, and we get the output Z B = (Z B1 , Z B2 ) as follows: 4.5 Summary
Z B1 = B1 We summarize the overall modelling framework of our MoFlow
(14)
Z B2 = B 2 ⊙ Sigmoid(S Θ (B 1 )) + TΘ (B 1 ). in Figure 1, which includes the inference (encoding), generation
(decoding), regression and optimization of molecular graphs. Refer
And thus the reverse mapping f B−1 is: to appendix for the detailed inference (Algorithm 1) and generation
B1 = Z B1 (Algorithm 2) algorithms.
(15)
B 2 = (Z B2 − TΘ (Z B1 ))/Sigmoid(S Θ (Z B1 )).
5 EXPERIMENTS
Instead of using exponential function as scale function, we use the
Sigmoid function with range (0, 1) to ensure the numerical stability Following previous works [12, 29], we validate our MoFlow by
when stacking many layers. We find that exponential scale function answering following questions:
leads to a large reconstruction error when the number of affine • Molecular graph generation and reconstruction (Sec. 5.1):
coupling layers increases. The scale function S Θ and the transfor- Can our MoFlow memorize and reconstruct all the training
mation function TΘ in each affine coupling layer can have arbitrary molecule datasets? Can our MoFlow generalize to generate
structures. We use multiple 3 × 3 conv2d->BatchNorm2d->ReLu novel, unique and valid molecules as many as possible?
layers to build them. The logarithm of the Jacobian determiant of • Visualizing continuous latent space (Sec. 5.2): Can our
each affine coupling is MoFlow embed molecular graphs into continuous latent
space with reasonable chemical similarity?
∂Z B Õ
log | det( ) |= log Sigmoid(S Θ (B 1 )) j . (16) • Property optimization (Sec. 5.3): Can our MoFlow gener-
∂B j ate novel molecular graphs with optimized properties?

621
Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA

• Constrained property optimization (Sec. 5.4): Can our Empirical Running Time. Following above setup, we imple-
MoFlow generate novel molecular graphs with the optimized mented our MoFlow by Pytorch-1.3.1 and trained it by Adam op-
properties and at the same time keep the chemical similarity timizer with learning rate 0.001, batch size 256, and 200 epochs
as much as possible? for both datasets on 1 GeForce RTX 2080 Ti GPU and 16 CPU
Baselines. We compare our MoFlow with: a) the state-of-the- cores. Our MoFlow finished 200-epoch training within 22 hours (6.6
art VAE-based method JT-VAE [12] which captures the chemical minutes/epoch) for ZINC250K and 3.3 hours (0.99 minutes/epoch)
validity by encoding and decoding a tree-structured scaffold of for QM9. Thanks to efficient one-pass inference/embedding, our
molecular graphs; b) the state-of-the-art autoregressive models MoFlow takes negligible 7 minutes to learn an additional regres-
GCPN [33] and MolecularRNN (MRNN)[25] with reinforcement sion layer trained in 3 epochs for optimization experiments on
learning for property optimization, which generate molecules in a ZINC250K. In comparison, as for the ZINC250K dataset, GraphNVP
sequential manner; c) flow-based methods GraphNVP [20] and GRF [20] costs 38.4 hours (11.5 minutes/epoch) by our Pytorch imple-
[10] which generate molecules at one shot and the state-of-the-art mentation for training on ZINC250K with the same configurations,
autoregressive-flow-based model GraphAF [29] which generates and the estimated total running time of GraphAF [29] is 124 hours
molecules in a sequential way. (24 minutes/epoch) which consists of the reported 4 hours for a
Datasets. We use two datasets QM9 [26] and ZINC250K [11] for generation model trained by 10 epochs and estimated 120 hours for
our experiments and summarize them in Table 1. The QM9 contains another optimization model trained by 300 epochs. The reported
133, 885 molecules with maximum 9 atoms in 4 different types, and running time of JT-VAE [12] is roughly 24 hours in [33].
the ZINC250K has 249, 455 drug-like molecules with maximum
38 atoms in 9 different types. The molecules are kekulized by the
5.1 Generation and Reconstruction
chemical software RDKit [16] and the hydrogen atoms are removed. Setup. In this task, we evaluate our MoFlow ’s capability of gener-
There are three types of edges, namely single, double, and triple ating novel, unique and valid molecular graphs, and if our MoFlow
bonds, for all molecules. Following the pre-processing procedure in can reconstruct input molecular graphs from their latent represen-
[20], we encode each atom and bond by one-hot encoding, pad the tations. We adopted the widely-used metrics, including: Validity
molecules which have less than the maximum number of atoms with which is the percentage of chemically valid molecules in all the gen-
an virtual atom, augment the adjacency tensor of each molecule erated molecules, Uniqueness which is the percentage of unique
by a virtual edge channel representing no bonds between atoms, valid molecules in all the generated molecules, Novelty which is
and dequantize [8, 20] the discrete one-hot-encoded data by adding the percentage of generated valid molecules which are not in the
uniform random noise U [0, 0.6] for each dimension, leading to training dataset, and Reconstruction rate which is the percentage
atom matrix A ∈ R9×5 and bond tensor B ∈ R4×9×9 for QM9, and of molecules in the input dataset which can be reconstructed from
A ∈ R38×10 and B ∈ R4×38×38 for ZINC250k. their latent representations. Besides, because the novelty score also
accounts for the potentially duplicated novel molecules, we propose
Table 1: Statistics of the datasets. a new metric N.U.V. which is the percentage of Novel, Unique, and
Valid molecules in all the generated molecules. We also compare
#Mol. Max. #Node #Edge the validity of ablation models if not using validity check or validity
Graphs #Nodes Types Types
correction, denoted as Validity w/o check in [29].
QM9 133,885 9 4+1 3+1
ZINC250K 249,455 38 9+1 3+1 The prior distribution of latent space follows a spherical multi-
variate Gaussian distribution N (0, (tσ )2 I) where σ is the learned
standard deviation and the hyper-parameter t is the temperature
MoFlow Setup. To be comparable with one-shot-flow baseline for the reduced-temperature generative model [13, 20, 23]. We use
GraphNVP [20], for the ZINC250K, we adopt 10 coupling layers t = 0.85 in the generation for both QM9 and ZINC250K datasets,
and 38 graph coupling layers for the bonds’ Glow and the atoms’ and t = 0.6 for the ablation study without validity correction. To
graph conditional flow respectively. We use two 3 ∗ 3 convolution be comparable with the state-of-the-art baseline GraphAF[29], we
layers with 512, 512 hidden dimensions in each coupling layer. For generate 10, 000 molecules, i.e., sampling 10, 000 latent vectors from
each graph coupling layer, we set one relational graph convolu- the prior and then decode them by the reverse transformation of
tion layer with 256 dimensions followed by a two-layer multilayer our MoFlow. We report the the mean and standard deviation of
perceptron with 512, 64 hidden dimensions. As for the QM9, we results over 5 runs. As for the reconstruction, we encode all the
adopt 10 coupling layers and 27 graph coupling layers for the bonds’ molecules from the training dataset into latent vectors by the en-
Glow and the atoms’ graph conditional flow respectively. There coding transformation of our MoFlow and then reconstruct input
are two 3*3 convolution layers with 128, 128 hidden dimensions molecules from these latent vectors by the reverse transformation
in each coupling layer, and one graph convolution layer with 64 of MoFlow.
dimensions followed by a two-layer multilayer perceptron with Results. Table 2 and Table 3 show that our MoFlow outperfoms
128, 64 hidden dimensions in each graph coupling layer. As for the the state-of-the-art models on all the six metrics for both QM9 and
optimization experiments, we further train a regression model to ZINC250k datasets. Thanks to the invertible characteristic of the
map the latent embeddings to different property scalars (discussed flow-based models, our MoFlow builds an one-to-one mapping from
in Sec. 5.3 and 5.4) by a multi-layer perceptron with 18-dim linear the input molecule M to its corresponding latent vector Z , enabling
layer -> ReLu -> 1-dim linear layer structures. For each dataset, we 100% reconstruction rate as shown in Table 2 and Table 3. In con-
use the same trained model for all the following experiments. trast, the VAE-based method JT-VAE and the autoregressive-based

622
Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA

Table 2: Generation and reconstruction performance on QM9 dataset.

% Validity % Validity w/o check % Uniqueness % Novelty % N.U.V. % Reconstruct


GraphNVP [20] 83.1 ± 0.5 n/a 99.2 ± 0.3 58.2 ± 1.9 47.97 100
GRF [10] 84.5 ± 0.70 n/a 66.0 ± 1.15 58.6 ± 0.82 32.68 100
GraphAF [29] 100 67 94.51 88.83 83.95 100
MoFlow 100.00 ± 0.00 96.17 ± 0.18 99.20 ± 0.12 98.03 ± 0.14 97.24 ± 0.21 100.00 ± 0.00

Table 3: Generation and reconstruction performance on ZINC250K dataset.

% Validity % Validity w/o check % Uniqueness % Novelty % N.U.V. % Reconstruct


JT-VAE [12] 100 n/a 100 100 100 76.7
GCPN [33] 100 20 99.97 100 99.97 n/a
MRNN [25] 100 65 99.89 100 99.89 n/a
GraphNVP [20] 42.6 ± 1.6 n/a 94.8 ± 0.6 100 40.38 100
GRF [10] 73.4 ± 0.62 n/a 53.7 ± 2.13 100 39.42 100
GraphAF [29] 100 68 99.10 100 99.10 100
MoFlow 100.00 ± 0.00 81.76 ± 0.21 99.99 ± 0.01 100.00 ± 0.00 99.99 ± 0.01 100.00 ± 0.00

method GCPN and MRNN can’t reconstruct all the input molecules. MRNN and GraphAF need to generate a molecule sequentially.
Compared with the one-shot flow-based model GraphNVP and Further more, we measure the chemical similarity between each
GRF, by incorporating validity correction mechanism, our MoFlow neighboring molecule and the centering molecule. We choose Tani-
achieves 100% validity, leading to significant improvements of the moto index [2] as the chemical similarity metrics and indicate their
validity score and N.U.V. score for both datasets. Specifically, the similarity values by a heatmap. We further visualize a linear inter-
N.U.V. score of MoFlow are 2 and 3 times as large as the N.U.V. polation between two molecules to show their changing trajectory
scores of GraphNVP and GRF respectively in Table 2. Even with- similar to the interpolation case between images [13].
out validity correction, our MoFlow still outperforms the validity Results. We show the visualization of latent space in Figure 4.
scores of GraphNVP and GRF by a large margin. Compared with We find the latent space is very smooth and the interpolations be-
the autoregressive flow-based model GraphAF, we find our MoFlow tween two latent points only change a molecule graph a little bit.
outperforms GraphAF by additional 16% and 0.8% with respect to Quantitatively, we find the chemical similarity between molecules
N.U.V scores for QM9 and ZINC respectively, indicating that our majorly correspond to their Euclidean distance between their la-
MoFlow generates more novel, unique and valid molecules. Indeed, tent vectors, implying that our MoFlow embeds similar molecular
MoFlow achieves better uniqueness score and novelty score com- graph structures into similar latent embeddings. Searching in such
pared with GraphAF for both datasets. What’s more, our MoFlow a continuous latent space learnt by our MoFlow is the basis for
without validity correction still outperforms GraphAF without the molecular property optimization and constraint optimization as
validity check by a large margin w.r.t. the validity score (validity discussed in the following sections.
w/o check in Table 2 and Table 3) for both datasets, implying the
superiority of capturing the molecular structures in a holistic way 5.3 Property Optimization
by our MoFlow over autoregressive ones in a sequential way. Setup. The property optimization task aims at generating novel
In conclusion, our MoFlow not only memorizes and reconstructs molecules with the best Quantitative Estimate of Druglikeness
all the training molecular graphs, but also generates more novel, (QED) scores [3] which measures the drug-likeness of generated
unique and valid molecular graphs than existing models, indicating molecules. Following the previous works [25, 33], we report the
that our MoFlow learns a strict superset of the training data and best property scores of novel molecules discovered by each method.
explores the unknown chemical space better. We use the pre-trained MoFlow, denoted as f , in the genera-
tion experiment to encode a molecule M and get the molecular
5.2 Visualizing Continuous Latent Space embedding Z = f (M), and further train a multilayer perceptron to
Setup. We examine the learned latent space of our MoFlow , de- regress the embedding Z of the molecules to their property values y.
noted as f , by visualizing the decoded molecular graphs from a We then search the best molecules by the gradient ascend method,
dy
neighborhood of a latent vector in the latent space. Similar to namely Z ′ = Z + λ ∗ dZ where the λ is the length of the search
[12, 15], we encode a seed molecule M into Z = f (M) and then grid step. We conduct above gradient ascend method by K steps. We
search two random orthogonal directions with unit vector X and Y decode the new embedding Z ′ in the latent space to the discovered
based on Z , then we get new latent vector by Z ′ = Z +λ X ∗X +λY ∗Y molecule by reverse mapping M ′ = f −1 (Z ′ ). The molecule M ′ is
where λ X and λY are the searching steps. Different from VAE- novel if M ′ doesn’t exist in the training dataset.
based models, our MoFlow gets decoded molecules efficiently by Results. We report the discovered novel molecules sorted by
the one-pass inverse transformation M ′ = f −1 (Z ′ ). In contrast, their QED scores in Table 4. We find previous methods can only find
the VAE-based models such as JT-VAE need to decode each latent very few molecules with the best QED score (= 0.948). In contrast,
vectors 10 − 100 times and autoregressive-based models like GCPN, our MoFlow finds much more novel molecules which have the

623
Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA

N N
N
O
O O S S S N S N S N

N+
N+ N N N+
O N+ O N+ O N+ N
O NH
O NH O NH O NH O NH NH NH NH
N N N
Cl
NH NH
OH O Cl NH Cl NH Cl NH O N+
N N N N S
N+ Cl N
OH OH N+
Cl Cl Cl Cl N+ N+ N+ OH
N+ N+

N+
OH
O O O

0.42
0.42 0.41
0.41 0.38
0.38 0.32
0.32 0.44
0.44 0.37
0.37 0.37
0.37 0.37
0.37 0.31
0.31

O O O O O O O O

O O O N+
N
O NH O NH O NH O NH O NH NH NH NH N

Cl
N
NH NH NH NH NH
OH OH OH O O Cl NH Cl NH Cl O
N N N N N S
N+ N+ N+
N+
Cl Cl Cl Cl Cl N+ N+ OH
N+

NH
O O

0.53
0.53 0.53
0.53 0.53
0.53 0.68
0.68 0.68
0.68 0.57
0.57 0.57
0.57 0.43
0.43 0.32
0.32

O O O
O O O O O

O O
O NH
O NH O NH O NH O NH O NH NH NH
NH Cl
N
N
N N+

O NH NH NH NH
O O O O
Cl Cl
O N N N N N
Cl N
Cl Cl Cl Cl Cl O N+

N+
O

O OH

1.00
1.00 0.52
0.52 0.68
0.68 0.68
0.68 0.68
0.68 0.68
0.68 0.57
0.57 0.47
0.47 0.39
0.39

O O O O O O
O O O

O NH O NH O NH O NH O NH O NH
O NH O NH O NH
N N N N N N+

NH NH NH
O O O
N N N

Cl N Cl N Cl N Cl N Cl N Cl N+
Cl Cl Cl

O O O O O O

Tanimoto Similarity
1.00
1.00 1.00
1.00 1.00
1.00 0.68
0.68 0.68
0.68 0.68
0.68 1.00
1.00 1.00
1.00 0.57
0.57
O O O O O O O O O

O NH O NH O NH O NH O NH O NH O NH O NH O NH

N N N N N N N N N+

Cl N Cl N Cl N Cl N Cl N Cl N Cl N Cl N Cl N

O O O O O O O O O

1.00
1.00 1.00
1.00 1.00
1.00 1.00
1.00 1.00
1.00 1.00
1.00 1.00
1.00 1.00
1.00 0.68
O O O

NH NH NH NH
O O O O O O
N N N N N N
N N N+
N N N N N N

NH NH NH NH NH NH

O O O O O O Cl N Cl N Cl N

O O O

0.59
0.59 0.59
0.59 0.59
0.59 0.59
0.59 0.59
0.59 0.59
0.59 0.79
0.79 0.79
0.79 0.53
0.53

O OH

NH NH
O O O O O O O
N N N N N N N
N+ N
N N N N N N N

NH NH NH NH NH NH NH

O O O O O O O Cl N Cl N

O O

0.59
0.59 0.59
0.59 0.59
0.59 0.59
0.59 0.59
0.59 0.59
0.59 0.59
0.59 0.53
0.53 0.71
0.71

OH

NH
O O O O O O O
N N N N N N N
N
N N N N N N N N

O NH NH
NH NH NH NH NH NH NH

O O O O O O HO Cl N

0.34
0.34 0.59
0.59 0.59
0.59 0.59
0.59 0.59
0.59 0.59
0.59 0.59
0.59 0.54
0.54 0.71
0.71

N O O O O O O O
N+ N N N N N+ N N
N N N N N N N N
N

O NH NH
NH NH NH NH NH NH NH NH

O O O O HO HO HO
O

0.32
0.32 0.34
0.34 0.59
0.59 0.59
0.59 0.59
0.59 0.59
0.59 0.32
0.32 0.54
0.54 0.54
0.54

Cl Cl Cl Cl
N
Cl Cl Cl
N N N NH N
O N
N NH N NH N+ Cl
Cl
HO O+ O+ O+
N
N N NH2 NH2
O
Cl O N NH2 NH2 NH2
N N+ O S O S
O O
O
O
N
NH SH SH SH
N NH N+ N N
NH

1.00 0.71 0.51 0.49 0.38 0.13 0.13 0.14 0.19 0.17

Figure 4: Visualization of learned latent space by our MoFlow. Top: Visualization of the grid neighbors of a seed molecule in
the center, which serves as the baseline for measuring similarity. Bottom: Interpolation between two seed molecular graphs
and the left one is the baseline molecule for measuring similarity. Seed molecules are highlighted in red boxs and they are
randomly selected from ZINC250K.

Table 4: Discovered novel molecules with the best QED


scores. Our MoFlow finds more molecules with the best QED best QED values than all the baselines. We show more molecular
scores. More results in Figure 5. structures with top QED values in Figure 5.

Method 1st 2nd 3rd 4th 5.4 Constrained Property Optimization


ZINC (Dataset) 0.948 0.948 0.948 0.948 Setup. The constrained property optimization aims at finding a
JT-VAE 0.925 0.911 0.910 -
new molecule M ′ with the largest similarity score sim(M, M ′ ) and
GCPN 0.948 0.947 0.946 - the largest improvement of a targeted property value y(M ′ ) − y(M)
MRNN 0.948 0.948 0.947 - given a molecule M. Following the similar experimental setup of
GraphAF 0.948 0.948 0.947 0.946
[12, 33], we choose Tanimoto similarity of Morgan fingerprint [27]
MoFlow 0.948 0.948 0.948 0.948
as the similarity metrics, the penalized logP (plogp) as the target
property, and M from the 800 molecules with the lowest plogp
OH

O F
NH2 F
NH
S
S
N NH
O
O S
NH
N Cl N
N
N

scores in the training dataset of ZINC250K. We use similar gradient


Br Cl N N
O O
O

0.948 0.948 0.948 0.948 0.948


HO F
ascend method as discussed in the previous subsetion to search for
Cl

optimized molecules. An optimization succeeds if we find a novel


O
O
N
S
S
NH
N NH O
N
N NH
Cl

molecule M ′ which is different from M and y(M ′ ) − y(M) ≥ 0 and


NH N O S O
O N
O O O
O

0.948 0.948 0.948 0.948 0.948


sim(M, M ′ ) ≥ δ within K steps where δ is the smallest similarity
threshold to screen the optimized molecules.
O O
NH N N N
O N
NH
N NH
NH N
O N N
NH
O
O N
O
O

Results. Results are summarized in Table 5. We find that our


O
F

0.948 0.948 0.948 0.948 0.948


O Cl
O
MoFlow finds the most similar new molecules at the same time
O NH Cl
F

achieves very good plogp improvement. Compared with the state-


S NH
NH O N
N
NH
NH+
N O
N N
O NH
O
O O
N

0.948 0.948 0.948 0.948 0.948 of-the-art VAE model JT-VAE, our MoFlow achieves much higher
similarity score and property improvement, implying that our
F

model is good at interpolation and learning continuous molec-


S O O
N
N N
N NH S N
N N S
N
NH NH
N N N
O O
N
O NH

NH O

0.948 0.948 0.948 0.948 0.948 ular embedding. Compared with the state-of-the-art reinforcement
Figure 5: Illustration of discovered novel molecules with the learning based method GCPN and GraphAF which is good at gen-
best druglikeness QED scores. erating molecules step-by-step with targeted property rewards, our
model MoFlow achieves the best similarity scores and the second

624
Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA

Table 5: Constrained optimization on Penalized-logP


[9] Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel
JT-VAE GCPN Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge
Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru-Guzik.
δ Improvement Similarity Success Improvement Similarity Success 2018. Automatic chemical design using a data-driven continuous representation
0.0 1.91 ± 2.04 0.28 ± 0.15 97.5% 4.20 ± 1.28 0.32 ± 0.12 100% of molecules. ACS central science 4, 2 (2018), 268–276.
0.2 1.68 ± 1.85 0.33 ± 0.13 97.1% 4.12 ± 1.19 0.34 ± 0.11 100% [10] Shion Honda, Hirotaka Akita, Katsuhiko Ishiguro, Toshiki Nakanishi, and Kenta
0.4 0.84 ± 1.45 0.51 ± 0.10 83.6% 2.49 ± 1.30 0.48 ± 0.08 100% Oono. 2019. Graph residual flow for molecular graph generation. arXiv preprint
0.6 0.21 ± 0.71 0.69 ± 0.06 46.4% 0.79 ± 0.63 0.68 ± 0.08 100% arXiv:1909.13521 (2019).
GraphAF MoFlow [11] John J Irwin, Teague Sterling, Michael M Mysinger, Erin S Bolstad, and Ryan G
Coleman. 2012. ZINC: a free tool to discover chemistry for biology. Journal of
δ Improvement Similarity Success Improvement Similarity Success
chemical information and modeling 52, 7 (2012), 1757–1768.
0.0 13.13 ± 6.89 0.29 ± 0.15 100% 8.61 ± 5.44 0.30 ± 0.20 98.88% [12] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. 2018. Junction tree
0.2 11.90 ± 6.86 0.33 ± 0.12 100% 7.06 ± 5.04 0.43 ± 0.20 96.75% variational autoencoder for molecular graph generation. arXiv preprint
0.4 8.21 ± 6.51 0.49 ± 0.09 99.88% 4.71 ± 4.55 0.61 ± 0.18 85.75% arXiv:1802.04364 (2018).
0.6 4.98 ± 6.49 0.66 ± 0.05 96.88% 2.10 ± 2.86 0.79 ± 0.14 58.25% [13] Durk P Kingma and Prafulla Dhariwal. 2018. Glow: Generative flow with in-
vertible 1x1 convolutions. In Advances in Neural Information Processing Systems.
NH
NH 10215–10224.
F
O
F
O [14] Ivan Kobyzev, Simon Prince, and Marcus A Brubaker. 2019. Normalizing flows:
NH NH Introduction and ideas. arXiv preprint arXiv:1908.09257 (2019).
O
[15] Matt J Kusner, Brooks Paige, and José Miguel Hernández-Lobato. 2017. Grammar
O
variational autoencoder. In Proceedings of the 34th International Conference on
Machine Learning-Volume 70. JMLR. org, 1945–1954.
Figure 6: An illustration of the constrained optimization of [16] Greg Landrum et al. 2006. RDKit: Open-source cheminformatics.
a molecule leading to an improvement of +16.48 w.r.t the pe- [17] Jenny Liu, Aviral Kumar, Jimmy Ba, Jamie Kiros, and Kevin Swersky. 2019. Graph
nalized logP and with Tanimoto similarity 0.624. The modi- normalizing flows. In Advances in Neural Information Processing Systems. 13556–
13566.
fied part is highlighted. [18] Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander Gaunt. 2018.
Constrained graph variational autoencoders for molecule design. In Advances in
best property improvements. We illustrate one optimization exam- Neural Information Processing Systems. 7795–7804.
ple in Figure 6 with very similar structures but a large improvement [19] Tengfei Ma, Jie Chen, and Cao Xiao. 2018. Constrained generation of semantically
w.r.t the penalized logP. valid graphs via regularizing variational autoencoders. In Advances in Neural
Information Processing Systems. 7113–7124.
[20] Kaushalya Madhawa, Katushiko Ishiguro, Kosuke Nakago, and Motoki Abe. 2019.
6 CONCLUSION GraphNVP: An Invertible Flow Model for Generating Molecular Graphs. arXiv
In this paper, we propose a novel deep graph generative model preprint arXiv:1905.11600 (2019).
MoFlow for molecular graph generation. Our MoFlow is one of the [21] Asher Mullard. 2017. The drug-maker’s guide to the galaxy. Nature News 549,
7673 (2017), 445.
first flow-based models which not only generates molecular graphs [22] George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed,
at one-shot by invertible mappings but also has a validity guarantee. and Balaji Lakshminarayanan. 2019. Normalizing Flows for Probabilistic Model-
Our MoFlow consists of a variant of Glow model for bonds, a novel ing and Inference. arXiv preprint arXiv:1912.02762 (2019).
[23] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer,
graph conditional flow for atoms given bonds, and then combining Alexander Ku, and Dustin Tran. 2018. Image transformer. arXiv preprint
them with post-hoc validity corrections. Our MoFlow achieves state- arXiv:1802.05751 (2018).
of-the-art performance on molecular generation, reconstruction [24] Steven M Paul, Daniel S Mytelka, Christopher T Dunwiddie, Charles C Persinger,
Bernard H Munos, Stacy R Lindborg, and Aaron L Schacht. 2010. How to improve
and optimization. For future work, we try to combine the advan- R&D productivity: the pharmaceutical industry’s grand challenge. Nature reviews
tages of both sequential generative models and one-shot generative Drug discovery 9, 3 (2010), 203.
[25] Mariya Popova, Mykhailo Shvets, Junier Oliva, and Olexandr Isayev. 2019. Molec-
models to generate chemically feasible molecular graphs. Codes and ularRNN: Generating realistic molecular graphs with optimized properties. arXiv
datasets are open-sourced at https://github.com/calvin-zcx/moflow. preprint arXiv:1905.13372 (2019).
[26] Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole
ACKNOWLEDGEMENT Von Lilienfeld. 2014. Quantum chemistry structures and properties of 134 kilo
molecules. Scientific data 1 (2014), 140022.
This work is supported by NSF IIS 1716432, 1750326, ONR N00014-18-1-2585, [27] David Rogers and Mathew Hahn. 2010. Extended-connectivity fingerprints.
Amazon Web Service (AWS) Machine Learning for Research Award and Journal of chemical information and modeling 50, 5 (2010), 742–754.
Google Faculty Research Award. [28] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan
Titov, and Max Welling. 2018. Modeling relational data with graph convolutional
networks. In European Semantic Web Conference. Springer, 593–607.
REFERENCES [29] Chence Shi, Minkai Xu, Zhaocheng Zhu, Weinan Zhang, Ming Zhang, and Jian.
[1] Jerry Avorn. 2015. The $2.6 billion pill—methodologic and policy considerations. Tang. 2020. GraphAF: a Flow-based Autoregressive Model for Molecular Graph
New England Journal of Medicine 372, 20 (2015), 1877–1879. Generation. ICLR 2020, Addis Ababa, Ethiopia, Apr.26-Apr. 30, 2020 (2020).
[2] Dávid Bajusz, Anita Rácz, and Károly Héberger. 2015. Why is Tanimoto index [30] Martin Simonovsky and Nikos Komodakis. 2018. Graphvae: Towards generation
an appropriate choice for fingerprint-based similarity calculations? Journal of of small graphs using variational autoencoders. In International Conference on
cheminformatics 7, 1 (2015), 20. Artificial Neural Networks. Springer, 412–422.
[3] G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, and An- [31] Mengying Sun, Sendong Zhao, Coryandar Gilvary, Olivier Elemento, Jiayu Zhou,
drew L Hopkins. 2012. Quantifying the chemical beauty of drugs. Nature and Fei Wang. 2019. Graph convolutional networks for computational drug
chemistry 4, 2 (2012), 90. development and discovery. Briefings in bioinformatics (2019).
[4] Xavier Bresson and Thomas Laurent. 2019. A Two-Step Graph Convolutional [32] David Weininger, Arthur Weininger, and Joseph L Weininger. 1989. SMILES.
Decoder for Molecule Generation. arXiv preprint arXiv:1906.03412 (2019). 2. Algorithm for generation of unique SMILES notation. Journal of chemical
[5] Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, and Le Song. 2018. information and computer sciences 29, 2 (1989), 97–101.
Syntax-directed variational autoencoder for structured data. arXiv preprint [33] Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, and Jure Leskovec. 2018. Graph
arXiv:1802.08786 (2018). convolutional policy network for goal-directed molecular graph generation. In
[6] Nicola De Cao and Thomas Kipf. 2018. MolGAN: An implicit generative model Advances in Neural Information Processing Systems. 6410–6421.
for small molecular graphs. arXiv preprint arXiv:1805.11973 (2018). [34] Alex Zhavoronkov, Yan A Ivanenkov, Alex Aliper, Mark S Veselov, Vladimir A
[7] Laurent Dinh, David Krueger, and Yoshua Bengio. 2014. Nice: Non-linear inde- Aladinskiy, Anastasiya V Aladinskaya, Victor A Terentiev, Daniil A Polykovskiy,
pendent components estimation. arXiv preprint arXiv:1410.8516 (2014). Maksim D Kuznetsov, et al. 2019. Deep learning enables rapid identification of
[8] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. 2016. Density estimation potent DDR1 kinase inhibitors. Nature biotechnology 37, 9 (2019), 1038–1040.
using real nvp. arXiv preprint arXiv:1605.08803 (2016).

625
Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA

APPENDIX:
A INFERENCE AND GENERATION
We summarize the inference (encoding) and generation (decoding) of molec-
ular graphs by our MoFlow in Algorithm 1 and Algorithm 2 respectively. We
visualize the overall framework in Figure 1. As shown in the algorithms, our
MoFlow have merits of exact likelihood estimation/training, one-pass infer-
ence, invertible and one-pass generation, and chemical validity guarantee.

Algorithm 1: Exact Likelihood Inference (Encoding) of


Molecular Graphs by MoFlow
Input: f A|B : graph conditional flow for atoms, f B : glow for bonds, A: atom
matrix, B : bond tensor, P Z∗ : isotropic Gaussian distributions.
Output: Z M :latent representation for atom M , log P M (M ): logarithmic
likelihood of molecule M .
Z B = f B (B)
∂f B
log P B (B) = log P ZB (Z B ) + log | det( ∂B )|
B̂ = graphnorm(B)
Z A|B = f A|B (A | B̂)
∂f A|B
log P A|B (A |B) = log PZA|B (Z A|B ) + log | det( ∂A )|
Z M = (Z A|B , Z B )
log P M (M ) = log P B (B) + log P A|B (A |B)
Return: Z M , log P M (M )

Algorithm 2: Molecular Graph Generation (Decoding) by


the Reverse Transformation of MoFlow
Input: f A|B : graph conditional flow for atoms, f B : glow for bonds, Z M :latent
representation of molecule M or sampling from a prior Gaussian,
validity-correction: validity correction rules.
Output: M : a molecule
(Z A|B , Z B ) = Z M
B = f B−1 (Z B )
B̂ = graphnorm(B)
A = f A|B
−1 (Z
A|B | B̂)
M = validity-correction(A, B)
Return: M

626

You might also like