0% found this document useful (0 votes)
45 views30 pages

Representing Molecules As Random Walks Over Interpretable Grammars

Uploaded by

gamininganela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views30 pages

Representing Molecules As Random Walks Over Interpretable Grammars

Uploaded by

gamininganela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Representing Molecules as Random Walks Over Interpretable Grammars

Michael Sun 1 Minghao Guo 1 Weize Yuan 2 Veronika Thost 3 Crystal Elaine Owens 1
Aristotle Franklin Grosz 4 Sharvaa Selvan 5 Katelyn Zhou 6 Hassan Mohiuddin 5 Benjamin J Pedretti 4
Zachary P Smith 4 Jie Chen 3 Wojciech Matusik 1

Abstract differ significantly from typical drug-like molecules. In ad-


arXiv:2403.08147v2 [cs.LG] 29 May 2024

Recent research in molecular discovery has pri- dition, the specificity of the designs and use cases, and the
marily been devoted to small, drug-like molecules, considerable cost of practical experiments, make it often
leaving many similarly important applications a scenario that is scarce in both data and labels; for exam-
in material design without adequate technology. ple, datasets of ≈ 300 molecules or less are not uncommon
These applications often rely on more complex (Wang et al., 2018; Lopez et al., 2016; Helma et al., 2001).
molecular structures with fewer examples that As a consequence, materials science has not yet fully ex-
are carefully designed using known substruc- ploited the potential of machine learning methods (Karande
tures. We propose a data-efficient and inter- et al., 2022; Wang & Wu, 2023). We focus on such chal-
pretable model for representing and reasoning lenging datasets that feature complex molecules containing
over such molecules in terms of graph grammars functional groups and structural motifs which are applied in
that explicitly describe the hierarchical design multiple diverse, real-world application scenarios.
space featuring motifs to be the design basis. Our goal is to represent and reason about molecules in
We present a novel representation in the form a data-efficient and interpretable way. Domain-specific
of random walks over the design space, which datasets typically exhibit distinct motifs and functional
facilitates both molecule generation and prop- groups, which serve as structural priors in our molecular
erty prediction. We demonstrate clear advan- representation. Previous works show that structural priors
tages over existing methods in terms of perfor- are highly advantageous for applications that require data
mance, efficiency, and synthesizability of pre- efficiency (Rogers & Hahn, 2010; Xia et al., 2023a; Shui
dicted molecules, and we provide detailed in- & Karypis, 2020; Jiang et al., 2022; Yang et al., 2022). We
sights into the method’s chemical interpretability. propose a novel approach to molecular discovery that is
Code is available at https://github.com/ tailored to more complex molecules and low-data scenarios
shiningsunnyday/polymer_walk. and builds upon the above insights. The idea is to start from
a set of expert-defined motifs1 and learn a context-sensitive
grammar over the space of motifs. The novelty of this work
1. Introduction lies in our representation and learning of this grammar.
Property-driven molecular discovery represents a challeng- We define a motif graph – a hierarchical abstraction of the
ing application with great potential benefits for society, and molecular design space induced by the given data, where
this is reflected in the large amount of research conducted each node is a motif and each edge represents a possible at-
in the machine learning community on this topic in recent tachment between a pair of motifs. Our main technical con-
years (Sawlani, 2024). Yet, most of the research focuses tribution is an efficient and interpretable parameterization
on small, drug-like molecules, while many classes of more over the context-sensitive grammar induced by the design
complex molecules have been largely neglected. Materials space, and the description of a molecule as a random walk
designed for applications such as gas-separation membranes of context-sensitive transition rules. Our representation of
or photovoltaics, which are critical for a sustainable future, molecules combines the quality of representation learning
often have specific distributions of molecule structure that with the interpretability of a rule-based grammar.
1
MIT CSAIL 2 MIT Chemistry 3 MIT-IBM Watson AI Lab, IBM In terms of quality, we demonstrate our grammar representa-
Research 4 MIT Chemical Engineering 5 MIT 6 Wellesley. Corre-
spondence to: Michael Sun <[email protected]>. 1
Note that our method works with any given set of motifs (e.g.,
we can apply one of the more simple algorithms used in existing
Proceedings of the 41 st International Conference on Machine works), but our evaluation shows that certain applications benefit
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by from high-quality domain knowledge.
the author(s).

1
Representing Molecules as Random Walks Over Interpretable Grammars

tion suits applications characterized by designer molecules. A few closely related works have recently proposed molecu-
We select datasets that reflect real-world settings of experi- lar graph representations where the relations between motifs
mentally curated designs of molecules with complex, modu- are explicitly represented, together with corresponding mod-
lar sub-structures characterized by functional groups known els (Shui & Karypis, 2020; Jiang et al., 2022). Our work
or hypothesized to yield high target properties. is different from theirs in two aspects. First, we show that
commonly used automatic approaches for motif extraction
In terms of interpretability, our grammar representation is
are not sufficient for property prediction over several kinds
special in two ways. As an indirect consequence of super-
of more complex molecules, and that custom motifs given
vised learning, our model produces visually discernible clus-
by domain experts yield better performance. It allows for
ters according to distinctive structural features within the
biasing the model towards known structure-activity relation-
dataset. More importantly, our compact, context-sensitive
ships or the expert’s hypotheses (e.g., fragments known or
grammar allows for discovering design rules that reveal the
assumed to be critical for the property under consideration).
design principles used during the creation of the dataset.
Second, to the best of our knowledge, their motif graph rep-
• Our method largely outperforms pretrained and traditional resentations do not model the context sensitivity explicitly
methods for molecular property prediction. It is compet- (e.g., HM-GNN’s motif graph (Shui & Karypis, 2020) con-
itive with a state-of-the-art graph grammar system for nects two motifs based on co-occurence in a molecule only).
chemistry (Guo et al., 2023b) in terms of quality while
being an order of magnitude more runtime efficient. Molecule representation by grammars. Recent work has
• Our method’s interpretable representations reveal deeper shown that such grammars represent a data-efficient way for
insights into relationships implicit in the data, explain the representing molecules and yield SOTA results (Guo et al.,
model’s reasoning, and lead to novel scientific insights. 2023a;b). In a nutshell, this is achieved by explicitly repre-
• Our method produces promising molecule generations, in senting the training data’s design space in terms of learnt
particular, producing diverse designs that are synthesiz- motifs, in the form of a graph grammar. Grammars naturally
able at a significantly higher rate than the state-of-the-art allow for generating novel molecules in the given design
data-efficient generative model, DEG (Guo et al., 2023a). space. Yet, obtaining production rules involves either man-
• Finally, made possible by our method’s interpretability, ual definition (Krenn et al., 2020; Guo et al., 2022; Nigam
our approach enables close collaboration with domain et al., 2021) or a significant complexity to automatically
experts. In particular, we devised and executed feasible, learn (Guo et al., 2023a; Kajino, 2019), where the training
practical, and semi-automated workflows with experts times for downstream tasks are considerable (see Figure 5).
for fragmenting molecules, constructing the design space, Further, the learnt substructures sometimes lack a chemical
and interpreting the results. interpretation, and grammar derivations often produce chem-
ically invalid structures (Guo et al., 2023a), so the natural po-
tential of symbolic methods for interpretability and validity
2. Related Works is lost, although such elements are critical for expert valida-
Motif-based molecular property prediction. ECFP em- tion and for gaining scientific insights. We propose a novel
beddings (Rogers & Hahn, 2010), which capture relevant way for representing and learning such context-sensitive
ego-graphs present in a molecule in bit vectors, represent a grammars, over a design space informed by chemical mo-
motif-based encoding. ECFP embeddings in combinations tifs. This approach results in order-of-magnitude differences
with simple predictors (e.g., XGBoost) have been competi- in runtime and enhances chemical interpretability.
tive on small datasets (Xia et al., 2023a). In our evaluation,
we show that our model is similarly data-efficient but de- Other works for molecular representation learning.
livers a better predictive performance, owing to the use There are various other non-motif based approaches that we
of graph-based representations. In light of the good per- compare to in our evaluation, including (pre-trained) GNNs
formance of ECFPs, it is not surprising that the recently (Hu et al., 2020; Xia et al., 2023b), motif-based pre-training
developed subgraph graph neural networks (GNNs) report approaches designed for semi- or unsupervised learning
competitive performance in molecular property prediction (Xia et al., 2023b), and molecular few-shot learning includ-
when using ego-graphs as subgraphs (Frasca et al., 2022); ing the SOTA, which relies on modeling the domain expert’s
we consider ESAN (Bevilacqua et al., 2022) in our evalu- reasoning in terms of related molecule contexts using asso-
ation. However, existing models usually apply subgraphs ciative memories (Schimunek et al., 2023). Central to our
rooted at all individual nodes rather than a set of more method is the connection between random walks and graph
coarse-grained, potentially complex, domain-specific sub- diffusion, established methods that have been particularly
graphs. Other recent work that integrates motifs to improve effective to model graph structures through physics-inspired
out-of-distribution detection similarly lacks this dimension processes (Thanou et al., 2017). Other related works and
of modeling (Yang et al., 2022). more detailed discussions can be found in Appendix C.

2
Representing Molecules as Random Walks Over Interpretable Grammars

(i) (i) (i)


3. An Interpretable, Grammar-based Molecule Fj is the subgraph of M (i) induced by Vj . When Fj
Representation and Efficient Learning is a chemical motif, it is essential to know the possible con-
(i)
texts within which Fj occur, because the behavior of one
Our method employs a graph grammar, which is composed
substructure is often influenced by neighboring structures2 .
of a set of predefined molecular motifs and a set of tran-
Specifically, given neighboring fragments j1 , j2 , i.e. ∃e ∈
sition rules. The motifs are devised either through auto- (i) (i) (i) (i)
matic generation or manual curation and are interconnected EM (i) s.t. e ∈
/ Ej1 , e ∈ / Ej2 and e ∈ M (i) (Vj1 ∩ Vj2 ),
following the transition rules to assemble into a complete then we can use automatic rules RD to infer the “con-
(j ) (i) (i) (j ) (i)
molecular structure. Following (Guo et al., 2023a;b), a gram- text” of j1 : cj21 := RD (Vj1 , Vj2 ) s.t. cj21 ⊆ Vj2 and
(i) (j )
mar G = (N , Σ, P, X ) contains a set N of non-terminal M (i) (Vj1 ⊔ cj21 ) is connected. The same rule is applied
nodes, a set Σ of terminal nodes representing chemical (j )
in reverse to obtain cj12 . The descriptions and examples for
atoms, and a starting node X . The generation of molec- dataset-specific rules are given in Appendix A.1.
ular graphs is described using a set of production rules,
P = {pi |i = 1, . . . , k}. Each rule, pi , is defined by a There are various automated methods to obtain such a frag-
transformation from a left-hand side (LHS) to a right-hand mentation (e.g., (ChemAxon; Degen et al., 2008; Jin et al.,
side (RHS), with both sides being graphs. The process 2020)); some are integrated in the commonly used RDKit
starts from an initial empty graph X , and a molecule is con- package (Landrum, 2016). Nevertheless, we found that
structed by iteratively applying a rule from P, where the complex molecular datasets often benefit from fragmenta-
LHS of the selected rule matches a subgraph within the tions and rules tailored to the application domain, in the
current graph. This selected subgraph is then replaced by sense that they may better capture known domain knowl-
the corresponding RHS of the rule. edge and provide a strong structural prior. For this reason,
we also designed and executed feasible, practical workflows
for annotating molecules and extracting the motifs.
Random Walk Grammar. We introduce random walk
grammar, characterized by a specific condition where the
LHS of each rule differs from its RHS by exactly a motif. Motif Graph Construction. Given a set of motifs, V , we
Such a design ensures that the generation of a molecule is a describe our hierarchical abstraction over V . G = (V, E) is
progressive process, where in each step, a new subgraph is a directed multigraph. Each v ∈ V contains both the motif
attached to the existing graph. We implement the grammar graph gv and {vrl }, denoting the possible “contexts” for
using a compact motif graph G (Fig. 1 (b)); the nodes are attaching gv to another motif; that is, ∀l, vrl ⊆ N (gv ), with
the motifs and each edge describes the application of a N (g) denoting the set of atom nodes of graph g. vR :=
transition rule. ∪l vrl , and ∅ ̸= vB := N (gv ) \ vR . Denoting ∼ to be the
We highlight two novelties of this work: isomorphism relation, we construct E by matching every
pair of motifs u, v and their contexts (l1 , l2 ) by finding
1. Molecules are represented as random walks over con- corresponding subgraphs in uB and vB to match url1 and
nected subgraphs of G (Fig. 1 (a)). This representation vrl2 , as shown in Fig. 1 (c). Specifically, (u, v, el1 ,l2 ) ∈
is explicit, compact, and interpretable. E ⇐⇒ ∃b2 ⊆ uB , b1 ⊆ vB such that:
2. The context-sensitive grammar over G is learnable from
a given training dataset by optimizing parameters that de-
gu (url1 ) ∼ gv (b2 ) (1)
termine the prior and adjusted edge weights of G. These
weights parameterize the transition probabilities, thereby gv (vrl2 ) ∼ gu (b1 ) (2)
influencing the molecular representation and facilitating gu (url1 ∪ b1 ) ∼ gv (b2 ∪ vrl2 ) (3)
the learning of context-sensitive rules, which we eluci-
date in our analysis section. gu (url1 ∪ b1 ) is connected (4)

We demonstrate the utility of our grammar-based represen-


tation for both molecular generation and property prediction el1 ,l2 is attributed with url1 , vrl2 , b1 , b2 .
tasks. The main steps of our workflow are as follows.
The construction of the motif graph G is in practice very
efficient. For example, for the datasets we study, it is done
Motif-based Molecule Fragmentation. Our method under a minute when parallelized across 100 CPU cores.
builds upon a given molecule fragmentation. More specifi- Details are given in Appendix C.
|D|
cally, given a dataset D = {M (i) := (VM (i) , EM (i) )}i=1 , a 2
(i)
fragmentation of M is a collection of disjoint molecular For materials applications that rely in particular on elec-
(i) (i) (i) trophilicity, polarity, and extended aromaticity, longer-range com-
graphs {Fj } := {(Vj , Ej )} such that ⊔j V (i) = VM (i) . binations and patterns of motifs are often more influential than any
Letting g(v) denote the node-induced subgraph of g by v, individual one.

3
Representing Molecules as Random Walks Over Interpretable Grammars

Figure 1. Illustration of our random walk representation: (a) (top) molecule M , number 33 (middle) HM as a connected subgraph of G
(bottom) ĤM as a random walk over HM ; (b) the motif graph G, each node is a motif v that contains both the molecular fragment vB
(black molecule sections) and the contexts for attachment (vR , red molecule sections), each gray line indicates a possible attachment
between nodes. Directed edges of ĤM use the same color as the dashed border of the corresponding figure of M ; (c) (top) demonstration
of motif matching criteria eq 1-4 (183 ↔ 5), another example is in Fig. 11 (bottom) two more examples of HM .

3.1. The Molecular Design Space as Derivations of a 3.2. Molecules as Random Walks in the Design Space
Context-Sensitive Grammar Over Motif Graph
Intuitively, our representation of a molecule M captures a
We now define our context-sensitive grammar over G. We derivation in the above-defined context-sensitive grammar.
use the notations defined in the previous section to enumer- While prior work has modeled such derivations in large and
ate the set of production rules, P, in our grammar. There complex tree structures (e.g., with auxiliary nodes for partial
is one initial rule pv ∈ P for each motif v in G, where derivations) (Guo et al., 2023a;b), we model it compactly in
the LHS is X , and the RHS is the molecular graph gv terms of a random walk over the bidirectionally connected
with uB being the base atoms and {(url )} being the red subgraph HM = (VM , EM ) of G given by the fragmen-
atom sets that become “options” for attachment. Then, tation of M 3 ; see Fig. 1 (a). Observe that G is a strong
there is exactly one production rule pu,v,l1 ,l2 ∈ P for prior for constraining the design space and sufficient for
each edge (u, v, el1 ,l2 ) ∈ G. This edge was attributed describing the molecular structure of M , but HM misses
with (url1 , vrl2 , b1 , b2 ) during the construction of G. The the global distribution of which it is a sample of.
application of the production rule then equates to attach-
Our learnable component models this distribution and, at
ing the fragment of v to the fragment of u, at the attach-
the same time, captures the features that characterize a spe-
ment options keyed with l1 , l2 . In the language of graph
cific molecule in terms of a random walk. More specifically,
grammars, the context of this production rule is hence the
our final molecule representation is a directed-acyclic multi-
molecular graph gu (uB ∪ url1 ), with the requirement that
graph ĤM = (VM , ÊM , wM ) that linearizes HM into a
the matched atoms for url1 are red. Applying this pro-
random walk such that (1) ÊM ⊆ EM , (2) ĤM remains
duction rule replaces the matched atoms for url1 within
connected, and (3) there is an Euler path4 . (i.e., each edge
the LHS by gv (N (gv ) \ vrl2 ), where the red atom sets
is used exactly once) v0 , v1 , . . . , vℓ over (VM , ÊM ) with
{vrl | vrl ∩ vrl2 ̸= ∅} in v are introduced as new options for
ÊM := ∪i {(vi , vi+1 )}; this path can be generated via a
attachment in the RHS. The random walk characterization
pre-order traversal that adds a reversed duplicate of the
arises out of the fact that if the LHS molecule contains the
sub-trajectory when the stack contracts. The last compo-
context gu (uB ∪ url1 ), any edge (u, v, el1 ,l2 ) ∈ E can be
traversed, possibly including self-loops and parallel edges 3
Refer to Appendix B.3 for how and why we augment G with
since G is a directed multidigraph. duplicates of the same motif.
4
In the case of monomers, the Euler path needs to be closed as
monomers have the property of self-loops.

4
Representing Molecules as Random Walks Over Interpretable Grammars

nent, wM is the sequence wM := p0 , p1 , . . . , pℓ−1 of prob- objective,


abilities given by the random walk; that is, pi represents
the probability with which the edge between vi and vi+1 L̃(D; Θ, θ, Φ) = EĤM (·;Φ) [L(fθ (FΘ (ĤM , y)] (7)
was traversed in the presence of all nodes visited thus far, |D|
as shown in Fig. 2. wM is parameterized by our learn- 1 X (i)
= L(fθ (FΘ (ĤM )), y (i) ), (8)
able grammar, and ĤM is explainable as a random walk of |D| i=1
context-sensitive grammar rule applications.
where we estimate the expectation using the training sam-
3.3. Learning Motif-based, Context-sensitive Grammars ples from training dataset D.
3.3.1. PARAMETER E STIMATION 3.3.3. M OLECULAR G ENERATION
For parameter estimation, we formulate the process of ran- To generate a molecule M , we apply the learned grammar
dom walk as a graph heat diffusion process, forward to sample edges to traverse during the random walk,
as depicted in Fig. 2. Each sampled edge either attaches
dxt
= L(Φ, t)xt , (5) a new motif to the current M , or backtracks to a previous
dt motif. Details on the algorithm are given in Appendix F.
where xt ∈ R|V | represents the probability distribution
of sampling motifs and L(Φ, t) ∈ R|V |×|V | is a time- 4. Results & Analysis
dependent graph Laplacian parameterized by Φ. Here the Our experiments quantitatively answer the following ques-
initial condition of the diffusion process x0 is a one-hot tions: 1) How well does our method perform on property
vector with the root of ĤM as one. At every time step, prediction for our setting of interest? 2) How well does our
the ground-truth xt follows the transition state of a random representation work for the generation of novel molecules,
walk. In our implementation, L(Φ, t) is calculated as compared with both SOTA symbolic and deep molecular
generative models? We also include three ablations to an-
L(Φ, t) = D − Ŵ (t), Ŵ (t) = W + h(ct ; ϕ) (6) swer: 3) How important are expert motifs, compared to
heuristic-based motifs? 4) How data-efficient and runtime-
where D ∈ R|V |×|V | is the in-degree matrix of G, h(·; ϕ) is efficient is our method? 5) How does our method compare
a memory-sensitive adjustment layer, and ct is a set-based with other motif-based predictors? Our qualitative analysis
memory of all nodes visited thus far. If p(t) is the current answers the following questions: 6) How interpretable is
state of the random walk, the set-based memory, c(t+1) , our learnt grammar to an expert? 7) How can we analyze
t 1
is updated as follows: c(t+1) ← t+1 · c(t) + t+1 · p(t) . the model’s learnt representations?
This set-based memory mechanism has precedents in graph
theory literature. The learnable parameters are Φ = (W, ϕ). 4.1. Datasets and Baselines
Further motivation of the set-based memory mechanism is
in Appendix C.3 and the full training algorithm can be found
in Appendix D. Table 1. Average size of our hierarchical representation HM over
each dataset, with expert vs heuristic motifs.
Dataset GC HOPV PTC
3.3.2. T RAINING FOR A D OWNSTREAM TASK
Expert Yes No Yes No Yes No
Property Prediction Our grammar-induced molecular Avg. |H M| 7.3 ± 2.8 3.8 ± 2.2 5.4 ± 1.9 6.5 ± 2.9 3.6 ± 2.4 2.1 ± 1.4
graph representation Ĥi allows for applying an off-the-shelf
graph neural network FΘ to solve a given prediction task; in Group Contribution (GC) (Wang et al., 2018; Park & Paul,
our evaluation, we used GIN (Xu et al., 2019). Given a prop- 1997; Wu et al., 2021). 114 molecules, characterized in
erty value y (i) ∈ R with each molecule M (i) , we apply a terms of gas separation. Their functional groups contribute
linear head fθ and application-specific loss function L (e.g., significantly to maintaining the structures and properties
MSE for regression or cross-entropy for classification). of 3D scaffold building blocks in polymer self-assembly,
which in turn play a significant role in gas separation pro-
End to End Training Our grammar-based representation
cesses, important in gas and oil industry. We used existing
can further be optimized via end-to-end training of Φ. Typi-
monomers (Wang et al., 2018) and compilations of groups
cally, we first train Φ to convergence under our MC-based
(Park & Paul, 1997; Wu et al., 2021) for inferring the frag-
objective, then train Θ to convergence under eq 7 on the
mentations.
representations induced by Φ. Finally, we freeze Θ and
finetune Φ to convergence. Alternatively, we train Φ and Θ The Harvard organic photovoltaic dataset (HOPV)
together, end to end, by using the following differentiable (Lopez et al., 2016). 316 molecules, applied to aid in the

5
Representing Molecules as Random Walks Over Interpretable Grammars

Figure 2. Illustration of our generation procedure: (t=1) our learnable grammar parameterized by Φ samples a state transition 56 → 9;
(t=2) with the memory of having visited {56}, our grammar samples a state transition → 71; (t=10) (bottom) our grammar samples a final
transition 5, which determines the molecular structure (top); our program’s notation is 56 → 9 → 71[→ 70 → 5] → 70 : 1 → 5 : 1

design of organic solar cells, with detailed information per- 4.2. Results
tinent to organic photovoltaic performance metrics. The
We report the mean absolute error (MAE) and coefficient of
molecules contain motifs which are among the most signifi-
determination (R2 ) over normalized prediction values for
cant functional groups for conducting/electroactive materi-
GC and HOPV, and the accuracy and AUC for PTC. For
als (Swager, 2017) and photovoltaic properties (Yuan et al.,
each (dataset, property) pair, we perform an 80-20 train-test
2022). We extracted motifs important for high HOMO val-
split over 3 random seeds and report the mean and stan-
ues and enhanced electron delocalization, which are critical
dard deviation. For molecular generation, we report com-
for photovoltaic efficiency; see Appendix G for details.
monly used metrics (Polykovskiy et al., 2020; Guo et al.,
Predictive Toxicology Challenge (PTC) (Helma et al., 2023a): Validity/Uniqueness/Novelty: Percentage of chem-
2001). 344 small chemical compounds characterized by ically valid/unique/novel molecules; Diversity: Average
very distinct functional groups known for their carcinogenic pairwise molecular distance among generated molecules;
properties or liver toxicity (Miller et al., 1949), with Retro* Score (RS): Success rate of Retro* (Chen et al.,
reported values for rats. We specifically segmented it 2020) which was trained to find a retrosynthesis path to
into functional groups that majorly contribute to the build a molecule from a list of commercially available ones.
improvement of compounds’ toxicity (Hughes et al., 2015). We add the metric of Membership, which tests whether cer-
Examples from each dataset are shown in Figure 3. tain motif(s) characteristic of membership to the chemical
class are present, primarily as a sanity check. Our method,
Baselines. To address question 1), we compare with by design, can achieve 100% if the random walk initializes
pretrained GNNs (PN (Stanley et al., 2021) and Pre-trained at the characteristic motif. See A.1 for further discussion.
GIN (Hu et al., 2020)), a specialized GNN model for
property prediction (wD-MPNN (Aldeghi & Coley, 2022)), 4.2.1. P ROPERTY PREDICTION
two SOTA pretrained models for molecular representation
learning (MolCLR (Wang et al., 2022) and UniMol (Zhou To answer question 1), we see in Table 2 that our method,
et al., 2023)) and two SOTA subgraph-based methods with expert motifs, achieves the best R2 by a wide margin
(ESAN (Bevilacqua et al., 2022) and HM-GNN (Shui of 0.10 and 0.06 over the second best method on regres-
& Karypis, 2020)). To address question 2), we compare sion datasets GC and HOPV and the highest accuracy on
with both Geo-DEG, the SOTA on small dataset property PTC. With heuristic motifs, our method remains competi-
prediction, and its generative variant, DEG, for molecular tive to Geo-DEG, achieving higher R2 on both regression
generation. datasets and accuracy within standard deviation on PTC. In-

6
Representing Molecules as Random Walks Over Interpretable Grammars

terestingly, using heuristic-based motifs in HOPV, achieves


significantly (27%) lower MAE than expert-based motifs
and Geo-DEG. To answer question 3), we see that the abla-
tion suggests expert motifs are generally better, but may be
more sensitive to outliers than heuristic-based motifs. We
observe experts are generally better at identifying special
cases that heuristics are unaware of, but heuristics are more
consistent. This reflects how R2 is generally more sensitive
to outliers than MAE. We describe our experts’ annotation
criteria in Appendix A and do an in-depth case study on
HOPV in Appendix G.

4.2.2. M OLECULAR GENERATION


To answer question 2), we see in Table 3 that our method
produces comparably more diverse molecules than the train-
ing dataset (+0.03 on HOPV, -0.01 on PTC) and significantly
more synthesizable molecules than the previous SOTA,
DEG (+39% on HOPV, +22% on PTC). On HOPV, our
retrosynthetic planner finds synthesis paths at a 14% higher Figure 4. Visualization of our motif graph G; black edges indicate
rate on our novel molecules than the original dataset, a care- matched motif pairs, thickness of red edges correspond to the
ful collation of experimental photovoltaic designs (Lopez numbers of HM that traverse it.
et al., 2016). This is encouraging to experimentalists whose and, in the case of HOPV, feature much larger molecules.
work is contingent on the designs’ feasibility for synthesis. Rather than using an encoder-decoder setup which requires
We also compare our methods with established VAE-based significantly more data to learn the mapping to and from a
molecular generative models such as (Jin et al., 2018) and latent space, our generative model explicitly captures the
its follow-up work (Jin et al., 2020) which includes larger transition probabilities over traversing the symbolic space
structural motifs. Furthermore, we modified the implemen- of structural motifs. Our grammar derivation can easily be
tation of Hier-VAE to incorporate our epert motifs. For all conditioned by a set-based memory to apply a diverse set
three cases, we follow the default settings, train until conver- of transition rules. This leads to more unique, diverse, and,
gence, and use the checkpoint with the lowest loss to sample most importantly, synthesizable structures.
1000 molecules. We observe that both VAE-based meth-
ods struggle to generate sufficiently unique molecules, with
4.3. Ablations
only 11%-43% (HOPV) and 8%-28% (PTC) of the 1000
generated molecules being unique. This is despite sampling 4.3.1. A BLATION : VARY TRAINING DATASET SIZE
from a Gaussian noise distribution. Meanwhile, our model
To answer question 4), we conduct an ablation study in
generates 100% unique and novel molecules, while ensuring
Figure 5 over the training split size to study how data and
a high internal diversity second only to DEG. For reference,
runtime-efficient our method is in comparison with Geo-
(Jin et al., 2018; 2020) trained and evaluated their methods
DEG. Our method performs strictly better on MAE as the
on 250K molecules extracted from ZINC (Sterling & Ir-
training set is reduced from 70% to 10%. This is in addition
win, 2015) and a polymer dataset containing 86K polymers.
to the method running an order of magnitude faster, high-
Meanwhile, our datasets contain only 100-300 molecules
lighting gains in both data efficiency and runtime efficiency.

Figure 3. Example molecules from GC, HOPV, and PTC. These


datasets are characterized by modular substructures that correspond
to meaningful chemical functional groups.
Figure 5. Varying the training dataset size from 10-70%.

7
Representing Molecules as Random Walks Over Interpretable Grammars

Table 2. Results on property prediction (best result bolded, second-best underlined). The datasets we include have expert-annotated
motifs. We also report Ours (w/o expert) as an ablation without expert motifs.
Methods PN Pre-trained GIN Ours
wD-MPNN ESAN HM-GNN MolCLR Unimol Geo-DEG Ours
Datasets (finetuned) (finetuned) (w/o expert)

MAE ↓ 0.47 ± 0.09 0.51 ± 0.06 0.34 ± 0.12 0.76 ± 0.30 0.68 ± 0.05 0.26 ± 0.10 0.38 ± 0.13 0.26 ± 0.11 0.25 ± 0.09 0.27 ± 0.08
Group
R2 ↑ 0.41 ± 0.12 −0.39 ± 0.62 0.56 ± 0.20 −7.56 ± −7.71 0.19 ± 0.09 0.68 ± 0.20 0.47 ± 0.25 0.70 ± 0.20 0.80 ± 0.15 0.74 ± 0.15

MAE ↓ 0.36 ± 0.03 0.37 ± 0.02 0.40 ± 0.02 0.42 ± 0.02 0.38 ± 0.02 0.34 ± 0.03 0.31 ± 0.03 0.30 ± 0.02 0.30 ± 0.05 0.22 ± 0.15
HOPV
R2 ↑ 0.69 ± 0.04 0.66 ± 0.06 0.65 ± 0.05 0.65 ± 0.04 0.66 ± 0.03 0.68 ± 0.03 0.70 ± 0.02 0.74 ± 0.03 0.80 ± 0.06 0.77 ± 0.12

Acc ↑ 0.67 ± 0.06 0.64 ± 0.08 0.66 ± 0.07 0.61 ± 0.08 0.62 ± 0.09 0.60 ± 0.03 0.57 ± 0.05 0.69 ± 0.07 0.70 ± 0.01 0.67 ± 0.02
PTC
AUC ↑ 0.70 ± 0.05 0.68 ± 0.06 0.69 ± 0.06 0.65 ± 0.07 0.66 ± 0.07 0.66 ± 0.05 0.67 ± 0.06 0.71 ± 0.07 0.69 ± 0.03 0.66 ± 0.05

Table 3. Results on molecular generation for HOPV (top) and PTC


(bottom); for both datasets, we generate 1000 novel molecules. R2 , > 99% Acc/AUC), and 2) generalize to the test data,
Refer to Appendix A.1 for more details on Membership. with further regularization potentially leading to even better
Datasets Methods Valid Unique Novel Diversity RS Memb. results. We believe our motif-based representation carries
HOPV
Train Data 100% 100% N/A 0.86 51% 100% better inductive biases, integrates better with expert motifs,
DEG 100% 98% 99% 0.93 19% 46% and demonstrates stronger empirical performance.
JT-VAE 100% 11% 100% 0.77 99% 84%
Hier-VAE 100% 43% 96% 0.87 79% 76%
Hier-VAE (+expert) 100% 29% 92% 0.86 84% 82%
Ours 100% 100% 100% 0.89 58% 71% 4.4. Analysis
Train Data 100% 100% N/A 0.94 87% 30%
PTC
DEG 100% 88% 87% 0.95 38% 27% 4.4.1. RULE EXTRACTION FROM GRAMMAR
JT-VAE 100% 8% 80% 0.83 96% 27%
Hier-VAE 100% 20% 85% 0.91 92% 25%
Hier-VAE (+expert) 100% 28% 75% 0.93 90% 17% To answer question 6), we extract context-sensitive grammar
Ours 100% 100% 100% 0.93 60% 22%
rules from our trained model. We perform best-first search
over random walk trajectories, beginning with base trajecto-
4.3.2. C OMPARISON WITH MOTIF - BASED BASELINES ries corresponding to each group v ∈ G. We only expand
trajectories with transition probabilities above a minimum
To answer question 5), we compare with two baselines. The
threshold. Each trajectory that reaches a transition with
first, Bag-of-Motifs, ablates our hierarchical information
probability of 1 is extracted as a “hard” context-sensitive
and retains only the motif co-occurrence information. For
rule. We depict two such rules in Figure 6, with a more
each molecule, we obtain a feature vector that concatenates
exhaustive compilation in Appendix F.1.
a) the occurrence counts of all motifs and b) the Morgan
fingerprint of the molecule. We train an XGBoost regres-
sor/classifier on top of these features (details in Appendix
E). As shown in Table 4, this baseline has enough capac-
ity to overfit the training data but fails to generalize. This
allows us to conclude in the absence of a proper represen-
tation, motif occurrence information is not sufficient for
generalization. Interestingly, expert-level motifs are not su-
perior to heuristic-based motifs in this featurization. This
suggests that the quality of motifs are not relevant in the
absence of a hierarchical representation that incorporates the
fine-grained features of each individual motif. The second,
Figure 6. We visualize two hard context-sensitive rules on PTC
HM-GNN (Shui & Karypis, 2020), is a SOTA motif-based
that correspond to design principles of the addition of halogen
property predictor that explicitly models motif-molecule and
groups to further improve molecular toxicity.
motif-motif relationships using a hetereogenous graph. Fur-
thermore, we endowed the method with our expert motifs Figure 6 shows how our model recovers a set of design prin-
since the vanilla version only considers bonds and rings. On ciples used to facilitate the synthesis of functional molecules
both regression datasets, HM-GNN avoids overfitting but and grounded in the structure-property relationship of PTC.
does not catch up to our method’s generalization capability. Consider the transformation of the triple benzene derivative
Endowing HM-GNN with our expert motifs enables better molecule (labeled as [‘G333’, ‘G393’]) with the addition
fitting of the training data but further hinders generalization. of bromide moiety (labeled as [‘G333’, ‘G393’, ‘G333:1’]).
On PTC, HM-GNN is competitive with our method in ac- In this instance, the central moiety, G393, is characterized
curacy but shows a discrepancy in terms of AUC. This is by two symmetrical ketone groups and two bromides ad-
concerning as a lower AUC may imply higher sensitivity to joined to the aromatic ring. This configuration markedly
class imbalance (in PTC, there are 45% more negatives than enhances the molecule’s toxicity. Moreover, by strategically
positives) and classification thresholds. Meanwhile, our positioning additional binding sites on the aromatic ring,
method can both 1) completely fit the training data (> 0.99 the software augments the molecule with two extra bromide

8
Representing Molecules as Random Walks Over Interpretable Grammars

Table 4. Ablation study on overfitting and generalization, vs other motif-based baselines, with and w/o expert motifs. Best result is bolded.
Ablation/Dataset HOPV PTC Group Contribution
Train Train Test Test Train Train Test Test Train Train Test Test
MAE ↓ R2 ↑ MAE ↓ R2 ↑ Acc ↑ AUC ↑ Acc ↑ AUC ↑ MAE ↓ R2 ↑ MAE ↓ R2 ↑
0.014± 0.997± 0.486± 0.489± 0.996± 1.000± 0.529± 0.609± 0.000± 1.000± 0.481± 0.257±
Bag-of-Motifs 0.002 0.001 0.025 0.062 0.000 0.000 0.031 0.031 0.000 0.000 0.174 0.453
0.011± 1.000± 0.521± 0.446± 0.996± 1.000± 0.581± 0.612± 0.000± 1.000± 0.493± 0.214±
Bag-of-Motifs (+expert) 0.004 0.000 0.031 0.125 0.000 0.000 0.018 0.029 0.000 0.000 0.143 0.404
0.366± 0.686± 0.473± 0.441± 0.915± 0.966± 0.710± 0.678± 0.281± 0.717± 0.362± 0.592±
HM-GNN 0.035 0.066 0.019 0.065 0.033 0.016 0.023 0.040 0.064 0.137 0.113 0.202
0.201± 0.895± 0.451± 0.408± 0.999± 1.000± 0.681± 0.587± 0.185± 0.926± 0.345± 0.547±
HM-GNN (+expert) 0.009 0.019 0.025 0.095 0.002 0.000 0.024 0.075 0.016 0.039 0.149 0.295
0.075± 0.990± 0.288± 0.765± 0.994± 0.999± 0.671± 0.659± 0.044± 0.995± 0.268± 0.738±
Ours (-expert) 0.003 0.001 0.048 0.146 0.001 0.000 0.020 0.047 0.015 0.004 0.084 0.148
0.045± 0.996± 0.295± 0.796± 0.996± 1.000± 0.705± 0.711± 0.028± 0.998± 0.222± 0.819±
Ours 0.003 0.001 0.049 0.105 0.000 0.000 0.007 0.018 0.007 0.002 0.079 0.137

groups, G333, thereby exacerbating its hepatotoxicity. In aid the search for novel molecules with desirable photo-
another example, the molecule with an ammonia group (la- voltaic properties. As 2D t-SNE is not a universal way to
beled as [‘G466’]) transforms with an additional ketone analyze representations, we also visualize the agreement be-
group (labeled as [‘G466’, ‘G231’]). Here, the presence tween embedding similarity and structural similarity using
of a C=O double bond within an acetamide group is a key a 64 × 64 grid. This is can be found in Appendix G.3, as
contributor to hepatotoxicity. part of an in-depth case study on HOPV.

4.4.2. T WO - DIMENSIONAL T-SNE ON HM VERSUS


PRETRAINED REPRESENTATIONS

Figure 8. Examples of top HOMO value compounds with group


(a) from the top cluster and group (b) from the bottom cluster.

Figure 7. Final layer representations from: a) Our method b) Our


method (-expert) c) Pre-trained GIN d) HM-GNN. We apply a 4.5. Conclusion & Future Work
grayscale coloring map using the normalized value of the desired
property (the darker the dot, the higher the HOMO). We represent molecules as random walks over an inter-
pretable context-sensitive grammar over the motif graph,
To answer question 7), we analyze the 2D t-SNE embed- a hierarchical abstraction over the design space. We de-
dings of various methods’ final layer representations of 64 vise and execute a practical workflow that invites experts
test set molecules on HOPV. As shown in Figure 7, our in the loop to enhance our design basis and representations
method is unique in extracting visually meaningful repre- by fragmenting molecules into well-established functional
sentations. High HOMO molecules were identified from groups, creating a synergy between expert feedback and
the visual clusters for structural analysis. Molecules in the the quality of our representations. Our evaluation on down-
upper cluster as illustrated in Figure 8 often have structures stream property prediction and molecular generation tasks
promoting electron delocalization, like carbonohydrazonoyl shows our representation combines quantitative advantages
dicyanide, while those in the lower cluster have electron- in performance and efficiency with qualitative advantages of
donating groups or structures increasing steric hindrance simplicity and enhanced interpretability. One promising av-
to boost HOMO values as shown in Figure 8. These two enue of future research is improving the autonomous extrac-
structural features correspond to the two primary ways to tion of interpretable grammar rules through learnable and/or
design molecules with high HOMO values. These findings human-guided approaches with Large Language Models.

9
Representing Molecules as Random Walks Over Interpretable Grammars

Acknowledgements Fang, G., Samorodnitsky, G., and Xu, Z. A cover time


study of a non-markovian algorithm. arXiv preprint
This work is supported by the MIT-IBM Watson AI Lab and arXiv:2306.04902, 2023.
its member companies Evonik and Shell.
Frasca, F., Bevilacqua, B., Bronstein, M. M., and Maron, H.
Impact Statement Understanding and extending subgraph gnns by rethink-
ing their symmetries. NeurIPS, 2022.
This paper presents work whose goal is to concurrently
and conjointly advance the fields of Machine Learning Gasieniec, L. and Radzik, T. Memory efficient anonymous
and Chemical Discovery. The application of our method graph exploration. In Graph-Theoretic Concepts in Com-
can have consequences for real-world discovery workflows. puter Science: 34th International Workshop, WG 2008,
There are no ethical aspects which we foresee and feel must Durham, UK, June 30–July 2, 2008. Revised Papers 34,
be discussed here. pp. 14–29. Springer, 2008.
Guo, M., Shou, W., Makatura, L., Erps, T., Foshey,
References M., and Matusik, W. Polygrammar: Grammar
for digital polymer representation and genera-
Aldeghi, M. and Coley, C. W. A graph representation of tion. Advanced Science, 9(23):2101864, 2022.
molecular ensembles for polymer property prediction. doi: https://doi.org/10.1002/advs.202101864. URL
Chemical Science, 13(35):10486–10498, 2022. https://onlinelibrary.wiley.com/doi/
abs/10.1002/advs.202101864.
Bevilacqua, B., Frasca, F., Lim, D., Srinivasan, B., Cai, C.,
Balamurugan, G., Bronstein, M., and Maron, H. Equiv- Guo, M., Thost, V., Song, S., Balachandran, A., Das, P.,
ariant subgraph aggregation networks. ICLR, 2022. Chen, J., and Matusik, W. Grammar-induced geometry
for data-efficient molecular property prediction. 2023a.
Blunk, D., Bierganns, P., Bongartz, N., Tessendorf, R., and
Stubenrauch, C. New speciality surfactants with natural Guo, M., Thost, V., Song, S. W., Balachandran, A., Das, P.,
structural motifs. New J. Chem., 30:1705–1717, 2006. Chen, J., and Matusik, W. Hierarchical grammar-induced
doi: 10.1039/B610045G. gemoetry for data-efficient molecular property prediction.
ICML, 2023b.
Bronstein, H., Nielsen, C. B., Schroeder, B. C., and Mc-
Helma, C., King, R. D., Kramer, S., and Srinivasan, A. The
Culloch, I. The role of chemical design in the per-
Predictive Toxicology Challenge 2000–2001 . Bioinfor-
formance of organic semiconductors. Nature Reviews
matics, 17(1):107–108, 2001.
Chemistry, 4(2):66–77, jan 2020. ISSN 2397-3358. doi:
10.1038/s41570-019-0152-9. Hu, W., Liu, B., Gomes, J., Marinka Zitnik, P. L., Pande, V.,
and Leskovec, J. Strategies for pre-training graph neural
Cai, C., Wang, D., and Wang, Y. Graph coarsening with networks. ICLR, 2020.
neural networks. ICLR, 2021.
Hughes, T. B., Miller, G. P., and Swamidass, S. J. Modeling
ChemAxon. Fragmenter. URL http://www. epoxidation of drug-like molecules with a deep machine
chemaxon.com/. learning network. ACS Central Science, 1(4):168–180,
2015.
Chen, B., Li, C., Dai, H., and Song, L. Retro*: Learn-
ing retrosynthetic planning with neural guided a* search. IUPAC. Compendium of Chemical Terminology. 1997.
2020. Jiang, J., Zhang, R., Zhao, Z., Ma, J., Liu, Y., Yuan, Y., and
Niu, B. Multigran-smiles: multi-granularity smiles learn-
Chen, J., Saad, Y., and Zhang, Z. Graph coarsening: from
ing for molecular property prediction. Bioinformatics, 38
scientific computing to machine learning. SeMA Journal,
(19):4573–4580, 2022.
79(1):187–223, 2022.
Jin, W., Barzilay, R., and Jaakkola, T. Junction tree vari-
Chen, Y., Yao, R., Yang, Y., and Chen, J. A gromov– ational autoencoder for molecular graph generation. In
wasserstein geometric view of spectrum-preserving graph International conference on machine learning, pp. 2323–
coarsening. ICML, 2023. 2332. PMLR, 2018.
Degen, J., Wegscheid-Gerlach, C., Zaliani, A., and Rarey, Jin, W., Barzilay, R., and Jaakkola, T. Hierarchical genera-
M. On the art of compiling and using ’drug-like’ chemical tion of molecular graphs using structural motifs. ICML,
fragment spaces. ChemMedChem, 2008. 2020.

10
Representing Molecules as Random Walks Over Interpretable Grammars

Kajino, H. Molecular hypergraph grammar with its Masuda, N., Porter, M. A., and Lambiotte, R. Random
application to molecular optimization. In Chaud- walks and diffusion on networks. Physics reports, 716:
huri, K. and Salakhutdinov, R. (eds.), Proceedings of 1–58, 2017.
the 36th International Conference on Machine Learn-
ing, volume 97 of Proceedings of Machine Learning Miller, J. A., Sapp, R. W., and Miller, E. C. The Car-
Research, pp. 3183–3191. PMLR, 09–15 Jun 2019. cinogenic Activities of Certain Halogen Derivatives of
URL https://proceedings.mlr.press/v97/ 4-Dimethylaminoazobenzene in the Rat*. Cancer Re-
kajino19a.html. search, 9(11):652–660, 1949.

Karande, P., Gallagher, B., and Han, T. Y.-J. A strategic ap- Nigam, A., Pollice, R., Krenn, M., Gomes, G. D. P., and
proach to machine learning for material science: How to Aspuru-Guzik, A. Beyond generative models: superfast
tackle real-world challenges and avoid pitfalls. Chemistry traversal, optimization, novelty, exploration and discov-
of Materials, 2022. ery (STONED) algorithm for molecules using SELFIES.
Chem. Sci., 12(20):7079–7090, April 2021.
Krenn, M., Häse, F., Nigam, A., Friederich, P., and Aspuru-
Guzik, A. Self-referencing embedded strings (self- Park, J. and Paul, D. R. Correlation and prediction of gas
ies): A 100 Machine Learning: Science and Technol- permeability in glassy polymer membrane materials via a
ogy, 1(4):045024, oct 2020. doi: 10.1088/2632-2153/ modified free volume based group contribution method.
aba947. URL https://dx.doi.org/10.1088/ Journal of Membrane Science, 125(1):23–39, 1997.
2632-2153/aba947.
Pemantle, R. A. Random processes with reinforcement. PhD
Landrum, G. Rdkit: Open-source cheminformatics software. thesis, Massachusetts Institute of Technology, Dept. of
2016. Mathematics, 1988.
Lee, W. J., Kwak, H. S., Lee, D.-r., Oh, C., Yum, E. K., Polykovskiy, D., Zhebrak, A., Sanchez-Lengeling, B., Golo-
An, Y., Halls, M. D., and Lee, C.-W. Design and syn- vanov, S., Tatanov, O., Belyaev, S., Kurbanov, R., Ar-
thesis of novel oxime ester photoinitiators augmented tamonov, A., Aladinskiy, V., Veselov, M., Kadurin, A.,
by automated machine learning. Chemistry of Materi- Johansson, S., Chen, H., Nikolenko, S., Aspuru-Guzik,
als, 34(1):116–127, jan 2022. ISSN 0897-4756. doi: A., and Zhavoronkov, A. Molecular sets (moses): A
10.1021/acs.chemmater.1c02871. benchmarking platform for molecular generation models.
Leung, L., Kalgutkar, A. S., and Obach, R. S. Metabolic Frontiers in Pharmacology, 2020.
activation in drug-induced liver injury. Drug metabolism Rogers, D. and Hahn, M. Extended-connectivity finger-
reviews, 44(1):18–33, 2012. prints. Journal of Chemical Information and Modeling,
Li, J., Wang, J., Zhao, Y., Zhou, P., Carter, J., Li, Z., 2010.
Waigh, T. A., Lu, J. R., and Xu, H. Surfactant-like
Rosvall, M., Esquivel, A. V., Lancichinetti, A., West, J. D.,
peptides: From molecular design to controllable self-
and Lambiotte, R. Memory in network flows and its
assembly with applications. Coordination Chemistry
effects on spreading dynamics and community detection.
Reviews, 421:213418, 2020. ISSN 0010-8545. doi:
Nature communications, 5(1):4630, 2014.
https://doi.org/10.1016/j.ccr.2020.213418.
Li, Y., Vinyals, O., Dyer, C., Pascanu, R., and Battaglia, Sawlani, N. Drug discovery informatics market set to surge
P. Learning deep generative models of graphs. arXiv at 10.9 Transparency Market Research, Inc, 2024.
preprint arXiv:1803.03324, 2018. Schimunek, J., Seidl, P., Friedrich, L., Kuhn, D., Rippmann,
Liu, Q., Allamanis, M., Brockschmidt, M., and Gaunt, A. F., Hochreiter, S., and Klambauer, G. Context-enriched
Constrained graph variational autoencoders for molecule molecule representations improve few-shot drug discov-
design. Advances in neural information processing sys- ery. 2023. URL https://openreview.net/pdf?
tems, 31, 2018. id=XrMWUuEevr.

Lopez, S. A., Pyzer-Knapp, E. O., Simm, G. N., Lutzow, T., Shui, Z. and Karypis, G. Heterogeneous molecular
Li, K., Seress, L. R., Hachmann, J., and Aspuru-Guzik, graph neural networks for predicting molecule proper-
A. The harvard organic photovoltaic dataset. Sci Data, 3, ties. ICDM, 2020.
2016.
Stanley, M., Bronskill, J. F., Krzysztof Maziarz, H. M.,
Ma and Chen. Unsupervised learning of graph hierarchical Lanini, J., Segler, M., Schneider, N., and Brockschmidt,
abstractions with differentiable coarsening and optimal M. Fs-mol: A few-shot learning dataset of molecules.
transport. AAAI, 2021. NeurIPS, 2021.

11
Representing Molecules as Random Walks Over Interpretable Grammars

Sterling, T. and Irwin, J. J. Zinc 15–ligand discovery for You, J., Liu, B., Ying, Z., Pande, V., and Leskovec, J. Graph
everyone. Journal of chemical information and modeling, convolutional policy network for goal-directed molecu-
55(11):2324–2337, 2015. lar graph generation. Advances in neural information
processing systems, 31, 2018.
Swager, T. M. 50th anniversary perspective: Conduct-
ing/semiconducting conjugated polymers. a personal per- Yuan, W., Vijayamohanan, H., Luo, S.-X. L., Husted, K.,
spective on the past and the future. Macromolecules, 50 Johnson, J. A., and Swager, T. M. Dynamic polypyrrole
(13):4867–4886, 2017. core–shell chemomechanical actuators. Chemistry of
Materials, 34(7):3013–3019, 2022.
Thanou, D., Dong, X., Kressner, D., and Frossard, P. Learn-
ing heat diffusion graphs. IEEE Transactions on Signal Zhou, G., Gao, Z., Ding, Q., Zheng, H., Xu, H., Wei, Z.,
and Information Processing over Networks, 3(3):484– Zhang, L., and Ke, G. Uni-mol: A universal 3d molecular
499, 2017. representation learning framework. ICLR, 2023.
Türel, T., Dağlar, Ö., Eisenreich, F., and Tomović, Ž. Epoxy
thermosets designed for chemical recycling. Chemistry
– An Asian Journal, 18(15), aug 2023. ISSN 1861-4728.
doi: 10.1002/asia.202300373.
Wang, S. and Wu, X. The mechanical performance predic-
tion of steel materials based on random forest. Frontiers
in Computing and Intelligent Systems, 2023.
Wang, S., Shi, K., Tripathi, A., Chakraborty, U., Par-
sons, G. N., and Khan, S. A. Designing intrinsi-
cally microporous polymer (pim-1) microfibers with
tunable morphology and porosity via controlling sol-
vent/nonsolvent/polymer interactions. ACS Applied Poly-
mer Materials, 2(6):2434–2443, 2020.
Wang, Y., Ma, X., Ghanem, B., Alghunaimi, F., Pinnau,
I., and Han, Y. Polymers of intrinsic microporosity for
energy-intensive membrane-based gas separations. Mate-
rials Today Nano, 3:69–95, 2018.
Wang, Y., Wang, J., Cao, Z., and Farimani, A. B. Molecular
contrastive learning of representations via graph neural
networks. nature machine intelligence, 2022.
Wu, A. X., Lin, S., Rodriguez, K. M., Benedetti, F. M.,
Joo, T., Grosz, A. F., Storme, K. R., Roy, N., Syar, D.,
and Smith, Z. P. Revisiting group contribution theory for
estimating fractional free volume of microporous polymer
membranes. Journal of Membrane Science, 636, 2021.
Xia, J., Zhang, L., Liu, Y., Gao, Z., Hu, B., Tan, C., Zheng,
J., Li, S., and Li, S. Z. Understanding the limitations of
deep models for molecular property prediction: Insights
and solutions. NeurIPS, 2023a.
Xia, J., Zhu, Y., Du, Y., Liu, Y., and Li, S. A systematic
survey of chemical pre-trained models. IJCAI, 2023b.
Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful
are graph neural networks? ICLR, 2019.
Yang, N., Zeng, K., Qitian Wu, X. J., and Yan, J. Learning
substructure invariance for out-of-distribution molecular
representations. NeurIPS, 2022.

12
Representing Molecules as Random Walks Over Interpretable Grammars

A. Motif Collection Strategy


Motifs are used to construct our motif graph G, which forms the design basis for both our generative and predictive methods.
The complexity of our grammar as conveyed by the size of the motif graph G for different motif collection strategies we
tried are summarized in Table 5. For the remainder of this section, we describe the Expert Annotation strategy which is our
primary workflow. The strategies for obtaining motifs from literature and heuristic-based fragmentation are described in B.1
and F.1, respectively.

Grammar Complexity (|V |, |E|) HOPV PTC GC


Literature N/A N/A (96, 3656)
Expert (329, 37273) (407, 23145) N/A
Heuristic (208, 16880) (279, 37968) (90, 4095)

Table 5. Number of nodes and edges of motif graph G constructed using different annotation strategies

A.1. Expert Annotation Workflow


The workflow consists of two steps: molecule segmentation, and extracting the negative groups for pairwise attachments.
Step 1 involves cooperation from an expert, and we detail our polished workflow to facilitate that process, which we have
attempted with multiple experts. Step 2 can become automated after the expert identifies 1) governing rules for a particular
dataset, and 2) important exceptions to the rule. On average, each dataset takes less than one working day for one expert
to fully annotate and process. The annotated datasets for Group Contribution, HOPV, and PTC will be released upon
publication.

Figure 9. Example segmentations for four molecules in the HOPV dataset. Segmentation locations are marked by the dark blue (teal) line.

13
Representing Molecules as Random Walks Over Interpretable Grammars

S TEP 1: E XPERT S EGMENTATION


First, experts view figures of the molecules, and indicate the bonds to break in order to segment the molecule into coherently
chosen sub-fragments, shown in Figure 9. We provide a brief description of the example datasets we show here and elaborate
on the rationale behind the experts’ segmentation strategy:

Table 6. Segmentation of the molecules (a) to (d) in 9. Bonds to break indicates the chemical bonds to cut to create black fragments, while
the black groups and red groups listed for each molecules correspond to one another, respectively.
Structures Bonds to Break Black Groups Red Groups
(1,2,3,4,5,6,7,8,9,10,11,12,44,45)
(13,14,15,16,43) (17,18,19,25,26,27,34,41,42)
(12,13) (16,17) (19,20) (20,21,22,23,24) (28,29,30,31,32,33) (13) (12,17) (16,20,28,35)
(a) (27,28) (34,35) (35,36,37,38,39,40) (19) (27) (34)
(11,12,13,14,15) (7,8,9,10,16,17,18,19,20)
(1,2,3,4,5,6,21,22,23,24,25,40,41) (10) (11,6)
(10,11) (6,7) (25,26) (26,27,28,29,35,36,37,38,39) (7,26) (25,30)
(b) (29,30) (30,31,32,33,34) (29)
(1,2) (3,4,5,6,7,8,9,10,24,25,27,36,37) (3) (2,11)
(2,3) (11,10) (12,13) (28,29,30,31,34,35) (11,12) (32,33) (27,32) (10,13)
(c) (27,28) (31,32) (13,14,15,16,17,18,19,20,21,22,23,24) (31) (12)
(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,38,39)
(16,17,18,19,37)
(20,21,22,23,24,25,26,32,33,34,35,36) (16) (15,20)
(d) (15,16) (19,20) (26,27) (27,28,29,30,31) (19,27) (26)

Predictive Toxicology Challenge (PTC) (Helma et al., 2001) The small molecules are characterized by distinct functional
groups known for their carcinogenic properties or liver toxicity (Miller et al., 1949; Helma et al., 2001). These groups
comprise a rich variety of elements such as halides, alkylating agents, epoxides, and furan rings. (Figure 3) Therefore, we
specifically segmented it into functional groups and sub-structures that contribute most to the toxicity of the compounds
(Hughes et al., 2015).
The Harvard organic photovoltaic dataset (HOPV) (Lopez et al., 2016) The process of segmenting the Harvard Organic
Photovoltaic Dataset (HOPV15) demonstrates a methodical and efficient approach to categorizing photovoltaic data. This
dataset contains a comprehensive collection of experimental photovoltaic data from literature coupled with quantum-
chemical calculations across various conformers. The criteria for the extraction of the black group are clearly defined and
systematically applied. Functional groups like vinyl, alcohol, ketone, aldehyde, amine, ester, and amide are separated as
individual black fragments. Similarly, distinct black fragments are used for individual rings including benzene, pyrrole, and
thiophene, in acknowledgement of their Pi-orbital electron delocalization. Complex structures with multiple consecutive
rings, known for their distinctive HOMO-LUMO bandgaps and electrochemical properties, such as thieno[3,4-b]pyrazine,
carbazole, and 2,5-dimethyl-3,6-dioxo-2,3,5,6-tetrahydropyrrolo[3,4-c]pyrrole, are also segmented as individual entities.
Moreover, for groups of 2-3 consecutive symmetrical thiophene or pyrrole units, the methodology captures the significance
of maintaining them as a complete black group because these consecutive groups sustain the electron cloud delocalization
between repeating units, strongly influencing optical and electronic properties not limited to light absorption, charge
transport, and luminescent properties in photovoltaic applications. Meanwhile, this method of segmentation enhances utility
and understanding of the results by clearly basing predictions on existing important structures.

Defining Membership. The Membership metric is reported in 4.2.2 after further consultation with experts, who identify
the presence of Thiophene as a proxy for Membership to HOPV, and the presence of Chloride/Bromide Halides (a key
indicator of toxicity) for PTC. In the case of both datasets, the Membership metric is only a sanity check that the method can
produce a non-trivial number of characteristic compounds. Here’s our justification for the criteria on each dataset:
1. A chloroalkane (Cl-C) is the most common motif in the PTC dataset. Yet, it is still not present in a majority of
structures, making the broader class of alkyl halides (Cl-C, Br-C-C) the best choice for a membership criterion. Their
prevalence is attributed to their reactivity and ability to undergo metabolic activation (Leung et al., 2012), leading to

14
Representing Molecules as Random Walks Over Interpretable Grammars

the formation of highly reactive intermediates that can interact with DNA and other cellular components, potentially
initiating carcinogenic processes. Although not all carcinogenic compounds will necessarily contain this class of motifs,
their presence contributes a strong likelihood.
2. Thiophene, a 5-member ring with one sulfur group, is the most common motif in the HOPV dataset, making it the best
choice for a single-motif membership criterion for HOPV. More broadly, thiophene and its derivatives are arguably the
most common chemical substructure in photovoltaics due to their ability to donate electrons, resulting in particularly high
highest occupied molecular orbital (HOMO) levels, along with stability, tunability of energy levels, and compatibility
with film forming techniques. While not every suitable organic photovoltaic compound will contain it, the vast majority
will.
In both cases, our method can easily achieve 100% membership with a slight modification to the sampling procedure:
instead of iterating through every possible starting motif node, we always initialize our random walk at the membership
motif. We choose not to modify our sampling procedure, and instead include this metric in Table 3 for completeness, since it
is still a good sanity check for other methods to show they generate a non-trivial fraction of candidates with those motif(s).

S TEP 2: E XTRACTING R ED G ROUPS


Key to the definition of our motif graph is the specification of red groups (vR ⊂ v) that define the possible pairwise
attachments between motifs. There are no hard rules, but generally red groups should be minimally necessary. It should be:
1) consistent, for enabling more attachments, hence making the motif graph denser; 2) small, for enabling fast isomorphism
checking during the precomputation of the motif graph, and 3) necessary, ensuring only valid attachments. Failure to follow
3) can generate chemically disallowed structures.

Table 7. Context determination rules and examples on datasets Group Contribution, HOPV and PTC.
Dataset Rule Example

We directly use the released groups


Group Contribution in (Park & Paul, 1997; Wu et al., 2021).

- For groups of a single atom – pick


ring of neighbor fragment if possible
- For groups of multiple atoms – pick
HOPV only the connected atoms in the neighbor fragment

PTC - Same as HOPV

B. Representing Existing Molecules as Walks on This Graph


B.1. Extracting Walks from Segmentation (HOPV, PTC)
During segmentation, we use the workflow in Appendix A.1 to segment a molecule into fragments. In doing so, we
also obtain the molecule’s representation HM as a directed subgraph over the motif graph. The pseudocode is found in
Algorithm 1.
Algorithm 2 linearizes the molecule into a directed acyclic graph (DAG). This procedure begins by finding the longest path,

15
Representing Molecules as Random Walks Over Interpretable Grammars

Algorithm 1: function extract walk(D,B)


Input: D = [Mi | i = 1, . . . , |D|] // dataset of molecules
1 B = [Bi | i = 1, . . . , |D|]; // annotation, i.e. bonds to break, for each molecule
2 DG = []; V = {}; H = [];
3 for i in range (len(D)) do
4 Fi ← break bonds(Mi , Bi ); // break bonds and form fragments
5 Gi ← form graph(Fi , Bi ); // graph of motifs, edges preserving connections
6 for f1 in Fi do
7 for f2 in NFi (f1 ) do
8 b ← Gi .edges[(f1,f2)]; // bond(s) connecting f1 , f2
9 rule ← apply rule(f1 , f2 , b);
10 V.add(rule);
11 DG .append(Gi );
12 for Gi in DG do
13 Hi ← traverse dag(Gi , G);
14 H.append(Hi );
15 Out: H,G

Algorithm 2: function traverse dag(Gi , G)


Input: Gi , G, NGi
// graph of fragments, motif graph, neighbor iterator
1 paths ← all pairs shortest paths(Gi );
2 path len = 0;
3 for src in paths do
4 for dest in paths[src] do
5 if len(paths[src][dest]) > path len then
6 path len ← paths[src][dest];
7 longest path ← paths[src][dest];

8 visited ← {};
9 visited[src] ← True;
10 root ← Node(src, main = (src in longest path));
11 Q ← queue([(root,src)]);
12 while !Q.empty() do
13 prev node, prev ← Q.dequeue();
14 for cur in NGi (prev) do
15 if visited[cur] then
16 continue
17 e ← Gi .edges[(prev, cur)];
18 e index ← find edge(e, G.edges[(prev,cur)]);
19 cur node ← Node(cur, main = (cur in longest path));
20 prev node.add child(cur node, e index);
21 vis[cur] ← True;
22 Q.enqueue((cur node, cur))
23 Out: root

and choosing a consistent ordering over neighbors (NGi ) to determine the random walk sequence. We elaborate on the
reasoning behind this canonicalization in Appendix B.4. The DAG constraint enables our graph diffusion process to become
a generator of new molecules (as will be discussed in Appendix D), in addition to capturing the distribution of existing ones.

16
Representing Molecules as Random Walks Over Interpretable Grammars

Thus, we specifically ask experts to create segmentations that are acyclic, which they naturally do in nearly all cases anyway.
In the case of monomers, this canonicalization is consistent with the IUPAC nomenclature(IUPAC, 1997) of linearizing a
monomer via its longest (main) chain, where NGi should iterate over neighbor fragments that descend side chains before the
consecutive fragment on the backbone of the main chain. More specifically, src and dest in Algorithm 2 correspond to the
first and last group of the main chain.

B.2. Extracting Walks From Literature (Case Study of Group Contribution)

The Group Contribution dataset includes a compilation of motifs characterized for gas separation, including common
organic chemical functional groups as well as important scaffold functional groups such as Triptycene and its deriva-
tives, dioxin and its derivatives, and N-methylphthalimide and PIM-1 and its derivatives (Wang et al., 2018; 2020).

17
Representing Molecules as Random Walks Over Interpretable Grammars

Figure 10. Following the same structure as 10, we illustrate our random walk representation over three monomers from the literature of
gas-separation membranes; in this setting, the random walk must be an Euler circuit, beginning and ending on the motif that is listed first
according to IUPAC notation.

These functional groups contribute significantly to maintaining the structures and properties of 3D scaffold building
blocks in polymer self-assembly, which in turn play a significant role in gas separation processes, i.e. the separation of
H2 , H2 /N2 , O2 , O2 /N2 , CO2 , CO2 /CH4 which are common separation tasks important in gas and oil industry. The steps
we take for compiling this dataset of segmented monomers are as follows:
1. We obtain an established compilation of groups (Park & Paul, 1997; Wu et al., 2021) for microporous polymers.
2. We visually segment the monomers in (Wang et al., 2018) into random walks over the groups identified in Step 1.
3. We collect experimental permeability and separation performances for 114 of the monomers identified in Step 2.
In addition to the motifs used here, the concept of such segmentation arises naturally across other application domains in
chemical design. Within synthetic organic chemistry, molecular design plays a governing role in advancing new technology
(Bronstein et al., 2020). Understanding of the behavior of a molecule or polymer in an application is commonly described by
experts using the function of key subparts, particularly key functional groups, scaffold structures, and backbone architectures
within a molecule or monomer, and their arrangement relative to each other, rather than considering atom-by-atom or
a molecule as a whole. In chemical design, new molecules can be complex and, when designed by hand in traditional
ways, are built from these relatively modular subcomponents. This approach naturally takes advantage of the physical
laws by which molecules are built by synthesis reactions, where a discrete set of additions and substitutions are allowed to
finally construct a desired target structure. Such methods of chemical design find broader application in drug discovery for
pharmaceuticals, surfactant and detergent design (Blunk et al., 2006; Li et al., 2020), organic semiconductors (Bronstein
et al., 2020), photoinitiators (Lee et al., 2022), and more recyclable plastics (Türel et al., 2023), among other uses, in each of
which chemists fine-tune properties of such components by adjusting the selection and arrangement of these sub-structures,
or otherwise use them as a guide for understanding performance.
Utilizing groups from existing structures as well as discovery of new and novel structures, researchers can predict perfor-
mance, find new uses for existing molecules, discover new molecules, and further optimize structures for better performance.
Utilizing machine learning models has been show to drastically decrease the time and cost of such methods while simulta-
neously improving throughput by creating and screening novel structures in a single step and providing researchers with
predictions of target molecules that have higher potential for success, which are then verified by experts. As presented by
(Wu et al., 2021), different structural elements and functional groups present in effective drug molecules can be identified
and recombined in new architectures. These novel structures can then be tested using computer models to benchmark likely
efficacy given new targets or modifications to binding sites.

18
Representing Molecules as Random Walks Over Interpretable Grammars

B.3. Graph Augmentation


The motif graph is the directed, multi-edge graph G = (V, E). When traversing to a previously seen motif v, there is
ambiguity in whether the random walk forms a cycle vs creating a copy of the previous motif and appending to the trajectory.
To remove this ambiguity, the random walk traverses a duplicate node, vk for the latter case. A dataset of molecules
and their representations, D := {(M, HM )} thus induces “an augmented version of G” =: G′ . For each v ∈ V , let
Kv = max(count(v, HM ) for M). We create duplicates for v and the in/out-edges of v:
[
V′ ←V ∪ {vk | k = 0, . . . , Kv − 2} (9)
v∈V
[

E ←E∪ {(vk , v ′ , e) | (v, v ′ , e) ∈ E, ∀k = 0, . . . , Kv − 2} (10)
v∈V
[
E′ ← E ∪ {(v ′ , vk , e) | (v ′ , v, e) ∈ E, ∀k = 0, . . . , Kv − 2}. (11)
v∈V

Molecule M = (VM , EM ) is then a rooted subgraph of G′ . In the main text, we refer to G as its augmented version, to
simplify the notation.

B.4. Data Augmentation


Like the Simplified molecular-input line-entry system (SMILES), our description, ĤM , of a molecule is not unique. We tried,
to varying extents, balancing between canonicalizing the description vs applying data augmentation during the grammar
training phase.
As described in Algorithm 2, we linearize a molecule by first setting its “main chain” – the longest shortest path of HM . If
this happens to be part of a cycle, we disregard one edge. If there are multiple longest shortest paths, we choose the one
whose first differing node comes first in our canonical ordering over the nodes of G.
We tried two types of data augmentation:
1. Reversing the direction of the main chain.
2. For each node, trying every permutation over the side chains descending from it.
However, we noticed no practical improvements in training loss or downstream task performance when either of the two
types of augmentation were applied. We believe that, given our parameter estimation procedure, the consistently applied
canonicalization over the nodes of G improves data-efficiency by significantly reducing the hypothesis space.

C. Building the Motif Graph & More Related Works


Expanding on Section 3.1, we apply a subgraph-matching algorithm with pseudocode in Algorithm 3 over all pairs of motifs
v1 , v2 . This algorithm is embarrassingly parallel and runtime-efficient as the subgraph vRl is, unless specified otherwise, a
few atoms or a ring. RDKit(Landrum, 2016) provides out-of-the-box implementations for subgraph matching optimized for
molecular sub-fragments like rings, enabling a significant speedup in runtime.

C.1. Connection to Dual Graph of Geo-DEG’s Meta-Grammar


Our proposed directed multi-digraph also conceptualizes the dual graph of the Geo-DEG meta-geometry. The essence of
the Geo-DEG meta-geometry lies in its completeness, a characteristic inherently inherited by our proposed digraph. A
significant advantage of our approach is the substantial reduction in complexity. To elucidate this process, consider the
construction of our multi-digraph from the Geo-DEG meta-geometry, denoted as Gg = (Vg , Eg ). The initial step involves
replacing each node in Vg , which represents a junction tree, with all feasible molecule structures derived from motifs that
maintain the same junction tree structure. Subsequently, we augment Eg with fully connected edges between these sets of
molecule structures. The dual graph, Gd g, is then derived from Gg , where each node from Vg is transformed into an edge,
and each edge from Eg becomes a node. This dual graph not only preserves the completeness of the original graph but also
provides an intuitive representation of molecular assembly. Each node in the dual graph symbolizes a motif, and traversing
this graph illustrates the process of assembling a molecule by adding motifs. To refine this representation, we eliminate
duplicate nodes in the dual graph, ensuring each node’s uniqueness.

19
Representing Molecules as Random Walks Over Interpretable Grammars

Algorithm 3: function build motif graph(V)


Input: V
// motifs
1 G = graph(V);
2 for v1 in V do
3 for v2 in V do
4 for l1 in v1R do
5 sub 2r ← extract subgraph(v2 , v1Rl 1 );
6 b2 sub ← substruct matches(v2 , sub 2r ;
7 b2 all ← isomorphisms iter(v2 , b2 sub);
8 conn b1 = []; for b1 in b1 all do
9 if connected(v1 (v1Rl 1 + b1)) then
10 conn b1.append(b1);
11 for l2 in v2R do
12 sub 1r ← extract subgraph(v1 , v2Rl 2 );
13 b1 sub ← substruct matches(v1 , sub 1r ;
14 b1 all ← isomorphisms iter(v1 , b1 sub);
15 conn b2 = [];
16 for b2 in b2 all do
17 if connected(v2 (b2 + v2Rl 2 )) then
18 conn b2.append(b2);
19 for b2 in conn v2 do
20 for b1 in conn b1 do
21 sub 1 ← v1 (v1Rl 1 + b1);
22 sub 2 ← v2 (v2Rl 2 + b2); if isomorphic(sub 1, sub 2) then
23 el1,l2 ← (v1 , v2 , r grp 1: v1Rl 1 , r grp 2: v2Rl 2 , b1 : b2 : b2 );
24 G.add edge(e {l1, l2});

25 Out: G

The representation’s completeness is maintained because every possible molecule structure derivable from the motifs
is accounted for in the dual graph. Each pathway through the graph represents a unique assembly sequence of motifs,
translating into a distinct molecular structure. The reduction in complexity arises from the transformation process. By
converting the original graph into its dual form, we reduce the granularity of representation. Instead of representing every
possible molecular structure as a separate node, we represent them as pathways through the dual graph. This approach
significantly decreases the number of nodes and edges required, leading to a more manageable yet complete representation
of the molecular structures.

C.2. Connection to Graph Coarsening


Mathematically, the motif graph advocated in this work is the quotient graph of the molecular graph, under the equivalence
relation defined as u ≡ v if nodes u and v belong to the same motif. As our motifs do not overlap and jointly cover all nodes
of the molecular graph, they define a partitioning of the graph. In scientific computing, collapsing each partition into a
single node and retaining edges crossing partitions is called graph coarsening, which is a commonly used technique to solve
large-scale problems, notably solving sparse linear systems of equations (Chen et al., 2022). Working on the coarsened
version of the graph (i.e., the quotient graph) is computationally attractive as the graph size is much smaller. Moreover,
when applied to machine learning problems such as graph classification, it is demonstrated that the representation learned
from the quotient graph can be as predictive as that learned from the original graph (Chen et al., 2023; Ma & Chen, 2021;
Cai et al., 2021). Favorably, a unique scenario of this work is that all concerned (molecular) graphs share the same set of

20
Representing Molecules as Random Walks Over Interpretable Grammars

Figure 11. Following notations of main text, {v1 }r1 = {1, 9}, {v1 }r2 = {1}, {v1 }r3 = {9}, {v1 }r4 = {5}. {v2 }r1 =
{1, 2, 3, 4, 5, 6}, {v2 }r2 = {12, 13, 14, 15, 16, 17}. We annotate e1,1 , where b1 = {6, 4, 3, 2, 8, 7}, b2 = {10, 7} provides the “certifi-
cate” of a successful match.

motifs, which brings in the potential benefit of learning better molecule representations based on motif representations that
form the basis of all molecules.

C.3. Connection to Random Walk Literature


Our parameterization of the random walk is by learning a graph heat diffusion process over the motif graph G. The
relationship between graph heat diffusion and random walk has been studied before (Masuda et al., 2017), but we integrate
two new ideas: 1) making the Laplacian (edge weights) learnable and dynamically adjustable, and 2) conditioning the
adjustment on an order-invariant memory. The justification as to why we don’t just use autoregressive models is part
of a larger discussion on the respective merits of autoregressive models vs grammar-based approaches. In data-efficient
settings, previous works (Guo et al., 2023a;b) show grammar (esp. context-free grammar) work well due to the relatively
small (tens/hundreds) number of examples needed to learn valid rules and derivation sequences. Meanwhile, the number
of possible hidden states that autoregressive models (Li et al., 2018; You et al., 2018; Liu et al., 2018) are parameterized
to learn is exponential (to the length of the sequence), and learning a good parameterization is difficult (Jin et al., 2018;
2020). We take a middle ground, combining the data-efficient advantages of context-free grammar and the expressivity of
autoregressive models, by introducing a context-sensitive grammar which utilizes a set-based memory during the random
walk. The set-based memory mechanism c(t) keeps an order-invariant memory of the nodes visited so far. Without the
memory mechanism, our model becomes an order 1 Markov process. Previous literature show that higher-order random
walks are required to capture temporal correlations in edge activations (Rosvall et al., 2014; Masuda et al., 2017), with a
tradeoff of complexity and practicality. In the design of complex and modular structures, the order 1 Markov assumption is
not sufficient (see footnote 2 in the paper). Meanwhile, higher-order models make it difficult to scale our grammar to larger
motif graphs. We take a middle ground by introducing a set-based memory state, replacing the entire visit history with a
summary of node visit counts. In particular, prior works study how memory mechanisms in random walks affect exploration
efficiency (Fang et al., 2023; Gasieniec & Radzik, 2008) and enable negative/positive feedback (Fang et al., 2023; Pemantle,
1988). Our results in Section 4.2.2 demonstrate the efficacy of this approach.

21
Representing Molecules as Random Walks Over Interpretable Grammars

D. Grammar Learning
D.1. Graph Diffusion Strategy
Our strategy is to encode the dataset of walk trajectories by training the parameters of our graph diffusion process to recover
the ground-truth state of a particle being diffused over the motif graph. We use stochastic gradient descent and choose
between a “forcing” approach (where a single particle transitions from one state to another) and a “split” approach (where a
single particle splits its mass equally along the out-edges of its current state). See the pseudocode in Algorithm 6.

Algorithm 4: function re order(childs)


Input: childs // children
1 ordered childs ← sorted(childs, key = λ c: (c.main, c.id));
2 Out: ordered childs // re-ordered children, with side-chain descendants first

Algorithm 5: function dfs walk(cur, traj)


Input: cur, traj // children
1 traj.append(cur);
2 childs ← re order(cur.children);
3 for c in childs do
4 cur len ← len(traj);
5 dfs walk(c, traj);
6 if !c.main then
7 traj ← traj + reverse(traj[cur len:]);

D.2. Visualizing Learning Process


In Figure 12 and Figure 13, we see our grammar’s capacity to estimate the prior edge weights, E, through training, as
well as correct the edge weights via a memory-sensitive adjustment during the random walk. Weights of edges that are
commonly traversed after G14 will be amplified during training, and weights of edges that visit G14 from another state will
be diminished.

E. Property Prediction
E.1. Graph Neural Network Design Choices
We apply a Graph Isomorphism Network (Xu et al., 2019) with hyperparameters in Table 8. For molecule M with
representation HM , the node-level features include: a) the Morgan fingerprint of the motif vi (dimension 2048), b) the
memory-free weights of its out-edges (dimension |G|), i.e. E[i]. We also concatenate the Morgan fingerprint of M.

E.2. Bag-of-Motifs Design Choices


We obtain |G|-dimension motif-occurrence feature vector for each M . Similar to Ours, we concatenate the Morgan
fingerprint of M to it. We use XGBoost with 16 estimators (boosting rounds) and maximum tree depth of 10.

E.3. Optimization Design Choices


We apply the Adam optimizer with stochastic gradient descent. To mitigate noisy training dynamics, we report the mean
and standard deviation over 3 runs, corresponding to 3 random seeds during data splitting. We initialize weights using the
Gaussian distribution.

22
Representing Molecules as Random Walks Over Interpretable Grammars

Algorithm 6: function algo-diffusion


Input: T, G, alpha, strategy // number of time-steps, motif graph, learning rate, either ’split’
or ’forcing’
1 E ← rand(|G| × |G|);
2 W ← rand(|G|, |G| ∗ |G|);
3 b ← zeros(|G|);
4 for (HM , EM ) in D do
5 c(0) ← [0 for v in G];
6 x(0) ← [1 if v==HM .root else 0 for v in G];
7 p(0) ← [1 if v==HM .root else 0 for v in G];
8 if strategy == ’forcing’ then
9 traj ← [];
10 dfs walk(HM .root, traj);
11 for t = 0, . . . , T − 1 do
t 1
12 c(t+1) = t+1 · ct + t+1 · pt ;
(t+1) (t+1)
13 W = E + f (c );
(t+1) t (t+1) t
14 x = x + (D − W )x ;
15 if strategy == ’forcing’ then
16 p(t+1) ← [1 if v == traj[(t+1)%len(traj)] else 0 for v in G];
17 else
18 for i in G do
(t+1) P ptj
19 pi ← (j,i)∈EM dj ;

20 Loss ← MSE(xt , pt );
21 E ← E - dLoss
dE ;
22 W ← W - α ∗ dLoss
dE ;
23 b ← b - α ∗ dLoss
db ;

24 Out = E,W,b

Table 8. Hyperparameter settings for property prediction


Hyperparameter Value
Number of layers 5
Activation ReLU
Hidden dimension 16
Motif featurization Morgan fingerprint
Motif feature dimension 2048
Input feature dimension 5 2048 + 2048 + |G|
Batch Size 1
Learning Rate 1e-3

F. Generating Novel Random Walks


We illustrate the generation of the random walk with notation G81 → G82 → G274 → G82 : 1 in Figure 14. Our graph
resolves any ambiguity of whether to revisit G82 or attach a new copy of G82 to the molecule by attaching a colon for
each newly attached motif that has a naming conflict. This is possible after augmenting G with duplicates of the motif (see
Appendix B.3), which, in practice, has a negligible increase the complexity of G.

23
Representing Molecules as Random Walks Over Interpretable Grammars

Figure 12. (Left) The raw data of Group Contribution. The edge thickness is proportional to the number of monomers whose random walk
representations traverse the edge. (Right) The learned parameter matrix of E after training converges The grammar both retains essential
nodes and edges and smoothens the distribution of edge weights.

Figure 13. We show the weight evolution of the edges incidental to G14 on HOPV. (Left) After processing the raw dataset into random
walks, we visualize the empirical distribution of edge traversals. (Middle) After learning our context-sensitive grammar, we plot the prior
edge weights, i.e. the memory-free parameter E. (Right) We plot the transition probabilities starting at G14 during the random walk
generation process.

Our implementation in Algorithm 7 handles the distinction between revisiting a previous node vs adding a new duplicate of
the same motif as a previous node through mask attach (new nodes which can be attached) vs mask return (the node which

24
Representing Molecules as Random Walks Over Interpretable Grammars

Figure 14. Generation of a random walk G81 → G82 → G274 → G82 : 1; The possible transitions from G81, G82 and G274 are in
Red, Green, and Blue (with thickness proportional to probability).

the random walker can backtrack to). This distinction is done by creating duplicates of nodes for each revisit (see B.3).
We guarantee 100% validity since we can explicitly check the possible motifs which can be attached to M at each step.
When there are neither new motifs to attach, nor existing motifs to return to, the generation terminates with the current M
being the final generation output. Please refer to the GitHub for details of the implementation.

Figure 15. Generation process of novel random walks on HOPV: (Left) G262 → G305 → G181 and (Right) G239 → G297 → G202;
The possible transitions from the first and second visited nodes are in Red and Green.

As shown in Figure 15, applying our generation method produces artifacts of learning that invite further scrutiny: “rules”
of consecutive motifs. The second example in Figure 15 shows there are only two possible motifs (green) that can be
attached to the G297 end of a molecule with the G239 and G297 functional groups (ignoring the return back to G239,
which transitions the state but does not attach a new motif). In the first example, the distribution of possible new motifs to
attach to the G305 end of a molecule with G262 and G305 appears more uniform.

F.1. Extracting Context-Sensitive Grammar Rules


One side product of our generation and verification procedure is the ability to extract “hard” rules. A hard rule is when a
certain edge must be traversed (probability of 1) under a certain memory and at a certain state. Although our memory is
invariant to the order of visited nodes thus far, we search for hard rules by using a best-first algorithm to store all promising
trajectories. Table 9 is a compilation of hard rules learned by our model on the PTC dataset.
5
We concatenate the molecule’s 2048-dimensional morgan fingerprint to the input features. We concatenate the edge-weighted
adjacency matrix to the input features.

25
Representing Molecules as Random Walks Over Interpretable Grammars

Algorithm 7: function generate


Input: G // motif graph
1 root M ∼ V; // can sample according to prior
2 loop back; // whether to loop back (applies for monomers)
3 root ← Node(root M);
4 H ← root;
5 M ← molecule(root M); // initialize the molecule
6 t ← 0;
7 c(t) ← [0 for v in V];
8 terminate ← False;
9 while !terminate do
10 p(t) ← [1 for v in V if v==H.val else 0];
t 1
11 c(t+1) ← t+1 · c(t) + t+1 · p(t) ;
(t+1) (t+1)
12 W ← E + f(c );
(t+1) (t) (t+1) (t)
13 x ← x + (D-W x );
14 mask attach, mask return ← mask possible(M,G,H);
15 mask ← mask attach | mask return;
mask∗x(t+1)
16 x(t+1) ← (mask∗x(t+1) ).sum()
;
(t+1)
17 cur ← sample(x );
18 if cur is not None then
19 if loop back and cur == root M then
20 terminate ← True;
21 else
22 if mask attach[cur] then
23 M ← attach(M, molecule(cur));
24 H.child ← Node(cur);
25 H ← H.child;
26 else
27 H ← H.parent;
28 else
29 if loop back then
30 return M, root, False;
31 else
32 break;

33 return M, root, True;


34 Out: molecule, representation of molecule, boolean indicating validity

G. Detailed Case Study: Harvard Organic Photovoltaic Dataset


The Harvard Organic Photovoltaic Dataset (HOPV15) is a comprehensive collection that bridges experimental photovoltaic
data with quantum-chemical calculations, serving as a crucial resource in the field of organic photovoltaics. This dataset
includes experimental results from literature and corresponding quantum chemical data for a wide range of molecular
conformers. These are analyzed using various density functionals and basis sets, including both generalized-gradient
approximation and hybrid designs. A key feature of HOPV15 is its utility in calibrating quantum chemical results with
experimental observations, aiding in the development of new semi-empirical methods, and benchmarking model chemistries
for organic electronic applications. The dataset employs the Scharber model to compute the maximum percent conversion
efficiencies for 350 studied molecules, focusing on their HOMO (Highest Occupied Molecular Orbital) values.

26
Representing Molecules as Random Walks Over Interpretable Grammars

Table 9. Under our string-based implementation, A[→B] encodes a random walk trajectory of A→B→A. All rules shown are valid, as
verified to correspond to valid molecules that can be constructed following the random walk trajectory.
Trajectory A ⇒ Trajectory B
[’G4’] [’G4’, ’G2’]
[’G27’] [’G27’, ’G6’]
[’G115’] [’G115’, ’G6’]
[’G218’] [’G218’, ’G6’]
[’G283’] [’G283’, ’G6’]
[’G290’] [’G290’, ’G6’]
[’G301’] [’G301’, ’G6’]
[’G335’] [’G335’, ’G6’]
[’G368’] [’G368’, ’G6’]
[’G466’] [’G466’, ’G231’]
[’G272’] [’G272’, ’G271’]
[’G362’] [’G362’, ’G361’]
[’G205’] [’G205’, ’G202’]
[’G435’] [’G435’, ’G434’]
[’G167’] [’G167’, ’G166’]
[’G436’] [’G436’, ’G166’]
[’G224’] [’G224’, ’G225’]
[’G2’, ’G4’] [’G2[->G4]’]
[’G202’, ’G205’] [’G202[->G205]’]
[’G434’, ’G435’] [’G434[->G435]’]
[’G361’, ’G362’] [’G361[->G362]’]
[’G333’, ’G393’] [’G333’, ’G393’, ’G333:1’]
[’G224’, ’G225’, ’G224:1’] [’G224’, ’G225[->G224:1]’]

G.1. Segmentation Strategy


Our segmentation approach involved systematically categorizing molecules based on their functional groups and ring
structures. We separated standard functional groups (e.g., vinyl, alcohol) and individual rings (e.g., benzene, thiophene,
pyrrole) to understand their unique contributions to photovoltaic properties. Additionally, we paid special attention to
complex structures with consecutive rings, acknowledging their impact on the optical and electronic characteristics of the
materials. These parameters are impactful to the molecular’s HOMO value, which are essential for calculating the open
circuit potential and short circuit current density, leading to an understanding of percent conversion efficiency.
The segmentation strategy is particularly focused on the differentiation and categorization of molecular structures based on
their photovoltaic properties and electronic configurations. This includes the separation of standard functional groups such
as vinyl, alcohol, ketone, aldehyde, amine, ester, and amide, each identified as individual black fragments. This separation is
critical in analyzing their distinct contributions to photovoltaic efficiency and electronic properties.
Moreover, the dataset and segmentation emphasize the unique characteristics of individual rings like benzene, pyrrole, and
thiophene by treating them as separate black fragments. This distinction is vital due to their specific Pi-orbital electron
delocalization, which plays a crucial role in the photovoltaic properties of the molecules. The segmentation method goes
a step further in dealing with complex structures possessing multiple consecutive rings, such as thieno[3,4-b]pyrazine,
carbazole, and 2,5-dimethyl-3,6-dioxo-2,3,5,6-tetrahydropyrrolo[3,4-c]pyrrole. These structures are treated as individual
entities to accurately reflect their unique HOMO-LUMO bandgaps and electrochemical characteristics, which are central to
their functionality in organic photovoltaics.
The segmentation strategy also pays special attention to groups of 2-3 consecutive symmetrical thiophene or pyrrole units,
maintaining these as a single black group. This decision is based on the understanding that the electron cloud delocalization
across these repeating units significantly influences the optical and electronic properties of the molecules, impacting factors
such as light absorption, charge transport, and luminescence. Such approach is essential for advancing the understanding of
molecular alignment and stability, thereby optimizing the functional properties of photovoltaic materials.

27
Representing Molecules as Random Walks Over Interpretable Grammars

Meanwhile, all the red group along with the segmented black group are chosen to be either single atom, or closet conjugated
rings if the black group is too small (just one or two atoms). This method helps reduce the redundancy and computational
resources of the red group.

G.2. Heuristic Based Fragmentation


We adopted a heuristic-based, deterministic algorithm to segment molecules across all datasets for our ablation study. Below,
we analyze its segmentation quality on the HOPV dataset. We cleave on any bond that satisfies one or either of these
conditions:
1. Bond connecting two rings.
2. Bond connecting a ring and an atom with degree greater than 1.
This algorithm works for molecules with rings, but tends to not capture functional groups consistently. It either fails to
sufficiently segment groups attached to ring like A2 in Figure 16 or cleaves on every ring even when they should be kept
together like B1 in Figure 16.

G.3. Analysis of Learnt Representations


In this section, we perform an alternate and more accepted means of analysis than the 2D t-SNE analysis done in Section
4.4.2. We seek to understand the agreement between our property predictor’s learnt representation and the structural
similarity over HOPV’s test set molecules. Since the final layer embedding is used for prediction, we expect molecules with
similar properties to have similar embeddings.
In Figure 17, several groups of trends stand out. Particularly, highlighted in green are cases where the embedding similarity
is high despite dissimilar HOMO property values; blue marks cases where the embedding similarity is low, and red marks
sections that are similar in property, structure, and embedding. We detail each of these, basing comparison against molecule
50 for illustration:
• For the topmost green section, molecules in the range 1-4 have similar components as those with higher HOMO values,
though are much smaller in size and relatively disordered. For instance, molecules 3 and 4 each share key subcomponents
(thiophene groups) with molecule 50, despite having quite different overall structure. The embedding similarity between
(50, 4) and (50, 3) is thus medium-low and medium-high.
• For the red sections along the diagonal, molecules in the ranges 14-16 and 17-26 cluster together. These tend to have an
over-representation of electron-withdrawing groups in in non-symmetric locations in the structure, particularly methoxy,
cyano, and carbonyl groups, without sufficient electron donating groups. Molecules 15 and 20 are shown as examples, and
their embedding similarity is high. Blue outlines mark similar sub-groups between 15 and 20.
• For the second-from-top green section, we again consider molecules in the range 18-26, where they show high similarity
to the highest band in the range 47-63. These share many component structures, for instance thiophenes groups and
derivatives. Molecule 23 is shown as an example, and has a barbituric acid core on one side, an electron withdrawing group,
with methoxy groups on benzene rings on the other side, with a nitrogen atom between benzene rings, contributing to
electron delocalization. The most likely explanation is that similar high-sterics groups have developed similar embeddings
in this case.
• For the bottom-right red section, molecules in the range 47-63 generally cluster together, reflecting the model’s ability
to agree on both structural and property similarity. They tend to have an alternating pattern of electron-donating and
electron-withdrawing groups which can increase the HOMO and provide a more direct pathway for charge transport.
Yellow outlines mark matching and similar groups with molecule 50. In these cases, more than simply thiophene shows
similar or the same structure. The embedding similarity between (50, 52), (50, 57), (52, 57) are all medium-high.
These insights show how complex molecule structure affects the measured property in this application, and how both
structure and property are captured in the embedding. We hope the analysis provides more insights into how structural priors
in our representation facilitates learning and generalization.

28
Representing Molecules as Random Walks Over Interpretable Grammars

Figure 16. We detail the difference between the expert and heuristic segmentations, highlighting how the heuristics are not sufficiently
capable. For example, the expert segmentation keeps the 3 thiophene rings together in A1, while the heuristic breaks them up. Similarly,
in B1, the expert treats the consecutive rings as one fragment, whereas the heuristic cleaves on bonds connecting them.

29
Representing Molecules as Random Walks Over Interpretable Grammars

Figure 17. There are 64 molecules in this test set indexed from lowest to highest HOMO value. The above grid visualizes the distance
between each pair of molecules as a cosine distance between the final layer embeddings of our model, with darker color representing
lower distance (higher similarity). We use 4 quantiles, and refer to their ranges as low, medium-low, medium-high, and high similarity.

30

You might also like