Representing Molecules As Random Walks Over Interpretable Grammars
Representing Molecules As Random Walks Over Interpretable Grammars
Michael Sun 1 Minghao Guo 1 Weize Yuan 2 Veronika Thost 3 Crystal Elaine Owens 1
Aristotle Franklin Grosz 4 Sharvaa Selvan 5 Katelyn Zhou 6 Hassan Mohiuddin 5 Benjamin J Pedretti 4
Zachary P Smith 4 Jie Chen 3 Wojciech Matusik 1
Recent research in molecular discovery has pri- dition, the specificity of the designs and use cases, and the
marily been devoted to small, drug-like molecules, considerable cost of practical experiments, make it often
leaving many similarly important applications a scenario that is scarce in both data and labels; for exam-
in material design without adequate technology. ple, datasets of ≈ 300 molecules or less are not uncommon
These applications often rely on more complex (Wang et al., 2018; Lopez et al., 2016; Helma et al., 2001).
molecular structures with fewer examples that As a consequence, materials science has not yet fully ex-
are carefully designed using known substruc- ploited the potential of machine learning methods (Karande
tures. We propose a data-efficient and inter- et al., 2022; Wang & Wu, 2023). We focus on such chal-
pretable model for representing and reasoning lenging datasets that feature complex molecules containing
over such molecules in terms of graph grammars functional groups and structural motifs which are applied in
that explicitly describe the hierarchical design multiple diverse, real-world application scenarios.
space featuring motifs to be the design basis. Our goal is to represent and reason about molecules in
We present a novel representation in the form a data-efficient and interpretable way. Domain-specific
of random walks over the design space, which datasets typically exhibit distinct motifs and functional
facilitates both molecule generation and prop- groups, which serve as structural priors in our molecular
erty prediction. We demonstrate clear advan- representation. Previous works show that structural priors
tages over existing methods in terms of perfor- are highly advantageous for applications that require data
mance, efficiency, and synthesizability of pre- efficiency (Rogers & Hahn, 2010; Xia et al., 2023a; Shui
dicted molecules, and we provide detailed in- & Karypis, 2020; Jiang et al., 2022; Yang et al., 2022). We
sights into the method’s chemical interpretability. propose a novel approach to molecular discovery that is
Code is available at https://github.com/ tailored to more complex molecules and low-data scenarios
shiningsunnyday/polymer_walk. and builds upon the above insights. The idea is to start from
a set of expert-defined motifs1 and learn a context-sensitive
grammar over the space of motifs. The novelty of this work
1. Introduction lies in our representation and learning of this grammar.
Property-driven molecular discovery represents a challeng- We define a motif graph – a hierarchical abstraction of the
ing application with great potential benefits for society, and molecular design space induced by the given data, where
this is reflected in the large amount of research conducted each node is a motif and each edge represents a possible at-
in the machine learning community on this topic in recent tachment between a pair of motifs. Our main technical con-
years (Sawlani, 2024). Yet, most of the research focuses tribution is an efficient and interpretable parameterization
on small, drug-like molecules, while many classes of more over the context-sensitive grammar induced by the design
complex molecules have been largely neglected. Materials space, and the description of a molecule as a random walk
designed for applications such as gas-separation membranes of context-sensitive transition rules. Our representation of
or photovoltaics, which are critical for a sustainable future, molecules combines the quality of representation learning
often have specific distributions of molecule structure that with the interpretability of a rule-based grammar.
1
MIT CSAIL 2 MIT Chemistry 3 MIT-IBM Watson AI Lab, IBM In terms of quality, we demonstrate our grammar representa-
Research 4 MIT Chemical Engineering 5 MIT 6 Wellesley. Corre-
spondence to: Michael Sun <[email protected]>. 1
Note that our method works with any given set of motifs (e.g.,
we can apply one of the more simple algorithms used in existing
Proceedings of the 41 st International Conference on Machine works), but our evaluation shows that certain applications benefit
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by from high-quality domain knowledge.
the author(s).
1
Representing Molecules as Random Walks Over Interpretable Grammars
tion suits applications characterized by designer molecules. A few closely related works have recently proposed molecu-
We select datasets that reflect real-world settings of experi- lar graph representations where the relations between motifs
mentally curated designs of molecules with complex, modu- are explicitly represented, together with corresponding mod-
lar sub-structures characterized by functional groups known els (Shui & Karypis, 2020; Jiang et al., 2022). Our work
or hypothesized to yield high target properties. is different from theirs in two aspects. First, we show that
commonly used automatic approaches for motif extraction
In terms of interpretability, our grammar representation is
are not sufficient for property prediction over several kinds
special in two ways. As an indirect consequence of super-
of more complex molecules, and that custom motifs given
vised learning, our model produces visually discernible clus-
by domain experts yield better performance. It allows for
ters according to distinctive structural features within the
biasing the model towards known structure-activity relation-
dataset. More importantly, our compact, context-sensitive
ships or the expert’s hypotheses (e.g., fragments known or
grammar allows for discovering design rules that reveal the
assumed to be critical for the property under consideration).
design principles used during the creation of the dataset.
Second, to the best of our knowledge, their motif graph rep-
• Our method largely outperforms pretrained and traditional resentations do not model the context sensitivity explicitly
methods for molecular property prediction. It is compet- (e.g., HM-GNN’s motif graph (Shui & Karypis, 2020) con-
itive with a state-of-the-art graph grammar system for nects two motifs based on co-occurence in a molecule only).
chemistry (Guo et al., 2023b) in terms of quality while
being an order of magnitude more runtime efficient. Molecule representation by grammars. Recent work has
• Our method’s interpretable representations reveal deeper shown that such grammars represent a data-efficient way for
insights into relationships implicit in the data, explain the representing molecules and yield SOTA results (Guo et al.,
model’s reasoning, and lead to novel scientific insights. 2023a;b). In a nutshell, this is achieved by explicitly repre-
• Our method produces promising molecule generations, in senting the training data’s design space in terms of learnt
particular, producing diverse designs that are synthesiz- motifs, in the form of a graph grammar. Grammars naturally
able at a significantly higher rate than the state-of-the-art allow for generating novel molecules in the given design
data-efficient generative model, DEG (Guo et al., 2023a). space. Yet, obtaining production rules involves either man-
• Finally, made possible by our method’s interpretability, ual definition (Krenn et al., 2020; Guo et al., 2022; Nigam
our approach enables close collaboration with domain et al., 2021) or a significant complexity to automatically
experts. In particular, we devised and executed feasible, learn (Guo et al., 2023a; Kajino, 2019), where the training
practical, and semi-automated workflows with experts times for downstream tasks are considerable (see Figure 5).
for fragmenting molecules, constructing the design space, Further, the learnt substructures sometimes lack a chemical
and interpreting the results. interpretation, and grammar derivations often produce chem-
ically invalid structures (Guo et al., 2023a), so the natural po-
tential of symbolic methods for interpretability and validity
2. Related Works is lost, although such elements are critical for expert valida-
Motif-based molecular property prediction. ECFP em- tion and for gaining scientific insights. We propose a novel
beddings (Rogers & Hahn, 2010), which capture relevant way for representing and learning such context-sensitive
ego-graphs present in a molecule in bit vectors, represent a grammars, over a design space informed by chemical mo-
motif-based encoding. ECFP embeddings in combinations tifs. This approach results in order-of-magnitude differences
with simple predictors (e.g., XGBoost) have been competi- in runtime and enhances chemical interpretability.
tive on small datasets (Xia et al., 2023a). In our evaluation,
we show that our model is similarly data-efficient but de- Other works for molecular representation learning.
livers a better predictive performance, owing to the use There are various other non-motif based approaches that we
of graph-based representations. In light of the good per- compare to in our evaluation, including (pre-trained) GNNs
formance of ECFPs, it is not surprising that the recently (Hu et al., 2020; Xia et al., 2023b), motif-based pre-training
developed subgraph graph neural networks (GNNs) report approaches designed for semi- or unsupervised learning
competitive performance in molecular property prediction (Xia et al., 2023b), and molecular few-shot learning includ-
when using ego-graphs as subgraphs (Frasca et al., 2022); ing the SOTA, which relies on modeling the domain expert’s
we consider ESAN (Bevilacqua et al., 2022) in our evalu- reasoning in terms of related molecule contexts using asso-
ation. However, existing models usually apply subgraphs ciative memories (Schimunek et al., 2023). Central to our
rooted at all individual nodes rather than a set of more method is the connection between random walks and graph
coarse-grained, potentially complex, domain-specific sub- diffusion, established methods that have been particularly
graphs. Other recent work that integrates motifs to improve effective to model graph structures through physics-inspired
out-of-distribution detection similarly lacks this dimension processes (Thanou et al., 2017). Other related works and
of modeling (Yang et al., 2022). more detailed discussions can be found in Appendix C.
2
Representing Molecules as Random Walks Over Interpretable Grammars
3
Representing Molecules as Random Walks Over Interpretable Grammars
Figure 1. Illustration of our random walk representation: (a) (top) molecule M , number 33 (middle) HM as a connected subgraph of G
(bottom) ĤM as a random walk over HM ; (b) the motif graph G, each node is a motif v that contains both the molecular fragment vB
(black molecule sections) and the contexts for attachment (vR , red molecule sections), each gray line indicates a possible attachment
between nodes. Directed edges of ĤM use the same color as the dashed border of the corresponding figure of M ; (c) (top) demonstration
of motif matching criteria eq 1-4 (183 ↔ 5), another example is in Fig. 11 (bottom) two more examples of HM .
3.1. The Molecular Design Space as Derivations of a 3.2. Molecules as Random Walks in the Design Space
Context-Sensitive Grammar Over Motif Graph
Intuitively, our representation of a molecule M captures a
We now define our context-sensitive grammar over G. We derivation in the above-defined context-sensitive grammar.
use the notations defined in the previous section to enumer- While prior work has modeled such derivations in large and
ate the set of production rules, P, in our grammar. There complex tree structures (e.g., with auxiliary nodes for partial
is one initial rule pv ∈ P for each motif v in G, where derivations) (Guo et al., 2023a;b), we model it compactly in
the LHS is X , and the RHS is the molecular graph gv terms of a random walk over the bidirectionally connected
with uB being the base atoms and {(url )} being the red subgraph HM = (VM , EM ) of G given by the fragmen-
atom sets that become “options” for attachment. Then, tation of M 3 ; see Fig. 1 (a). Observe that G is a strong
there is exactly one production rule pu,v,l1 ,l2 ∈ P for prior for constraining the design space and sufficient for
each edge (u, v, el1 ,l2 ) ∈ G. This edge was attributed describing the molecular structure of M , but HM misses
with (url1 , vrl2 , b1 , b2 ) during the construction of G. The the global distribution of which it is a sample of.
application of the production rule then equates to attach-
Our learnable component models this distribution and, at
ing the fragment of v to the fragment of u, at the attach-
the same time, captures the features that characterize a spe-
ment options keyed with l1 , l2 . In the language of graph
cific molecule in terms of a random walk. More specifically,
grammars, the context of this production rule is hence the
our final molecule representation is a directed-acyclic multi-
molecular graph gu (uB ∪ url1 ), with the requirement that
graph ĤM = (VM , ÊM , wM ) that linearizes HM into a
the matched atoms for url1 are red. Applying this pro-
random walk such that (1) ÊM ⊆ EM , (2) ĤM remains
duction rule replaces the matched atoms for url1 within
connected, and (3) there is an Euler path4 . (i.e., each edge
the LHS by gv (N (gv ) \ vrl2 ), where the red atom sets
is used exactly once) v0 , v1 , . . . , vℓ over (VM , ÊM ) with
{vrl | vrl ∩ vrl2 ̸= ∅} in v are introduced as new options for
ÊM := ∪i {(vi , vi+1 )}; this path can be generated via a
attachment in the RHS. The random walk characterization
pre-order traversal that adds a reversed duplicate of the
arises out of the fact that if the LHS molecule contains the
sub-trajectory when the stack contracts. The last compo-
context gu (uB ∪ url1 ), any edge (u, v, el1 ,l2 ) ∈ E can be
traversed, possibly including self-loops and parallel edges 3
Refer to Appendix B.3 for how and why we augment G with
since G is a directed multidigraph. duplicates of the same motif.
4
In the case of monomers, the Euler path needs to be closed as
monomers have the property of self-loops.
4
Representing Molecules as Random Walks Over Interpretable Grammars
5
Representing Molecules as Random Walks Over Interpretable Grammars
Figure 2. Illustration of our generation procedure: (t=1) our learnable grammar parameterized by Φ samples a state transition 56 → 9;
(t=2) with the memory of having visited {56}, our grammar samples a state transition → 71; (t=10) (bottom) our grammar samples a final
transition 5, which determines the molecular structure (top); our program’s notation is 56 → 9 → 71[→ 70 → 5] → 70 : 1 → 5 : 1
design of organic solar cells, with detailed information per- 4.2. Results
tinent to organic photovoltaic performance metrics. The
We report the mean absolute error (MAE) and coefficient of
molecules contain motifs which are among the most signifi-
determination (R2 ) over normalized prediction values for
cant functional groups for conducting/electroactive materi-
GC and HOPV, and the accuracy and AUC for PTC. For
als (Swager, 2017) and photovoltaic properties (Yuan et al.,
each (dataset, property) pair, we perform an 80-20 train-test
2022). We extracted motifs important for high HOMO val-
split over 3 random seeds and report the mean and stan-
ues and enhanced electron delocalization, which are critical
dard deviation. For molecular generation, we report com-
for photovoltaic efficiency; see Appendix G for details.
monly used metrics (Polykovskiy et al., 2020; Guo et al.,
Predictive Toxicology Challenge (PTC) (Helma et al., 2023a): Validity/Uniqueness/Novelty: Percentage of chem-
2001). 344 small chemical compounds characterized by ically valid/unique/novel molecules; Diversity: Average
very distinct functional groups known for their carcinogenic pairwise molecular distance among generated molecules;
properties or liver toxicity (Miller et al., 1949), with Retro* Score (RS): Success rate of Retro* (Chen et al.,
reported values for rats. We specifically segmented it 2020) which was trained to find a retrosynthesis path to
into functional groups that majorly contribute to the build a molecule from a list of commercially available ones.
improvement of compounds’ toxicity (Hughes et al., 2015). We add the metric of Membership, which tests whether cer-
Examples from each dataset are shown in Figure 3. tain motif(s) characteristic of membership to the chemical
class are present, primarily as a sanity check. Our method,
Baselines. To address question 1), we compare with by design, can achieve 100% if the random walk initializes
pretrained GNNs (PN (Stanley et al., 2021) and Pre-trained at the characteristic motif. See A.1 for further discussion.
GIN (Hu et al., 2020)), a specialized GNN model for
property prediction (wD-MPNN (Aldeghi & Coley, 2022)), 4.2.1. P ROPERTY PREDICTION
two SOTA pretrained models for molecular representation
learning (MolCLR (Wang et al., 2022) and UniMol (Zhou To answer question 1), we see in Table 2 that our method,
et al., 2023)) and two SOTA subgraph-based methods with expert motifs, achieves the best R2 by a wide margin
(ESAN (Bevilacqua et al., 2022) and HM-GNN (Shui of 0.10 and 0.06 over the second best method on regres-
& Karypis, 2020)). To address question 2), we compare sion datasets GC and HOPV and the highest accuracy on
with both Geo-DEG, the SOTA on small dataset property PTC. With heuristic motifs, our method remains competi-
prediction, and its generative variant, DEG, for molecular tive to Geo-DEG, achieving higher R2 on both regression
generation. datasets and accuracy within standard deviation on PTC. In-
6
Representing Molecules as Random Walks Over Interpretable Grammars
7
Representing Molecules as Random Walks Over Interpretable Grammars
Table 2. Results on property prediction (best result bolded, second-best underlined). The datasets we include have expert-annotated
motifs. We also report Ours (w/o expert) as an ablation without expert motifs.
Methods PN Pre-trained GIN Ours
wD-MPNN ESAN HM-GNN MolCLR Unimol Geo-DEG Ours
Datasets (finetuned) (finetuned) (w/o expert)
MAE ↓ 0.47 ± 0.09 0.51 ± 0.06 0.34 ± 0.12 0.76 ± 0.30 0.68 ± 0.05 0.26 ± 0.10 0.38 ± 0.13 0.26 ± 0.11 0.25 ± 0.09 0.27 ± 0.08
Group
R2 ↑ 0.41 ± 0.12 −0.39 ± 0.62 0.56 ± 0.20 −7.56 ± −7.71 0.19 ± 0.09 0.68 ± 0.20 0.47 ± 0.25 0.70 ± 0.20 0.80 ± 0.15 0.74 ± 0.15
MAE ↓ 0.36 ± 0.03 0.37 ± 0.02 0.40 ± 0.02 0.42 ± 0.02 0.38 ± 0.02 0.34 ± 0.03 0.31 ± 0.03 0.30 ± 0.02 0.30 ± 0.05 0.22 ± 0.15
HOPV
R2 ↑ 0.69 ± 0.04 0.66 ± 0.06 0.65 ± 0.05 0.65 ± 0.04 0.66 ± 0.03 0.68 ± 0.03 0.70 ± 0.02 0.74 ± 0.03 0.80 ± 0.06 0.77 ± 0.12
Acc ↑ 0.67 ± 0.06 0.64 ± 0.08 0.66 ± 0.07 0.61 ± 0.08 0.62 ± 0.09 0.60 ± 0.03 0.57 ± 0.05 0.69 ± 0.07 0.70 ± 0.01 0.67 ± 0.02
PTC
AUC ↑ 0.70 ± 0.05 0.68 ± 0.06 0.69 ± 0.06 0.65 ± 0.07 0.66 ± 0.07 0.66 ± 0.05 0.67 ± 0.06 0.71 ± 0.07 0.69 ± 0.03 0.66 ± 0.05
8
Representing Molecules as Random Walks Over Interpretable Grammars
Table 4. Ablation study on overfitting and generalization, vs other motif-based baselines, with and w/o expert motifs. Best result is bolded.
Ablation/Dataset HOPV PTC Group Contribution
Train Train Test Test Train Train Test Test Train Train Test Test
MAE ↓ R2 ↑ MAE ↓ R2 ↑ Acc ↑ AUC ↑ Acc ↑ AUC ↑ MAE ↓ R2 ↑ MAE ↓ R2 ↑
0.014± 0.997± 0.486± 0.489± 0.996± 1.000± 0.529± 0.609± 0.000± 1.000± 0.481± 0.257±
Bag-of-Motifs 0.002 0.001 0.025 0.062 0.000 0.000 0.031 0.031 0.000 0.000 0.174 0.453
0.011± 1.000± 0.521± 0.446± 0.996± 1.000± 0.581± 0.612± 0.000± 1.000± 0.493± 0.214±
Bag-of-Motifs (+expert) 0.004 0.000 0.031 0.125 0.000 0.000 0.018 0.029 0.000 0.000 0.143 0.404
0.366± 0.686± 0.473± 0.441± 0.915± 0.966± 0.710± 0.678± 0.281± 0.717± 0.362± 0.592±
HM-GNN 0.035 0.066 0.019 0.065 0.033 0.016 0.023 0.040 0.064 0.137 0.113 0.202
0.201± 0.895± 0.451± 0.408± 0.999± 1.000± 0.681± 0.587± 0.185± 0.926± 0.345± 0.547±
HM-GNN (+expert) 0.009 0.019 0.025 0.095 0.002 0.000 0.024 0.075 0.016 0.039 0.149 0.295
0.075± 0.990± 0.288± 0.765± 0.994± 0.999± 0.671± 0.659± 0.044± 0.995± 0.268± 0.738±
Ours (-expert) 0.003 0.001 0.048 0.146 0.001 0.000 0.020 0.047 0.015 0.004 0.084 0.148
0.045± 0.996± 0.295± 0.796± 0.996± 1.000± 0.705± 0.711± 0.028± 0.998± 0.222± 0.819±
Ours 0.003 0.001 0.049 0.105 0.000 0.000 0.007 0.018 0.007 0.002 0.079 0.137
groups, G333, thereby exacerbating its hepatotoxicity. In aid the search for novel molecules with desirable photo-
another example, the molecule with an ammonia group (la- voltaic properties. As 2D t-SNE is not a universal way to
beled as [‘G466’]) transforms with an additional ketone analyze representations, we also visualize the agreement be-
group (labeled as [‘G466’, ‘G231’]). Here, the presence tween embedding similarity and structural similarity using
of a C=O double bond within an acetamide group is a key a 64 × 64 grid. This is can be found in Appendix G.3, as
contributor to hepatotoxicity. part of an in-depth case study on HOPV.
9
Representing Molecules as Random Walks Over Interpretable Grammars
10
Representing Molecules as Random Walks Over Interpretable Grammars
Kajino, H. Molecular hypergraph grammar with its Masuda, N., Porter, M. A., and Lambiotte, R. Random
application to molecular optimization. In Chaud- walks and diffusion on networks. Physics reports, 716:
huri, K. and Salakhutdinov, R. (eds.), Proceedings of 1–58, 2017.
the 36th International Conference on Machine Learn-
ing, volume 97 of Proceedings of Machine Learning Miller, J. A., Sapp, R. W., and Miller, E. C. The Car-
Research, pp. 3183–3191. PMLR, 09–15 Jun 2019. cinogenic Activities of Certain Halogen Derivatives of
URL https://proceedings.mlr.press/v97/ 4-Dimethylaminoazobenzene in the Rat*. Cancer Re-
kajino19a.html. search, 9(11):652–660, 1949.
Karande, P., Gallagher, B., and Han, T. Y.-J. A strategic ap- Nigam, A., Pollice, R., Krenn, M., Gomes, G. D. P., and
proach to machine learning for material science: How to Aspuru-Guzik, A. Beyond generative models: superfast
tackle real-world challenges and avoid pitfalls. Chemistry traversal, optimization, novelty, exploration and discov-
of Materials, 2022. ery (STONED) algorithm for molecules using SELFIES.
Chem. Sci., 12(20):7079–7090, April 2021.
Krenn, M., Häse, F., Nigam, A., Friederich, P., and Aspuru-
Guzik, A. Self-referencing embedded strings (self- Park, J. and Paul, D. R. Correlation and prediction of gas
ies): A 100 Machine Learning: Science and Technol- permeability in glassy polymer membrane materials via a
ogy, 1(4):045024, oct 2020. doi: 10.1088/2632-2153/ modified free volume based group contribution method.
aba947. URL https://dx.doi.org/10.1088/ Journal of Membrane Science, 125(1):23–39, 1997.
2632-2153/aba947.
Pemantle, R. A. Random processes with reinforcement. PhD
Landrum, G. Rdkit: Open-source cheminformatics software. thesis, Massachusetts Institute of Technology, Dept. of
2016. Mathematics, 1988.
Lee, W. J., Kwak, H. S., Lee, D.-r., Oh, C., Yum, E. K., Polykovskiy, D., Zhebrak, A., Sanchez-Lengeling, B., Golo-
An, Y., Halls, M. D., and Lee, C.-W. Design and syn- vanov, S., Tatanov, O., Belyaev, S., Kurbanov, R., Ar-
thesis of novel oxime ester photoinitiators augmented tamonov, A., Aladinskiy, V., Veselov, M., Kadurin, A.,
by automated machine learning. Chemistry of Materi- Johansson, S., Chen, H., Nikolenko, S., Aspuru-Guzik,
als, 34(1):116–127, jan 2022. ISSN 0897-4756. doi: A., and Zhavoronkov, A. Molecular sets (moses): A
10.1021/acs.chemmater.1c02871. benchmarking platform for molecular generation models.
Leung, L., Kalgutkar, A. S., and Obach, R. S. Metabolic Frontiers in Pharmacology, 2020.
activation in drug-induced liver injury. Drug metabolism Rogers, D. and Hahn, M. Extended-connectivity finger-
reviews, 44(1):18–33, 2012. prints. Journal of Chemical Information and Modeling,
Li, J., Wang, J., Zhao, Y., Zhou, P., Carter, J., Li, Z., 2010.
Waigh, T. A., Lu, J. R., and Xu, H. Surfactant-like
Rosvall, M., Esquivel, A. V., Lancichinetti, A., West, J. D.,
peptides: From molecular design to controllable self-
and Lambiotte, R. Memory in network flows and its
assembly with applications. Coordination Chemistry
effects on spreading dynamics and community detection.
Reviews, 421:213418, 2020. ISSN 0010-8545. doi:
Nature communications, 5(1):4630, 2014.
https://doi.org/10.1016/j.ccr.2020.213418.
Li, Y., Vinyals, O., Dyer, C., Pascanu, R., and Battaglia, Sawlani, N. Drug discovery informatics market set to surge
P. Learning deep generative models of graphs. arXiv at 10.9 Transparency Market Research, Inc, 2024.
preprint arXiv:1803.03324, 2018. Schimunek, J., Seidl, P., Friedrich, L., Kuhn, D., Rippmann,
Liu, Q., Allamanis, M., Brockschmidt, M., and Gaunt, A. F., Hochreiter, S., and Klambauer, G. Context-enriched
Constrained graph variational autoencoders for molecule molecule representations improve few-shot drug discov-
design. Advances in neural information processing sys- ery. 2023. URL https://openreview.net/pdf?
tems, 31, 2018. id=XrMWUuEevr.
Lopez, S. A., Pyzer-Knapp, E. O., Simm, G. N., Lutzow, T., Shui, Z. and Karypis, G. Heterogeneous molecular
Li, K., Seress, L. R., Hachmann, J., and Aspuru-Guzik, graph neural networks for predicting molecule proper-
A. The harvard organic photovoltaic dataset. Sci Data, 3, ties. ICDM, 2020.
2016.
Stanley, M., Bronskill, J. F., Krzysztof Maziarz, H. M.,
Ma and Chen. Unsupervised learning of graph hierarchical Lanini, J., Segler, M., Schneider, N., and Brockschmidt,
abstractions with differentiable coarsening and optimal M. Fs-mol: A few-shot learning dataset of molecules.
transport. AAAI, 2021. NeurIPS, 2021.
11
Representing Molecules as Random Walks Over Interpretable Grammars
Sterling, T. and Irwin, J. J. Zinc 15–ligand discovery for You, J., Liu, B., Ying, Z., Pande, V., and Leskovec, J. Graph
everyone. Journal of chemical information and modeling, convolutional policy network for goal-directed molecu-
55(11):2324–2337, 2015. lar graph generation. Advances in neural information
processing systems, 31, 2018.
Swager, T. M. 50th anniversary perspective: Conduct-
ing/semiconducting conjugated polymers. a personal per- Yuan, W., Vijayamohanan, H., Luo, S.-X. L., Husted, K.,
spective on the past and the future. Macromolecules, 50 Johnson, J. A., and Swager, T. M. Dynamic polypyrrole
(13):4867–4886, 2017. core–shell chemomechanical actuators. Chemistry of
Materials, 34(7):3013–3019, 2022.
Thanou, D., Dong, X., Kressner, D., and Frossard, P. Learn-
ing heat diffusion graphs. IEEE Transactions on Signal Zhou, G., Gao, Z., Ding, Q., Zheng, H., Xu, H., Wei, Z.,
and Information Processing over Networks, 3(3):484– Zhang, L., and Ke, G. Uni-mol: A universal 3d molecular
499, 2017. representation learning framework. ICLR, 2023.
Türel, T., Dağlar, Ö., Eisenreich, F., and Tomović, Ž. Epoxy
thermosets designed for chemical recycling. Chemistry
– An Asian Journal, 18(15), aug 2023. ISSN 1861-4728.
doi: 10.1002/asia.202300373.
Wang, S. and Wu, X. The mechanical performance predic-
tion of steel materials based on random forest. Frontiers
in Computing and Intelligent Systems, 2023.
Wang, S., Shi, K., Tripathi, A., Chakraborty, U., Par-
sons, G. N., and Khan, S. A. Designing intrinsi-
cally microporous polymer (pim-1) microfibers with
tunable morphology and porosity via controlling sol-
vent/nonsolvent/polymer interactions. ACS Applied Poly-
mer Materials, 2(6):2434–2443, 2020.
Wang, Y., Ma, X., Ghanem, B., Alghunaimi, F., Pinnau,
I., and Han, Y. Polymers of intrinsic microporosity for
energy-intensive membrane-based gas separations. Mate-
rials Today Nano, 3:69–95, 2018.
Wang, Y., Wang, J., Cao, Z., and Farimani, A. B. Molecular
contrastive learning of representations via graph neural
networks. nature machine intelligence, 2022.
Wu, A. X., Lin, S., Rodriguez, K. M., Benedetti, F. M.,
Joo, T., Grosz, A. F., Storme, K. R., Roy, N., Syar, D.,
and Smith, Z. P. Revisiting group contribution theory for
estimating fractional free volume of microporous polymer
membranes. Journal of Membrane Science, 636, 2021.
Xia, J., Zhang, L., Liu, Y., Gao, Z., Hu, B., Tan, C., Zheng,
J., Li, S., and Li, S. Z. Understanding the limitations of
deep models for molecular property prediction: Insights
and solutions. NeurIPS, 2023a.
Xia, J., Zhu, Y., Du, Y., Liu, Y., and Li, S. A systematic
survey of chemical pre-trained models. IJCAI, 2023b.
Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful
are graph neural networks? ICLR, 2019.
Yang, N., Zeng, K., Qitian Wu, X. J., and Yan, J. Learning
substructure invariance for out-of-distribution molecular
representations. NeurIPS, 2022.
12
Representing Molecules as Random Walks Over Interpretable Grammars
Table 5. Number of nodes and edges of motif graph G constructed using different annotation strategies
Figure 9. Example segmentations for four molecules in the HOPV dataset. Segmentation locations are marked by the dark blue (teal) line.
13
Representing Molecules as Random Walks Over Interpretable Grammars
Table 6. Segmentation of the molecules (a) to (d) in 9. Bonds to break indicates the chemical bonds to cut to create black fragments, while
the black groups and red groups listed for each molecules correspond to one another, respectively.
Structures Bonds to Break Black Groups Red Groups
(1,2,3,4,5,6,7,8,9,10,11,12,44,45)
(13,14,15,16,43) (17,18,19,25,26,27,34,41,42)
(12,13) (16,17) (19,20) (20,21,22,23,24) (28,29,30,31,32,33) (13) (12,17) (16,20,28,35)
(a) (27,28) (34,35) (35,36,37,38,39,40) (19) (27) (34)
(11,12,13,14,15) (7,8,9,10,16,17,18,19,20)
(1,2,3,4,5,6,21,22,23,24,25,40,41) (10) (11,6)
(10,11) (6,7) (25,26) (26,27,28,29,35,36,37,38,39) (7,26) (25,30)
(b) (29,30) (30,31,32,33,34) (29)
(1,2) (3,4,5,6,7,8,9,10,24,25,27,36,37) (3) (2,11)
(2,3) (11,10) (12,13) (28,29,30,31,34,35) (11,12) (32,33) (27,32) (10,13)
(c) (27,28) (31,32) (13,14,15,16,17,18,19,20,21,22,23,24) (31) (12)
(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,38,39)
(16,17,18,19,37)
(20,21,22,23,24,25,26,32,33,34,35,36) (16) (15,20)
(d) (15,16) (19,20) (26,27) (27,28,29,30,31) (19,27) (26)
Predictive Toxicology Challenge (PTC) (Helma et al., 2001) The small molecules are characterized by distinct functional
groups known for their carcinogenic properties or liver toxicity (Miller et al., 1949; Helma et al., 2001). These groups
comprise a rich variety of elements such as halides, alkylating agents, epoxides, and furan rings. (Figure 3) Therefore, we
specifically segmented it into functional groups and sub-structures that contribute most to the toxicity of the compounds
(Hughes et al., 2015).
The Harvard organic photovoltaic dataset (HOPV) (Lopez et al., 2016) The process of segmenting the Harvard Organic
Photovoltaic Dataset (HOPV15) demonstrates a methodical and efficient approach to categorizing photovoltaic data. This
dataset contains a comprehensive collection of experimental photovoltaic data from literature coupled with quantum-
chemical calculations across various conformers. The criteria for the extraction of the black group are clearly defined and
systematically applied. Functional groups like vinyl, alcohol, ketone, aldehyde, amine, ester, and amide are separated as
individual black fragments. Similarly, distinct black fragments are used for individual rings including benzene, pyrrole, and
thiophene, in acknowledgement of their Pi-orbital electron delocalization. Complex structures with multiple consecutive
rings, known for their distinctive HOMO-LUMO bandgaps and electrochemical properties, such as thieno[3,4-b]pyrazine,
carbazole, and 2,5-dimethyl-3,6-dioxo-2,3,5,6-tetrahydropyrrolo[3,4-c]pyrrole, are also segmented as individual entities.
Moreover, for groups of 2-3 consecutive symmetrical thiophene or pyrrole units, the methodology captures the significance
of maintaining them as a complete black group because these consecutive groups sustain the electron cloud delocalization
between repeating units, strongly influencing optical and electronic properties not limited to light absorption, charge
transport, and luminescent properties in photovoltaic applications. Meanwhile, this method of segmentation enhances utility
and understanding of the results by clearly basing predictions on existing important structures.
Defining Membership. The Membership metric is reported in 4.2.2 after further consultation with experts, who identify
the presence of Thiophene as a proxy for Membership to HOPV, and the presence of Chloride/Bromide Halides (a key
indicator of toxicity) for PTC. In the case of both datasets, the Membership metric is only a sanity check that the method can
produce a non-trivial number of characteristic compounds. Here’s our justification for the criteria on each dataset:
1. A chloroalkane (Cl-C) is the most common motif in the PTC dataset. Yet, it is still not present in a majority of
structures, making the broader class of alkyl halides (Cl-C, Br-C-C) the best choice for a membership criterion. Their
prevalence is attributed to their reactivity and ability to undergo metabolic activation (Leung et al., 2012), leading to
14
Representing Molecules as Random Walks Over Interpretable Grammars
the formation of highly reactive intermediates that can interact with DNA and other cellular components, potentially
initiating carcinogenic processes. Although not all carcinogenic compounds will necessarily contain this class of motifs,
their presence contributes a strong likelihood.
2. Thiophene, a 5-member ring with one sulfur group, is the most common motif in the HOPV dataset, making it the best
choice for a single-motif membership criterion for HOPV. More broadly, thiophene and its derivatives are arguably the
most common chemical substructure in photovoltaics due to their ability to donate electrons, resulting in particularly high
highest occupied molecular orbital (HOMO) levels, along with stability, tunability of energy levels, and compatibility
with film forming techniques. While not every suitable organic photovoltaic compound will contain it, the vast majority
will.
In both cases, our method can easily achieve 100% membership with a slight modification to the sampling procedure:
instead of iterating through every possible starting motif node, we always initialize our random walk at the membership
motif. We choose not to modify our sampling procedure, and instead include this metric in Table 3 for completeness, since it
is still a good sanity check for other methods to show they generate a non-trivial fraction of candidates with those motif(s).
Table 7. Context determination rules and examples on datasets Group Contribution, HOPV and PTC.
Dataset Rule Example
15
Representing Molecules as Random Walks Over Interpretable Grammars
8 visited ← {};
9 visited[src] ← True;
10 root ← Node(src, main = (src in longest path));
11 Q ← queue([(root,src)]);
12 while !Q.empty() do
13 prev node, prev ← Q.dequeue();
14 for cur in NGi (prev) do
15 if visited[cur] then
16 continue
17 e ← Gi .edges[(prev, cur)];
18 e index ← find edge(e, G.edges[(prev,cur)]);
19 cur node ← Node(cur, main = (cur in longest path));
20 prev node.add child(cur node, e index);
21 vis[cur] ← True;
22 Q.enqueue((cur node, cur))
23 Out: root
and choosing a consistent ordering over neighbors (NGi ) to determine the random walk sequence. We elaborate on the
reasoning behind this canonicalization in Appendix B.4. The DAG constraint enables our graph diffusion process to become
a generator of new molecules (as will be discussed in Appendix D), in addition to capturing the distribution of existing ones.
16
Representing Molecules as Random Walks Over Interpretable Grammars
Thus, we specifically ask experts to create segmentations that are acyclic, which they naturally do in nearly all cases anyway.
In the case of monomers, this canonicalization is consistent with the IUPAC nomenclature(IUPAC, 1997) of linearizing a
monomer via its longest (main) chain, where NGi should iterate over neighbor fragments that descend side chains before the
consecutive fragment on the backbone of the main chain. More specifically, src and dest in Algorithm 2 correspond to the
first and last group of the main chain.
The Group Contribution dataset includes a compilation of motifs characterized for gas separation, including common
organic chemical functional groups as well as important scaffold functional groups such as Triptycene and its deriva-
tives, dioxin and its derivatives, and N-methylphthalimide and PIM-1 and its derivatives (Wang et al., 2018; 2020).
17
Representing Molecules as Random Walks Over Interpretable Grammars
Figure 10. Following the same structure as 10, we illustrate our random walk representation over three monomers from the literature of
gas-separation membranes; in this setting, the random walk must be an Euler circuit, beginning and ending on the motif that is listed first
according to IUPAC notation.
These functional groups contribute significantly to maintaining the structures and properties of 3D scaffold building
blocks in polymer self-assembly, which in turn play a significant role in gas separation processes, i.e. the separation of
H2 , H2 /N2 , O2 , O2 /N2 , CO2 , CO2 /CH4 which are common separation tasks important in gas and oil industry. The steps
we take for compiling this dataset of segmented monomers are as follows:
1. We obtain an established compilation of groups (Park & Paul, 1997; Wu et al., 2021) for microporous polymers.
2. We visually segment the monomers in (Wang et al., 2018) into random walks over the groups identified in Step 1.
3. We collect experimental permeability and separation performances for 114 of the monomers identified in Step 2.
In addition to the motifs used here, the concept of such segmentation arises naturally across other application domains in
chemical design. Within synthetic organic chemistry, molecular design plays a governing role in advancing new technology
(Bronstein et al., 2020). Understanding of the behavior of a molecule or polymer in an application is commonly described by
experts using the function of key subparts, particularly key functional groups, scaffold structures, and backbone architectures
within a molecule or monomer, and their arrangement relative to each other, rather than considering atom-by-atom or
a molecule as a whole. In chemical design, new molecules can be complex and, when designed by hand in traditional
ways, are built from these relatively modular subcomponents. This approach naturally takes advantage of the physical
laws by which molecules are built by synthesis reactions, where a discrete set of additions and substitutions are allowed to
finally construct a desired target structure. Such methods of chemical design find broader application in drug discovery for
pharmaceuticals, surfactant and detergent design (Blunk et al., 2006; Li et al., 2020), organic semiconductors (Bronstein
et al., 2020), photoinitiators (Lee et al., 2022), and more recyclable plastics (Türel et al., 2023), among other uses, in each of
which chemists fine-tune properties of such components by adjusting the selection and arrangement of these sub-structures,
or otherwise use them as a guide for understanding performance.
Utilizing groups from existing structures as well as discovery of new and novel structures, researchers can predict perfor-
mance, find new uses for existing molecules, discover new molecules, and further optimize structures for better performance.
Utilizing machine learning models has been show to drastically decrease the time and cost of such methods while simulta-
neously improving throughput by creating and screening novel structures in a single step and providing researchers with
predictions of target molecules that have higher potential for success, which are then verified by experts. As presented by
(Wu et al., 2021), different structural elements and functional groups present in effective drug molecules can be identified
and recombined in new architectures. These novel structures can then be tested using computer models to benchmark likely
efficacy given new targets or modifications to binding sites.
18
Representing Molecules as Random Walks Over Interpretable Grammars
Molecule M = (VM , EM ) is then a rooted subgraph of G′ . In the main text, we refer to G as its augmented version, to
simplify the notation.
19
Representing Molecules as Random Walks Over Interpretable Grammars
25 Out: G
The representation’s completeness is maintained because every possible molecule structure derivable from the motifs
is accounted for in the dual graph. Each pathway through the graph represents a unique assembly sequence of motifs,
translating into a distinct molecular structure. The reduction in complexity arises from the transformation process. By
converting the original graph into its dual form, we reduce the granularity of representation. Instead of representing every
possible molecular structure as a separate node, we represent them as pathways through the dual graph. This approach
significantly decreases the number of nodes and edges required, leading to a more manageable yet complete representation
of the molecular structures.
20
Representing Molecules as Random Walks Over Interpretable Grammars
Figure 11. Following notations of main text, {v1 }r1 = {1, 9}, {v1 }r2 = {1}, {v1 }r3 = {9}, {v1 }r4 = {5}. {v2 }r1 =
{1, 2, 3, 4, 5, 6}, {v2 }r2 = {12, 13, 14, 15, 16, 17}. We annotate e1,1 , where b1 = {6, 4, 3, 2, 8, 7}, b2 = {10, 7} provides the “certifi-
cate” of a successful match.
motifs, which brings in the potential benefit of learning better molecule representations based on motif representations that
form the basis of all molecules.
21
Representing Molecules as Random Walks Over Interpretable Grammars
D. Grammar Learning
D.1. Graph Diffusion Strategy
Our strategy is to encode the dataset of walk trajectories by training the parameters of our graph diffusion process to recover
the ground-truth state of a particle being diffused over the motif graph. We use stochastic gradient descent and choose
between a “forcing” approach (where a single particle transitions from one state to another) and a “split” approach (where a
single particle splits its mass equally along the out-edges of its current state). See the pseudocode in Algorithm 6.
E. Property Prediction
E.1. Graph Neural Network Design Choices
We apply a Graph Isomorphism Network (Xu et al., 2019) with hyperparameters in Table 8. For molecule M with
representation HM , the node-level features include: a) the Morgan fingerprint of the motif vi (dimension 2048), b) the
memory-free weights of its out-edges (dimension |G|), i.e. E[i]. We also concatenate the Morgan fingerprint of M.
22
Representing Molecules as Random Walks Over Interpretable Grammars
20 Loss ← MSE(xt , pt );
21 E ← E - dLoss
dE ;
22 W ← W - α ∗ dLoss
dE ;
23 b ← b - α ∗ dLoss
db ;
24 Out = E,W,b
23
Representing Molecules as Random Walks Over Interpretable Grammars
Figure 12. (Left) The raw data of Group Contribution. The edge thickness is proportional to the number of monomers whose random walk
representations traverse the edge. (Right) The learned parameter matrix of E after training converges The grammar both retains essential
nodes and edges and smoothens the distribution of edge weights.
Figure 13. We show the weight evolution of the edges incidental to G14 on HOPV. (Left) After processing the raw dataset into random
walks, we visualize the empirical distribution of edge traversals. (Middle) After learning our context-sensitive grammar, we plot the prior
edge weights, i.e. the memory-free parameter E. (Right) We plot the transition probabilities starting at G14 during the random walk
generation process.
Our implementation in Algorithm 7 handles the distinction between revisiting a previous node vs adding a new duplicate of
the same motif as a previous node through mask attach (new nodes which can be attached) vs mask return (the node which
24
Representing Molecules as Random Walks Over Interpretable Grammars
Figure 14. Generation of a random walk G81 → G82 → G274 → G82 : 1; The possible transitions from G81, G82 and G274 are in
Red, Green, and Blue (with thickness proportional to probability).
the random walker can backtrack to). This distinction is done by creating duplicates of nodes for each revisit (see B.3).
We guarantee 100% validity since we can explicitly check the possible motifs which can be attached to M at each step.
When there are neither new motifs to attach, nor existing motifs to return to, the generation terminates with the current M
being the final generation output. Please refer to the GitHub for details of the implementation.
Figure 15. Generation process of novel random walks on HOPV: (Left) G262 → G305 → G181 and (Right) G239 → G297 → G202;
The possible transitions from the first and second visited nodes are in Red and Green.
As shown in Figure 15, applying our generation method produces artifacts of learning that invite further scrutiny: “rules”
of consecutive motifs. The second example in Figure 15 shows there are only two possible motifs (green) that can be
attached to the G297 end of a molecule with the G239 and G297 functional groups (ignoring the return back to G239,
which transitions the state but does not attach a new motif). In the first example, the distribution of possible new motifs to
attach to the G305 end of a molecule with G262 and G305 appears more uniform.
25
Representing Molecules as Random Walks Over Interpretable Grammars
26
Representing Molecules as Random Walks Over Interpretable Grammars
Table 9. Under our string-based implementation, A[→B] encodes a random walk trajectory of A→B→A. All rules shown are valid, as
verified to correspond to valid molecules that can be constructed following the random walk trajectory.
Trajectory A ⇒ Trajectory B
[’G4’] [’G4’, ’G2’]
[’G27’] [’G27’, ’G6’]
[’G115’] [’G115’, ’G6’]
[’G218’] [’G218’, ’G6’]
[’G283’] [’G283’, ’G6’]
[’G290’] [’G290’, ’G6’]
[’G301’] [’G301’, ’G6’]
[’G335’] [’G335’, ’G6’]
[’G368’] [’G368’, ’G6’]
[’G466’] [’G466’, ’G231’]
[’G272’] [’G272’, ’G271’]
[’G362’] [’G362’, ’G361’]
[’G205’] [’G205’, ’G202’]
[’G435’] [’G435’, ’G434’]
[’G167’] [’G167’, ’G166’]
[’G436’] [’G436’, ’G166’]
[’G224’] [’G224’, ’G225’]
[’G2’, ’G4’] [’G2[->G4]’]
[’G202’, ’G205’] [’G202[->G205]’]
[’G434’, ’G435’] [’G434[->G435]’]
[’G361’, ’G362’] [’G361[->G362]’]
[’G333’, ’G393’] [’G333’, ’G393’, ’G333:1’]
[’G224’, ’G225’, ’G224:1’] [’G224’, ’G225[->G224:1]’]
27
Representing Molecules as Random Walks Over Interpretable Grammars
Meanwhile, all the red group along with the segmented black group are chosen to be either single atom, or closet conjugated
rings if the black group is too small (just one or two atoms). This method helps reduce the redundancy and computational
resources of the red group.
28
Representing Molecules as Random Walks Over Interpretable Grammars
Figure 16. We detail the difference between the expert and heuristic segmentations, highlighting how the heuristics are not sufficiently
capable. For example, the expert segmentation keeps the 3 thiophene rings together in A1, while the heuristic breaks them up. Similarly,
in B1, the expert treats the consecutive rings as one fragment, whereas the heuristic cleaves on bonds connecting them.
29
Representing Molecules as Random Walks Over Interpretable Grammars
Figure 17. There are 64 molecules in this test set indexed from lowest to highest HOMO value. The above grid visualizes the distance
between each pair of molecules as a cosine distance between the final layer embeddings of our model, with darker color representing
lower distance (higher similarity). We use 4 quantiles, and refer to their ranges as low, medium-low, medium-high, and high similarity.
30