Integrative_AI_Framework_for_Molecular_Retrosynthesis_via_Heterogeneous_Knowledge_Graphs (1)
Integrative_AI_Framework_for_Molecular_Retrosynthesis_via_Heterogeneous_Knowledge_Graphs (1)
Integrative_AI_Framework_for_Molecular_Retrosynthesis_via_Heterogeneous_Knowledge_Graphs (1)
4nd Thomas Bouetou Bouetou 5th Venkataramana Gadhamshetty 6th Etienne Gnimpieba Z.
Ecole Nationale Supérieure Polytechnique Dept. of Civil and Environmental Engineering Dept. of Biomedical Engineering
University of Yaounde 1 South Dakota School of Mines and Technology University of South Dakota
Yaounde, Cameroon Rapid City, SD, USA Vermillion, SD, USA
[email protected] [email protected] [email protected]
Abstract—The synthesis of small molecules is a crucial task efficient synthetic pathways for target molecules remains a
across multiple scientific domains, including drug discovery, complex task due to the intricate chemical transformations
materials science, and sustainable chemistry. As AI and Machine and reactivity patterns involved. Retrosynthesis, the process
Learning (ML) continue to advance, these technologies offer
transformative potential in molecular synthesis, enhancing effi- of deconstructing a target molecule into simpler precursor
ciency and expanding synthetic possibilities. In particular, small structures, is a cornerstone of synthetic chemistry that has
molecule synthesis has profound implications for creating novel traditionally relied on the expertise of chemists and curated
therapeutic agents, high-performance materials, and greener databases. However, conventional retrosynthetic methods are
chemical processes, but a major challenge remains in designing often constrained by the scope of known reactions, limiting
efficient synthetic routes for target molecules. Retrosynthesis,
the process of mapping out synthetic pathways by working their ability to propose novel pathways and predict transfor-
backwards from a target molecule, represents a vital step; how- mations for emerging molecules [7].
ever, traditional retrosynthesis methods often struggle to predict Recent advancements in Artificial Intelligence (AI) and
complex or novel transformations. To address this limitation, Machine Learning (ML) have enabled new approaches to
we introduce AIM-HKR: AI-Driven Molecular Retrosynthesis retrosynthesis prediction, leveraging data-driven techniques
Using Heterogeneous Knowledge Representations, a model that
leverages graph neural network techniques to enhance ret- to enhance the accuracy and diversity of proposed synthetic
rosynthetic predictions. AIM-HKR integrates information from routes. Among these techniques, graph-based representation
heterogeneous knowledge graphs, capturing the intricate rela- learning has shown promising results by capturing complex re-
tionships and analogical reasoning required for retrosynthesis. lationships within molecular structures and reaction networks
This model generates type-specific embeddings that reflect both [8]. By encoding molecular connectivity, functional groups,
network topology and semantic connections across different
entity types, such as molecules, reactions, and functional groups. and reaction conditions in graph form, AI models can analyze
AIM-HKR’s unique capacity to leverage heterogeneous graphs the vast chemical space more comprehensively, facilitating the
allows it to propose synthetic pathways that extend beyond identification of both standard and novel reaction pathways.
established precedents, enabling predictions of chemically feasible Integrating these methods with deep learning has thus opened
but previously uncharted transformations. We believe AIM-HKR new avenues for generating innovative retrosynthetic routes,
has the potential to significantly advance molecular synthesis
through AI-driven retrosynthesis, establishing a new paradigm extending beyond traditional rule-based frameworks [1].
in AI-assisted chemistry. To further advance retrosynthesis prediction, the integration
Index Terms—Molecular Retrosynthesis, Graph Neural Net- of heterogeneous knowledge graphs represents a powerful tool,
works, Heterogeneous Representations, Small Molecule Synthe- providing models with access to a wide range of chemical data
sis, AI-Assisted Chemistry. sources. Knowledge graphs constructed from expert-curated
databases and chemical literature capture valuable insights
I. I NTRODUCTION into reactivity patterns and functional groups, enhancing the
The synthesis of small molecules, an essential component model’s capacity to generalize across diverse chemical reac-
in drug discovery, materials science, and sustainable chem- tions and predict plausible transformations [4]. Such graph-
istry, is a challenging process that plays a critical role in based systems serve not only as repositories of chemical
advancing scientific progress across diverse fields. Designing knowledge but also as dynamic frameworks for exploring
potential synthetic routes that may not have precedent in
existing literature.
In this work, we introduce AIM-HKR (figure 1): AI-Driven a specialized deep learning model designed to process hetero-
Molecular Retrosynthesis Using Heterogeneous Knowledge geneous graphs. HeteroGNN generates node sequences that
Representations, a novel AI model designed to advance ret- represent potential retrosynthetic pathways. These sequences
rosynthesis prediction. AIM-HKR combines graph-based rep- are derived by analyzing the relationships between molecules,
resentation learning with deep neural networks to capture reactions, and functional groups within the graph. The model
complex chemical relationships and generate retrosynthetic learns to identify plausible disconnections and synthetic steps
pathways for target molecules. By leveraging a comprehensive by understanding how different entities interact within the
knowledge graph encompassing a broad spectrum of chemical chemical space.
reactions and reactivity patterns, AIM-HKR provides a robust The node feature extraction process for a given molecule v
framework for identifying both known and novel synthetic is described by the following function, where Cv denotes the
pathways. This model enables chemists to navigate and ex- set of neighboring nodes (i.e., reactions involving molecule v):
plore chemically plausible transformations, ultimately accel- 1 X h−−−−−→ ←−−−−− i
erating the synthetic design process and fostering innovation f1 (v) = LSTMθx (xi ) ⊕ LSTMθx (xi ) , (2)
|Cv |
in molecular synthesis. i∈Cv
df ×1
where xi ∈ R is the feature representation of the i-th
II. M ETHOD
content in Cv , and df is the content feature dimension.
Our AI/ML platform, powered by the AIM-HKR model, Following the generation of node sequences, we extract em-
provides a robust solution for retrosynthesis prediction by beddings that capture the most relevant chemical information.
leveraging heterogeneous knowledge graph representations These embeddings serve as compact, yet informative repre-
and deep learning techniques. The methodology consists of sentations of the retrosynthetic pathways. Each embedding
several key phases that allow our model to effectively gener- encodes the relationships between the molecules and reactions
ate and evaluate retrosynthetic pathways for a given target involved in the synthetic process, providing a foundation for
molecule. These steps combine graph-based representation downstream tasks such as pathway ranking and selection. In
learning with advanced neural network architectures to cap- the final step, we rank the predicted retrosynthetic pathways
ture complex relationships within chemical data, ultimately based on their feasibility and synthetic accessibility using
predicting synthetic routes with high accuracy and efficiency XGBoost [5], a powerful gradient boosting algorithm. The
[3] [6]. embeddings extracted from the node sequences are used as
The first step in our retrosynthesis prediction process is the input features to XGBoost, which classifies and ranks each
construction of a hierarchical similarity graph, which serves potential pathway based on its likelihood of successful ex-
as the foundation for all subsequent computations. This graph ecution. Factors such as reaction success rates, availability
represents the chemical space by encoding molecules, reac- of reagents, and compatibility with reaction conditions are
tions, and functional groups as nodes, while edges represent integrated into the model’s ranking function. The objective
the relationships between these entities, such as reaction trans- function for XGBoost is optimized as:
formations and connectivity patterns. To build this graph, we 1 X
use expert-curated databases and chemical reaction data, cap- L=− [yu,v log ŷu, v + (1 − yu, v) log(1 − ŷu,v )],
|E|
turing the intricate relationships between molecules, reaction (u,v)∈E
(3)
conditions, and reactivity patterns [2], [10]. The construction
where E is the set of edges in the graph. The model parameters
of the similarity graph is mathematically expressed by the
are then updated using the Adam optimization algorithm to
following equation:
iteratively minimize this loss function.
xnew = xi + λ(xj − xi ), (1) The AIM-HKR model, supported by our AI/ML platform,
offers a powerful tool for predicting retrosynthetic pathways.
where xi and xj represent molecular embeddings, and λ is By combining graph-based representation learning, hetero-
a scalar factor controlling the interaction strength between geneous knowledge graphs, and advanced machine learning
molecules i and j. This formula updates the node repre- algorithms, we are able to provide chemists with multiple,
sentations in the graph, ensuring that structurally similar innovative synthetic routes for a target molecule. The model’s
molecules are positioned closer within the chemical space. ability to predict both known and novel reactions significantly
Once the similarity graph is constructed, we apply a data accelerates the synthetic design process, enabling the rapid
augmentation technique for graph balancing to ensure a well- development of new therapeutic agents, materials, and sus-
represented dataset across various reaction types. This step tainable chemical processes.
compensates for the under-representation of certain reaction
classes, thus improving the model’s ability to predict less III. R ESULTS AND D ISCUSSION
common or novel transformations. By balancing the graph, the In this section, we present the evaluation results of our
model is exposed to a broader spectrum of chemical reactions, AIM-HKR model for retrosynthesis prediction. The evalua-
enhancing its predictive capabilities. In the next phase, we tion framework consists of several key performance metrics,
use HeteroGNN [9] (Heterogeneous Graph Neural Network), including loss, accuracy, F1-score, AUROC (Area Under the
Fig. 1. Workflow of the AIM-HKR Model for AI-Assisted Retro-Synthesis.
Receiver Operating Characteristic Curve), and AUPR (Area tion, we found that our model was capable of proposing
Under the Precision-Recall Curve). These metrics were com- novel retrosynthetic strategies, highlighting its potential for
puted on a diverse set of challenging target molecules repre- discovering innovative synthetic routes. To quantitatively as-
senting a wide range of complexities in chemical synthesis. sess the performance of the AIM-HKR model, we com-
puted several evaluation metrics, and the results, shown in
We first compared the predicted retrosynthetic pathways
Table I, demonstrate that the model performs well across
generated by the AIM-HKR model with known literature
all evaluation criteria. The loss value of 0.35 indicates that
routes and expert-curated databases. Our model demonstrated
the AIM-HKR model’s predictions are relatively close to the
high accuracy in reproducing established synthetic routes,
ground truth, suggesting effective learning and minimal error
with a substantial overlap in predicted pathways. In addi-
in the retrosynthetic pathway predictions. A low loss value pathways. By leveraging a novel approach that integrates
is essential as it reflects how well the model is performing graph-based learning, deep neural networks, and expert-
in terms of minimizing prediction error during training. The curated chemical knowledge, the AIM-HKR model is able
model achieved an accuracy of 88%, which signifies that to generate accurate and innovative retrosynthetic routes for
88% of the predicted retrosynthetic pathways matched the a diverse range of target molecules. Our evaluation frame-
known literature or expert-curated synthesis routes. This high work, which includes metrics such as loss, accuracy, F1-
accuracy demonstrates the model’s reliability in predicting score, AUROC, and AUPR, demonstrated the model’s strong
correct pathways and reflects its general robustness in handling performance in generating high-quality synthetic routes and its
diverse chemical synthesis challenges. ability to distinguish between correct and incorrect pathways.
The results confirm that AIM-HKR, combined with its user-
Metrics results friendly AI/ML platform, offers a powerful tool for researchers
Loss 0.35
Accuracy 0.88 and chemists in the field of chemical synthesis. By providing
F1-score 0.85 both reliable and novel synthetic strategies, AIM-HKR has the
AUROC 0.91 potential to significantly accelerate the drug discovery process,
AUPR 0.89
TABLE I material design, and other chemical innovation fields. The
E VALUATION R ESULTS FOR AIM-HKR M ODEL platform’s scalability and efficient handling of large datasets
further enhance its utility in real-world applications. While
there is room for improvement, particularly in the inclusion
The F1-score of 0.85 highlights the balance between pre- of more diverse chemical data and advanced ranking tech-
cision and recall, showing that the model is highly effective niques, the current results position AIM-HKR as a promising
at both identifying relevant retrosynthetic routes (precision) solution for modern retrosynthesis prediction. Overall, our
and capturing as many true routes as possible (recall). A high work represents a significant step toward the integration of
F1-score indicates that the AIM-HKR model excels in both AI and machine learning in the chemical synthesis field.
generating correct predictions and avoiding false positives. The AIM-HKR model not only paves the way for more
With an AUROC value of 0.91, the model exhibits a strong efficient synthesis design but also offers new opportunities for
ability to distinguish between true positive and false positive innovation in chemistry and beyond.
retrosynthetic pathways. The closer the AUROC value is to 1, Acknowledgment
the better the model is at classifying correct pathways. This This work was supported by the National Science Foun-
value indicates that our model is effective in distinguishing dation [NSF OIA-1849206, OIA-1920954]; and the National
between successful and unsuccessful pathways. The AUPR Institutes of Health [5P20GM103443-20].
value of 0.89 further confirms the model’s capability to
generate meaningful and accurate retrosynthetic pathways. A R EFERENCES
high AUPR score suggests that the model not only classifies
correctly but also ranks the correct pathways higher than [1] Maryam Arabi, Abbas Ostovan, Jinhua Li, Xiaoyan Wang, Zhiyang
Zhang, Jaebum Choo, and Lingxin Chen. Molecular imprinting: green
incorrect ones, thereby improving the overall quality of the perspectives and strategies. Advanced Materials, 33(30):2100543, 2021.
synthesized routes. Additionally, we evaluated the usability [2] Jingxin Dong, Mingyi Zhao, Yuansheng Liu, Yansen Su, and Xiangxiang
and scalability of our AI/ML platform, which integrates the Zeng. Deep learning in retrosynthesis planning: datasets, models and
tools. Briefings in Bioinformatics, 23(1):bbab391, 2022.
AIM-HKR model. Domain experts interacted with the plat- [3] Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang,
form and provided feedback on its user interface, ease of Zhuo Chen, Xiaohui Fan, and Huajun Chen. Mol-instructions: A large-
navigation, and overall usability. Their input confirmed that the scale biomolecular instruction dataset for large language models. arXiv
preprint arXiv:2306.08018, 2023.
platform meets the needs of chemists and researchers, offering [4] Yinjie Jiang, Yemin Yu, Ming Kong, Yu Mei, Luotian Yuan, Zhengxing
a seamless and intuitive experience for exploring retrosynthetic Huang, Kun Kuang, Zhihua Wang, Huaxiu Yao, James Zou, et al.
pathways. Moreover, the platform was able to handle large- Artificial intelligence for retrosynthesis prediction. Engineering, 25:32–
50, 2023.
scale datasets efficiently, demonstrating its scalability and [5] George Obaido, Ibomoiye Domor Mienye, Oluwaseun F Egbelowo,
computational performance. Ikiomoye Douglas Emmanuel, Adeola Ogunleye, Blessing Ogbuokiri,
The results confirm that the AIM-HKR model, combined Pere Mienye, and Kehinde Aruleba. Supervised machine learning in
drug discovery and development: Algorithms, applications, challenges,
with the AI/ML platform, offers a powerful tool for assisting and prospects. Machine Learning with Applications, 17:100576, 2024.
chemists in designing efficient and novel synthetic routes. The [6] Wenjia Qian, Xiaorui Wang, Yu Kang, Peichen Pan, Tingjun Hou, and
model’s ability to accurately predict retrosynthetic pathways, Chang-Yu Hsieh. A general model for predicting enzyme functions
based on enzymatic reactions. Journal of Cheminformatics, 16(1):38,
coupled with its potential to explore new synthesis strategies, 2024.
presents a significant advancement in the field of chemical [7] Ajay Vikram Singh, Daniel Rosenkranz, Mohammad Hasan Dad Ansari,
synthesis and retrosynthesis prediction. Rishabh Singh, Anurag Kanase, Shubham Pratap Singh, Blair Johnston,
Jutta Tentschert, Peter Laux, and Andreas Luch. Artificial intelligence
and machine learning empower advanced biomedical material design to
IV. C ONCLUSION toxicity prediction. Advanced Intelligent Systems, 2(12):2000084, 2020.
[8] Hongyu Tu, Shantam Shorewala, Tengfei Ma, and Veronika Thost.
In this paper, we presented the AIM-HKR model, an Retrosynthesis prediction revisited. In NeurIPS 2022 AI for Science:
advanced AI/ML-based solution for predicting retrosynthetic Progress and Promises, 2022.
[9] Finn Womack, Jason McClelland, and David Koslicki. Leveraging
distributed biomedical knowledge sources to discover novel uses for
known drugs. bioRxiv, page 765305, 2019.
[10] Tarid Wongvorachan, Surina He, and Okan Bulut. A comparison
of undersampling, oversampling, and smote methods for dealing with
imbalanced classification in educational data mining. Information,
14(1):54, 2023.