Curating Reagents in Chemical Reaction Data with an Interactive Reagent Space Map

Andronov, Mikhail; Andronova, Natalia; Wand, Michael; Schmidhuber, Jürgen; Clevert, Djork-Arné

doi:10.1007/978-3-031-72381-0_3

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14894))

Included in the following conference series:

International Workshop on AI in Drug Discovery

1741 Accesses
3 Altmetric

Abstract

The increasing use of machine learning and artificial intelligence in chemical reaction studies demands high-quality reaction data, necessitating specialized tools enabling data understanding and curation. Our work introduces a novel methodology for reaction data examination centered on reagents - essential molecules in reactions that do not contribute atoms to products. We propose an intuitive tool for creating interactive reagent space maps using distributed vector representations, akin to word2vec in Natural Language Processing, capturing the statistics of reagent usage within datasets. Our approach enables swift assessment of reagent action patterns and identification of erroneous reagent entries, which we demonstrate using the USPTO dataset. Our contributions include an open-source web application for visual reagent pattern analysis and a table cataloging around six hundred of the most frequent reagents in USPTO annotated with detailed roles. Our method aims to support organic chemists and cheminformatics experts in reaction data curation routine.

You have full access to this open access chapter, Download conference paper PDF

Keywords

1 Introduction

Over the many years chemical science has existed, chemists have amassed a vast body of knowledge and records about organic chemical reactions. This wealth of information encapsulates hundreds of distinct reaction types determined by a general transformation scheme and the required reagents [13]. As machine learning is cementing its place among widely used approaches to various reaction modeling problems [8, 16], chemists are becoming increasingly concerned with understanding and curating their reaction data.

The most widely known reaction data collections are Reaxys [3], CASREACT [1], the open dataset of reactions from US patents (USPTO) [15], Pistachio [2], and a recent Open Reaction Database (ORD) [10]. Different reaction datasets may have unique particularities and biases [24], and comprehending those in one’s dataset of choice is an advisable prerequisite for any study relying on that dataset. One of the possible sources of reaction data imperfections is reagent information.

A chemical reaction scheme (Fig. 1) typically involves reactants, products, and reagents. A reactant, as defined by IUPAC, is a substance consumed in the course of a chemical reaction. Consequently, reagents are other molecules that enable a reaction but do not contribute atoms to the products. Reagents are commonly written above or below the arrow in a reaction scheme. For example, catalysts and solvents are reagents. However, in practice, for convenience, substances with other roles, such as reducing and oxidizing agents, may also be considered reagents. Reagents may be integral to the mechanism of a reaction or merely improve the reaction rate. Reactions that use the same reagents may often correspond to the same reaction type.

Before building and testing machine learning models on some reaction dataset, it may be beneficial to pay special attention to reagent information within reagent records: verify whether reactant-reagent separation is adequate, or if there are redundant records of the same reagent, or if information about detailed roles of reagents is available. While the creators of popular reaction databases try to address all those issues, the data often needs additional curation. For example, reactant-reagent separation in USPTO is based on atom-atom mapping provided by the Indigo toolkit. The imperfection of this mapping may lead to imperfect separation. Additionally, reagent roles may not be detailed enough: both Reaxys and USPTO feature only three roles (“catalysts”, “solvents”, or “reagents”). When building reagent prediction models, their detailed performance analysis may require access to richer reagent role attribution [4].

In this paper, we describe a simple visual tool that helps to curate reagents in chemical reaction data. This tool is an interactive reagent space map based on distributed vector representations of reagents and served in a web application. We obtain reagent representations using an algorithm equivalent to the famous word2vec [18] algorithm from Natural Language Processing, and they reflect the statistics of reagent co-occurrences and interaction in a given reaction data corpus: representations of reagents of the same role tend to cluster together. We demonstrate the application of our tool to the USPTO dataset [15]. Using our tool, we label around six hundred most common reagents in USPTO into ten detailed roles, detect reactants erroneously listed as reagents, and ensure the uniqueness of unique reagents’ names.

Our reagent space mapping and the web application work with any reaction dataset, and we are confident that it will benefit organic chemists and cheminformatics specialists working with their own reaction data.

The codebase is open-source and available at https://github.com/Academich/reagent_emb_vis.

2 Results

2.1 Interactive Application

We have built a lightweight web application for the interactive exploration of the USPTO reagent space. Figure 2 demonstrates the application’s appearance. The application displays an interactive UMAP [17] projection of reagent embeddings that capture the statistics of reagent co-occurrences within the dataset. Various filters are available in the application: it is possible to display only reagents with desired indices or roles, and one can also filter reagents by SMARTS patterns. When the user hovers over a point on the map, the corresponding reagent structure and its SMILES appear on the screen. Two versions of the map are available: a flat 2D map and a map on the surface of a sphere.

Users can explore reagents in their own reaction data in the application after carrying out necessary preprocessing described in the repository with the code.

2.2 Properties of Reagent Embeddings

If some two reagents tend to occur in similar contexts, i.e., with similar other reagents, then it is likely that these two reagents are alternatives, and they likely get embeddings that are close to each other. For example, we do not expect two different palladium catalysts for Suzuki coupling to occur in one reaction, but one can use the same bases and solvents with both of them. Therefore, we can expect the map of reagent embeddings to feature role clusters, e.g., a cluster of specific catalysts, ligands, or other reagents. The reagent embeddings are obtained by factorizing the point-wise mutual information (PMI) scores matrix with singular value decomposition (SVD). We can easily derive the table of PMI scores from the table of reagent counts by Eq. 1.

$$\begin{aligned} \text {PMI}(x, y) = \log _2 \frac{P(x, y)}{P(x)P(y)} \end{aligned}$$

(1)

where x and y are reagents, P(x, y) is a relative frequency of a reagent pair among all reagent pairs, and P(x) and P(y) are relative frequencies of individual reagents. Reagent vector representations obtained in this manner lie close in the vector space for entities with similar "meanings", which are determined by the "companions" of those entities. This method of obtaining reagent embeddings is equivalent to the word2vec algorithm (see 3.1) if we treat reagent molecules as words and reagents in one reaction as one context. One can freely select reagent embedding dimensionality, and we choose it to be 50 to achieve information compression that forces reagent embeddings into role clusters but with enough degrees of freedom. Other dimensionalities lead to different shapes of the map but do not tend to affect the observed clusters.

2.3 Reagent Data Curation

When reagents are displayed in this interactive map (Fig. 2), it is much easier to curate them than when working with just table data. We showcase two use cases when the map facilitates reagent curation: labeling reagents into detailed roles and finding redundancies in reagent SMILES entries.

Reagent Number. To study the reagents in USPTO, we obtain the entire dataset using the rxnutils [9] Python package and carry out relevant preprocessing (see 3.2). We rely on the atom-atom mapping (AAM) provided in USPTO to extract reagents from reactions. After filtering, we are left with 1,128,297 reactions that feature at least one reagent. In these reactions, we recognize 40,556 unique molecules as reagents using AAM. Among them, two-thirds (27,100) occur only once. We disregard them, as their attribution to reagents is mostly the result of erroneous atom-atom mapping, and, in any case, our method works best with reagents that occur in the data several times and desirably come together with various other reagent species. Furthermore, for demonstration we decided to limit our study to the reagents that occur at least 100 times in the filtered dataset, and we are left with 626 unique reagent SMILES, which we sort by occurrence frequency in descending order.

Reagent Roles. Reagent role information in the USPTO is rather limited and only differentiates between catalysts, solvents, and everything else. We decided to manually categorize every reagent in our subset of 626 USPTO reagents into one of the following eleven roles:

Acids

Acidic compounds typically used as catalysts, e.g., HCl, $\text {H}_2\text {SO}_4$.
Bases

Basic compounds typically used as catalysts, e.g., NaOH, n-butyllithium, or Hünig’s base.
Lewis Acids

Catalysts that are Lewis acids, e.g., $\text {AlCl}_3$.
Catalysts

Other catalysts, mostly metal-based. For example, those would comprise homogeneous palladium-based catalysts for cross-coupling reactions, such as Suzuki coupling, or heterogeneous catalysts for hydrogenation, such as metal nickel.
Ligands

Compounds with the purpose of forming coordination complexes with metal ions, e.g., phosphorus-based ligands for homogeneous palladium catalysts or chelating agents.
Oxidizing Agents

Various oxidizers, including halogenating agents. For example, $\text {KMnO}_4$, $\text {CrO}_3$, $\text {SOCl}_2$.
Reducing Agents

Various reducing agents, e.g. $\text {H}_2$, $\text {SnCl}_2$, $\text {LiAlH}_4$.
Activators

Reagents that facilitate an overall reaction but are consumed in the process. For example, we call activators the agents that allow the formation of active intermediates enabling a reaction, such as active esters for peptide coupling reactions. A comprehensive review by El-Faham and Albericio [5] systematizes a number of such reagents, many of which are present in USPTO. The examples of activators are 1-hydroxybenzotriazole and N,N’-dicyclohexylcarbodiimide for peptide coupling or diethyl azodicarboxylate for Mitsunobu reaction.
Ambience

Other reagents that do not fit any described role and serve some auxiliary purpose, such as radical reaction inhibitors or nitrogen for inert atmosphere.
Reactants

Molecules that, in fact, contribute atoms to the product and are mistakenly classified as reagents.

The interactivity of the application and the clustering tendency of reagents with the same role on the map allow for faster decision-making about reagent roles compared to using only tabular reagent data. After reagent labeling, the map in the web application looks as in Fig. 3. The reagents are now colored according to the roles we assigned to them.

The tendency to role cluster formation is visible in the map, although it is not perfect.

Questionable Reagent SMILES. The properties of the reagent embedding map enable easy detection of different SMILES that represent the same reagent. Alternative SMILES of the same reagent are naturally assigned similar embeddings by our method. For example, Fig. 4 demonstrates a zoom-in on a map region occupied by strong bases.

By exploring this region on the interactive map, we can immediately discover that there are sometimes several different SMILES representations for the same reagent in USPTO. We see two SMILES representations for n-butyllithium, two for lithium diisopropylamide, and three for lithium bis(trimethylsilyl)amide. In other regions of the map in Fig. 3, there are sometimes mixtures of reagents determined in our preprocessing as one reagent (see 3.2), e.g., two solvents together as a unique reagent. We want to ensure that each unique reagent is represented by a unique SMILES string in our reaction data. We revise our reagent map once more, standardizing all redundant SMILES we find this way and reducing the number of unique reagent entries in our map from 626 to 559.

2.4 Analysis

After labeling our reagents with roles and standardizing all reagent SMILES, we generate reagent embeddings again. The embedding map we thus obtain has a slightly different shape but preserves the same clusters united by reagent action.

Contiguous Regions of Reagent Action. Figure 5 demonstrates the Voronoi diagram for the reagent embedding map we obtain for the final 559 reagents in our application. A Voronoi diagram for a set of two-dimensional points is a partition of a plane into regions drawn around every point in the set. In this case, the points are called seeds, and the regions are called Voronoi cells. In every cell, the points of the plane are closer to the seed forming the cell than to any other seed. In our case, UMAP projections of reagent embeddings are seeds.

The cells formed by reagent embedding projections are colored by the roles of the corresponding reagents, and the touching cells of the same color merge. A Voronoi diagram makes it easy to see regions formed by reagents of the same purpose. We highlight nine example regions in the diagram. The region labeled with 1 comprises various reagents enabling peptide coupling. Among them are HOBt, HOAt, DCC, and their alternatives, many of which are described in the corresponding review [5]. Region 2 unites organophosphorus ligands for homogeneous metal catalysts, e.g., 1,3-bis(diphenylphosphino)propane (dppp), CyJohnPhos and BrettPhos. Region 3 is defined by homogeneous palladium catalysts, such as $\text {Pd2(dba)}_3$ or $\text {Pd(PPh}_3\text {)}_4$. Region 4 is the region of chelating agents, e.g., 8-hydroxyquinoline or phenanthroline. Region 5 features two clusters: one cluster is defined by hydrogenation catalysts, such as palladium on carbon and other 10th group metals; the other cluster is formed by catalytic compounds of $\text {Cu}^\text {I}$ and $\text {Cu}^\text {II}$. Region 6 unites the reagents for Mitsunobu reactions, namely diethyl azodicarboxylate (DEAD) and structurally similar TMAD, DIAD, and DtBAD. Region 7 comprises chlorinating agents such as $\text {SOCl}_2$, $\text {PCl}_5$, or cyanuric chloride. Region 8 is the region of Grignard reagents. Region 9 features borohydrides that serve as reducing agents, e.g., $\text {NaBH}_4$ or $\text {NaBH(Ac)}_3$. These nine regions are just some examples, and the map contains many more regions uniting reagents of the same action. We invite the readers to explore the interactive map of USPTO reagents themselves.

Reagent Role Distribution. Figure 6 demonstrates the distribution of reagent roles among our subset of 559 reagents.

One can see that nineteen percent of the reagents are actually reactants. It hints that the atom mapping tool used in USPTO often fails to resolve noisy reactions this dataset is notorious for.

We make the list of USPTO reagents and their roles available alongside our codebase. We hope that researchers will this information useful for their own work involving USPTO.

Reagent Counts. Figure 7 displays the logarithm of the number of occurrences for every reagent. Around 50 most common reagents dominate the dataset. However, $\sim $ 35 percent of all reactions use reagents other than those 50 most common ones. The less common reagents form a fat tail of the occurrence frequency distribution: the relative occurrence frequency of reagents starting from the 100th falls like $\frac{1}{x^2}$.

3 Methods

3.1 Theory

The original word2vec [18] is a machine learning algorithm, whose success popularized machine learning for Natural Language Processing. The goal of the algorithm is to obtain learned distributed vector representations of words of the natural language for subsequent usage in downstream tasks, such as text classification [14]. It has been also shown that embeddings similar to word2vec enable very effective text compression [21]. The idea behind word2vec is the distributional hypothesis: the words occurring in similar contexts have similar meanings and, therefore, must get similar vector representations - embeddings. The algorithm iteratively trains those initially random embeddings by solving a classification task - what words are in the context for a given word. The context of a word consists of other words within a window centered on that word, and the data pairs for training are obtained using a sliding window over a large text corpus.

Word2vec has gained widespread adoption in the fields of cheminformatics and bioinformatics. For example, researchers have adapted it for the construction of universal feature vectors for small molecules [7, 23]. Also, it has been employed to create meaningful representations of nucleic acids for phylogenetic analysis [20], predicting drug-miRNA associations [6] and RNA degradation prediction [11]. Additionally, word2vec embeddings have been utilized for proteins in tasks such as drug-target interaction [26], drug-target affinity [28], protein-protein interaction [27], and others. A survey [19] offers a broad overview of the diverse applications of word2vec in bioinformatics and cheminformatics.

While word2vec is an iterative algorithm, Levy and Golberg [12] derived a proof that a formulation of word2vec called SGNS (skipgram with negative sampling) is equivalent to factorising the matrix of point-wise mutual information (PMI) scores with singular value decomposition (SVD). PMI scores can be obtained from co-occurrence counts (Eq. 1). Although learning word embeddings is better approached with the iterative algorithm as there are hundreds of thousands of unique words and their co-occurrence matrix would be too large, we can resort to PMI matrix factorization when building reagent embeddings, as there are only several hundreds of the latter.

Singular value decomposition is a way of factorising a matrix in linear algebra. It is a generalization of matrix eigendecomposition to non-square matrices. SVD decomposes a real-valued matrix M of size $m \times n$ into three factors according to Eq. 2:

$$\begin{aligned} M = U \varSigma V^{T} \end{aligned}$$

(2)

where U and V are orthogonal matrices with sizes $m \times m$ and $n \times n$, respectively, and $\varSigma $ is an $m \times n$ diagonal matrix with non-negative real numbers on the diagonal called singular values. The number of non-zero singular values is equal to the rank of M, and we can truncate the sizes of $\varSigma $ to the desired number of singular values, which controls the sizes of U and V. We perform SVD in Python using the SciPy library [25] and use U as the matrix of reagent embeddings, in which every row is a reagent embedding with the dimensionalily equal to the chosen number of singular values.

3.2 Data

Preprocessing. We use the entire USPTO dataset for our study. We download the USPTO reactions as SMILES using the rxnutils Python package [9]. The raw data before preprocessing consists of 3,748,191 reactions. We use rxnutils to build the pipeline for the initial preprocessing. First, we canonicalize all SMILES strings and remove the reactions that contain molecules not canonicalizeable by RDKit. Then, we remove stereochemical information from all molecules and drop duplicate reactions. Finally, we keep only those reactions in which the number of precursors (reactants and reagents together) is less than 10. This initial preprocessing reduces the data volume to 1,393,677 reactions. We do not remove the CXSMILES information because it is necessary to assemble scattered reagent fragments like ions or ligands that belong together into one species.

The consequent preprocessing does not rely on rxnutils and consists of the following steps:

1.
Reagent extraction:

We determine reagents for every reaction using the atom-atom mapping information available in USPTO. We use the CXSMILES information provided in reactions to assemble disjoint reagent fragments (e.g., ions or ligands) into whole reagent species.
2.
Reaction filtering:

We drop reaction records involving more than ten reactants and reagents, more than five reagents, or zero reagents. We also delete trivial reactions - the ones in which the product is found among reactants or reagents.
3.
Reagent filtering:

We remove bound water from reagents such as $\text {Na}_{2}\text {SO}_{4}$ $\cdot $10$\text {H}_{2}\text {O}$ and delete bare ions and other reagent species with unbalanced charge.

Eventually, we remove reactants and products from every reaction and obtain a text file in which every row contains reagent SMILES for a reaction separated by semicolons. In this file, we count all unique reagents and remove those that appear less than 100 times. This leaves us with 626 unique reagents (559 after ensuring one-to-one correspondence between unique reagents and unique SMILES in our data, see 2.3).

We then use this file to derive a table of reagent co-occurrence counts and point-wise mutual information scores. Finally, we factorize this table using singular value decomposition and use one of the factors as a matrix of reagent embeddings.

UMAP Details. We use the default parameters of the UMAP class (15 neighbors, Euclidean metric) in the UMAP Python package [17] in the web application.

Alternative Reagent Determination. As an alternative to relying on AAM to determine reagents, we also try the fingerprint-based procedure described by Schneider et al. [22] and available in RDKit. It does not depend on AAM and is therefore more universal, even though it may occasionally fail, determining all reactants as reagents. In such cases, we fall back to the AAM-based reagent extraction. With this preprocessing, we obtain 558 unique reagents that appear at least 100 times in the dataset. Among those, 25 reagents, 14 of which we assign the “reactant” role, are not among our 559 reagents determined using AAM. At the same time, 26 reagents (19 “reactants”) determined by AAM are not in the set of reagents determined by the fingerprint procedure. However, the reagent embedding space maps in both cases do not differ significantly, and the map for reagents obtained by the fingerprint procedure contains the same regions as in Fig. 5. Therefore, we conclude that both reagent determination procedures are alternatives for the user to choose from depending on the user’s confidence in the AAM reliability in their data.

4 Conclusion

Our paper introduces a novel approach to facilitate chemical reaction data curation with a focus on reagents. By counting unique reagents in a reaction dataset, turning the table of their pairwise counts into point-wise mutual information scores, and factorizing that table with singular value decomposition, we effectively apply a word2vec algorithm to reagents and obtain their distributed vector representations that capture reagent co-occurrence statistics. Projecting the obtained reagent representations on the plane with UMAP, we construct a reagent space map demonstrating intriguing clustering patterns among reagents, highlighting that reagents united by common purpose lie close together and partition into distinct clusters. Based on this map, we present an interactive web application providing a user-friendly platform for researchers to navigate and explore reagent patterns within reaction datasets. We demonstrate the use of the application with the USPTO dataset.

Additionally, we systematize and catalog several hundreds of the most common reagents used in USPTO and label them into detailed roles. We believe that such information will be valuable for reagent prediction models trained on USPTO. For example, it can be used to estimate the performance of a reagent prediction model not by the often interchangeable individual molecules, but by the correctness of predicted roles. The code and data are available at https://github.com/Academich/reagent_emb_vis.

References

CASREACT website. Accessed 23 March 2024. https://www.cas.org/cas-data
NextMove Software. Pistachio. Accessed 23 March 2023. http://www.nextmovesoftware.com/pistachio.html
Reaxys database. Accessed 23 March 2024. https://www.reaxys.com
Andronov, M., Voinarovska, V., Andronova, N., Wand, M., Clevert, D.A., Schmidhuber, J.: Reagent prediction with a molecular transformer improves reaction data quality. Chem. Sci. 14(12), 3235–3246 (2023)
Article Google Scholar
El-Faham, A., Albericio, F.: Peptide coupling reagents, more than a letter soup. Chem. Rev. 111(11), 6557–6602 (2011)
Article Google Scholar
Guan, Y.J., et al.: MFIDMA: a multiple information integration model for the prediction of drug-miRNA associations. Biology 12(1), 41 (2022)
Article Google Scholar
Jaeger, S., Fulle, S., Turk, S.: Mol2vec: unsupervised machine learning approach with chemical intuition. J. Chem. Inf. Model. 58(1), 27–35 (2018)
Article Google Scholar
Johansson, S., et al.: AI-assisted synthesis prediction. Drug Discov. Today Technol. 32, 65–72 (2019)
Article Google Scholar
Kannas, C., Genheden, S.: Rxnutils–a cheminformatics python library for manipulating chemical reaction data (2022)
Google Scholar
Kearnes, S.M., et al.: The open reaction database. J. Am. Chem. Soc. 143(45), 18820–18826 (2021)
Article Google Scholar
Krishna, U.V., Premjith, B., Soman, K.: A comparative study of pre-trained gene embeddings for COVID-19 mRNA vaccine degradation prediction. In: Proceedings of the Seventh International Conference on Mathematics and Computing: ICMC 2021, pp. 301–308. Springer (2022)
Google Scholar
Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. Adv. Neural Inf. Proc. Syst. 27 (2014)
Google Scholar
Li, J.J.: Name Reactions, 3rd edn. A Collection of Detailed Reaction Mechanisms. Springer-Verlag, Berlin Heidelberg (2006). https://doi.org/10.1007/978-3-030-50865-4
Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and word2vec for text classification with semantic features. In: 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC), pp. 136–140. IEEE (2015)
Google Scholar
Lowe, D.M.: Extraction of Chemical Structures and Reactions from the Literature. Ph.D. Dissertation, University of Cambridge, Cambridge, UK. https://doi.org/10.17863/CAM.16293 (2012)
Madzhidov, T.I., et al.: Machine learning modelling of chemical reaction characteristics: yesterday, today, tomorrow. Mendeleev Commun. 31(6), 769–780 (2021)
Article Google Scholar
McInnes, L., Healy, J., Melville, J.: UMAP: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Proc. Syst. 26 (2013)
Google Scholar
Öztürk, H., Özgür, A., Schwaller, P., Laino, T., Ozkirimli, E.: Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discov. Today 25(4), 689–705 (2020)
Article Google Scholar
Ren, R., Yin, C., S.-T. Yau, S.: kmer2vec: A novel method for comparing DNA sequences by word2vec embedding. J. Comput. Biol. 29(9), 1001–1021 (2022)
Google Scholar
Schmidhuber, J., Heil, S.: Sequential neural text compression. IEEE Trans. Neural Networks 7(1), 142–146 (1996)
Article Google Scholar
Schneider, N., Stiefl, N., Landrum, G.A.: What’s what: the (nearly) definitive guide to reaction role assignment. J. Chem. Inf. Model. 56(12), 2336–2346 (2016). https://doi.org/10.1021/acs.jcim.6b00564
Article Google Scholar
Shao, J., Gong, Q., Yin, Z., Pan, W., Pandiyan, S., Wang, L.: S2DV: converting SMILES to a drug vector for predicting the activity of anti-HBV small molecules. Brief. Bioinform. 23(2) (2022)
Google Scholar
Thakkar, A., Kogej, T., Reymond, J.L., Engkvist, O., Bjerrum, E.J.: Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain. Chem. Sci. 11(1), 154–168 (2020)
Article Google Scholar
Virtanen, P., et al.: SciPy 1.0 Contributors: SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat Methods 17, 261–272 (2020). https://doi.org/10.1038/s41592-019-0686-2
Wang, L., Zhou, Y., Chen, Q.: AMMVF-DTI: a novel model predicting drug-target interactions based on attention mechanism and multi-view fusion. Int. J. Mol. Sci. 24(18), 14142 (2023)
Article Google Scholar
Wang, Y., You, Z.H., Yang, S., Li, X., Jiang, T.H., Zhou, X.: A high efficient biological language model for predicting protein-protein interactions. Cells 8(2), 122 (2019)
Article Google Scholar
Xia, M., Hu, J., Zhang, X., Lin, X.: Drug-target binding affinity prediction based on graph neural networks and word2vec. In: International Conference on Intelligent Computing, pp. 496–506. Springer (2022). https://doi.org/10.1007/978-3-031-13829-4_43

Download references

Acknowledgments

This study was partially funded by the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie Innovative Training Network European Industrial Doctorate grant agreement No. 956832 “Advanced machine learning for Innovative Drug Discovery, and also by the Horizon Europe funding programme under the Marie Skłodowska-Curie Actions Doctoral Networks grant agreement “Explainable AI for Molecules - AiChemist” No. 101120466.

Author information

Authors and Affiliations

IDSIA, USI, SUPSI, 6900, Lugano, Switzerland
Mikhail Andronov, Michael Wand & Jürgen Schmidhuber
Machine Learning Research, Pfizer Research and Development, Friedrichstr. 110, 10117, Berlin, Germany
Mikhail Andronov & Djork-Arné Clevert
Institute for Digital Technologies for Personalized Healthcare, SUPSI, 6900, Lugano, Switzerland
Michael Wand
AI Initiative, KAUST, 23955, Thuwal, Saudi Arabia
Jürgen Schmidhuber
Independent Researcher, Berlin, Germany
Natalia Andronova

Authors

Mikhail Andronov
View author publications
You can also search for this author in PubMed Google Scholar
Natalia Andronova
View author publications
You can also search for this author in PubMed Google Scholar
Michael Wand
View author publications
You can also search for this author in PubMed Google Scholar
Jürgen Schmidhuber
View author publications
You can also search for this author in PubMed Google Scholar
Djork-Arné Clevert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mikhail Andronov .

Editor information

Editors and Affiliations

Machine Learning Research, Pfizer (Germany), Berlin, Germany
Djork-Arné Clevert
IDSIA USI-SUPSI, Viganello, Switzerland
Michael Wand
Comenius University, Bratislava, Slovakia
Kristína Malinovská
IDSIA USI-SUPSI, Viganello, Switzerland
Jürgen Schmidhuber
Institute of Structural Biology, Molecular Targets and Therapeutics Center, Helmholtz Zentrum München, Munich, Germany
Igor V. Tetko

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare.

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Andronov, M., Andronova, N., Wand, M., Schmidhuber, J., Clevert, DA. (2025). Curating Reagents in Chemical Reaction Data with an Interactive Reagent Space Map. In: Clevert, DA., Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) AI in Drug Discovery. AIDD 2024. Lecture Notes in Computer Science, vol 14894. Springer, Cham. https://doi.org/10.1007/978-3-031-72381-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-72381-0_3
Published: 20 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72380-3
Online ISBN: 978-3-031-72381-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Curating Reagents in Chemical Reaction Data with an Interactive Reagent Space Map