Keywords

1 Introduction

Over the many years chemical science has existed, chemists have amassed a vast body of knowledge and records about organic chemical reactions. This wealth of information encapsulates hundreds of distinct reaction types determined by a general transformation scheme and the required reagents [13]. As machine learning is cementing its place among widely used approaches to various reaction modeling problems [8, 16], chemists are becoming increasingly concerned with understanding and curating their reaction data.

The most widely known reaction data collections are Reaxys [3], CASREACT [1], the open dataset of reactions from US patents (USPTO) [15], Pistachio [2], and a recent Open Reaction Database (ORD) [10]. Different reaction datasets may have unique particularities and biases [24], and comprehending those in one’s dataset of choice is an advisable prerequisite for any study relying on that dataset. One of the possible sources of reaction data imperfections is reagent information.

A chemical reaction scheme (Fig. 1) typically involves reactants, products, and reagents. A reactant, as defined by IUPAC, is a substance consumed in the course of a chemical reaction. Consequently, reagents are other molecules that enable a reaction but do not contribute atoms to the products. Reagents are commonly written above or below the arrow in a reaction scheme. For example, catalysts and solvents are reagents. However, in practice, for convenience, substances with other roles, such as reducing and oxidizing agents, may also be considered reagents. Reagents may be integral to the mechanism of a reaction or merely improve the reaction rate. Reactions that use the same reagents may often correspond to the same reaction type.

Fig. 1.
figure 1

An instance of the Suzuki coupling reaction. Reagents are normally written above or below the arrow. Various reaction types are often enabled by specific reagents.

Before building and testing machine learning models on some reaction dataset, it may be beneficial to pay special attention to reagent information within reagent records: verify whether reactant-reagent separation is adequate, or if there are redundant records of the same reagent, or if information about detailed roles of reagents is available. While the creators of popular reaction databases try to address all those issues, the data often needs additional curation. For example, reactant-reagent separation in USPTO is based on atom-atom mapping provided by the Indigo toolkit. The imperfection of this mapping may lead to imperfect separation. Additionally, reagent roles may not be detailed enough: both Reaxys and USPTO feature only three roles (“catalysts”, “solvents”, or “reagents”). When building reagent prediction models, their detailed performance analysis may require access to richer reagent role attribution [4].

In this paper, we describe a simple visual tool that helps to curate reagents in chemical reaction data. This tool is an interactive reagent space map based on distributed vector representations of reagents and served in a web application. We obtain reagent representations using an algorithm equivalent to the famous word2vec [18] algorithm from Natural Language Processing, and they reflect the statistics of reagent co-occurrences and interaction in a given reaction data corpus: representations of reagents of the same role tend to cluster together. We demonstrate the application of our tool to the USPTO dataset [15]. Using our tool, we label around six hundred most common reagents in USPTO into ten detailed roles, detect reactants erroneously listed as reagents, and ensure the uniqueness of unique reagents’ names.

Our reagent space mapping and the web application work with any reaction dataset, and we are confident that it will benefit organic chemists and cheminformatics specialists working with their own reaction data.

The codebase is open-source and available at https://github.com/Academich/reagent_emb_vis.

2 Results

2.1 Interactive Application

We have built a lightweight web application for the interactive exploration of the USPTO reagent space. Figure 2 demonstrates the application’s appearance. The application displays an interactive UMAP [17] projection of reagent embeddings that capture the statistics of reagent co-occurrences within the dataset. Various filters are available in the application: it is possible to display only reagents with desired indices or roles, and one can also filter reagents by SMARTS patterns. When the user hovers over a point on the map, the corresponding reagent structure and its SMILES appear on the screen. Two versions of the map are available: a flat 2D map and a map on the surface of a sphere.

Fig. 2.
figure 2

The appearance of the web application for the exploration of reagent space. Reagents in our demonstration come from USPTO. On the right, there is an interactive UMAP projection of reagent embeddings. Every point corresponds to a unique reagent. Upon hovering over a point, the corresponding reagent is displayed on the left. Various filters for the map are available. The embeddings are obtained by factorizing the matrix of point-wise mutual information scores between pairs of reagents with singular value decomposition (SVD). The original embedding dimensionality is 50.

Users can explore reagents in their own reaction data in the application after carrying out necessary preprocessing described in the repository with the code.

2.2 Properties of Reagent Embeddings

If some two reagents tend to occur in similar contexts, i.e., with similar other reagents, then it is likely that these two reagents are alternatives, and they likely get embeddings that are close to each other. For example, we do not expect two different palladium catalysts for Suzuki coupling to occur in one reaction, but one can use the same bases and solvents with both of them. Therefore, we can expect the map of reagent embeddings to feature role clusters, e.g., a cluster of specific catalysts, ligands, or other reagents. The reagent embeddings are obtained by factorizing the point-wise mutual information (PMI) scores matrix with singular value decomposition (SVD). We can easily derive the table of PMI scores from the table of reagent counts by Eq. 1.

$$\begin{aligned} \text {PMI}(x, y) = \log _2 \frac{P(x, y)}{P(x)P(y)} \end{aligned}$$
(1)

where x and y are reagents, P(xy) is a relative frequency of a reagent pair among all reagent pairs, and P(x) and P(y) are relative frequencies of individual reagents. Reagent vector representations obtained in this manner lie close in the vector space for entities with similar "meanings", which are determined by the "companions" of those entities. This method of obtaining reagent embeddings is equivalent to the word2vec algorithm (see 3.1) if we treat reagent molecules as words and reagents in one reaction as one context. One can freely select reagent embedding dimensionality, and we choose it to be 50 to achieve information compression that forces reagent embeddings into role clusters but with enough degrees of freedom. Other dimensionalities lead to different shapes of the map but do not tend to affect the observed clusters.

2.3 Reagent Data Curation

When reagents are displayed in this interactive map (Fig. 2), it is much easier to curate them than when working with just table data. We showcase two use cases when the map facilitates reagent curation: labeling reagents into detailed roles and finding redundancies in reagent SMILES entries.

Reagent Number. To study the reagents in USPTO, we obtain the entire dataset using the rxnutils [9] Python package and carry out relevant preprocessing (see 3.2). We rely on the atom-atom mapping (AAM) provided in USPTO to extract reagents from reactions. After filtering, we are left with 1,128,297 reactions that feature at least one reagent. In these reactions, we recognize 40,556 unique molecules as reagents using AAM. Among them, two-thirds (27,100) occur only once. We disregard them, as their attribution to reagents is mostly the result of erroneous atom-atom mapping, and, in any case, our method works best with reagents that occur in the data several times and desirably come together with various other reagent species. Furthermore, for demonstration we decided to limit our study to the reagents that occur at least 100 times in the filtered dataset, and we are left with 626 unique reagent SMILES, which we sort by occurrence frequency in descending order.

Reagent Roles. Reagent role information in the USPTO is rather limited and only differentiates between catalysts, solvents, and everything else. We decided to manually categorize every reagent in our subset of 626 USPTO reagents into one of the following eleven roles:

  • Acids

    Acidic compounds typically used as catalysts, e.g., HCl, \(\text {H}_2\text {SO}_4\).

  • Bases

    Basic compounds typically used as catalysts, e.g., NaOH, n-butyllithium, or Hünig’s base.

  • Lewis Acids

    Catalysts that are Lewis acids, e.g., \(\text {AlCl}_3\).

  • Catalysts

    Other catalysts, mostly metal-based. For example, those would comprise homogeneous palladium-based catalysts for cross-coupling reactions, such as Suzuki coupling, or heterogeneous catalysts for hydrogenation, such as metal nickel.

  • Ligands

    Compounds with the purpose of forming coordination complexes with metal ions, e.g., phosphorus-based ligands for homogeneous palladium catalysts or chelating agents.

  • Oxidizing Agents

    Various oxidizers, including halogenating agents. For example, \(\text {KMnO}_4\), \(\text {CrO}_3\), \(\text {SOCl}_2\).

  • Reducing Agents

    Various reducing agents, e.g. \(\text {H}_2\), \(\text {SnCl}_2\), \(\text {LiAlH}_4\).

  • Activators

    Reagents that facilitate an overall reaction but are consumed in the process. For example, we call activators the agents that allow the formation of active intermediates enabling a reaction, such as active esters for peptide coupling reactions. A comprehensive review by El-Faham and Albericio [5] systematizes a number of such reagents, many of which are present in USPTO. The examples of activators are 1-hydroxybenzotriazole and N,N’-dicyclohexylcarbodiimide for peptide coupling or diethyl azodicarboxylate for Mitsunobu reaction.

  • Ambience

    Other reagents that do not fit any described role and serve some auxiliary purpose, such as radical reaction inhibitors or nitrogen for inert atmosphere.

  • Reactants

    Molecules that, in fact, contribute atoms to the product and are mistakenly classified as reagents.

The interactivity of the application and the clustering tendency of reagents with the same role on the map allow for faster decision-making about reagent roles compared to using only tabular reagent data. After reagent labeling, the map in the web application looks as in Fig. 3. The reagents are now colored according to the roles we assigned to them.

Fig. 3.
figure 3

The map of embeddings for the subset of 626 most common USPTO reagents colored according to the detailed reagent roles assigned manually.

The tendency to role cluster formation is visible in the map, although it is not perfect.

Questionable Reagent SMILES. The properties of the reagent embedding map enable easy detection of different SMILES that represent the same reagent. Alternative SMILES of the same reagent are naturally assigned similar embeddings by our method. For example, Fig. 4 demonstrates a zoom-in on a map region occupied by strong bases.

Fig. 4.
figure 4

A region of the reagent embedding map that reveals unique reagents represented by several different SMILES. In this region of strong bases, the SMILES representations of n-butyllithium, lithium diisopropylamide, and lithium bis(trimethylsilyl)amide require standardization.

By exploring this region on the interactive map, we can immediately discover that there are sometimes several different SMILES representations for the same reagent in USPTO. We see two SMILES representations for n-butyllithium, two for lithium diisopropylamide, and three for lithium bis(trimethylsilyl)amide. In other regions of the map in Fig. 3, there are sometimes mixtures of reagents determined in our preprocessing as one reagent (see 3.2), e.g., two solvents together as a unique reagent. We want to ensure that each unique reagent is represented by a unique SMILES string in our reaction data. We revise our reagent map once more, standardizing all redundant SMILES we find this way and reducing the number of unique reagent entries in our map from 626 to 559.

2.4 Analysis

After labeling our reagents with roles and standardizing all reagent SMILES, we generate reagent embeddings again. The embedding map we thus obtain has a slightly different shape but preserves the same clusters united by reagent action.

Contiguous Regions of Reagent Action. Figure 5 demonstrates the Voronoi diagram for the reagent embedding map we obtain for the final 559 reagents in our application. A Voronoi diagram for a set of two-dimensional points is a partition of a plane into regions drawn around every point in the set. In this case, the points are called seeds, and the regions are called Voronoi cells. In every cell, the points of the plane are closer to the seed forming the cell than to any other seed. In our case, UMAP projections of reagent embeddings are seeds.

The cells formed by reagent embedding projections are colored by the roles of the corresponding reagents, and the touching cells of the same color merge. A Voronoi diagram makes it easy to see regions formed by reagents of the same purpose. We highlight nine example regions in the diagram. The region labeled with 1 comprises various reagents enabling peptide coupling. Among them are HOBt, HOAt, DCC, and their alternatives, many of which are described in the corresponding review [5]. Region 2 unites organophosphorus ligands for homogeneous metal catalysts, e.g., 1,3-bis(diphenylphosphino)propane (dppp), CyJohnPhos and BrettPhos. Region 3 is defined by homogeneous palladium catalysts, such as \(\text {Pd2(dba)}_3\) or \(\text {Pd(PPh}_3\text {)}_4\). Region 4 is the region of chelating agents, e.g., 8-hydroxyquinoline or phenanthroline. Region 5 features two clusters: one cluster is defined by hydrogenation catalysts, such as palladium on carbon and other 10th group metals; the other cluster is formed by catalytic compounds of \(\text {Cu}^\text {I}\) and \(\text {Cu}^\text {II}\). Region 6 unites the reagents for Mitsunobu reactions, namely diethyl azodicarboxylate (DEAD) and structurally similar TMAD, DIAD, and DtBAD. Region 7 comprises chlorinating agents such as \(\text {SOCl}_2\), \(\text {PCl}_5\), or cyanuric chloride. Region 8 is the region of Grignard reagents. Region 9 features borohydrides that serve as reducing agents, e.g., \(\text {NaBH}_4\) or \(\text {NaBH(Ac)}_3\). These nine regions are just some examples, and the map contains many more regions uniting reagents of the same action. We invite the readers to explore the interactive map of USPTO reagents themselves.

Reagent Role Distribution. Figure 6 demonstrates the distribution of reagent roles among our subset of 559 reagents.

One can see that nineteen percent of the reagents are actually reactants. It hints that the atom mapping tool used in USPTO often fails to resolve noisy reactions this dataset is notorious for.

We make the list of USPTO reagents and their roles available alongside our codebase. We hope that researchers will this information useful for their own work involving USPTO.

Reagent Counts. Figure 7 displays the logarithm of the number of occurrences for every reagent. Around 50 most common reagents dominate the dataset. However, \(\sim \) 35 percent of all reactions use reagents other than those 50 most common ones. The less common reagents form a fat tail of the occurrence frequency distribution: the relative occurrence frequency of reagents starting from the 100th falls like \(\frac{1}{x^2}\).

Fig. 5.
figure 5

Voronoi diagram of the UMAP projection of reagent embeddings. Reagents of the same role tend to form contiguous regions corresponding to the same type of reagent action. Numbers highlight 9 example regions that unite reagents of the same purpose.

Fig. 6.
figure 6

Distribution of 559 most common USPTO reagents by detailed roles.

Fig. 7.
figure 7

Decimal logarithm of the number of occurrences of unique reagents in the USPTO dataset. We consider only the reagents that occur at least 100 times. The dataset is dominated by approximately 50 most common reagents, and the others are relatively rare. Unique reagent indices are sorted by occurrence frequency.

3 Methods

3.1 Theory

The original word2vec [18] is a machine learning algorithm, whose success popularized machine learning for Natural Language Processing. The goal of the algorithm is to obtain learned distributed vector representations of words of the natural language for subsequent usage in downstream tasks, such as text classification [14]. It has been also shown that embeddings similar to word2vec enable very effective text compression [21]. The idea behind word2vec is the distributional hypothesis: the words occurring in similar contexts have similar meanings and, therefore, must get similar vector representations - embeddings. The algorithm iteratively trains those initially random embeddings by solving a classification task - what words are in the context for a given word. The context of a word consists of other words within a window centered on that word, and the data pairs for training are obtained using a sliding window over a large text corpus.

Word2vec has gained widespread adoption in the fields of cheminformatics and bioinformatics. For example, researchers have adapted it for the construction of universal feature vectors for small molecules [7, 23]. Also, it has been employed to create meaningful representations of nucleic acids for phylogenetic analysis [20], predicting drug-miRNA associations [6] and RNA degradation prediction [11]. Additionally, word2vec embeddings have been utilized for proteins in tasks such as drug-target interaction [26], drug-target affinity [28], protein-protein interaction [27], and others. A survey [19] offers a broad overview of the diverse applications of word2vec in bioinformatics and cheminformatics.

While word2vec is an iterative algorithm, Levy and Golberg [12] derived a proof that a formulation of word2vec called SGNS (skipgram with negative sampling) is equivalent to factorising the matrix of point-wise mutual information (PMI) scores with singular value decomposition (SVD). PMI scores can be obtained from co-occurrence counts (Eq. 1). Although learning word embeddings is better approached with the iterative algorithm as there are hundreds of thousands of unique words and their co-occurrence matrix would be too large, we can resort to PMI matrix factorization when building reagent embeddings, as there are only several hundreds of the latter.

Singular value decomposition is a way of factorising a matrix in linear algebra. It is a generalization of matrix eigendecomposition to non-square matrices. SVD decomposes a real-valued matrix M of size \(m \times n\) into three factors according to Eq. 2:

$$\begin{aligned} M = U \varSigma V^{T} \end{aligned}$$
(2)

where U and V are orthogonal matrices with sizes \(m \times m\) and \(n \times n\), respectively, and \(\varSigma \) is an \(m \times n\) diagonal matrix with non-negative real numbers on the diagonal called singular values. The number of non-zero singular values is equal to the rank of M, and we can truncate the sizes of \(\varSigma \) to the desired number of singular values, which controls the sizes of U and V. We perform SVD in Python using the SciPy library [25] and use U as the matrix of reagent embeddings, in which every row is a reagent embedding with the dimensionalily equal to the chosen number of singular values.

3.2 Data

Preprocessing. We use the entire USPTO dataset for our study. We download the USPTO reactions as SMILES using the rxnutils Python package [9]. The raw data before preprocessing consists of 3,748,191 reactions. We use rxnutils to build the pipeline for the initial preprocessing. First, we canonicalize all SMILES strings and remove the reactions that contain molecules not canonicalizeable by RDKit. Then, we remove stereochemical information from all molecules and drop duplicate reactions. Finally, we keep only those reactions in which the number of precursors (reactants and reagents together) is less than 10. This initial preprocessing reduces the data volume to 1,393,677 reactions. We do not remove the CXSMILES information because it is necessary to assemble scattered reagent fragments like ions or ligands that belong together into one species.

The consequent preprocessing does not rely on rxnutils and consists of the following steps:

  1. 1.

    Reagent extraction:

    We determine reagents for every reaction using the atom-atom mapping information available in USPTO. We use the CXSMILES information provided in reactions to assemble disjoint reagent fragments (e.g., ions or ligands) into whole reagent species.

  2. 2.

    Reaction filtering:

    We drop reaction records involving more than ten reactants and reagents, more than five reagents, or zero reagents. We also delete trivial reactions - the ones in which the product is found among reactants or reagents.

  3. 3.

    Reagent filtering:

    We remove bound water from reagents such as \(\text {Na}_{2}\text {SO}_{4}\) \(\cdot \)10\(\text {H}_{2}\text {O}\) and delete bare ions and other reagent species with unbalanced charge.

Eventually, we remove reactants and products from every reaction and obtain a text file in which every row contains reagent SMILES for a reaction separated by semicolons. In this file, we count all unique reagents and remove those that appear less than 100 times. This leaves us with 626 unique reagents (559 after ensuring one-to-one correspondence between unique reagents and unique SMILES in our data, see 2.3).

We then use this file to derive a table of reagent co-occurrence counts and point-wise mutual information scores. Finally, we factorize this table using singular value decomposition and use one of the factors as a matrix of reagent embeddings.

UMAP Details. We use the default parameters of the UMAP class (15 neighbors, Euclidean metric) in the UMAP Python package [17] in the web application.

Alternative Reagent Determination. As an alternative to relying on AAM to determine reagents, we also try the fingerprint-based procedure described by Schneider et al. [22] and available in RDKit. It does not depend on AAM and is therefore more universal, even though it may occasionally fail, determining all reactants as reagents. In such cases, we fall back to the AAM-based reagent extraction. With this preprocessing, we obtain 558 unique reagents that appear at least 100 times in the dataset. Among those, 25 reagents, 14 of which we assign the “reactant” role, are not among our 559 reagents determined using AAM. At the same time, 26 reagents (19 “reactants”) determined by AAM are not in the set of reagents determined by the fingerprint procedure. However, the reagent embedding space maps in both cases do not differ significantly, and the map for reagents obtained by the fingerprint procedure contains the same regions as in Fig. 5. Therefore, we conclude that both reagent determination procedures are alternatives for the user to choose from depending on the user’s confidence in the AAM reliability in their data.

4 Conclusion

Our paper introduces a novel approach to facilitate chemical reaction data curation with a focus on reagents. By counting unique reagents in a reaction dataset, turning the table of their pairwise counts into point-wise mutual information scores, and factorizing that table with singular value decomposition, we effectively apply a word2vec algorithm to reagents and obtain their distributed vector representations that capture reagent co-occurrence statistics. Projecting the obtained reagent representations on the plane with UMAP, we construct a reagent space map demonstrating intriguing clustering patterns among reagents, highlighting that reagents united by common purpose lie close together and partition into distinct clusters. Based on this map, we present an interactive web application providing a user-friendly platform for researchers to navigate and explore reagent patterns within reaction datasets. We demonstrate the use of the application with the USPTO dataset.

Additionally, we systematize and catalog several hundreds of the most common reagents used in USPTO and label them into detailed roles. We believe that such information will be valuable for reagent prediction models trained on USPTO. For example, it can be used to estimate the performance of a reagent prediction model not by the often interchangeable individual molecules, but by the correctness of predicted roles. The code and data are available at https://github.com/Academich/reagent_emb_vis.