0% found this document useful (0 votes)
20 views11 pages

Evaluating The Generalizability of Graph Neural Networks For Predicting Collision Cross Section

Uploaded by

Kms Hamim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views11 pages

Evaluating The Generalizability of Graph Neural Networks For Predicting Collision Cross Section

Uploaded by

Kms Hamim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Engler Hart et al.

Journal of Cheminformatics (2024) 16:105 Journal of Cheminformatics


https://doi.org/10.1186/s13321-024-00899-w

RESEARCH Open Access

Evaluating the generalizability of graph


neural networks for predicting collision
cross section
Chloe Engler Hart1†, António José Preto1†, Shaurya Chanana1†, David Healey1, Tobias Kind1 and
Daniel Domingo‑Fernández1*

Abstract
Ion Mobility coupled with Mass Spectrometry (IM-MS) is a promising analytical technique that enhances molecu‑
lar characterization by measuring collision cross-section (CCS) values, which are indicative of the molecular size
and shape. However, the effective application of CCS values in structural analysis is still constrained by the limited
availability of experimental data, necessitating the development of accurate machine learning (ML) models for in sil‑
ico predictions. In this study, we evaluated state-of-the-art Graph Neural Networks (GNNs), trained to predict CCS
values using the largest publicly available dataset to date. Although our results confirm the high accuracy of these
models within chemical spaces similar to their training environments, their performance significantly declines
when applied to structurally novel regions. This discrepancy raises concerns about the reliability of in silico CCS
predictions and underscores the need for releasing further publicly available CCS datasets. To mitigate this, we intro‑
duce Mol2CCS which demonstrates how generalization can be partially improved by extending models to account
for additional features such as molecular fingerprints, descriptors, and the molecule types. Lastly, we also show
how confidence models can support by enhancing the reliability of the CCS estimates.
Scientific contribution
We have benchmarked state-of-the-art graph neural networks for predicting collision cross section. Our work
highlights the accuracy of these models when trained and predicted in similar chemical spaces, but also how their
accuracy drops when evaluated in structurally novel regions. Lastly, we conclude by presenting potential approaches
to mitigate this issue.

Introduction additional insight into the structural properties of ions


Ion Mobility coupled to Mass Spectrometry (IM-MS) has [7]. IM-MS measures the time ions take to traverse a
emerged as a powerful analytical technique that com- gas-filled chamber under an electric field. The drift
plements traditional mass spectrometry by providing time is then used to calculate the collision cross section
(CCS) of the ions. Since CCS is a reproducible and struc-
ture-reflective parameter that characterizes the over-

Chloe Engler Hart, António José Preto, Shaurya Chanana have authors all shape and size of ionized molecules, it can be used
contributed equally to this work. as an orthogonal feature to identify the compounds in a
*Correspondence: sample. Therefore, leveraging CCS data can be viewed
Daniel Domingo‑Fernández as a complementary approach to the traditional liquid
[email protected]
1
Enveda Biosciences, Inc., 5700 Flatiron Pkwy, Boulder, CO 80301, USA chromatography mass spectrometry (LC–MS) based

© The Author(s) 2024. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://creativecommons.org/licenses/by/4.0/.
Engler Hart et al. Journal of Cheminformatics (2024) 16:105 Page 2 of 11

approaches, enhancing the accuracy and depth of struc- SigmaCCS [6], employs Edge-Conditioned Convolutions
tural prediction approaches [4]. (ECCs) [19] to embed the original molecular structure
The benefits of using CCS values for structure pre- along with the adduct type. The authors evaluated the
diction are evident by the growing number of related model on CCSBase (v1.2) [17], focusing on over 5,000
publications, but various challenges hinder their wider high-quality experimental CCS values and three adducts
adoption. A major challenge is the limited availability of ([M + H] + , [M + Na] + , and [M-H] − . They reported a
comprehensive CCS databases, which often lack exten- coefficient of determination (­ R2) of 0.9945 and a Median
sive coverage of CCS values for a wide array of reference Relative Error (MRE) of 1.1751% on a test set. The sec-
compounds [13, 17, 23]. While many of the databases ond GNN, GraphCCS [22], simulates the structure of the
cover a wide range of predicted values, the number of resulting adduct and uses it as input for a Graph Convo-
high quality experimental references is only in the low lutional Network (GCN). Although these models have
thousands. not been directly compared, Xie and colleagues also
This deficiency significantly restricts the effective use reported an ­R2 of 0.994 and an MRE of 1.29% on a dif-
of CCS as a predictive tool in structural analysis. To ferent test set from CCSBase (v1.2). Lastly, both studies
account for the lack of experimental CCS values, in sil- have used their models to generate an in silico database
ico prediction models can be used. The efficacy of these of CCS values.
predictions, however, is contingent upon the precision Until recently, CCSBase [17], and its underlying data-
of these models and the used structural scaffolds dur- sets, was the main publicly available source where
ing model training. In other words, only models that can researchers could access several thousand CCS data
reliably predict CCS values with high accuracy are suit- points. Consequently, any prior modeling approach has
able for generating synthetic CCS values that can fill the been constrained by the limited amount of training data
gaps in experimental values existing in current databases. available. Notably, although previous evaluations have
The concept of the applicability domain is well known demonstrated that models can accurately predict CCS
in the field of QSAR/QSPR and property predictions [5, values from molecular structures, the chemical space
18]. It refers to the concept that both training and vali- represented in this database is relatively small (approxi-
dation compounds and their estimated parameters need mately 6,075 unique structures, including lipids, peptides,
to be in a similar structural space (scaffold space) and carbohydrates, and small molecules) and highly homog-
that predicted properties should be similar in the train- enous (see Supplementary Fig. 1A). The recent release of
ing and prediction set (response space). This ensures that METLIN-CCS [1, 2], which focuses on synthetic small
the models perform reliable, robust and that confident molecules, significantly expands the availability of experi-
predictions can be made and outliers could be marked mental CCS data with over 27,000 unique structures.
as potentially unreliable predictions [10, 11]. Without This expansion allows for the benchmarking of previously
such advancements, the integration of CCS as a routine published models in a broader context. Additionally, the
parameter in molecular characterization remains under- combination of fingerprints and GNNs has recently been
utilized, limiting our capacity to fully exploit its insights applied to similar prediction tasks, such as retention time
into molecular geometry and interactions. prediction [22]. Furthermore, leveraging additional meta-
Over the past few years, several machine learning (ML) data, such as the instruments used to generate the CCS
models for predicting CCS values have been developed values or the types of molecules analyzed, could poten-
[9]. These models require a molecular representation, tially enhance the accuracy and generalizability of these
such as fingerprints or a graph representation of the models.
molecule. The first ML model, DeepCCS [14], encodes In this work we benchmark the state-of-the-art mod-
SMILES representations and utilizes a convolutional els on the new METLIN-CCS database and assess their
neural network to predict CCS values. It was trained and generalizability. We also demonstrate increased general-
evaluated on a set of heterogeneous datasets contain- izability using an extension of SigmaCCS, which we call
ing over 2,400 molecules combined. Similarly, a Support Mol2CCS, that incorporates additional information such
Vector Regression (SVR) model named CCS Predic- as instrument type. Finally, we show that using confi-
tor 2.0 [16] leverages molecular fingerprints to predict dence models to filter predictions can result in increased
CCS values and has demonstrated better accuracy than performance of the models.
DeepCCS.
More recently, two models based on Graph Neural Net- Methods
works (GNNs) have surpassed previous state-of-the-art Data
(SOTA) models in predicting CCS values using a graph For this work, we leveraged two of the largest pub-
representation of molecular structures. The first model, lic resources for CCS values: CCSBase [17] and
Engler Hart et al. Journal of Cheminformatics (2024) 16:105 Page 3 of 11

METLIN-CCS [1]. CCSBase (v1.3) combines 22 dif- all mean absolute deviations for duplicates (Supplemen-
ferent datasets, providing a total of 16,989 CCS values tary Fig. 2). We then assigned the mean CCS value for
measured on three different instruments from 6,744 dis- each pair to the remaining duplicates. Ultimately, these
tinct molecules (e.g., small molecules, lipids, peptides, processing steps reduced the number of data points
and carbohydrates). Recently published, METLIN-CCS from 16,989 to 13,617 for CCSBase. All points within the
(downloaded on 14/04/2024) is currently the largest METLIN-CCS database had a mean absolute deviation
CCS database with over 65,000 CCS values from 27,633 less than five.
distinct small synthetic molecules. METLIN-CCS con-
tains CCS values exclusively measured in timsTOF Pro Feature extraction for Mol2CCS
for trapped ion mobility spectrometry (TIMS) and CCS Similarly to related work [6, 22], the Mol2CCS architec-
values for individual adduct forms were experimentally ture represents each molecule as a graph, wherein every
acquired in triplicates. node and edge denoted an atom and bond, respectively.
The entire molecular graph is represented by three differ-
Pre‑processing ent matrices: i) node attribute matrix, ii) edge attribute
SMILES matrix, and iii) adjacency matrix. These three matrices
There are multiple ways of representing small molecules, respectively store characteristic attributes of the atoms,
one of the most common ones being SMILES codes. This bonds, and the connections of the molecular graph. To
format allows encoding the molecular data into string construct them, we first read the SMILES representation
format, without loss of information. In this analysis, the of each molecule using RDKit (v2023.09.5) [8] and subse-
SMILES strings available in both datasets are the base quently used the ETKDG and MMFF94 conformer gen-
representation leveraged to generate molecular finger- erators to obtain the 3D conformers for each molecule.
prints and graph representations. Lastly, from these conformers, we obtained the atoms,
bonds and their attributes. As features for Mol2CCS, we
Standardization across datasets expand upon the node and bond attributes used by Guo
Handling adduct ions is a common hurdle when dealing et al. [6] (Supplementary Table 1).
with mass spectrometry (MS) data [20]. As a byproduct In addition to the molecular graph and the adduct,
of the measurement, adducts become a source of varia- Mol2CCS utilizes seven additional features encoded
tion that needs to be considered in order to account for as one-dimensional vectors. Firstly, similar to previous
the same molecule displaying different CCS values. Both models, we applied one-hot encoding to represent spe-
GraphCCS and SigmaCCS take this into account by pass- cific adduct forms (i). Additionally, we employed one-
ing adducts as features, in our work, we decided to also hot encoding to differentiate between monomeric and
include it in our data splitting schema in order to make dimeric adducts (ii). Secondly, we integrated molecular
sure that the same molecule with different adducts does information by incorporating a 256-dimensional vector
not get separated across the prediction model training representing Morgan fingerprints (iii), generated using
and evaluation stages. Similarly to previous work, we RDKit with 256 bits and radius of 2, and a 2-dimensional
considered only the most common adducts. Since this vector containing the molecular weight of the original
work is the first to leverage the recently released MET- molecule as well as the molecular weight of the original
LIN-CCS for CCS prediction, we are able to train mod- molecular plus or minus the adduct (iv). Thirdly, to spec-
els on one order of magnitude more than previous work, ify the type of molecule, we included a 35-dimensional
we considered more adduct forms (9) than previous work vector indicating the presence or absence of several
(3): [M + H] + , [2 M + H] + , [M + Na] + , [2 M + Na] + , structural classes (v) (e.g., allene, carboxyl, and organic
[M-H]-, [2 M-H]-, [M + K] + , [M + H-H2O] + , acid) using Drug Tax [15]. Additionally, we applied one-
[M + NH4] + . hot encoding to categorize the four molecule types (vi)
Due to the diversity of CCSBase and METLIN-CCS, we (i.e., small molecule, lipid, peptide, carbohydrate). Lastly,
followed similar processing steps as GraphCCS and Sig- we incorporated a final one-hot encoded vector indicat-
maCCS. We first dropped rows with missing SMILES and ing the instrument type (vii) (i.e., TIMS, DT, TW).
SMILES with a “.”, implying multiple disconnected parts.
In CCSBase multiple instances of duplicate SMILES- Data splitting
adduct pairs were present. To address this issue, we We trained and tested each model on train and test
calculated the standard deviation of the CCS values asso- splits using the following strategy. First, we divided the
ciated with each SMILES-adduct pair. We removed any entire dataset (CCSBase and METLIN-CCS data com-
pairs with a mean absolute deviation greater than five, bined) into five distinct groups: lipids, dimers carbohy-
determined through analysis of a histogram showcasing drates, peptides, and everything else. This categorization
Engler Hart et al. Journal of Cheminformatics (2024) 16:105 Page 4 of 11

ensured that each of these categories was represented in subsection "Feature extraction for Mol2CCS"). Lastly,
the train and test sets. We performed an 80/20 train test the model aggregates the output of both modules into
split on each of these groups stratified on the Murcko eight fully connected layers with 384 neurons each, as the
scaffold [3]. It is important to note that several molecules original SigmaCCS model (Fig. 1A). For both the ECC
did not have Murcko scaffolds (mostly lipids) and some layers and the fully connected layers, we used ReLU as
of them only a simple benzene Murcko scaffold. Given activation functions and the L2 regularization. The final
the substantial number of molecules in these two cat- layer that outputs the predicted CCS value is also a fully
egories, we chose to replace the Murcko scaffold with the connected layer with ReLU but without regularization
SMILES string during the stratified split for such mole- applied.
cules. After performing these splits, we combined all of We trained the model up to 400 epochs applying early
the train splits and all of the test splits to create our train- stopping using a patience of 10 epochs. Furthermore, we
ing and test sets. There were several Murcko scaffolds used a dropout of 0.1 for the novel module, a batch size
that appeared in more than one of our initial five groups, of 32, an Adam optimizer with a learning rate of 0.0001,
resulting in 745 scaffolds that were present in both the and 16, 16 and 128, respectively, as the outputs of the
training and test sets. To ensure disjoint testing and three ECC layers. We chose these parameters based on
training sets, we performed an additional 80/20 train/test a grid search experiment conducted on a subset of the
split on these 745 scaffolds to divide them into the train dataset (Supplementary Table 2). Details about the hard-
and test sets (see Supplementary Fig. 3 for details). For ware used can be found in Supplementary Text 1.
models that were trained solely on a single database (i.e.,
METLIN-CCS or CCSBase), we used these same data Evaluation
splits confined to that particular database (Fig. 1B). This We benchmarked all the models using three different
method ensured that the test data for the combined data- settings. In the first setting, we trained each model on
set didn’t contain any molecules used for training models 80% of one database (METLIN-CCS or CCSBase) and
on the METLIN-CCS or CCSBase data alone. tested it on the remaining 20% of the same database. This
method evaluated the model’s ability to predict CCS val-
State‑of‑the‑art machine learning models for CCS ues for chemicals within a similar chemical space. For the
prediction second method we trained the model on the training set
We evaluated the performance of the two most accurate for one database and used the other database as a test
ML models trained to predict CCS values: GraphCCS set. The rationale behind this setting was to evaluate the
[22], and SigmaCCS [6]. We retrained these models using generalizability of both datasets and models. Finally, we
the reported hyperparameters on the datasets used in trained both models on a combined dataset using both
this work. METLIN-CCS and CCSBase (Fig. 1B). Supplementary
Table 3 and 4 report the number of adducts and molecule
Mol2CCS’s architecture and hyper‑parameters types in each database.
Since Mol2CCS is an extension of SigmaCCS [6], we We evaluated the performance of the models on a test
expanded upon its codebase for its implementation. Our set using several metrics. We employed correlation met-
goal, rather than developing a new model, was to dem- rics such as coefficient of determination (­R2), Pearson
onstrate how current SOTA models can be improved by and Spearman correlation. Additionally, we used mean
leveraging additional features, similar to [22] for reten- absolute error (MAE), mean squared error (MSE), root
tion prediction. mean squared error (RMSE), relative standard deviation
The SigmaCCS architecture consists of a GNN with (RSD), and the relative error in percentage between the
three ECC layers [19] trained on three matrices repre- predicted CCS values and the experimental ones (% CCS
senting the molecular graph described in Section"Feature error).
extraction for Mol2CCS". The output of these layers is
concatenated with a one-hot encoded vector represent- Confidence model
ing the adduct and fed into several fully connected layers For our confidence model, we used a random forest
to produce a predicted ccs value. In our implementa- model implemented using scikit-learn [12] (Supple-
tion of Mol2CCS, the GNN module is identical to Sig- mentary Text 2). The model inputs for each molecule
maCCS’s implementation, but we extend the model with consisted of the seven features described above used
a parallel module consisting of four fully connected lay- for the novel module of Mol2CCS (e.g., SMILES,
ers (# neurons: 256, 512, 512, 256) applying dropout to adduct, molecule type, CCS instrument type, etc.) as
prevent overfitting. This module learns a representation well as the experimental and predicted CCS values.
for the additional seven features (including adduct) (see To feed the features into the model, we calculated the
Engler Hart et al. Journal of Cheminformatics (2024) 16:105 Page 5 of 11

Fig. 1 A Model architecture. The upper section of this figure illustrates the conversion of the SMILES representation of the molecule
into a molecular graph, which is then represented as three matrices (an adjacency matrix, an edge attributes matrix, and a node attributes matrix).
These matrices are fed into a GNN. The GNN’s output is concatenated with the output from a linear model which accepts additional features (such
as adduct, instrument type, etc.) as input. This concatenated vector is then fed into another set of fully connected layers which outputs a CCS
value. B Evaluation schema. Each database is split in train (80%) and test (20%) based on molecule type (e.g., lipid, small molecule, etc.) and Murcko
scaffolds. Next, each model is trained on the training set of each database (either CCSBase train or METLIN-CCS train) and evaluated on the two
test sets of both databases (CCSBase test and METLIN-CCS test). When the model is evaluated on the same database that has been trained on,
the model has already seen similar molecules, and thus, the evaluation is on similar chemical space (left). When the model is evaluated on a test set
containing dissimilar molecules, the evaluation is a novel chemical space (middle). Lastly, both databases are also combined for training and testing
(right)

Morgan fingerprints using radius = 2 and default values predicted probabilities for our confidence scores. These
in RDKit from the SMILES string and one-hot encoded confidence models were used for the models trained
the remaining features. To create the labels for this on one database (database A) and tested on another
model, we calculated the difference between the true (database B) as an attempt to improve generalizability.
CCS value and the predicted CCS value. If this differ- To train the confidence models, we experimented with
ence was less than 5% of the true CCS value, we gave several training methods. For each of these methods,
the molecule a label of one (i.e., a proxy for an accu- we used the original test set for database B to test the
rate prediction), otherwise a label zero (i.e., a proxy for confidence models. Additionally, we sampled 10% of the
an inaccurate prediction). We then used the model’s original training set for database B to use as a validation
Engler Hart et al. Journal of Cheminformatics (2024) 16:105 Page 6 of 11

set for determining confidence thresholds. We used the Results


following training sets for the confidence model: Benchmarking models on a similar chemical space
We began by assessing the performance of the models in
• A training set in the same domain as the training set a similar chemical space by training them and evaluating
for the CCS prediction model (the original test set for them in the same database, either METLIN-CCS or CCS-
database A). This is equivalent to evaluating the con- Base, independently. We would like to note that although
fidence model in a different chemical space. we applied scaffold splitting rather than a simple train-
• The test set for database A with small amounts of test split to avoid data leakage, both databases comprise
data from the domain of database B (sampled from highly similar molecules (Supplementary Fig. 1C-D). Fig-
the original training set for database B excluding the ure 2 shows the predicted CCS and the actual experimen-
validation set). This is equivalent to evaluating the tal values as well as the corresponding metrics for the
confidence model in a different chemical space while three benchmarked models, when they are evaluated on
exposing the model to a few in-domain chemical a test set of the same database where they were trained.
structures. Overall, the performance in both databases across all
• A training set in the same domain as the test set (the three models was very high. Although the performance
original training set for database B excluding the vali- on the CCSBase dataset is slightly worse than reported
dation set). This is equivalent to evaluating the confi- by either GraphCCS or SigmaCCS, presumably due to
dence model in a similar chemical space. the scaffold splitting conducted, both models achieve
a ­R2 slightly below 0.986, and a Median Relative Error
To select the confidence thresholds (to filter out (MRE) of 1.55% and 1.57%, respectively (Fig. 2). Mol-
points with low confidence), we calculated the preci- 2CCS exhibits high accuracy across all metrics (e.g.,
sion and recall for each threshold from 0 to 1 (with step ­R2 = 0.985, MRE = 1.51%, and MAE = 4.37). Similarly, the
size 0.1) and selected the highest threshold where the performance is also high for METLIN-CCS, with similar
recall remained higher than the precision to balance values for RMSE, MAE, and MRE, although R ­ 2 drops to
recall and precision performance. This threshold varied 0.9 (Fig. 2). Since this resource has a broader chemical
for the different training sets. space and molecules it contains are less similar between
them, compared to CCSBase (where the models suffer
from data leakage) (Supplementary Fig. 1), we believe

Fig. 2 Scatterplots of the predictions for each model when training and evaluating on the same database. On the CCSBase dataset (upper row) all
models perform equally with a very high correlation coefficient and RMSE of approximately 6 square angstrom [Å2]. On the METLIN-CCS, R
­ 2 drops
from 0.99 in CCSBase to 0.9. However, the other metrics are comparable for all three models
Engler Hart et al. Journal of Cheminformatics (2024) 16:105 Page 7 of 11

the metrics for METLIN-CCS can accurately represent the CCSBase test set. GraphCCS also performs poorly
the performance of a model when it is evaluated on a ­(R2 = 0.36, RMSE = 18.43, MAE = 9.61), due to larger
similar chemical space. Finally, we investigated the mol- errors predicting dimers (large cloud of orange points
ecules with the largest deviations between the predicted deviated from the diagonal). We would like to note that
CCS value and the experimental one for each model and these inaccuracies for dimers are to be expected due to
observed a high overlap, suggesting the outliers across all the limited number of dimer examples in the training set
three models typically correspond to the same molecules (CCSBase) (Supplementary Table 4). When we subset the
(Supplementary Fig. 4). test set to only monomers the models tend to perform
better (i.e., GraphCCS shows better performance when
Benchmarking models on novel chemical space evaluating solely based on the [M + H] + and [M-H]-
Here, we explore the generalizability of the models when adducts) (Supplementary Fig. 7). Likewise, when we
training them on one database (i.e., CCSBase and MET- trained on METLIN-CCS, which has more dimers, and
LIN-CCS) and being evaluated on another one. Given evaluated on CCSBase, GraphCCS significantly improved
that both databases cover different regions of the chemi- its performance on these less common adducts. Overall,
cal space (Supplementary Fig. 5 and 6), these results can Mol2CCS achieves the best performance, as the addi-
be used as a proxy to assess the performance of a model tional features (e.g., molecular fingerprints, dimer type,
in a more realistic application, when the model has not molecule type, etc.) added to the model on top of Sig-
seen closely similar molecules to the ones it is to pre- maCCS, lead to a better generalization of the model.
dict. In this setting, the performance significantly drops We observed a similar trend when training on the
across all models; thus, indicating the lack of generality of METLIN-CCS train set and evaluating on the entire
the models when predicting within an unseen chemical CCSBase (Fig. 3—bottom row). While they all achieve a
space. better performance in this setting ­(R2 above 0.89, RMSE
When training on the CCSBase train set and evaluating between 17.04 and 33.29, MREs between 3.44 and 5.80%,
on the entire METLIN-CCS (Fig. 3—top row), SigmaCCS and MAEs between 10 and 21), boosted by being trained
and Mol2CCS drop to R ­ 2 lower than 0.8 and their RMSE, on a larger and broader dataset, the performance is still
MAE and MSE are several times larger in comparison to worse than training and evaluating on a single database.

Fig. 3 Scatterplots of the predictions for each model when training on one database and evaluating on another one. The bottom plots show
the evaluation on CCSBase when training on METLIN-CCS. When training on CCSBase and evaluating on METLIN-CCS (upper row) the performance
of all models significantly drops. For instance, the ­R2 goes down to 0.36, 0.8, and 0.84 for GraphCCS, SigmaCCS, and Mol2CCS, respectively. However,
performance drops less dramatically when training on METLIN-CCS and evaluating on CCSBase, since the models have been trained on several
times more data points. Despite the larger training data, the differences in their chemical space can explain why all models exhibit RMSEs three
times larger than when they are trained and evaluated on the same database
Engler Hart et al. Journal of Cheminformatics (2024) 16:105 Page 8 of 11

When looking at GraphCCS, we observed again that the training dataset used for the CCS prediction model.
there is a straight cloud of points deviated from the diag- With this training set, the confidence model was very
onal, which corresponds to the lipids in CCSBase. The confident since all of the predictions it was trained on
same lipids, however, are more dispersed in both Mol- were relatively accurate (Fig. 2). When we filtered the
2CCS and SigmaCCS. compounds based on their confidence score, applying a
threshold (0.8) determined by the methods described in
Evaluating the performance on both databases Sect. "Confidence model", the metrics remained almost
As a final experiment, we trained all three models on unchanged. Nevertheless, MAE and MRE did decrease
the combined dataset comprising both databases. As slightly (Fig. 5C and Supplementary Fig. 9C).
expected for a model evaluated on a similar chemical Next, we tried enhancing the confidence model train
space than the one it was trained on, the performance set with small amounts of scaffold disjoint data from the
of the three models is high, lying between the metrics test set domain. We examined the MAE, MRE, and R ­2
observed for in subSect. "Benchmarking models on a metrics for sample sizes of 5, 10, 25, 50, 100, and 1,000
similar chemical space" for each database (Fig. 4). All (Supplementary Fig. 10). As the sample sizes increased,
three models have a ­R2 close to 0.95, RMSEs close to 6, the MAE and MRE decreased for both the models illus-
and MREs between 1.39% and 1.71%. Mol2CCS mini- trating that the confidence model was able to remove
mally improves the performance of SigmaCCS, suggest- outliers (Supplementary Fig. 11). Additionally, the cor-
ing that expanding the GNN architecture can also help relation between the probabilities of an accurate predic-
when models are trained on large datasets and are evalu- tion and the actual performance of the confidence model
ated on structurally similar compounds to the training continued to increase, indicating the confidence model is
data. The best model among the three is GraphCCS, able to accurately predict the CCS values that are off by
which now achieves a good performance across all nine over 5% (Fig. 5A and B). Interestingly, the model trained
adducts, closely followed by the other two (Supplemen- on CCSBase exhibited less overall improvement. This
tary Fig. 8). is likely because the CCSBase dataset contains several
molecule types (such as lipids) that the METLIN-CCS
Confidence models can assist in identifying high database does not. So when the confidence model is
confidence predictions trained exclusively on METLIN-CCS data and tested on
In this subsection, we explore the use of confidence mod- CCSBase data, it has no reference point for these other
els to enhance CCS prediction models by flagging pre- molecule types. However, when we add a small amount
dictions likely to deviate beyond a predefined threshold. of CCSBase data to the confidence model training set,
Specifically, we focused on predictions for novel chemi- the confidence model can better handle these out of
cal spaces using Mol2CCS (see SubSect. "Benchmarking domain molecule types on the CCSBase test set causing
models on novel chemical space"), where we observed a significant improvement in the metrics. To summa-
high variability and numerous outliers in the predictions. rize, these results demonstrate that even if there is only
Consequently, we set a 5% threshold for training the con- a small amount of in-domain training data available, this
fidence model. It is important to note that other models data can be used to train confidence models with better
or datasets could potentially be used for this task. performance.
Initially, we examined the performance of the confi- Finally, we examined the performance of the confi-
dence model trained using data from the same domain as dence model trained on only data from the same domain

Fig. 4 Scatterplots of the predictions for each model when training and evaluating on the combined dataset
Engler Hart et al. Journal of Cheminformatics (2024) 16:105 Page 9 of 11

Fig. 5 Confidence model for CCS prediction model trained on METLIN-CCS and tested on CCSBase. A-B Predicted confidences by the confidence
model in the test set vs. absolute error of the Mol2CCS prediction. C-D Predictions on the high confidence subset generated from the two
experiments where the model is trained on one database and evaluated on the other. A and C are for the confidence model trained
only on data from METLIN-CCS. B and D show results for the confidence model trained on METLIN-CCS data with an additional 1,000 data points
from METLIN-CCS (that are structure disjoint from the METLIN-CCS test dataset). C displays the metrics of the data after confidence thresholding
compared to the metrics without filtering. Comparing A and B as well as C and D demonstrates that the confidence model improves when it
is trained with some in domain data. However, as shown in C, even without in domain data, the MAE and MRE improve slightly when confidence
thresholding is used
Engler Hart et al. Journal of Cheminformatics (2024) 16:105 Page 10 of 11

as the test set (Supplementary Fig. 12). Using this train- of the chemical space (e.g., METLIN-CCS is based on
ing set, the MRE and MAE metrics were consistent with synthetic structures). Finally the models can only be
those for the confidence model trained with 1,000 in as good as the underlying experimental data available.
domain data points described above. Still, the correla- We applied filtering strategies similar to those used in
tion between the probabilities of an accurate prediction previous studies (see Section."Standardization across
and the actual performance of the confidence model datasets") to try and mitigate this issue; however, vari-
increased particularly for the model trained on CCSBase. ations in the experimental conditions, such as instru-
This increased correlation may have had a larger impact ment types, likely introduced some inaccuracies in the
on the MRE and MAE a different confidence threshold CCS values used to train our models.
(Supplementary Fig. 13). Nevertheless, this result further We foresee several potential avenues for our work.
demonstrates that more in-domain data enables confi- Firstly, the improvements in generalizability shown
dence models to more accurately identify outliers. by our work together with the promising application
In conclusion, each of these confidence models demon- of a confidence model, can be leveraged to generate
strated to varying degrees that focusing on the set of high in silico datasets with higher quality. Secondly, as new
confidence predictions can increase model performance. CCS databases are released, or new ML architectures
This is particularly true when even a small amount of in- emerge, we expect the scientific community to conduct
domain data is included in the training set for the confi- similar benchmarks in order to verify that an increase
dence model. in data and an improvement in model architectures
indeed improves generalizability. Thirdly, similar to the
Discussion confidence model application presented in this work,
In this work, we evaluated the performance of state-of- we anticipate that approaches such as generating a dis-
the-art deep learning models for predicting CCS values tribution of predictions applying Monte Carlo dropout
from molecular structures and proposed two novel mod- or using ensemble models could be used to assess the
eling approaches that could be implemented to improve confidence of the predictions. Lastly, the confidence
accuracy and confidence. First, we benchmarked two model that we trained could be used alongside Monte
GNNs (i.e., SigmaCCS and GraphCCS) and Mol2CCS, Carlo dropout to select the prediction with the highest
an adaptation of SigmaCCS, on METLIN-CCS and CCS- confidence potentially leading to better predictions for
Base. Our results revealed that the original high accuracy more molecules.
reported when the models were trained on CCSBase does
not generalize to other chemical spaces. We observed Supplementary Information
The online version contains supplementary material available at https://​doi.​
similar results when training on METLIN-CCS and org/​10.​1186/​s13321-​024-​00899-w.
evaluating on CCSBase. Additionally, we demonstrate
how an extension of the architecture of GNN (Mol2CCS) Supplementary Material 1.
including other additional features improves the general-
izability of SigmaCCS, particularly for dimers and other Acknowledgements
uncommon adducts. Lastly, we investigated the appli- We would like to thank the creators of the resources referenced in the litera‑
cation of confidence models and showed how employ- ture section that were used as the foundation of this work. Furthermore, we
would like to thank the reviewers for their feedback and suggestions.
ing them can improve the confidence of the underlying
predictions. Author contributions
Our work highlights that one of the major challenges CEH, AJP, SC, TK, and DDF designed the study. CEH and SC prepared the data
with help from TK. CEH, AJP, and DDF implemented Mol2CCS. SC, AJP, and
in the field is lack of data availability. Despite the fact DDF adapted the baselines and conducted the benchmark. CEH and DDF
that the release of METLIN-CCS offers several times implemented the confidence model. CEH and DDF analyzed and interpreted
more data points compared to CCSBase, the lack of the results. DDF wrote the paper with help from CEH and TK. All authors
reviewed the manuscript. All authors have read and approved the final
generalizability observed is concerning. We believe that manuscript.
our findings impact the usability of any in silico data-
base generated so far, as their predicted CCS values Funding
Not applicable.
have to be used with caution, especially for uncommon
adducts. To mitigate this, we demonstrated how confi- Availability of data and materials
dence models can be applied to narrow down molecular Benchmarking scripts and the implemented adaptations for Mol2CCS are
released at https://​github.​com/​enveda/​ccs-​predi​ction. Here, we include
datasets to high confidence predictions. Another limi- scripts and notebooks to perform grid search, rerun the experiments, process
tation of our evaluation is its restriction in scope, as it the data, and generate the data splits. All data supporting the conclusions in
mainly focuses on small molecules and specific regions this article is available at https://​zenodo.​org/​recor​ds/​11199​061.
Engler Hart et al. Journal of Cheminformatics (2024) 16:105 Page 11 of 11

Declarations 18. Roy K, Kar S, Ambure P (2015) On a simple approach for determining
applicability domain of QSAR models. Chemom Intell Lab Syst 145:22–29.
Competing interests https://​doi.​org/​10.​1016/j.​chemo​lab.​2015.​04.​013
All authors were employees of Enveda Biosciences Inc. during the course of 19. Simonovsky M, Komodakis N. (2017). Dynamic edge-conditioned filters in
this work and have real or potential ownership interest in the company. convolutional neural networks on graphs. Proceedings of the IEEE confer‑
ence on computer vision and pattern recognition. 3693–3702
20. Stricker T, Bonner R, Lisacek F, Hopfgartner G (2021) Adduct annotation in
Received: 13 May 2024 Accepted: 19 August 2024 liquid chromatography/high-resolution mass spectrometry to enhance
compound identification. Anal Bioanal Chem 413:503–517. https://​doi.​
org/​10.​1007/​s00216-​020-​03019-3
21. Xie, T., Yang, Q., Sun, J., Zhang, H., Wang, Y., and Lu, H. Large-scale predic‑
tion of collision cross-section with graph convolutional network for
References compound identification.
1. Baker ES, Hoang C, Uritboonthai W, Heyman HM, Pratt B, MacCoss M et al 22. Xue J, Wang B, Ji H, Li W (2024) RT-transformer: retention time prediction
(2023) METLIN-CCS: an ion mobility spectrometry collision cross section for metabolite annotation to assist in metabolite identification. Bioinfor‑
database. Nat Methods 20(12):1836–1837. https://​doi.​org/​10.​1038/​ matics. https://​doi.​org/​10.​1093/​bioin​forma​tics/​btae0​84
s41592-​023-​02078-5 23. Zhang H, Luo M, Wang H, Ren F, Yin Y, Zhu ZJ (2023) AllCCS2: curation of
2. Baker ES, Uritboonthai W, Aisporna A, Hoang C, Heyman HM, Connell L ion mobility collision cross-section atlas for small molecules using com‑
et al (2024) METLIN-CCS lipid database: an authentic standards resource prehensive molecular representations. Anal Chem 95(37):13913–13921.
for lipid classification and identification. Nat Metab. https://​doi.​org/​10.​ https://​doi.​org/​10.​1021/​acs.​analc​hem.​3c022​67
1038/​s42255-​024-​01058-z
3. Bemis GW, Murcko MA (1996) The properties of known drugs. 1. molecu‑
lar frameworks. J Med Chem 39(15):2887–2893
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in pub‑
4. Das S, Tanemura KA, Dinpazhoh L, Keng M, Schumm C, Leahy L et al
lished maps and institutional affiliations.
(2022) In silico collision cross section calculations to aid metabolite anno‑
tation. J Am Soc Mass Spectrom 33(5):750–759. https://​doi.​org/​10.​1021/​
jasms.​1c003​15
5. Dragos H, Gilles M, Alexandre V (2009) Predicting the predictability: a
unified approach to the applicability domain problem of QSAR models. J
Chem Inf Model 49(7):1762–1776. https://​doi.​org/​10.​1021/​ci900​0579
6. Guo R, Zhang Y, Liao Y, Yang Q, Xie T, Fan X et al (2023) Highly accu‑
rate and large-scale collision cross sections prediction with graph
neural networks. Commun Chem 6(1):139. https://​doi.​org/​10.​1038/​
s42004-​023-​00939-w
7. Kanu AB, Dwivedi P, Tam M, Matz L, Hill HH Jr (2008) Ion mobility–mass
spectrometry. J Mass Spectrom 43(1):1–22. https://​doi.​org/​10.​1002/​jms.​
1383
8. Landrum G. (2016). RDKit: open-source cheminformatics, http://​www.​
rdkit.​org/. https://​doi.​org/​10.​5281/​zenodo.​74151​28
9. Li X, Wang H, Jiang M, Ding M, Xu X, Xu B et al (2023) Collision cross
section prediction based on machine learning. Molecules 28(10):4050.
https://​doi.​org/​10.​3390/​molec​ules2​81040​50
10. Luque Ruiz I, Gómez-Nieto MÁ (2018) Study of the applicability domain
of the QSAR classification models by means of the rivality and modelabil‑
ity indexes. Molecules 23(11):2756. https://​doi.​org/​10.​3390/​molec​ules2​
31127​56
11. Ochi S, Miyao T, Funatsu K (2017) Structure modification toward applica‑
bility domain of a QSAR/QSPR model considering activity/property. Mol
Inf 36(12):1700076. https://​doi.​org/​10.​1002/​minf.​20170​0076
12. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O,
Blondel M et al (2011) Scikit-learn: machine learning in python. J Machine
Learn Res 12:2825–2830
13. Picache JA, Rose BS, Balinski A, Leaptrot KL, Sherrod SD, May JC, McLean
JA (2019) Collision cross section compendium to annotate and predict
multi-omic compound identities. Chem Sci 10(4):983–993. https://​doi.​
org/​10.​1039/​C8SC0​4396E
14. Plante PL, Francovic-Fontaine É, May JC, McLean JA, Baker ES, Laviolette F
et al (2019) Predicting ion mobility collision cross-sections using a deep
neural network: DeepCCS. Anal Chem 91(8):5191–5199. https://​doi.​org/​
10.​1021/​acs.​analc​hem.​8b058​21
15. Preto AJ, Correia PC, Moreira IS (2022) DrugTax: package for drug tax‑
onomy identification and explainable feature extraction. J Cheminform
14(1):73. https://​doi.​org/​10.​1186/​s13321-​022-​00649-w
16. Rainey MA, Watson CA, Asef CK, Foster MR, Baker ES, Fernández FM (2022)
CCS Predictor 2.0: an open-source jupyter notebook tool for filtering out
false positives in metabolomics. Anal Chem 94(50):17456–17466. https://​
doi.​org/​10.​1021/​acs.​analc​hem.​2c034​91
17. Ross DH, Cho JH, Xu L (2020) Breaking down structural diversity for com‑
prehensive prediction of ion-neutral collision cross sections. Anal Chem
92(6):4548–4557. https://​doi.​org/​10.​1021/​acs.​analc​hem.​9b057​72

You might also like