0% found this document useful (0 votes)
76 views20 pages

Application of Machine Learning in LC-MS-based Non-Targeted Analysis

This article reviews the application of machine learning (ML) in liquid chromatography-mass spectrometry (LC-MS)-based non-targeted analysis (NTA), highlighting its significance in metabolomics, environmental science, and food safety. It discusses the challenges faced in data processing and interpretation, and how ML techniques, including traditional and innovative models, can enhance the efficiency and accuracy of NTA. The review aims to provide guidance for researchers on optimizing ML methods to improve the analysis of complex chemical data.

Uploaded by

Siou-Sian Jhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views20 pages

Application of Machine Learning in LC-MS-based Non-Targeted Analysis

This article reviews the application of machine learning (ML) in liquid chromatography-mass spectrometry (LC-MS)-based non-targeted analysis (NTA), highlighting its significance in metabolomics, environmental science, and food safety. It discusses the challenges faced in data processing and interpretation, and how ML techniques, including traditional and innovative models, can enhance the efficiency and accuracy of NTA. The review aims to provide guidance for researchers on optimizing ML methods to improve the analysis of complex chemical data.

Uploaded by

Siou-Sian Jhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Trends in Analytical Chemistry 189 (2025) 118243

Contents lists available at ScienceDirect

Trends in Analytical Chemistry


journal homepage: www.elsevier.com/locate/trac

Application of machine learning in LC-MS-based non-targeted analysis


Zhuo-Lin Jin a,b,1, Lu Chen a,b,1 , Yu Wang a , Chao-Ting Shi a , Yan Zhou a , Bing Xia a,*
a
Chengdu Institute of Biology, Chinese Academy of Sciences, Chengdu, 610041, PR China
b
University of Chinese Academy of Sciences, Beijing, 100049, PR China

A R T I C L E I N F O A B S T R A C T

Keywords: Non-targeted analysis, a cornerstone in fields such as metabolomics, environmental science, and food safety,
Non-targeted analysis allows for the comprehensive screening of constituents in a sample without preconceived target compounds. The
Liquid chromatography vast and intricate data generated by these analyses, however, often present challenges for traditional data
Mass spectrometry
processing methods in effectively extracting valuable insights. Machine learning, as a robust tool for data pro­
Machine learning
Data processing
cessing and pattern recognition, has increasingly garnered the attention of researchers for its application in non-
De novo structure generation targeted analysis. This review synthesizes the latest advancements in the application of machine learning to non-
LC-MS targeted analysis. Furthermore, the discussion covers key steps such as data acquisition, data preprocessing,
feature extraction, and data analysis and interpretation. It highlights the challenges that machine learning faces
in these critical stages and proposes future research directions. By reviewing the most recent research findings,
this review aims to provide practical guidance on the selection and optimization of machine learning methods for
researchers in non-targeted analysis, thereby fostering further development and application in this domain.

1. Introduction they still face many limitations and challenges. Firstly, in the field of
metabolomics, the extreme diversity in the chemical structure and
Non-targeted analysis (NTA) is a chemical analytical technique composition of metabolites poses a significant barrier to their precise
characterized by its lack of predefined target compounds. Instead, it identification in complex biological samples, thereby complicating the
focuses on identifying and characterizing unknown or understudied elucidation of complex metabolic pathways [12]. Secondly, NTA en­
chemical entities within a sample. NTA demands high selectivity and counters the dilemma of bridging the gap between spectra and molec­
sensitivity to address complex challenges, considering both known ular structures, particularly when dealing with a vast array of unknown
compounds that have not been specifically focused on (“known un­ spectra and extracting information from limited experimental data. This
knowns”) and entirely unknown potential compounds (“unknown un­ underscores the growing demand for more efficient data processing al­
knowns”). To achieve this, NTA commonly employs technologies such as gorithms and tools [13]. Furthermore, the technical application of NTA
liquid chromatography-mass spectrometry (LC-MS) to comprehensively faces multiple challenges, such as the intricacies of mass spectrometry
reveal the chemical composition of samples and detect significant trends (MS) techniques, the accuracy of data annotation, the comprehensive­
and patterns that may be overlooked in traditional targeted analyses [1, ness of database resources, the completeness of data-sharing mecha­
2]. Given the aforementioned strengths of NTA—such as its compre­ nisms, and the lack of standardized procedures. These challenges
hensiveness, sensitivity, high selectivity, adaptability to complex collectively restrict the widespread and in-depth application of NTA in
chemical challenges, innovative application of advanced technologies, both scientific research and practical contexts [14].
and foresight in revealing potential trends and patterns—it has been In NTA, the data processing stage is particularly challenging due to
widely applied across various fields including food safety [3], environ­ the scarcity of prior information. Accurately identifying potential risk
mental pollution [4], and metabolomics [5]. ions from complex matrices or low-concentration samples is excep­
Despite the numerous advantages of NTA techniques and their tionally difficult. The identification and confirmation of unknown
application as a significant analytical tool across various research fields, compounds without matches in the database are fraught with

* Corresponding author. Chengdu Institute of Biology, Chinese Academy of Sciences, No. 23, Qunxian South Street, Tianfu New Area, Chengdu, Sichuan, 610213,
PR China.
E-mail address: [email protected] (B. Xia).
1
Z.L. Jin and L. Chen contributed equally to this work.

https://doi.org/10.1016/j.trac.2025.118243
Received 24 October 2024; Received in revised form 20 March 2025; Accepted 20 March 2025
Available online 21 March 2025
0165-9936/© 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
Z.-L. Jin et al. Trends in Analytical Chemistry 189 (2025) 118243

uncertainty, necessitating more advanced identification techniques and 2. Brief overview of machine learning in non-targeted analysis
a more comprehensive compound library for support [15]. In the realm
of data analysis and interpretation, NTA faces similar challenges in 2.1. Traditional machine learning
quantitative interpretation, risk assessment decision-making, exposure
assessment, and risk characterization. Despite numerous efforts to ML, an integral branch of artificial intelligence (AI), demonstrates
overcome these difficulties, continuous scientific progress and techno­ significant potential and importance in NTA. Within the paradigm of
logical innovation are still required to achieve more efficient and ac­ ML, traditional ML models are primarily categorized into Supervised
curate data utilization, thereby supporting rapid risk characterization Learning, Unsupervised Learning, and Semi-supervised Learning [26].
[16]. Supervised learning involves training models on datasets with
From the preceding text, it is evident that addressing the limitations labeled data, with the objective of predicting outputs based on inputs.
and challenges in NTA necessitates the use of auxiliary tools with high Common supervised learning algorithms include Decision Trees,
flexibility and adaptability. Despite the limitations faced by traditional Random Forests (RF), Support Vector Machines (SVM), and Logistic
software and algorithms in this domain, machine learning (ML) offers Regression. Unsupervised learning involves processing unlabeled data­
significant advantages due to its flexibility, adaptability, and efficient sets with the aim of discovering hidden structures or patterns within the
processing of complex data. ML has demonstrated exceptional capability data. Common unsupervised learning algorithms include K-means
in chemical structure and spectral data processing. Initially, it addresses Clustering and Principal Component Analysis (PCA). Semi-supervised
the challenge of chemical structural diversity through automated learning combines the characteristics of supervised and unsupervised
feature extraction. This process significantly reduces the reliance on learning, utilizing a small amount of labeled data along with a large
manual operations and enhances analytical performance by employing volume of unlabeled data for model training [27–29].
dimensionality reduction techniques to transform complex data into a Traditional ML models have played a significant role in NTA tech­
lower-dimensional space while preserving key information [17]. This niques, yet they exhibit certain limitations when dealing with complex
feature is particularly crucial for establishing a link between spectral high-throughput data analysis. Consequently, to better meet the de­
data and molecular structures, as ML utilizes data-driven models to mands of NTA, new ML models and algorithms are being increasingly
explore and learn the intricate relationships between spectral data and applied.
molecular structures, thereby successfully bridging the two domains
[18]. Furthermore, ML algorithms play a significant role in optimizing 2.2. Innovative machine learning
the precision of spectral database matching. These algorithms not only
enhance the efficiency of data annotation through automated processes 2.2.1. Generative models
but also significantly improve the accuracy of matching with the intro­ Generative models, by estimating the joint probability distribution of
duction of deep learning techniques. A suite of technical methods, data, serve as a powerful tool in the field of ML, particularly excelling in
including feature extraction and transformation, spectral similarity scenarios that require simulating the data generation process [30].
computation, variable selection and dimensionality reduction, struc­ However, it should be noted that the generative models discussed here
tural similarity prediction, and retention index forecasting, collectively are distinct from those referred to in the subsequent context of ‘de novo
contribute to the effectiveness of ML models. These advancements have structure generation.’ The latter focuses primarily on generating the
led to marked improvements in both the accuracy and efficiency of structures for unknown compounds directly from their MS/MS spectra,
spectral database matching [19,20]. In the core segment of data pro­ which is independent of the structure database. In contrast, generative
cessing, ML also plays an indispensable role, particularly in the detection models in ML are more concerned with learning from data and pro­
of anomalous data. By integrating advanced analytical methods such as ducing new data samples similar to existing ones.
feature extraction, classification modeling, and multivariate statistical In non-targeted analytical techniques, the primary generative
analysis, ML can acutely identify anomalous signals within the data. This models utilized include Generative Adversarial Networks (GANs),
provides robust data support for NTA, thereby achieving higher Diffusion Models, Autoencoders, Variational Autoencoders (VAEs), and
analytical accuracy and efficiency [21,22]. Flow-Based Models.
Notably, ML-assisted non-target analysis for risk assessment has GANs are a potent generative modeling framework capable of
emerged as a cutting-edge technique, extensively applied across the learning data distribution through a competitive training process. These
aforementioned three domains. This approach adeptly combines the networks, which comprise two components—the Generator (G) and the
strengths of High-Resolution Mass Spectrometry (HRMS) in NTA with Discriminator (D)—are trained in a collaborative yet adversarial
the robust predictive capabilities of ML algorithms, enabling the precise manner. The Generator aims to sample random noise vectors z from the
identification and effective evaluation of potential risks posed by latent space to produce samples G(z) that resemble real data, while the
pharmaceuticals and their transformation products (TPs) in aquatic Discriminator endeavors to identify whether the input sample is real or
environments [23]. Meanwhile, ML-assisted NTA has demonstrated generated by the generator [31,32]. GANs are applied in NTA by
unique value in metabolomics. It accelerates metabolite identification, generating synthetic samples [33,34] or mitigating batch effects [35],
achieves automated data processing, and significantly enhances the thereby enhancing the quality of data and the performance of classifi­
accuracy of disease classification and the predictive power of clinical cation models (Fig. 1).
outcomes [24]. Similarly, in the field of food safety, ML technology is Diffusion models corrupt training data by gradually injecting
rapidly advancing. Leveraging ML in data processing, modeling, and Gaussian noise and subsequently learn to reverse this corruption to
performance evaluation markedly improves the efficiency and accuracy recover the original data. Specifically, the framework employs a Markov
of detecting volatile organic compounds (VOCs) in food, which aids in chain in which Gaussian noise is sequentially applied to the initial data
food origin traceability and counterfeit detection [25]. ML-assisted NTA x0 over discrete timesteps t = 1, …, T. This refers to the variational
has shown significant application value in environmental monitoring, posterior distribution of the forward process:
metabolomics, and food safety, providing robust technical support for
T
environmental science research, disease diagnosis, and food safety ∏
q(x1:T |x0 ) = q(xt |xt− 1 ) (1)
supervision. t=1

where x1 , …, xT denote progressively noisier latent variables [36]. These


models have achieved state-of-the-art performance in image synthesis
tasks [37], surpassing alternative generative approaches such as GANs

2
Z.-L. Jin et al. Trends in Analytical Chemistry 189 (2025) 118243

Fig. 1. Generative adversarial networks.

Fig. 2. Diffusion model.

in both output quality and sample diversity [38]. In NTA, diffusion and small molecule identification [45] (Fig. 3).
techniques have been effectively adapted for two main applications: 1) VAEs provide a principled framework for learning deep latent vari­
probabilistic identification of unknown contaminants through iterative able models via variational inference, combining deep neural networks
refinement [39], and 2) enhancement of spectral signals through with generative probabilistic modeling. It integrates variational infer­
noise-conditional denoising, improving chromatographic peak detection ence with deep learning to model data generation and latent space
sensitivity and enabling robust retention time alignment [40] (Fig. 2). representation by maximizing the Evidence Lower Bound (ELBO) of the
Autoencoders are a class of Artificial Neural Networks (ANNs) that data marginal likelihood:
utilize unsupervised learning to encode input data into a low- ( ⃦ )
L (θ, ϕ; x) = Eqϕ (z|x) [log pθ (x|z)] − DKL qϕ (z|x)⃦pθ (z) (3)
dimensional representation and subsequently decode it back to the
original input, with the primary objective of minimizing reconstruction
where θ and ϕ denote generative and inference network parameters
error. An autoencoder consists of two main components: the encoder fθ :
respectively, DKL represents the Kullback-Leibler divergence, and pθ (z) is
RD →Rd (where D is the input dimension, d is the latent dimension, d≪ the prior distribution [46]. In VAEs, the inference network of the
D), which maps high-dimensional data inputs x ∈ RD to the latent rep­ ( )
encoder outputs the mean μ and the log variance log σ 2 that parame­
resentation z = fθ (x). The decoder gϕ : Rd →RD constructs approxima­ terize a Gaussian distribution in the latent space. These outputs define
tions ̂
x = gϕ (z) (reconstruction output) by minimizing the mean squared the probability distribution of the latent variable z conditioned on the
error reconstruction loss: input data. In NTA, VAEs are primarily applied to capture nonlinear
[⃦ ⃦2 ] relationships and enhance the model’s generalization capability [47]
L = E ⃦x − gϕ (fθ (x))⃦ (2) (Fig. 4).
Flow-based models learn data distributions through a sequence of
where E is the expectation operator, ‖ ⋅‖ denotes L2 norm [41,42]. In invertible transformations, creating a bijective mapping between the
NTA, autoencoders serve as powerful tools by facilitating data pre­ complex data distribution x and a simple latent distribution z = gθ− 1 ,
processing [43], batch effect removal [44], chemical library expansion,

Fig. 3. Autoencoder.

3
Z.-L. Jin et al. Trends in Analytical Chemistry 189 (2025) 118243

Fig. 4. Variational autoencoder.

Fig. 5. Flow-based model.

where gθ denotes the bijective transformation [48,49]. The model’s core 2.2.2.1. Sequence data processing. The primary sequence data process­
mechanism is expressed through the change of variables formula: ing models utilized in NTA include Recurrent Neural Networks (RNNs),
⃒ ( )⃒ Long Short-Term Memory networks (LSTMs), and Transformer.
⃒ dz ⃒⃒
log pθ (x) = log pθ (z) + log⃒⃒det (4) RNNs are a specialized class of ANNs designed for processing
dx ⃒
sequential data. These networks are adept at handling variable-length
( ) sequences and capturing the temporal dynamics within them. The key
dz
where det dx represents the determinant of the Jacobian matrix of the feature of RNNs is their recurrent structure, which enables the consid­
eration of previous inputs and outputs when processing the current
inverse transformation gθ− 1 , and |⋅| ensures positivity of probability
input. This allows RNNs to retain information from prior contexts [55].
density. This formulation enables exact log-likelihood computation for
For each time step t in an RNN, the hidden state ht and output yt are
training samples, distinguishing flow models from other generative ap­
given by:
proaches [50]. By constructing an invertible mapping from simple dis­
tributions to complex molecular structure distributions, flow-based ht = f(Whh ht− 1 + Wxh xt + bh ) (5)
models enable rapid molecular generation and optimization, high­ ( )
lighting their significant potential in molecular design [51,52] (Fig. 5). yt = g Why ht + by (6)

2.2.2. Representation and transformation where xt is the input at time step t, ht− 1 is the previous hidden state, Whh ,
The main purpose of representation and transformation models is to Wxh , and Why are weight matrices, bh and by are bias terms, and f and g
convert raw data into features that can be effectively utilized by ML are activation functions. In NTA, RNNs are primarily applied in two key
algorithms [53,54]. In NTA, the primary representation and trans­ areas: time-series data processing [56] and de novo structure generation
formation models employed include sequence data processing models [57] (Fig. 6).
and structured data processing models. LSTMs are a specialized variant of RNNs, meticulously designed to

Fig. 6. Recurrent neural networks.

4
Z.-L. Jin et al. Trends in Analytical Chemistry 189 (2025) 118243

address the challenges of long-term dependency encountered by tradi­ concatenates the outputs from multiple attention heads. Positional in­
tional RNNs when processing lengthy data sequences [58]. Their ar­ formation is encoded using sinusoidal functions:
chitecture employs sophisticated gating mechanisms, defined by the ( / )
PE(pos,2i) = sin pos 100002i/dmodel (13)
following operations:
Forget gate: ( / )
( ) PE(pos,2i+1) = cos pos 100002i/dmodel (14)
ft = σ Wf ⋅[ht− 1 , xt ] + bf (7)
where pos is the position index and i is the dimension index. The
Input gate:
encoder-decoder structure employs layer normalization and residual
it = σ (Wi ⋅[ht− 1 , xt ] + bi ) (8) connections to stabilize training gradients. Each layer contains position-
wise feed-forward networks with ReLU activation [62]. In NTA, the
Cell state update:
Transformer architecture enhances the accuracy and efficiency of data
C̃t = tanh(WC ⋅[ht− 1 , xt ] + bC ) (9) analysis due to its capabilities in precisely predicting MS/MS data [63],
interpreting MS/MS spectral similarities [64], predicting retention times
Ct = ft ⊙ Ct− 1 + it ⊙ C̃t (10) (RT) [65], and generating molecular structures [66] (Fig. 8).

where σ denotes the sigmoid activation, ⊙ represents element-wise 2.2.2.2. Structured data processing. The predominant structured data
multiplication, W and b are trainable weights/biases, ht− 1 is the previ­ processing models applied in NTA encompass Convolutional Neural
ous hidden state, xt is the current input, C̃t is the candidate value for Networks (CNNs), and Graph Neural Networks (GNNs).
updating the cell state, and Ct is the updated cell state. This structure CNNs are deep learning models inspired by biological vision, using
enables selective retention and update of information through gated local connections and weight sharing to automatically extract features
interactions between current inputs and historical states [59]. In NTA, while reducing parameters, allowing for the automatic extraction of
LSTMs not only effectively capture long-term dependencies in data from feature representations from input data. The convolutional layer applies
chromatography [60], but also automatically extract key features while learnable filters to detect local patterns. The feature value at location
suppressing noise [61]. Furthermore, as demonstrated in Ref. [61], they (i, j) in the k-th feature map of the l-th layer, zli,j,k , is calculated as follows:
can be combined with other deep learning models to achieve compre­
hensive feature extraction and analysis (Fig. 7). zli,j,k = wlT
k ⋅xi,j + bk
l l
(15)
The Transformer architecture proposes a deep learning framework
based on self-attention mechanisms, departing from conventional where wlk is the weight vector of the k-th filter in the l-th layer, blk is the
recurrent and convolutional paradigms. Its core operation, scaled dot- bias term of the k-th filter of the l-th layer, and xli,j is the input patch
product attention, computes sequence relationships through: centered at location (i, j) of the l-th layer. After computing the feature
( T) values, non-linearity is introduced via activation functions like ReLU.
QK
Attention(Q, K, V) = softmax √̅̅̅̅̅ V (11) CNNs use pooling layers to reduce spatial dimensions, which helps
dk
mitigate overfitting:
( )
where Q ∈ Rn×dk , K ∈ Rm×dk , and V ∈ Rm×dv represent query, key, and yli,j,k = pool alm,n,k ∀(m, n) ∈ Rij (16)
value matrices respectively. Here n is the sequence length of queries, m is
the sequence length of the input, dk is the key dimension, and dv is the
where yli,j,k is the pooled feature map, pool is the function that applies
value dimension. The architecture extends this through multi-head
attention mechanisms that parallelize attention computations across h max/average pooling over a local region, alm,n,k is the activation value of
subspaces: zlm,n,k and Rij is the local neighborhood around location (i, j) [67]. CNNs
MultiHead(Q, K, V) = Concat(head1 , …, headh )W O
(12) have been applied in various aspects of NTA, including feature extrac­
tion and image classification [68], peak detection and localization [69],
( ) data visualization and interpretation, and trend analysis and pollution
where each headi = Attention QWiQ , KWiK , VWiV and WO ∈ Rhdv ×dmodel is
source identification [70]. These applications have not only enhanced
the output projection matrix. dmodel is the model dimension. Concat the accuracy and efficiency of the analysis but also advanced the

Fig. 7. Long Short-Term Memory networks.

5
Z.-L. Jin et al. Trends in Analytical Chemistry 189 (2025) 118243

Fig. 8. Transformer.

automation and intelligence of NTA techniques (Fig. 9). (Graph Convolutional Network), GAT (Graph Attention Network), and
GNNs are a class of neural networks designed to handle graph- GRN (Graph recurrent network), each differing in how messages are
structured data. These networks excel at capturing the intricate re­ aggregated [71,72]. The application of GNNs in NTA is reflected in their
lationships between nodes as well as the overall structural features of the ability to process graph-structured data. This capability enhances the
graph. The core idea of GNNs is message passing, where a node updates accuracy of RT predictions, aids in the identification of metabolites, and
its representation by aggregating information from its neighbors. facilitates the analysis of relationships within complex datasets [73,74].
GNNs aim to learn a node’s state embedding hv (a vector of node v) These applications advance NTA research by improving both the effi­
by integrating its own features xv , edge features xco[v] , and neighboring ciency and precision of data interpretation (Fig. 10).
nodes’ states hNv and features xNv . This embedding is computed using a
shared parametric function f, defined as: 2.2.3. Advanced processing and optimization
Advanced processing and optimization models refer to a suite of
hv = f(xv , xco[v] , hNv , xNv ) (17)
sophisticated and efficient ML models designed to conduct in-depth
The resulting embedding hv is then used to generate an output ov analysis and optimization of data and problems through advanced al­
through another function g, where: gorithms and techniques. These models aim to address challenging is­
sues encountered in practical applications [75]. In NTA, advanced
ov = g(hv , xv ) (18) processing and optimization models include Ensemble Learning,
Both f and g are typically implemented as feedforward neural net­ Transfer Learning, Contrastive learning, and Multimodal Learning.
works, allowing GNNs to capture neighborhood information for tasks Ensemble learning is a methodology that amalgamates multiple
such as node label prediction. Popular GNN variants include GCN learners(also known as base learners or inducers)to make decisions,

Fig. 9. Convolutional neural networks.

6
Z.-L. Jin et al. Trends in Analytical Chemistry 189 (2025) 118243

Fig. 10. Graph neural networks.

particularly in supervised learning tasks. It enhances the predictive noise reduction, baseline correction, and peak identification. The use of
performance of individual models by training multiple models and ML for peak detection and annotation significantly improves the effi­
combining their predictions [76]. In NTA, ensemble learning is utilized ciency and precision of data processing, rendering complex datasets
to enhance the predictive accuracy of metabolite RT across different more manageable.
systems and to mitigate technical variability in metabolomics data [77, During the feature extraction and identification phase, ML leverages
78]. its unique strengths. By employing techniques such as de novo structure
Transfer learning refers to the process of leveraging knowledge ac­ generation, it accurately extracts statistically significant features from
quired from one or more source domains to facilitate learning tasks in a massive data and generates corresponding molecules, thereby signifi­
target domain. It is particularly beneficial when there is some relevance cantly accelerating the analytical process. In the data analysis and
between the source and target domains, yet they are not identical [79]. interpretation phase, ML utilizes methods like molecular fingerprints
Through pre-trained models and fine-tuning on task-specific data [80], prediction to characterize the specific chemical composition of the
transfer learning can significantly enhance the performance and appli­ samples. This approach reveals hidden chemical patterns and provides
cability of models, thereby facilitating the accuracy and scalability of profound insights for scientific research (Fig. 11).
complex chemical analysis tasks in NTA.
Contrastive learning is a ML paradigm that facilitates the learning of
data representations by leveraging the distances or similarities between 3.1. Peak detection
positive sample pairs (similar samples) and negative sample pairs (dis­
similar samples) in the representation space. It is applicable in various Peak detection and annotation are critical steps in NTA, focusing on
contexts, including self-supervised, supervised, and semi-supervised the identification and characterization of peaks in MS and chromatog­
learning scenarios [81]. In NTA, contrastive learning can enhance the raphy data. Peak detection involves identifying signal peaks in LC-MS
representation quality of MS/MS spectral embeddings for different data that correspond to compounds. These peaks are typically charac­
compounds and improve the identification of structural similarities terized by intensity maxima at specific mass-to-charge ratios (m/z) and
among spectral data [82,83]. RT within the MS data [87].
Multimodal learning leverages data from multiple sensors (each As a pivotal step in the data processing workflow, the accuracy of
sensor’s output is referred to as a modality) to address a common peak detection directly influences the reliability of compound identifi­
learning task. These modalities provide complementary information cation and quantification [88]. By eliminating noise and baseline drift,
when observing the same phenomenon, thereby enhancing learning peak detection significantly enhances data quality, providing a solid
performance [84]. In NTA, the application of multimodal learning is foundation for subsequent statistical analysis [89]. Furthermore, as
evident in the integration of diverse datasets and techniques to provide demonstrated in Ref. [87], efficient peak detection algorithms support
complementary information, thereby enhancing analysis accuracy. peak alignment and RT alignment across multiple samples, ensuring
Multimodal learning plays a significant role in areas such as the inte­ data comparability and consistency.
gration of mass spectrometry imaging (MSI) with liquid However, achieving this goal is not a trivial task, particularly when
chromatography-mass spectrometry (LC-MS), the combination of faced with challenges posed by diverse data sources and variable
various MSI technologies, and the utilization of multiple molecular experimental conditions [90]. Peak detection algorithms must adapt to
representations for ML-based predictions [85,86]. complex data generated by various analytical techniques and manage
inconsistencies in peak shape, width, and intensity arising from varia­
3. Application of machine learning in non-targeted analysis tions in experimental conditions, instrumental parameters, and sample
properties [91].
ML plays a crucial role in NTA, significantly contributing to various Traditional algorithms are applied to peak detection by analyzing
stages, including data acquisition, preprocessing, feature extraction, and Total Ion Chromatograms (TICs) and Extracted Ion Chromatograms
data interpretation. It enhances the efficiency and accuracy of analyses, (XICs). These algorithms, including isotope envelope analysis and
aiding in the comprehensive understanding and interpretation of com­ feature matching, can accurately identify local maxima, isotope peaks,
plex samples. and other key features in chromatograms. This capability provides
During the data acquisition phase, ML is particularly prominent. It robust support for the interpretation of LC-MS data [92].
facilitates the efficient recording of vast amounts of raw data and en­ As data complexity continues to increase, traditional methods for
hances data completeness and accuracy through techniques such as RT, peak detection and annotation are becoming inadequate in meeting the
collision cross-section (CCS), and molecular fragmentation simulation. demands for high precision and efficiency. Traditional peak detection
These advancements establish a solid foundation for subsequent ana­ methods, such as the Savitzky-Golay algorithm, may underperform when
lyses. In the data preprocessing stage, ML emerges as a robust tool for dealing with metabolomics data featuring complex peak shapes and low-
quality deviations, thereby limiting their applicability and accuracy in

7
Z.-L. Jin et al. Trends in Analytical Chemistry 189 (2025) 118243

Fig. 11. Workflow of non-targeted analysis based on machine learning. The figure was created in BioRender. Jin, Z. (2025) https://BioRender.com/lv1jsdh.

certain scenarios [93]. examined by a custom-trained CNN [96].


Consequently, the integration of ML models has emerged as an Although CNNs excel in peak detection, traditional CNNs are typi­
innovative solution in this domain. Among these, CNNs, an exemplary cally designed for image classification tasks and focus primarily on
representation of deep learning techniques, have been extensively uti­ global features of the entire image. To harness the powerful feature
lized for the precise detection and meticulous classification of peaks. extraction capabilities of deep learning and address the need for precise
CNNs demonstrate the capability to effectively capture complex patterns localization in object detection tasks, the R–CNN model was introduced
within chromatograms, thereby enabling accurate identification of [97].
peaks that may be subtle or challenging to distinguish using conven­ Faster R–CNN, an advanced deep learning architecture, represents a
tional approaches. significant improvement over R–CNN. It integrates a Region Proposal
Arsenty et al. have developed an algorithm named PeakOnly. This Network (RPN) with a CNN to simultaneously localize and classify
algorithm leverages CNNs to address peak detection and integration in multiple objects within an image. This approach has demonstrated
LC-MS data. PeakOnly demonstrates significant advantages with its exceptional performance in object detection tasks, offering more accu­
CNN-based feature detection method during the initial stage of raw data rate localization and classification of peaks in chromatograms, along
processing. It is capable of providing high-quality peak tables, thereby with high recall and precision rates [98].
simplifying subsequent analysis work [94]. In addition to the detection and classification of chromatographic
While PeakOnly has demonstrated significant advantages in peak peaks, the processing and transformation of MS data are crucial steps in
detection and integration of LC-MS data, they require a preliminary step metabolomics research. To achieve bidirectional conversion between
of region of interest (ROI) detection and classification before peak sep­ centroided data and MS peak data, while addressing the issue of infor­
aration and integration can be performed, which increases the mation loss in traditional centroiding processes, Samanipour et al. have
complexity of the process. The CNN model developed by Alexander et al. developed the Cent2Prof software package. This package employs an
offers a more direct and efficient approach to automated peak detection. adaptive centroiding algorithm that automatically adjusts peak width
It excels in handling complex and noisy data, is trained on simulated estimation based on the data and generates peak width information for
datasets, and has an innovative advantage in predicting peak probabil­ each centroid. Furthermore, a RF model is utilized to predict MS peak
ities [95]. widths, significantly enhancing the accuracy of peak reconstruction
Concurrently, to enhance the efficiency of chromatographic peak [99].
detection in LC-HRMS data, Christoph et al. have introduced a ML tool To further enhance data processing efficiency and extract deeper
named PeakBot to identify all local signal maxima in the chromatogram features, autoencoder is ingeniously incorporated. Through unsuper­
and extracts them as standardized regions. These regions are then vised learning, autoencoders perform dimensionality reduction on raw

8
Z.-L. Jin et al. Trends in Analytical Chemistry 189 (2025) 118243

chromatographic data, thereby laying a solid foundation for subsequent 3.2. Retention time prediction
classification tasks.
PeakDetective, developed by Ethan Stancliffe and Gary J. Patti, uti­ In LC-MS-based NTA, RT prediction is pivotal in improving the
lizes deep learning models that employ unsupervised autoencoders for precision of compound identification, narrowing down candidate
dimensionality reduction and feature extraction, as well as semi- structures, facilitating the identification of unknown compounds,
supervised active learning models to distinguish true peaks from false reducing false positives, and enhancing global analytical capabilities
ones. With only a small number of user-labeled peaks, PeakDetective can [102,103].
rapidly train classifiers to adapt to specific LC-MS methods and sample Specifically, RT prediction enhances the accuracy of peak annota­
types, thereby maximizing dataset performance [6] (Fig. 12). tion, reducing false positives that arise from relying solely on accurate
And the AutoMS tool introduces an innovative approach for peak mass matching [104]. Furthermore, the application of RT prediction
detection in LC-MS data by employing denoising autoencoders (DAEs) models can filter out numerous non-target chromatographic peaks,
from deep learning to extract and assess the quality of regions of interest allowing the analytical focus to be directed towards potential com­
(ROIs). This method integrates the advantages of Continuous Wavelet pounds of interest. Particularly in the absence of reference standards,
Transform (CWT) and deep learning peak picking. AutoMS accurately comparing the consistency between predicted and experimental RT
identifies peaks within ROIs and provides continuous evaluation. provides strong supportive evidence for the preliminary identification of
Through the use of DAEs, AutoMS reconstructs denoised signals, assesses compounds [105].
peak quality, and allows users to set thresholds to control the false Traditional ML approaches for RTs prediction typically involve
positive rate according to their experimental objectives [100]. several steps. Initially, RTs data for known compounds are collected
Peak quality assessment is a critical step in the peak detection pro­ using LC-MS technology. Subsequently, molecular descriptors reflecting
cess, enhancing the accuracy of peak identification and providing a the compounds’ physicochemical properties are calculated using
foundation for subsequent peak filtering. In a study, Kelsey et al. computational chemistry tools. A ML model is then developed to
compared 24 classifiers and proposed a computational method that correlate these descriptors with the observed RTs. Finally, the model is
combines ML with peak quality metrics. This method aims to filter out trained and optimized to accurately predict the RTs of unknown com­
low-quality peaks, marking significant progress towards developing pounds [104].
automated tools for the reliable integration of peaks [101]. In the process of RT prediction, traditional ML methods have played
Peak detection is a crucial step in LC-MS analysis, directly influ­ a significant role. Although the basic procedure from data collection to
encing the accuracy and reliability of subsequent data interpretation, model establishment and optimization is relatively fixed, there are many
particularly in the context of NTA. In recent years, various algorithms, variables in the specific implementation. Different researchers will
including CNNs, Autoencoders, SVMs, and RFs, have been applied to choose or develop models and methods suitable for their own needs and
peak detection. These algorithms optimize the peak detection process available resources.
through diverse mechanisms such as feature extraction, anomaly Reza et al. established both linear and nonlinear models, such as k-
detection, and classification, significantly enhancing the accuracy and Nearest Neighbors (kNN)-GA-MLR and kNN-GA-SVM, to predict the RTs
precision of peak detection. Their potential and flexibility in handling of suspect compounds in broad-spectrum surveys, achieving significant
complex data have brought new vitality to the development of MS and results [106]. Furthermore, through feature selection and hyper­
chromatographic data analysis. With advancements in computational parameter tuning, an automated workflow for the development of RT
power and continuous refinement of algorithms, these methods are ex­ prediction models has been established by Yang et al., significantly
pected to play an increasingly prominent role in future data analysis of improving the accuracy of the predictions [107]. The OPERA-RT
LC-MS-based NTA. Quantitative Structure-Property Relationship (QSPR) model further ex­
tends the application in this field. This model utilizes molecular

Fig. 12. PeakDetective workflow. Reprinted with permission from Ref. [6]. Copyright 2023, American Chemical Society.

9
Z.-L. Jin et al. Trends in Analytical Chemistry 189 (2025) 118243

descriptors as inputs, based on the same liquid chromatography method, molecules. This model can learn and predict RTs directly from molecular
to forecast the RTs of compounds [108]. structures without the need for pre-calculated molecular descriptors.
However, the primary challenges in RT prediction studies arise from Compared to other algorithms, GNN-RT showed superior predictive
the scarcity of experimental data and the variability in experimental accuracy and exhibited good transfer learning capabilities [73]. They
conditions. Limited data may compromise model accuracy [109], while also found that the combination of GNNs and transfer learning demon­
significant variations in experimental parameters can hinder model strated exceptional performance [74]. Subsequently, they developed a
generalization [110]. Moreover, the absence of reference standards and novel strategy named MetEx, which employs a transfer learning-based
experimental RT values for non-target compounds further complicates GNN-RT model to predict RTs across different chromatographic sys­
the analysis [111]. tems, thereby enhancing the accuracy and coverage of metabolite
In response to these challenges, researchers have explored a variety annotation in non-targeted LC-HRMS data. MetEx uses information en­
of approaches. Milinda et al. trained on data from 2157 compounds and tropy and signal-to-noise (SNR) ratio for the identification of authentic
integrated a Retention Index (RI) model to filter the set of candidate chromatographic peaks and integrates MS/MS spectral similarity scores
compounds, thereby effectively reducing the number of false positives for automated metabolite annotation, significantly advancing the
and improving the performance of predictive fragment generation detection and annotation performance compared to existing software
[112]. packages [7] (Fig. 13).
In addition to traditional methods for RT prediction, researchers However, individual ML models often have limitations. Therefore,
have begun exploring advanced ML models, such as ANNs and GNNs. combining ANNs or GNNs with other ML techniques can more effec­
These sophisticated models demonstrate superior capability in handling tively predict RTs. By integrating GNNs with transfer learning or uti­
complex and diverse data, thereby enhancing the accuracy and gener­ lizing ANNs in conjunction with other ML methods, not only is the
alization of predictive models. accuracy of predictions enhanced, but also a powerful tool is provided
Due to the exceptional predictive and generalization capabilities of for the identification and screening of compounds [115]. Moreover,
ANNs in various complex data processing tasks, researchers have natu­ multi-condition prediction is particularly crucial in LC-MS, as it enables
rally introduced this advanced ML model into the field of RT prediction. more accurate identification and screening of compounds across various
To validate the potential of ANNs in RT prediction, a series of experi­ experimental setups. Our laboratory previously proposed a method for
ments and tests have been conducted to assess their performance in ranking chromatographic peaks based on multi-condition RTs predic­
practical applications. tion. This method utilizes RT generator neural network to forecast RTs
Richard et al. validated the efficacy of ANNs in predicting RTs using under various chromatographic conditions and assesses the similarity of
an optimized four-layer backpropagation multilayer perceptron (MLP) isomers by comparing the discrepancies between predicted and actual
model [105]. Concurrently, Kelly et al. employed ANNs to simulate and measured values [116].
predict the RTs of 166 pharmaceuticals in wastewater extracts, further Given that ML algorithms exhibit different performance on different
assessing the predictive capabilities of these networks [113]. Notably, types of datasets [117], in addition to the deep learning models
Barron and McEneff pioneered the demonstration of ANN performance mentioned above, the development and utilization of datasets can also
in predicting the chromatographic RTs of 1117 chemically diverse significantly advance the capability of ML in predicting RTs.
compounds in mixtures [114], providing a significant reference for The METLIN database is a comprehensive repository containing a
research in this field. vast array of MS and RT data for small molecules. It includes a dataset
Researchers have started to explore GNNs, which can directly pro­ known as METLIN SMRT, which contains RT information for 80,038
cess molecular structure information, aiming to further enhance pre­ small molecules obtained through reverse-phase chromatography ex­
dictive performance and model generalizability beyond ANNs. Guo periments. Utilizing the information contained within the METLIN
Wang Xu’s team explored the application of GNNs, proposing the GNN- dataset, researchers can effectively train ML models to predict the RTs of
RT method for predicting the liquid chromatographic RTs of small small molecules [109]. Building upon this foundation, Ju et al. proposed

Fig. 13. Overview of MetEx to extract and annotate metabolites from UHPLC-HRMS data. Reprinted with permission from Ref. [7]. Copyright 2023, American
Chemical Society.

10
Z.-L. Jin et al. Trends in Analytical Chemistry 189 (2025) 118243

a novel deep neural network (DNN) approach. After being constructed substances [120,121].
and optimized on the METLIN dataset, this method achieved higher Given the significance of CCS, accurate prediction of CCS values has
predictive accuracy across 17 different datasets through transfer become a focal point of research. These predictions not only signifi­
learning [80]. cantly enhance the accuracy and efficiency of compound identification
Following a comprehensive evaluation of the performance of various and characterization but also effectively address the scarcity of reference
ML algorithms on metabolomic datasets, researchers have delved into standards. Moreover, CCS prediction technology has facilitated the
innovative applications of ML in the development of chromatographic development of comprehensive, high-quality CCS databases [122].
methods. Alexander et al. have explored the application of reinforce­ These databases, in turn, further improve the accuracy of CCS pre­
ment learning, specifically the Double Deep Q-Learning (DDQ) algo­ dictions, creating a positive feedback loop that promotes the widespread
rithm. The study demonstrates that the retention models derived from application and advancement of Ion Mobility Mass Spectrometry
the scouting runs chosen by the algorithm exhibit comparable or supe­ (IM-MS) technology [123].
rior performance in predicting retention factors, compared to models ML plays a significant role in predicting CCS values. The integration
using all data points or those selected by chromatographers [118]. of ML with IM-MS technology can overcome the limitations associated
Additionally, they investigated the application of GCNs in chromato­ with the lack of reference standards and facilitate the creation of pro­
graphic RT prediction. They compared two GCN models against seven prietary CCS databases [122]. MaKayla et al. have developed a xeno­
benchmark models. The results demonstrated that GCNs generally out­ biotic structural annotation workflow that integrates ion mobility
performed or were at least comparable to existing models, providing a spectrometry (IMS) with MS. This workflow leverages ML to predict CCS
powerful tool for non-targeted metabolomic analysis [119]. values for the identification and differentiation of potential xenobiotic
Various ML models, including MLR, SVMs, GNNs, ANNs, GCNs, and classes and species within large metabolomic feature lists [124].
DNNs, each offer unique advantages in time series forecasting. The se­ Zhou et al. presented AllCCS, an IM-CCS atlas for untargeted
lection of the most suitable model hinges on several factors, such as the metabolomics, which leverages ML techniques to optimize the predic­
characteristics of the data, the complexity of the prediction task, and the tion of CCS values, thereby enhancing the precision and coverage of
need for model interpretability. In practical applications, it may be annotation for both known and unknown metabolites. In their study, ML
necessary to conduct experiments and fine-tune models to determine the models were employed to predict CCS values from a vast array of
optimal combination and parameter settings, thereby achieving the best chemical structures and to augment the identification of metabolites
RT prediction performance. Future research could explore innovative through a multi-dimensional matching strategy that integrates experi­
models and methods to enhance the accuracy and efficiency of RT mental data, including m/z, RT, CCS, and MS/MS spectra [8] (Fig. 14).
prediction. For the prediction of CCS values, traditional ML approaches typically
follow a multi-step process. Initially, this involves the collection and
refinement of a training dataset comprising molecular structure pa­
3.3. Collision Cross Section prediction rameters (molecular descriptors) and their corresponding CCS values.
Subsequently, regression models are constructed to learn the relation­
Collision Cross Section (CCS) not only provides crucial structural ship between molecular structure and CCS values. After being assessed
information in NTA but also facilitates cross-platform comparison and through both internal and external validation procedures, the validated
the development of new methodologies, while demonstrating significant models are then employed to predict the CCS values of unknown
potential in structural analysis and the identification of unknown

Fig. 14. Overview of AllCCS atlas and annotation of known and unknown metabolites. Figure adapted from Ref. [8], under Creative Commons license.

11
Z.-L. Jin et al. Trends in Analytical Chemistry 189 (2025) 118243

compounds [122]. [136].


For instance, the CCSP 2.0 algorithm encodes molecular structures as In the prediction of CCS values, both traditional ML methods, such as
neutral InChI strings and utilizes molecular descriptors to generate an SVR, SVM, and PLS, and deep learning techniques, such as neural net­
SVR (Support Vector Regression) model, which can accurately predict works, have been extensively employed, demonstrating exceptional
the CCS values of unknown ions [125]. Similarly, Zhou et al. have performance. These ML models learn from extensive experimental data
employed the SVR algorithm, utilizing 14 commonly used molecular and integrate the physicochemical properties of compounds to establish
descriptors to predict CCS values. They measured the CCS values of predictive models that map molecular structures to CCS values. In
approximately 400 metabolites in nitrogen as training data, constructed practical applications, these predictive models have been integrated into
a hyperplane in high-dimensional space to perform regression, and ul­ various analytical software, significantly enhancing the accuracy and
timately established an SVR prediction method based on this training efficiency of CCS value predictions. Overall, with the continuous
dataset [126]. They further developed the MetCCS Predictor as a advancement and optimization of ML technology, its application in CCS
web-based server to extend and apply this methodology. Utilizing the value prediction and related fields is expected to become more wide­
SVR regression algorithm and molecular descriptors, it rapidly predicts spread and in-depth.
the CCS values of metabolites, enhancing the accuracy of metabolite
identification in non-targeted metabolomics [127]. 3.4. MS spectra interpretation and prediction
In addition to SVR, SVM and Partial Least Squares (PLS) algorithms
have also been applied to predict CCS values. Song et al. collected MS spectra interpretation and prediction primarily encompass two
experimental CCS values for over a thousand compounds to develop key directions: predicting molecular structures from MS spectra and
predictive models. After evaluation, the SVM model based on Chemistry predicting MS Spectra from molecular structures.
Development Kit (CDK) descriptors was found to be the most accurate Predicting molecular structures from MS spectra involves utilizing
[128]. Concurrently, they developed four models incorporating both ML algorithms to extract key features from MS data and transform them
SVM and PLS algorithms. These models were compared with existing into interpretable molecular fingerprints. This approach aids in the
CCS prediction tools and demonstrated satisfactory predictive perfor­ structural identification of unknown compounds and offers new insights
mance [129]. for large-scale data analysis in fields such as metabolomics.
After demonstrating the significant potential of ML models in pre­ Predicting MS spectra from molecular structures focuses on starting
dicting CCS values by integrating molecular descriptors and computa­ with the molecular structure to simulate its fragmentation behavior in a
tional chemistry data [130], Robbin et al. explored how innovative mass spectrometer, thus predicting potential MS spectra. This ‘structure-
model designs could further enhance predictive accuracy. To more to-spectrum’ approach not only assists in the interpretation of MS data
accurately predict the CCS values of small molecules across various but also provides a theoretical basis for predicting the mass spectro­
chemical categories, a novel hybrid model has been proposed. This metric behavior of new compounds.
model integrates molecular modeling and ML techniques, enhancing
predictive accuracy. It discusses the utilization of molecular modeling 3.4.1. Prediction of molecular structures from MS spectra
features and three-dimensional information for CCS value prediction, Predicting molecular structures, especially molecular fingerprints,
offering a new perspective on the prediction of CCS values [131]. from MS data involves employing computational approaches (ML
Traditional ML algorithms face significant challenges in predicting techniques) to forecast the molecular fingerprint profiles of compounds.
CCS values, primarily due to limitations in dataset size and diversity. This process typically entails translating mass spectral signals into bi­
The restricted scale and limited variety of existing datasets constrain the nary representations of molecular fingerprints, thereby facilitating
accuracy and generalization capabilities of these models. Consequently, compound identification and classification [137].
traditional machine learning approaches often perform poorly when To achieve the prediction of the molecular fingerprint spectra, MS
predicting CCS values for previously unseen molecules, potentially data is converted into vectors, which serve as input for training models
resulting in excessive prediction errors in practical applications [132, that predict individual molecular fingerprint bits. These models are
133]. optimized using various algorithms. Once trained, the models can
To address these challenges, researchers have begun exploring more accurately predict the molecular fingerprints of unknown compounds.
advanced techniques, such as deep learning, with the aim of enhancing These predicted fingerprints can then be compared with the fingerprints
predictive accuracy and improving generalization ability. of candidate structures in compound databases to facilitate compound
Pier-Luc Plante and colleagues have successfully predicted the CCS identification [138].
values of compounds using deep learning algorithms, specifically CNNs. Fragmentation trees, as efficient data structures, adeptly organize
Their model learns the internal representation of compounds and in­ and interpret MS data, providing accurate and relevant input for deep
tegrates ion types for accurate predictions, demonstrating a high cor­ learning models. This graph-theoretic construct is instrumental in
relation between predicted and experimental values across multiple test organizing and storing data generated during mass spectrometric ana­
datasets [134]. The development of the Multimodal Graph ATtention lyses. Particularly in MS/MS, fragmentation trees effectively represent
network for CCS (MGAT-CCS) model also offers a novel solution for the fragmentation processes of analytes and the interdependencies
predicting the CCS values of small molecules based on their chemical among the resulting fragment ions [139].
categories. By integrating graph attention networks and multimodal CSI:FingerID combines the automatic construction of fragmentation
molecular representations, the model achieves accurate CCS predictions. trees with ML algorithms to convert MS/MS data into predicted mo­
It has demonstrated exceptional performance in predicting the CCS lecular structural fingerprint spectra. Initially, the method employs a
values of metabolites, with low relative errors, thus paving a new way reference dataset of known molecular structures to train support vector
for CCS value prediction [135]. machine models, enabling them to learn various properties within the
In addition to the aforementioned ML models, the development and molecular fingerprint. Subsequently, during the prediction phase, given
utilization of databases can significantly enhance the predictive capa­ the MS/MS data of an unknown compound, CSI:FingerID generates the
bilities of ML for CCS. Dylan et al. constructed a comprehensive database compound’s predicted molecular fingerprint and queries extensive mo­
enriched with a vast array of CCS values and extracted features such as lecular structure databases for the compound that best matches the
Molecular Quantum Numbers (MQNs). Compounds were categorized predicted fingerprint. Through the optimization of scoring functions and
through unsupervised clustering, which was followed by the training of the augmentation of training data, CSI:FingerID significantly enhances
predictive models for each class. This approach not only improved the the accuracy and efficiency of the identification of unknown metabolite
accuracy of predictions but also provided interpretability for the results structures [9] (Fig. 15).

12
Z.-L. Jin et al. Trends in Analytical Chemistry 189 (2025) 118243

To further advance exploration in this domain, the COSMIC work­ In the realm of fingerprint prediction, various deep learning models
flow incorporates SVM and Kernel Density Estimation (KDE) techniques have been extensively applied and integrated into numerous analytical
to enhance the accuracy of metabolite annotation and increase confi­ tools. These models are embedded into MS analysis software and plat­
dence in fingerprint spectrum prediction. As a non-parametric approach, forms through diverse mechanisms, such as serving as core prediction
KDE demonstrates a unique advantage in capturing the spatial distri­ engines, integrating with existing software, offering predictive services,
bution trends of metabolites, enabling the generation of Molecular and participating in data processing and feature extraction. This inte­
Probability Maps (MPMs) [140]. This process integrates database gen­ gration not only enhances the accuracy of fingerprint spectrum predic­
eration, CSI:FingerID searching, and confidence scoring. As a result, it tion but also significantly improves work efficiency. With the continuous
enables efficient annotation on large datasets with a low false discovery advancement of ML technology, its application in fingerprint spectrum
rate and demonstrates the potential to discover new metabolites [141]. prediction and related fields is expected to become more extensive and
SIRIUS 4, another advanced software tool, not only enhances iden­ profound.
tification rates through integration with CSI:FingerID but also supports
automatic element detection, encompassing both positive and negative 3.4.2. Prediction of MS spectra from molecular structures
ion modes, as well as candidate structure scoring and retrieval [142]. Predicting MS spectra from molecular structures, particularly
Subsequently, the team of SIRIUS proposed a novel approach that in­ through the analysis of molecular fragmentation patterns, is a technique
tegrates deep learning with kernel techniques by approximating kernel that employs computational chemistry methods to simulate molecular
functions as linear features, thereby enhancing the accuracy of molec­ fragmentation behavior and forecast MS spectral data [148]. This
ular fingerprint prediction from MS data. This method not only improves approach aims to predict the mass spectrometric behavior of compounds
training efficiency for large-scale datasets but also strengthens metab­ using theoretical models, without relying on actual experimental MS
olite identification under low-quality data conditions [143]. data. The core of this methodology lies in understanding the fragmen­
Additionally, MetFID, an ANN-based method, also predicts molecular tation rules of molecules under various ionization conditions. Based on
fingerprints based on MS data, which aids in metabolite annotation. By this understanding, it simulates the potential fragment ions and their
learning the mapping between MS/MS spectra and molecular finger­ relative abundances, thereby constructing a predicted MS spectrum
prints, MetFID outperforms other tools on the test dataset, thereby [149].
enhancing the accuracy of metabolite annotation [144]. Subsequently, This predictive approach plays a pivotal role in metabolomics and
the research team of MetFID developed an advanced CNN model for the identification of small molecules. Researchers can systematically
predicting molecular fingerprints, thereby facilitating metabolite predict possible fragments of metabolites or compounds using various
annotation. This model incorporated a novel mechanism to reduce the algorithms and match them with experimental tandem MS data, thereby
number of parameters, thus mitigating overfitting and improving ac­ enhancing the accuracy and reliability of identification. This approach
curacy. The study employed eight independent CNN models to process not only improves the precision of identification but also strengthens its
data from diverse instruments. In comparison to conventional SVM and reliability by integrating additional information such as reference data,
MLP models, the CNN models exhibited superior performance in both patent data, and RTs, providing a powerful tool for the comprehensive
prediction and ranking tasks [145]. NTA of small molecules [150].
To enhance the accuracy and efficiency of fingerprint predictions, The general workflow for predicting MS spectra from molecular
researchers have developed a variety of innovative methods and tools. fragmentation patterns using ML involves several key steps. First, MS
As part of these advancements, Samuel et al. proposed a neural network- data is collected and preprocessed. Then, this data is used to train ML
driven workflow called MIST. This approach represents MS peaks as models to learn the molecular fragmentation patterns. Once trained,
chemical formulas and leverages deep learning models to learn mean­ these models are used to predict MS spectra for new molecular struc­
ingful representations of the input mass spectra, thereby offering a novel tures. The accuracy of these predictions is evaluated by comparing them
perspective for the prediction of molecular fingerprinting [146]. In the with experimental MS spectra [151].
realm of deep learning frameworks, IDSL_MINT utilizes a Transformer After comprehending the intricate relationship between molecular
model to process MS/MS data, converting MS/MS spectra into molecular fragmentation patterns and MS spectra, researchers have developed
fingerprints. This approach enhances the annotation rate in metab­ various algorithms to simulate and predict these fragmentation pro­
olomics and exposomics research by supporting customized molecular cesses. The Competitive Fragmentation Modeling (CFM) algorithm is
fingerprints and improving the annotation capabilities for specific one such efficient and accurate method. It integrates multiple frag­
chemical categories through specialized model training [147]. mentation modes and reaction pathways to more precisely predict and

Fig. 15. The learning phase of CSI:FingerID. Figure adapted from Ref. [9]. Copyright 2024, National Academy of Sciences.

13
Z.-L. Jin et al. Trends in Analytical Chemistry 189 (2025) 118243

match experimental spectra, thereby enhancing the accuracy of com­ [157].


pound identification [152]. Under the impetus of a cascade of innovative technologies, Jennifer
CFM-ID employs CFM algorithms to provide MS peak annotation, et al. have introduced a lightweight neural network model named
spectral prediction, and predictive ranking of candidate structures, NEIMS. This model employs a novel architecture that integrates both
thereby facilitating automated metabolite identification. Its superior forward and reverse predictive patterns to more accurately simulate
performance and user-friendly interface make the interpretation of fragmentation events in MS. It also optimizes predictions for high-
tandem MS data more accessible and intuitive [153]. With the release of quality spectral regions by considering physical phenomena. Demon­
CFM-ID 4.0, its performance across various datasets has been signifi­ strating high-precision library matching performance within an
cantly enhanced, particularly regarding the prediction of MS peaks and expanded reference library, NEIMS stands out for its accuracy [158].
compound identification. This version significantly enhances the accu­ Furthermore, RASSP employs an innovative approach combining sub­
racy of MS spectra predictions for a variety of compounds by refining structure enumeration with deep learning to generate probabilistic
molecular topological feature learning, incorporating ring-cleavage distributions of chemical subformulas and atomic subsets. This method
sequence modeling, and expanding the foundation of handwritten not only enhances the accuracy of predictions but also enables the
rule-based predictions [10] (Fig. 16). prediction of spectral data even with low data availability and high
Unlike CFM-ID, MassFormer employs Graph Transformer technology resolution [159].
to simulate long-range atomic interactions within molecules and fine- Molecular fragmentation simulations have a broad and profound
tunes the parameters of spectral data through chemical pre-training range of applications, with their core value lying in facilitating the
tasks, thereby significantly enhancing the predictive accuracy. The development of innovative computational tools and high-precision al­
performance of MassFormer across various datasets has demonstrated its gorithms. Researchers have successfully developed various algorithmic
potential and effectiveness as a novel approach for predicting tandem frameworks based on fragmentation mechanisms, which, through
MS fragmentation patterns [154]. meticulous design, can effectively capture key physicochemical char­
By employing computational methods and data-driven parameter acteristics during the molecular fragmentation process. The innovative
optimization techniques, researchers have further enhanced the accu­ development of these machine ML methods and tools has not only
racy of MS fragmentation pattern predictions. The best-performing greatly enriched the technical approaches to MS analysis but also pro­
MAGMa and MIDAS tools have also been fine-tuned, achieving more vided robust computational support for understanding the intrinsic
precise metabolite identification in non-targeted metabolomics [155]. connections between molecular structure and properties. Looking
In addition to the aforementioned diverse fragmentation pattern ahead, molecular fragmentation simulations are expected to evolve to­
algorithms, deep learning techniques, particularly GCNs and CNNs, have wards more intelligent, high-precision directions and foster interdisci­
significantly contributed to enhancing the accuracy and efficiency of MS plinary collaborations.
spectra prediction.
GCNs extract features directly from the graph representation of
molecules, effectively mitigating the common issue of molecular struc­ 3.5. De novo structure generation for compound identification
ture information loss in traditional methods. These models can learn
complex mapping relationships between molecular structures and MS The basic idea of the above-mentioned predictive approaches is
peaks, thereby achieving superior predictive performance compared to depending on databases. However, when there is no such structure in­
previous technologies [156]. formation in the real or in silico database, these methods will no longer
By integrating information about molecules and their structural be effective.
motifs at the graph level, heterogeneous GNNs can effectively capture De Novo Structure Generation refers to the process of designing novel
long-range dependencies within molecules, thereby enabling efficient molecular structures from scratch, without reliance on existing molec­
prediction of MS data while maintaining low memory consumption ular structure libraries [160]. In this domain, ML-assisted De Novo
Structure Generation emerges as an innovative approach [161],

Fig. 16. The workflow of CFM-ID 4.0. Figure adapted from Ref. [10], under Creative Commons license.

14
Z.-L. Jin et al. Trends in Analytical Chemistry 189 (2025) 118243

integrating ML techniques to more efficiently explore the chemical space reasonable to anticipate that this field will achieve significant progress
and generate molecular structures with optimized properties [162]. in the future, introducing new possibilities to analytical chemistry.
Furthermore, this method is capable of uncovering potential molecular
characteristics and structural relationships, thereby fostering innovation 4. Perspective
and breakthroughs. Such innovative tools extend the capabilities of
NTA, enabling exploration and discovery across a broader chemical 4.1. Limitations of predictive models and prospects for optimization
space.
In the field of de novo structure generation, ML models have Although ML-based NTA has become an important method for
demonstrated significant utility. Specifically, RNNs and their de­ exploring unknown compounds, the application of ML predictive models
rivatives, including LSTM networks and Gated GRUs, are extensively still faces several challenges and limitations in complex chemical envi­
utilized for their ability to process sequential data and capture temporal ronments. Especially in the prediction of RT, CCS values, and finger­
dependencies. prints, the accuracy and reliability of existing models need to be
Building upon this, Michael et al. introduced MSNovelist, a method improved.
for designing new molecules using ML. It predicts the molecular struc­ Predicting RT and CCS values is essential for NTA, as accurate pre­
ture of unknown compounds by analyzing MS/MS data. Initially, it uses dictions can significantly enhance the precision and efficiency of sub­
CSI:FingerID to predict molecular fingerprints, followed by the trans­ stance identification. However, ML models face several common
formation of these fingerprints into SMILES sequences by a RNN, challenges in this prediction process, including data scarcity [113,114],
generating candidate molecular structures. Unlike traditional methods limited model interpretability [106], difficulties in cross-platform
that rely on databases, MSNovelist can directly generate structural can­ generalization [117], discrepancies between theoretical and experi­
didates from MS/MS data, making it particularly suitable for identifying mental data [126], and concerns regarding data quality and availability
novel compounds [11] (Fig. 17). [128].
Eleni et al. introduced a deep learning framework named Spec2Mol, To address these challenges, researchers have proposed various
which employs an encoder-decoder architecture. By learning the strategies. For data scarcity, methods such as pseudo-labeling, semi-
embedded representation of MS data and reconstructing SMILES se­ supervised learning [74], and transfer learning [114] have been
quences, Spec2Mol enables direct translation from MS data to molecular employed to enhance model performance. To improve model inter­
structures. This method can recommend new molecular structures pretability, approaches include evaluating model attributes, using
without a reference database, greatly expanding the possibilities for the explanation techniques [167], and conducting feature importance
identification and structural elucidation of novel molecules [163]. analysis [115]. For cross-platform generalization, strategies like adver­
Beyond breakthroughs in direct molecular structure generation from sarial domain adaptation (ADA) [168], self-challenging mechanisms
mass spectrometry, researchers continue to explore the integration of [169], data augmentation, and feature learning are utilized. To reconcile
generative models. DarkNPS innovatively incorporates data augmenta­ theoretical with experimental data discrepancies, models are trained
tion and an optimized LSTM model to generate potential NPS structures, and optimized using diversified datasets [134]. For issues related to data
significantly improving structural analysis accuracy when combined quality and availability, solutions involve expanding datasets,
with MS/MS data [164]. To further integrate structure generation and improving descriptor calculations, comparing algorithms, and inte­
property prediction, the ReLeaSE framework employs dual neural net­ grating data sources [128].
works to simultaneously generate SMILES sequences and predict com­ ML models are also confronted with unpredictable dynamic prob­
pound properties [165]. Additionally, the GENTRL model focuses on lems such as complex molecular structures, changes in instrument
novel small molecule design by integrating reinforcement learning, conditions, and nonlinear relationships.
variational inference, and tensor decomposition, further expanding the In the predictive modeling of RT and CCS values using ML, the
scope of deep generative model integration [166]. complexity of small molecules presents numerous challenges. These
It is noteworthy that de novo structure generation for compound include the intricacies of physicochemical factors, the confounding
identification remains an emerging field in the current research land­ similarities among structurally similar compounds, and data de­
scape. In contrast, it has been widely applied and extensively studied in pendency [170]. Researchers have adopted graph-based CNNs to tackle
the domains of protein structure prediction and drug molecule design. these challenges. This approach leverages data-driven features to opti­
Although de novo structure generation in analytical chemistry is still in mize predictive performance [171]. By learning the graph structure of
its infancy, its potential for structural identification and molecular molecules, it effectively addresses the complex physicochemical prop­
characterization should not be underestimated. With ongoing techno­ erties and structural similarities of small molecules, while also reducing
logical advancements and enhanced interdisciplinary collaboration, it is reliance on extensive datasets [172,173].

Fig. 17. Conceptual overview of MSNovelist. Figure adapted from Ref. [11], under Creative Commons license.

15
Z.-L. Jin et al. Trends in Analytical Chemistry 189 (2025) 118243

To address challenges posed by variations in instrument conditions comprehensive phenomenon involving multiple physical and chemical
and nonlinear relationships, it is advisable to consider incorporating processes, and its pathways may vary depending on specific chemical
instrument conditions as input parameters into the model [174]. Addi­ environments or energy conditions, increasing the difficulty of predic­
tionally, data augmentation techniques can be employed to simulate tion. To address this challenge, researchers have adopted various stra­
data under various chromatographic conditions. To better capture tegies. First, by establishing fragmentation rules [182] and integrating
nonlinear relationships, feature engineering techniques can be applied them with data-driven methods [10], the fragmentation behavior under
to transform the raw features, rendering them more suitable for analysis specific conditions can be simulated. Second, predicting the probability
[175]. distribution of possible substructures and atomic subsets [159] provides
In the application of ML techniques for predicting fingerprint a quantitative basis for the diversity of fragmentation outcomes.
spectra, a variety of unique and complex challenges arise beyond com­ Furthermore, the proposal of specific hydrogen rearrangement rules
mon issues. The primary challenge lies in the laborious preprocessing [183] further refines the description of the fragmentation process.
steps required for spectral data, which are crucial for ensuring data
quality [176]. Additionally, the vast amount of information contained in 4.3. Constraints of structure generation and prospects for innovation
high-dimensional spectral data presents another significant challenge:
how to extract the most relevant features [143]. More complex still is the With the rapid advancement of ML, the de novo structure generation
susceptibility of spectral data to noise interference and its high vari­ for compound identification has encountered numerous opportunities.
ability, which pose serious threats to the accuracy and robustness of the However, despite the revolutionary changes that ML has brought to this
model [177]. field, the approach still faces inherent limitations and challenges.
To address these challenges, firstly, data normalization and statisti­ Firstly, the selection of molecular representation methods, such as
cal significance testing can be employed to optimize preprocessing steps, SMILES notation and graphical representation, is pivotal to the overall
thereby reducing the burden of data processing [178]. Secondly, to performance of a model. Different representation methods may lead to
tackle issues associated with high-dimensional data, Feature extraction information loss or bias, thereby affecting the model’s accuracy and
algorithms (FEA) and Feature selection algorithm (FSA) techniques can reliability [184]. Consequently, when choosing a molecular represen­
effectively filter out the most representative features, thus enhancing tation method, it is imperative to carefully consider its ability to
model performance [179]. Lastly, to counteract the impact of noise on uniquely and precisely describe molecular structures while preserving
the model, deep learning and key band selection methods are introduced comprehensive chemical information. Furthermore, it must be
for denoising, further enhancing the model’s stability and accuracy compatible with existing chemical information processing tools and
[180]. Through the integrated application of these strategies, the databases to ensure data tractability and analyzability [185].
numerous challenges in predicting fingerprint spectra can be more Secondly, the model’s extrapolation capability is a critical issue that
effectively addressed. should not be overlooked. Although the model may perform admirably
within the chemical space of the training data, its performance often
4.2. Challenges in MS data analysis and strategies for enhancement significantly deteriorates when confronted with new, uncharted chem­
ical spaces [186]. To address this challenge, advanced techniques such
MS, as a powerful analytical tool, generates vast datasets for re­ as transfer learning and reinforcement learning can be leveraged.
searchers. However, processing and interpreting these data can be a Additionally, employing a diverse array of molecular descriptors and
formidable task for ML Peak detection, a critical step in MS data anal­ scoring functions can enhance the model’s adaptability and predictive
ysis, directly affects the accuracy of subsequent analyses due to its capabilities across various chemical spaces [160].
sensitivity and specificity. Molecular fragmentation in MS is essential for Lastly, the generated molecular structures may not always adhere to
understanding compound structures, as it involves predicting potential fundamental chemical rules, such as valency, which is an issue that re­
fragment ions from MS data, which is vital for compound identification quires particular attention [187]. To ensure the validity of the generated
and structural elucidation. structures, additional filtering or constraint mechanisms need to be
In the application of ML models for peak detection, researchers incorporated. In this regard, combining GANs with reinforcement
encounter several challenges, including peak overlap in complex sam­ learning [188], as well as employing advanced methods like molecular
ples [68], the detection of low-abundance compounds [181], and the graph representation and GNNs [189], provides effective solutions.
identification of isotopic patterns. To address the issue of peak overlap, These approaches not only ensure that the generated molecular struc­
CNNs [96]and other deep learning techniques [94] can be effectively tures conform to chemical principles but also enhance the efficiency and
utilized. For the detection of low-abundance compounds, innovation of molecular design.
semi-supervised deep learning methods [68] and CNNs [96] have
demonstrated their potential. Furthermore, the identification of isotopic 5. Conclusion
patterns can be resolved by employing improved isotopic cluster
detection methods [180]. This review provides a comprehensive overview of the latest ad­
In the realm of ML tasked with MS fragmentation processes, models vancements in the application of ML models in NTA. The article begins
encounter multifaceted challenges. The primary challenge stems from by discussing the challenges faced by traditional NTA and highlights the
the extensive diversity of compound structures [154]; these range from unique advantages that ML offers in this field. It then introduces the
small to large molecules, each exhibiting unique physicochemical latest ML models currently applied in NTA. On the technical front, the
properties and fragmentation patterns. paper delves into the application and effectiveness of ML models in peak
To address the challenge posed by compound structural diversity, a detection, RT prediction, CCS value prediction, fingerprints prediction,
range of advanced methodologies has been proposed to augment the MS spectra prediction, and de novo structure generation. Furthermore,
models’ understanding and predictive capabilities. Among these the review objectively analyzes the shortcomings and limitations of ML
methods are the utilization of structural motifs to capture key molecular applications in these fields and proposes a reference for future research
features, the application of GNNs and heterogeneous graph networks for directions. Overall, ML is of great significance for the enhancement of
dissecting complex molecular structures, and the integration of various NTA technology. It not only significantly improves the efficiency and
data sources to provide a more comprehensive informational backdrop accuracy of data processing and analysis but also effectively promotes
[157]. the in-depth development of NTA work. However, given the inherent
Another significant challenge arises from the complexity of frag­ characteristics of ML, its application in NTA still faces many limitations
mentation mechanisms [154]. Molecular fragmentation is a and challenges. In the future, with the continuous advancement of

16
Z.-L. Jin et al. Trends in Analytical Chemistry 189 (2025) 118243

technology and in-depth research, we have reason to believe that ML


will play an increasingly important role in NTA, providing more effec­
tive means and methods to solve complex problems in this field.

Glossary CRediT authorship contribution statement

Field-specific terms Definitions


Zhuo-Lin Jin: Writing – original draft. Lu Chen: Writing – review &
editing. Yu Wang: Funding acquisition. Chao-Ting Shi: Funding
liquid chromatography-mass A technique that combines the physical
acquisition. Yan Zhou: Resources. Bing Xia: Writing – review & editing,
spectrometry (LC-MS) separation capabilities of liquid chromatography
with the identification and quantification Resources.
capabilities of mass spectrometry.
emerging contaminants (CECs) New or previously unrecognized pollutants that Funding
are becoming increasingly prevalent in the
environment and may pose risks to human health
or ecosystems.
This work was financially supported by the Sichuan Science and
High-Resolution Mass A mass spectrometry technique that provides Technology Program (No. 2024ZYD0173).
Spectrometry (HRMS) highly accurate mass measurements, allowing for
improved compound identification and Declaration of competing interest
characterization.
transformation products (TPs) Compounds formed from the degradation or
metabolism of parent compounds in the The authors declare that they have no known competing financial
environment or biological systems. interests or personal relationships that could have appeared to influence
volatile organic compounds Organic chemicals that easily vaporize at room the work reported in this paper.
(VOCs) temperature and can be harmful to human health
or the environment.
MS/MS Tandem mass spectrometry, a technique where
Acknowledgements
ions generated from a sample are further
fragmented in a mass spectrometer to provide We deeply appreciate the resources provided by the University of
structural information about the original Chinese Academy of Sciences and honor the contributions of those in
compound.
analytical chemistry and machine learning. At the same time, we
mass spectrometry imaging A technique that combines mass spectrometry
(MSI) with imaging to provide spatial and chemical sincerely apologize to all esteemed colleagues in the related fields whose
information about the distribution of compounds elegant studies are not included here.
within a sample.
retention times (RT) The time it takes for a compound to elute from a Data availability
chromatographic column.
mass-to-charge ratios (m/z) The ratio of the mass of an ion to its charge, used
to identify compounds in mass spectrometry. Data will be made available on request.
Total Ion Chromatograms (TICs) A plot of the total ion current detected by a mass
spectrometer over time during a References
chromatographic run.
Extracted Ion Chromatograms Chromatograms created by plotting the intensity
[1] B.L. Milman, I.K. Zhurkovich, The chemical space for non-target analysis, TrAC,
(XICs) of a specific m/z value or range over time. Trends Anal. Chem. 97 (2017) 179–187.
Continuous Wavelet Transform A mathematical technique used for signal [2] K.T. Peter, A.L. Phillips, A.M. Knolhoff, et al., Nontargeted analysis study
(CWT) processing that decomposes a signal into reporting tool: a framework to improve research transparency and
wavelets, which are small wave-like patterns that reproducibility, Anal. Chem. 93 (41) (2021) 13870–13879.
can provide information about the signal’s [3] B. Shao, H. Li, J. Shen, et al., Nontargeted detection methods for food safety and
frequency content at different scales. integrity, Annu. Rev. Food Sci. Technol. 10 (1) (2019) 429–455.
regions of interest (ROIs) Specific areas within an image or dataset that are [4] J. Hollender, E.L. Schymanski, H.P. Singer, et al., Nontarget Screening with High
selected for further analysis because they are Resolution Mass Spectrometry in the Environment: Ready to Go?, 2017.
expected to contain important information. [5] C.H. Johnson, J. Ivanisevic, G. Siuzdak, Metabolomics: beyond biomarkers and
Retention Index (RI) A standardized measure of a compound’s towards mechanisms, Nat. Rev. Mol. Cell Biol. 17 (7) (2016) 451–459.
[6] E. Stancliffe, G.J. Patti, PeakDetective: a semisupervised deep learning-based
retention time relative to a series of reference
approach for peak curation in untargeted metabolomics, Anal. Chem. 95 (25)
compounds.
(2023) 9397–9403.
signal-to-noise (SNR) ratio A measure used in signal processing that
[7] F. Zheng, L. You, W. Qin, et al., MetEx: a targeted extraction strategy for
compares the level of a desired signal to the level improving the coverage and accuracy of metabolite annotation in liquid
of background noise. chromatography–high-resolution mass spectrometry data, Anal. Chem. 94 (24)
ion mobility spectrometry (IMS) A technique that separates ions based on their (2022) 8561–8569.
size and shape by measuring how quickly they [8] Z. Zhou, M. Luo, X. Chen, et al., Ion mobility collision cross-section atlas for
move through a gas under the influence of an known and unknown metabolite annotation in untargeted metabolomics, Nat.
electric field. Commun. 11 (1) (2020) 4334.
Chemistry Development Kit An open-source Java library designed to support [9] K. Dührkop, H. Shen, M. Meusel, et al., Searching molecular structure databases
(CDK) the development of chemical informatics with tandem mass spectra using CSI: FingerID, Proc. Natl. Acad. Sci. USA 112
software. It provides data structures for chemical (41) (2015) 12580–12585.
objects and associated algorithms. [10] F. Wang, J. Liigand, S. Tian, et al., CFM-ID 4.0: more accurate ESI-MS/MS spectral
Molecular Quantum Numbers Quantum numbers that describe the quantum prediction and compound identification, Anal. Chem. 93 (34) (2021)
11692–11700.
(MQNs) state of a molecule, including its electronic,
[11] M.A. Stravs, K. Dührkop, S. Böcker, et al., MSNovelist: de novo structure
vibrational, and rotational states.
generation from mass spectra, Nat. Methods 19 (7) (2022) 865–870.
Kernel Density Estimation A non-parametric method for estimating the [12] Y. Cai, Z. Zhou, Z.J. Zhu, Advanced analytical and informatic strategies for
(KDE) probability density function of a random metabolite annotation in untargeted metabolomics, TrAC, Trends Anal. Chem.
variable. 158 (2023) 116903.
Molecular Probability Maps Visualizations that show the probability [13] X.Y. Lu, H.P. Wu, H. Ma, et al., Deep learning-assisted spectrum–structure
(MPMs) distribution of molecular properties or the correlation: state-of-the-art and perspectives, Anal. Chem. 96 (20) (2024)
likelihood of certain chemical transformations. 7959–7975.
SMILES Simplified Molecular Input Line Entry System, a [14] A.A. Aksenov, R. da Silva, R. Knight, et al., Global chemical analysis of biology by
string notation used to represent chemical mass spectrometry, Nat. Rev. Chem 1 (7) (2017) 54.
structures in a compact, text-based format. [15] Y. Fu, C. Zhao, X. Lu, et al., Nontargeted screening of chemical contaminants and
illegal additives in food based on liquid chromatography–high resolution mass
spectrometry, TrAC, Trends Anal. Chem. 96 (2017) 89–98.

17
Z.-L. Jin et al. Trends in Analytical Chemistry 189 (2025) 118243

[16] Y. Fu, Y. Zhang, Z. Zhou, et al., Screening and determination of potential risk [47] D.P. Gomari, A. Schweickart, L. Cerchietti, et al., Variational autoencoders learn
substances based on liquid chromatography–high-resolution mass spectrometry, transferrable representations of metabolomics data, Commun. Biol. 5 (1) (2022)
Anal. Chem. 90 (14) (2018) 8454–8461. 645.
[17] S. Khalid, T. Khalil, S. Nasreen, A survey of feature selection and feature [48] D.P. Kingma, P. Dhariwal, Glow: generative flow with invertible 1x1
extraction techniques in machine learning[C]. 2014 Science and Information convolutions, Adv. Neural Inf. Process. Syst. (2018) 31.
Conference, IEEE, 2014, pp. 372–378. [49] J. Ho, X. Chen, A. Srinivas, et al., Flow++: improving flow-based generative
[18] A.S.A. Alwabel, X.J. Zeng, Data-driven modeling of technology acceptance: a models with variational dequantization and architecture design[C]. International
machine learning perspective, Expert Syst. Appl. 185 (2021) 115584. Conference on Machine Learning, PMLR, 2019, pp. 2722–2730.
[19] T. Matsuo, H. Tsugawa, H. Miyagawa, et al., Integrated strategy for unknown [50] G. Papamakarios, E. Nalisnick, D.J. Rezende, et al., Normalizing flows for
EI–MS identification using quality control calibration curve, multivariate probabilistic modeling and inference, J. Mach. Learn. Res. 22 (57) (2021) 1–64.
analysis, EI–MS spectral database, and retention index prediction, Anal. Chem. 89 [51] C. Shi, M. Xu, Z. Zhu, et al., Graphaf: a flow-based autoregressive model for
(12) (2017) 6766–6773. molecular graph generation, arXiv preprint arXiv:2001.09382 (2020). htt
[20] W. Demuth, M. Karlovits, K. Varmuza, Spectral similarity versus structural ps://doi.org/10.48550/arXiv.2001.09382.
similarity: mass spectrometry, Anal. Chim. Acta 516 (1–2) (2004) 75–85. [52] N.C. Frey, V. Gadepally, B. Ramsundar, Fastflows: flow-based models for
[21] E. Quatrini, F. Costantino, G. Di Gravio, et al., Machine learning for anomaly molecular graph generation, arXiv preprint arXiv:2201.12419 (2022). htt
detection and process phase classification to improve safety and maintenance ps://doi.org/10.48550/arXiv.2201.12419.
activities, J. Manuf. Syst. 56 (2020) 117–132. [53] Y. Bengio, A. Courville, P. Vincent, Representation learning: a review and new
[22] B. Gao, S.E. Holroyd, J.C. Moore, et al., Opportunities and challenges using non- perspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2013) 1798–1828.
targeted methods for food fraud detection, J. Agric. Food Chem. 67 (31) (2019) [54] M. Tzelepi, P. Nousi, N. Passalis, et al., Representation Learning and retrieval
8425–8430. [M]//Deep Learning for Robot Perception and Cognition, Academic Press, 2022,
[23] Y. Han, L.X. Hu, T. Liu, et al., Discovering transformation products of pp. 221–241.
pharmaceuticals in domestic wastewaters and receiving rivers by using non-target [55] R. Pascanu, C. Gulcehre, K. Cho, et al., How to construct deep recurrent neural
screening and machine learning approaches, Sci. Total Environ. (2024) 174715. networks, arXiv preprint arXiv:1312.6026 (2013). https://doi.org/10.48550/arXi
[24] C.J. Chen, D.Y. Lee, J. Yu, et al., Recent advances in LC-MS-based metabolomics v.1312.6026.
for clinical biomarker discovery, Mass Spectrom. Rev. 42 (6) (2023) 2349–2378. [56] M. Li, X.R. Wang, Peak alignment of gas chromatography–mass spectrometry data
[25] Y. Feng, Y. Wang, B. Beykal, et al., A mechanistic review on machine learning- with deep learning, J. Chromatogr. A 1604 (2019) 460476.
supported detection and analysis of volatile organic compounds for food quality [57] M.A. Stravs, K. Dührkop, S. Böcker, et al., MSNovelist: de novo structure
and safety, Trends Food Sci. Technol. (2023) 104297. generation from mass spectra, Nat. Methods 19 (7) (2022) 865–870.
[26] Computational psychometrics, New Methodologies for a New Generation of [58] A. Sherstinsky, Fundamentals of recurrent neural network (RNN) and long short-
Digital Learning and Assessment: with Examples in R and Python[M], Springer term memory (LSTM) network, Phys. Nonlinear Phenom. 404 (2020) 132306.
Nature, 2022. [59] R.C. Staudemeyer, E.R. Morris, Understanding LSTM–a tutorial into long short-
[27] M. Mohammed, M.B. Khan, E.B.M. Bashier, Machine Learning: Algorithms and term memory recurrent neural networks, arXiv preprint arXiv:1909.09586
applications[M], Crc Press, 2016. (2019). https://doi.org/10.48550/arXiv.1909.09586.
[28] B. Mahesh, Machine learning algorithms-a review, Int. J. Sci. Res. 9 (1) (2020) [60] S. Osipenko, K. Botashev, E. Nikolaev, et al., Transfer learning for small molecule
381–386 [Internet]. retention predictions, J. Chromatogr. A 1644 (2021) 462119.
[29] S. Ray, A Quick Review of Machine Learning algorithms[C]//2019 International [61] K. Li, Y. Zhang, Y. Li, A false peak recognition method based on deep learning,
Conference on Machine Learning, Big Data, Cloud and Parallel Computing Chemometr. Intell. Lab. Syst. 238 (2023) 104849.
(COMITCon), IEEE, 2019, pp. 35–39. [62] A. Vaswani, Attention is all you need, Adv. Neural Inf. Process. Syst. (2017). htt
[30] G.M. Harshvardhan, M.K. Gourisaria, M. Pandey, et al., A comprehensive survey p://arxiv.org/abs/1706.03762.
and analysis of generative models in machine learning, Comput. Sci. Rev. 38 [63] A. Young, H. Röst, B. Wang, Tandem mass spectrum prediction for small
(2020) 100285. molecules using graph transformers, Nat. Mach. Intell. 6 (4) (2024) 404–416.
[31] A. Creswell, T. White, V. Dumoulin, et al., Generative adversarial networks: an [64] D. Bui-Thi, Y. Liu, J.L. Lippens, et al., TransExION: a transformer based
overview, IEEE Signal Process. Mag. 35 (1) (2018) 53–65. explainable similarity metric for comparing IONS in tandem mass spectrometry,
[32] K. Wang, C. Gou, Y. Duan, et al., Generative adversarial networks: introduction J. Cheminf. 16 (1) (2024) 61.
and outlook, IEEE/CAA J. Automat. Sinica 4 (4) (2017) 588–598. [65] J. Xue, B. Wang, H. Ji, et al., RT-Transformer: retention time prediction for
[33] F.J. Moreno-Barea, L. Franco, D. Elizondo, et al., Application of data metabolite annotation to assist in metabolite identification, Bioinformatics 40 (3)
augmentation techniques towards metabolomics, Comput. Biol. Med. 148 (2022) (2024) btae084.
105916. [66] A.D. Shrivastava, N. Swainston, S. Samanta, et al., MassGenie: a transformer-
[34] V.B. Mathema, K. Duangkumpha, K. Wanichthanarak, et al., CRISP: a deep based deep learning method for identifying small molecules from their mass
learning architecture for GC× GC–TOFMS contour ROI identification, simulation spectra, Biomolecules 11 (12) (2021) 1793.
and analysis in imaging metabolomics, Briefings Bioinf. 23 (2) (2022) bbab550. [67] J. Gu, Z. Wang, J. Kuen, et al., Recent advances in convolutional neural networks,
[35] Z. Rong, Q. Tan, L. Cao, et al., NormAE: deep adversarial learning model to Pattern Recogn. 77 (2018) 354–377.
remove batch effects in liquid chromatography mass spectrometry-based [68] J. Zeng, Y. Li, C. Wang, et al., Combination of in silico prediction and
metabolomics data, Anal. Chem. 92 (7) (2020) 5082–5090. convolutional neural network framework for targeted screening of metabolites
[36] J. Ho, A. Jain, P. Abbeel, Denoising diffusion probabilistic models, Adv. Neural from LC-HRMS fingerprints: a case study of “Pericarpium Citri Reticulatae-
Inf. Process. Syst. 33 (2020) 6840–6851. Fructus Aurantii”, Talanta 269 (2024) 125514.
[37] J. Wolleb, R. Sandkühler, F. Bieder, et al., Diffusion Models for Implicit Image [69] K. Mottershead, T.H. Miller, Application of deep learning to support peak picking
Segmentation ensembles[C]//International Conference on Medical Imaging with during non-target high resolution mass spectrometry workflows in environmental
Deep Learning, PMLR, 2022, pp. 1336–1348. research, Environmental Science: Advances 2 (6) (2023) 877–885.
[38] T. Amit, T. Shaharbany, E. Nachmani, et al., Segdiff: image segmentation with [70] N. Alygizakis, T. Giannakopoulos, N.S. Τhomaidis, et al., Detecting the sources of
diffusion probabilistic models, arXiv preprint arXiv:2112.00390 (2021). htt chemicals in the Black Sea using non-target screening and deep learning
ps://doi.org/10.48550/arXiv.2112.00390. convolutional neural networks, Sci. Total Environ. 847 (2022) 157554.
[39] J.K. Challis, X.O. Almirall, P.A. Helm, et al., Performance of the organic-diffusive [71] J. Zhou, G. Cui, S. Hu, et al., Graph neural networks: a review of methods and
gradients in thin-films passive sampler for measurement of target and suspect applications, AI open 1 (2020) 57–81.
wastewater contaminants, Environ. Pollut. 261 (2020) 114092. [72] Z. Wu, S. Pan, F. Chen, et al., A comprehensive survey on graph neural networks,
[40] J. Zeng, M. He, H. Wu, et al., Peak alignment for herbal fingerprints from liquid IEEE Transact. Neural Networks Learn. Syst. 32 (1) (2020) 4–24.
chromatography-high resolution mass spectrometry via diffusion model and bi- [73] Q. Yang, H. Ji, H. Lu, et al., Prediction of liquid chromatographic retention time
directional eigenvalues, Microchem. J. 167 (2021) 106296. with graph neural networks to assist in small molecule identification, Anal. Chem.
[41] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural 93 (4) (2021) 2200–2206.
networks, Science 313 (5786) (2006) 504–507. [74] Q. Yang, H. Ji, X. Fan, et al., Retention time prediction in hydrophilic interaction
[42] P. Vincent, H. Larochelle, Y. Bengio, et al., Extracting and composing robust liquid chromatography with graph neural network and transfer learning,
features with denoising autoencoders[C]. Proceedings of the 25th International J. Chromatogr. A 1656 (2021) 462536.
Conference on Machine Learning, 2008, pp. 1096–1103. [75] K. Bian, R. Priyadarshi, Machine learning optimization techniques: a Survey,
[43] A. Kensert, G. Collaerts, K. Efthymiadis, et al., Deep convolutional autoencoder classification, challenges, and Future Research Issues, Arch. Comput. Methods
for the simultaneous removal of baseline noise and baseline drift in Eng. (2024) 1–25.
chromatograms, J. Chromatogr. A 1646 (2021) 462093. [76] O. Sagi, L. Rokach, Ensemble learning: a survey, Wiley Interdiscipl. Revi.: Data
[44] S.J. Pelletier, M. Leclercq, F. Roux-Dalvai, et al., BERNN: enhancing classification Min. Knowl. Discov. 8 (4) (2018) e1249.
of Liquid Chromatography Mass Spectrometry data with batch effect removal [77] S. Han, J. Huang, F. Foppiano, et al., TIGER: technical variation elimination for
neural networks, Nat. Commun. 15 (1) (2024) 3777. metabolomics data using ensemble learning architecture, Briefings Bioinf. 23 (2)
[45] S.M. Colby, J.R. Nuñez, N.O. Hodas, et al., Deep learning to generate in silico (2022) bbab535.
chemical property libraries and candidate molecules for small molecule [78] B. Chen, C. Wang, Z. Fu, et al., RT-Ensemble Pred: a tool for retention time
identification in complex samples, Anal. Chem. 92 (2) (2019) 1720–1729. prediction of metabolites on different LC-MS systems, J. Chromatogr. A 1707
[46] D.P. Kingma, M. Welling, An introduction to variational autoencoders, Found. (2023) 464304.
Trends® in Mach. Learn. 12 (4) (2019) 307–392. [79] K. Weiss, T.M. Khoshgoftaar, D.D. Wang, A survey of transfer learning, J. Big data
3 (2016) 1–40.

18
Z.-L. Jin et al. Trends in Analytical Chemistry 189 (2025) 118243

[80] R. Ju, X. Liu, F. Zheng, et al., Deep neural network pretrained by weighted [111] L. Bijlsma, R. Bade, A. Celma, et al., Prediction of collision cross-section values for
autoencoders and transfer learning for retention time prediction of small small molecules: application to pesticide residue analysis, Anal. Chem. 89 (12)
molecules, Anal. Chem. 93 (47) (2021) 15651–15658. (2017) 6583–6589.
[81] P.H. Le-Khac, G. Healy, A.F. Smeaton, Contrastive representation learning: a [112] M.A. Samaraweera, L.M. Hall, D.W. Hill, et al., Evaluation of an artificial neural
framework and review, IEEE Access 8 (2020) 193907–193934. network retention index model for chemical structure identification in
[82] H. Guo, K. Xue, H. Sun, et al., Contrastive learning-based embedder for the nontargeted metabolomics, Anal. Chem. 90 (21) (2018) 12752–12760.
representation of tandem mass spectra, Anal. Chem. 95 (20) (2023) 7888–7896. [113] K. Munro, T.H. Miller, C.P.B. Martins, et al., Artificial neural network modelling
[83] L. Chen, B. Xia, Y. Wang, et al., CMSSP, A contrastive mass spectra-structure of pharmaceutical residue retention times in wastewater extracts using gradient
pretraining model for metabolite identification, Anal. Chem. 96 (42) (2024) liquid chromatography-high resolution mass spectrometry data, J. Chromatogr. A
16871–16881. 1396 (2015) 34–44.
[84] D. Ramachandram, G.W. Taylor, Deep multimodal learning: a survey on recent [114] L.P. Barron, G.L. McEneff, Gradient liquid chromatographic retention time
advances and trends, IEEE Signal Process. Mag. 34 (6) (2017) 96–108. prediction for suspect screening applications: a critical assessment of a
[85] J. Novak, A. Skriba, V. Havlicek, CycloBranch 2: molecular formula annotations generalised artificial neural network-based approach across 10 multi-residue
applied to imzML data sets in bimodal fusion and LC-MS data files, Anal. Chem. reversed-phase analytical methods, Talanta 147 (2016) 261–270.
92 (10) (2020) 6844–6849. [115] P.R. Haddad, M. Taraji, R. Szucs, Prediction of analyte retention time in liquid
[86] D. Velickovic, R.K. Chu, A.A. Carrell, et al., Multimodal MSI in conjunction with chromatography, Anal. Chem. 93 (1) (2020) 228–256.
broad coverage spatially resolved MS2 increases confidence in both molecular [116] X. Chen, W. Wu, H. Sun, et al., Development and application of a comprehensive
identification and localization, Anal. Chem. 90 (1) (2018) 702–707. nontargeted screening strategy for aristolochic acid analogues, Anal. Chem. 96 (5)
[87] C.A. Smith, E.J. Want, G. O’Maille, et al., XCMS: processing mass spectrometry (2024) 1922–1931.
data for metabolite profiling using nonlinear peak alignment, matching, and [117] R. Bouwmeester, L. Martens, S. Degroeve, Comprehensive and empirical
identification, Anal. Chem. 78 (3) (2006) 779–787. evaluation of machine learning algorithms for small molecule LC retention time
[88] M. Katajamaa, M. Orešič, Data processing for mass spectrometry-based prediction, Anal. Chem. 91 (5) (2019) 3694–3703.
metabolomics, J. Chromatogr. A 1158 (1–2) (2007) 318–328. [118] A. Kensert, G. Collaerts, K. Efthymiadis, et al., Deep Q-learning for the selection of
[89] R. Tautenhahn, C. Böttcher, S. Neumann, Highly sensitive feature detection for optimal isocratic scouting runs in liquid chromatography, J. Chromatogr. A 1638
high resolution LC/MS, BMC Bioinf. 9 (2008) 1–16. (2021) 461900.
[90] R. Spicer, R.M. Salek, P. Moreno, et al., Navigating freely-available software tools [119] A. Kensert, R. Bouwmeester, K. Efthymiadis, et al., Graph convolutional networks
for metabolomics analysis, Metabolomics 13 (2017) 1–16. for improved prediction and interpretability of chromatographic retention data,
[91] G. Libiseller, M. Dvorzak, U. Kleb, et al., IPO: a tool for automated optimization of Anal. Chem. 93 (47) (2021) 15633–15641.
XCMS parameters, BMC Bioinf. 16 (2015) 1–10. [120] J.C. May, C.B. Morris, J.A. McLean, Ion mobility collision cross section
[92] R. Smith, D. Ventura, J.T. Prince, LC-MS alignment in theory and practice: a compendium, Anal. Chem. 89 (2) (2017) 1032–1044.
comprehensive algorithmic review, Briefings Bioinf. 16 (1) (2015) 104–117. [121] A.P. France, L.G. Migas, E. Sinclair, et al., Using collision cross section
[93] J. Guo, T. Huan, Mechanistic understanding of the discrepancies between distributions to assess the distribution of collision cross section values, Anal.
common peak picking algorithms in liquid chromatography–mass spectrometry- Chem. 92 (6) (2020) 4340–4348.
based metabolomics, Anal. Chem. 95 (14) (2023) 5894–5902. [122] X. Li, H. Wang, M. Jiang, et al., Collision cross section prediction based on
[94] A.D. Melnikov, Y.P. Tsentalovich, V.V. Yanshole, Deep learning for the precise machine learning, Molecules 28 (10) (2023) 4050.
peak detection in high-resolution LC–MS data, Anal. Chem. 92 (1) (2019) [123] J.A. Picache, B.S. Rose, A. Balinski, et al., Collision cross section compendium to
588–592. annotate and predict multi-omic compound identities, Chem. Sci. 10 (4) (2019)
[95] A. Kensert, E. Bosten, G. Collaerts, et al., Convolutional neural network for 983–993.
automated peak detection in reversed-phase liquid chromatography, [124] M.K. Foster, M. Rainey, C. Watson, et al., Uncovering PFAS and other xenobiotics
J. Chromatogr. A 1672 (2022) 463005. in the dark metabolome using ion mobility spectrometry, mass defect analysis,
[96] C. Bueschl, M. Doppler, E. Varga, et al., PeakBot: machine-learning-based and machine learning, Environ. Sci. Technol. 56 (12) (2022) 9133–9143.
chromatographic peak picking, Bioinformatics 38 (13) (2022) 3422–3428. [125] C.K. Asef, M.A. Rainey, B.M. Garcia, et al., Unknown metabolite identification
[97] Z. Zou, K. Chen, Z. Shi, et al., Object detection in 20 years: a survey, Proc. IEEE using machine learning collision cross-section prediction and tandem mass
111 (3) (2023) 257–276. spectrometry, Anal. Chem. 95 (2) (2023) 1047–1056.
[98] J. Zeng, H. Wu, M. He, Image classification combined with faster R–CNN for the [126] Z. Zhiwei, S. **aotao, T. Jia, et al., Large-Scale Prediction of Collision Cross-
peak detection of complex components and their metabolites in untargeted LC- Section Values for Metabolites in Ion Mobility-Mass Spectrometry, 2016.
HRMS data, Anal. Chim. Acta 1238 (2023) 340189. [127] Z. Zhou, X. **ong, Z.J. Zhu, MetCCS predictor: a web server for predicting
[99] S. Samanipour, P. Choi, J.W. O’Brien, et al., From centroided to profile mode: collision cross-section values of metabolites in ion mobility-mass spectrometry
machine learning for prediction of peak width in HRMS data, Anal. Chem. 93 (49) based metabolomics, Bioinformatics 33 (14) (2017) 2235–2237.
(2021) 16562–16570. [128] X.C. Song, N. Dreolin, E. Canellas, et al., Prediction of collision cross-section
[100] H. Ji, J. Tian, Deep denoising autoencoder-assisted continuous scoring of peak values for extractables and leachables from plastic products, Environ. Sci.
quality in high-resolution LC− MS data, Chemometr. Intell. Lab. Syst. 231 (2022) Technol. 56 (13) (2022) 9463–9473.
104694. [129] X.C. Song, N. Dreolin, T. Damiani, et al., Prediction of collision cross section
[101] K. Chetnik, L. Petrick, G. Pandey, MetaClean: a machine learning-based classifier values: application to non-intentionally added substance identification in food
for reduced false positive peak detection in untargeted LC–MS metabolomics data, contact materials, J. Agric. Food Chem. 70 (4) (2022) 1272–1281.
Metabolomics 16 (2020) 1–13. [130] Y. Lai, J.P. Koelmel, D.I. Walker, et al., High-resolution mass spectrometry for
[102] P. Bonini, T. Kind, H. Tsugawa, et al., Retip: retention time prediction for human exposomics: Expanding chemical space coverage, Environ. Sci. Technol.
compound annotation in untargeted metabolomics, Anal. Chem. 92 (11) (2020) 58 (29) (2024) 12784–12822.
7515–7522. [131] R. Bouwmeester, K. Richardson, R. Denny, et al., Predicting ion mobility collision
[103] S. Meshref, Y. Li, Y.L. Feng, Prediction of liquid chromatographic retention time cross sections and assessing prediction variation by combining conventional and
using quantitative structure-retention relationships to assist non-targeted data driven modeling, Talanta 274 (2024) 125970.
identification of unknown metabolites of phthalates in human urine with high- [132] Z. Zhou, J. Tu, Z.J. Zhu, Advancing the large-scale CCS database for metabolomics
resolution mass spectrometry, J. Chromatogr. A 1634 (2020) 461691. and lipidomics at the machine-learning era, Curr. Opin. Chem. Biol. 42 (2018)
[104] M. Cao, K. Fraser, J. Huege, et al., Predicting retention time in hydrophilic 34–41.
interaction liquid chromatography mass spectrometry and its use for peak [133] S.M. de Cripan, T. Arora, A. Olomí, et al., Predicting the predicted: A comparison
annotation in metabolomics, Metabolomics 11 (2015) 696–706. of machine learning-based collision cross-section prediction models for small
[105] R. Bade, L. Bijlsma, T.H. Miller, et al., Suspect screening of large numbers of molecules, Anal. Chem. 96 (22) (2024) 9088–9096.
emerging contaminants in environmental waters using artificial neural networks [134] P.L. Plante, É. Francovic-Fontaine, J.C. May, et al., Predicting ion mobility
for chromatographic retention time prediction and high resolution mass collision cross-sections using a deep neural network: DeepCCS, Anal. Chem. 91
spectrometry data analysis, Sci. Total Environ. 538 (2015) 934–941. (8) (2019) 5191–5199.
[106] R. Aalizadeh, N.S. Thomaidis, A.A. Bletsou, et al., Quantitative [135] C. Wang, C. Yuan, Y. Wang, et al., Predicting collision cross-section values for
structure–retention relationship models to support nontarget high-resolution small molecules through chemical class-based multimodal graph attention
mass spectrometric screening of emerging contaminants in environmental network, J. Chem. Inform. Model. 64 (16) (2024) 6305–6315.
samples, J. Chem. Inform. Model. 56 (7) (2016) 1384–1398. [136] D.H. Ross, J.H. Cho, L. Xu, Breaking down structural diversity for comprehensive
[107] S.R. Newton, R.L. McMahen, J.R. Sobus, et al., Suspect screening and non- prediction of ion-neutral collision cross sections, Anal. Chem. 92 (6) (2020)
targeted analysis of drinking water using point-of-use filters, Environ. Pollut. 234 4548–4557.
(2018) 297–306. [137] V. Consonni, F. Gosetti, V. Termopoli, et al., Multi-task neural networks and
[108] J. Yang, F. Zhao, J. Zheng, et al., An automated toxicity based prioritization molecular fingerprints to enhance compound identification from LC-MS/MS data,
framework for fast chemical characterization in non-targeted analysis, J. Hazard Molecules 27 (18) (2022) 5827.
Mater. 448 (2023) 130893. [138] H. Ji, H. Deng, H. Lu, et al., Predicting a molecular fingerprint from an electron
[109] X. Domingo-Almenara, C. Guijas, E. Billings, et al., The METLIN small molecule ionization mass spectrum with deep neural networks, Anal. Chem. 92 (13) (2020)
dataset for machine learning-based retention time prediction, Nat. Commun. 10 8649–8653.
(1) (2019) 5811. [139] A. Vaniya, O. Fiehn, Using fragmentation trees and mass spectral trees for
[110] R. Bouwmeester, L. Martens, S. Degroeve, Comprehensive and empirical identifying unknown compounds in metabolomics, TrAC, Trends Anal. Chem. 69
evaluation of machine learning algorithms for LC retention time prediction, (2015) 52–61.
bioRxiv (2018) 259168.

19
Z.-L. Jin et al. Trends in Analytical Chemistry 189 (2025) 118243

[140] D. Abu Sammour, J.L. Cairns, T. Boskamp, et al., Spatial probabilistic mapping of [166] A. Zhavoronkov, Y.A. Ivanenkov, A. Aliper, et al., Deep learning enables rapid
metabolite ensembles in mass spectrometry imaging, Nat. Commun. 14 (1) (2023) identification of potent DDR1 kinase inhibitors, Nat. Biotechnol. 37 (9) (2019)
1823. 1038–1040.
[141] M.A. Hoffmann, L.F. Nothias, M. Ludwig, et al., High-confidence structural [167] Z.C. Lipton, The mythos of model interpretability: in machine learning, the
annotation of metabolites absent from spectral libraries, Nat. Biotechnol. 40 (3) concept of interpretability is both important and slippery, ACM Queue 16 (3)
(2022) 411–421. (2018) 31–57.
[142] K. Dührkop, M. Fleischauer, M. Ludwig, et al., Sirius 4: a rapid tool for turning [168] D. Zhao, F. Liu, Cross-condition and cross-platform remaining useful life
tandem mass spectra into metabolite structure information, Nat. Methods 16 (4) estimation via adversarial-based domain adaptation, Sci. Rep. 12 (1) (2022) 878.
(2019) 299–302. [169] Z. Huang, H. Wang, E.P. Xing, et al., Self-challenging Improves Cross-Domain
[143] K. Dührkop, Deep kernel learning improves molecular fingerprint prediction from generalization[C]//Computer Vision–ECCV 2020: 16th European Conference,
tandem mass spectra, Bioinformatics 38 (Supplement_1) (2022) i342–i349. Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, Springer International
[144] Z. Fan, A. Alley, K. Ghaffari, et al., MetFID: artificial neural network-based Publishing, 2020, pp. 124–140.
compound fingerprint prediction for metabolite annotation, Metabolomics 16 [170] P.G. Boswell, J.R. Schellenberg, P.W. Carr, et al., A study on retention
(2020) 1–11. “projection” as a supplementary means for compound identification by liquid
[145] S. Gao, H.Y.K. Chau, K. Wang, et al., Convolutional neural network-based chromatography–mass spectrometry capable of predicting retention with
compound fingerprint prediction for metabolite annotation, Metabolites 12 (7) different gradients, flow rates, and instruments, J. Chromatogr. A 1218 (38)
(2022) 605. (2011) 6732–6741.
[146] S. Goldman, J. Wohlwend, M. Stražar, et al., Annotating metabolite mass spectra [171] D.K. Duvenaud, D. Maclaurin, J. Iparraguirre, et al., Convolutional networks on
with domain-inspired chemical formula transformers, Nat. Mach. Intell. 5 (9) graphs for learning molecular fingerprints, Adv. Neural Inf. Process. Syst. (2015)
(2023) 965–979. 28.
[147] S.F. Baygi, D.K. Barupal, IDSL_MINT: a deep learning framework to predict [172] H. Altae-Tran, B. Ramsundar, A.S. Pappu, et al., Low data drug discovery with
molecular fingerprints from mass spectra, J. Cheminf. 16 (1) (2024) 8. one-shot learning, ACS Cent. Sci. 3 (4) (2017) 283–293.
[148] F. Allen, R. Greiner, D. Wishart, Competitive fragmentation modeling of ESI-MS/ [173] ** K. Yang, K. Swanson, W, et al., Analyzing learned molecular representations
MS spectra for putative metabolite identification, Metabolomics 11 (2015) for property prediction, J. Chem. Inf. Model. 59 (8) (2019) 3370–3388.
98–110. [174] H. Xu, J. Lin, D. Zhang, et al., Retention time prediction for chromatographic
[149] A.D. McEachran, I. Balabin, T. Cathey, et al., Linking in silico MS/MS spectra with enantioseparation by quantile geometry-enhanced graph neural network, Nat.
chemistry data to improve identification of unknowns, Sci. Data 6 (1) (2019) 141. Commun. 14 (1) (2023) 3095.
[150] C. Ruttkies, E.L. Schymanski, S. Wolf, et al., MetFrag relaunched: incorporating [175] M. Liu, C. Guo, L. Xu, An interpretable automated feature engineering framework
strategies beyond in silico fragmentation, J. Cheminf. 8 (2016) 1–16. for improving logistic regression, Appl. Soft Comput. 153 (2024) 111269.
[151] I. Blaženović, T. Kind, H. Torbašinović, et al., Comprehensive comparison of in [176] H. Shen, K. Dührkop, S. Böcker, et al., Metabolite identification through multiple
silico MS/MS fragmentation tools of the CASMI contest: database boosting is kernel learning on fragmentation trees, Bioinformatics 30 (12) (2014) i157–i164.
needed to achieve 93% accuracy, J. Cheminf. 9 (2017) 1–12. [177] Z. Fan, A. Alley, K. Ghaffari, et al., MetFID: artificial neural network-based
[152] Y. Djoumbou-Feunang, A. Pon, N. Karu, et al., CFM-ID 3.0: significantly improved compound fingerprint prediction for metabolite annotation, Metabolomics 16
ESI-MS/MS prediction and compound identification, Metabolites 9 (4) (2019) 72. (2020) 1–11.
[153] F. Allen, A. Pon, M. Wilson, et al., CFM-ID: a web server for annotation, spectrum [178] X. Wei, X. Shi, S. Kim, et al., Data preprocessing method for liquid
prediction and metabolite identification from tandem mass spectra, Nucleic Acids chromatography–mass spectrometry based metabolomics, Anal. Chem. 84 (18)
Res. 42 (W1) (2014) W94–W99. (2012) 7963–7971.
[154] A. Young, H. Röst, B. Wang, Tandem mass spectrum prediction for small [179] P. Ray, S.S. Reddy, T. Banerjee, Various dimension reduction techniques for high
molecules using graph transformers, Nat. Mach. Intell. 6 (4) (2024) 404–416. dimensional data analysis: a review, Artif. Intell. Rev. 54 (5) (2021) 3473–3515.
[155] D. Verdegem, D. Lambrechts, P. Carmeliet, et al., Improved metabolite [180] W. Xie, Y. Li, X. Jia, Deep convolutional networks with residual learning for
identification with MIDAS and MAGMa through MS/MS spectral dataset-driven accurate spectral-spatial denoising, Neurocomputing 312 (2018) 372–381.
parameter optimization, Metabolomics 12 (2016) 1–16. [181] R. Chaleckis, I. Meister, P. Zhang, et al., Challenges, progress and promises of
[156] B. Zhang, J. Zhang, Y. Xia, et al., Prediction of electron ionization mass spectra metabolite annotation for LC–MS-based metabolomics, Curr. Opin. Biotechnol. 55
based on graph convolutional networks, Int. J. Mass Spectrom. 475 (2022) (2019) 44–50.
116817. [182] S. Wolf, S. Schmidt, M. Müller-Hannemann, et al., In silico fragmentation for
[157] J. Park, J. Jo, S. Yoon, Mass spectra prediction with structural motif-based graph computer assisted identification of metabolite mass spectra, BMC Bioinf. 11
neural networks, Sci. Rep. 14 (1) (2024) 1400. (2010) 1–12.
[158] J.N. Wei, D. Belanger, R.P. Adams, et al., Rapid prediction of electron–ionization [183] H. Tsugawa, T. Kind, R. Nakabayashi, et al., Hydrogen rearrangement rules:
mass spectrometry using neural networks, ACS Cent. Sci. 5 (4) (2019) 700–708. computational MS/MS fragmentation and structure elucidation using MS-FINDER
[159] R.L. Zhu, E. Jonas, Rapid approximate subset-based spectra prediction for software, Anal. Chem. 88 (16) (2016) 7946–7958.
electron ionization–mass spectrometry, Anal. Chem. 95 (5) (2023) 2653–2663. [184] J. You, B. Liu, Z. Ying, et al., Graph convolutional policy network for goal-
[160] K. Atz, L. Cotos, C. Isert, et al., Prospective de novo drug design with deep directed molecular graph generation, Adv. Neural Inf. Process. Syst. (2018) 31.
interactome learning, Nat. Commun. 15 (1) (2024) 3408. [185] D.C. Elton, Z. Boukouvalas, M.D. Fuge, et al., Deep learning for molecular
[161] J. Meyers, B. Fabian, N. Brown, De novo molecular design and generative models, design—a review of the state of the art, Mol. Syst. Design; Eng. 4 (4) (2019)
Drug Discov. Today 26 (11) (2021) 2707–2715. 828–849.
[162] D.D. Martinelli, Generative machine learning for de novo drug discovery: A [186] G. Chen, Z. Shen, A. Iyer, et al., Machine-learning-assisted de novo design of
systematic review, Comput. Biol. Med. 145 (2022) 105403. organic molecules and polymers: opportunities and challenges, Polymers 12 (1)
[163] E.E. Litsa, V. Chenthamarakshan, P. Das, et al., An end-to-end deep learning (2020) 163.
framework for translating mass spectra to de-novo molecules, Commun. Chem. 6 [187] M.A. Skinnider, R.G. Stacey, D.S. Wishart, et al., Chemical language models
(1) (2023) 132. enable navigation in sparsely populated chemical space, Nat. Mach. Intell. 3 (9)
[164] M.A. Skinnider, F. Wang, D. Pasin, et al., A deep generative model enables (2021) 759–770.
automated structure elucidation of novel psychoactive substances, Nat. Mach. [188] B. Sanchez-Lengeling, C. Outeiral, G.L. Guimaraes, et al., Optimizing distributions
Intell. 3 (11) (2021) 973–984. over molecular space. An Objective-Reinforced Generative Adversarial Network
[165] M. Popova, O. Isayev, A. Tropsha, Deep reinforcement learning for de novo drug for Inverse-Design Chemistry (ORGANIC)[J], 2017.
design, Sci. Adv. 4 (7) (2018) eaap7885. [189] N. De Cao, T. Kipf, MolGAN: an implicit generative model for small molecular
graphs, arxiv preprint arxiv:1805.11973 (2018). https://doi.org/10.4855
0/arXiv.1805.11973.

20

You might also like