0% found this document useful (0 votes)
57 views11 pages

Machine Learning in RNA Structure Prediction - Advances and Challenges

Uploaded by

yahel.godinez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views11 pages

Machine Learning in RNA Structure Prediction - Advances and Challenges

Uploaded by

yahel.godinez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Please cite this article in press as: Zhang et al.

, Machine learning in RNA structure prediction: Advances and challenges, Biophysical Journal (2024), https://
doi.org/10.1016/j.bpj.2024.01.026

Review

Machine learning in RNA structure prediction:


Advances and challenges
Sicheng Zhang,1 Jun Li,1 and Shi-Jie Chen1,2,*
1
Department of Physics and Institute of Data Science and Informatics, University of Missouri, Columbia, Missouri and 2Department of
Biochemistry, University of Missouri, Columbia, Missouri

ABSTRACT RNA molecules play a crucial role in various biological processes, with their functionality closely tied to their struc-
tures. The remarkable advancements in machine learning techniques for protein structure prediction have shown promise in the
field of RNA structure prediction. In this perspective, we discuss the advances and challenges encountered in constructing ma-
chine learning-based models for RNA structure prediction. We explore topics including model building strategies, specific chal-
lenges involved in predicting RNA secondary (2D) and tertiary (3D) structures, and approaches to these challenges. In addition,
we highlight the advantages and challenges of constructing RNA language models. Given the rapid advances of machine
learning techniques, we anticipate that machine learning-based models will serve as important tools for predicting RNA struc-
tures, thereby enriching our understanding of RNA structures and their corresponding functions.

INTRODUCTION
RNA molecules play a crucial role in diverse biological pro- analyzed by Primer Extension (SHAPE) (3,4), for RNA
cesses. Their functions include gene expression regulation, structure probing, structural biology experiments alone are
catalyzing biochemical reactions, and genetic information unable to keep pace with the growth of biologically signifi-
translation. For example, transfer RNA (tRNA) acts as a car- cant RNA sequences. There exists a large gap between the
rier of amino acids during the process of protein synthesis, known structures and the sequences. Therefore, computa-
messenger RNA (mRNA) transfers genetic information tional structure prediction becomes a much needed comple-
from DNA, while ribosomal RNAs (rRNAs) constitute a pri- mentary method for RNA structure determination.
mary constituent of ribosomes and catalyze protein synthe- Traditional structure prediction models have achieved sig-
sis (1). Noncoding RNAs (ncRNAs) do not translate into nificant progress in RNA structure prediction, as demonstrated
proteins but play a critical role in many biochemical reac- by the RNA-Puzzles competition (5–8). However, these
tions such as epigenetic regulation and synaptic transmis- models still face certain challenges. For RNA 2D structure
sion (1,2). The wide range of functions associated with prediction, thermodynamic models such as RNAfold (9,10),
RNA is connected to its capacity to adopt proper secondary Mfold (11), RNAstructure (12), and Vfold2D (13–19), search
(2D) and tertiary (3D) structures. Therefore, accurate deter- for thermodynamically most stable structures from the RNA
mination of RNA structures through experimental tech- sequence. The prediction accuracy is hindered by the limited
niques and computational modeling holds significant number of experimentally determined thermodynamic param-
importance not only in understanding RNA functions, but eters and the challenges posed by the sampling and free energy
also in drug and therapeutic designs. evaluation of pseudoknots and noncanonical basepairs. For
Despite the remarkable advances in experimental determi- RNA 3D structure prediction, the motif/loop structural tem-
nation of RNA structures using x-ray crystallography, NMR plate-based methods (Vfold3D (20), VfoldLA (21), 3dRNA
spectroscopy, cryogenic electron microscopy, and chemical (22,23), and RNAComposer (24)) often fail to find the appro-
probing techniques, such as Selective 2-Hydroxyl Acylation priate templates for effective conformational sampling due to
the limited size of the available template library extracted
from the solved RNA structures. For the de novo RNA 3D
Submitted November 29, 2023, and accepted for publication January 24, structure prediction methods (NAST (25), iFoldRNA
2024. (26,27), oxRNA (28), MC-Sym (29), FARFAR (30,31),
*Correspondence: [email protected] HiRE-RNA (32), SPQR (33), SimRNA (34), BRiQ (35),
Sicheng Zhang and Jun Li contributed equally to this work. IsRNA (36–38), RNAJP (39)), they encounter the challenges
Editor: Meyer Jackson. of performing efficient conformational sampling in all-atom
https://doi.org/10.1016/j.bpj.2024.01.026
Ó 2024 Biophysical Society.

Biophysical Journal 123, 1–11, October 1, 2024 1


Please cite this article in press as: Zhang et al., Machine learning in RNA structure prediction: Advances and challenges, Biophysical Journal (2024), https://
doi.org/10.1016/j.bpj.2024.01.026

Zhang et al.

or coarse-grained 3D space and constructing accurate scoring/ the development of several notable protein language
energy functions to guide the conformational sampling and models, such as Evolutionary Scale Modeling (ESM) (53)
rank the structures. Considering these limitations, machine and ProGen (54). Similarly, the application of RNA lan-
learning-based methods have emerged in recent years and guage models in RNA structure prediction has gained signif-
have gained increasing significance in the field of RNA struc- icant importance and shows great promise (55,56).
ture prediction. Compared with traditional structure prediction In this review and perspective, we highlight the advances
models, machine learning-based methods have several distinct and challenges in using machine learning approaches to pre-
features. First, machine learning-based methods utilize neural dict RNA structures. Specifically, we first discuss the latest
networks and deep learning techniques, which have demon- advancements and model building strategies for predicting
strated remarkable success in various domains such as Natural RNA secondary (2D) and tertiary (3D) structures. Next,
Language Processing (NLP) (40,41). Second, these methods we explore the specific challenges and strategies involved
heavily rely on known data sets to train and optimize the in predicting RNA structures. Finally, we discuss the advan-
model, including known RNA structures, sequences, and tages and challenges encountered in constructing the RNA
sequence alignments. However, they do not require informa- language model.
tion such as energy parameters/functions derived from
experiments or physics-based (or empirical) calculations.
Third, the utilization of neural networks enables machine ADVANCES AND BUILDING STRATEGIES FOR
learning-based methods to capture more intricate relationships MACHINE LEARNING-BASED MODELS
and patterns between sequence features and structures.
2D structure prediction
Machine learning techniques, particularly deep learning
techniques, have shown remarkable success in protein struc- Although RNAs may fold along a nonhierarchical pathway
ture prediction. Notably, AlphaFold2 (42) participated in the due to strong tertiary interactions (57,58), for most RNAs
community-wide experiment on the Critical Assessment of the folding process is hierarchical, with the initial folding
Structure Prediction (CASP) (43), a blind test for evaluating of the RNA into a 2D structure through basepairing, fol-
protein structure prediction methods, and achieved a lowed by the stabilization of the 3D structure. Therefore,
groundbreaking outcome (44,45). Notably, AlphaFold2 em- the prediction of RNA’s 2D structure, or the basepairing
ploys an end-to-end deep neural network to directly predict pattern, is a crucial initial step in RNA 3D structure predic-
protein structures from amino acid sequences. Inspired by tion. In recent years, machine learning-based models have
this tremendous breakthrough, similar approaches that uti- been effectively employed in predicting RNA 2D structures.
lize an end-to-end framework to predict RNA structures Machine learning-based approaches for predicting the 2D
have emerged in recent years. These include, for example, structure of RNA can be roughly classified into three types.
E2Efold-3D (46), RoseTTAFoldNA (47), and DRfold The first type of approach directly predicts the 2D structure
(48). However, these RNA structure prediction models defined by all the basepairs in the structure. The predicted
have not yet achieved the same level of success as 2D structures can be represented in the bpseq or dot-bracket
AlphaFold2 in protein structure prediction. In the following formats. Popular models that adopt this approach include
sections, we offer an in-depth analysis of the progress and DMfold (59) and CDPfold (60). DMfold, for example, em-
challenges in machine learning approaches to RNA struc- ploys multilayer long short-term memory networks to pre-
ture prediction. Other notable machine learning-based dict the pairing states of each nucleotide within the given
models for protein structure prediction include trRosetta RNA sequence. Subsequently, it predicts the 2D structure
(49) and RoseTTAFold (50), which exhibit high accuracy of the RNA in dot-bracket format based on the predicted
in predicting protein 3D structures or geometric constraints. basepairing pattern. The second type of approach aims to
Similar to protein, an RNA sequence has a limited alphabet predict a 2D contact map, which is represented as a matrix
(nucleotides) and the sequence to structure mapping leads to containing the scores for all the possible basepairs. This
different levels of structural complexity (1D, 2D, and 3D). contact map can then be refined and converted into the 2D
The remarkable achievements of machine learning tech- structure, or alternatively employed as distance constraints
niques in protein structure prediction suggest their prom- in the 3D structure modeling. Models using the second
ising potential in RNA structure prediction. type of approach include E2Efold (61), Ufold (62), SPOT-
In recent years, significant progress has been achieved in RNA (63), SPOT-RNA2 (64), CNNFold (65), RNAformer
the field of natural language processing problems. These ad- (66), and 2dRNA (67). For example, SPOT-RNA treats the
vancements can be attributed to breakthroughs in masked 2D structure of RNA as a 2D contact map and uses a hybrid
language model (51), attention-based model (40), and self- network consisting of ResNets and 2D-BLSTMs for the 2D
supervised learning (41), which have been made possible structure prediction. By treating the 2D structure as a con-
by the availability of increasingly powerful computational tact map, the model effectively captures information from
resources. These advanced techniques have been introduced the entire sequence. The third type of approach entails pre-
to solve the protein-related problems (52), and have led to dicting folding scores. MXfold2 (68) adopts this approach

2 Biophysical Journal 123, 1–11, October 1, 2024


Please cite this article in press as: Zhang et al., Machine learning in RNA structure prediction: Advances and challenges, Biophysical Journal (2024), https://
doi.org/10.1016/j.bpj.2024.01.026

Machine learning in RNA structure prediction

by integrating neural networks with thermodynamic param-


eters to calculate the folding scores for each nucleotide.
Subsequently, the predicted folding scores are employed
to predict the optimal secondary structure using Mfold-style
dynamic programming (69).
Although many machine learning-based models are
developed based on a specific neural network architecture,
such as convolutional neural networks (CNNs), recurrent
neural networks (RNNs), bidirectional long short-term
memory networks (bi-LSTMs), and transformer networks,
certain models adopt a combined approach to enhance their FIGURE 1 Schematic overview of the machine learning-based model for
prediction accuracy. For example, SPOT-RNA (63) and RNA structure prediction. The workflow is divided into distinct compo-
MXfold2 (68) leverage both CNNs and bi-LSTMs in their nents: data set, input, and output of the neural networks. The predictions
network architecture. E2Efold (61), on the other hand, em- of 2D and 3D structures utilize different network inputs and outputs, which
ploys transformer networks and CNNs to predict the 2D are represented by blue and green, respectively. To see this figure in color,
go online.
contact map.

the concerned methods leverage transformer-based tech-


3D structure prediction
niques in their neural networks, albeit with distinct architec-
In recent years, significant advancements have been achieved tural variations. In terms of input data for these neural
in RNA 3D structure prediction through the utilization of ma- networks, some exclusively utilize MSA data (as seen in
chine learning-based approaches (56,70). These approaches E2Efold-3D (46) and RoseTTAFoldNA (47)), while others
can be categorized into three principal strategies, empha- incorporate MSA in conjunction with predicted secondary
sizing the crucial components of sampling and scoring, structures (as in DeepFoldRNA (74) and trRosettaRNA
which also underpin the importance of the physics of the (75)). Alternatively, certain methods rely solely on predicted
problem. The first strategy develops scoring functions de- secondary structures (as demonstrated by DRfold (48)).
signed to evaluate RNA 3D structures that are generated by
other methods. Notable examples in this category include
CHALLENGES IN DEVELOPING MACHINE
RNA3DCNN (71), ARES (72), and PaxNet (73), which
LEARNING-BASED MODELS FOR RNA
use deep learning neural networks to predict RMSD from
STRUCTURE PREDICTION
the geometric features extracted from RNA 3D structures.
The second strategy focuses on the development of geomet- Fig. 1 shows a schematic overview of the machine learning-
ric potentials and various constraints such as pairwise dis- based model for RNA structure prediction. This overview
tances, basepairing interactions, and torsion angles. These showcases the fundamental components within the architec-
geometric potentials and constraints are used to guide the ture of the machine learning-based model, including the in-
generation of RNA 3D structures. The methods falling into puts and outputs of networks, and the data set employed for
this category include DeepFoldRNA (74), trRosettaRNA training, validation, and testing. In this section, we will
(75), and epRNA (76). For example, trRosettaRNA uses analyze these components, discuss the challenges encoun-
a transformer network RNAformer to convert multiple tered by machine learning-based models in predicting
sequence alignment (MSA) and a predicted secondary struc- RNA structures and potential solutions.
ture into various 1D and 2D distances, angles and torsional
angles. These predicted geometries are then utilized as re-
Inputs of neural networks
straints to fold RNA 3D structures based on energy minimi-
zation. The third strategy entails an integrated approach Machine learning-based models for predicting RNA 2D
where the sampling and scoring processes are combined structures typically use two types of inputs: RNA sequences
into an end-to-end framework. The methods in this category or MSAs. To improve the prediction accuracy, SPOT-RNA2
include E2Efold-3D (46), RoseTTAFoldNA (47), and incorporates more 2D structural features as inputs to the
DRfold (48). It is noteworthy that RoseTTAFoldNA can pre- neural networks (64). These features include the basepairing
dict not only RNA 3D structures but also the 3D structures probability map computed from the sequence, as well as the
for protein-DNA and protein-RNA complexes. Position Specific Score Matrix (PSSM) and 2D Direct
Here, we focus on the latter two types of machine Coupling Analysis (DCA) computed from the MSA. In
learning-based methods, as they have the capability to the future, incorporating additional structural features,
directly generate RNA 3D structures from sequences, while including thermodynamic or covariation-predicted features,
the first category of methods relies on externally generated may lead to further improvements in the accuracy of the pre-
structural candidates for the structure prediction. Most of dicted RNA structures.

Biophysical Journal 123, 1–11, October 1, 2024 3


Please cite this article in press as: Zhang et al., Machine learning in RNA structure prediction: Advances and challenges, Biophysical Journal (2024), https://
doi.org/10.1016/j.bpj.2024.01.026

Zhang et al.

To predict RNA 3D structures, recently developed ma- RoseTTAFoldNA (47), implement an end-to-end strategy
chine learning-based models employ either predicted 2D that enables the direct computation of RNA 3D structures.
structures (48), MSAs (46,47), or a combination of both These end-to-end models have the potential to enhance the
(74,75) as inputs to the neural networks. The accuracy of speed and accuracy of RNA 3D structure prediction. Such
the predicted 2D structures plays an important role in influ- enhancements may be achieved by circumventing the
encing the performance of these machine learning methods. intermediate steps of translating geometric constraints
With the continuous advancements in RNA 2D structure into 3D structural models, a process that can be compro-
prediction, the overall effectiveness of these methods for mised by the ambiguity of those constraints, leading to
RNA 3D structure prediction is expected to be improved difficulties in the subsequent prediction and optimization
correspondingly. In addition, precise 2D structure informa- stages.
tion can also be obtained from experiments such as An additional issue with most of the current machine
SHAPE or other chemical probing experiments. learning-based models is the limitation that a single, static
MSAs contain evolutionary information, such as conserved RNA 3D structure is produced. This limitation hinders their
basepairs, thereby offering rich structure information for capacity to elucidate the diverse transitional configurations
the learning process in both RNA 2D and 3D structure predic- that play a crucial role in the biological functionality of
tions. Previous studies indicate that methods incorporating RNA molecules. In the protein structure prediction commu-
coevolutionary information derived from MSAs tend to nity, there is a shift toward predicting multiple conformations
exhibit a superior performance in predicting RNA structures from a single protein sequence. This can be achieved by
(48). However, constructing effective MSAs can pose a modifying the input MSA (77,78), exploring the energy land-
challenge, particularly when dealing with a limited number scapes generated from the predicted contact and distance
of homologous RNA sequences. In addition, the creation of maps to identify potential conformational states (79), and
MSAs can be time-consuming, especially when for large- incorporating generative models (80). Given the highly dy-
scale genomic data sets. namic feature of RNA conformations, we expect a similar
trend of RNA structure prediction community to engage in
the challenge of predicting alternative folds of RNAs.
Outputs of neural networks
In machine learning-based models for 2D structure predic-
Training data set
tion, the neural networks produce three types of outputs:
the 2D structure itself, the 2D contact map, or the folding RNA exhibits significantly fewer experimentally resolved
scores (in MXfold2 (68)). Among these outputs, the 2D con- structures compared with proteins, posing a substantial
tact map provides the best flexibility for subsequent anal- challenge for data-driven machine learning-based tech-
ysis. This is due to its ability to be transformed into niques. The common data set employed for training
conventional 2D structure annotations and its capacity to models in RNA structure prediction includes the PDB
infer noncanonical basepairs and pseudoknots. As a result, database (81) for RNA 3D structures, and Rfam (82)
many machine learning-based models that can predict pseu- and RNAstralign (83) for sequence alignments. Additional
doknots or noncanonical basepairs opt to use the 2D contact resources, such as bpRNA (84), Archive II (85), and RNA
map as the networks’ output, rather than the 2D structures STRAND (86), are used for training 2D structure predic-
represented in dot-bracket notation (56). Remarkably, the tion models.
2D contact map closely resembles the basepairing probabil- As of November 2023, the PDB database contains 1782
ity map predicted by thermodynamic models, which can be RNA-only structures, which account for approximately
used to infer suboptimal structures. Therefore, there is a po- 0.84% of the total number of structures present. Moreover,
tential for future studies to leverage machine learning-based when considering high-resolution structures with a resolu-
models in predicting suboptimal structures and generating tion of better than 3.0 Å, RNA nucleotides constitute only
an ensemble of 2D structures. Considering the intrinsic 2% of all residues (87) in the PDB database. A similar issue
conformational heterogeneity of RNAs, the ability to predict arises in the data set of sequence alignments, where the
alternative structures is crucial to the understanding of RNA number of sequence alignments in Rfam is only approxi-
function. mately 20% of those in Pfam (82,88,87). The limited avail-
For RNA 3D structure prediction, the outputs of the ability of the resolved RNA 3D structures and sequence
neural networks vary across different machine learning- alignments has hindered the advancement in RNA structure
based methods. For instance, DeepFoldRNA (74) and prediction, despite the notable accomplishments achieved
trRosettaRNA (75) utilize the neural networks to produce by AlphaFold in protein structure prediction (87). To
a series of geometric constraints. The predicted RNA 3D address the challenge of limited training data for RNA struc-
structures are then derived and refined through simulations ture prediction demands innovative approaches.
that adhere to the geometric constraints. In contrast, other One potential strategy would be developing a manually
approaches, such as E2Efold-3D (46), DRfold (48), and designed data set. A remarkable example of this approach

4 Biophysical Journal 123, 1–11, October 1, 2024


Please cite this article in press as: Zhang et al., Machine learning in RNA structure prediction: Advances and challenges, Biophysical Journal (2024), https://
doi.org/10.1016/j.bpj.2024.01.026

Machine learning in RNA structure prediction

is demonstrated by EternaFold (89), which uses a commu- and C, the two machine learning-based models failed to pre-
nity-designed RNA data set to train the model for 2D struc- dict the pseudoknot in Fig. 2 A, resulting in an overall low
ture, chemical mapping reactivity, and riboswitch affinity prediction accuracy. In contrast, as shown in Fig. 2 D, the
prediction. In this work, researchers use the EternaBench physics-based model Vfold2D exhibited a high prediction
data set, which consists of over 20,000 synthetic RNA con- accuracy by correctly predicting 6 out of the 7 basepairs
structs designed using the RNA design platform Eterna, to in the pseudoknot. The high prediction accuracy achieved
train the model through multitask learning. Notably, by Vfold2D in the case of the CPEB3 ribozyme can be
compared with other machine learning-based models that attributed to its incorporation of proper energy parameters
rely solely on ‘‘natural’’ RNA data sets, the resulting model for pseudoknots. These results demonstrate the importance
from EternaFold exhibited remarkable performance in pre- of using reliable energy parameters in physics-based
dicting RNA 2D structures (89). The result suggests that, models, and highlight the need for further improvements
through close collaboration between theoretical and exper- in machine learning-based models to enhance the predictive
imental experts, it is feasible to expand the training data set power for pseudoknots.
by incorporating synthetic structures. To address the challenge, a data set consisting exclusively
Another potential strategy is to expand the RNA struc- of pseudoknots or noncanonical basepairs may be useful.
tural data by synthesizing/building new RNA conformations Due to the anticipated small size of this data set, a possible
from known structures. This could be accomplished through approach would involve training the model on a larger data
combinatorial assembly of various RNA modules or RNA set first, followed by transferring the learned knowledge to
structures. Such created novel conformations may offer this smaller data set. Another strategy might initially predict
significant advantages in developing machine learning- the structures without pseudoknots or noncanonical base-
based models. This approach can be particularly useful for pairs, then infer the pseudoknots and noncanonical base-
models without using MSAs because MSAs can be chal- pairs from the initial structures. One example of a model
lenging to construct for new structures that lack homologous that adopts this strategy is DMfold (59), which predicts
sequences. pseudoknots by generating three pseudoknot-free substruc-
tures and then combining them to predict the pseudoknots.
Pseudoknots and noncanonical basepairs
Large RNAs
RNA pseudoknots frequently occur in biologically signifi-
cant structures, and the structures often involve noncanon- Current machine learning-based methods for predicting
ical basepairs in the loop regions. However, the ability to RNA 2D and 3D structures are primarily trained on small-
predict pseudoknots and noncanonical basepairs is quite to medium-sized RNAs containing single chains. Due to
limited within machine learning-based models. In the study the limited availability of solved large structures and the
by Wu et al. (56), a comprehensive evaluation of 16 high training costs that are constrained by GPU memory,
machine learning-based models for RNA 2D structure pre- the performance of most currently available machine
diction was conducted. The results indicated that only 9 of learning-based methods is limited when it comes to larger
these models can predict pseudoknots and only 6 models RNAs with multiple chains. Data generation efforts may
are capable of predicting noncanonical basepairs. The pre- serve as a possible strategy to overcome these limitations.
diction of pseudoknots and noncanonical basepairs in the These efforts could assemble small- and medium-sized
2D structure remains challenging for machine learning- RNAs to construct large RNAs for training purposes. As a
based approaches due to two main reasons. Firstly, RNA caveat, we note that the stability and validation of these
2D structure databases are predominantly composed of assembled structures remain questionable. Another
pseudoknot-free and canonical basepairs. Secondly, the for- approach might involve the determination of (smaller) inde-
mation of pseudoknots and noncanonical basepairs is influ- pendent structural domains based on the prediction of base-
enced not only by the RNA sequence but also by the pairs and tertiary interactions. To achieve this, various
solution environment. For example, ligands (including models can be employed, including thermodynamic-based,
metal ions) are required for the stabilization of pseudo- covariation-based, or machine learning-based models, or
knots in riboswitches (90). a combination of these methods. The assembly of the
To demonstrate the challenge in predicting pseudoknots independent domains would give a (larger) complete RNA
using machine learning-based models, we tested two ma- structure (93).
chine learning-based models (E2Efold (61) and MXfold2
(68)) and a physics-based model (Vfold2D (13–19)) to pre-
Motif knowledge
dict the 2D structure of the CPEB3 ribozyme. The structure
of CPEB3 ribozyme, which contains a pseudoknot, was The recent results of CASP15 RNA structure prediction
selected as an RNA target in CASP15 (91,92). The predic- category showed the advantage of physics-based methods
tion results are depicted in Fig. 2. As shown in Fig. 2, B over machine learning-based methodologies (91,92). The

Biophysical Journal 123, 1–11, October 1, 2024 5


Please cite this article in press as: Zhang et al., Machine learning in RNA structure prediction: Advances and challenges, Biophysical Journal (2024), https://
doi.org/10.1016/j.bpj.2024.01.026

Zhang et al.

FIGURE 2 The 2D structure prediction results for the CPEB3 ribozyme. (A) The native structure, the structures predicted by the machine learning-based
models E2Efold (61) (B) and MXfold2 (68) (C), and the physics-based model Vfold2D (13–19) (D). Pseudoknots are depicted using dashed lines. To conduct
the prediction using machine learning models, we used the web server for each model with the default parameter setting.

less-accurate predictions from the machine learning chine learning-based models exhibit RMSD values of
methods can be primarily attributed to the absence of anal- 27.66 and 29.74 Å, respectively. As depicted in Fig. 3
ogous sequences and structures within the training sets, as B, the two predicted structures encountered difficulties
well as a deficiency of homologous sequences necessary in identifying and incorporating the G-quadruplex,
to construct effective MSAs. In CASP15, the synthetic hairpin-hairpin kissing, and junction loop motifs into the
RNAs contain relatively small structural motifs, such as complete structure. The failure to accurately predict
complementary kissing loops, short junction loops, and he- crucial motifs resulting in a low level of prediction accu-
lices, which can be intuitively assembled into overall struc- racy. The structures and the RMSD values are obtained
tures by human experts. However, machine learning-based from the CASP web site at https://www.predictioncenter.
approaches encounter difficulties in accurately identifying org/casp15/rna_results.cgi?target¼R1126.
and seamlessly integrating these motifs into the overall To address this challenge, future development of machine
structure. learning-based RNA structure prediction methods may
An illustrative example is shown in Fig. 3 to demon- involve predicting structures of individual motifs. Through
strate the challenge machine learning-based models face this approach, we can effectively transfer the structural
in identifying and assembling motifs. As shown in knowledge of motifs learned from solved structures to
Fig. 3 A, R1126 is a synthetic RNA that contains a unknown ones. Based on our experience gained from
G-quadruplex motif at its core, two hairpin-hairpin kissing RNA-Puzzles (5–8) and CASP (92), incorporating motif
motifs that connect the upper and lower regions, and mul- template structures can lead to a significant improvement
tiple-way junctions. This structure is employed as an RNA in prediction performance. Therefore, it is important to
target in CASP15 (91,92). Among the predictions submit- enable machine learning approaches to acquire structural
ted, the two highest-ranked predictions generated by ma- knowledge based on known motifs.

6 Biophysical Journal 123, 1–11, October 1, 2024


Please cite this article in press as: Zhang et al., Machine learning in RNA structure prediction: Advances and challenges, Biophysical Journal (2024), https://
doi.org/10.1016/j.bpj.2024.01.026

Machine learning in RNA structure prediction

FIGURE 3 R1126, a synthetic RNA that con-


tains a G-quadruplex (blue), two hairpin-hairpin
kissing motifs (gray), and multiple-way junctions
(red). This structure was selected as an RNA target
in CASP15 (91,92). (A) The native 3D structure. (B)
The two best predicted structures, with RMSDs of
27.66 Å (left) and 29.74 Å (right), respectively, in
the pool of various machine learning-generated
structures in CASP15. The structures and the
RMSD values are obtained from the CASP web
site at https://www.predictioncenter.org/casp15/
rna_results.cgi?target¼R1126. To see this figure
in color, go online.

ADVANTAGES AND CHALLENGES IN BUILDING trained model used for MLM, employs transformer architec-
RNA LANGUAGE MODELS tures to facilitate language model learning. An MLM-based
pretraining process is illustrated in Fig. 4 (top). During the
RNA sequences encode information based on a four-nucle-
training process, nucleotide tokens are randomly masked,
otide alphabet (A, C, G, and U). Therefore, akin to human
and the model is trained to predict them. A distinctive char-
language, RNA sequences can be naturally represented as
acteristic of language models is that the pretraining process
strings of letters. Moreover, RNA structures can be con-
is self-supervised, which relies on unannotated RNA data.
structed through motifs and domains, just as human lan-
As a result, structural information such as 2D structures or
guage is built upon words. Given these similarities, the
atom coordinates is not needed. This feature significantly
idea of adopting language models, which have demonstrated
broadens the available data sets for training, as it allows
remarkable success in analyzing human languages, to
for the utilization of sequence-only databases. An example
address RNA-related challenges has gained attention.
of such a database is RNAcentral (99), which provides
Recently, several RNA language models have emerged,
over 34 million RNA sequences.
serving as valuable tools for diverse predictive tasks. These
Therefore, unlike models that rely on MSA, many RNA
models were developed to predict RNA 2D structures, dis-
language models, such as RNA-FM (94) and UNI-RNA
tance maps, functional annotations, and solvent accessi-
(96), can adopt a single RNA sequence as the network input
bility. Notable RNA language models include RNA-FM
during the pretraining process. The utilization of MSA-free
(94), RNA-MSM (95), UNI-RNA (96), SpliceBERT (97),
protein language models has proven to be effective in accu-
and scBERT (98). Nevertheless, it is important to note
rately and efficiently predicting protein 3D structures
that the assessment of RNA language models’ performance
(97,100–102). These findings indicate the potential to
within the context of RNA structure prediction remains an
develop RNA-specific language models specifically de-
ongoing process, with continued efforts expected to advance
signed for RNA 3D structure prediction.
this field further.
The common strategy for building an RNA language
model for structure prediction involves initially pretraining
Fine-tuning
the model on a large corpus of unannotated data, such as
RNA sequences. Subsequently, the pretrained model is Pretrained RNA language models undergo a fine-tuning
fine-tuned to perform structure prediction tasks using the an- process on downstream tasks, employing either supervised
notated data, including 2D structures and 3D structures. or semisupervised learning techniques. This approach al-
Fig. 4 provides a schematic representation of this process. lows the RNA language model to transfer the general
knowledge acquired during pretraining to the structure pre-
diction tasks in the fine-tuning stage, resulting in improved
Pretraining
performance and accuracy when compared with models
Language models for human language and proteins are trained solely with annotated data. For example, RNA-
trained through self-supervised learning using large corpora MSM fine-tuned its pretrained model for RNA 2D structure
of unannotated text and protein sequences, respectively prediction, utilizing 2D ResNet networks (95). However,
(52). Similarly, an RNA language model requires pretrain- because the fine-tuning process follows a supervised or
ing on RNA sequences. One popular pretraining approach semisupervised learning architecture, it encounters the
for language models is masked language modeling challenges discussed in previous sections, including the
(MLM). BERT (51), a widely recognized large-scale pre- constrained size of the RNA database, difficulties in

Biophysical Journal 123, 1–11, October 1, 2024 7


Please cite this article in press as: Zhang et al., Machine learning in RNA structure prediction: Advances and challenges, Biophysical Journal (2024), https://
doi.org/10.1016/j.bpj.2024.01.026

Zhang et al.

FIGURE 4 Schematic overview of the RNA lan-


guage model. The RNA language model undergoes
a two-step training process: pretraining and fine-tun-
ing. In the pretraining stage (top), the model is
trained on the unannotated data set, such as RNA
sequences. Following pretraining, the model is
fine-tuned to perform structure prediction tasks
(bottom). A commonly used pretraining technique
for language models is masked language modeling,
where nucleotide tokens are randomly masked, and
the model is tasked with predicting the original to-
kens. To see this figure in color, go online.

predicting pseudoknots, noncanonical basepairs, and large of development, and efforts need to be dedicated toward is-
RNA molecules. sues such as reducing the training cost associated with them.
While machine learning-based approaches can accurately
predict the 3D structures of certain RNAs, the underlying
Training cost
mechanisms governing the formation of these structures
Training the large language model is computationally expen- from their sequences and their in vivo functions remain
sive, as it requires high-performance GPUs. This can be elusive. This is because data-driven machine learning-based
observed in the training details of published RNA language methods primarily map sequences to structures, lacking the
models. For example, RNA-FM utilized 8 A100 GPUs with ability to elucidate the intricate pathways connecting them.
80 GB of memory for a month of training (94). Similarly, Therefore, it is crucial to keep developing physics-based and
RNA-MSM employed 8 32G GTX V100 GPUs during the knowledge-based approaches and integrating them with ma-
training process (95). SpliceBERT, on the other hand, trained chine learning approaches to gain understanding of RNA
their model on 8 V100 GPUs for a week (97). To overcome the folding dynamics. Furthermore, it is important to note that
challenge, cloud computing resources or distributed training RNA folding is influenced by environmental factors,
methods may help mitigate the costs and infrastructure re- including ligands, temperature, and ions, which current ma-
quirements associated with training language models. chine learning-based models fail to consider. Considering
the interactions between RNAs and their respective environ-
ments represents a significant challenge for the future
CONCLUSION
advancement of machine learning-based RNA prediction
There are great opportunities and challenges in developing models (103).
machine learning-based models for predicting RNA struc- Overall, predicting RNA structures from RNA se-
tures, and significant advancements are required to match quences remains a challenging problem for machine
the remarkable success of machine learning-based methods learning-based methods. As shown by the results of
in protein structure prediction. While many machine CASP15, the top 4 methods are traditional energy-based
learning-based models have achieved impressive results in approaches. We anticipate that, through continuous inno-
RNA 2D and 3D structure prediction, several key challenges vation and development in machine learning techniques,
remain, including the prediction of pseudoknots, noncanon- and through collaboration between theoretical and experi-
ical basepairs, synthetic structures, and large RNAs. These mental experts in the RNA community, machine learning-
challenges can be mainly attributed to the limited availabil- based models will lead a breakthrough in RNA structure
ity of RNA 2D and 3D structures for training. Language prediction and significantly enhance our understanding
models, which have demonstrated remarkable success in hu- of RNA functions.
man language analysis, have gained increasing momentum
in the field of RNA structure prediction. By transferring
knowledge learned from unannotated data to the prediction AUTHOR CONTRIBUTIONS
tasks, RNA language models can effectively leverage RNA All authors contributed to the conception of the content described in this
sequences for learning. Although RNA language models article. S.Z. and J.L. assembled an initial draft, and S.-J.C. edited and final-
hold promising potential, they are still in the early stages ized the manuscript.

8 Biophysical Journal 123, 1–11, October 1, 2024


Please cite this article in press as: Zhang et al., Machine learning in RNA structure prediction: Advances and challenges, Biophysical Journal (2024), https://
doi.org/10.1016/j.bpj.2024.01.026

Machine learning in RNA structure prediction

ACKNOWLEDGMENTS 22. Zhao, Y., Y. Huang, ., Y. Xiao. 2012. Automated and fast building of
three-dimensional RNA structures. Sci. Rep. 2:734.
This work was supported by the National Institutes of Health under grant 23. Wang, J., J. Wang, ., Y. Xiao. 2019. 3dRNA v2.0: an updated web
R35-GM134919 to S.-J.C. server for RNA 3D structure prediction. Int. J. Mol. Sci. 20:4116.
24. Popenda, M., M. Szachniuk, ., R. W. Adamiak. 2012. Automated 3D
structure composition for large RNAs. Nucleic Acids Res. 40:e112.
DECLARATION OF INTERESTS
25. Jonikas, M. A., R. J. Radmer, ., R. B. Altman. 2009. Coarse-grained
The authors declare no competing interests. modeling of large RNA molecules with knowledge-based potentials
and structural filters. RNA. 15:189–199.
26. Sharma, S., F. Ding, and N. V. Dokholyan. 2008. iFoldRNA: three-
REFERENCES dimensional RNA structure prediction and folding. Bioinformatics.
24:1951–1952.
1. Morris, K. V., and J. S. Mattick. 2014. The rise of regulatory RNA. 27. Krokhotin, A., K. Houlihan, and N. V. Dokholyan. 2015. iFoldRNA
Nat. Rev. Genet. 15:423–437. v2: folding RNA with constraints. Bioinformatics. 31:2891–2893.
2. Mattick, J. S., P. P. Amaral, ., M. Wu. 2023. Long non-coding RNAs: 
28. Sulc, P., F. Romano, ., A. A. Louis. 2014. A nucleotide-level coarse-
definitions, functions, challenges and recommendations. Nat. Rev. grained model of RNA. J. Chem. Phys. 140:235102.
Mol. Cell Biol. 24:430–447.
29. Parisien, M., and F. Major. 2008. The MC-Fold and MC-Sym pipeline
3. Weeks, K. M., and D. M. Mauger. 2011. Exploring RNA structural infers RNA structure from sequence data. Nature. 452:51–55.
codes with SHAPE chemistry. Acc. Chem. Res. 44:1280–1291.
30. Das, R., J. Karanicolas, and D. Baker. 2010. Atomic accuracy in pre-
4. Lucks, J. B., S. A. Mortimer, ., A. P. Arkin. 2011. Multiplexed RNA dicting and designing noncanonical RNA structure. Nat. Methods.
structure characterization with selective 2’-hydroxyl acylation 7:291–294.
analyzed by primer extension sequencing (SHAPE-Seq). Proc. Natl.
Acad. Sci. USA. 108:11063–11068. 31. Watkins, A. M., R. Rangan, and R. Das. 2020. FARFAR2: improved
de novo rosetta prediction of complex global RNA folds. Structure.
5. Cruz, J. A., M.-F. Blanchet, ., E. Westhof. 2012. RNA-Puzzles: a 28:963–976.e6.
CASP-like evaluation of RNA three-dimensional structure prediction.
RNA. 18:610–625. 32. Pasquali, S., and P. Derreumaux. 2010. HiRE-RNA: a high resolution
coarse-grained energy model for RNA. J. Phys. Chem. B. 114:11957–
6. Miao, Z., R. W. Adamiak, ., E. Westhof. 2015. RNA-Puzzles Round 11966.
II: assessment of RNA structure prediction programs applied to three
large RNA structures. RNA. 21:1066–1084. 33. Poblete, S., S. Bottaro, and G. Bussi. 2018. A nucleobase-centered
coarse-grained representation for structure prediction of RNA motifs.
7. Miao, Z., R. W. Adamiak, ., E. Westhof. 2017. RNA-Puzzles Round Nucleic Acids Res. 46:1674–1683.
III: 3D RNA structure prediction of five riboswitches and one ribo-
zyme. RNA. 23:655–672. 34. Boniecki, M. J., G. Lach, ., J. M. Bujnicki. 2016. SimRNA: a coarse-
grained method for RNA folding simulations and 3D structure predic-
8. Miao, Z., R. W. Adamiak, ., E. Westhof. 2020. RNA-Puzzles Round tion. Nucleic Acids Res. 44:e63.
IV: 3D structure predictions of four ribozymes and two aptamers.
RNA. 26:982–995. 35. Xiong, P., R. Wu, ., Y. Zhou. 2021. Pairing a high-resolution statis-
tical potential with a nucleobase-centric sampling algorithm for
9. Hofacker, I. L. 2003. Vienna RNA secondary structure server. Nucleic improving RNA model refinement. Nat. Commun. 12:2777.
Acids Res. 31:3429–3431.
36. Zhang, D., and S.-J. Chen. 2018. IsRNA: An iterative simulated refer-
10. Lorenz, R., S. H. Bernhart, ., I. L. Hofacker. 2011. ViennaRNA ence state approach to modeling correlated interactions in RNA
Package 2.0. Algorithm Mol. Biol. 6:26. folding. J. Chem. Theor. Comput. 14:2230–2239.
11. Zuker, M. 2003. Mfold web server for nucleic acid folding and hy- 37. Zhang, D., J. Li, and S.-J. Chen. 2021. IsRNA1: de novo prediction
bridization prediction. Nucleic Acids Res. 31:3406–3415. and blind screening of RNA 3D structures. J. Chem. Theor. Comput.
12. Reuter, J. S., and D. H. Mathews. 2010. RNAstructure: software for 17:1842–1857.
RNA secondary structure prediction and analysis. BMC Bioinf. 38. Zhang, D., S.-J. Chen, and R. Zhou. 2021. Modeling noncanonical
11:129. RNA base pairs by a coarse-grained IsRNA2 model. J. Phys. Chem.
13. Cao, S., and S.-J. Chen. 2009. Predicting structures and stabilities for B. 125:11907–11915.
H-type pseudoknots with interhelix loops. RNA. 15:696–706. 39. Li, J., and S.-J. Chen. 2023. RNAJP: enhanced RNA 3D structure pre-
14. Tan, Z.-J., and S.-J. Chen. 2009. Predicting electrostatic forces in dictions with non-canonical interactions and global topology sam-
RNA folding. In Meth. Enzymol. Elsevier, pp. 465–487. pling. Nucleic Acids Res. 51:3341–3356.
15. Cao, S., D. P. Giedroc, and S.-J. Chen. 2010. Predicting loop–helix 40. Vaswani, A., N. Shazeer, ., I. Polosukhin. 2017. Attention is all you
tertiary structural contacts in RNA pseudoknots. RNA. 16:538–552. need. Adv. Neural Inf. Process. Syst. 30
16. Xu, X., and S.-J. Chen. 2016. A method to predict the structure and 41. Chen, T., S. Kornblith, ., G. E. Hinton. 2020. Big self-supervised
stability of RNA/RNA complexes. Methods Mol. Biol. 1490:63–72. models are strong semi-supervised learners. Adv. Neural Inf. Process.
17. Zhao, C., X. Xu, and S.-J. Chen. 2017. Predicting RNA structure with Syst. 33:22243–22255.
Vfold. Methods Mol. Biol. 1654:3–15. 42. Jumper, J., R. Evans, ., D. Hassabis. 2021. Highly accurate protein
18. Cheng, Y., S. Zhang, ., S.-J. Chen. 2021. Vfold2D-MC: a physics- structure prediction with AlphaFold. Nature. 596:583–589.
based hybrid model for predicting RNA secondary structure folding. 43. Moult, J., J. T. Pedersen, ., K. Fidelis. 1995. A Large-Scale Exper-
J. Phys. Chem. B. 125:10108–10118. iment to Assess Protein Structure Prediction Methods.
19. Zhang, S., Y. Cheng, ., S.-J. Chen. 2022. VfoldMCPX: predicting 44. AlQuraishi, M. 2019. AlphaFold at CASP13. Bioinformatics.
multistrand RNA complexes. RNA. 28:596–608. 35:4862–4865.
20. Cao, S., and S.-J. Chen. 2011. Physics-based de novo prediction of 45. Jumper, J., R. Evans, ., D. Hassabis. 2021. Applying and improving
RNA 3D structures. J. Phys. Chem. B. 115:4216–4226. AlphaFold at CASP14. Proteins. 89:1711–1721.
21. Xu, X., C. Zhao, and S.-J. Chen. 2019. VfoldLA: a web server for loop 46. Shen, T., Z. Hu, ., S. Wang. 2022. E2Efold-3D: end-to-end deep
assembly-based prediction of putative 3D RNA structures. J. Struct. learning method for accurate de novo RNA 3D structure prediction.
Biol. 207:235–240. Preprint at arXiv. https://doi.org/10.48550/arXiv.2207.01586.

Biophysical Journal 123, 1–11, October 1, 2024 9


Please cite this article in press as: Zhang et al., Machine learning in RNA structure prediction: Advances and challenges, Biophysical Journal (2024), https://
doi.org/10.1016/j.bpj.2024.01.026

Zhang et al.

47. Baek, M., R. McHugh, ., F. DiMaio. 2022. Accurate prediction of 69. Zuker, M., and P. Stiegler. 1981. Optimal computer folding of large
nucleic acid and protein-nucleic acid complexes using RoseTTA- RNA sequences using thermodynamics and auxiliary information.
FoldNA. Preprint at bioRxiv. https://doi.org/10.1101/2022.09.09. Nucleic Acids Res. 9:133–148.
507333. 70. Zhang, J., M. Lang, ., Y. Zhang. 2024. Predicting RNA structures
48. Li, Y., C. Zhang, ., Y. Zhang. 2023. Integrating end-to-end learning and functions by artificial intelligence. Trends Genet. 40:94–107.
with deep geometrical potentials for ab initio RNA structure predic-
71. Li, J., W. Zhu, ., W. Wang. 2018. RNA3DCNN: Local and global
tion. Nat. Commun. 14:5745.
quality assessments of RNA 3D structures using 3D deep convolu-
49. Yang, J., I. Anishchenko, ., D. Baker. 2020. Improved protein struc- tional neural networks. PLoS Comput. Biol. 14, e1006514.
ture prediction using predicted interresidue orientations. Proc. Natl.
Acad. Sci. USA. 117:1496–1503. 72. Townshend, R. J. L., S. Eismann, ., R. O. Dror. 2021. Geometric
deep learning of RNA structure. Science. 373:1047–1051.
50. Baek, M., F. DiMaio, ., D. Baker. 2021. Accurate prediction of pro-
tein structures and interactions using a three-track neural network. 73. Zhang, S., Y. Liu, and L. Xie. 2022. Physics-aware graph neural
Science. 373:871–876. network for accurate RNA 3D structure prediction. Preprint at arXiv.
https://doi.org/10.48550/arXiv:2210.16392.
51. Devlin, J., M.-W. Chang, ., K. Toutanova. 2018. Bert: Pre-training
of deep bidirectional transformers for language understanding. Pre- 74. Pearce, R., G. S. Omenn, and Y. Zhang. 2022. De novo RNA tertiary
print at arXiv. https://doi.org/10.48550/arXiv.1810.04805. structure prediction at atomic resolution using geometric potentials
from deep learning. Preprint at bioRxiv. https://doi.org/10.1101/
52. Ofer, D., N. Brandes, and M. Linial. 2021. The language of proteins:
2022.05.15.491755.
NLP, machine learning & protein sequences. Comput. Struct.
Biotechnol. J. 19:1750–1758. 75. Wang, W., C. Feng, ., J. Yang. 2023. trRosettaRNA: automated pre-
53. Rives, A., J. Meier, ., R. Fergus. 2021. Biological structure and func- diction of RNA 3D structure with transformer network. Nat. Commun.
tion emerge from scaling unsupervised learning to 250 million protein 14:7266.
sequences. Proc. Natl. Acad. Sci. USA. 118, e2016239118. 76. Sha, C. M., J. Wang, and N. V. Dokholyan. 2023. Predicting 3D RNA
54. Madani, A., B. McCann, ., R. Socher. 2020. Progen: Language structure from solely the nucleotide sequence using Euclidean dis-
modeling for protein generation. Preprint at arXiv. https://doi.org/ tance neural networks. Biophys. J. 122:444a.
10.48550/arXiv:2004.03497. 77. Wayment-Steele, H. K., S. Ovchinnikov, ., D. Kern. 2022. Predic-
55. Zhang, J., Y. Fei, ., Q. C. Zhang. 2022. Advances and opportunities tion of multiple conformational states by combining sequence clus-
in RNA structure experimental determination and computational tering with AlphaFold2. Preprint at bioRxiv. https://doi.org/10.1101/
modeling. Nat. Methods. 19:1193–1207. 2022.10.17.512570.
56. Wu, K. E., J. Y. Zou, and H. Chang. 2023. Machine learning modeling 78. Stein, R. A., and H. S. Mchaourab. 2022. SPEACH_AF: Sampling
of RNA structures: methods, challenges and future perspectives. protein ensembles and conformational heterogeneity with Alpha-
Briefings Bioinf. 24, bbad210. fold2. PLoS Comput. Biol. 18, e1010483.
57. Behrouzi, R., J. H. Roh, ., S. A. Woodson. 2012. Cooperative ter- 79. Hou, M., S. Jin, ., G. Zhang. 2023. Protein multiple conformations
tiary interaction network guides RNA folding. Cell. 149:348–357. prediction using multi-objective evolution algorithm. Preprint at bio-
58. Schlick, T., and A. M. Pyle. 2017. Opportunities and challenges in Rxiv. https://doi.org/10.1101/2023.04.21.537776.
RNA structural modeling and design. Biophys. J. 113:225–234. 80. Jing, B., E. Erives, ., T. Jaakkola. 2023. EigenFold: Generative Pro-
59. Wang, L., Y. Liu, ., H. Zhang. 2019. DMfold: a novel method to pre- tein Structure Prediction with Diffusion Models. Preprint at arXiv.
dict RNA secondary structure with pseudoknots based on deep https://doi.org/10.48550/arXiv:2304.02198.
learning and improved base pair maximization principle. Front. 81. Rose, P. W., A. Prlic, ., Z. Feng. 2016. The RCSB protein data bank:
Genet. 10:143. integrative view of protein, gene and 3D structural information. Nu-
60. Zhang, H., C. Zhang, ., Y. Liu. 2019. A new method of RNA second- cleic Acids Res gkw1000.
ary structure prediction based on convolutional neural network and 82. Kalvari, I., E. P. Nawrocki, ., A. I. Petrov. 2021. Rfam 14: expanded
dynamic programming. Front. Genet. 10:467. coverage of metagenomic, viral and microRNA families. Nucleic
61. Chen, X., Y. Li, ., L. Song. 2020. RNA secondary structure predic- Acids Res. 49:D192–D200.
tion by learning unrolled algorithms. Preprint at arXiv. https://doi.org/ 83. Tan, Z., Y. Fu, ., D. H. Mathews. 2017. TurboFold II: RNA structural
10.48550/arXiv:2002.05810. alignment and secondary structure prediction informed by multiple
62. Fu, L., Y. Cao, ., X. Xie. 2022. UFold: fast and accurate RNA sec- homologs. Nucleic Acids Res. 45:11570–11581.
ondary structure prediction with deep learning. Nucleic Acids Res. 84. Danaee, P., M. Rouches, ., D. Hendrix. 2018. bpRNA: large-scale
50:e14. automated annotation and analysis of RNA secondary structure. Nu-
63. Singh, J., J. Hanson, ., Y. Zhou. 2019. RNA secondary structure pre- cleic Acids Res. 46:5381–5394.
diction using an ensemble of two-dimensional deep neural networks
85. Sloma, M. F., and D. H. Mathews. 2016. Exact calculation of loop for-
and transfer learning. Nat. Commun. 10:5407.
mation probability identifies folding motifs in RNA secondary struc-
64. Singh, J., K. Paliwal, ., Y. Zhou. 2021. Improved RNA secondary tures. RNA. 22:1808–1818.
structure and tertiary base-pairing prediction using evolutionary pro-
file, mutational coupling and two-dimensional transfer learning. Bio- 86. Andronescu, M., V. Bereg, A. Condon., 2008. RNA STRAND: the
informatics. 37:2589–2600. RNA secondary structure and statistical analysis database. BMC Bio-
inf. 9:340.
65. Saman Booy, M., A. Ilin, and P. Orponen. 2022. RNA secondary
structure prediction with convolutional neural networks. BMC Bioinf. 87. Schneider, B., B. A. Sweeney, ., M. Szachniuk. 2023. When will
23:58. RNA get its AlphaFold moment? Nucleic Acids Res. 51:9522–9532.
66. Franke, J. K., F. Runge, and F. Hutter. 2023. Scalable deep learning for 88. Mistry, J., S. Chuguransky, ., A. Bateman. 2021. Pfam: The protein
RNA secondary structure prediction. Preprint at arXiv. https://doi.org/ families database in 2021. Nucleic Acids Res. 49:D412–D419.
10.48550/arXiv:2307.10073. 89. Wayment-Steele, H. K., W. Kladwang, ., R. Das. 2022. RNA sec-
67. Mao, K., J. Wang, and Y. Xiao. 2022. Length-dependent deep learning ondary structure packages evaluated and improved by high-
model for RNA secondary structure prediction. Molecules. 27:1030. throughput experiments. Nat. Methods. 19:1234–1242.
68. Sato, K., M. Akiyama, and Y. Sakakibara. 2021. RNA secondary 90. Peng, X., W. Liao, ., L. Huang. 2023. Crystal structures of the
structure prediction using deep learning with thermodynamic integra- NADþ-II riboswitch reveal two distinct ligand-binding pockets. Nu-
tion. Nat. Commun. 12:941. cleic Acids Res. 51:2904–2914.

10 Biophysical Journal 123, 1–11, October 1, 2024


Please cite this article in press as: Zhang et al., Machine learning in RNA structure prediction: Advances and challenges, Biophysical Journal (2024), https://
doi.org/10.1016/j.bpj.2024.01.026

Machine learning in RNA structure prediction

91. Das, R., R. C. Kretsch, ., D. Rigden. 2023. Assessment of three- splicing prediction. Preprint at bioRxiv. https://doi.org/10.1101/
dimensional RNA structure prediction in CASP15. Preprint at bio- 2023.01.31.526427.
Rxiv. https://doi.org/10.1101/2023.04.25.538330.
98. Yang, F., W. Wang, ., J. Yao. 2022. scBERT as a large-scale pre-
92. Li, J., S. Zhang, and S.-J. Chen. 2023. Advancing RNA 3D structure trained deep language model for cell type annotation of single-cell
prediction: Exploring hierarchical and hybrid approaches in CASP15. RNA-seq data. Nat. Mach. Intell. 4:852–866.
Proteins. 91:1779–1789.
93. Bryant, P., G. Pozzati, ., A. Elofsson. 2022. Predicting the structure 99. The RNAcentral Consortium. 2019. RNAcentral: a hub of information
of large protein complexes using AlphaFold and Monte Carlo tree for non-coding RNA sequences. Nucleic Acids Res. 47:D221–D229.
search. Nat. Commun. 13:6028. 100. Chowdhury, R., N. Bouatta, ., M. AlQuraishi. 2022. Single-
94. Chen, J., Z. Hu, ., T. Shen. 2022. Interpretable RNA foundation sequence protein structure prediction using a language model and
model from unannotated data for highly accurate RNA structure deep learning. Nat. Biotechnol. 40:1617–1623.
and function predictions. Preprint at bioRxiv. https://doi.org/10.
1101/2022.08.06.503062. 101. Fang, X., F. Wang, ., L. Song. 2023. A method for multiple-
sequence-alignment-free protein structure prediction using a protein
95. Zhang, Y., M. Lang, ., G. Song. 2023. Multiple sequence-alignment- language model. Nat. Mach. Intell. 5:1087–1096.
based RNA language model and its application to structural inference.
Preprint at bioRxiv. https://doi.org/10.1101/2023.03.15.532863. 102. Lin, Z., H. Akin, ., A. Rives. 2023. Evolutionary-scale prediction of
96. Wang, X., R. Gu, ., H. Wen. 2023. UNI-RNA: universal pre-trained atomic-level protein structure with a language model. Science.
models revolutionize RNA research. Preprint at bioRxiv. https://doi. 379:1123–1130.
org/10.1101/2023.07.11.548588. 103. Chen, S.-J., M. Hassan, ., G. D. Rose. 2023. Protein folds vs. protein
97. Chen, K., Y. Zhou, ., Y. Yang. 2023. Self-supervised learning on folding: Differing questions, different challenges. Proc. Natl. Acad.
millions of pre-mRNA sequences improves sequence-based RNA Sci. USA. 120, e2214423119.

Biophysical Journal 123, 1–11, October 1, 2024 11

You might also like