Structure-Based ML Paper
Structure-Based ML Paper
a
Biozentrum, University of Basel, Basel, Switzerland
b
Bioinformatics Group, Department of Plant Sciences, Wageningen University and Research, Wageningen, the Netherlands
a r t i cl e i nfo a bstr ac t
Article history: Recent breakthroughs in protein structure prediction demarcate the start of a new era in structural
Received 26 September 2022 bioinformatics. Combined with various advances in experimental structure determination and the unin
Received in revised form 21 December 2022 terrupted pace at which new structures are published, this promises an age in which protein structure
Accepted 21 December 2022
information is as prevalent and ubiquitous as sequence. Machine learning in protein bioinformatics has
Available online 29 December 2022
been dominated by sequence-based methods, but this is now changing to make use of the deluge of rich
structural information as input. Machine learning methods making use of structures are scattered across
Keywords:
Protein structures literature and cover a number of different applications and scopes; while some try to address questions and
Machine learning tasks within a single protein family, others aim to capture characteristics across all available proteins. In this
Deep learning review, we look at the variety of structure-based machine learning approaches, how structures can be used
as input, and typical applications of these approaches in protein biology. We also discuss current challenges
and opportunities in this all-important and increasingly popular field.
© 2023 The Authors. Published by Elsevier B.V. on behalf of Research Network of Computational and
Structural Biotechnology. This is an open access article under the CC BY license (http://creative
commons.org/licenses/by/4.0/).
Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... . ..... . ..... ..... . ..... . ..... . ..... . ..... . ..... . ..... . .... 631
2. Machine learning in the protein field . . . . . . . . . . . . . . . . . . . . . . . . ..... . ..... . ..... ..... . ..... . ..... . ..... . ..... . ..... . ..... . .... 631
2.1. Protein family based ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... . ..... . ..... ..... . ..... . ..... . ..... . ..... . ..... . ..... . .... 631
2.2. Protein universe based ML . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... . ..... . ..... ..... . ..... . ..... . ..... . ..... . ..... . ..... . .... 632
3. Computational representations of protein structures . . . . . . . . . . . . ..... . ..... . ..... ..... . ..... . ..... . ..... . ..... . ..... . ..... . .... 634
3.1. Generating structure feature matrices . . . . . . . . . . . . . . . . . . ..... . ..... . ..... ..... . ..... . ..... . ..... . ..... . ..... . ..... . .... 635
3.1.1. Residue level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... . ..... . ..... ..... . ..... . ..... . ..... . ..... . ..... . ..... . .... 635
3.1.2. Structural environment level. . . . . . . . . . . . . . . . . . . ..... . ..... . ..... ..... . ..... . ..... . ..... . ..... . ..... . ..... . .... 635
3.2. Learning protein embeddings . . . . . . . . . . . . . . . . . . . . . . . . . ..... . ..... . ..... ..... . ..... . ..... . ..... . ..... . ..... . ..... . .... 635
4. Challenges and future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... . ..... . ..... ..... . ..... . ..... . ..... . ..... . ..... . ..... . .... 637
4.1. Structure-based approaches are computationally expensive. ..... . ..... . ..... ..... . ..... . ..... . ..... . ..... . ..... . ..... . .... 637
4.2. End-to-end learning on structures . . . . . . . . . . . . . . . . . . . . . ..... . ..... . ..... ..... . ..... . ..... . ..... . ..... . ..... . ..... . .... 637
4.3. Dynamic representations of structure . . . . . . . . . . . . . . . . . . ..... . ..... . ..... ..... . ..... . ..... . ..... . ..... . ..... . ..... . .... 637
4.4. Probing underlying protein mechanisms . . . . . . . . . . . . . . . . ..... . ..... . ..... ..... . ..... . ..... . ..... . ..... . ..... . ..... . .... 638
4.5. A unified approach to function . . . . . . . . . . . . . . . . . . . . . . . . ..... . ..... . ..... ..... . ..... . ..... . ..... . ..... . ..... . ..... . .... 638
5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... . ..... . ..... ..... . ..... . ..... . ..... . ..... . ..... . ..... . .... 638
CRediT authorship contribution statement . . . . . . . . . . . . . . . . . . . . ..... . ..... . ..... ..... . ..... . ..... . ..... . ..... . ..... . ..... . .... 638
Conflicts of Interest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... . ..... . ..... ..... . ..... . ..... . ..... . ..... . ..... . ..... . .... 638
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... . ..... . ..... ..... . ..... . ..... . ..... . ..... . ..... . ..... . .... 638
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... . ..... . ..... ..... . ..... . ..... . ..... . ..... . ..... . ..... . .... 638
https://doi.org/10.1016/j.csbj.2022.12.039
2001-0370/© 2023 The Authors. Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology. This is an open access article under the
CC BY license (http://creativecommons.org/licenses/by/4.0/).
J. Durairaj, D. de Ridder and A.D.J. van Dijk Computational and Structural Biotechnology Journal 21 (2023) 630–643
⁎
Corresponding author.
E-mail addresses: [email protected] (J. Durairaj), [email protected] (D. de Ridder), [email protected] (A.D.J. van Dijk).
631
J. Durairaj, D. de Ridder and A.D.J. van Dijk Computational and Structural Biotechnology Journal 21 (2023) 630–643
products, and inhibitors; state prediction in the context of en Though computationally predicted structures are shown to be
gineering thermostability, binding affinity and activity; and predic highly accurate at the backbone level, tasks such as the ones de
tion of the effects of mutations. In many cases, such as the scribed above which involve small molecule binding may need fur
immensely diverse lipocalins [27] and the fast-evolving enzyme fa ther family-specific processing and ML-based approaches to harness
milies involved in specialised metabolism [28], the sequence di the structural information specifically related to ligand interaction.
versity within a family make it impossible for sequence-based For example, [45] show that AlphaFold-predicted GPCR structures
techniques to predict family properties. Even very similar sequences differ in crucial features such as domain assembly, ligand-binding
can have mutations in key structural regions resulting in completely pockets, and interface conformation, thus impeding their direct use
different activities, which is easier to ascertain from structure than in functional studies.
from sequence alone. In addition, insights from computational pre Unsupervised ML in the protein family space hosts a new sub-
diction methods which also use structure as input can better drive field of structural bioinformatics, dubbed “comparative structuro
experimental studies due to the generally higher accuracy of struc mics" by Mohammed AlQuraishi. This is concerned with tools, al
ture-based prediction, and better enable exploration of the protein gorithms, and techniques to compare and contrast assorted datasets
family space with structural stability and activity taken into account. of protein structures to answer a variety of biological questions - the
We give examples of supervised ML tasks for some well known evolutionary relationships between structural orthologs, interaction
protein families below. networks and how they are affected by structural changes, folding
The superfamily of G protein-coupled receptors (GPCRs) is the and changes within different cellular contexts and organisms, and
largest family of targets for approved drugs in modern drug dis how structure and folding are coupled with different functional
covery, and hence also a popular target for ML approaches to drive characteristics. Zebra3D [46] is an example of such a technique. It
exploration and understanding. GPCRs play an essential role in provides a systematic analysis of 3D protein structure alignments
physiological processes such as vision, olfaction, neuronal signal combined with the identification of subfamily-specific regions using
transmission, cell differentiation, pain, muscle contraction, and unsupervised ML clustering algorithms - these regions represent
hormone secretion [29]. Recent ML studies on GPCRs have started patterns of local 3D structure similar within subfamilies, but dif
incorporating structural information to improve prediction perfor fering between them, thus likely to be associated with functional
mance, and to derive biological insight into the residues and me diversity and function-related conformational plasticity. The work of
chanisms involved. As commonly used ML models for structure, de Lima et al. [47] is another example of unsupervised protein family
interaction and interface prediction are trained on soluble proteins, ML concerned with the detection of subfamilies and simultaneous
specialised GPCR-specific oligomerization and interface predictors identification of differentiating residues. Clustering and dimension
were developed [30,31], able to handle their long transmembrane ality reduction techniques have been used to describe the con
regions. Recent work even modified the existing AlphaFold2 algo formational landscape of proteins and identify binding-induced
rithm to generate rarer GPCR conformations [32]. GPCRs often dis conformational change [48,49].
play high conformational flexibility and low thermostability, making Protein family ML often has to deal with sparsely populated
their structural, biophysical, and biochemical characterisation in the datasets and rely on algorithms which can handle a large number of
laboratory challenging. Given that experimental identification of features measured across a small number of data points. A wide
thermostabilizing mutations is very resource intensive and must be range of algorithms are at our disposal for these tasks, including but
repeated for each individual receptor, computational prediction of not limited to k-nearest neighbours algorithms (k-NNs) [50], support
GPCR mutant stability is a crucial task in this field [33]. Finally, vector machines (SVMs) [51], Gaussian processes [52], and ensemble
GPCRs bind to a very diverse range of ligands and ML is used to methods such as Random Forests [53] and gradient boosting trees
identify biologically active ligands and binding inhibitors, estimating [54]. In addition, many approaches in this field aim to interpret
affinity and other binding properties, and probe ligand-specific prediction results to derive insights about underlying mechanisms
binding mechanisms [34]. and residues which may be important for function. Such predictions
Another important class of drug targets are the kinases [35], with and insights obtained from protein family ML are often used to drive
over 500,000 publications, 20,000 patents, inhibition assays for the experimental research to explore and characterise novel, interesting
majority of the human kinome and 115,000 kinase inhibitors cov or relevant proteins.
ering 20% of the kinome [36]. With over 7000 structures solved
covering 308 kinases across 8 groups and complexed with over 3000
unique ligands and inhibitors, structure-based ML approaches are 2.2. Protein universe based ML
widely used for addressing challenges within this superfamily. These
include methods to predict inhibition [37] and binding affinity [38] The larger-scale protein universe based ML typically uses tens of
in specific kinase families. Another common kinase challenge is thousands of proteins from diverse superfamilies to learn global
predicting conformational change between the so-called active and properties of proteins, such as secondary and tertiary structure and
inactive conformations [39,40]. For drug targets, predicting the ef folding, interactions, disorder, broad function classes etc. DL is a
fects of mutation of a single protein could also be considered a common choice for such problems, as it is known to drastically
protein family ML task, as the inputs are still proteins sharing the outperform other techniques in the presence of large amounts of
same structural fold with key differences caused by changes in the data. In fact, protein structure prediction is in itself a protein uni
sequence. PremPLI [41] uses features from modelled protein-ligand verse task in which the use of DL has in many cases eclipsed other
complexes to predict the effect of mutation on binding affinity to a ML or statistical methods. This is true for prediction of secondary
number of inhibitors for a kinase cancer target. structure, solvent accessibility [55], backbone torsion angles [56,57],
In the field of natural products and specialised metabolism in residue-residue contacts or distance matrices from co-evolution
plants, bacteria, and fungi, ML has slowly been gaining popularity [58–62], and in de novo all atom structure modelling. In fact, all the
over more traditional approaches involving similarity search or top-performing Critical Assessment of Structure Prediction (CASP13
analysis of a few, closely related proteins. ML has been used for [63], CASP14 [64]) methods for de novo modelling rely on deep
successful prediction of substrate [42,43] and product [44] specifi convolutional neural networks for predicting residue contacts or
city in various natural product enzyme families. In 2013, a structure- distances, predicting backbone torsion angles and/or ranking the
informed approach was used to engineer highly thermostable cy final models. For recent reviews on the underlying techniques used,
tochrome p450s [19]. including those in AlphaFold2 and related approaches, see [65,66].
632
J. Durairaj, D. de Ridder and A.D.J. van Dijk Computational and Structural Biotechnology Journal 21 (2023) 630–643
Table 1
Supervised protein universe tasks, inputs and examples.
The Input column describes the typical form of input given to the algorithms used. Multiple input format possibilities are comma-separated. All inputs refer to the structural
context, i.e. “Protein” refers to the 3D protein structure, “Residue” to aspects associated with each individual residue - its physicochemical, electrostatic, geometric properties etc.
(similarly for “Mutation”), “Ligand” to the 2D and/or 3D structure of a small molecule ligand.
With the availability of protein structures, a number of additional Predicting the effects of variants and mutations, especially those
tasks can make use of structure-based ML instead of sequence. These involved in diseases, is another common task. Sen et al. [141] took
are listed in Table 1, grouped by the kinds of inputs used. Recent advantage of the latest de novo structure prediction techniques to
examples as well as common datasets used to validate and bench model human disease-associated proteins, many of which do not have
mark novel algorithms created for each task are also listed. existing structures or even close homologues. Afterwards, they com
In the 2020 CASP14 competition, the breakthrough results of pared disease-associated mutations to ligand binding sites, protein-
AlphaFold2 prompted a press release declaring the protein structure protein interfaces and conserved regions predicted from the models, in
problem for single protein chains solved [64]. This emphasis on “single order to provide some rationale for most of the mutations. However,
protein chains” revealed the new frontier for structural bioinformatics the current DL-based structure predictors are not yet able to suc
- complex structures are yet to be successfully predicted at the same cessfully predict mutations in protein structures as their training
breakthrough levels. Thus the related yet distinct tasks of predicting procedure is designed to be robust to small changes in sequence. This
whether two proteins interact, and predicting the interface of two has been practically demonstrated in studies aiming to predict stabi
interacting proteins are common protein universe problems with a lity effects of mutations using predicted structures [142,143], and it
number of solutions, based on docking [87,104], templates [105], end- indicates an under-explored area of structure prediction.
to-end learning [84] and, most recently, protein complex prediction Approaches building upon AlphaFold2 and its underlying archi
approaches building upon AlphaFold2 [128-130]. The latter generation tectures have been used successfully in design tasks [144–147], in
combines the AlphaFold2 DL architecture with a modified paired MSA dicating that the AlphaFold2 breakthrough may also cause a leap in
generation approach which encapsulates co-evolutionary information protein design prediction. The process of constructing idealised folds
across the subunits of the desired complex. This yielded success rates during protein design can reveal new information about the physical
for complex prediction up to double that of previous template-based and structural constraints that dictate which conformations a pro
and docking methods, marking significant progress in the field. tein can adopt [148,149]. Such insights could be of vital importance
However, these success rates are still only around 50% and vary to solving fundamental biological questions behind the evolution of
drastically across species, protein families, types of complexes, and proteins, as well as for further improvement of protein engineering
stoichiometries considered [129,131]. Similarly, the popular de novo and design [150]. See [151] for a recent review of DL approaches in
protein structure prediction algorithm RoseTTAFold, has been ex the protein design field.
tended to the prediction of nucleic acid and protein-nucleic acid Instrinsically disordered proteins (IDPs) lack a fixed or ordered
complexes [132], though again only around half of the tested com three-dimensional structure. This widespread phenomenon, thought
plexes could be successfully modelled. to occur in over 33% of eukaryotic proteins, has been linked with
Structure-based drug discovery also hosts some significant ap allosteric regulation, enzyme catalysis, and a variety of diseases
plications of protein universe ML [133], starting from the computa [152]. While structure-based prediction of intrinsic disorder may
tional modelling of putative receptor targets. Subsequently, binding seem contradictory, energy scores obtained from existing structures
sites in the target structure and putative drug candidates are iden [100] as well as residue-level computational modelling scores
tified using cavity/pocket prediction techniques [76], prediction of [11,101] contain information correlating with disorder and are ef
“druggable” regions, and protein-ligand binding site [134] predic fective for prediction. Structure-based ML has also been used to
tion. This is typically followed by molecular docking to evaluate sample the very diverse conformational ensembles of IDPs [153].
protein-ligand interaction and affinity between the target and a Unsupervised techniques in the protein universe support tasks
variety of drug candidates. In the case of unknown target proteins or such as structure query and retrieval, clustering for motif and hot
to identify off-target binding candidates, reverse/inverse docking spot discovery, and structure-based fold annotation. For the former
[135–138] is used to create embeddings of drugs and search across task, an array of fast techniques that allows near-instant retrieval of
protein structure databases for good docking solutions. In these structures matching an input structure [154–158]. Recent ap
contexts, ML approaches are used to improve scoring functions of proaches for structure-based clustering allow pinpointing novel or
binding affinity and plausible docking poses [81,116,121,138,139]. rare folds [11,159], as well as residues and structural regions asso
Indeed, [140] show that computationally predicted structures per ciated with function [160]. Another common task is the generation
form on par with experimental structures at reverse docking tasks - of fixed-dimensional unsupervised embeddings which capture
although the docking and scoring methods themselves could use global and local protein characteristics. These can be used in
major improvements to further drug discovery and design. downstream ML algorithms, as discussed in the next section.
633
J. Durairaj, D. de Ridder and A.D.J. van Dijk Computational and Structural Biotechnology Journal 21 (2023) 630–643
Fig. 1. Common steps in structure-based machine learning. A) Starting from a set of protein sequences, structural models can either be retrieved from the PDB or constructed
using computational approaches. B) A number of different feature extraction, feature engineering, or pre-trained embedding approaches can then be used C) to extract a matrix
representation of the input, with the rows as data points and columns representing features or embedding values. D) This matrix forms the input for ML models resulting in
predictions of classes, regression values, or unsupervised clustering and dimensionality reduction. E) Prediction results, combined with the trained model, can be used to inspect
and interpret regions of the protein structure relevant for the task at hand.
3. Computational representations of protein structures Another approach is to generate reduced fixed-dimensional protein
representations, referred to as embeddings. Both these approaches
Protein structures contain interconnected high-dimensional in (Fig. 1B) are followed by the use of ML algorithms that take the
formation about the amino acids involved, their positions and re feature matrix or embedding as input and return various results
lative orientations, and the varying physicochemical and (Fig. 1D) and insights (Fig. 1E) for user interpretation.
electrostatic effects they have on each other. Fig. 1 shows an over A number of studies have demonstrated that high-confidence
view of the most common steps taken in structure-based ML. Once a predicted structural models (both homology-based and DL-based)
set of structures with or without associated labels has been collected have predictive power and can even perform as well as experimental
(Fig. 1A), the next step typically consists of choosing a format to structures on specific tasks [11,16,33,161]. However, this is unlikely to
represent this information that can be understood by computers be a general statement as it is highly dependent on both the types of
(Fig. 1C). One way to do this is by explicitly extracting a set of at proteins and the task at hand. For example, membrane proteins, in
tributes or features from proteins to create a tabular feature matrix. trinsically disordered proteins, and proteins with high conformational
634
J. Durairaj, D. de Ridder and A.D.J. van Dijk Computational and Structural Biotechnology Journal 21 (2023) 630–643
flexibility would still benefit from experimental structures solved in 3.1.2. Structural environment level
different conditions to increase the diversity of structures available Fig. 2 depicts some structural environments commonly used in
and thus our knowledge of them. In addition, side-chain modelling computational representations. For tasks such as hotspot prediction
accuracy, crucial for tasks involving side-chain interactions, tends to or interface residue prediction, each input data point could be a
lag behind main chain accuracy. Finally, in a significant number of single residue. In such situations, including aggregate features with
cases, AlphaFold2 and related approaches do not produce high-con weighted neighbour averages over the spatial nearest neighbouring
fidence structures. It was recently shown that while residues pre residues, as shown in Fig. 2A, often improves the discriminatory
dicted by AlphaFold2 with high confidence (> 90 plDDT) have a very power of predictors [181]. Some environment representations were
low prediction error (median 0.6 Å), this quickly increases to over 3 Å borne out of ease of adaption of approaches from other fields to
error for low confidence residues (< 70 plDDT) [162]. For such cases protein structures - for example, viewing the three-dimensional
with only low confidence structure information present, we may still coordinates of atoms in a structure as a 3D image grid (Fig. 2B) al
have to fall back on sequence-based approaches or utilise embedding lows the application of voxelization followed by the use of 3D con
techniques as described in Section 3.2. volutional neural networks often applied in the field of computer
vision. Whereas in the case of images the red, green and blue values
3.1. Generating structure feature matrices are often encoded as different channels, for proteins these channels
have been used to encode different atom types [77,95]. Another
Broadly, protein structures are compared at the residue level, approach that can also take into account atomic density and radii is
where features are extracted from each individual residue in the the use of geometric tessellations to define a set of polyhedra around
structure, or at a structural environment level, where features are atoms or residues in a structure [182–185] (Fig. 2C).
extracted from well-defined portions of the structure (or the entire Representations of the molecular surface (Fig. 2D) are useful for
structure) containing relevant and localised properties. The former tasks related to protein interactions and protein-solvent interactions.
approach is commonly used in structurally conserved protein family For example, MaSIF [86] depicts the surface as a series of overlapping
ML tasks involving the entire protein, and the latter is used for more radial patches with associated geometric features such as shape index
divergent proteins or for more specific tasks involving the corre and distance-dependent curvature, as well as chemical features such
sponding structural environments. Both approaches use a range of as hydropathy index, continuum electrostatics and the location of free
techniques to align or arrange the extracted features into the fixed electrons and proton donors. A geometric deep neural network is
dimensional feature matrix format. applied to these input features to spatially localise features and op
timise them towards particular tasks. Other approaches have used 3D
Zernike or similar descriptors of surfaces which are invariant to ro
3.1.1. Residue level tation, thus allowing structures and surfaces of different proteins to
Many different features can be extracted from each residue in a be compared [186–188]. In fact, one of the main problems to solve
protein structure using a plethora of computational tools, as listed in when representing entire protein structures is this rotational and
Table 2. translational invariance. Fig. 2E depicts one way to address this,
When the proteins under consideration are evolutionarily closely namely by using a 2D residue-residue distance or contact map
related, multiple protein alignment is commonly used to generate [189,190]. Another approach gaining popularity is the representation
the input feature matrix. While sequence alignment has generally of a protein structure as a graph (Fig. 2F) with rotation and translation
been much more popular than structure alignment, the existence of invariant properties attached to the nodes and/or edges [17,191–194].
protein families which share the same structural fold despite having These graphs form the ideal input for geometric deep learning ap
little sequence similarity necessitates the use of structure-based proaches and have the capacity to encode most of the information
alignment methods. This has driven the development of fast mul contained in the protein structure [195,196].
tiple structure aligners capable of scaling to the numbers of proteins Proteins often interact with other molecules - other proteins,
required to train ML algorithms [178–180]. peptides, nucleic acids and small molecule ligands - so computa
An alternative to the tabular format is a (dis)similarity matrix, tional representations of these binding regions or interfaces are
often used as input to kernel-based methods such as SVMs or in necessary for a number of tasks. Graph [122,197,198] and voxel-
unsupervised dimensionality reduction. For instance, de Lima et al. based [79,116,199] approaches can be used on experimentally solved
[47] calculate protein-protein similarity by combining similarities or computationally docked protein-ligand complexes, usually by
calculated from, among other features, structural alignment, align zooming in to the ligand binding pocket. In addition, there are
ment-free structural comparisons, putative active sites, and in specialised approaches to take into account explicit protein-ligand
stability indices. interactions within the ligand binding pocket in a complex
[124,200]; see [201] for more examples of protein-ligand feature
Table 2 representations. In cases where data about the complex is absent but
Structural features and tools used to extract them. Apart from DISPORED, all tools use
unbound structures are present, some approaches concatenate fea
protein structures as input.
tures of the individual entities as their representation [117,119,120].
Residue feature Tools
Accessible surface area NACCESS [163], PSAIA [164], FreeSASA [165], 3.2. Learning protein embeddings
DSSP [166], ProtDCal [167]
Half sphere exposure BioPython (Bio.PDB.HSExposure) [168]
A complementary approach to generate the tabular input re
Residue depth MSMS [169], PSAIA [164]
Hydrogen bonding patterns DSSP [166] quired for ML is by using end-to-end or pre-trained embedding al
Bond angles DSSP [166], MDAnalysis [170] gorithms. These typically make use of unsupervised DL methods
Secondary structure DSSP [166] trained on a large dataset of proteins to produce a series of values
Energy FoldX [171], Rosetta [172]
representing a given protein in a fixed high-dimensional space, often
Electrostatics APBS [173]
Disorder DISOPRED [174] without the need for explicitly handcrafted features. Due to the
Residue flexibility and ProDy [175], MechStiff [176] training process, these values place similar proteins closer together
stiffness in this space thus capturing overall protein variation and relation
Perturbation response PRS [177] ships between individual proteins. For example, recent global se
Thermodynamics ProtDCal [167]
quence embeddings have been shown to capture amino acid
635
J. Durairaj, D. de Ridder and A.D.J. van Dijk Computational and Structural Biotechnology Journal 21 (2023) 630–643
Fig. 2. Different approaches for computational representation of a protein structure which go beyond features of individual residues. For A-D features or representations
calculated across individual blocks (respectively: spheres, grids, polyhedra, surface patches) are used as input to ML, while for E-F, the entire matrix or graph is often used in
methods specifically designed for these kinds of inputs. A Overlapping spheres B 3D voxel grids C Geometric tesselations D Molecular surface representations E Distance/contact
maps F Graph representations.
characteristics and other physiological properties of proteins as a compared to MaSIF while maintaining and even improving accuracy.
whole [202–205]. These have recently been extended to include Recent DL approaches use the concept of “equivariance” (i.e rotation
structural information as well [206,207]. Unlike protein family ML, and translation of coordinates does not affect the learning process)
alignment is generally not an option in such techniques since most in sequence, graph-based, and diffusion architectures for end-to-end
proteins used for training are evolutionarily remote, thus most de predictive and generative learning [211–213,213].
scribed embedding techniques depend on learning alignment-free GeoPPI [113] is an unsupervised approach that operates on the
patterns across diverse proteins or on generating on-the-fly align graph of a protein complex and uses a message passing neural network
ments of sub-groups of data during the learning process. to reconstruct the structure of a perturbed complex, i.e one in which a
End-to-end learning is popular in this area, covering techniques random residue is modified. This enables learning of intrinsic binding
which start from the raw protein structure with minimal processing interactions, optimal for the prediction of protein-protein binding af
and automatically extract features based on optimising prediction finity. An advantage of such “self"-supervised approaches is that they
accuracy in a given end task - thus the intermediate feature re are not specific to a single task while still encoding more global protein
presentations or embeddings learned are more applicable to the task context; i.e GeoPPI embeddings could easily be used as input for any
at hand and can be retrained to adapt better to different tasks. prediction task. This kind of repurposing of unsupervised or pretrained
ContactLib-ATT [208] applies this concept to predict the SCOP embeddings is quite popular in the sequence world [214,215], and
(Structural Classification Of Proteins) classification of an input likely the same will hold through for structure-based ML in the future.
structure, using attention-based learning [209] on vectors of hy Pretrained embeddings can also be used in a transfer learning context,
drogen bond properties extracted from the structure. SASNet [84] is where they are further fine-tuned to a more specific case of a general
an example of such an approach applied to interface prediction. protein problem, such as the prediction of antibody-antigen interfaces
Local atomic environments of each surface residue are voxelized and from an embedding trained across all protein-protein interfaces [17].
a 3D convolutional neural network is applied to the resulting grids of Another interesting and relevant approach is structure-guided
each pair of residues to learn their interaction propensity. Interest sequence embeddings [203,216,217] - these make use of structural
ingly, this method was trained based only on residues within bound information only in the training stage while the input to the em
structures of interacting partners and yet performs exceedingly well bedding algorithm from the perspective of the end user is just the
also on unbound counterparts, indicating that complex features sequence. This provides a compromise between the use of structure
beyond simple shape complementarity can be learned in this end- data, which may be computationally expensive to produce, and more
to-end fashion. dMaSIF [210], the successor to MaSIF (mentioned easily accessible sequence data while still making use of implicit
above), performs end-to-end learning of molecular surface re structural information. Some recent work [194,218] has even made
presentations directly from 3D point cloud data, optimised to each use of the intermediate representations generated by AlphaFold2
prediction task. Removing the reliance on handcrafted features im during the structure prediction process instead of, or along with, the
proved the running time of dMaSIF by many orders of magnitude predicted structure itself - these representations contain
636
J. Durairaj, D. de Ridder and A.D.J. van Dijk Computational and Structural Biotechnology Journal 21 (2023) 630–643
information about homologous sequences and structures, especially handcrafting of intermediate features and tasks, was seen to be
useful for predicting the effects of mutations or ligand binding, most highly successful for the extremely complex task of mapping an
of which is lost on generation of the final structure. input sequence to a 3D structure [66]. This has been followed by a
boom in end-to-end learning approaches on proteins sequences for
4. Challenges and future directions function prediction, as well as on protein structures for generating
designed protein sequences. See [223] for a recent review.
Despite rapid progress in the direction of structure-based ML, End-to-end learning is becoming popular for a number of tasks as
there are challenges to address before it can become as ubiquitously large models trained once on huge datasets of structures can then be
used as sequence-based ML. Just as there exists a wide variety of reused for smaller sets of proteins and adapted to similar tasks with
tools for answering questions from a sequence perspective, there much less resource consumption and, at the same time, a great in
need to be tools in structural bioinformatics that are as easy to use, crease in performance for even sparse amounts of data
as intuitive to interpret, as optimised, and as feature-rich. [16,212,213,224,225]. In addition, these approaches can learn to
make use of relevant intermediate information from proteins that
4.1. Structure-based approaches are computationally expensive may not be required or prioritised for the structure prediction task
but are crucial for other downstream tasks - for example, residue
The universal and widespread use of protein sequence data, masking in the AlphaFold2 learning procedure increases its robust
combined with its one-dimensional nature, has resulted in a diverse ness and improves overall structure prediction but makes it im
landscape of highly optimised sequence-based tools and algorithms. possible to predict the structural changes caused by mutations,
Many of these, including clustering algorithms, aligners, feature while much of this information is still present in the intermediate
extractors etc., scale to hundreds of thousands of sequences with representations and useful for mutant effect prediction [218].
ease. This cannot be said for structure-based approaches yet, both However, these learners do need huge initial training sets of di
due to their relative newness and to structural data being much verse data and careful architecture engineering to avoid overfitting
more complex than sequence data. as well as large amounts of computational resources for training and
Often this resource intensiveness starts from the very first step - inference. In addition, results from such approaches are difficult to
i.e. generating structural models. Template-based or homology interpret in terms of which kinds of protein properties are being
modelling approaches take a matter of minutes to hours for gen used to make certain decisions, which is a useful property of more
erating a single model, often exacerbated by the need to infer handcrafted ML techniques to hypothesise about the underlying
multiple models for better robustness and expensive additions such biology.
as loop modelling for special cases. Recent template-free methods
such as AlphaFold2 and RosettaFold run in minutes, though scaling 4.3. Dynamic representations of structure
very poorly with the number of residues, and require GPUs and high
amounts of memory and disk space. Memory and space require Since proteins are inherently dynamic in nature, their true
ments for both are somewhat alleviated by the presence of servers “structure” is much more than the rigid three-dimensional co
such as SWISS-MODEL [219] for template-based modelling and the ordinates which serve as the basis for many of the approaches de
recently released ColabFold [220] for template-free modelling, both tailed in the previous sections. Instead, a protein is an ensemble of
of which allow running these resource intensive modelling steps on possible conformations, with some areas displaying more flexibility
shared external servers. In addition, the growth of the AlphaFold than others. This is further influenced by the constant interaction of
protein structure database [9] will eventually reduce the need for proteins with the surrounding solvent, small molecules, nucleic
remodelling from scratch for a large number of sequenced proteins. acids, peptides and of course other proteins, all of which drive
Mutants, designed and novel proteins will still need computational conformational changes within the protein. Protein biological ac
modelling however, indicating that speeding up the modelling pro tivity often involves adopting specific conformations, contributions
cess is still a relevant problem in the field. Recent approaches that from local fluctuations, and even large-scale structural transitions
use protein language model embeddings as input instead of calcu between different conformations. In fact, the old paradigm that se
lating time-intensive multiple sequence alignments (MSAs) provide quence encodes structure, and structure determines function can
a step in this direction [221]. With the growth of exascale computing now be rephrased as sequence encodes structure, structure de
resources, modelling structural dynamics via molecular simulations termines dynamics, and dynamics encodes function [226].
is increasingly accessible, though there is a long way to go for this to Protein flexibility and conformational diversity can be modelled
become commonplace. in multiple ways. One of the most common approaches is using
Once a dataset of structures is gathered or generated, the next molecular dynamics (MD) simulations, which calculates the force
steps often involve structural comparison and feature extraction. exerted on each atom by all other atoms as a function of time using a
Alignment-free structural comparison techniques are relatively fast molecular mechanics force field [227]. However, MD simulations,
already, but structural aligners that scale to the sizes of datasets which are already computationally extremely expensive, do not
required for ML have only recently started to appear. These are still a address covalent bond formation or breakage, both crucial in a
far cry from the highly optimised sequence aligners, but many of number of enzyme families. This sometimes leads to the need for the
these optimisation techniques can be transferred to structure-based even more expensive and challenging set up of Quantum mechanics/
approaches and represent a logical next step as ML on structures molecular mechanics (QM/MM) simulations [228]. Coarse-grained
grows in popularity. Extraction of many of the features detailed in modelling with Monte Carlo simulations (CG-MC) and elastic net
Table 2 is time consuming as well. While some improvements can be work models (ENM, a.k.a normal mode analysis) both provide sim
made with parallelisation and making better use of modern hard plified protein representations that still allow for understanding
ware, this is unlikely to scale to hundreds of thousands of proteins in some aspects of protein flexibility while greatly reducing computa
a similar timescale as sequence feature extraction. tional time [226,229]. structures resolved by cryo-EM, a fast-growing
number.
4.2. End-to-end learning on structures Together, these computational techniques can provide informa
tion about globular protein flexibility and mutations [230,231],
End-to-end learning, where a DL model learns a mathematical large-scale structural transitions (e.g.from active to inactive con
function to map an input to a complex output [222], with minimal formations) [232–235], and conformations involved in the formation
637
J. Durairaj, D. de Ridder and A.D.J. van Dijk Computational and Structural Biotechnology Journal 21 (2023) 630–643
of protein complexes [236]. They have also been used to assess and multi-omics studies, a great target for ML and DL methods. The future
refine 3D models [237–239], improve ligand positioning [240,241], holds an increasing number of opportunities for this combination of
and to create receptor ensembles for ensemble docking [242,243]. network biology and ML [259] – in understanding and fighting diseases
The faster and cruder CG-MC and ENM approaches can be combined by inspecting protein and gene interaction networks, in locating off-
with atomistic-level MD, providing efficient strategies and starting target effects of drugs and concocting valuable drug combination
points for multiscale simulations of proteins and complexes [244]. therapies based on chemical networks and multi-omics data from drug
While ML is becoming more prevalent in the MD and CG-MC fields, treatments [260], in understanding microbial interactions through
to construct force field models, model energy surfaces, and perform metabolic networks, in finding biosynthetic gene clusters through gene
conformational sampling [245–247], future efforts will likely also neighbourhoods, transcriptomics, and expression profiling, and in de
utilise the flexibility information obtained from these techniques to signing synthetic gene circuits combining interconnected genes, pro
use as input in ML-based predictors of protein function, with a few moters, and ribosome binding sites. Apart from a few examples [261],
early examples already doing this in unsupervised [248,249] and structural data has rarely been used in such large scale integrative
supervised settings [250,251]. There is some evidence that this can approaches due to its scarcity and complexity. With the former being
improve over static structure-based prediction [252]. solved, the future holds promise in finding and using algorithms and
approaches to link protein structures with all of their interlinked data
4.4. Probing underlying protein mechanisms in a unified approach to model function [262].
638
J. Durairaj, D. de Ridder and A.D.J. van Dijk Computational and Structural Biotechnology Journal 21 (2023) 630–643
[5] Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, J, Ciruela F, editors. Progress in molecular biology and translational science, Vol.
Bourne PE. The protein data bank. Nucleic Acids Res 2000;28(1):235–42. 169 of oligomerization in health and disease: from enzymes to G protein-
[6] Schwede T. Protein modeling: what happened to the “protein structure gap”? coupled receptors Academic Press; 2020. p. 105–49. https://doi.org/10.1016/bs.
Structure 2013;21(9):1531–40. https://doi.org/10.1016/j.str.2013.08.007 pmbts.2019.11.007. (pp).
[7] Somody JC, MacKinnon SS, Windemuth A. Structural coverage of the proteome [31] Bordner AJ. Predicting protein-protein binding sites in membrane proteins.
for pharmaceutical applications. Drug Discov Today 2017;22(12):1792–9. BMC Bioinform 2009;10(1):312. https://doi.org/10.1186/1471-2105-10-312
https://doi.org/10.1016/j.drudis.2017.08.004 [32] L. Heo, M. Feig, Multi-state modeling of G-protein Coupled Receptors at ex
[8] Bienert S, Waterhouse A, de Beer TAP, Tauriello G, Studer G, Bordoli L, Schwede perimental accuracy, bioRxiv Preprint (Nov. 2021). 10.1101/2021.11.26.470086.
T. The SWISS-MODEL Repository–new features and functionality. Nucleic Acids [33] Popov P, Peng Y, Shen L, Stevens RC, Cherezov V, Liu Z-J, Katritch V.
Res 2017;45(D1):D313–9. https://doi.org/10.1093/nar/gkw1132 Computational design of thermostabilizing point mutations for G Protein-
[9] Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool Coupled Receptors. eLife 2018;7:e34729https://doi.org/10.7554/eLife.34729
K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie [34] Raschka S, Kaufman B. Machine learning and AI-based approaches for bioactive
A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy ligand discovery and GPCR-ligand recognition. Methods 2020;180:89–110.
E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, https://doi.org/10.1016/j.ymeth.2020.06.016
Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D. Highly accurate protein [35] Cohen P. Protein Kinases — the major drug targets of the twenty-first century?
structure prediction with AlphaFold. Nature 2021;596(7873):583–9. https://doi. Nat Rev Drug Discov 2002;1(4):309–15. https://doi.org/10.1038/nrd773
org/10.1038/s41586-021-03819-2 [36] Laufer S, Bajorath J. New frontiers in kinases: second generation inhibitors. J
[10] Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, Yuan D, Med Chem 2014;57(6):2167–8.
Stroe O, Wood G, Laydon A, et al. AlphaFold Protein Structure Database: [37] Afanasyeva A, Nagao C, Mizuguchi K. Developing a kinase-specific target se
Massively expanding the structural coverage of protein-sequence space with lection method using a structure-based machine learning approach. Adv
high-accuracy models. Nucleic Acids Res 2022;50(D1):D439–44. Appl Bioinform Chem AABC 2020;13:27–40. https://doi.org/10.2147/AABC.
[11] M. Akdel, D.E.V. Pires, E.P. Pardo, J. Jänes, A.O. Zalevsky, B. Mészáros, P. Bryant, L. S278900
L. Good, R.A. Laskowski, G. Pozzati, A. Shenoy, W. Zhu, P. Kundrotas, V.R. Serra, C. [38] de Ávila MB, Xavier MM, Pintro VO, de Azevedo WF. Supervised machine
H.M. Rodrigues, A.S. Dunham, D. Burke, N. Borkakoti, S. Velankar, A. Frost, J. learning techniques to predict binding affinity A study for Cyclin-Dependent
Basquin, K. Lindorff-Larsen, A. Bateman, A.V. Kajava, A. Valencia, S. Ovchinnikov, Kinase 2. Biochem Biophys Res Commun 2017;494(1):305–10. https://doi.org/
J. Durairaj, D.B. Ascher, J.M. Thornton, N.E. Davey, A. Stein, A. Elofsson, T.I. Croll, 10.1016/j.bbrc.2017.10.035
P. Beltrao, A structural biology community assessment of AlphaFold2 applica [39] McSkimming DI, Rasheed K, Kannan N. Classifying kinase conformations using
tions, Nat Struct Mol Biol 29(11) (2022) 1056–1067. 10.1038/s41594–022- a machine learning approach. BMC Bioinform 2017;18(1):86. https://doi.org/10.
00849-w. 1186/s12859-017-1506-2
[12] Porta-Pardo E, Ruiz-Serra V, Valentini S, Valencia A. The structural coverage of [40] Ung PM-U, Rahman R, Schlessinger A. Redefining the protein kinase con
the human proteome before and after AlphaFold. PLoS Comput Biol 2022;18(1). formational space with machine learning. e2 Cell Chem Biol
https://doi.org/10.1371/journal.pcbi.1009818 2018;25(7):916–24. https://doi.org/10.1016/j.chembiol.2018.05.002
[13] Pfab J, Phan NM, Si D. Deeptracer for fast de novo cryo-em protein structure [41] Sun T, Chen Y, Wen Y, Zhu Z, Li M. PremPLI: a machine learning model for
modeling and special studies on cov-related complexes. Proc Natl Acad Sci USA predicting the effects of missense mutations on protein-ligand interactions.
2021;118(2):e2017525118. (Nov.). Commun Biol 2021;4(1). https://doi.org/10.1038/s42003-021-02826-3.
[14] Jin S, Miller MD, Chen M, Schafer NP, Lin X, Chen X, Phillips GN, Wolynes PG. (Nov.).
Molecular-replacement phasing using predicted protein structures from [42] Mou Z, Eakes J, Cooper CJ, Foster CM, Standaert RF, Podar M, Doktycz MJ, Parks
awsem-suite. IUCrJ 2020;7(6):1168–78. JM. Machine learning-based prediction of enzyme substrate scope: application
[15] Chai L, Zhu P, Chai J, Pang C, Andi B, McSweeney S, Shanklin J, Liu Q. Alphafold to bacterial nitrilases. Proteins Struct Funct Bioinform 2021;89(3):336–47.
protein structure database for sequence-independent molecular replacement. https://doi.org/10.1002/prot.26019
Crystals 2021;11(10):1227. [43] Robinson SL, Smith MD, Richman JE, Aukema KG, Wackett LP. Machine
[16] Abdin O, Nim S, Wen H, Kim PM. PepNN: a deep attention model for the learning-based prediction of activity and substrate specificity for OleA enzymes
identification of peptide binding sites. Commun Biol 2022;5(1):1–10. https:// in the Thiolase superfamily. Synth Biol 2020;5(1). https://doi.org/10.1093/
doi.org/10.1038/s42003-022-03445-2 synbio/ysaa004. (Jan.).
[17] Pittala S, Bailey-Kellogg C. Learning context-aware structural representations to [44] Durairaj J, Melillo E, Bouwmeester HJ, Beekwilder J, de Ridder D, van Dijk ADJ.
predict antigen and antibody binding interfaces. Bioinformatics Integrating structure-based machine learning and co-evolution to investigate
2020;36(13):3996–4003. https://doi.org/10.1093/bioinformatics/btaa263 specificity in plant sesquiterpene synthases. PLoS Comput Biol
[18] Liu R, Hu J. DNABind: a hybrid algorithm for structure-based prediction of DNA- 2021;17(3):e1008197https://doi.org/10.1371/journal.pcbi.1008197
binding residues by combining machine learning- and template-based ap [45] He X-h, You C-z, Jiang H-l, Jiang Y, Xu HE, Cheng X. Alphafold2 versus experi
proaches. Proteins Struct Funct Bioinform 2013;81(11):1885–99. https://doi. mental structures: evaluation on g protein-coupled receptors. Acta Pharmacol
org/10.1002/prot.24330 Sin 2022:1–7.
[19] Romero PA, Krause A, Arnold FH. Navigating the protein fitness landscape with [46] Timonina D, Sharapova Y, Švedas V, Suplatov D. Bioinformatic analysis of
Gaussian Processes. Proc Natl Acad Sci USA 2013;110(3):E193–201. https://doi. subfamily-specific regions in 3D-structures of homologs to study functional
org/10.1073/pnas.1215251110 diversity and conformational plasticity in protein superfamilies. Comput Struct
[20] Volkov M, Turk J-A, Drizard N, Martin N, Hoffmann B, Gaston-Mathé Y, Rognan Biotechnol J 2021;19:1302–11.
D. On the frustration to predict binding affinities from protein-ligand structures [47] de Lima EB, Júnior WM, de Melo-Minardi RC. Isofunctional protein subfamily
with deep neural networks. J Med Chem 2022;65(11):7946–58. https://doi.org/ detection using data integration and spectral clustering. PLoS Comput Biol
10.1021/acs.jmedchem.2c00487 2016;12(6):e1005001https://doi.org/10.1371/journal.pcbi.1005001
[21] Mitchell TM, et al. Machine learning vol. 45 McGraw Hill; 1997. [48] N. Ahalawat, J. Mondal, Resolving protein conformational plasticity and sub
[22] Greener JG, Kandathil SM, Moffat L, Jones DT. A guide to machine learning for strate binding through the lens of machine-learning, bioRxiv Preprint (Jan.
biologists. Nat Rev Mol Cell Biol 2022;23(1):40–55. https://doi.org/10.1038/ 2022). 10.1101/2022.01.07.475334.
s41580-021-00407-0 [49] A. Joshi, N. Haspel, E. González, Characterizing protein conformational spaces
[23] Li R, Li L, Xu Y, Yang J. Machine learning meets omics: applications and per using dimensionality reduction and algebraic topology, bioRxiv Preprint (Nov.
spectives. bbab460 Brief Bioinform 2022;23(1). https://doi.org/10.1093/bib/ 2021). 10.1101/2021.11.16.468545.
bbab460 [50] Peterson LE. K-Nearest neighbor. Scholarpedia 2009;4(2):1883. https://doi.org/
[24] Sieow BF-L, De Sotto R, Seet ZRD, Hwang IY, Chang MW. Synthetic biology 10.4249/scholarpedia.1883
meets machine learning. In: Selvarajoo K, editor. Computational biology and [51] Noble WS. What is a support vector machine? Nat Biotechnol
machine learning for metabolic engineering and synthetic biology, methods in 2006;24(12):1565–7. https://doi.org/10.1038/nbt1206-1565
molecular biology US, New York, NY: Springer; 2023. p. 21–39. https://doi.org/ [52] Rasmussen CE. Gaussian processes in machine learning. In: Bousquet O, von
10.1007/978-1-0716-2617-7_2. (pp). Luxburg U, Rätsch G, editors. Advanced lectures on machine learning: ML
[25] Wainberg M, Merico D, Delong A, Frey BJ. Deep learning in biomedicine. Nat summer schools 2003, Canberra, Australia, February 2 - 14, 2003, Tübingen,
Biotechnol 2018;36(9):829–38. https://doi.org/10.1038/nbt.4233 Germany, August 4 - 16, 2003, Revised Lectures, Lecture Notes in Computer
[26] Vamathevan J, Clark D, Czodrowski P, Dunham I, Ferran E, Lee G, Li B, Science Berlin, Heidelberg: Springer; 2004. p. 63–71. https://doi.org/10.1007/
Madabhushi A, Shah P, Spitzer M, Zhao S. Applications of machine learning in 978-3-540-28650-9_4
drug discovery and development. Nat Rev Drug Discov 2019;18(6):463–77. [53] Breiman L. Random forests. Mach Learn 2001;45(1):5–32. https://doi.org/10.
https://doi.org/10.1038/s41573-019-0024-5 1023/A:1010933404324
[27] Flower DR, North ACT, Sansom CE. The lipocalin protein family: structural and [54] Friedman JH. Greedy function approximation: a gradient boosting machine.
sequence overview. Biochim Biophys Acta ((BBA)) Protein Struct Mol Enzymol Ann Stat 2001;29(5):1189–232.
2000;1482(1):9–24. https://doi.org/10.1016/S0167-4838(00)00148-5 [55] Cheng J, Randall AZ, Sweredoski MJ, Baldi P. SCRATCH: a protein structure and
[28] Durairaj J, DiGirolamo A, Bouwmeester HJ, de Ridder D, Beekwilder J, van Dijk structural feature prediction server. Nucleic Acids Res 2005;33(Web Server
AD. An analysis of characterized plant sesquiterpene synthases. Phytochemistry issue):W72–6. https://doi.org/10.1093/nar/gki396
2019;158:157–65. https://doi.org/10.1016/j.phytochem.2018.10.020 [56] Shen Y, Bax A. Protein backbone and sidechain torsion angles predicted from
[29] Böhme I, Beck-Sickinger AG. Illuminating the life of GPCRs. Cell Commun Signal NMR chemical shifts using artificial neural networks. J Biomol NMR
2009;7(1):1–22. 2013;56(3):227–41. https://doi.org/10.1007/s10858-013-9741-y
[30] Barreto CAV, Baptista SJ, Preto AJ, Matos-Filipe P, Mourão J, Melo R, Moreira I. [57] Mataeimoghadam F, Newton MAH, Dehzangi A, Karim A, Jayaram B,
Chapter Four - Prediction and targeting of GPCR oligomer interfaces. In: Giraldo Ranganathan S, Sattar A. Enhancing protein backbone angle prediction by using
639
J. Durairaj, D. de Ridder and A.D.J. van Dijk Computational and Structural Biotechnology Journal 21 (2023) 630–643
simpler models of deep neural networks. Sci Rep 2020;10(1):19430. https://doi. [83] Kryshtafovych A, Monastyrskyy B, Fidelis K. CASP11 statistics and the predic
org/10.1038/s41598-020-76317-6 tion center evaluation system. Proteins Struct Funct Bioinform
[58] Xu J. Distance-based protein folding powered by deep learning. Proc Natl Acad 2016;84(S1):15–9. https://doi.org/10.1002/prot.25005
Sci USA 2019;116(34):16856–65. https://doi.org/10.1073/pnas.1821309116 [84] Townshend R, Bedi R, Suriana P, Dror R. End-to-end Learning on 3D protein
[59] Jones DT, Kandathil SM. High precision in protein contact prediction using fully structure for interface prediction. Adv Neural Inf Process Syst 2019;32.
convolutional neural networks and minimal sequence features. Bioinformatics [85] Sanchez-Garcia R, Sorzano COS, Carazo JM. A method for the prediction of
2018;34(19):3308–15. https://doi.org/10.1093/bioinformatics/bty341 partner-specific protein-protein interfaces. Bioinformatics 2019;35(3):470–7.
[60] Wang S, Sun S, Li Z, Zhang R, Xu J. Accurate de novo prediction of protein https://doi.org/10.1093/bioinformatics/bty647
contact map by ultra-deep learning model. PLoS Comput Biol [86] Gainza P, Sverrisson F, Monti F, Rodolà E, Boscaini D, Bronstein MM, Correia BE.
2017;13(1):e1005324https://doi.org/10.1371/journal.pcbi.1005324 Deciphering interaction fingerprints from protein molecular surfaces using
[61] Liu Y, Palmedo P, Ye Q, Berger B, Peng J. Enhancing evolutionary couplings with geometric deep learning. Nat Methods 2020;17(2):184–92. https://doi.org/10.
deep convolutional neural networks. e3 Cell Syst 2018;6(1):65–74. https://doi. 1038/s41592-019-0666-6
org/10.1016/j.cels.2017.11.014 [87] U. Ghani, I. Desta, A. Jindal, O. Khan, G. Jones, S. Kotelnikov, D. Padhorny, S.
[62] Ovchinnikov S, Park H, Varghese N, Huang P-S, Pavlopoulos GA, Kim DE, Vajda, D. Kozakov, Improved docking of protein models by a combination of
Kamisetty H, Kyrpides NC, Baker D. Protein structure determination using alphafold2 and cluspro, bioRxiv Preprint (Sep. 2021). 10.1101/2021.09.07.
metagenome sequence data. Science 2017;355(6322):294–8. https://doi.org/10. 459290.
1126/science.aah4043 [88] Bendell CJ, Liu S, Aumentado-Armstrong T, Istrate B, Cernek PT, Khan S,
[63] Kryshtafovych A, Schwede T, Topf M, Fidelis K, Moult J. Critical assessment of Picioreanu S, Zhao M, Murgita RA. Transient protein-protein interface predic
methods of protein structure prediction (CASP)—round XIII. Proteins Struct tion: datasets, features, algorithms, and the rad-t predictor. BMC Bioinform
Funct Bioinform 2019;87(12):1011–20. https://doi.org/10.1002/prot.25823 2014;15(1):1–12.
[64] Kryshtafovych A, Schwede T, Topf M, Fidelis K, Moult J. Critical assessment of [89] Das S, Chakrabarti S. Classification and prediction of protein-protein interaction
methods of protein structure prediction (CASP)–round XIV. Proteins Struct interface using machine learning algorithm. Sci Rep 2021;11(1):1–12.
Funct Bioinform 2021;89(12):1607–17. https://doi.org/10.1002/prot.26237 [90] Xu Q, Dunbrack RL. Protcid: a data resource for structural information on
[65] Kuhlman B, Bradley P. Advances in protein structure prediction and design. Nat protein interactions. Nat Commun 2020;11(1):1–16.
Rev Mol Cell Biol 2019;20(11):681–97. https://doi.org/10.1038/s41580-019- [91] Vreven T, Moal IH, Vangone A, Pierce BG, Kastritis PL, Torchala M, Chaleil R,
0163-x Jiménez-García B, Bates PA, Fernandez-Recio J, Bonvin AMJJ, Weng Z. Updates to
[66] AlQuraishi M. Machine learning in protein structure prediction. Curr Opin the integrated protein-protein interaction benchmarks: Docking Benchmark
Chem Biol 2021;65:1–8. https://doi.org/10.1016/j.cbpa.2021.04.005 Version 5 and Affinity Benchmark Version 2. J Mol Biol 2015;427(19):3031–41.
[67] Gligorijević V, Renfrew PD, Kosciolek T, Leman JK, Berenberg D, Vatanen T, https://doi.org/10.1016/j.jmb.2015.07.016
Chandler C, Taylor BC, Fisk IM, Vlamakis H, Xavier RJ, Knight R, Cho K, Bonneau [92] Kundrotas PJ, Anishchenko I, Dauzhenka T, Kotthoff I, Mnevets D, Copeland
R. Structure-based protein function prediction using graph convolutional net MM, Vakser IA. Dockground: a comprehensive data resource for modeling of
works. Nat Commun 2021;12:3168. https://doi.org/10.1038/s41467-021- protein complexes. Protein Sci 2018;27(1):172–81.
23303-9 [93] A. Morehead, C. Chen, A. Sedova, Dips-plus: The enhanced database of inter
[68] Rauer C, Sen N, Waman VP, Abbasian M, Orengo CA. Computational approaches acting protein structures for interface prediction, arXiv preprint arXiv:2106.
to predict protein functional families and functional sites. Curr Opin Struct Biol 04362 (2021).
2021;70:108–22. https://doi.org/10.1016/j.sbi.2021.05.012 [94] Jiménez J, Doerr S, Martínez-Rosell G, Rose AS, Fabritiis GDe. DeepSite: Protein-
[69] Dana JM, Gutmanas A, Tyagi N, Qi G, O’Donovan C, Martin M, Velankar S. SIFTS: binding site predictor using 3D-convolutional neural networks. Bioinformatics
updated Structure Integration with Function, Taxonomy and Sequences re 2017;33(19):3036–42. https://doi.org/10.1093/bioinformatics/btx350
source allows 40-fold increase in coverage of structure-based annotations for [95] Kozlovskii I, Popov P. Spatiotemporal identification of druggable binding sites
proteins. Nucleic Acids Res 2019;47(D1):D482–9. https://doi.org/10.1093/nar/ using deep learning. Commun Biol 2020;3(1):1–12. https://doi.org/10.1038/
gky1114 s42003-020-01350-0
[70] Parthiban V, Gromiha MM, Schomburg D. CUPSAT: prediction of protein sta [96] Krivák R, Hoksza D. P2Rank: machine learning based tool for rapid and accurate
bility upon point mutations. Nucleic Acids Res 2006;34(suppl_2):W239–42. prediction of ligand binding sites from protein structure. J Cheminfor
https://doi.org/10.1093/nar/gkl190 2018;10(1):39. https://doi.org/10.1186/s13321-018-0285-8
[71] Li B, Yang YT, Capra JA, Gerstein MB. Predicting changes in protein thermo [97] Desaphy J, Bret G, Rognan D, Kellenberger E. sc-PDB: a 3D-database of ligand
dynamic stability upon point mutation with deep 3D convolutional neural able binding sites—10 years on. Nucleic Acids Res 2015;43(D1):D399–404.
networks. PLoS Comput Biol 2020;16(11):e1008291https://doi.org/10.1371/ https://doi.org/10.1093/nar/gku928
journal.pcbi.1008291 [98] Roy A, Yang J, Zhang Y. COFACTOR: an accurate comparative algorithm for
[72] Masso M, Vaisman II. Accurate prediction of stability changes in protein mu structure-based protein function annotation. Nucleic Acids Res
tants by combining machine learning with structure based computational 2012;40(W1):W471–7. https://doi.org/10.1093/nar/gks372
mutagenesis. Bioinformatics 2008;24(18):2002–9. https://doi.org/10.1093/ [99] Schmidtke P, Souaille C, Estienne F, Baurin N, Kroemer RT. Large-scale com
bioinformatics/btn353 parison of four binding site detection algorithms. J Chem Inf Model
[73] Quan L, Lv Q, Zhang Y. STRUM: Structure-based prediction of protein stability 2010;50(12):2191–200. https://doi.org/10.1021/ci1000289
changes upon single-point mutation. Bioinformatics 2016;32(19):2936–46. [100] Mészáros B, Erdős G, Dosztányi Z. IUPred2A: Context-dependent prediction of
https://doi.org/10.1093/bioinformatics/btw361 protein disorder as a function of redox State and protein binding. Nucleic Acids
[74] Nikam R, Kulandaisamy A, Harini K, Sharma D, Gromiha MM. ProThermDB: Res 2018;46(W1):W329–37. https://doi.org/10.1093/nar/gky384
thermodynamic database for proteins and mutants revisited after [101] McGuffin LJ. Intrinsic disorder prediction from the analysis of multiple protein
15 years. Nucleic Acids Res 2021;49(D1):D420–4. https://doi.org/10.1093/nar/ fold recognition models. Bioinformatics 2008;24(16):1798–804. https://doi.org/
gkaa1035 10.1093/bioinformatics/btn326
[75] R.J. Townshend, M. Vögele, P. Suriana, A. Derry, A. Powers, Y. Laloudakis, S. [102] Schad E, Fichó E, Pancsa R, Simon I, Dosztányi Z, Mészáros B. DIBS: a repository of
Balachandar, B. Jing, B. Anderson, S. Eismann, et al., Atom3d: Tasks on mole disordered binding sites mediating interactions with ordered proteins.
cules in three dimensions, arXiv preprint arXiv:2012.04035 (2020). Bioinformatics 2018;34(3):535–7. https://doi.org/10.1093/bioinformatics/btx640
[76] Naderi M, Lemoine JM, Govindaraj RG, Kana OZ, Feinstein WP, Brylinski M. [103] Piovesan D, Tabaro F, Mičetić I, Necci M, Quaglia F, Oldfield CJ, Aspromonte MC,
Binding site matching in rational drug design: algorithms and applications. Davey NE, Davidović R, Dosztányi Z, Elofsson A, Gasparini A, Hatos A, Kajava AV,
Brief Bioinform 2019;20(6):2167–84. https://doi.org/10.1093/bib/bby078 Kalmar L, Leonardi E, Lazar T, Macedo-Ribeiro S, Macossay-Castillo M, Meszaros
[77] Pu L, Govindaraj RG, Lemoine JM, Wu H-C, Brylinski M. DeepDrug3D: classifi A, Minervini G, Murvai N, Pujols J, Roche DB, Salladini E, Schad E, Schramm A,
cation of ligand-binding pockets in proteins with a convolutional neural net Szabo B, Tantos A, Tonello F, Tsirigos KD, Veljković N, Ventura S, Vranken W,
work. PLoS Comput Biol 2019;15(2):e1006718https://doi.org/10.1371/journal. Warholm P, Uversky VN, Dunker A, Longhi S, Tompa P, Tosatto SC. DisProt 7.0: a
pcbi.1006718 major update of the database of disordered proteins. Nucleic Acids Res
[78] Brylinski M. eMatchSite: Sequence order-independent structure alignments of 2017;45(D1):D219–27. https://doi.org/10.1093/nar/gkw1056
ligand binding pockets in protein models. PLoS Comput Biol [104] Wass MN, Fuentes G, Pons C, Pazos F, Valencia A. Towards the prediction of
2014;10(9):e1003829https://doi.org/10.1371/journal.pcbi.1003829 protein interaction partners using physical docking. Mol Syst Biol
[79] Ragoza M, Hochuli J, Idrobo E, Sunseri J, Koes DR. Protein-ligand scoring with 2011;7(1):469. https://doi.org/10.1038/msb.2011.3
convolutional neural networks. J Chem Inf Model 2017;57(4):942–57. https:// [105] Zhang QC, Petrey D, Deng L, Qiang L, Shi Y, Thu CA, Bisikirska B, Lefebvre C,
doi.org/10.1021/acs.jcim.6b00740 Accili D, Hunter T, Maniatis T, Califano A, Honig B. Structure-based prediction of
[80] Pagès G, Charmettant B, Grudinin S. Protein model quality assessment using 3D protein-protein interactions on a genome-wide scale. Nature
oriented convolutional neural networks. Bioinformatics 2019;35(18):3313–9. 2012;490(7421):556–60. https://doi.org/10.1038/nature11503
https://doi.org/10.1093/bioinformatics/btz122 [106] I.R. Humphreys, J. Pei, M. Baek, A. Krishnakumar, I. Anishchenko, S.
[81] Shen C, Ding J, Wang Z, Cao D, Ding X, Hou T. From machine learning to deep Ovchinnikov, J. Zhang, T.J. Ness, S. Banjade, S. Bagde, V.G. Stancheva, X.-H. Li, K.
learning: advances in scoring functions for protein-ligand docking. WIREs Liu, Z. Zheng, D.J. Barrero, U. Roy, I.S. Fernández, B. Szakal, D. Branzei, E.C.
Comput Mol Sci 2020;10(1):e1429https://doi.org/10.1002/wcms.1429 Greene, S. Biggins, S. Keeney, E.A. Miller, J.C. Fromme, T.L. Hendrickson, Q. Cong,
[82] Hiranuma N, Park H, Baek M, Anishchenko I, Dauparas J, Baker D. Improved D. Baker, Structures of core eukaryotic protein complexes, bioRxiv Preprint
protein structure refinement guided by deep learning based accuracy estima (Sep. 2021). 10.1101/2021.09.30.462231.
tion. Nat Commun 2021;12(1):1340. https://doi.org/10.1038/s41467-021- [107] Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D. The database
21511-x of interacting proteins: 2004 update. Nucleic Acids Res
2004;32(suppl_1):D449–51.
640
J. Durairaj, D. de Ridder and A.D.J. van Dijk Computational and Structural Biotechnology Journal 21 (2023) 630–643
[108] Szklarczyk D, Gable AL, Nastou KC, Lyon D, Kirsch R, Pyysalo S, Doncheva NT, [130] P. Bryant, G. Pozzati, W. Zhu, A. Shenoy, P. Kundrotas, A. Elofsson, Predicting the
Legeay M, Fang T, Bork P, Jensen LJ, von Mering C. The STRING database in 2021: structure of large protein complexes using AlphaFold and Monte Carlo tree
customizable protein-protein networks, and functional characterization of search, Nat Commun 13(1) (2022) 6028.10.1038/s41467–022-33729–4.
user-uploaded gene/measurement sets. Nucleic Acids Res [131] Yin R, Feng BY, Varshney A, Pierce BG. Benchmarking AlphaFold for protein
2021;49(D1):D605–12. https://doi.org/10.1093/nar/gkaa1074 complex modeling reveals accuracy determinants. Protein Sci 2022;31(8):e4379
[109] Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath https://doi.org/10.1002/pro.4379
V, Niranjan V, Muthusamy B, Gandhi TKB, Gronborg M, Ibarrola N, Deshpande [132] M. Baek, R. McHugh, I. Anishchenko, D. Baker, F. DiMaio, Accurate prediction of
N, Shanker K, Shivashankar HN, Rashmi BP, Ramya MA, Zhao Z, Chandrika KN, nucleic acid and protein-nucleic acid complexes using rosettafoldna, bioRxiv
Padma N, Harsha HC, Yatish AJ, Kavitha MP, Menezes M, Choudhury DR, Suresh (2022). 10.1101/2022.09.09.507333.
S, Ghosh N, Saravana R, Chandran S, Krishna S, Joy M, Anand SK, Madavan V, [133] Lima AN, Philot EA, Trossini GHG, Scott LPB, Maltarollo VG, Honorio KM.
Joseph A, Wong GW, Schiemann WP, Constantinescu SN, Huang L, Khosravi-Far Use of machine learning approaches for novel drug discovery. Expert
R, Steen H, Tewari M, Ghaffari S, Blobe GC, Dang CV, Garcia JGN, Pevsner J, Opin Drug Discov 2016;11(3):225–39. https://doi.org/10.1517/17460441.2016.
Jensen ON, Roepstorff P, Deshpande KS, Chinnaiyan AM, Hamosh A, Chakravarti 1146250
A, Pandey A. Development of human protein reference database as an initial [134] Zhao J, Cao Y, Zhang L. Exploring the computational methods for protein-ligand
platform for approaching systems biology in humans. Genome Res binding site prediction. Comput Struct Biotechnol J 2020;18:417–26. https://
2003;13(10):2363–71. https://doi.org/10.1101/gr.1680803 doi.org/10.1016/j.csbj.2020.02.008
[110] Oughtred R, Rust J, Chang C, Breitkreutz B-J, Stark C, Willems A, Boucher L, [135] Lee M, Kim D. Large-scale reverse docking profiles and their applications. BMC
Leung G, Kolas N, Zhang F, Dolma S, Coulombe-Huntington J, Chatr-Aryamontri Bioinform 2012;13(17):S6. https://doi.org/10.1186/1471-2105-13-S17-S6
A, Dolinski K, Tyers M. The BioGRID database: a comprehensive biomedical [136] Grinter SZ, Liang Y, Huang S-Y, Hyder SM, Zou X. An inverse docking approach
resource of curated protein, genetic, and chemical interactions. Protein Sci for identifying new potential anti-cancer targets. J Mol Graph Model
2021;30(1):187–200. https://doi.org/10.1002/pro.3978 2011;29(6):795–9. https://doi.org/10.1016/j.jmgm.2011.01.002
[111] Kumar R, Nanduri B. HPIDB - a unified resource for host-pathogen interactions. [137] Fernández A. Artificial intelligence teaches drugs to target proteins by tackling
BMC Bioinform 2010;11(6):S16. https://doi.org/10.1186/1471-2105-11-S6-S16 the induced folding problem. Mol Pharm 2020;17(8):2761–7. https://doi.org/
[112] Zhang N, Chen Y, Lu H, Zhao F, Alvarez RV, Goncearenco A, Panchenko AR, Li M. 10.1021/acs.molpharmaceut.0c00470
MutaBind2: Predicting the impacts of single and multiple mutations on pro [138] Z. Xu, O.R. Wauchope, A.T. Frank, Navigating chemical space by interfacing
tein-protein interactions. iScience 2020;23(3). https://doi.org/10.1016/j.isci. generative artificial intelligence and molecular docking, J Chem Inf Model
2020.100939. (Mar.). 61(11) (2021) 5589–5600. 10.1021/acs.jcim.1c00746.
[113] Liu X, Luo Y, Li P, Song S, Peng J. Deep geometric representations for modeling [139] P. Drotár, A.R. Jamasb, B. Day, C. Cangea, P. Liò, Structure-aware generation of
effects of mutations on protein-protein binding affinity. PLoS Comput Biol drug-like molecules, arXiv Preprint (Nov. 2021.
2021;17(8):e1009284https://doi.org/10.1371/journal.pcbi.1009284 [140] Wong F, Krishnan A, Zheng EJ, Stärk H, Manson AL, Earl AM, Jaakkola T, Collins
[114] Geng C, Vangone A, Folkers GE, Xue LC, Bonvin AMJJ. iSEE: Interface structure, JJ. Benchmarking alphafold-enabled molecular docking predictions for anti
evolution, and energy-based machine learning predictor of binding affinity biotic discovery. Mol Syst Biol 2022;18(9):e11081.
changes upon mutations. Proteins Struct Funct Bioinform 2019;87(2):110–9. [141] N. Sen, I. Anishchenko, N. Bordin, I. Sillitoe, S. Velankar, D. Baker, C. Orengo,
https://doi.org/10.1002/prot.25630 Characterizing disease-associated human proteins without available protein
[115] Jankauskaitė J, Jiménez-García B, Dapkunas J, Fernández-Recio J, Moal IH. structures or homologues, bioRxiv Preprint (Nov. 2021). 10.1101/2021.11.17.
SKEMPI 2.0: an updated benchmark of changes in protein-protein binding en 468998.
ergy, kinetics and thermodynamics upon mutation. Bioinformatics [142] Pak MA, Ivankov DN. Best templates outperform homology models in pre
2019;35(3):462–9. https://doi.org/10.1093/bioinformatics/bty635 dicting the impact of mutations on protein stability. 07 Bioinform Btac
[116] Jiménez J, Škalič M, Martínez-Rosell G, Fabritiis GDe. KDEEP: Protein-ligand 2022;515. https://doi.org/10.1093/bioinformatics/btac515. 07.
absolute binding affinity prediction via 3D-convolutional neural networks. J [143] M.A. Pak, K.A. Markhieva, M.S. Novikova, D.S. Petrov, I.S. Vorobyev, E.S.
Chem Inf Model 2018;58(2):287–96. https://doi.org/10.1021/acs.jcim.7b00650 Maksimova, F.A. Kondrashov, D.N. Ivankov, Using alphafold to predict the im
[117] Ahmed A, Mam B, Sowdhamini R. DEELIG: A deep learning approach to predict pact of single mutations on protein stability and function, BioRxiv (2021).
protein-ligand binding affinity. Bioinform Biol Insights 2021;15:11779322211030364 [144] C. Norn, B.I.M. Wicky, D. Juergens, S. Liu, D. Kim, B. Koepnick, I. Anishchenko, F.
https://doi.org/10.1177/11779322211030364 Players, D. Baker, S. Ovchinnikov, Protein sequence design by explicit energy
[118] Ballester PJ, Mitchell JBO. A machine learning approach to predicting protein- landscape optimization, bioRxiv (2020). 10.1101/2020.07.23.218917.
ligand binding affinity with applications to molecular docking. Bioinformatics [145] D. Tischer, S. Lisanza, J. Wang, R. Dong, I. Anishchenko, L.F. Milles, S.
2010;26(9):1169–75. https://doi.org/10.1093/bioinformatics/btq112 Ovchinnikov, D. Baker, Design of proteins presenting discontinuous functional
[119] Boyles F, Deane CM, Morris GM. Learning from docked ligands: Ligand-based sites using deep learning, bioRxiv (2020). 10.1101/2020.11.29.402743.
features rescue structure-based scoring functions when trained on docked [146] J. Wang, S. Lisanza, D. Juergens, D. Tischer, I. Anishchenko, M. Baek, J.L. Watson,
poses. J Chem Inf Model 2021. https://doi.org/10.1021/acs.jcim.1c00096. (Sep.). J.H. Chun, L.F. Milles, J. Dauparas, M. Expòsit, W. Yang, A. Saragovi, S.
[120] Kundu I, Paul G, Banerjee R. A machine learning approach towards the prediction Ovchinnikov, D. Baker, Deep learning methods for designing proteins scaf
of protein- ligand binding affinity based on fundamental molecular properties. folding functional sites, bioRxiv Preprint (Nov. 2021). 10.1101/2021.11.10.
RSC Adv 2018;8(22):12127–37. https://doi.org/10.1039/C8RA00003D 468128.
[121] Li H, Leung K-S, Wong M-H, Ballester PJ. Improving AutoDock Vina using [147] Anishchenko I, Pellock SJ, Chidyausiku TM, Ramelot TA, Ovchinnikov S, Hao J,
Random Forest: the growing accuracy of binding affinity prediction by the ef Bafna K, Norn C, Kang A, Bera AK, DiMaio F, Carter L, Chow CM, Montelione GT,
fective exploitation of larger data sets. Mol Inf 2015;34(2–3):115–26. https:// Baker D. De novo protein design by deep network hallucination. Nature
doi.org/10.1002/minf.201400132 2021;600(7889):547–52. https://doi.org/10.1038/s41586-021-04184-w
[122] S. Li, J. Zhou, T. Xu, L. Huang, F. Wang, H. Xiong, W. Huang, D. Dou, H. Xiong, [148] Lin Y-R, Koga N, Tatsumi-Koga R, Liu G, Clouser AF, Montelione GT, Baker D.
Structure-aware interactive graph neural networks for the prediction of pro Control over overall shape and size in de novo designed proteins. Proc Natl
tein-ligand binding affinity, in: Proceedings of the 27th ACM SIGKDD Acad Sci USA 2015;112(40):E5478–85. https://doi.org/10.1073/pnas.
Conference on Knowledge Discovery & Data Mining, ACM, Virtual Event 1509508112
Singapore, 2021, pp.975–985.10.1145/3447548.3467311. [149] Marcos E, Chidyausiku TM, McShan AC, Evangelidis T, Nerli S, Carter L, Nivón
[123] Stepniewska-Dziubinska MM, Zielenkiewicz P, Siedlecki P. Development and eva LG, Davis A, Oberdorfer G, Tripsianes K, Sgourakis NG, Baker D. De novo design
luation of a deep learning model for protein- ligand binding affinity prediction. of a non-local β-sheet protein with high stability and accuracy. Nat Struct Mol
Bioinformatics 2018;34(21):3666–74. https://doi.org/10.1093/bioinformatics/bty374 Biol 2018;25(11):1028–34. https://doi.org/10.1038/s41594-018-0141-6
[124] Wójcikowski M, Kukiełka M, Stepniewska-Dziubinska MM, Siedlecki P. [150] Baker D. What has de novo protein design taught us about protein folding and
Development of a Protein-Ligand Extended Connectivity (PLEC) fingerprint and biophysics? Protein Sci 2019;28(4):678–83. https://doi.org/10.1002/pro.3588
its application for binding affinity predictions. Bioinformatics [151] N. Ferruz, M. Heinzinger, M. Akdel, A. Goncearenco, L. Naef, C. Dallago, From
2019;35(8):1334–41. https://doi.org/10.1093/bioinformatics/bty757 sequence to function through structure: deep learning for protein design,
[125] Liu Z, Li Y, Han L, Li J, Liu J, Zhao Z, Nie W, Liu Y, Wang R. PDB-wide collection of bioRxiv (2022).
binding data: current status of the PDBbind database. Bioinformatics [152] Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT. Prediction and functional
2015;31(3):405–12. https://doi.org/10.1093/bioinformatics/btu626 analysis of native disorder in proteins from the three kingdoms of life. J Mol
[126] Hu L, Benson ML, Smith RD, Lerner MG, Carlson HA. Binding MOAD (Mother Of Biol 2004;337(3):635–45. https://doi.org/10.1016/j.jmb.2004.02.002
All Databases). Proteins Struct Funct Bioinform 2005;60(3):333–40. https://doi. [153] A. Gupta, S. Dey, H.-X. Zhou, Artificial Intelligence Guided Conformational
org/10.1002/prot.20512 Mining of Intrinsically Disordered Proteins, bioRxiv Preprint(Nov. 2021). 10.
[127] Mysinger MM, Carchia M, Irwin JJ, Shoichet BK. Directory of useful decoys, 1101/2021.11.21.469457.
enhanced (dud-e): better ligands and decoys for better benchmarking. J Med [154] Budowski-Tal I, Nov Y, Kolodny R. FragBag, an accurate representation of pro
Chem 2012;55(14):6582–94. tein structure, retrieves structural neighbors from the entire PDB quickly and
[128] R. Evans, M. O’Neill, A. Pritzel, N. Antropova, A. Senior, T. Green, A. Žídek, R. accurately. Proc Natl Acad Sci USA 2010;107(8):3481–6. https://doi.org/10.
Bates, S. Blackwell, J. Yim, O. Ronneberger, S. Bodenstein, M. Zielinski, A. 1073/pnas.0914097107
Bridgland, A. Potapenko, A. Cowie, K. Tunyasuvunakool, R. Jain, E. Clancy, P. [155] Liu Y, Ye Q, Wang L, Peng J. Learning structural motif representations for effi
Kohli, J. Jumper, D. Hassabis, Protein complex prediction with AlphaFold- cient protein structure search. Bioinformatics 2018;34(17):i773–80. https://doi.
Multimer, bioRxiv Preprint (Oct. 2021). 10.1101/2021.10.04.463034. org/10.1093/bioinformatics/bty585
[129] Bryant P, Pozzati G, Elofsson A. Improved prediction of protein-protein inter [156] Guzenko D, Burley SK, Duarte JM. Real time structural search of the protein data
actions using AlphaFold2. Nat Commun 2022;13(1):1265. https://doi.org/10. bank. PLoS Comput Biol 2020;16(7):e1007970https://doi.org/10.1371/journal.
1038/s41467-022-28865-w pcbi.1007970
641
J. Durairaj, D. de Ridder and A.D.J. van Dijk Computational and Structural Biotechnology Journal 21 (2023) 630–643
[157] T. Aderinwale, V. Bharadwaj, C. Christoffer, G. Terashi, Z. Zhang, R. Jahandideh, Y. structures. Bioinformatics 2021;37(16):2332–9. https://doi.org/10.1093/
Kagaya, D. Kihara, Real-Time Structure Search and Structure Classification for bioinformatics/btab118
AlphaFold Protein Models, bioRxiv Preprint (Oct. 2021). 10.1101/2021.10.21. [185] Bernauer J, Bahadur RP, Rodier F, Janin J, Poupon A. DiMoVo: A voronoi tes
465371. sellation-based method for discriminating crystallographic and biological
[158] Foldseek: fast and accurate protein structure search bioRxiv 10.1101/2022.02. protein– protein interactions. Bioinformatics 2008;24(5):652–8. https://doi.
07.479398v4 org/10.1093/bioinformatics/btn022
[159] N. Bordin, I. Sillitoe, V. Nallapareddy, C. Rauer, S.D. Lam, V.P. Waman, N. Sen, M. [186] Durairaj J, Akdel M, de Ridder D, van Dijk ADJ. Geometricus represents protein
Heinzinger, M. Littmann, S. Kim, S. Velankar, M. Steinegger, B. Rost, C. Orengo, structures as shape-mers derived from moment invariants. Bioinformatics
AlphaFold2 reveals commonalities and novelties in protein structure space for 2020;36(Supplement_2):i718–25. https://doi.org/10.1093/bioinformatics/btaa839
21 model organisms, pages: 2022.06.02.494367 Section: New Results (Jun. [187] Kihara D, Sael L, Chikhi R, Esquivel-Rodriguez J. Molecular surface re
2022). 10.1101/2022.06.02.494367. presentation Using 3D Zernike descriptors for protein shape comparison and
[160] Niu B, Scott AD, Sengupta S, Bailey MH, Batra P, Ning J, Wyczalkowski MA, Liang docking. Curr Protein Peptide Sci 2011;12(6):520–30. https://doi.org/10.2174/
W-W, Zhang Q, McLellan MD, Sun SQ, Tripathi P, Lou C, Ye K, Mashl RJ, Wallis J, 138920311796957612
Wendl MC, Chen F, Ding L. Protein-structure-guided discovery of functional [188] Yin S, Proctor EA, Lugovskoy AA, Dokholyan NV. Fast screening of protein sur
mutations across 19 cancer types. Nat Genet 2016;48(8):827–37. https://doi. faces using geometric invariant fingerprints. Proc Natl Acad Sci USA
org/10.1038/ng.3586 2009;106(39):16622–6. https://doi.org/10.1073/pnas.0906146106
[161] Berliner N, Teyra J, Çolak R, Lopez SG, Kim PM. Combining structural modeling [189] Namrata A, Po-Ssu H. Generative modeling for protein structures. Adv Neural
with ensemble machine learning to accurately predict protein fold stability and Inf Process Syst 2018:7494–505.
binding affinity effects upon mutation. PLoS One 2014;9(9):e107353https://doi. [190] Jiang M, Li Z, Zhang S, Wang S, Wang X, Yuan Q, Wei Z. Drug- target affinity
org/10.1371/journal.pone.0107353 prediction using graph neural network and contact maps. RSC Adv
[162] Terwilliger TC, Liebschner D, Croll TI, Williams CJ, McCoy AJ, Poon BK, Afonine 2020;10(35):20701–12. https://doi.org/10.1039/D0RA02297G
PV, Oeffner RD, Richardson JS, Read RJ, Adams PD. AlphaFold predictions: great [191] Wang X, Flannery ST, Kihara D. Protein docking model evaluation by graph neural
hypotheses but no match for experiment, preprint. Biochemistry 2022. https:// networks. Front Mol Biosci 2021;8:402. https://doi.org/10.3389/fmolb.2021.647915
doi.org/10.1101/2022.11.21.517405. (Nov.). [192] Strokach A, Becerra D, Corbi-Verge C, Perez-Riba A, Kim PM. Fast and flexible
[163] Hubbard SJ, Thornton JM, et al. naccess, computer program, department of protein design using deep graph neural networks. e4 Cell Syst
biochemistry and molecular biology. Univ Coll Lond 1993;2(1). 2020;11(4):402–11. https://doi.org/10.1016/j.cels.2020.08.016
[164] Mihel J, Šikić M, Tomić S, Jeren B, Vlahoviček K. Psaia-protein structure and [193] Ingraham J, Garg V, Barzilay R, Jaakkola T. Generative models for graph-based
interaction analyzer. BMC Struct Biol 2008;8(1):1–11. protein design. Adv Neural Inf Process Syst 2019;32:15820–31.
[165] Mitternacht S. Freesasa: An open source c library for solvent accessible surface [194] Q. Yuan, S. Chen, J. Rao, S. Zheng, H. Zhao, Y. Yang, AlphaFold2-aware protein-
area calculations. F1000Research 2016;5. DNA binding site prediction using graph transformer, bioRxiv Preprint (Dec.
[166] Touw WG, Baakman C, Black J, Te Beek TA, Krieger E, Joosten RP, Vriend G. A 2021). 10.1101/2021.08.25.457661.
series of pdb-related databanks for everyday needs. Nucleic Acids Res [195] A.R. Jamasb, R. Viñas, E.J. Ma, C. Harris, K. Huang, D. Hall, P. Lió, T.L. Blundell,
2015;43(D1):D364–8. Graphein - a Python library for geometric deep learning and network analysis
[167] Ruiz-Blanco YB, Paz W, Green J, Marrero-Ponce Y. Protdcal: A program to on protein structures and interaction networks, bioRxiv Preprint (Oct. 2021).
compute general-purpose-numerical descriptors for sequences and 3d-struc 10.1101/2020.07.15.204701.
tures of proteins. BMC Bioinform 2015;16(1):1–15. [196] Somnath VR, Bunne C, Krause A. Multi-scale representation learning on pro
[168] Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck teins. Adv Neural Inf Process Syst 2021;34.
T, Kauff F, Wilczynski B, et al. Biopython: freely available python tools for [197] Lim J, Ryu S, Park K, Choe YJ, Ham J, Kim WY. Predicting drug-target interaction
computational molecular biology and bioinformatics. Bioinformatics using a novel graph neural network with 3D structure-embedded graph re
2009;25(11):1422–3. presentation. J Chem Inf Model 2019;59(9):3981–8. https://doi.org/10.1021/acs.
[169] Sanner MF, Olson AJ, Spehner J-C. Reduced surface: an efficient way to compute jcim.9b00387
molecular surfaces. Biopolymers 1996;38(3):305–20. [198] Morrone JA, Weber JK, Huynh T, Luo H, Cornell WD. Combining docking pose
[170] R.J. Gowers, M. Linke, J. Barnoud, T.J.E. Reddy, M.N. Melo, S.L. Seyler, J. rank and structure with deep learning improves protein-ligand binding mode
Domanski, D.L. Dotson, S. Buchoux, I.M. Kenney, et al., Mdanalysis: a python prediction over a baseline docking approach. J Chem Inf Model
package for the rapid analysis of molecular dynamics simulations, Tech. rep., 2020;60(9):4170–9. https://doi.org/10.1021/acs.jcim.9b00927
Los Alamos National Lab. (LANL), Los Alamos, NM (United States) (2019). [199] Sunseri J, King JE, Francoeur PG, Koes DR. Convolutional neural network scoring
[171] Buß O, Rudat J, Ochsenreither K. Foldx as protein engineering tool: better than and minimization in the D3R 2017 community challenge. J Comput Aided Mol
random based approaches? Comput Struct Biotechnol J 2018;16:25–33. Des 2019;33(1):19–34. https://doi.org/10.1007/s10822-018-0133-y
[172] Alford RF, Leaver-Fay A, Jeliazkov JR, O’Meara MJ, DiMaio FP, Park H, Shapovalov [200] Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, Leswing K,
MV, Renfrew PD, Mulligan VK, Kappel K, et al. The rosetta all-atom energy Pande V. MoleculeNet: A benchmark for molecular machine learning. Chem Sci
function for macromolecular modeling and design. J Chem Theory Comput 2018;9(2):513–30. https://doi.org/10.1039/C7SC02664A
2017;13(6):3031–48. [201] Qin T, Zhu Z, Wang XS, Xia J, Wu S. Computational representations of
[173] Baker NA, Sept D, Joseph S, Holst MJ, McCammon JA. Electrostatics of nano protein- ligand interfaces for structure-based virtual screening. Expert Opin
systems: application to microtubules and the ribosome. Proc Natl Acad Sci USA Drug Discov 2021;16(10):1175–92. https://doi.org/10.1080/17460441.2021.
2001;98(18):10037–41. 1929921
[174] Ward JJ, McGuffin LJ, Bryson K, Buxton BF, Jones DT. The disopred server for the [202] Alley EC, Khimulya G, Biswas S, AlQuraishi M, Church GM. Unified rational
prediction of protein disorder. Bioinformatics 2004;20(13):2138–9. protein engineering with sequence-based deep representation learning. Nat
[175] Bakan A, Meireles LM, Bahar I. Prody: protein dynamics inferred from theory Methods 2019;16(12):1315–22. https://doi.org/10.1038/s41592-019-0598-1
and experiments. Bioinformatics 2011;27(11):1575–7. [203] T. Bepler, B. Berger, Learning protein sequence embeddings using information
[176] Mikulska-Ruminska K, Kulik AJ, Kaya C, BenAdiba C, Dietler G, Nowak W, Bahar from structure, arXiv Preprint (Oct. 2019). arXiv:1902.08661.
I. Mechstiff: A new tool for evaluating stress-induced dynamics and application [204] Heinzinger M, Elnaggar A, Wang Y, Dallago C, Nechaev D, Matthes F, Rost B.
to cell adhesion proteins. Biophys J 2017;112(3):45a–6a. Modeling aspects of the language of life through transfer-learning protein sequences.
[177] Atilgan C, Atilgan AR. Perturbation-response scanning reveals ligand entry-exit BMC Bioinform 2019;20(1):723. https://doi.org/10.1186/s12859-019-3220-8
mechanisms of ferric binding protein. PLoS Comput Biol 2009;5(10):e1000544. [205] Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, Guo D, Ott M, Zitnick CL, Ma J,
[178] Shegay MV, Suplatov DA, Popova NN, Švedas VK, Voevodin VV. parMATT: Fergus R. Biological structure and function emerge from scaling unsupervised
Parallel multiple alignment of protein 3D-structures with translations and learning to 250 million protein sequences. Proc Natl Acad Sci USA 2021;118(15).
twists for distributed-memory systems. Bioinformatics 2019;35(21):4456–8. https://doi.org/10.1073/pnas.2016239118. (Apr.).
https://doi.org/10.1093/bioinformatics/btz224 [206] Mansoor S, Baek M, Madan U, Horvitz E. Toward more general embeddings for
[179] J. Durairaj, M. Akdel, D. de Ridder, A.D. van Dijk, Fast and adaptive protein protein design: harnessing joint representations of sequence and structure.
structure representations for machine learning, bioRxiv Preprint (Apr. 2021). bioRxiv Preprint 2021. https://doi.org/10.1101/2021.09.01.458592. (Sep.).
10.1101/2021.04.07.438777. [207] P. Hermosilla, T. Ropinski, Contrastive representation learning for 3d protein
[180] Shegay MV, Švedas VK, Voevodin VV, Suplatov DA, Popova NN. Guide tree opti structures, arXiv preprint arXiv:2205.15675 (2022).
mization with genetic algorithm to improve multiple protein 3D-structure align [208] C. Chen, Y. Zha, D. Zhu, K. Ning, X. Cui, Hydrogen bonds meet self-attention: all
ment. Bioinformatics 2021. https://doi.org/10.1093/bioinformatics/btab798 you need for general-purpose protein structure embedding, bioRxiv Preprint
[181] Ezkurdia I, Bartoli L, Fariselli P, Casadio R, Valencia A, Tress ML. Progress and (Aug. 2021). 10.1101/2021.01.31.428935.
challenges in predicting protein- protein interaction sites. Brief Bioinform [209] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I.
2009;10(3):233–46. https://doi.org/10.1093/bib/bbp021 Polosukhin, Attention is all you need, in: Advances in neural information
[182] Poupon A. Voronoi and voronoi-related tessellations in studies of protein processing systems, 2017, pp.5998–6008.
structure and interaction. Curr Opin Struct Biol 2004;14(2):233–41. https://doi. [210] F. Sverrisson, J. Feydy, B.E. Correia, M.M. Bronstein, Fast end-to-end learning on
org/10.1016/j.sbi.2004.03.010 protein surfaces, bioRxiv Preprint (Dec. 2020). 10.1101/2020.12.28.424589.
[183] Pan Y, Wang Z, Zhan W, Deng L. Computational identification of binding energy [211] G. Corso, H. Stärk, B. Jing, R. Barzilay, T. Jaakkola, DiffDock:Diffusion Steps,
hot spots in protein-RNA complexes using an ensemble approach. Bioinformatics Twists, and Turns for Molecular Docking, arXiv:2210.01776 [physics, q-bio](Oct.
2018;34(9):1473–80. https://doi.org/10.1093/bioinformatics/btx822 2022). 10.48550/arXiv.2210.01776.
[184] Igashov I, Olechnovič K, Kadukova M, Venclovas Č, Grudinin S. VoroCNN: Deep [212] O.-E. Ganea, X. Huang, C. Bunne, Y. Bian, R. Barzilay, T. Jaakkola, A. Krause,
convolutional neural network built on 3D voronoi tessellation of protein Independent SE(3)-equivariant models for end-to-end rigid protein docking,
arXiv:2111.07786 [cs] (Mar. 2022). 10.48550/arXiv.2111.07786.
642
J. Durairaj, D. de Ridder and A.D.J. van Dijk Computational and Structural Biotechnology Journal 21 (2023) 630–643
[213] A. Schneuing, Y. Du, C. Harris, A. Jamasb, I. Igashov, W. Du, T. Blundell, P. Lió, C. [237] Singharoy A, Teo I, McGreevy R, Stone JE, Zhao J, Schulten K. Molecular dy
Gomes, M. Welling, M. Bronstein, B. Correia, Structure-based drug design with namics-based refinement and validation for sub-5 Å cryo-electron microscopy
equivariant diffusion models, arXiv:2210.13695 [cs, q-bio](Oct. 2022). 10. maps. eLife 2016;5. https://doi.org/10.7554/eLife.16105. (Jul.).
48550/arXiv.2210.13695. [238] Mirjalili V, Noyes K, Feig M. Physics-based protein structure refinement
[214] Kim PT, Winter R, Clevert D-A. Unsupervised representation learning for pro through multiple molecular dynamics trajectories and structure averaging.
teochemometric modeling. Int J Mol Sci 2021;22(23):12882https://doi.org/10. Proteins Struct Funct Genet 2014;82(Suppl 2):196–207. https://doi.org/10.
3390/ijms222312882 1002/prot.24336
[215] Villegas-Morcillo A, Makrodimitris S, van Ham RCHJ, Gomez AM, Sanchez V, [239] Gniewek P, Kolinski A, Jernigan RL, Kloczkowski A. Elastic network normal
Reinders MJT. Unsupervised protein embeddings outperform hand-crafted se modes provide a basis for protein structure refinement. J Chem Phys
quence and structure features at predicting molecular function. Bioinformatics 2012;136(19):195101https://doi.org/10.1063/1.4710986
2021;37(2):162–70. https://doi.org/10.1093/bioinformatics/btaa701 [240] Schneider J, Korshunova K, SiChaib Z, Giorgetti A, Alfonso-Prieto M, Carloni P.
[216] S. Sledzieski, R. Singh, L. Cowen, B. Berger, Sequence-based prediction of pro Ligand pose predictions for human G Protein-Coupled Receptors: insights from
tein-protein interactions: a structure-aware interpretable deep learning model, the Amber-based hybrid molecular mechanics/coarse-grained approach. J
bioRxiv (2021). 10.1101/2021.01.22.427866. Chem Inf Model 2020;60(10):5103–16. https://doi.org/10.1021/acs.jcim.
[217] M. Heinzinger, M. Littmann, I. Sillitoe, N. Bordin, C. Orengo, B. Rost, Contrastive 0c00661
learning on protein embeddings enlightens midnight zone at lightning speed, [241] Wang A, Zhang Y, Chu H, Liao C, Zhang Z, Li G. Higher accuracy achieved for
bioRxiv Preprint (Nov. 2021). 10.1101/2021.11.14.468528. protein-ligand binding pose prediction by Elastic Network Model-based en
[218] Y. Zhang, P. Li, F. Pan, H. Liu, P. Hong, X. Liu, J. Zhang, Applications of AlphaFold semble docking. J Chem Inf Model 2020;60(6):2939–50. https://doi.org/10.
beyond protein structure prediction, bioRxiv Preprint (Dec. 2021). 10.1101/2021. 1021/acs.jcim.9b01168
11.03.467194. [242] Cavasotto CN. Normal mode-based approaches in receptor ensemble docking.
[219] Waterhouse A, Bertoni M, Bienert S, Studer G, Tauriello G, Gumienny R, Heer FT, In: Baron R, editor. Computational drug discovery and design, methods in
de Beer TAP, Rempfer C, Bordoli L, Lepore R, Schwede T. SWISS-MODEL: molecular biology New York, NY: Springer; 2012. p. 157–68. https://doi.org/10.
homology modelling of protein structures and complexes. Nucleic Acids Res 1007/978-1-61779-465-0_11. (pp).
2018;46(W1):W296–303. https://doi.org/10.1093/nar/gky427 [243] Evangelista Falcon W, Ellingson SR, Smith JC, Baudry J. Ensemble docking in
[220] M. Mirdita, S. Ovchinnikov, M. Steinegger, ColabFold - Making protein folding drug discovery: how many protein configurations from molecular dynamics
accessible to all, bioRxiv Preprint (Aug. 2021). 10.1101/2021.08.15.456425. simulations are needed to reproduce known ligand binding? J Phys Chem B
[221] Weißenow K, Heinzinger M, Rost B. Protein language-model embeddings for 2019;123(25):5189–95. https://doi.org/10.1021/acs.jpcb.8b11491
fast, accurate, and alignment-free protein structure prediction. Structure 2022. [244] Stansfeld PJ, Sansom MSP. From coarse grained to atomistic: a serial multiscale
[222] AlQuraishi M, Sorger PK. Differentiable biology: using deep learning for bio approach to membrane protein simulations. J Chem Theory Comput
physics-based and data-driven modeling of molecular mechanisms. Nat 2011;7(4):1157–66. https://doi.org/10.1021/ct100569y
Methods 2021;18(10):1169–80. https://doi.org/10.1038/s41592-021-01283-4 [245] Noé F, Tkatchenko A, Müller K-R, Clementi C. Machine learning for molecular
[223] Ferruz N, Heinzinger M, Akdel M, Goncearenco A, Naef L, Dallago C. From se simulation. Annu Rev Phys Chem 2020;71(1):361–90. https://doi.org/10.1146/
quence to function through structure: deep learning for protein design. annurev-physchem-042018-052331
Comput Struct Biotechnol J 2023;21:238–50. https://doi.org/10.1016/j.csbj. [246] Noé F, De Fabritiis G, Clementi C. Machine learning for protein folding and
2022.11.014 dynamics. Curr Opin Struct Biol 2020;60:77–84. https://doi.org/10.1016/j.sbi.
[224] Dauparas J, Anishchenko I, Bennett N, Bai H, Ragotte RJ, Milles LF, Wicky BIM, 2019.12.005
Courbet A, de Haas RJ, Bethel N, Leung PJY, Huddy TF, Pellock S, Tischer D, Chan [247] Jin Y, Johannissen LO, Hay S. Predicting new protein conformations from mo
F, Koepnick B, Nguyen H, Kang A, Sankaran B, Bera AK, King NP, Baker D. Robust lecular dynamics simulation conformational landscapes and machine learning.
deep learning-based protein sequence design using ProteinMPNN. Science Proteins Struct Funct Bioinform 2021;89(8):915–21.
2022;378(6615):49–56. https://doi.org/10.1126/science.add2187 [248] Karamzadeh R, Karimi-Jafari MH, Sharifi-Zarchi A, Chitsaz H, Salekdeh GH,
[225] J.L. Watson, D. Juergens, N.R. Bennett, B.L. Trippe, J. Yim, H.E. Eisenach, W. Moosavi-Movahedi AA. Machine learning and network analysis of molecular
Ahern, A.J. Borst, R.J. Ragotte, L.F. Milles, B.I.M. Wicky, N. Hanikel, S.J. Pellock, A. dynamics trajectories reveal two chains of red/ox-specific residue interactions
Courbet, W. Sheffler, J. Wang, P. Venkatesh, I. Sappington, S.V. Torres, A. Lauko, in human protein Disulfide Isomerase. Sci Rep 2017;7(1):3666. https://doi.org/
V.D. Bortoli, E. Mathieu, R. Barzilay, T.S. Jaakkola, F. DiMaio, M. Baek, D. Baker, 10.1038/s41598-017-03966-5
Broadly applicable and accurate protein design by integrating structure pre [249] Spiwok V, Kr^íž P. Time-lagged t-Distributed Stochastic Neighbor Embedding
diction networks and diffusion generative models, pages: 2022.12.09.519842 (t-SNE) of molecular simulation trajectories. Front Mol Biosci 2020;7.
Section: New Results (Dec. 2022). 10.1101/2022.12.09.519842. [250] Wang DD, Ou-Yang L, Xie H, Zhu M, Yan H. Predicting the impacts of mutations
[226] Kmiecik S, Kouza M, Badaczewska-Dawid AE, Kloczkowski A, Kolinski A. on protein-ligand binding affinity based on molecular dynamics simulations
Modeling of protein structural flexibility and large-scale dynamics: coarse- and machine learning methods. Comput Struct Biotechnol J 2020;18:439–54.
grained simulations and Elastic Network Models. Int J Mol Sci https://doi.org/10.1016/j.csbj.2020.02.007
2018;19(11):3496. https://doi.org/10.3390/ijms19113496 [251] Marchetti F, Moroni E, Pandini A, Colombo G. Machine learning prediction of
[227] Hollingsworth SA, Dror RO. Molecular dynamics simulation for all. Neuron allosteric drug activity from molecular dynamics. J Phys Chem Lett
2018;99(6):1129–43. https://doi.org/10.1016/j.neuron.2018.08.011 2021;12(15):3724–32. https://doi.org/10.1021/acs.jpclett.1c00045
[228] Quesne MG, Borowski T, de Visser SP. Quantum mechanics/molecular me [252] Glazer DS, Radmer RJ, Altman RB. Improving structure-based function predic
chanics modeling of enzymatic processes: caveats and breakthroughs. Chem tion using molecular dynamics. Structure 2009;17(7):919–29. https://doi.org/
Eur J 2016;22(8):2562–81. https://doi.org/10.1002/chem.201503802 10.1016/j.str.2009.05.010
[229] Atilgan AR, Durell SR, Jernigan RL, Demirel MC, Keskin O, Bahar I. Anisotropy of [253] C. Outeiral, D.A. Nissley, C.M. Deane, Current protein structure predictors do not
fluctuation dynamics of proteins with an Elastic Network Model. Biophys J produce meaningful folding pathways, bioRxiv Preprint (Sep. 2021). 10.1101/
2001;80(1):505–15. https://doi.org/10.1016/S0006-3495(01)76033-X 2021.09.20.461137.
[230] Jamroz M, Orozco M, Kolinski A, Kmiecik S. Consistent view of protein fluc [254] Hochuli J, Helbling A, Skaist T, Ragoza M, Koes DR. Visualizing convolutional
tuations from all-atom molecular dynamics and coarse-grained dynamics with neural network protein-ligand scoring. J Mol Graph Model 2018;84:96–108.
knowledge-based force-field. J Chem Theory Comput 2013;9(1):119–25. https://doi.org/10.1016/j.jmgm.2018.06.005
https://doi.org/10.1021/ct300854w [255] Kim E, Goren A, Ast G. Alternative splicing: current perspectives. BioEssays
[231] Frappier V, Najmanovich RJ. A coarse-grained elastic network atom contact 2008;30(1):38–47. https://doi.org/10.1002/bies.20692
model and its use in the simulation of protein dynamics and the prediction of [256] Owji H, Nezafat N, Negahdaripour M, Hajiebrahimi A, Ghasemi Y. A compre
the effect of mutations. PLoS Comput Biol 2014;10(4):e1003569https://doi.org/ hensive review of signal peptides: structure, roles, and applications. Eur J Cell
10.1371/journal.pcbi.1003569 Biol 2018;97(6):422–41. https://doi.org/10.1016/j.ejcb.2018.06.003
[232] Tekpinar M, Zheng W. Predicting order of conformational changes during [257] Ribeiro AJM, Das S, Dawson N, Zaru R, Orchard S, Thornton JM, Orengo C,
protein conformational transitions using an interpolated Elastic Network Zeqiraj E, Murphy JM, Eyers PA. Emerging concepts in pseudoenzyme classifi
Model. Proteins Struct Funct Genet 2010;78(11):2469–81. https://doi.org/10. cation, evolution, and signaling. Sci Signal 2019;12(594). https://doi.org/10.
1002/prot.22755 1126/scisignal.aat9797. (Aug.).
[233] Kmiecik S, Gront D, Kouza M, Kolinski A. From coarse-grained to atomic-level [258] Smith LM, Kelleher NL. Proteoforms as the next proteomics currency. Science
characterization of protein dynamics: transition state for the folding of B do 2018;359(6380):1106–7.
main of protein A. J Phys Chem B 2012;116(23):7026–32. https://doi.org/10. [259] Camacho DM, Collins KM, Powers RK, Costello JC, Collins JJ. Next-generation
1021/jp301720w machine learning for biological networks. Cell 2018;173(7):1581–92. https://
[234] Mahajan S, Sanejouand Y-H. On the relationship between low-frequency doi.org/10.1016/j.cell.2018.05.015
normal modes and the large-scale conformational changes of proteins. Arch [260] Fuentealba M, Dönertas HM, Williams R, Labbadia J, Thornton JM, Partridge L.
Biochem Biophys 2015;567:59–65. https://doi.org/10.1016/j.abb.2014.12.020 Using the drug-protein interactome to identify anti-ageing compounds for
[235] Yang L, Song G, Jernigan RL. How well can we understand large-scale protein humans. PLoS Comput Biol 2019;15(1):e1006639https://doi.org/10.1371/
motions using normal modes of Elastic Network Models? Biophys J journal.pcbi.1006639
2007;93(3):920–9. https://doi.org/10.1529/biophysj.106.095927 [261] Murray D, Petrey D, Honig B. Integrating 3D structural information into
[236] Takada S, Kanada R, Tan C, Terakawa T, Li W, Kenzaki H. Modeling structural systems biology. J Biol Chem 2021;296:100562https://doi.org/10.1016/j.jbc.
dynamics of biomolecular complexes by coarse-grained molecular simulations. 2021.100562
Acc Chem Res 2015;48(12):3026–35. https://doi.org/10.1021/acs.accounts. [262] Aloy P, Russell RB. Structural systems biology: modelling protein interactions.
5b00338 Nat Rev Mol Cell Biol 2006;7(3):188–97. https://doi.org/10.1038/nrm1859
643