Base Paper
Base Paper
In Brief
Prediction of protein structure from
sequence is important for understanding
protein function, but it remains very
challenging, especially for proteins with
few homologs. Existing prediction
methods are human engineered, with
many complex parts developed over
decades. We introduce a new approach
based entirely on machine learning that
predicts protein structure from sequence
using a single neural network. The model
achieves state-of-the-art accuracy and
does not require co-evolution information
or structural homologs. It is also much
faster, making predictions in milliseconds
versus hours or days, which enables new
Highlights applications in drug discovery and
d Neural network predicts protein structure from sequence protein design.
without using co-evolution
Article
End-to-End Differentiable
Learning of Protein Structure
Mohammed AlQuraishi1,2,3,*
1Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA 02115, USA
2Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA
3Lead Contact
*Correspondence: [email protected]
https://doi.org/10.1016/j.cels.2019.03.006
292 Cell Systems 8, 292–301, April 24, 2019 ª 2019 Elsevier Inc.
Figure 1. Conventional Pipelines for Protein Structure
H-bonding MDSAITLW.... Prediction
van der Waals
Solvaon Protein Sequence Prediction process begins with query sequence (top, green box)
Electrostacs whose constituent domains and co-evolutionary relationships are
Atomic distances identified through multiple sequence alignments. In free modeling
Strand packing (left), fragment libraries are searched to derive distance restraints,
...
Fragment Template which, along with restraints derived from co-evolutionary data, guide
Search Domain Spling Threading simulations that iteratively minimize energy through sampling.
Coarse conformations are then refined to yield the final structure. In
template-based modeling (right pipeline), the PDB is searched for
templates. If found, fragments from one or more templates are
combined to assemble a structure, which is then optimized and
refined to yield the final structure. Orange boxes indicate sources of
Stascal and Distance Selected input information beyond query sequence, including prior physical
Physical Potenals Restraints Co-EvoContacts
Co-Evo Contacts Templates knowledge. Diagram is modeled on the I-Tasser and Quark pipelines
(Zhang et al., 2018).
Template Sorng
Template Fragments
Structure Assembly
Clustering
Clustering
Inherent
Reduced
Fragments Potenal
Structure Reassembly
Atomic Refinement
ure l
uct nta
uct ed
ure
Str rime
Str edict
Exisng Atoms Bond Bond Torsional
(Input) Rotaon Extension Rotaon
e
Pr
Exp
Output ψ ω
GEOMETRY GEOMETRY Nascent Structure GEOMETRY (Torsional Angles) φ Weighted
Average
Torsional
φ
Angles
ψ Prior
ω State
Next Circular
Internal State State Projecon
without violating their covalent chemistry, (3) coupling local pro- computed. Instead, each unit’s computation is described by an
tein structure to its global representation via recurrent geometric equation whose parameters are optimized so that RGNs accu-
units, and (4) using a differentiable loss function to capture rately predict structures.
deviations between predicted and experimental structures. We The second stage is made of geometric units that take as
find that the new approach outperforms other methods, input the torsional angles for a given residue and the partially
including co-evolution ones, when predicting novel folds even completed backbone resulting from the geometric unit upstream
though it uses only primary sequences and position-specific of it, and output a new backbone extended by one residue,
scoring matrices (PSSMs) that summarize individual residue pro- which is fed into the adjacent downstream unit (AlQuraishi,
pensities for mutation. We also find that when predicting known 2019a; Parsons et al., 2005). The last unit outputs the completed
folds, the new approach is on average within 1–2 Å of other ap- 3D structure of the protein. During model training, a third stage
proaches, including template-based ones, despite being tem- computes deviations between predicted and experimental
plate-free. structures using the distance-based root mean square deviation
(dRMSD) metric. The dRMSD first computes pairwise distances
RESULTS between all atoms in the predicted structure and all atoms in
the experimental one (separately) and then computes the root
Recurrent Geometric Networks mean square of the distance between these sets of distances.
Our model takes a sequence of amino acids and PSSMs as input Because dRMSD is distance-based, it is invariant to reflections,
and outputs a 3D structure. It comprises three stages—compu- which can lead RGNs to predict reflected structures (effectively
tation, geometry, and assessment—which we term a recurrent wrong chirality) that must be corrected by a counter-reflection.
geometric network (RGN). The first stage is made of computa- RGN parameters are optimized to minimize the dRMSD between
tional units that for each residue position, integrate information predicted and experimental structures using backpropagation
about its amino acid and PSSM with information coming from (Goodfellow et al., 2016). Hyperparameters, which describe
adjacent units. By laying these units in a recurrent bidirectional higher-level aspects of the model such as the number of compu-
topology (Figure 2), the computations for each residue integrate tational units, were determined through manual exploration of
information from residues upstream and downstream all the way hyperparameter space. See Supplemental Information for a
to the N and C terminus, covering the entire protein. By further complete mathematical treatment.
stacking units in multiple layers (data not shown), the model
implicitly encodes a multi-scale representation of proteins. Assessment of Model Error
Each unit outputs three numbers, corresponding to the torsional Machine learning models must be trained against as large a pro-
angles of the residue. We do not specify a priori how angles are portion of available data as possible to fit model parameters and
then evaluated against a distinct test set to assess accuracy. dRMSD of RGN predictions on FM structures to the top five auto-
Reliable evaluation is frequently complicated by unanticipated mated predictors in CASP 7–12, known as ‘‘servers’’ in CASP
information leakage from the training set into the test set, espe- parlance (‘‘humans’’ are combined server and human-expert
cially for protein sequences that share an underlying evolutionary pipelines—we do not compare against this group as our process-
relationship. Partly to address this problem, the critical assess- ing is automated). In Figure 3A, we break down the predictions by
ment of protein structure prediction (CASP) (Moult et al., 1995) target against the top performing server and in Figure 3C against
was organized to assess methods in a blinded fashion, by testing the dRMSD distribution of all CASP servers.
predictors using sequences of solved structures that have On all CASPs, RGNs had the best performance, even
not been publicly released. To assess RGNs, we therefore compared to servers that use co-evolution data (in CASP 11
sought to recreate the conditions of past CASPs by assembling [Kryshtafovych et al., 2016; Ovchinnikov et al., 2016] and CASP
the ProteinNet datasets (AlQuraishi, 2019b). For every CASP 12 [Schaarschmidt et al., 2018]). RGNs outperformed other
from 7 through 12, we created a corresponding ProteinNet test methods at both short and long multi-domain proteins, suggest-
set comprising CASP structures, and a ProteinNet training set ing their performance is not limited to one regime (e.g., short sin-
comprising all sequences and structures publicly available gle-domain proteins), despite having no explicit knowledge of
prior to the start of that CASP. Using multiple CASP datasets en- domain boundaries. While the margin between RGNs and the
ables a deeper and more thorough assessment that spans a next best server is small for most CASPs, such small gaps are
broad range of dataset sizes than relying on the most recent representative of the differences between the top five performers
CASP alone. We also adopted the CASP division of test struc- in Table 1. In general, small gains in accuracy at the top end are
tures into free modeling (FM) targets that assess prediction of difficult, with only minimal gains obtained over a 10-year time
novel folds and template-based modeling (TBM and TBM- span from CASP 6 to CASP 11 (Kryshtafovych et al., 2018).
hard) targets that assess prediction of folds with known homo- More substantial gains were seen in CASP 12 as a result of the
logs in the Protein Data Bank (PDB) (Bernstein et al., 1977). We use of co-evolutionary information (Moult et al., 2018), but
set aside a subset of the training data as a validation set to deter- RGNs match these advances without using co-evolutionary
mine when to stop model training and to further insulate training data and by operating in a fundamentally distinct and comple-
and test data. mentary way. The accuracy gap between RGNs and other servers
ProteinNet datasets were used for all analyses described here. is highest on CASP 11, which benefits from having the RGN hy-
RGN hyperparameters were fit by repeated evaluations on the perparameters fit on the ProteinNet11 validation set, suggesting
ProteinNet 11 validation set, followed by three evaluations on similar gains may be obtained by optimizing RGN hyperpara-
the ProteinNet 11 test set. Once chosen, the same hyperpara- meters for each dataset (this would not correspond to overfitting
meters were used to train models on ProteinNet 7–12 training as only the validation set is used to fit hyperparameters but
sets, with a single evaluation made at the end on each test set would require substantially more compute resources for training).
(excepting ProteinNet 11) to generate Table 1. Subsequently, ProteinNet datasets of earlier CASPs are smaller, which may have
additional test set evaluations were made to generate Table S1, also reduced accuracy. To assess the contribution of dataset size
with one evaluation per number reported. No additional test set to model error, we used RGNs trained on earlier ProteinNet data-
evaluations were made. Overall, this represents a rigorous sets to predict later CASP test sets (Table S1). As expected,
approach to evaluation with the lowest possible risk of information accuracy drops as datasets shrink.
leakage. The dRMSD metric does not require structures to be pre-
aligned and is consequently able to detect regions of high local
Predicting New Folds without Co-evolution concordance even when global concordance is poor. dRMSD
We first assessed RGNs on a difficult task that has not consis- assesses predictions at all length scales, however, it penalizes
tently been achieved by any existing method: predicting novel large global deviations in proportion to their distance, which
protein folds without co-evolutionary data. FM structures served can result in a very high error for far apart regions. To obtain a
as targets for this exercise. Table 1 compares the average complementary assessment of model accuracy, we also tested
T0831
20 20
T0534 T0529 T0686
Best CASP server dRMSD (Å)
T09
T0923 T0571
T0600
T0904 T0880 T0680 T0542 T0543 T0705
T0361 T0918
T091 15 T0671
71 T0547
15 T0941 T0604
T0
T0897 T0834
4 T0810
T0639 T0437
T0864 T0356
T0737 T0890
T0914 T0741 T0674
T0616 T0759
10 T0581 T0309 T0637 10 T0392
92 T0608
T0
T0868 T0943
T0531
T0836 T0562 T0920
T0561 T0435
T0537 T0307 T0360
CASP7
T0353 T0618 T0507 T0487
CASP8 T0945
5 T0624 T0544 CASP9 5 T0852
T0350 T0685
CASP10
T0915 T0862 T0372
CASP11 T0664
T0837 CASP12
T0867 T0438
0 0
0 5 10 15 20 0 5 10 15 20
RGN dRMSD (Å) RGN dRMSD (Å)
10
15
8
dRMSD (Å)
10 6
4
5 52% 64% 63% 58%
25% 2
(D) Distribution of RGN dRMSDs (ends of boxes correspond to upper and lower quartiles, whiskers to highest and lowest values, wide white line to median, and
short white line to mean) on ProteinNet validation sets grouped by maximum % sequence identity to training set over all CASPs.
(E) Traces of backbone atoms of well (left), fairly (middle), and poorly (right) predicted RGN structures are shown (bottom) along with their experimental
counterparts (top). CASP identifier is displayed above each structure and dRMSD below. A color spectrum spans each protein chain to aid visualization.
See also Figure S1.
second-level CATH categories (three right panels in Figure 5C). dictions, with sequence length 700, can be made in 5.4 s on a
We again see contiguous regions for each category, albeit with single GPU, i.e., 10 ms per structure. Table 2 compares training
greater overlap, likely owing to the continuous nature of protein and prediction speeds of RGNs to established methods that rely
structure space and reduction of RGN space to just two dimen- heavily on simulation with limited learning (first row), and deep
sions. These visualizations suggest RGNs are learning a useful learning plus co-evolution-based contact prediction methods
representation of protein sequence space that may yield insights that rely on learning (second row), combined with CONFOLD
into the nature of protein structure space. (Adhikari et al., 2015) to convert predicted contact maps into ter-
tiary structures. While training RGNs can take weeks to months,
RGNs Are 6–7 Orders of Magnitude Faster Than Existing once trained, they make predictions 6–7 orders of magnitude
Methods faster than existing pipelines. This speed enables new types of
Existing structure prediction pipelines are multi-staged (Figure 1), applications, such as the integration of structure prediction
first detecting domains that can be separately modeled and within docking and virtual screening in which ligand-aware
running multiple algorithms to estimate secondary structure RGNs could output distinct protein conformations in response
propensities, solvent accessibility, and disordered regions. Co- to distinct ligand poses.
evolutionary methods use multiple sequence alignments to pre-
dict contact maps, and template-based methods search the DISCUSSION
PDB for templates. Their predictions are converted into geomet-
ric constraints to guide a conformation sampling process, where A key limitation of explicit sequence-to-structure maps, including
fragments are swapped in and out of putative structures to molecular dynamics and fragment assembly, is a reliance on
minimize an expertly derived energy model. Because of this fixed energy models that do not learn from data; a second limi-
complexity, prediction times range from hours to days and re- tation is the exclusive use of single-scale atomic or residue-
quires codebases as large as several million lines of code level representations. In contrast, modern co-evolution methods
(Leaver-Fay et al., 2011). leverage learning and multi-scale representations to substantially
In contrast, a trained RGN model is a single mathematical improve performance (Liu et al., 2018b; Wang et al., 2017). RGNs
function that is evaluated once per prediction. Computation of go one step further by building a fully differentiable map extend-
this function implicitly carries out domain splitting, property ing from sequence to structure with all of the steps in existing
finding, energy minimization, and conformational sampling prediction pipelines implicitly encoded and learnable from
simultaneously. We found that 512 concurrent RGN-based pre- data. Through their recurrent architecture, RGNs can capture
sequence-structure motifs and multiple scales from residues to representation—may further improve the detection and quality
domains (Alva et al., 2015; Ponting and Russell, 2002). When of long-range interactions. The third parameterization, built in
tracking structure prediction during RGN training (Video S1), the dRMSD stage, is the matrix of inter-atomic distances and is
RGNs appear to first learn global aspects of protein folds and simultaneously local and global. It is useful for optimizing RGN
then refine their predictions to generate a more accurate local parameters de novo, as we have used it, but can also be used
structure. to incorporate prior knowledge expressible in terms of atomic dis-
RGNs are multi-representational, operating on three distinct tances; such knowledge includes physical features (e.g., electro-
parameterizations of protein structure. The first is torsional, statics) and statistical data on interactions (e.g., evolutionary
capturing local relationships between atoms with bond lengths couplings).
and angles held fixed and torsional angles as the immediate One limitation of current RGNs is their reliance on PSSMs,
outputs of computational units. This virtually guarantees that which we have found to be helpful to achieving high-accuracy
predictions are structurally correct at a local level. The second predictions. PSSMs are much weaker than multiple sequence
is Cartesian, built by geometric units and capturing the global alignments, as they are based on single residue mutation fre-
coordination of multiple atoms in 3D space, the catalytic triad of quencies and ignore how each residue mutates in response to
an enzyme’s active site for example, even if the residues are all other residues. Co-evolutionary couplings require pairwise
distant along the protein chain. Future augmentations—e.g., 3D frequencies, resulting in quadratically rather than linearly scaling
convolutional networks that operate directly on the Cartesian statistical cost. Nonetheless, removing PSSMs and relying
range of scientific and technical fields available to protein folding Adhikari, B., Bhattacharya, D., Cao, R., and Cheng, J. (2015). CONFOLD: res-
idue-residue contact-guided ab initio protein folding. Proteins 83, 1436–1449.
and biophysics. We predict that hybrid systems using deep
Alain, G., and Bengio, Y. (2016). Understanding intermediate layers using linear
learning and co-evolution as priors and physics-based ap-
classifier probes. Arxiv, arXiv:1610.01644v4 https://arxiv.org/abs/1610.01644.
proaches for refinement will soon solve the long-standing prob-
AlQuraishi, M. (2019a). Parallelized natural extension reference frame: parallel-
lem of accurate and efficient structure prediction. It is also
ized conversion from internal to Cartesian coordinates. J. Comp. Chem. 40,
possible that the use of neural-network-probing techniques 885–892.
(Alain and Bengio, 2016; Koh and Liang, 2017; Nguyen et al., AlQuraishi, M. (2019b). ProteinNet: a standardized data set for machine
2016; Shrikumar et al., 2017; Simonyan et al., 2013) with RGNs learning of protein structure. Arxiv, arXiv:1902.00249v1 https://arxiv.org/abs/
will provide new insight into the physical chemistry of folding 1902.00249.
and the types of intermediate structures that proteins use to Alva, V., Söding, J., and Lupas, A.N. (2015). A vocabulary of ancient peptides
sample conformational space. at the origin of folded proteins. Elife 4, e09410.
Dill, K.A., and MacCallum, J.L. (2012). The protein-folding problem, 50 years Ovchinnikov, S., Kim, D.E., Wang, R.Y.-R., Liu, Y., DiMaio, F., and Baker, D.
on. Science 338, 1042–1046. (2016). Improved de novo structure prediction in CASP11 by incorporating
coevolution information into Rosetta. Proteins 84, 67–75.
Gajda, M.J., Pawlowski, M., and Bujnicki, J.M. (2011a). Protein structure pre-
diction: from recognition of matches with known structures to recombination Ovchinnikov, S., Park, H., Varghese, N., Huang, P.S., Pavlopoulos, G.A., Kim,
of fragments. In Multiscale Approaches to Protein Modeling, A. Kolinski, ed. D.E., Kamisetty, H., Kyrpides, N.C., and Baker, D. (2017). Protein structure
(Springer), pp. 231–254. determination using metagenome sequence data. Science 355, 294–298.
Gajda, M.J., Pawlowski, M., and Bujnicki, J.M. (2011b). Multiscale Approaches Parsons, J., Holmes, J.B., Rojas, J.M., Tsai, J., and Strauss, C.E.M. (2005).
to Protein Modeling (Springer). Practical conversion from torsion space to Cartesian space for in silico protein
synthesis. J. Comput. Chem. 26, 1063–1068.
Gao, Y., Wang, S., Deng, M., and Xu, J. (2017). Real-value and confidence
Perez, A., Morrone, J.A., Brini, E., MacCallum, J.L., and Dill, K.A. (2016). Blind
prediction of protein backbone dihedral angles through a hybrid method of
protein structure prediction using accelerated free-energy simulations. Sci.
clustering and deep learning. Arxiv, arXiv:1712.07244v1 https://arxiv.org/
Adv 2, e1601274.
abs/1712.07244v1.
Ponting, C.P., and Russell, R.R. (2002). The natural history of protein domains.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning (The
Annu. Rev. Biophys. Biomol. Struct 31, 45–71.
MIT Press).
Ramachandran, G.N., Ramakrishnan, C., and Sasisekharan, V. (1963).
Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory. Neural
Stereochemistry of polypeptide chain configurations. J. Mol. Biol. 7, 95–99.
Comput. 9, 1735–1780.
Schaarschmidt, J., Monastyrskyy, B., Kryshtafovych, A., and Bonvin, A.M.J.J.
Hopf, T.A., Scha€rfe, C.P.I., Rodrigues, J.P.G.L.M., Green, A.G., Kohlbacher,
(2018). Assessment of contact predictions in CASP12: co-evolution and deep
O., Sander, C., Bonvin, A.M.J.J., and Marks, D.S. (2014). Sequence co-evolu-
learning coming of age. Proteins 86, 51–66.
tion gives 3D contacts and structures of protein complexes. Elife 3.
Shrikumar, A., Greenside, P., and Kundaje, A. (2017). Learning important fea-
Hopf, T.A., Ingraham, J.B., Poelwijk, F.J., Scha €rfe, C.P.I., Springer, M.,
tures through propagating activation differences. In ICML’17 Proceedings
Sander, C., and Marks, D.S. (2017). Mutation effects predicted from sequence
of the 34th International Conference on Machine Learning-Volume 70, pp.
co-variation. Nat. Biotech 35, 128–135.
3145–3153.
Juan, D., de Pazos, F., and Valencia, A. (2013). Emerging methods in protein Simonyan, K., Vedaldi, A., and Zisserman, A. (2013). Deep inside convolutional
co-evolution. Nat. Rev. Genet. 14, 249–261. networks: visualising image classification models and saliency maps. Arxiv,
Koh, P.W., and Liang, P. (2017). Understanding Black-box predictions via influ- arXiv:1312.6034v2 https://arxiv.org/abs/1312.6034.
ence functions. Arxiv, arXiv:1703.04730v2 https://arxiv.org/abs/1703.04730. Wang, S., Sun, S., Li, Z., Zhang, R., and Xu, J. (2017). Accurate de novo pre-
Kryshtafovych, A., Monastyrskyy, B., and Fidelis, K. (2016). CASP11 statistics diction of protein contact map by ultra-deep learning model. PLoS Comput.
and the prediction center evaluation system. Proteins 84, 15–19. Biol. 13, e1005324.
Kryshtafovych, A., Monastyrskyy, B., Fidelis, K., Moult, J., Schwede, T., and Xu, D., and Zhang, Y. (2012). Ab initio protein structure assembly using contin-
Tramontano, A. (2018). Evaluation of the template-based modeling in uous structure fragments and optimized knowledge-based force field.
CASP12. Proteins 86, 321–334. Proteins 80, 1715–1735.
Leaver-Fay, A., Tyka, M., Lewis, S.M., Lange, O.F., Thompson, J., Jacak, R., Xu, J., and Zhang, Y. (2010). How significant is a protein structure similarity
Kaufman, K., Renfrew, P.D., Smith, C.A., Sheffler, W., et al. (2011). ROSETTA3: with TM-score = 0.5? Bioinformatics 26, 889–895.
an object-oriented software suite for the simulation and design of macromole- Yang, J., Yan, R., Roy, A., Xu, D., Poisson, J., and Zhang, Y. (2015). The
cules. Meth. Enzymol 487, 545–574. I-TASSER suite: protein structure and function prediction. Nat. Methods 12, 7–8.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature 521, Zhang, Y., and Skolnick, J. (2004). Scoring function for automated assessment
436–444. of protein structure template quality. Proteins 57, 702–710.
Li, H., Hou, J., Adhikari, B., Lyu, Q., and Cheng, J. (2017). Deep learning Zhang, C., Mortuza, S.M., He, B., Wang, Y., and Zhang, Y. (2018). Template-
methods for protein torsion angle prediction. BMC Bioinformatics 18, 417. based and free modeling of I-TASSER and Quark pipelines using predicted
Liu, T., Ish-Shalom, S., Torng, W., Lafita, A., Bock, C., Mort, M., Cooper, D.N., contact maps in CASP12. Proteins 86, 136–151.
Bliven, S., Capitani, G., Mooney, S.D., et al. (2018). .Biological and functional Zhao, F., Peng, J., and Xu, J. (2010). Fragment-free approach to protein folding
relevance of CASP predictions. Proteins 86, 374–386. using conditional neural fields. Bioinformatics 26, i310–i317.
Liu, Y., Palmedo, P., Ye, Q., Berger, B., and Peng, J. (2018). Enhancing evolu- Zhou, Y., Duan, Y., Yang, Y., Faraggi, E., and Lei, H. (2011). Trends in template/
tionary couplings with deep convolutional neural networks. Cell Syst 6, 65–74. fragment-free protein structure prediction. Theor. Chem. Acc 128, 3–16.
Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Mo-
hammed AlQuraishi ([email protected]).
METHOD DETAILS
Model
We featurize a protein of length L as a sequence of vectors (x1,/,xL) where xt ˛Rd for all t. The dimensionality d is 41, where 20
dimensions are used as a one-hot indicator of the amino acid residue at a given position, another 20 dimensions are used for the
PSSM of that position, and 1 dimension is used to encode the information content of the position. The PSSM values are sigmoid
transformed to lie between 0 and 1. The sequence of input vectors are fed to an LSTM (Hochreiter and Schmidhuber, 1997), whose
basic formulation is described by the following set of equations.
it = sðWi ½xt ; ht1 + bi Þ:
ct = it 1c~t + ft 1ct1 :
ht = ot 1tanhðct Þ:
Wi,Wf,Wo,Wc are weight matrices, bi,bf,bo,bc are bias vectors, ht and ct are the hidden and memory cell state for residue t, respec-
tively, and 1 is element-wise multiplication. We use two LSTMs, running independently in opposite directions (1 to L and L to 1), to
ðfÞ ðbÞ
output two hidden states ht and ht for each residue position t corresponding to the forward and backward directions. Depending
on the RGN architecture, these two hidden states are either the final outputs states or they are fed as inputs into one or more LSTM
layers.
ðfÞ ðbÞ ðfÞ ðbÞ
The outputs from the last LSTM layer form a sequence of a concatenated hidden state vectors ð½h1 ; h1 ; /; ½hL ; hL Þ. Each
concatenated vector is then fed into an angularization layer described by the following set of equations:
h i
ðfÞ ðbÞ
pt = softmax W4 ht ; ht + b4 :
4t = argðpt expðiFÞÞ:
W4 is a weight matrix, b4 is a bias vector, F is a learned alphabet matrix, and arg is the complex-valued argument function. Expo-
nentiation of the complex-valued matrix iF is performed element-wise. The F matrix defines an alphabet of size m whose letters
correspond to triplets of torsional angles defined over the 3-torus. The angularization layer interprets the LSTM hidden state outputs
as weights over the alphabet, using them to compute a weighted average of the letters of the alphabet (independently for each
torsional angle) to generate the final set of torsional angles 4t ˛S1 3S1 3S1 for residue t (we are overloading the standard notation
for protein backbone torsional angles, with 4t corresponding to the (j,4,u) triplet). Note that 4t may be alternatively computed using
the following equation, where the trigonometric operations are performed element-wise:
4t = atan2ðpt sinðFÞ; pt cosðFÞÞ:
mk = ck1 ck2 :
ck :
nk = mk1 3 m
ck ; c
Mk = ½ m ck ; c
nk 3 m nk :
ck = Mk c~k + ck1 :
Where rk is the length of the bond connecting atoms k 1 and k, qk is the bond angle formed by atoms k 2, k 1, and k, 4k=3;k mod 3
is the predicted torsional angle formed by atoms k 2 and k 1, ck is the position of the newly predicted atom k, m b is the unit-normal-
ized version of m, and 3 is the cross product. Note that k indexes atoms 1 through 3L, since there are three backbone atoms per
residue. For each residue t we compute c3t2 ; c3t1 , and c3t using the three predicted torsional angles of residue t, specifically
4t;j = 43t;ð3t + jÞ mod 3 for j = {0,1,2}. The bond lengths and angles are fixed, with three bond lengths (r0,r1,r2) corresponding to N-Ca,
3
Ca-C’, and C’-N, and three bond angles (q0,q1,q2) corresponding to N-Ca-C’, Ca-C’-N, and C’-N-Ca. As there are only three unique
values we have rk = rk mod 3 and qk = qk mod 3. In practice we employ a modified version of the above equations which enable much
higher computational efficiency (AlQuraishi, 2019a).
The resulting sequence ðc1 ; /; c3L Þ fully describes the protein backbone chain structure and is the model’s final predicted output.
For training purposes a loss is necessary to optimize model parameters. We use the dRMSD metric as it is differentiable and captures
both local and global aspects of protein structure. It is defined by the following set of equations:
d~j;k = kcj ck k2 :
ðexpÞ ðpredÞ
dj;k = d~j;k d~j;k :
kDk2
dRMSD = :
LðL 1Þ
ðexpÞ ðpredÞ
Where fdj;k g are the elements of matrix D, and d~j;k and d~j;k are computed using the coordinates of the experimental and
predicted structures, respectively. In effect, the dRMSD computes the [2 -norm of the distances over distances, by first computing
the pairwise distances between all atoms in both the predicted and experimental structures individually, and then computing
the distances between those distances. For most experimental structures, the coordinates of some atoms are missing. They are
excluded from the dRMSD by not computing the differences between their distances and the predicted ones.
Hyperparameters
RGN hyperparameters were manually fit, through sequential exploration of hyperparameter space, using repeated evaluations on the
ProteinNet11 validation set and three evaluations on the ProteinNet11 test set. Once chosen the same hyperparameters were used to
train RGNs on ProteinNet7-12 training sets. The validation sets were used to determine early stopping criteria, followed by single
evaluations on the ProteinNet7-12 test sets to generate the final reported numbers (excepting ProteinNet11).
The final model consisted of two bidirectional LSTM layers, each comprised of 800 units per direction, and in which outputs from
the two directions are first concatenated before being fed to the second layer. Input dropout set at 0.5 was used for both layers, and
the alphabet size was set to 60 for the angularization layer. Inputs were duplicated and concatenated; this had a separate effect from
decreasing dropout probability. LSTMs were random initialized with a uniform distribution with support [0.01, 0.01], while the alpha-
bet was similarly initialized with support [p,p]. ADAM was used as the optimizer, with a learning rate of 0.001, b1 = 0.95 and b2 = 0.99,
and a batch size of 32. Gradients were clipped using norm rescaling with a threshold of 5.0. The loss function used for optimization
was length-normalized dRMSD (i.e. dRMSD divided by protein length), which is distinct from the standard dRMSD we use for report-
ing accuracies.
Dataset
We use the ProteinNet dataset for all analyses (AlQuraishi, 2019b). ProteinNet recreates the conditions of past CASP assessments by
restricting the set of sequences (for building PSSMs) and structures used to those available prior to the start of each CASP assess-
ment. Each ProteinNet entry is comprised of two inputs, the raw protein sequence, represented by a one-hot vector, and the protein’s
PSSM and information content profiles, derived using 5 iterations of JackHMMer with an e-value threshold of 10-10. PSSM values are
normalized to lie between 0 and 1. The output for each ProteinNet entry is comprised of the Cartesian coordinates of the protein’s
backbone atoms, annotated by metadata denoting which atoms are missing from the experimental structure. These atoms are
excluded from the dRMSD loss calculation, which enables use of partially resolved experimental structures that would otherwise
be excluded from the dataset.
For ProteinNet7-11, the publicly available CASP structures were used as test sets. For ProteinNet12, the publicly available CASP12
structures are incomplete, as some structures are still embargoed. We obtained a private set of structures from the CASP organizers
that includes all structures used in CASP12 (except two), and we used this set for model assessment. For training all RGN models, the
90% ‘‘thinning’’ version of ProteinNet was used.
TensorFlow (Abadi et al., 2016) code for training new RGN models, as well as pre-trained RGN models used in reporting results for
CASP 7-12, are available on GitHub at https://github.com/aqlaboratory/rgn.