0% found this document useful (0 votes)
14 views14 pages

Base Paper

Uploaded by

aparnasajeev2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views14 pages

Base Paper

Uploaded by

aparnasajeev2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Article

End-to-End Differentiable Learning of Protein


Structure
Graphical Abstract Authors
Mohammed AlQuraishi
A I T L M Correspondence
[email protected]

In Brief
Prediction of protein structure from
sequence is important for understanding
protein function, but it remains very
challenging, especially for proteins with
few homologs. Existing prediction
methods are human engineered, with
many complex parts developed over
decades. We introduce a new approach
based entirely on machine learning that
predicts protein structure from sequence
using a single neural network. The model
achieves state-of-the-art accuracy and
does not require co-evolution information
or structural homologs. It is also much
faster, making predictions in milliseconds
versus hours or days, which enables new
Highlights applications in drug discovery and
d Neural network predicts protein structure from sequence protein design.
without using co-evolution

d Model replaces structure prediction pipelines with one


mathematical function

d Achieves state-of-the-art performance on novel protein folds

d Learns a low-dimensional representation of protein


sequence space

AlQuraishi, 2019, Cell Systems 8, 292–301


April 24, 2019 ª 2019 Elsevier Inc.
https://doi.org/10.1016/j.cels.2019.03.006
Cell Systems

Article

End-to-End Differentiable
Learning of Protein Structure
Mohammed AlQuraishi1,2,3,*
1Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA 02115, USA
2Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA
3Lead Contact

*Correspondence: [email protected]
https://doi.org/10.1016/j.cels.2019.03.006

SUMMARY tically derived protein fragments. Fragment assembly usually


achieves high accuracy only when homologous protein struc-
Predicting protein structure from sequence is a cen- tures are used as templates. Such template-based methods
tral challenge of biochemistry. Co-evolution methods use one or more experimental structures—found through homol-
show promise, but an explicit sequence-to-structure ogy searches—as the basis for making predictions.
map remains elusive. Advances in deep learning that The second category of methods eschews explicit sequence-
replace complex, human-designed pipelines with to-structure maps and instead identifies co-evolving residues
within protein families to derive residue-residue contact maps,
differentiable models optimized end to end suggest
using co-evolution as an indicator of contact in physical space
the potential benefits of similarly reformulating struc-
(Hopf et al., 2014; Marks et al., 2011). With a large and diverse
ture prediction. Here, we introduce an end-to-end set of homologous sequences—typically tens to hundreds of
differentiable model for protein structure learning. thousands—co-evolution methods can accurately predict con-
The model couples local and global protein structure tact maps (Juan et al., 2013). A correct contact map can guide
via geometric units that optimize global geometry fragment assembly methods to an accurate 3D structure 25%–
without violating local covalent chemistry. We test 50% of the time (Ovchinnikov et al., 2017). However, because
our model using two challenging tasks: predicting co-evolutionary methods do not construct a model of the rela-
novel folds without co-evolutionary data and predict- tionship between individual sequences and structures, they are
ing known folds without structural templates. In the unable to predict structures for which no sequence homologs
first task, the model achieves state-of-the-art accu- exist, as in new bacterial taxa or de novo protein design. More-
over, even for well-characterized proteins, such methods are
racy, and in the second, it comes within 1–2 Å;
generally unable to predict the structural consequences of minor
competing methods using co-evolution and experi-
sequence changes such as mutations or indels because they op-
mental templates have been refined over many erate on protein families rather than individual sequences (they
years, and it is likely that the differentiable approach do, however, show promise in predicting the functional conse-
has substantial room for further improvement, with quences of mutations [Hopf et al., 2017]). Thus, there remains
applications ranging from drug discovery to protein a substantial need for new and potentially better approaches.
design. End-to-end differentiable deep learning has revolutionized
computer vision and speech recognition (LeCun et al., 2015),
but protein structure pipelines continue to resemble the ways
INTRODUCTION in which computers tackled vision and speech prior to deep
learning, by having many human-engineered stages, each inde-
Proteins are linear polymers that fold into very specific and or- pendently optimized (Xu and Zhang, 2012; Yang et al., 2015)
dered three-dimensional (3D) conformations based on their (Figure 1). End-to-end differentiable models replace all compo-
amino acid sequence (Branden and Tooze, 1999; Dill, 1990). nents of such pipelines with differentiable primitives to enable
Understanding how this occurs is a foundational problem in joint optimization from input to output. In contrast, use of deep
biochemistry. Computational approaches to protein folding not learning for structure prediction has so far been restricted to in-
only seek to make structure determination faster and less costly; dividual components within a larger pipeline (Aydin et al., 2012;
they aim to understand the folding process itself. Existing compu- Gao et al., 2017; Li et al., 2017; Lyons et al., 2014; Zhao et al.,
tational methods fall into two broad categories (Gajda et al., 2010), for example, prediction of contact maps (Liu et al.,
2011a, 2011b). The first category builds explicit sequence-to- 2018b; Wang et al., 2017). This stems from the technical chal-
structure maps using computational procedures to transform lenge of developing an end-to-end differentiable model that re-
raw amino acid sequences into 3D structures. This includes builds the entire structure prediction pipeline using differentiable
physics-based molecular dynamics simulations (Marx and Hut- primitives. We have developed such a model by combining four
ter, 2012), which are restricted by computational cost to small ideas: (1) encoding protein sequence using a recurrent neural
proteins, and fragment assembly methods (Gajda et al., 2011a), network, (2) parameterizing (local) protein structure by torsional
which find energy-minimizing conformations by sampling statis- angles to enable a model to reason over diverse conformations

292 Cell Systems 8, 292–301, April 24, 2019 ª 2019 Elsevier Inc.
Figure 1. Conventional Pipelines for Protein Structure
H-bonding MDSAITLW.... Prediction
van der Waals
Solvaon Protein Sequence Prediction process begins with query sequence (top, green box)
Electrostacs whose constituent domains and co-evolutionary relationships are
Atomic distances identified through multiple sequence alignments. In free modeling
Strand packing (left), fragment libraries are searched to derive distance restraints,
...
Fragment Template which, along with restraints derived from co-evolutionary data, guide
Search Domain Spling Threading simulations that iteratively minimize energy through sampling.
Coarse conformations are then refined to yield the final structure. In
template-based modeling (right pipeline), the PDB is searched for
templates. If found, fragments from one or more templates are
combined to assemble a structure, which is then optimized and
refined to yield the final structure. Orange boxes indicate sources of
Stascal and Distance Selected input information beyond query sequence, including prior physical
Physical Potenals Restraints Co-EvoContacts
Co-Evo Contacts Templates knowledge. Diagram is modeled on the I-Tasser and Quark pipelines
(Zhang et al., 2018).

Free Modeling Template-Based Modeling

Template Sorng

Template Fragments

Decoy Opmized Potenal

Sampling and Simulaon

Structure Assembly

Clustering
Clustering
Inherent
Reduced
Fragments Potenal

Structure Reassembly

Atomic Refinement

Atomic Refinement Quality Esmaon

Final Predicon Final Predicon

Cell Systems 8, 292–301, April 24, 2019 293


Å dRMSD Å
LOSS LOSS

ure l
uct nta
uct ed
ure

Str rime
Str edict
Exisng Atoms Bond Bond Torsional
(Input) Rotaon Extension Rotaon

e
Pr

Exp
Output ψ ω
GEOMETRY GEOMETRY Nascent Structure GEOMETRY (Torsional Angles) φ Weighted
Average

Torsional
φ

Angles
ψ Prior
ω State
Next Circular
Internal State State Projecon

COMPUTE COMPUTE COMPUTE

Protein Sequence Input


(Amino Acid) Torsional Alphabet
A I M

Figure 2. Recurrent Geometric Networks


Protein sequences are fed one residue at a time to the computational units of an RGN (bottom-left), which compute an internal state that is integrated with the
states of adjacent units. Based on these computations, torsional angles are predicted and fed to geometric units, which sequentially translate them into Cartesian
coordinates to generate the predicted structure. dRMSD is used to measure deviation from experimental structures, serving as the signal for optimizing RGN
parameters. Top-left inset: geometric units take new torsional angles and a partial backbone chain and extend it by one residue. Bottom-right inset: compu-
tational units, based on long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997), use gating units (blue) to control information flow in and out of the
internal state (gray) and angularization units (purple) to convert raw outputs into angles. Rightmost inset: angularization units select from a learned set of torsion
angles (‘‘alphabet’’) a mixture of torsions, which are then averaged in a weighted manner to generate the final set of torsions. Mixing weights are determined by
computational units.

without violating their covalent chemistry, (3) coupling local pro- computed. Instead, each unit’s computation is described by an
tein structure to its global representation via recurrent geometric equation whose parameters are optimized so that RGNs accu-
units, and (4) using a differentiable loss function to capture rately predict structures.
deviations between predicted and experimental structures. We The second stage is made of geometric units that take as
find that the new approach outperforms other methods, input the torsional angles for a given residue and the partially
including co-evolution ones, when predicting novel folds even completed backbone resulting from the geometric unit upstream
though it uses only primary sequences and position-specific of it, and output a new backbone extended by one residue,
scoring matrices (PSSMs) that summarize individual residue pro- which is fed into the adjacent downstream unit (AlQuraishi,
pensities for mutation. We also find that when predicting known 2019a; Parsons et al., 2005). The last unit outputs the completed
folds, the new approach is on average within 1–2 Å of other ap- 3D structure of the protein. During model training, a third stage
proaches, including template-based ones, despite being tem- computes deviations between predicted and experimental
plate-free. structures using the distance-based root mean square deviation
(dRMSD) metric. The dRMSD first computes pairwise distances
RESULTS between all atoms in the predicted structure and all atoms in
the experimental one (separately) and then computes the root
Recurrent Geometric Networks mean square of the distance between these sets of distances.
Our model takes a sequence of amino acids and PSSMs as input Because dRMSD is distance-based, it is invariant to reflections,
and outputs a 3D structure. It comprises three stages—compu- which can lead RGNs to predict reflected structures (effectively
tation, geometry, and assessment—which we term a recurrent wrong chirality) that must be corrected by a counter-reflection.
geometric network (RGN). The first stage is made of computa- RGN parameters are optimized to minimize the dRMSD between
tional units that for each residue position, integrate information predicted and experimental structures using backpropagation
about its amino acid and PSSM with information coming from (Goodfellow et al., 2016). Hyperparameters, which describe
adjacent units. By laying these units in a recurrent bidirectional higher-level aspects of the model such as the number of compu-
topology (Figure 2), the computations for each residue integrate tational units, were determined through manual exploration of
information from residues upstream and downstream all the way hyperparameter space. See Supplemental Information for a
to the N and C terminus, covering the entire protein. By further complete mathematical treatment.
stacking units in multiple layers (data not shown), the model
implicitly encodes a multi-scale representation of proteins. Assessment of Model Error
Each unit outputs three numbers, corresponding to the torsional Machine learning models must be trained against as large a pro-
angles of the residue. We do not specify a priori how angles are portion of available data as possible to fit model parameters and

294 Cell Systems 8, 292–301, April 24, 2019


Table 1. Comparative Accuracy of RGNs Using dRMSD
FM (Novel Folds) Category (Å) TBM (Known Folds) Category (Å)
CASP7 CASP8 CASP9 CASP10 CASP11 CASP12 CASP7 CASP8 CASP9 CASP10 CASP11 CASP12
RGN 9.3* 7.3* 8.7* 10.0* 8.5* 10.7* 5.6 5.9 6.5 6.9 7.4 6.9
1st server 9.3 8.3 9.0 10.3 9.3 11.0 4.0* 4.3* 5.2* 5.3* 5.8* 4.7*
2nd server 9.9 8.6 9.1 10.6 9.6 11.2 4.0 4.6 5.2 5.4 6.0 4.8
3rd server 10.0 9.2 9.7 10.9 11.2 11.3 4.1 4.8 5.4 5.7 6.5 5.6
4th server 10.1 9.9 10.1 11.7 11.7 11.4 4.2 5.0 5.4 5.9 6.8 5.8
5th server 10.4 10.4 13.5 12.0 12.9 13.0 4.8 5.0 5.5 7.2 6.9 5.9
The average dRMSD (lower is better; asterisk indicates best performing method) achieved by RGNs and the top five servers at each CASP is shown for
the novel folds (left) and known folds (right) categories. Numbers are based on common set of structures predicted by top 5 servers during each CASP.
A different RGN was trained for each CASP, using the corresponding ProteinNet training set containing all sequences and structures available prior to
the start of that CASP. See also Tables S1–S3.

then evaluated against a distinct test set to assess accuracy. dRMSD of RGN predictions on FM structures to the top five auto-
Reliable evaluation is frequently complicated by unanticipated mated predictors in CASP 7–12, known as ‘‘servers’’ in CASP
information leakage from the training set into the test set, espe- parlance (‘‘humans’’ are combined server and human-expert
cially for protein sequences that share an underlying evolutionary pipelines—we do not compare against this group as our process-
relationship. Partly to address this problem, the critical assess- ing is automated). In Figure 3A, we break down the predictions by
ment of protein structure prediction (CASP) (Moult et al., 1995) target against the top performing server and in Figure 3C against
was organized to assess methods in a blinded fashion, by testing the dRMSD distribution of all CASP servers.
predictors using sequences of solved structures that have On all CASPs, RGNs had the best performance, even
not been publicly released. To assess RGNs, we therefore compared to servers that use co-evolution data (in CASP 11
sought to recreate the conditions of past CASPs by assembling [Kryshtafovych et al., 2016; Ovchinnikov et al., 2016] and CASP
the ProteinNet datasets (AlQuraishi, 2019b). For every CASP 12 [Schaarschmidt et al., 2018]). RGNs outperformed other
from 7 through 12, we created a corresponding ProteinNet test methods at both short and long multi-domain proteins, suggest-
set comprising CASP structures, and a ProteinNet training set ing their performance is not limited to one regime (e.g., short sin-
comprising all sequences and structures publicly available gle-domain proteins), despite having no explicit knowledge of
prior to the start of that CASP. Using multiple CASP datasets en- domain boundaries. While the margin between RGNs and the
ables a deeper and more thorough assessment that spans a next best server is small for most CASPs, such small gaps are
broad range of dataset sizes than relying on the most recent representative of the differences between the top five performers
CASP alone. We also adopted the CASP division of test struc- in Table 1. In general, small gains in accuracy at the top end are
tures into free modeling (FM) targets that assess prediction of difficult, with only minimal gains obtained over a 10-year time
novel folds and template-based modeling (TBM and TBM- span from CASP 6 to CASP 11 (Kryshtafovych et al., 2018).
hard) targets that assess prediction of folds with known homo- More substantial gains were seen in CASP 12 as a result of the
logs in the Protein Data Bank (PDB) (Bernstein et al., 1977). We use of co-evolutionary information (Moult et al., 2018), but
set aside a subset of the training data as a validation set to deter- RGNs match these advances without using co-evolutionary
mine when to stop model training and to further insulate training data and by operating in a fundamentally distinct and comple-
and test data. mentary way. The accuracy gap between RGNs and other servers
ProteinNet datasets were used for all analyses described here. is highest on CASP 11, which benefits from having the RGN hy-
RGN hyperparameters were fit by repeated evaluations on the perparameters fit on the ProteinNet11 validation set, suggesting
ProteinNet 11 validation set, followed by three evaluations on similar gains may be obtained by optimizing RGN hyperpara-
the ProteinNet 11 test set. Once chosen, the same hyperpara- meters for each dataset (this would not correspond to overfitting
meters were used to train models on ProteinNet 7–12 training as only the validation set is used to fit hyperparameters but
sets, with a single evaluation made at the end on each test set would require substantially more compute resources for training).
(excepting ProteinNet 11) to generate Table 1. Subsequently, ProteinNet datasets of earlier CASPs are smaller, which may have
additional test set evaluations were made to generate Table S1, also reduced accuracy. To assess the contribution of dataset size
with one evaluation per number reported. No additional test set to model error, we used RGNs trained on earlier ProteinNet data-
evaluations were made. Overall, this represents a rigorous sets to predict later CASP test sets (Table S1). As expected,
approach to evaluation with the lowest possible risk of information accuracy drops as datasets shrink.
leakage. The dRMSD metric does not require structures to be pre-
aligned and is consequently able to detect regions of high local
Predicting New Folds without Co-evolution concordance even when global concordance is poor. dRMSD
We first assessed RGNs on a difficult task that has not consis- assesses predictions at all length scales, however, it penalizes
tently been achieved by any existing method: predicting novel large global deviations in proportion to their distance, which
protein folds without co-evolutionary data. FM structures served can result in a very high error for far apart regions. To obtain a
as targets for this exercise. Table 1 compares the average complementary assessment of model accuracy, we also tested

Cell Systems 8, 292–301, April 24, 2019 295


A FM CASP Targets B TBM CASP Targets
RGN beer T0863 RGN beer T0706

T0831
20 20
T0534 T0529 T0686
Best CASP server dRMSD (Å)

T09
T0923 T0571
T0600
T0904 T0880 T0680 T0542 T0543 T0705
T0361 T0918
T091 15 T0671
71 T0547
15 T0941 T0604
T0
T0897 T0834
4 T0810
T0639 T0437
T0864 T0356
T0737 T0890
T0914 T0741 T0674
T0616 T0759
10 T0581 T0309 T0637 10 T0392
92 T0608
T0
T0868 T0943
T0531
T0836 T0562 T0920
T0561 T0435
T0537 T0307 T0360
CASP7
T0353 T0618 T0507 T0487
CASP8 T0945
5 T0624 T0544 CASP9 5 T0852
T0350 T0685
CASP10
T0915 T0862 T0372
CASP11 T0664
T0837 CASP12
T0867 T0438
0 0
0 5 10 15 20 0 5 10 15 20
RGN dRMSD (Å) RGN dRMSD (Å)

C CASP servers mean dRMSD distribuon D RGN validaon set dRMSD


20 12

10
15
8
dRMSD (Å)

10 6

4
5 52% 64% 63% 58%
25% 2

CASP8 CASP9 CASP10 CASP11 CASP12 0 10 20 30 40 50 70 90


0
FM TBM Maximum % sequence identy to training set
E T0811 T0856 T0785 T0827 T0816 T0806
Experiment
Predicon

2.5Å 3.2Å 7.2Å 7.6Å 10.1Å 10.3Å

Figure 3. Results Overview


(A and B) Scatterplots of individual FM (A) and TBM (B) predictions made by RGN and top CASP server. Two TBM outliers (T0629 and T0719) were dropped for
visualization purposes.
(C) Distributions of mean dRMSD (lower is better; ends of boxes correspond to upper and lower quartiles, whiskers to highest and lowest values, and white line to
median) achieved by servers predicting all structures with >95% coverage at CASP 8–12 are shown for FM (novel folds) and TBM (known folds) categories. Thick
black (white on dark background) bars mark RGN dRMSD. RGN percentile rankings are shown for the TBM category (below whiskers). CASP 7 is omitted
because of lack of server metadata.
(legend continued on next page)

296 Cell Systems 8, 292–301, April 24, 2019


RGNs using TM scores (Zhang and Skolnick, 2004), which are Template-based methods are particularly accurate where
defined by the following equation: template and query sequences overlap and are inaccurate
2 3 where they do not; unfortunately, non-overlapping regions are
often the regions of high biological interest. Errors in these crit-
6 1 LX aligned 7
6 1 7 ical non-overlapping regions can be masked by large over-
TM score = max6   2 7;
4Ltarget i 5 lapping regions, inflating overall accuracy (Contreras-Moreira
1+ d L di
0 ð target Þ et al., 2005; Dill and MacCallum, 2012; Liu et al., 2018a; Perez
et al., 2016). To determine whether RGNs suffer from similar
where Ltarget and Laligned are the lengths of the full protein and the limitations, we split TBM domains into short fragments ranging
aligned region, respectively, di is the distance between the ith in size from 5 to 50 residues and computed the RMSD for every
residues in thepexperimental fragment (with respect to the experimental structure) from the
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi and predicted structures, and
d0 ðLtarget Þ = 1:24 3 Ltarget  15  1:8 is used to normalize scores. best template, the best CASP prediction, and the RGN predic-
TM scores do require structures to be pre-aligned and thus can tion (Figure 4). We found CASP predictions to be correlated
penalize predictions with high local concordance if a global (average R2 = 0.44) with template quality across length scales
alignment cannot be found, but they are less sensitive to large as previously reported (Kryshtafovych et al., 2018), whereas
deviations because they only compute error over the aligned RGN predictions were not (average R2 = 0.06). This distinc-
regions. TM scores range from 0 to 1, with a score of <0.17 cor- tion persists even when predictions with >3 Å accuracy are
responding to a random unrelated protein, and >0.5 generally excluded (average R2 = 0.49 for best CASP predictions; average
corresponding to the same protein fold (Xu and Zhang, 2010). R2 = 0.02 for RGN predictions). Thus, RGNs perform equally
Since TM scores are not invariant to reflections, we compute well on regions of proteins with experimental templates and
them for both the original and reflected RGN structures and on those without.
use the higher of the two. Table S2 compares TM scores of
RGN predictions to CASP servers. In general, RGNs rank among RGNs Learn an Implicit Representation of Protein
the top five servers but do not consistently outperform all other Fold Space
methods as they do on dRMSD, possibly reflecting the lack of Applications of deep learning in sensory domains often result in
partial credit assignment by TM scores. models whose internal representation of the data is interpret-
able, e.g., placing semantically similar words nearby in a natural
Predicting Known Folds without Templates language model. To ascertain whether RGNs behave similarly,
We next assess RGNs on predicting known protein folds without we extracted the internal state of their computational units after
experimental templates, a challenging task that provides an processing each protein sequence in the ProteinNet12 training
advantage to template-based methods (Zhou et al., 2011). TBM set. For each protein, we obtained multiple high-dimensional
structures served as targets for this purpose. Tables 1 and S2 vectors, one per layer and direction of the RGN. We then used
compare RGN predictions to top CASP servers using dRMSD linear dimensionality reduction techniques to visualize these
and TM score, respectively, while Figure 3B breaks down predic- vectors in two dimensions, separately for each layer and direc-
tions by target, and Figure 3C shows the distribution over all tion (Figure 5A) and by concatenating all layers together (Fig-
CASP servers. A representative sampling of the full quality spec- ure 5B). When we color each protein (dot) according to the frac-
trum of FM and TBM predictions is shown in Figure 3E. In general, tion of secondary structure present in its original PDB structure,
RGNs underperform the very top CASP servers, all of which use clear visual patterns emerge (Figure 5B). This is notable because
templates, although 60% of predictions are within 1.5 Å of the secondary structure was neither used as input to aid model pre-
best-performing server. diction nor as an output signal to guide training; i.e., the model
Since RGNs do not use templates, this suggests that they learn was not explicitly encoded with the concept of secondary struc-
generalizable aspects of protein structure, and their improved ture, yet it uses secondary structure as the dominant factor in
accuracy on TBM targets relative to FM reflects denser sampling shaping its representation of protein fold space.
in TBM regions of protein space. To investigate this possibility, we We next used the CATH database (Dawson et al., 2017), which
partitioned ProteinNet validation sets into groups based on hierarchically classifies proteins into structural families, to parti-
maximum sequence identity to the training set and computed tion data points into CATH classes and visualize their distribution
dRMSDs within each group across CASPs 7–12 (Figure 3D) and in RGN space. At the topmost CATH level, divided into ‘‘Mainly
by individual CASP (Figure S1). RGN performance robustly Alpha,’’ ‘‘Mainly Beta,’’ ‘‘Alpha Beta,’’ and ‘‘Few Secondary
transfers to sequences with >40% sequence identity, predicting Structures,’’ we see clearly demarcated regions for each class
structures with a median dRMSD of 5 Å and then begins to dete- (represented by differently colored contour plots), with ‘‘Alpha
riorate. There was little difference in dRMSD between 50% and Beta’’ acting unsurprisingly as the bridge (leftmost panel in Fig-
90% sequence identity, with substantial error remaining at 90%, ure 5C). We then reapplied dimensionality reduction to data in
which is suggestive of underfitting. each class and visualized the distributions of their respective

(D) Distribution of RGN dRMSDs (ends of boxes correspond to upper and lower quartiles, whiskers to highest and lowest values, wide white line to median, and
short white line to mean) on ProteinNet validation sets grouped by maximum % sequence identity to training set over all CASPs.
(E) Traces of backbone atoms of well (left), fairly (middle), and poorly (right) predicted RGN structures are shown (bottom) along with their experimental
counterparts (top). CASP identifier is displayed above each structure and dRMSD below. A color spectrum spans each protein chain to aid visualization.
See also Figure S1.

Cell Systems 8, 292–301, April 24, 2019 297


Figure 4. Correlation between Prediction Accuracy and Template Quality
Scatterplots of fragment RMSDs, ranging in size from 5 to 50 residues, comparing the best CASP templates to the best CASP server predictions (top) and RGN
predictions (bottom). R2 values are computed over all data points (non-parenthesized) and over data points in which predictions achieved <3 Å accuracy
(parenthesized). TBM domains were used (excluding TBM-hard that do not have good templates), and only templates and predictions covering >85% of full
domain sequences were considered. Templates and predictions were selected based on global dRMSD with respect to experimental structure. CASP 7 and 8 are
omitted because of lack of full template information.

second-level CATH categories (three right panels in Figure 5C). dictions, with sequence length 700, can be made in 5.4 s on a
We again see contiguous regions for each category, albeit with single GPU, i.e., 10 ms per structure. Table 2 compares training
greater overlap, likely owing to the continuous nature of protein and prediction speeds of RGNs to established methods that rely
structure space and reduction of RGN space to just two dimen- heavily on simulation with limited learning (first row), and deep
sions. These visualizations suggest RGNs are learning a useful learning plus co-evolution-based contact prediction methods
representation of protein sequence space that may yield insights that rely on learning (second row), combined with CONFOLD
into the nature of protein structure space. (Adhikari et al., 2015) to convert predicted contact maps into ter-
tiary structures. While training RGNs can take weeks to months,
RGNs Are 6–7 Orders of Magnitude Faster Than Existing once trained, they make predictions 6–7 orders of magnitude
Methods faster than existing pipelines. This speed enables new types of
Existing structure prediction pipelines are multi-staged (Figure 1), applications, such as the integration of structure prediction
first detecting domains that can be separately modeled and within docking and virtual screening in which ligand-aware
running multiple algorithms to estimate secondary structure RGNs could output distinct protein conformations in response
propensities, solvent accessibility, and disordered regions. Co- to distinct ligand poses.
evolutionary methods use multiple sequence alignments to pre-
dict contact maps, and template-based methods search the DISCUSSION
PDB for templates. Their predictions are converted into geomet-
ric constraints to guide a conformation sampling process, where A key limitation of explicit sequence-to-structure maps, including
fragments are swapped in and out of putative structures to molecular dynamics and fragment assembly, is a reliance on
minimize an expertly derived energy model. Because of this fixed energy models that do not learn from data; a second limi-
complexity, prediction times range from hours to days and re- tation is the exclusive use of single-scale atomic or residue-
quires codebases as large as several million lines of code level representations. In contrast, modern co-evolution methods
(Leaver-Fay et al., 2011). leverage learning and multi-scale representations to substantially
In contrast, a trained RGN model is a single mathematical improve performance (Liu et al., 2018b; Wang et al., 2017). RGNs
function that is evaluated once per prediction. Computation of go one step further by building a fully differentiable map extend-
this function implicitly carries out domain splitting, property ing from sequence to structure with all of the steps in existing
finding, energy minimization, and conformational sampling prediction pipelines implicitly encoded and learnable from
simultaneously. We found that 512 concurrent RGN-based pre- data. Through their recurrent architecture, RGNs can capture

298 Cell Systems 8, 292–301, April 24, 2019


Figure 5. The Latent Space of RGNs
(A and B) 2D projection of the separate (A) and combined (B) internal state of all RGN computational layers, with dots corresponding to individual protein
sequences in the ProteinNet12 training set. (B) Proteins are colored by fractional secondary structure content, as determined by annotations of original protein
structures.
(C) Contour plots of the probability density (50%–90% quantiles) of proteins belonging to categories in the topmost level of the CATH hierarchy (first from left) and
proteins belonging to categories in the second-level CATH classes of ‘‘Mainly Alpha’’ (second), ‘‘Mainly Beta’’ (third), and ‘‘Alpha Beta’’ (fourth). Distinct colors
correspond to distinct CATH categorizations; see Figures S2–S5 for complete legends. The topmost CATH class ‘‘Few Secondary Structures’’ is omitted because
it has no subcategories.

sequence-structure motifs and multiple scales from residues to representation—may further improve the detection and quality
domains (Alva et al., 2015; Ponting and Russell, 2002). When of long-range interactions. The third parameterization, built in
tracking structure prediction during RGN training (Video S1), the dRMSD stage, is the matrix of inter-atomic distances and is
RGNs appear to first learn global aspects of protein folds and simultaneously local and global. It is useful for optimizing RGN
then refine their predictions to generate a more accurate local parameters de novo, as we have used it, but can also be used
structure. to incorporate prior knowledge expressible in terms of atomic dis-
RGNs are multi-representational, operating on three distinct tances; such knowledge includes physical features (e.g., electro-
parameterizations of protein structure. The first is torsional, statics) and statistical data on interactions (e.g., evolutionary
capturing local relationships between atoms with bond lengths couplings).
and angles held fixed and torsional angles as the immediate One limitation of current RGNs is their reliance on PSSMs,
outputs of computational units. This virtually guarantees that which we have found to be helpful to achieving high-accuracy
predictions are structurally correct at a local level. The second predictions. PSSMs are much weaker than multiple sequence
is Cartesian, built by geometric units and capturing the global alignments, as they are based on single residue mutation fre-
coordination of multiple atoms in 3D space, the catalytic triad of quencies and ignore how each residue mutates in response to
an enzyme’s active site for example, even if the residues are all other residues. Co-evolutionary couplings require pairwise
distant along the protein chain. Future augmentations—e.g., 3D frequencies, resulting in quadratically rather than linearly scaling
convolutional networks that operate directly on the Cartesian statistical cost. Nonetheless, removing PSSMs and relying

Cell Systems 8, 292–301, April 24, 2019 299


Table 2. Prediction and Training Speeds of Structure Prediction STAR+METHODS
Methods
Model Prediction Speed Training Time Detailed methods are provided in the online version of this paper
and include the following:
Rosetta, I-Tasser, Quark Hours to days N/A
Raptor X, DeepContact + One to few hours Hours d KEY RESOURCES TABLE
CONFOLD d CONTACT FOR REAGENT AND RESOURCE SHARING
Recurrent geometric Milliseconds Weeks to months d METHOD DETAILS
networks (RGNs) B Model
Top row corresponds to the most complex and established set of B Hyperparameters
methods, which rely heavily on simulation and sampling and typically B Dataset
have only a minimal learning component. Second row corresponds to d DATA AND SOFTWARE AVAILABILITY
methods combining co-evolution-based contact prediction with deep
learning, which rely on a learning procedure, plus the CONFOLD method
SUPPLEMENTAL INFORMATION
to convert predicted contact maps into tertiary structures. Time esti-
mates are based on workflows used for CASP predictions, which Supplemental Information can be found online at https://doi.org/10.1016/j.
(excepting RGNs) generate a large ensemble of structures, increasing cels.2019.03.006.
prediction time. RGN predictions are deterministic and thus necessi-
tate only a single prediction. All time estimates exclude multiple ACKNOWLEDGMENTS
sequence alignment (MSA) generation times.
We are indebted to Peter Sorger for his mentorship and support and thank him
for extensive editorial feedback on this manuscript. We thank Jasper Snoek
and Adrian Jinich for their editorial comments and many helpful discussions;
exclusively on raw sequences could robustify RGNs for many Uraib Aboudi, Ramy Arnaout, Karen Sachs, Michael Levitt, Nazim Bouatta,
applications, including prediction of genetic variants. Achieving and Jinbo Xu for their feedback; Martin Steinegger and Milot Mirdita for their
this may require more data-efficient model architectures. For help with using the HHblits and MMseqs2 packages; Sergey Ovchinnikov for
protein design, RGNs can be used as is, by fixing the desired discussions about the manuscript and help with metagenomics sequences;
structure and optimizing the raw sequence and PSSMs to match Andriy Kryshtafovych for his help with CASP structures; Sean Eddy for his
help with using the JackHMMer package; and Raffaele Potami, Amir Karger,
it (i.e., by computing derivatives of the inputs—as opposed to
and Kristina Holton for their help with using the HPC resources at Harvard
model parameters—with respect to the dRMSD between pre- Medical School. Finally, we thank the anonymous reviewers for their construc-
dicted and desired structures). Co-evolution methods do not tive feedback. We gratefully acknowledge the support of NVIDIA Corporation
have this capability as their inputs are the inter-residue couplings with the donation of the Titan Xp GPUs used for this research. This work was
themselves, making the approach circular. supported by NIGMS grant P50GM107618 and NCI grant U54-CA225088.
The history of protein structure prediction suggests that new
AUTHOR CONTRIBUTIONS
methods complementary to existing ones are eventually incor-
porated into hybrids. RGNs have this benefit, being an almost M.A. conceived the model, conducted the experiments, and wrote the paper.
entirely complementary modeling approach. For example, struc-
tural templates or co-evolutionary information could be incorpo- DECLARATION OF INTERESTS
rated as priors in the distance-based parameterization or even The author declares no competing interests.
as raw inputs for learning. RGNs can also include secondary
structure predicted by other algorithms. This is likely to be ad- Received: June 22, 2018
vantageous since the RGNs described here often predict global Revised: February 1, 2019
fold correctly but do less well with secondary structure (e.g., Accepted: March 11, 2019
T0827 in Figure 3E). RGNs can also be made to predict side- Published: April 17, 2019

chain conformations, by outputting a branched curve in lieu of REFERENCES


the current linear curve, and are applicable to a wide range of
other polymers (e.g., RNA tertiary structure). Our demonstration Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M.,
that state-of-the-art performance in structure prediction can be Ghemawat, S., Irving, G., Isard, M., et al. (2016). TensorFlow: a system for
achieved using an end-to-end differentiable model will make large-scale machine learning. In 12th USENIX Symposium on Operating
very rapid improvements in machine learning across a wide Systems Design and Implementation (OSDI 16), pp. 265–283.

range of scientific and technical fields available to protein folding Adhikari, B., Bhattacharya, D., Cao, R., and Cheng, J. (2015). CONFOLD: res-
idue-residue contact-guided ab initio protein folding. Proteins 83, 1436–1449.
and biophysics. We predict that hybrid systems using deep
Alain, G., and Bengio, Y. (2016). Understanding intermediate layers using linear
learning and co-evolution as priors and physics-based ap-
classifier probes. Arxiv, arXiv:1610.01644v4 https://arxiv.org/abs/1610.01644.
proaches for refinement will soon solve the long-standing prob-
AlQuraishi, M. (2019a). Parallelized natural extension reference frame: parallel-
lem of accurate and efficient structure prediction. It is also
ized conversion from internal to Cartesian coordinates. J. Comp. Chem. 40,
possible that the use of neural-network-probing techniques 885–892.
(Alain and Bengio, 2016; Koh and Liang, 2017; Nguyen et al., AlQuraishi, M. (2019b). ProteinNet: a standardized data set for machine
2016; Shrikumar et al., 2017; Simonyan et al., 2013) with RGNs learning of protein structure. Arxiv, arXiv:1902.00249v1 https://arxiv.org/abs/
will provide new insight into the physical chemistry of folding 1902.00249.
and the types of intermediate structures that proteins use to Alva, V., Söding, J., and Lupas, A.N. (2015). A vocabulary of ancient peptides
sample conformational space. at the origin of folded proteins. Elife 4, e09410.

300 Cell Systems 8, 292–301, April 24, 2019


Aydin, Z., Thompson, J., Bilmes, J., Baker, D., and Noble, W.S. (2012). Protein Lyons, J., Dehzangi, A., Heffernan, R., Sharma, A., Paliwal, K., Sattar, A., Zhou,
torsion angle class prediction by a hybrid architecture of Bayesian and Y., and Yang, Y. (2014). Predicting backbone Ca angles and dihedrals from
neural networks. In 13th International Conference on Bioinformatics and protein sequences by stacked sparse auto-encoder deep neural network.
Computational Biology, pp 2012–2018. J. Comput. Chem. 35, 2040–2046.
Bernstein, F.C., Koetzle, T.F., Williams, G.J., Meyer, E.F., Brice, M.D., Marks, D.S., Colwell, L.J., Sheridan, R., Hopf, T.A., Pagnani, A., Zecchina, R.,
Rodgers, J.R., Kennard, O., Shimanouchi, T., and Tasumi, M. (1977). The and Sander, C. (2011). Protein 3D structure computed from evolutionary
Protein Data Bank. A computer-based archival file for macromolecular struc- sequence variation. PLoS One 6, e28766.
tures. Eur. J. Biochem 80, 319–324. Marx, D., and Hutter, J. (2012). Ab Initio Molecular Dynamics: Basic Theory
Branden, C., and Tooze, J. (1999). Introduction to Protein Structure (Garland and Advanced Methods (Cambridge University Press).
Publishing Science). Moult, J., Pedersen, J.T., Judson, R., and Fidelis, K. (1995). A large-scale
Contreras-Moreira, B., Ezkurdia, I., Tress, M.L., and Valencia, A. (2005). experiment to assess protein structure prediction methods. Proteins 23, ii–iv.
Empirical limits for template-based protein structure prediction: the CASP5 Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T., and Tramontano, A.
example. FEBS Lett. 579, 1203–1207. (2018). Critical assessment of methods of protein structure prediction
Dawson, N.L., Lewis, T.E., Das, S., Lees, J.G., Lee, D., Ashford, P., Orengo, (CASP)-round XII. Proteins 86, 7–15.
C.A., and Sillitoe, I. (2017). CATH: an expanded resource to predict protein Nguyen, A., Yosinski, J., and Clune, J. (2016). Multifaceted feature visualiza-
function through structure and sequence. Nucleic Acids Res. 45, D289–D295. tion: uncovering the different types of features learned by each neuron
Dill, K.A. (1990). Dominant forces in protein folding. Biochemistry 29, in deep neural networks. Arxiv, arXiv:1602.03616v2 https://arxiv.org/abs/
7133–7155. 1602.03616.

Dill, K.A., and MacCallum, J.L. (2012). The protein-folding problem, 50 years Ovchinnikov, S., Kim, D.E., Wang, R.Y.-R., Liu, Y., DiMaio, F., and Baker, D.
on. Science 338, 1042–1046. (2016). Improved de novo structure prediction in CASP11 by incorporating
coevolution information into Rosetta. Proteins 84, 67–75.
Gajda, M.J., Pawlowski, M., and Bujnicki, J.M. (2011a). Protein structure pre-
diction: from recognition of matches with known structures to recombination Ovchinnikov, S., Park, H., Varghese, N., Huang, P.S., Pavlopoulos, G.A., Kim,
of fragments. In Multiscale Approaches to Protein Modeling, A. Kolinski, ed. D.E., Kamisetty, H., Kyrpides, N.C., and Baker, D. (2017). Protein structure
(Springer), pp. 231–254. determination using metagenome sequence data. Science 355, 294–298.

Gajda, M.J., Pawlowski, M., and Bujnicki, J.M. (2011b). Multiscale Approaches Parsons, J., Holmes, J.B., Rojas, J.M., Tsai, J., and Strauss, C.E.M. (2005).
to Protein Modeling (Springer). Practical conversion from torsion space to Cartesian space for in silico protein
synthesis. J. Comput. Chem. 26, 1063–1068.
Gao, Y., Wang, S., Deng, M., and Xu, J. (2017). Real-value and confidence
Perez, A., Morrone, J.A., Brini, E., MacCallum, J.L., and Dill, K.A. (2016). Blind
prediction of protein backbone dihedral angles through a hybrid method of
protein structure prediction using accelerated free-energy simulations. Sci.
clustering and deep learning. Arxiv, arXiv:1712.07244v1 https://arxiv.org/
Adv 2, e1601274.
abs/1712.07244v1.
Ponting, C.P., and Russell, R.R. (2002). The natural history of protein domains.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning (The
Annu. Rev. Biophys. Biomol. Struct 31, 45–71.
MIT Press).
Ramachandran, G.N., Ramakrishnan, C., and Sasisekharan, V. (1963).
Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory. Neural
Stereochemistry of polypeptide chain configurations. J. Mol. Biol. 7, 95–99.
Comput. 9, 1735–1780.
Schaarschmidt, J., Monastyrskyy, B., Kryshtafovych, A., and Bonvin, A.M.J.J.
Hopf, T.A., Scha€rfe, C.P.I., Rodrigues, J.P.G.L.M., Green, A.G., Kohlbacher,
(2018). Assessment of contact predictions in CASP12: co-evolution and deep
O., Sander, C., Bonvin, A.M.J.J., and Marks, D.S. (2014). Sequence co-evolu-
learning coming of age. Proteins 86, 51–66.
tion gives 3D contacts and structures of protein complexes. Elife 3.
Shrikumar, A., Greenside, P., and Kundaje, A. (2017). Learning important fea-
Hopf, T.A., Ingraham, J.B., Poelwijk, F.J., Scha €rfe, C.P.I., Springer, M.,
tures through propagating activation differences. In ICML’17 Proceedings
Sander, C., and Marks, D.S. (2017). Mutation effects predicted from sequence
of the 34th International Conference on Machine Learning-Volume 70, pp.
co-variation. Nat. Biotech 35, 128–135.
3145–3153.
Juan, D., de Pazos, F., and Valencia, A. (2013). Emerging methods in protein Simonyan, K., Vedaldi, A., and Zisserman, A. (2013). Deep inside convolutional
co-evolution. Nat. Rev. Genet. 14, 249–261. networks: visualising image classification models and saliency maps. Arxiv,
Koh, P.W., and Liang, P. (2017). Understanding Black-box predictions via influ- arXiv:1312.6034v2 https://arxiv.org/abs/1312.6034.
ence functions. Arxiv, arXiv:1703.04730v2 https://arxiv.org/abs/1703.04730. Wang, S., Sun, S., Li, Z., Zhang, R., and Xu, J. (2017). Accurate de novo pre-
Kryshtafovych, A., Monastyrskyy, B., and Fidelis, K. (2016). CASP11 statistics diction of protein contact map by ultra-deep learning model. PLoS Comput.
and the prediction center evaluation system. Proteins 84, 15–19. Biol. 13, e1005324.
Kryshtafovych, A., Monastyrskyy, B., Fidelis, K., Moult, J., Schwede, T., and Xu, D., and Zhang, Y. (2012). Ab initio protein structure assembly using contin-
Tramontano, A. (2018). Evaluation of the template-based modeling in uous structure fragments and optimized knowledge-based force field.
CASP12. Proteins 86, 321–334. Proteins 80, 1715–1735.
Leaver-Fay, A., Tyka, M., Lewis, S.M., Lange, O.F., Thompson, J., Jacak, R., Xu, J., and Zhang, Y. (2010). How significant is a protein structure similarity
Kaufman, K., Renfrew, P.D., Smith, C.A., Sheffler, W., et al. (2011). ROSETTA3: with TM-score = 0.5? Bioinformatics 26, 889–895.
an object-oriented software suite for the simulation and design of macromole- Yang, J., Yan, R., Roy, A., Xu, D., Poisson, J., and Zhang, Y. (2015). The
cules. Meth. Enzymol 487, 545–574. I-TASSER suite: protein structure and function prediction. Nat. Methods 12, 7–8.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature 521, Zhang, Y., and Skolnick, J. (2004). Scoring function for automated assessment
436–444. of protein structure template quality. Proteins 57, 702–710.
Li, H., Hou, J., Adhikari, B., Lyu, Q., and Cheng, J. (2017). Deep learning Zhang, C., Mortuza, S.M., He, B., Wang, Y., and Zhang, Y. (2018). Template-
methods for protein torsion angle prediction. BMC Bioinformatics 18, 417. based and free modeling of I-TASSER and Quark pipelines using predicted
Liu, T., Ish-Shalom, S., Torng, W., Lafita, A., Bock, C., Mort, M., Cooper, D.N., contact maps in CASP12. Proteins 86, 136–151.
Bliven, S., Capitani, G., Mooney, S.D., et al. (2018). .Biological and functional Zhao, F., Peng, J., and Xu, J. (2010). Fragment-free approach to protein folding
relevance of CASP predictions. Proteins 86, 374–386. using conditional neural fields. Bioinformatics 26, i310–i317.
Liu, Y., Palmedo, P., Ye, Q., Berger, B., and Peng, J. (2018). Enhancing evolu- Zhou, Y., Duan, Y., Yang, Y., Faraggi, E., and Lei, H. (2011). Trends in template/
tionary couplings with deep convolutional neural networks. Cell Syst 6, 65–74. fragment-free protein structure prediction. Theor. Chem. Acc 128, 3–16.

Cell Systems 8, 292–301, April 24, 2019 301


STAR+METHODS

KEY RESOURCES TABLE

REAGENT or RESOURCE SOURCE IDENTIFIER


Software and Algorithms
TensorFlow Abadi et al., 2016 tensorflow.org
ProteinNet AlQuraishi, 2019b https://github.com/aqlaboratory/proteinnet

CONTACT FOR REAGENT AND RESOURCE SHARING

Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Mo-
hammed AlQuraishi ([email protected]).

METHOD DETAILS

Model
We featurize a protein of length L as a sequence of vectors (x1,/,xL) where xt ˛Rd for all t. The dimensionality d is 41, where 20
dimensions are used as a one-hot indicator of the amino acid residue at a given position, another 20 dimensions are used for the
PSSM of that position, and 1 dimension is used to encode the information content of the position. The PSSM values are sigmoid
transformed to lie between 0 and 1. The sequence of input vectors are fed to an LSTM (Hochreiter and Schmidhuber, 1997), whose
basic formulation is described by the following set of equations.
it = sðWi ½xt ; ht1  + bi Þ:

ft = sðWf ½xt ; ht1  + bf Þ:

ot = sðWo ½xt ; ht1  + bo Þ:

c~t = tanhðWc ½xt ; ht1  + bc Þ:

ct = it 1c~t + ft 1ct1 :

ht = ot 1tanhðct Þ:
Wi,Wf,Wo,Wc are weight matrices, bi,bf,bo,bc are bias vectors, ht and ct are the hidden and memory cell state for residue t, respec-
tively, and 1 is element-wise multiplication. We use two LSTMs, running independently in opposite directions (1 to L and L to 1), to
ðfÞ ðbÞ
output two hidden states ht and ht for each residue position t corresponding to the forward and backward directions. Depending
on the RGN architecture, these two hidden states are either the final outputs states or they are fed as inputs into one or more LSTM
layers.
ðfÞ ðbÞ ðfÞ ðbÞ
The outputs from the last LSTM layer form a sequence of a concatenated hidden state vectors ð½h1 ; h1 ; /; ½hL ; hL Þ. Each
concatenated vector is then fed into an angularization layer described by the following set of equations:
 h i 
ðfÞ ðbÞ
pt = softmax W4 ht ; ht + b4 :

4t = argðpt expðiFÞÞ:
W4 is a weight matrix, b4 is a bias vector, F is a learned alphabet matrix, and arg is the complex-valued argument function. Expo-
nentiation of the complex-valued matrix iF is performed element-wise. The F matrix defines an alphabet of size m whose letters
correspond to triplets of torsional angles defined over the 3-torus. The angularization layer interprets the LSTM hidden state outputs
as weights over the alphabet, using them to compute a weighted average of the letters of the alphabet (independently for each
torsional angle) to generate the final set of torsional angles 4t ˛S1 3S1 3S1 for residue t (we are overloading the standard notation
for protein backbone torsional angles, with 4t corresponding to the (j,4,u) triplet). Note that 4t may be alternatively computed using
the following equation, where the trigonometric operations are performed element-wise:
4t = atan2ðpt sinðFÞ; pt cosðFÞÞ:

e1 Cell Systems 8, 292–301.e1–e3, April 24, 2019


In general, the geometry of a protein backbone can be represented by three torsional angles 4, c, and u that define the angles
between successive planes spanned by the N, Ca, and C’ protein backbone atoms (Ramachandran et al., 1963). While bond lengths
and angles vary as well, their variation is sufficiently limited that they can be assumed fixed. Similar claims hold for side chains as well,
although we restrict our attention to backbone structure. The resulting sequence of torsional angles ð41 ; /; 4L Þ from the angulariza-
tion layer is fed sequentially, along with the coordinates of the last three atoms of the nascent protein chain ðc1 ;/;c3t Þ, into recurrent
geometric units that convert this sequence into 3D Cartesian coordinates, with three coordinates resulting from each residue, cor-
responding to the N, Ca, and C’ backbone atoms. Multiple mathematically-equivalent formulations exist for this transformation; we
adopt one based on the Natural Extension Reference Frame (Parsons et al., 2005), described by the following set of equations:
2 3
 cosðqk mod 3 Þ
c~k = rk mod 3 4 cos 4k=3;k mod 3 sinðqk mod 3 Þ 5:
sin 4k=3;k mod 3 sinðqk mod 3 Þ

mk = ck1  ck2 :

ck :
nk = mk1 3 m

ck ; c
Mk = ½ m ck ; c
nk 3 m nk :

ck = Mk c~k + ck1 :
Where rk is the length of the bond connecting atoms k  1 and k, qk is the bond angle formed by atoms k  2, k  1, and k, 4k=3;k mod 3
is the predicted torsional angle formed by atoms k  2 and k  1, ck is the position of the newly predicted atom k, m b is the unit-normal-
ized version of m, and 3 is the cross product. Note that k indexes atoms 1 through 3L, since there are three backbone atoms per
residue. For each residue t we compute c3t2 ; c3t1 , and c3t using the three predicted torsional angles of residue t, specifically
4t;j = 43t;ð3t + jÞ mod 3 for j = {0,1,2}. The bond lengths and angles are fixed, with three bond lengths (r0,r1,r2) corresponding to N-Ca,
3
Ca-C’, and C’-N, and three bond angles (q0,q1,q2) corresponding to N-Ca-C’, Ca-C’-N, and C’-N-Ca. As there are only three unique
values we have rk = rk mod 3 and qk = qk mod 3. In practice we employ a modified version of the above equations which enable much
higher computational efficiency (AlQuraishi, 2019a).
The resulting sequence ðc1 ; /; c3L Þ fully describes the protein backbone chain structure and is the model’s final predicted output.
For training purposes a loss is necessary to optimize model parameters. We use the dRMSD metric as it is differentiable and captures
both local and global aspects of protein structure. It is defined by the following set of equations:

d~j;k = kcj  ck k2 :

ðexpÞ ðpredÞ
dj;k = d~j;k  d~j;k :

kDk2
dRMSD = :
LðL  1Þ
ðexpÞ ðpredÞ
Where fdj;k g are the elements of matrix D, and d~j;k and d~j;k are computed using the coordinates of the experimental and
predicted structures, respectively. In effect, the dRMSD computes the [2 -norm of the distances over distances, by first computing
the pairwise distances between all atoms in both the predicted and experimental structures individually, and then computing
the distances between those distances. For most experimental structures, the coordinates of some atoms are missing. They are
excluded from the dRMSD by not computing the differences between their distances and the predicted ones.

Hyperparameters
RGN hyperparameters were manually fit, through sequential exploration of hyperparameter space, using repeated evaluations on the
ProteinNet11 validation set and three evaluations on the ProteinNet11 test set. Once chosen the same hyperparameters were used to
train RGNs on ProteinNet7-12 training sets. The validation sets were used to determine early stopping criteria, followed by single
evaluations on the ProteinNet7-12 test sets to generate the final reported numbers (excepting ProteinNet11).
The final model consisted of two bidirectional LSTM layers, each comprised of 800 units per direction, and in which outputs from
the two directions are first concatenated before being fed to the second layer. Input dropout set at 0.5 was used for both layers, and
the alphabet size was set to 60 for the angularization layer. Inputs were duplicated and concatenated; this had a separate effect from
decreasing dropout probability. LSTMs were random initialized with a uniform distribution with support [0.01, 0.01], while the alpha-
bet was similarly initialized with support [p,p]. ADAM was used as the optimizer, with a learning rate of 0.001, b1 = 0.95 and b2 = 0.99,
and a batch size of 32. Gradients were clipped using norm rescaling with a threshold of 5.0. The loss function used for optimization
was length-normalized dRMSD (i.e. dRMSD divided by protein length), which is distinct from the standard dRMSD we use for report-
ing accuracies.

Cell Systems 8, 292–301.e1–e3, April 24, 2019 e2


RGNs are very seed sensitive. As a result, we used a milestone scheme to restart underperforming models early. If a dRMSD loss
milestone is not achieved by a given iteration, training is restarted with a new initialization seed. Table S3 summarizes the milestones,
which were determined based on preliminary runs. In general, 8 models were started and, after surviving all milestones, were run for
250k iterations, at which point the lower performing half were discarded, and similarly at 500k iterations, ending with 2 models that
were usually run for 2.5M iterations. Once validation error stabilized we reduced the learning rate by a factor of 10 to 0.0001, and run
for a few thousand additional iterations to gain a small but detectable increase in accuracy before ending model training.

Dataset
We use the ProteinNet dataset for all analyses (AlQuraishi, 2019b). ProteinNet recreates the conditions of past CASP assessments by
restricting the set of sequences (for building PSSMs) and structures used to those available prior to the start of each CASP assess-
ment. Each ProteinNet entry is comprised of two inputs, the raw protein sequence, represented by a one-hot vector, and the protein’s
PSSM and information content profiles, derived using 5 iterations of JackHMMer with an e-value threshold of 10-10. PSSM values are
normalized to lie between 0 and 1. The output for each ProteinNet entry is comprised of the Cartesian coordinates of the protein’s
backbone atoms, annotated by metadata denoting which atoms are missing from the experimental structure. These atoms are
excluded from the dRMSD loss calculation, which enables use of partially resolved experimental structures that would otherwise
be excluded from the dataset.
For ProteinNet7-11, the publicly available CASP structures were used as test sets. For ProteinNet12, the publicly available CASP12
structures are incomplete, as some structures are still embargoed. We obtained a private set of structures from the CASP organizers
that includes all structures used in CASP12 (except two), and we used this set for model assessment. For training all RGN models, the
90% ‘‘thinning’’ version of ProteinNet was used.

DATA AND SOFTWARE AVAILABILITY

TensorFlow (Abadi et al., 2016) code for training new RGN models, as well as pre-trained RGN models used in reporting results for
CASP 7-12, are available on GitHub at https://github.com/aqlaboratory/rgn.

e3 Cell Systems 8, 292–301.e1–e3, April 24, 2019

You might also like