2021 09 08 AlphaFold Webinar Slides
2021 09 08 AlphaFold Webinar Slides
AlphaFold
Presenter: Kathryn Tunyasuvunakool
Research Scientist at DeepMind
Agenda Private & Confidential
● Introduction
● Accessing AlphaFold
● Future work
DeepMind and protein folding Private & Confidential
A key AlphaFold input is the MSA, containing sequences evolutionarily related to the target.
Related sequences are found using standard tools and public databases.
Inputs Private & Confidential
The input sequence is used to create an array of numbers representing all residue pairs.
Inputs Private & Confidential
AlphaFold can also use template structures from the PDB, found using standard tools.
However, it often produces accurate predictions without a template.
Network Private & Confidential
The Evoformer blocks extract information about the relationship between residues.
The MSA representation can update the pair representation and vice versa.
Network Private & Confidential
Feeding certain outputs back through the network again improves performance
Other outputs Private & Confidential
Interpreting
predictions
Identifying domains & possible disordered regions Assessing confidence within a domain
pLDDT > 90
pLDDT > 70 Reasonable to
Residues 65-342 pLDDT < 50 investigate side
and 418-784 form A disorder chains / active
a confident domain prediction not site details
a structure
prediction
pLDDT > 70
Lower confidence on
these specific parts
Predicted LDDT: pitfalls Private & Confidential
High pLDDT on all domains does not imply AlphaFold is confident of their relative positions
Mainly used to assess relative domain positions, but applicable whenever pairwise confidence is relevant
Predicted Aligned Error: format Private & Confidential
Residues
400-722
Residues
1-375
Predicted Aligned Error: format Private & Confidential
272 163
Predicted Aligned Error: format Private & Confidential
272 163
Predicted Aligned Error: format Private & Confidential
272 429
Predicted Aligned Error: format Private & Confidential
272 429
Predicted Aligned Error: usage Private & Confidential
1640
1521
861
2000
Things to be aware of Private & Confidential
● If AlphaFold is uncertain, it won’t necessarily place domains sensibly relative to each other
○ Membrane proteins won’t leave space for the cell membrane
○ Clashes can occur
Complexes
● For proteins that exist in complex, AlphaFold is missing context about their binding partners
○ Heteromers more problematic than homomers
○ Worst case: the protein is flexible in isolation
● Some have had success predicting complexes by joining 2 sequences with a linker
○ We think it is possible to extend the ideas in AlphaFold to complexes
○ However, this linker setup remains to be benchmarked
Benchmarking AlphaFold Private & Confidential
AlphaFold isn’t trained on these recent chains, so it hasn’t had access to them before
Our code includes a max_template_date flag that can be used to limit the templates
the model can use.
AlphaFold DB predictions may use any template released before February 15th 2021
Private & Confidential
1. EMBL-EBI database
Accessing
AlphaFold
2. AlphaFold Colab
Cons
Pros
● Your protein of interest may be missing
● Easiest way to access a prediction -
● Can’t play around with how the prediction
no code, no wait times
● Bulk download available is generated
● Data is CC-BY 4.0
AlphaFold Colab Private & Confidential
Cons
Pros ● Not suitable for large prediction jobs
● May be unreliable for long sequences
● Can run an arbitrary sequence of interest ● Wait times
● Limited ability to influence the prediction
● No coding or installation required
Note: community-made Colabs are also available that use AlphaFold models! Sergey will talk about one later.
These may support a wider range of options for customising the prediction.
Please cite our methods paper (Jumper et al. 2021) if you use an AlphaFold prediction.
Open source code Private & Confidential
You can also download the code and run AlphaFold on your own machine
Most dependencies are provided in a Docker container, but you need to download genetics / template databases and trained
models separately.
Cons
Pros
● Requires sufficient space for the
● Can run an arbitrary sequence of interest databases (~2.2 TB) and a GPU
● Can run large numbers of predictions ● A little technical expertise required
● Full freedom to edit the code and change how the prediction ● Wait times
is generated
Private & Confidential
Future work
Private & Confidential
Thank you to everyone who made AlphaFold possible! Private & Confidential
Sameer Velankar
EMBL-EBI
AlphaFold Database
• Title
• Molecule description
• Name, source
• COMPND, SOURCE records
• Citation
• Cross-reference to UniProt
• DBREF record
• Sequence
• SEQRES record
Model archive extension
• Entry details
• Accession, version, authors, version history
• Molecule description
• Name, source, sequence, sequence version
• Cross-reference
• UniProt, Taxonomy id
• Quality measure
• Per residue quality, Global quality
• Possibility to add protocols, MSA details
• Can be extended to include more metadata
AlphaFold web pages
• Basic search system
• Allows search using UniProt accession, UniProt id,
protein name, gene name and organism
UniProt Pfam
InterPro
PDBe-KB
AlphaFold Database – limitations
• Information on complexes with other proteins, nucleic acids (DNA or RNA) or ligands. In
some cases, the single-chain prediction may correspond to the structure adopted in a
complex. The missing context from surrounding molecules may lead to an uninformative
prediction
• AlphaFold does not make any predictions about any of the non-protein components such as
cofactors, metals, ligands including drug-like molecules, ions, carbohydrates and other
post-translational modifications
• Protein dynamics - AlphaFold will usually only produce one of multiple conformations
• AlphaFold has not been trained or validated for predicting the effect of mutations
• May (or may not) lead to hypotheses about protein function – any hypotheses have to be
tested by further experimentation
What’s next – under discussion
• Additional metadata
• MSA – need to consider data size
• information on templates
• quality criteria e.g. predicted TM score
• Integrative/hybrid methods
• Models for individual components
I/H Methods Structures
552-protein yeast Nuclear Pore Complex
Kim et al. (2018) Nature 555, 475-82
PDBDEV_00000010; PDBDEV_00000011; PDBDEV_00000012
• Combination of sparse experimental data and
predicted model may lead to actionable data to test
hypothesis
• Chemical foot printing
• Hydrogen-Deuterium exchange
• smFRET - Single molecule fluorescence resonance
energy transfer
Structural biology across scales
• Organism • Carbohydrate chain
• Cellular organelle
• Strain/variant • Immune-system evasion
Organism Atom
Virus Complex Molecule
Infected Chemical
Cell Assembly Chains Entity
github.com/sokrypton/ColabFold
ColabFold - Advanced options C
8 9 10 11 12 13 14 15
Can predict protein-protein/peptide interactions
Don't actually need a G-linker!
G-linker!
UNK-linker!
Protein-peptide interaction
Can predict protein-protein/peptide interactions
consistent w/
biochem data
Preprints rolling in...
Cross-species
Predicting homo-oligomers
Modeling protein given MSA
Residue index [1,2,3,4,5,6]
Multiple Sequence
Alignment
Multiple Sequence
Alignment
Residue_index in AlphaFold is used to create a Relative positional encoding
[1,2,3,4,5,6]
Multiple Sequence
Alignment
Complexes - monomer
Positions
PAE (Å)
Positions
Complexes - homodimer
PAE (Å)
Chains
A B
Positions
Complexes - homo-6mer
E F
PAE (Å)
A
Chains
B C
D Positions
Complexes - homo-8mer?
PAE (Å)
Chains
Positions
BENCHMARK* - Can AlphaFold predict homo-oligomers
and is inter-PAE a good metric for this?
* technically in the training set, but alphafold was only trained on single-chains.
A B A B
A A
B B
Homo-oligomeric dataset
Ponstingl, H., Kabir, T. and Thornton, J.M., 2003. Automatic inference of protein quaternary structure from crystals.
Journal of Applied Crystallography, 36(5), pp.1116-1122.
BENCHMARK* - Lower inter-PAE scores appears
predictive of homo-dimeric formation
A B
A
B
A B
A
B
A B
A pTMscore integrates PAE info. We
recommend using it instead of pLDDT for
B ranking predicted complexes.
How about hetero-oligomers (or a mixture)?
Hetero-dimer (1:1) - CASP target H1065
A 1:1
PAE (Å)
Chains
Positions
Sometimes unpaired MSA works (example: CASP target H1065)
Success Success
1/5 ~3/5
Combining paired+unpaired helps (example: CASP target H1065)
Unpaired+Paired MSA
Success
5/5
Homo/hetero-oligomer
2:2
Chains
Positions
Homo/hetero-oligomer - D-methionine transport system
2:1:2
B A
E D
A B C D E
PAE (Å)
Chains
Positions
What if your protein/complex is too big to fit into memory? - Trim it!
What if your protein/complex is too big to fit into memory? - Trim it!
A1-A36,A92-A197
A47-A242 B65-B477
Put it together in pymol!
Conclusions
● AlphaFold can be “hacked” into predicting protein-protein complexes.
● This is an unintended use and should only be used as hypotheses generation
tool that needs further experimental validation.
Acknowledgments Coauthors
Martin Milot
Sergey
Some AlphaFold use cases
Alex Bateman
Pfam use case: Calmodulin_bind (PF07887)
• Found in plants
• Involved in stress responses
• Built family in 2004
Seed alignment
• No PDB structure
Structural model for Calmodulin_bind (PF07887)
2
1
3
1 2 3
Structural model for Calmodulin_bind (PF07887)
2
3
+GG linker
+GGGG linker
Negative control: plDDT values for a spurious protein
• KKKKDDDDKKKKDDDDKKKKDDDDKKKKDDDDKKKKDDDDKKKKDDDDK
KKKDDDD
Acknowledgements
EFEMP2
TM
regio
n Top pocket
prediction
Kinase
Long domain
disordered
c-term
Pocket detection on the full models is likely to result in false positives and false negatives due to
modelling of low confidence regions and interactions. We tested this on a benchmark dataset
Pocket detection and how to filter the models
We retained 230 of 304 proteins from a dataset by Clark et al., 2020. Pocket detection was
performed using ghecom (Kawabata, 2010), as done previously in (Clark et al., 2020).
Pocket detection and how to filter the models
Bálint Mészáros
EMBL Heidelberg
08/09/2021
AlphaFold2 indicates the presence of IDRs
AF2 generates coordinates for every residue, even ones that have no fixed structure
human
p53
Two interpretations for low confidence: pLDDT is a good indicator of disordered regions (in this
• AF2 isn’t good enough to predict the case)
structure Let’s test the generic case – binary disordered prediction
• There is no structure to predict
AlphaFold2 as a disorder prediction method
1.2
pLDDT score distribution on the human
proteome
0.8
0.4
0 20 40 60 80 100
AlphaFold2 as a disorder prediction method
Structural parameters derived from AF2 serve as excellent predictors of disorder – outperforming our current
workhorse
https://tinyurl.com/AF2-ProViz
AF2 identifies functional sites in IDRs
Homo-tetramerization
Binding site for the
region
MDM2 ubiquitin
ligase
AF2 can build complex structures for certain IDRs
Homo-tetramerization
Binding site for the
region
MDM2 ubiquitin
ligase
KID:KI E2F1:DP1:R
X b
Right binding site, wrong Highly specialized homotrimeric
orientation folding
RelA:CB
P collagen
trimer
AF2 can build complex structures for certain IDRs
• AF2 can predict IDR complexes but success depends on several factors
• Helps prediction:
• Helical / β bound IDP conformation
• Well defined, hydrophobic binding groove
• Asymmetric bound IDP structure (secondary structural elements along the IDR sequence)
• Hinders prediction:
• Short IDRs
• Irregular bound structure
• Phosphorylation-dependent binding
• Presence of ions in the interface
• Symmetric bound structures (long helices or arrays of short similar structural elements)
• Examples + documentation: https://tinyurl.com/AF2-IDRcomplex
Acknowledgements
Thank you!