CLASSIFICATION OF PROTEIN
STRUCTURES
Comparing Protein Structures: Why?
• detect evolutionary relationships
• identify recurring motifs
• detect structure/function relationships
• predict function assess predicted structures
• classify structures -used for many purposes
Structure is more conserved than
sequence
28% sequence identity
1
Chain/Domain Library
Hundreds of thousands of gene sequences are
translated to proteins (SwissProt, PIR)
~35,000 solved structures (PDB) as of March, 2006
Goals:
Predict structure from sequence
Predict function based on sequence
Predict function based on structure
2
Fig. 1. Examples of structural alignments obtained with MAMMOTH
Angel R. Ortiz et al. Protein Sci 2002; 11: 2606-2621
(A) Alignment of 1pts_A with 1mup. The structural alignment score is 9.52;
(B) Structural alignment of 1pgb with 5tss_A. The score in this case is 6.29.
3
4
• Recognizing Structural Similarity
• GOAL: Of all solved structures, find the structure or
substructure most similar to a protein of interest
• By eye -tried and true! requires an expert viewer
with a GREAT memory!
• Automated detection -good for database searching
• How would you do this?
Features of automated structure comparison
1. What representation will you use for the protein?
2. How will you assess structural similarity?
3. How will you search the possible comparisons?
4. How significant is a “hit”?
5
»Example: Superposition to minimize RMSD
• 1. Define measure of similarityRMSD = {Σ|x-xj|2)/N}1/2
• 2. Determine correspondence between residues of each protein
(e.g. by sequence alignment, or a guess)
• 3. Align centers of mass
• 4. Use matrix methods to solve for the rotation that gives minimal
RMSD (variety of methods available)
• 5. Evaluate the resulting number
• 6. Refine the alignment
• 7. iterate
»Very useful. Commonly used for comparing similar structures.
»But… Not a good choice when proteins are only partially similar. Why?
»Also, points far from center of mass are weighted more heavily.
Algorithms for detecting structure similarity
Dynamic Programming
-works on 1D strings -reduce problem to this-can’t accommodate topological
changes-example: Secondary Structure Alignment Program (SSAP)
3D Comparison/Clustering
-identify secondary structure elements or fragments-look for a similar
arrangement of these between different structures-allows for different
topology, large insertions-example: Vector Alignment Search Tool (VAST)
Distance Matrix
-identify contact patterns of groups that are close together-compare
these for different structures-fast, insensitive to insertions-example:
Distance ALIgnment Tool (DALI)
Unit vector RMS
-map structure to sphere of vectors -minimize the difference between
spheres -fast, insensitive to outliers -example:
Matching Molecular Models Obtained from Theory (MAMMOTH)
6
Structural Classification of Proteins
• Structure vs. structure comparisons (e.g. using DALI)
reveal related groups of proteins
• Structurally-similar proteins with detectable sequence
homology are assumed to be evolutionarily related
• Similarities between non-homologous proteins suggest
convergent evolution to a favorable or useful fold
• A number of different groups have proposed classification schemes
– SCOP (by hand)
– CATH (uses SSAP)–FSSP (uses Dali)
Classification of structures
SCOP: http://scop.mrc-lmb.cam.ac.uk/scop/
(domains, good annotation)
CATH: http://www.biochem.ucl.ac.uk/bsm/cath/
CE: http://cl.sdsc.edu/ce.html
Dali Domain Dictionary: http://columba.ebi.ac.uk:8765/holm/ddd2.cgi
FSSP: http://www2.ebi.ac.uk/dali/fssp/
(chains, updated weekly)
HOMSTRAD: http://www-cryst.bioc.cam.ac.uk/~homstrad/
HSSP: http://swift.embl-heidelberg.de/hssp/
7
SCOP Hierarchy of Structures
Class: upper hierarchy
Family:evolutionarily related with a significant sequence identity -2327 in SCOP
Superfamily:different families whose structural and functional features
suggest common evolutionary origin -1294 in SCOP
Fold:different superfamilieshaving same major secondary structures in
same arrangement and with same topological connections -800 in SCOP
Classification of structural data (SCOP) http://scop.mrc-lmb.cam.ac.uk/scop/
Pennisi, E. (1998) Science 279, 978
Hubbard et al. (1999) Nucleic Acids
605 folds
Res 254.
947 superfamily
1557 family
12794 protein
PIR web site: http://pir.georgetown.edu
8
Statistics from July 2005
Æ 945 FOLDS
Æ 1539 SUPERFAMILIES
Æ 2845 FAMILIES
Æ 70859 DOMAINS
9
10
11
12
13
14
15
16
Classification of Protein Structure: SCOP
http://scop.mrc-lmb.cam.ac.uk/scop/
http://scop.berkeley.edu/
17
Classification of Protein Structure: SCOP
SCOP is organized into 4 hierarchical layers:
(1) Classes:
Classification of Protein Structure: SCOP
(2) Folds: Major structural similarity
Proteins are defined as having a common fold if they have the
same major secondary structures in the same arrangement and
with the same topological connections
3) Superfamily: Probable common evolutionary origin
Proteins that have low sequence identities, but whose structural and
functional features suggest that a common evolutionary origin is probable
are placed together in superfamilies
4) Family: Clear evolutionarily relationship
Proteins clustered together into families are clearly evolutionarily related.
Generally, this means that pairwise residue identities between the proteins
are 30% and greater
18
Classification of Protein Structure: SCOP
Classification of Protein Structure: CATH
http://www.biochem.ucl.ac.uk/bsm/cath/
19
Classification of Protein Structure: CATH
Mixed Alpha
Alpha Beta
Beta
C
Barrel Super Roll
Sandwich
Tim Barrel
Other Barrel
20
Classification of Protein Structure: CATH
21
22
The DALI Domain Dictionary
http://www.ebi.ac.uk/dali/domain/
The DALI Domain Dictionary
• All-against-all comparison of PDB90 using
DALI
• Define score of each pair as a Z-score
• Regroup proteins based on pair-wise
score:
– Z-score > 2: “Folds”
– Z-score >4, 6, 8, 10 : sub-groups of “folds”
(different from Families, and sub-families!)
23
Summary
• Classification is an important part of biology; protein structures are not
exempt
• Prior to being classified, proteins are cut into domains
• While all structural biologists agree that proteins are usually a collection of
domains, there is no consensus on how to delineate the domains
• There are three main protein structure classification:
- SCOP (manual)
source of evolutionary information
- CATH (semi-automatic)
source of geometric information
- FSSP (automatic)
source of raw data
24