0% found this document useful (0 votes)

16 views11 pages

Article

This document describes an automated method to assign protein classifications in the SCOP and CATH databases based on scores from the FSSP database. The method, called Classification by Optimization (CO), finds the SCOP fold and CATH topology classification for a protein that minimizes the cost, defined based on FSSP Z-scores. The classifications assigned by CO have a high success rate when ambiguous cases are excluded. These ambiguous cases represent the inherent limitations of structure-based protein classification.

Uploaded by

Sahil8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views11 pages

Article

Uploaded by

Sahil8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

PROTEINS: Structure, Function, and Genetics 46:405– 415 (2002)

DOI 10.1002/prot.1176

Automated Assignment of SCOP and CATH Protein

Structure Classifications From FSSP Scores
Gad Getz,1 Michele Vendruscolo,2 David Sachs,3 and Eytan Domany1*
1
Department of Physics of Complex Systems, Weizmann Institute of Science, Rehovot, Israel
2
Oxford Centre for Molecular Sciences, New Chemistry Laboratory, Oxford, United Kingdom
3
Department of Physics, Princeton University, Princeton, New Jersey

ABSTRACT We present an automated proce- vealed by their folded state. Therefore, most of the recently
dure to assign CATH and SCOP classifications to proposed methods to derive energy functions to perform
proteins whose FSSP score is available. CATH classi- protein fold predictions rely in different ways on structural
fication is assigned down to the topology level, and data.12,13
SCOP classification is assigned to the fold level. The most comprehensive repository of three-dimen-
Because the FSSP database is updated weekly, this sional structures of proteins is the Protein Data Bank
method makes it possible to update also CATH and (PDB).14 The number of released structures is increasing
SCOP with the same frequency. Our predictions at the pace of about 50 per week, and ⬎12,000 complete
have a nearly perfect success rate when ambiguous sets of coordinates were available at the time of writing.
cases are discarded. These ambiguous cases are Many research groups maintain web-accessible hierarchi-
intrinsic in any protein structure classification that cal classifications of PDB entries. The most widely used
relies on structural information alone. Hence, we are FSSP,15 CATH,16 SCOP,17 HOMSTRAD,18 MMDB,19
introduce the “twilight zone for structure classifica- and 3Dee20 (see Table I for a list of abbreviations). Here we
tion.” We further suggest that to resolve these am- consider three of these: the FSSP, the CATH, and the
biguous cases, other criteria of classification, based SCOP databases. Each group has its own way to compare
also on information about sequence and function, and classify proteins; these three classification schemes
must be used. Proteins 2002;46:405– 415. are, however, consistent with each other to a large ex-
© 2002 Wiley-Liss, Inc. tent.21,22

Key words: protein structure; protein databases; FSSP Database

CATH; FSSP; SCOP; classification; clus-
tering The FSSP (Fold classification based on Structure-
Structure alignment of Proteins) uses a fully automated
INTRODUCTION structure comparison algorithm, DALI (Distances ALIgn-
The first step to analyze the vast amount of information ment algorithm),23,24 to calculate a pairwise structural
provided by genome sequencing projects is to organize similarity measure (the S-score) between protein chains.
proteins (the gene products) into classes with similar The algorithm searches for that amino acid alignment
properties. Because during evolution protein structures between the two protein chains that yields the most
are much more conserved than sequences and functions,1 similar pair of C␣ distance maps. In general, the more
proteins are usually classified first by their structural geometrically similar two chain structures are, the higher
similarity (phenetic classification) and then by the similar- their S-score is. The mean and standard deviations of the
ity of their sequences or by the similarity of their functions S-scores obtained for all the pairs of proteins are evalu-
(phylogenetic classification).2 ated. Shifting the S-scores by their mean and rescaling by
A reliable structural classification scheme is useful for the standard deviation yield the statistically meaningful
several reasons. Perhaps the most exciting perspective is Z-scores.
the possibility to routinely assign a function to newly For classification of structures, the FSSP uses the
identified genes.3 This goal may be achievable because a Z-scores for all pairs in a representative subset of the PDB.
classified database provides a library of representative A fold tree is generated by applying an average-linkage
structures to perform prediction of protein structure by hierarchical clustering algorithm25 to this all-against-all
homology4,5 or by threading,6 – 8 and it allows for the
identification of distant evolutionary relationships.9 In
addition, given a particular protein, it provides a tool to Grant sponsor: Minerva Foundation; Grant sponsor: Germany-
Israel Science Foundation; Grant sponsor: US-Israel Science Founda-
identify other proteins of similar structure and function.10 tion (BSF).
The knowledge of the structure helps to reveal the mecha- *Correspondence to: Eytan Domany, Department of Physics of
nism of molecular recognition involved in catalysis, signal- Complex Systems, Weizmann Institute of Science, Rehovot 76100,
ing, and binding2 and may lead to the rational design of Israel. E-mail: [email protected].
new drugs.11 At a more abstract level, the physical prin- Received 23 February 2001; Accepted 13 July 2001
ciples dictating structural stability of proteins are re- Published online xx January 2002

© 2002 WILEY-LISS, INC.

406 G. GETZ ET AL.

TABLE I. Abbreviations and Definitions structure content.26 SCOP is linked to the ASTRAL com-
pendium,30 which provides a series of tools for further
Abbreviation Definition
analysis of the classified structures, mainly through the
3Dee Database of protein domain definitions use of their sequence. At variance with FSSP and CATH,
ASTRAL The ASTRAL compendium for sequence and SCOP is constructed manually, by visual inspection and
structure analysis
comparison of not only structures but also sequences and
CATH Protein structure classification
CO Classification by optimization functions.
DALI Protein structure comparison by alignment of
distance matrices
Automated Assignment of SCOP and CATH
DHS Dictionary of homologous superfamilies Classifications
FSSP Fold classification based on structure-structure In this work we present a method, Classification by
alignment of proteins
Optimization (CO), to predict without human intervention
HOMSTRAD Homologous structure alignment database
the SCOP fold level and the CATH topology level from the
MMDB Molecular modeling database
PDB Protein data bank FSSP pairwise structure similarity score. A protein for
SCOP Structural classification of proteins which the Z-score is available is classified into a SCOP fold
SSAP Structure comparison algorithm and into a CATH topology by the CO method, an optimiza-
tion procedure that finds the assignment of minimal cost,
where the cost is defined in terms of Z-scores (see Materi-
Z-score matrix. An alternate classification based on a more als and Methods). The query for the classification of any
common four-level hierarchy is also available.24 such protein can be submitted to the web site.31

CATH Database RESULTS

Consistency of the FSSP, CATH, and SCOP
Orengo and coworkers use a combination of automatic
Classifications
and manual procedures to create a hierarchical classifica-
tion of domains (CATH).16 They arrange domains in a We found that the FSSP and CATH databases are
four-level hierarchy of families according to the protein consistent.21 In this section we show that SCOP is also
class (C), architecture (A), topology (T), and homologous consistent with these to a large extent (see also Ref. 22). In
superfamily (H). The class level describes the secondary the rest of this work we use this fact to derive an
structures found in the domain26 and is created automati- automated procedure to assign the CATH and SCOP
cally. There are four class types: mainly-␣, mainly-␤, ␣-␤, classifications starting from the FSSP Z-scores (which are
and proteins with few secondary structures (FSS). The updated weekly) in a fully automated fashion to include
architecture level, on the other hand, is assigned manually new releases in the PDB.1 Here we further discuss the
(using human judgment) and describes the shape created consistency of the three classification schemes by introduc-
by the relative orientation of the secondary structure ing concepts and quantities that are later used in the
units. The shape families are chosen according to a com- prediction of the CATH and SCOP classifications.
monly used structure classification (e.g., barrel, sandwich, We first illustrate the correlation between the FSSP
roll, etc.). The topology level groups together all structures similarity score and the CATH classification. A simple and
with similar sequential connectivity between their second- visually appealing way to study this problem is shown in
ary structure elements. Structures with high structural Figure 1. The element Zij of the Z-score matrix [Fig. 1(a)]
and functional similarity are put in the same fourth-level represents the score for superimposing structure i with
family, called homologous superfamily. Both the topology structure j of the set PFrCs (a subset of the proteins in
and homologous superfamily levels are assigned by thresh- FSSP and CATH, see Table III and Materials and Meth-
olding a calculated structural similarity measure (SSAP) ods) using the DALI algorithm.23,24 In Figure 1(a) only the
at two different levels, respectively.27,28 The CATH data- pairs with Z ⬎ 2 are shown; therefore, the matrix is
base has been recently linked to the Dictionary of Homolo- sparse and the proteins are ordered in a random fashion.
gous Superfamilies (DHS) database,29 which allows fur- Figure 1(b) is produced by reordering the rows and col-
ther analysis of structural and functional features of umns of the original Z-score matrix [Fig. 1(a)]. The reorder-
evolutionary related proteins. There is a growing need for ing is performed according to the CATH classification in
annotating proteins classified in structural databases be- the following way: for each of the proteins in this set we
cause structural genomic initiatives are providing a large have the CATH classifications at all levels. First, we order
number of new proteins whose function might be gathered the proteins by their class; within the class, by the
by distant homology informations. architecture; within it by the topology, and so on. This
reordering generates a permutation of the columns and
SCOP Database rows of the Z matrix. The solid black grid in Figure 1(b)
The Structural Classification of Proteins (SCOP)17 data- separates the proteins according to their CATH class, and
base is organized hierarchically. The lower two levels a thin grid is placed at the boundaries between architec-
(family and superfamily) describe near and distant evolu- tures.
tionary relationship, the third (fold) describes structural Figure 1(b) shows the underlying order behind the
similarity, and the top level (class) describes the secondary apparent randomness of Figure 1(a) and reveals the extent
AUTOMATED ASSIGNMENT OF PROTEIN CLASSIFICATION 407

Fig. 1. a: Z-score matrix between all pairs of proteins in the PFrCs set. A black dot represents Z ⬎ 2.0. b:
Same Z-score matrix with rows and columns rearranged by using the CATH classification (see text). Part (b)
shows the underlying order behind the apparent randomness of part (a) and illustrates the extent to which the
FSSP Z-scores reflect the CATH classification. The regions A, B, C, and D are discussed in the text.

to which the FSSP Z-scores reflect the CATH classifica- These findings suggest that Z-scores can be used to
tion. predict the CATH and SCOP classifications of yet unclassi-
Several interesting observations can be made. First, fied proteins. In what follows, we demonstrate that this
consider the Class level of CATH. As can be seen in Figure indeed can be done. We also estimate the success rate of
1(b), there are no matrix elements with Z ⬎ 2.0 in region our predictions and provide a web site31 that can be used to
A that connect proteins of the mainly-␣ class to the retrieve our predictions for the CATH topology and the
mainly-␤ class. At variance with this, some proteins from SCOP fold for new entries in FSSP.
both of these classes have large Z-scores with proteins We also verified that the CATH and SCOP classifica-
from the ␣-␤ class (region B). This is reasonable, because of tions are to a large extent mutually compatible. An
the way similarity is defined by FSSP; a mainly-␣ protein immediate consequence of this is that it is possible to
can have a high Z-score with an ␣-␤ protein because of high construct a “translation table,” T̂, from the proteins that
similarity with the ␣ part. Second, in the Architecture have already both a CATH and a SCOP classification. In
level, we observe that there are architecture families that this way, given a CATH entry, one can obtain the corre-
are highly connected within themselves, e.g., ␣-␤ barrels sponding SCOP classification (see Fig. 2). Row i of the
(482–525: region C), whereas for others the intrafamily table refers to a particular CATH topology and column j to
connections are more sparse. The similarities within the a particular SCOP fold. The element T̂ij of the table is the
mainly-␤ sandwich family (318 – 406: region D) have two measured fraction of times that a protein has a CATH
relatively distinct subgroups, which suggest an inner topology i and a SCOP fold j. This number is calculated by
structure corresponding to the lower levels in the CATH enumerating all the 10,197 single-domain proteins with
hierarchy. Checking the topology level (the third CATH known CATH and SCOP classifications (PCsSs), and it is
level) for this architecture, one indeed finds two large an estimate of Tij, the joint probability distribution for a
topology subfamilies, the immunoglobulin-like proteins protein to have CATH topology i and SCOP fold j. If the
(324 –366: upper left part of region D) and the Jelly-Rolls CATH and SCOP classifications had been independent,
(373– 402: lower right part of region D), which correspond every element Tij could have been expressed as a product
precisely to the two strongly connected subgroups that of Ci, the fraction of proteins that belong to CATH topology
appear in Figure 1(b). i, and Sj, the fraction that belongs to SCOP fold j, that is,
We found that the CATH classification at the level of Tij ⫽ Ci ⴱ Sj. Randomly placing 10,197 proteins using
topology is reflected in the Z-matrix. This is to be expected such a probability distribution yields 4780 ⫾ 40 nonzero
because the Z-score measures the structural similarity of elements in the matrix. In the other extreme case, if there
two aligned proteins while preserving their connectivity. had been a full correspondence between the SCOP and
Overall, this analysis shows that the Z-matrix is corre- CATH classifications, the table would have had a single
lated with the CATH classification. In a similar way it is nonzero element in each row and column (in each CATH
possible to show that the Z-score is correlated with the topology row the nonzero element would have been in that
SCOP classification. The results are available at the web SCOP fold column that corresponds to it). In this case, the
site.31 proteins in PCsSs would have been distributed among 284
408 G. GETZ ET AL.

Fig. 2. Translation table from the CATH topology to the SCOP fold and vice versa. Nonzero entries of T̂ij
appear as black dots. T̂ij is proportional to the number of proteins of CATH topology i that have a SCOP fold j in
PCsSs.

nonzero elements (the number of distinct CATH topologies (both for CATH and SCOP), but a later release provides
in PCsSs). a different classification.
We found 369 nonzero elements in T̂, meaning that the
CATH and SCOP classifications are highly dependent. The frequencies of these outcomes greatly depends on
Still, the correspondence is not entirely one-to-one; in the statistics of the set of proteins to be classified. More
general, more than one SCOP fold corresponds to a given specifically, rejected proteins are of two types: proteins
CATH topology. The number of such folds is, however, that do not have high Z-scores with any other proteins
typically small. Such a translation table may be used to (“islands”; see Materials and Methods) and clusters of
predict the SCOP classification of a structure already proteins that are similar among themselves but do not
classified in CATH or at least to significantly restrict the have high Z-scores with other proteins outside their clus-
number of possibilities and vice versa. For example, the ter (“superislands”). The fraction of islands and superis-
assignment of the CATH topology to a protein with known lands is a feature of the particular set of proteins to be
SCOP fold can be done by selecting the CATH topology classified. The occurrence of a superisland suggests that a
with the largest value in the translation table for that new classification type (a new topology for CATH and and
particular SCOP fold. Such an assignment is correct in new fold for SCOP) might be needed. The work of maintain-
93% of the cases. The corresponding assignment of the ing CATH and SCOP can be thus focused on the classifica-
SCOP fold from the CATH topology is correct in 82% of the tion of a representative from each of these superislands.
cases. Although this is possibly useful information, in this For the set PFCs, the fraction of islands and superis-
work we do not assign classifications in this way. lands is 5%. We used this set to provide an upper bound for
the performance of the CO method (see below); however,
SUMMARY OF THE COCLASSIFICATION for the set PFC៮ the fraction of rejections goes up to 22%. If
PERFORMANCE rejections are not counted, we classify correctly 98% of the
PFCs proteins. On the other hand, we could test our
Every time the FSSP Z-scores are updated (once a week) predictions also against the new CATH release v2.0. Of
the CO classification can be applied to all the proteins that 1582 proteins that were assigned to previously existing
appear in the new FSSP release but are not yet classified CATH topologies, CO has classified correctly 80%. The
in CATH or in SCOP. The possible outcomes of the difference in success rates between PFCs and PFC ៮ is due
classification procedure are as follows: to the different way in which the test set is nested in the
larger set of structures with known classification. In the
1. Correct classification: the predicted classification will first case, the test set consisted of 20% of the members of
agree with the future release of the databases. PFCs, selected at random; the remaining 80% were used to
2. Rejection: the program is unable to classify the struc- “predict” the classification of the test set. In the second
ture. case, the members of CATH v1.7 were used to predict the
3. Ambiguous classification: a classification is returned classification of the new proteins that were added when
AUTOMATED ASSIGNMENT OF PROTEIN CLASSIFICATION 409

Fig. 3. a: Z-score matrix between all pairs of proteins in the combined PFrCs ⫹ PFC ៮ sets. The submatrix in
the upper left corner is the reordered Z-score matrix of the set PFrCs, which was already shown in Figure 1(b).
៮ . b: The same matrix as in (a) with the
The rest of the matrix presents the Z-scores for the proteins in the set PFC
rows and columns relative to the proteins in PFC ៮ reordered according to our assignment of their CATH
topology. With the CO method, the original order in the submatrix PFrCs is propagated to the entire matrix.

CATH v2.0 was released. These new structures are not predictions for the classes and architectures. In this work
distributed uniformly at random among the members of we present only results obtained by one particular method
CATH v1.7. (CO), which uses the original Z-score as a similarity
Ambiguous classifications are due to two different mecha- measure (see Materials and Methods). A complete list of
nisms. The first stems from a well-known problem with the the results obtained by using other methods can be found
way the FSSP similarity index is calculated (the “Russian in Ref. 21, which is available on the web site.31
doll effect”; see below). The second kind of “mistake” is Our final assignments for the set PFC ៮ using the CO
actually not a wrong classification; rather, it happens method are listed in the web site. A more illustrative way
when the newly classified structure lies within the ambigu- to present these results is shown in Figure 3. In Figure 3(a)
ous “twilight zone” between two closely related topologies we present the Z-score matrix for the combined set PFrCs ⫹
(for CATH) or folds (for SCOP), as demonstrated in detail PFC ៮ . The submatrix in the upper left corner is the
below. reordered Z-score matrix of the set PFrCs, which was
Automated Assignment of CATH Classification already shown in Figure 1(b). The rest of the matrix in
From FSSP Figure 3(a) presents the Z-scores of PFrCs with the set
PFC ៮ (randomly ordered) and the Z-scores of PFC ៮ among
In this section we describe the procedure that we used to
themselves. In Figure 3(b) we reordered the rows and
predict CATH topology level from the FSSP scores. We
៮ ; see Materials and columns whose index was ⬎860, corresponding to proteins
identified a set of 7431 proteins (PFC ៮ . Although in the matrix of Figure 3(a) these
in PFC
Methods) that appear in FSSP but were not yet processed
proteins appear in a random order, in Figure 3(b) they
by CATH 1.7. Our goal is to predict the CATH topology of
appear in the order imposed by our prediction of their
these 7431 proteins by using (a) the Z-scores between all
proteins in PF (see Materials and Methods) and (b) the CATH topology. One can see that the original order in the
known classifications of the set PFrCs (see Materials and submatrix PFrCs is propagated by our assignment proce-
dure to the set PFC ៮ . For example, focus on the small black
Methods).
Predicting topologies is a classification problem that we square at the upper left corner of the matrix. This small
treated with pattern recognition tools. We tested several black square represents the high Z-scores among the
prediction algorithms using cross-validation to estimate mainly-␣ class of proteins in PFrCs. In the corresponding
their performance.21 Every one of the algorithms that were top rows of the full matrix we see high Z-scores between
tested can be viewed as a two-stage process. In the first these structures and some proteins from PFC ៮ . In particu-
stage, a new similarity measure is produced from the lar, the small group with indices near 2476 are “close” to
original Z-scores. This is done either by a direct rescaling these mainly-␣ structures and hence are also classified as
of the original Z-scores or by using the results of various such. On the other hand, there is a large group of struc-
hierarchical clustering methods to produce new similarity tures from PFC ៮ (between 861 and 2476), which do not have
measures. The second stage consists of using these similari- high Z-scores with any of the proteins in PFrCs or with any
ties as the input to some classification method, yielding of the other structures in PFC ៮ with index ⬎2476. Hence,
410 G. GETZ ET AL.

we are unable to classify this group of structures on the the case of CATH, this number increased when we dis-
basis of their FSSP scores. carded proteins in the “twilight zone” (see the next sec-
Figure 3(b) illustrates the central idea of this work. We tion).
perform a task that is intermediate between clustering
and classification. We take proteins of known classification Twilight Zone for Protein Classification
and we use them as fixed a priori values in a clustering
The attempt to assign a new protein to a known fold
procedure.
might lead to frustration because at times one is undecided
The overall success rate of our prediction estimated by
about two or more possibilities. To assess that two proteins
cross-validation was 93%. To understand the significance
have similar structures, a similarity score is needed. FSSP
of these success rates, we derived a statistical (see Materi-
uses the Z-score, CATH uses the SSAP score, and SCOP
als and Methods) upper bound for this kind of prediction.
uses a subjective evaluation, which is also a kind of score.
This upper bound is 95% (see Materials and Methods),
The problem arises when the protein to be classified has
hence the figure of 93/95 ⫽ 98% given above.*
high scores with two proteins already classified, but to
We estimated the accuracy of the prediction by using the
different topologies. In this article, these proteins are
following procedure. First, the set PFCs was randomly
called borders (see Materials and Methods). Being a border
“diluted”; that is, we randomly chose a certain fraction of
protein depends on the similarity score. We showed,
the proteins in PFCs and placed them in a test set,
however, that FSSP, CATH, and SCOP are to a large
pretending that we did not know their classification. The
extent consistent classifications. Therefore, we suggest
FSSP scores of the entire set were then used to classify the
that there are “intrinsically” ambiguous cases— cases that
test set. For each protein from the test set, we either
are unavoidable in structure comparison. We refer to these
return a predicted classification or reject the protein (i.e.,
ambiguous regions in structure space as the “twilight
we declare that we are unable to classify it). The quality of
zone” in analogy with the case of protein sequence compari-
any classification algorithm (see Materials and Methods)
son where proteins with sequence similarity below 30%
is measured by its success rate (fraction of correctly
classified proteins, out of the test set) and by the purity cannot be reliably assigned to the same fold. We illustrate
(success rate out of the nonrejected proteins). For the CO this concept by a typical case, shown in Figure 4. This is a
method, the results were 93% for the success rate and 98% border protein. Protein 1dhn (the central one) is the one to
for the purity (using a dilution of 20%). More extensive be classified (in fact, it is a three-layer sandwich according
tests at other dilutions and for other methods are of to CATH). It has a Z-score of 9.3 with protein 1a8rA (on the
classification are discussed in Ref. 21 and available at the left), which is a three-layer sandwich topology and a
web site.31 Z-score of 8.7 with protein 1b66A (on the right), which is a
We also tested directly the reliability of the CO assign- two-layer sandwich topology. This example illustrates how
ments by using the CATH version 2.0 (PC2). In PC2, 1640 structural information alone might not provide a clear-cut
single-domain proteins that are present in PFC ៮ were criterion for classification of this protein. The incidence of
assigned to one of the topologies that existed in v1.7. the twilight zone is shown in Figure 5. In Figure 5(a) we
Fifty-eight of these we “rejected.” In 1266 cases of the present the histogram of the number of protein pairs that
remaining 1582 (80%), our prediction agrees with the one have different CATH topologies as a function of their
given in CATH v2.0. Almost all the cases in which we Z-score. This number is a rapidly decaying function of Z.
misassigned a domain can be explained in a simple way. On the contrary, the number of pairs with the same CATH
These cases are discussed in detail in a following section. topology is a slowly decaying function of Z. For Z ⬎ 3, the
The CO method can also be used to predict directly the C probability of having the same CATH topology becomes
level and the A level of CATH. We found that when the C greater than that of having different topologies. For Z ⬎
and A levels were predicted as a byproduct of predicting 7.5, the probability to have the same topology is 97.5%. In
the T level, the resulting C and A were consistent with Figure 5(b) we show the corresponding figure for SCOP.
those predicted directly. The number of folds in SCOP is larger than the number of
topologies in CATH; therefore, there is more ambiguity.
Automated Assignment of SCOP Classification However, also in this case for Z ⬎ 7.5, the probability to
From FSSP have the same topology is 93.5%. Taken together, these
We used the CO method to predict the SCOP fold for a results indicate that the twilight zone for structure com-
set of 3451 proteins (PFS៮ ) that belong in PF but not yet in parison can be bound by Z ⱕ 7.
PS. The results are available on the web site.31 The There are other cases in which the classification of a
estimated success rate (by cross-validation) was 93%. As in particular protein is inconsistent with that of all its
neighbors. For example, proteins that we called colonies
(see Materials and Methods) are such that none of their
neighbors are of their own kind. This means that the FSSP
* One must keep in mind that the estimated success rate is
calculated for all proteins; both FSSP representatives (⬇10% of the
scores imply that these proteins are similar only to pro-
proteins) and nonrepresentatives. Because the presence of homolo- teins of different classes and architectures. Identifying
gous proteins can create a bias in these estimates, we also tested the these proteins can also focus the attention to possible
success rate of predicting the CATH topology only for the FSSP
representatives, which yielded 63%, to be compared with the corre- misclassification or to drawbacks of the Z-score. For ex-
sponding upper bound of 74%. ample, 1 of the 49 colonies (at the architecture level) that
AUTOMATED ASSIGNMENT OF PROTEIN CLASSIFICATION 411

Fig. 4. Center: Protein 1dhn, which has a CATH ␣ ␤ three-layer (␤␤␣) sandwich Aspartyl-glucosaminidase
chain B (3.50.11) topology. Left: Protein 1a8rA, which has also a CATH ␣ ␤ three-layer (␤␤␣) sandwich
Aspartylglucosaminidase chain B (3.50.11) topology and has Z-score of 9.3 with protein 1dhn. Right: Protein
1b66A, which has a CATH ␣ ␤ two-layer sandwich Tetrahydropterin Synthase, subunit A (3.30.479) topology
and has Z-score of 8.7 with protein 1dhn. This example illustrates how structural information alone might be
insufficient to provide a clear-cut criterion for the classification of this protein.

Fig. 5. Twilight zone for protein structure classification. a: The number of protein pairs of with a given FSSP
Z-score that have different CATH folds is a rapidly decaying function of Z. On the contrary, the number of
proteins pairs with the same CATH fold is decaying slowly. For Z ⬍ 5 there is a non-negligible probability to
have different folds. We call this threshold the “twilight zone for structure classification.” b: The corresponding
histogram for SCOP folds. The number of SCOP folds is larger than the number of CATH topologies; hence the
twilight zone is Z ⯝ 7.

we found in CATH is the PDB entry 1rboC, which is ture has a similar structure to a part of a protein of a
classified as a ␣-␤ two-layer sandwich. It has 15 neighbors different architecture. Swindells et al.32 call the phenom-
in PC, 14 of which are classified as mainly-␤ sandwiches. enon of structures within structures, the “Russian doll”
We summarize the results about the assignments of the effect. Such cases are common between architectures of
CATH architecture for proteins that already have a CATH long proteins that contain substructures corresponding to
classification (PFCs) in a “confusion table” (see Table II). The architectures of shorter proteins; for example, there are
first column lists the “correct” classification (as given in many two-layer sandwich proteins that resemble a part of
CATH v1.7 for the test set); the second column gives the three-layer sandwich proteins. Such relationships can
assignments by CO (correct, incorrect, or reject), and the occur at the class level [e.g., ␣-␤ proteins that contain
third column lists the corresponding percentages. A full list mainly-␣ or mainly-␤ proteins (1rboC, 1hgeA)]. They can
of the inconsistent proteins is available on the web site.31 also occur at the architecture level within the same class
Another problem is that there are some large Z-scores [e.g., ␣-␤ complex architecture contains ␣-␤ two-layer
between proteins of different architectures. Such large sandwich (1regX)]. Other inconsistencies occur when pro-
Z-scores arise when a protein of one particular architec- teins fit two architecture definitions.
412 G. GETZ ET AL.

TABLE II. Summary of a “Confusion Table”

Original classification Assigned classification Cases (%)

Mainly alpha
1.10 Orthogonal bundle 1.10 Orthogonal bundle 96.3
reject 3.3
1.20 Up-down bundle 1.20 Up-down bundle 97.7
4.10 Irregular 1.2
1.10 Orthogonal bundle 0.7
1.25 Horseshoe 1.25 Horseshoe 100.0
1.50 Alpha-alpha barrel 1.50 Alpha-alpha barrel 100.0
Mainly beta
2.10 Ribbon 2.10 Ribbon 93.9
reject 5.7
2.20 Single sheet 2.20 Single sheet 97.2
reject 2.3
2.30 Roll 2.30 Roll 97.2
reject 2.1
3.10 Roll 0.7
2.40 Barrel 2.40 Barrel 91.0
reject 8.8
2.50 Clam 2.50 Clam 94.4
2.40 Barrel 5.6
2.60 Sandwich 2.60 Sandwich 86.1
reject 13.9
2.70 Distorted sandwich 2.70 Distorted sandwich 96.1
2.60 Sandwich 3.9
2.80 Trefoil 2.80 Trefoil 100.0
2.90 Orthogonal prism 2.90 Orthogonal prism 100.0
2.100 Aligned prism 2.100 Aligned prism 100.0
2.102 3-layer sandwich 2.102 3-layer sandwich 78.6
2.30 Roll 21.4
2.110 4 Propellor 2.110 4 Propellor 100.0
2.120 6 Propellor 2.120 6 Propellor 96.1
reject 3.9
2.130 7 Propellor 2.130 7 Propellor 100.0
2.140 8 Propellor 2.140 8 Propellor 85.3
reject 14.7
2.160 3 Solenoid 2.160 3 Solenoid 100.0
2.170 Complex 2.170 Complex 83.3
2.60 Sandwich 8.6
reject 8.0
Mixed alpha-beta
3.10 Roll 3.10 Roll 99.9
3.20 Barrel 3.20 Barrel 100.0
3.30 2-layer sandwich 3.30 2-layer sandwich 93.5
reject 6.0
3.40 3-layer(aba) sandwich 3.40 3-layer(aba) sandwich 96.1
reject 3.8
3.50 3-layer(bba) sandwich 3.50 3-layer(bba) sandwich 72.1
reject 27.2
3.30 2-layer sandwich 0.7
3.60 4-layer sandwich 3.60 4-layer sandwich 99.7
3.70 Box 3.70 Box 100.0
3.75 5-stranded propeller 3.75 5-stranded propeller 100.0
3.80 Horseshoe 3.80 Horseshoe 100.0
3.90 Complex 3.90 Complex 97.9
reject 0.7
Few secondary structures
4.10 Irregular 4.10 Irregular 90.8
reject 8.3
1.20 Up-down bundle 0.8
This table summarizes the results about the assignments of the CATH
architecture for proteins that have already a CATH classification. Only cases
that occur ⬎0.5% are listed. These figures were calculated by using 100
cross-validation runs at 20% dilution.
AUTOMATED ASSIGNMENT OF PROTEIN CLASSIFICATION 413

TABLE III. The Search Result When Submitting “1cuoA” to the Web Site
http://www.weizmann.ac.il/physics/complex/compphys/f2cs/
SCOP
CATH v1.7 CATH v2.0 CATH prediction SCOP 1.53 prediction
Chain id # C A T # C A T C A T # C F C F
1cuoa ⫺1 1 2 60 40 2 60 40 ⫺1 2 5
This protein was classified by neither CATH v1.7 nor SCOP 1.53, which are the basis of our predictions. We predicted it to belong to CATH
topology 2.60.40 and SCOP fold 2.5. Later it was indeed classified by CATH v2.0 as 2.60.40. The ⫺1 in both CATH v1.7 and SCOP 1.53 represents
that it was not classified by them.

Class Prediction Using the Web Site of the existing folds, possibly all of them.3 In such a
large-scale project, human intervention, which is precious
To retrieve our prediction for the CATH topology or
in setting the principles of classification, should be gradu-
SCOP fold of a protein, one can use the web site31 by
ally replaced by automated procedures.
entering the protein chain identifier in the search box and
submitting the query. If the protein appears in our data- ACKNOWLEDGMENTS
base, then a table will be returned containing both the
known and the predicted SCOP and CATH classifications. We thank Liisa Holm for making the raw FSSP data
For example, the submission of the chain identifier “1cuoA” available to us and for useful discussions during the initial
returns Table III. This protein was classified by neither stages of this project. This work is based on a thesis for the
CATH v1.7 nor SCOP 1.53, which are the basis of our M.Sc. degree submitted by G.G. to Tel-Aviv University
predictions. We predicted it to belong to CATH topology (1998). We also thank Noam Shental for discussions. M.V.
2.60.40 and SCOP fold 2.5. Later, the release CATH v2.0 is supported by an European Molecular Biology Organiza-
identified 1cuoA as 2.60.40. tion (EMBO) long-term fellowship; he also thanks the
Einstein Center for Theoretical Physics for partial support
CONCLUSIONS of his stay at the Weizmann Institute. D.S. thanks the
Weizmann Institute of Science for hospitality while part of
The rapidly increasing number of experimentally de-
this work was carried out.
rived protein structures requires a continuous updating of
the existing structure classification databases. Each group MATERIALS AND METHODS
adopts different classification criteria at the level of se- Databases and Protein Sets
quence, of structure, and of function similarities. A compari-
son between different classification schemes can help to Because the CATH and SCOP databases classify do-
understand the optimal interplay between different levels, mains and FSSP deals with chains, we considered only
it can reveal possible misclassification, and it can ulti- chains that form a single domain; therefore, these proteins
mately offer a fully automated updating procedure. Manual appear as a single entry in the three databases. Several
steps can be automated in an ever-increasing way by using groups have developed methods to identify protein do-
the tools made available by other databases. mains.20,23,33–35 In this work, we used the Dali Domain
In this work we showed that it is possible to automati- Dictionary24 to identify single-domain proteins.
cally predict the CATH topology and the SCOP fold from We used the following databases. The CATH release 1.7,
the FSSP Z-scores. It is possible to submit a protein of which contains 15,802 protein chains, among which 10,906
unknown CATH or SCOP classifications but known FSSP are classified as single domain. This latter set is called
Z-scores to the web site31 to obtain its CATH and SCOP PCs. We also used the CATH release 2.0, which contains
classifications. Because the FSSP database is updated 20,780 protein chains, among which 14,389 are single
weekly, our procedure offers the possibility to update also domain (PC2s). The SCOP release 1.53, which contains
CATH and SCOP with the same frequency (at least down 20,021 protein chains, among which 15,375 are single
to the topology and fold level, respectively). We introduced domain (PSs). The FSSP release from 14 January 2001,
a classification method that clusters together structures of which contains 22,660 protein chains (PF). The FSSP
known and unknown classification according to their proteins are grouped into 2,494 homology classes so that
Z-scores. When proteins outside the twilight zone for within a class the sequence similarity is ⬎25%. One
structure comparison are considered, our method is highly protein per class is selected as representative, and we call
reliable. We suggest that, to classify proteins within the PFr the set of all representatives. All the protein sets and
twilight zone, other classification criteria, based on se- their sizes are listed in Table IV.
quence and function similarity, must be adopted.
Classification by Optimization (CO) Method
The advent of genome projects is multiplying the efforts
in the field of protein classification. In the past, the aim The classification scheme that we used is based on the
was to find the structure of the particular protein that was minimization of a particular cost function, defined as
interesting at a given time. Now the hope is to find a large follows (for the case of the prediction of CATH topology; a
representative set of structures that can encompass most similar definition holds for SCOP folds). Each protein is
414 G. GETZ ET AL.

TABLE IV. Protein Sets and Their Sizes

Name Description Size of set

PF All chains in FSSP (14 Jan, 2001) 22,660
PFr Representative chains in FSSP (14 Jan, 2001) 2,494
PC Chains in CATH v1.7 15,802
PCs Single-domain chains in CATH v1.7 10,906
PC2 Chains in CATH v2.0 20,780
PC2s Single-domain chains in CATH v2.0 14,389
PS Chains in SCOP 1.53 20,021
PSs Single-domain chains in SCOP 1.53 15,375
PCsSs Single-domain chains in SCOP 1.53 and CATH v1.7 10,197
PFrCs Single-domain chains in CATH that are representatives FSSP (PFr 艚 PCs) 860
PFrSs Single-domain chains in SCOP that are representatives FSSP (PFr 艚 PSs) 1,626
PFCs Chains in FSSP and single domain in CATH v1.7 10,541
PFSs Chains in FSSP and single domain in SCOP 1.53 14,716
PFC៮ Chains in FSSP and not in CATH v1.7 7,431
PFS៮ Chains in FSSP and not in SCOP 1.53 3,451

assigned an integer number ci, describing its topology One can characterize the FSSP-based neighborhood of a
(1–305). We assign to proteins with known classification protein according to the CATH classification of itself and
the value of c(i) determined by their CATH classification. its neighbors. Every protein must belong to one of four
To the yet unclassified proteins we assign initially random categories:
values from 1 to 305. A cost is calculated for each configu- “Island”: The protein has no neighbors.
ration C ⫽ {ci} of topologies, which penalizes the assign- “Colony”: It has no neighbors of its own kind.
ment of different topologies to any pair of proteins. The “Border”: It has neighbors of its own kind as well as of
value of this penalty is chosen to be the similarity measure other kinds.
Zij between proteins i and j; the higher the similarity Zij, “Interior”: The protein has only neighbors of its own
the more costly it is to place proteins i and j in different kind.
topologies. The cost function is defined as the sum of
Using these definitions we can arrange the proteins of
penalties for all protein pairs 具i, j典,
PC in groups according to their neighborhood category

E共C兲 ⫽ 冘
具i,j典
Zij关1 ⫺ ␦共ci, cj兲兴. (1)
at the class, architecture, and topology levels. The
distribution of the proteins among these groups can be
used to calculate an upper bound for the CO method, if
we assume that the set of unclassified proteins has the
The classification problem is stated as finding the minimal
same distribution as the classified ones. For example,
cost configuration of the unclassified proteins, while keep-
ing the topologies (i.e., the ci values) of the classified islands cannot be classified and are therefore rejected.
proteins fixed. This problem corresponds to finding the Colonies are bound to be misclassified because none of
ground state of a random field Potts ferromagnet. their neighbors give a clue on their type. Because the
We search for a classification C of minimal cost by an fraction of proteins in each category was estimated on
iterative greedy algorithm described in detail elsewhere.21 the basis of a sample, it can be interpreted only as a
The algorithm identifies at which iteration, if any, it statistical upper bound.
performed a heuristic decision. For low fractions of un- We consider the set PFCs to obtain a first type of upper
known topologies, the algorithm usually reaches the global bound for the success rate of the CO method. This set (see
minimum of the cost function. Table IV) is formed by 10,541 proteins, among which 5%
are islands, a negligible fraction (0.2%) are colonies, 6%
Bounds on the Success Rate of the Prediction are borders, and 88% are interiors. Therefore, the upper
bound that we found is about 95% for predicting the
In this section we establish a statistical upper bound for
the prediction success rate relevant to a family of predic- topology level in CATH.
tion algorithms. The actual prediction performed in this work is done on
the set PFC ៮ , which is formed by the 7431 proteins that are
The Z-matrix can be reinterpreted as a weighted graph;
each vertex in the graph represents a protein and the in FSSP (14 January 2001) but not in CATH1.7 (see Table
weights on the edges connecting two vertices are the IV). Within PFC ៮ there is a subset of 1617 (about 22%)
corresponding Z-scores. Edges with Z ⬍ 2.0 are absent proteins that are either islands or superislands, that is,
from the graph. Following this representation, we define they are connected only with other proteins in the subset
two proteins as neighbors if they are connected by an edge. and therefore they have no connection to proteins with
By analyzing the connectivity properties of set PC we known classification. Thus, the upper bound for this
make inferences about our predictive power. second type of prediction is about 78%.
AUTOMATED ASSIGNMENT OF PROTEIN CLASSIFICATION 415

Evaluating a Classification Prediction Algorithm 12. Finkelstein AV. Protein structure: What is possible to predict
now? Curr Opin Struct Biol 1997;7:60 –71.
Because an algorithm can output either a predicted 13. Simons KT, Bonneau R, Ruczinski I, Baker D. Ab initio protein
classification or a “rejection,” if it does not have any structure prediction of CASP III targets using ROSETTA. Pro-
prediction, one has to estimate two probabilities: Psuccess teins 1999;37(Suppl 3):171–176.
14. Bernstein F, Koetzle T, Williams G, Meyer EJ, Brice M, Rodgers J,
and Preject. Robust estimation of these parameters is Kennard O, Shimanouchi T, Tasumi M. The Protein Data Bank: a
produced by cross-validation, a procedure that consists in computer-based archival file for macromolecular structures. J Mol
averaging over many (T) randomly sampled test trials. In Biol 1977;112:535–542.
15. Holm L, Sander C. Dali/FSSP classification of three-dimensional
each trial, the set is divided into two subsets; one is used
protein folds. Nucleic Acids Res 1997;25:231–234.
for training the algorithm and the other set, of Ntest 16. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB,
proteins, is used to test the algorithm by comparing its Thornton JM. CATH—a hierarchic classification of protein do-
prediction to the true classification. The probability esti- main structures. Structure 1997;5:1093–1108.
17. Conte LL, Ailey B, Hubbard TJP, Brenner SE, Murzin AG,
mates are given by Chothia C. SCOP: a structural classification of proteins database.
Nucleic Acids Res 2000;28:257–259.

冘
T
Nsuccess 18. Mizuguchi K, Deane CM, Blundell TL, Overington JP. HOM-
P̂success ⫽ 1/T (2) STRAD: a database for protein structure alignments for homolo-
Ntest gous families. Protein Sci 1998;7:2469 –2471.
t⫽1
19. Gibrat JF, Madej T, Bryant SH. Surprising similarities in struc-
ture comparison. Curr Opin Struct Biol 1996;6:377–385.

冘
T
Ntest ⫺ Nreject 20. Siddiqui AS, Barton GJ. Continuous and discontinuous domains:
P̂non-reject ⫽ 1 ⫺ P̂reject ⫽ 1/T (3) an algorithm for the automatic generation of reliable protein
Ntest domain definitions. Protein Sci 1995;4:872– 884.
t⫽1
21. Getz G. Clustering and classification of protein structures. M.Sc.
Another figure of merit, the purity Ppure, is the probability Thesis, Tel-Aviv University, 1998.
of correctly classifying nonrejected proteins. It is esti- 22. Hadley C, Jones DT. A systematic comparison of protein structure
classifications: SCOP, CATH and FSSP. Structure 1999;7:1099 –
mated by 1112.
23. Holm L, Sander C. The FSSP database of structurally aligned
P̂success protein fold families. Nucleic Acids Res 1994;22:3600 –3609.
P̂pure ⫽ (4) 24. Dietmann S, Park J, Notredame C, Heger A, Lappe M, Holm L. A
1 ⫺ P̂reject fully automatic evolutionary classification of protein folds: Dali
Domain Dictionary version 3. Nucleic Acids Res 2001;29:55–57.
REFERENCES 25. Jain AK, Dubes RC. Algorithms for clustering data. Englewood
1. Holm L, Sander C. Mapping the protein universe. Science 1996;273: Cliffs, NJ: Prentice-Hall; 1988.
595– 602. 26. Levitt M, Chothia C. Structural patterns in globular proteins.
2. Thornton JM, Orengo CA, Todd AE, Pearl FMG. Protein folds, Nature 1976;261:552–558.
functions and evolution. J Mol Biol 1999;293:333–342. 27. Taylor WR, Orengo CA. Protein structure alignment. J Mol Biol
3. Šali A. 100,000 protein structures for the biologist. Nat Struct Biol 1989;208:1–22.
1998;5:1029 –1032. 28. Orengo CA, Brown NP, Taylor WR. Fast structure alignment for
4. Martı́-Renom MA, Ashley AC, Fiser A, Sanchez R, Melo F, Šali A. protein databank searching. Proteins 1992;14:139 –167.
Comparative protein structure modeling og genes and genomes. 29. Bray JE, Todd AE, Pearl FMG, Thornton JM, Orengo CA. The
Annu Rev Biophys Biomol Struct 2000;29:291–325. CATH Dictionary of Homologous Superfamilies: a consensus
5. Heger A, Holm L. Towards a covering set of protein family profiles. approach to analyze distant structural homologues. Protein Eng
Prog Biophys Mol Biol 2000;73:321–337. 2000;13:153–165.
6. Bowie JU, Lüthy R, Eisenberg D. A method to identify protein 30. Brenner SE, Koehl P, Levitt M. The ASTRAL compendium for
sequences that fold into a known three-dimensional structure. protein structure and sequence analysis. Nucleic Acids Res 2000;
Science 1991;253:164 –170. 28:254 –256.
7. Jones DT, Taylor WR, Thornton JM. A new approach to protein 31. http://www.weizmann.ac.il/physics/complex/compphys/f2cs/
fold recognition. Nature 1992;358:86 – 89. index.html.
8. Fisher D, Rice D, Bowie JU, Eisenberg D. Assigning amino acid 32. Swindells MB, Orengo CA, Jones DT, Hutchinson EG, Thornton
sequences to 3-dimensional protein folds. FASEB J 1996;10:126 – JM. Contemporary approaches to protein structure classification.
136. Bioessays 1998;20:884 – 891.
9. Gerstein M, Levitt M. A structural census of the current popula- 33. Islam SA, Luo J, Sternberg MJE. Identification and analysis of
tion of protein sequences. Proc Natl Acad USA 1997;94:11911– domains in proteins. Protein Eng 1995;8:513–525.
11916. 34. Swindells MB. A procedure for detecting structural domains in
10. Murzin AG. Structural classification of proteins: new superfami- proteins. Protein Sci 1995;4:103–112.
lies. Curr Opin Struct Biol 1996;6:386 –394. 35. Sowdahamini R, Rufino SD, Blundell TL. Nuclear dynamics and
11. Blundell TL, Mizuguchi K. Structural genomics: an overview. electronic transition in a photosynthetic reaction center. J Am
Prog Biophys Mol Biol 2000;73:289 –295. Chem Soc 1997;119:3948 –3958.

Template Based Protein Structure Modeling
No ratings yet
Template Based Protein Structure Modeling
98 pages
Structural Bioinformatics
No ratings yet
Structural Bioinformatics
75 pages
Iot-Enabled Protein Structure Classification Via Csa-Pso Based CD4.5 Classifier
No ratings yet
Iot-Enabled Protein Structure Classification Via Csa-Pso Based CD4.5 Classifier
11 pages
Protein 3D Structure Database
No ratings yet
Protein 3D Structure Database
46 pages
CS273 - Protein Structure Prediction
No ratings yet
CS273 - Protein Structure Prediction
39 pages
Yi-Ping Phoebe Chen - Bioinformatics Technologies - 250210 - 163243-3
No ratings yet
Yi-Ping Phoebe Chen - Bioinformatics Technologies - 250210 - 163243-3
17 pages
Structural Databases
No ratings yet
Structural Databases
5 pages
Protein Structure Classification/domain Prediction: SCOP and CATH (Bioinformatics) .
100% (4)
Protein Structure Classification/domain Prediction: SCOP and CATH (Bioinformatics) .
23 pages
Structural Genomics
No ratings yet
Structural Genomics
39 pages
FALLSEM2024-25 BBIT418L TH VL2024250104339 2024-09-11 Reference-Material-I
No ratings yet
FALLSEM2024-25 BBIT418L TH VL2024250104339 2024-09-11 Reference-Material-I
34 pages
Protein Secondary Structure Prediction - A Survey of The State of The Art
No ratings yet
Protein Secondary Structure Prediction - A Survey of The State of The Art
24 pages
FALLSEM2024-25 BBIT418L TH VL2024250104339 2024-09-12 Reference-Material-I
No ratings yet
FALLSEM2024-25 BBIT418L TH VL2024250104339 2024-09-12 Reference-Material-I
20 pages
SCOP and CATH Database
100% (5)
SCOP and CATH Database
22 pages
Dingo Optimized Fuzzy CNN Technique For Efficient Protein Structure Prediction
No ratings yet
Dingo Optimized Fuzzy CNN Technique For Efficient Protein Structure Prediction
9 pages
3D Structure Prediction
No ratings yet
3D Structure Prediction
18 pages
Pi Is 0969212699801774
No ratings yet
Pi Is 0969212699801774
14 pages
Proteins 76 418 2009
No ratings yet
Proteins 76 418 2009
21 pages
Fat Noews
No ratings yet
Fat Noews
20 pages
Scop 2008
No ratings yet
Scop 2008
7 pages
Lecture 5 Molecular Modelling
No ratings yet
Lecture 5 Molecular Modelling
13 pages
Anwesha Mazumder
No ratings yet
Anwesha Mazumder
12 pages
CATH, Bilogical Data Bases, Bioinformatics Data Base
No ratings yet
CATH, Bilogical Data Bases, Bioinformatics Data Base
3 pages
Protein Engineering
No ratings yet
Protein Engineering
45 pages
Protein Structural Motifs: Doug Brutlag Professor Emeritus Biochemistry & Medicine (By Courtesy)
No ratings yet
Protein Structural Motifs: Doug Brutlag Professor Emeritus Biochemistry & Medicine (By Courtesy)
100 pages
Protein Sructure Prediction Using Phyre - Kelly & Sternberg 2009
No ratings yet
Protein Sructure Prediction Using Phyre - Kelly & Sternberg 2009
9 pages
Fold Lib
100% (1)
Fold Lib
24 pages
Lecture3-Structural Bioinformatics-Secondary Resources
No ratings yet
Lecture3-Structural Bioinformatics-Secondary Resources
26 pages
Scop & Cath: Dr. M.I. Hassan
No ratings yet
Scop & Cath: Dr. M.I. Hassan
50 pages
CSD
No ratings yet
CSD
14 pages
MADOKA: An Ultra-Fast Approach For Large-Scale Protein Structure Similarity Searching
No ratings yet
MADOKA: An Ultra-Fast Approach For Large-Scale Protein Structure Similarity Searching
10 pages
Proclust:: Improved Clustering of Protein Sequences With An Extended Graph-Based Approach
No ratings yet
Proclust:: Improved Clustering of Protein Sequences With An Extended Graph-Based Approach
58 pages
Protein STR
No ratings yet
Protein STR
63 pages
Structural Classi®cation of Zinc ®ngers: Survey and Summary
No ratings yet
Structural Classi®cation of Zinc ®ngers: Survey and Summary
19 pages
Lecture 12 (Structural Bioinformatics)
No ratings yet
Lecture 12 (Structural Bioinformatics)
30 pages
13-SCOP - Structural Classification of Proteins-06-09-2024
No ratings yet
13-SCOP - Structural Classification of Proteins-06-09-2024
21 pages
Main
No ratings yet
Main
15 pages
bp2 5
No ratings yet
bp2 5
17 pages
Lab Report 1 BME 310
No ratings yet
Lab Report 1 BME 310
7 pages
Protein Folding
No ratings yet
Protein Folding
21 pages
2005 in Silico Biol 5 227-37
No ratings yet
2005 in Silico Biol 5 227-37
12 pages
A Consensus View of Fold Space: Combining SCOP, CATH, and The Dali Domain Dictionary
No ratings yet
A Consensus View of Fold Space: Combining SCOP, CATH, and The Dali Domain Dictionary
11 pages
Protein Structure
No ratings yet
Protein Structure
52 pages
The Role of Protein Structure in Genomics: Minireview
No ratings yet
The Role of Protein Structure in Genomics: Minireview
5 pages
Gene Pridiction and Orf
No ratings yet
Gene Pridiction and Orf
34 pages
Bioinfo - S1 2021 - L9 - Protein Structure - 1 Slide
No ratings yet
Bioinfo - S1 2021 - L9 - Protein Structure - 1 Slide
87 pages
Lecture 7
No ratings yet
Lecture 7
24 pages
Tertiary Structure Prediction Methods: Any Given Protein Sequence
No ratings yet
Tertiary Structure Prediction Methods: Any Given Protein Sequence
29 pages
Protein Structure Determination: Bookmark This Page
No ratings yet
Protein Structure Determination: Bookmark This Page
25 pages
Protein Tertiary Structures: Prediction From Amino Acid Sequences
No ratings yet
Protein Tertiary Structures: Prediction From Amino Acid Sequences
7 pages
Protein Folds and Structure
No ratings yet
Protein Folds and Structure
19 pages
An Initio Method8 PDF
No ratings yet
An Initio Method8 PDF
23 pages
Introduction To Structural Databases
No ratings yet
Introduction To Structural Databases
10 pages
Protein Sequence
No ratings yet
Protein Sequence
36 pages
Proteins Bioinfo Latest
No ratings yet
Proteins Bioinfo Latest
45 pages
List of Biological Databases
100% (1)
List of Biological Databases
8 pages
Tramontano A. - Protein Structure Prediction 2007 - t1v3
No ratings yet
Tramontano A. - Protein Structure Prediction 2007 - t1v3
46 pages
Protein Structure: Predictive Methods and Experimental Methodologies
No ratings yet
Protein Structure: Predictive Methods and Experimental Methodologies
33 pages
Running BLAST Through Perl
No ratings yet
Running BLAST Through Perl
35 pages
Bioinformatics
No ratings yet
Bioinformatics
10 pages
Msa Notes
No ratings yet
Msa Notes
10 pages
Sequence Alignment
No ratings yet
Sequence Alignment
25 pages
Protein Modeling: Protein Structure Prediction Other Topics
No ratings yet
Protein Modeling: Protein Structure Prediction Other Topics
76 pages
EMBL Presentation (Twisha)
No ratings yet
EMBL Presentation (Twisha)
22 pages
Metagenomics
100% (1)
Metagenomics
19 pages
CH-3 Genomics Bioinformatics Notes
No ratings yet
CH-3 Genomics Bioinformatics Notes
38 pages
BLAST Topic
No ratings yet
BLAST Topic
13 pages
Mastering Bioinformatics and Computational Biology - Unraveling The Complexities of Life Through Data-Driven Discovery
100% (1)
Mastering Bioinformatics and Computational Biology - Unraveling The Complexities of Life Through Data-Driven Discovery
216 pages
Sequence Alignment
No ratings yet
Sequence Alignment
92 pages
Es 243 Biology For Engineers Assignment-2: Question-1
No ratings yet
Es 243 Biology For Engineers Assignment-2: Question-1
23 pages
BIO310 Lecture-1
No ratings yet
BIO310 Lecture-1
15 pages
Sci4101 Bio Informatics
No ratings yet
Sci4101 Bio Informatics
5 pages
Algae Bioinformatics
No ratings yet
Algae Bioinformatics
10 pages
Protein Alignment Scoring - PAM and BLOSUM
No ratings yet
Protein Alignment Scoring - PAM and BLOSUM
11 pages
Mathematical and Theoretical Biology
No ratings yet
Mathematical and Theoretical Biology
14 pages
Hap Map
No ratings yet
Hap Map
2 pages
Multiple Sequence Alignment
No ratings yet
Multiple Sequence Alignment
7 pages
BIF307 Assignment Topics
No ratings yet
BIF307 Assignment Topics
2 pages
BioinChapter 1
No ratings yet
BioinChapter 1
35 pages
Biological Database
No ratings yet
Biological Database
8 pages
Protein Data Bank (PDB) & File Format
No ratings yet
Protein Data Bank (PDB) & File Format
1 page
NyBerMan Free Internship Metagenomics
No ratings yet
NyBerMan Free Internship Metagenomics
1 page
The EMBL Nucleotide Sequence Database
No ratings yet
The EMBL Nucleotide Sequence Database
5 pages
Bioinformatics
No ratings yet
Bioinformatics
2 pages
Building Phylogenetic Trees From Molecular Data With MEGA
No ratings yet
Building Phylogenetic Trees From Molecular Data With MEGA
7 pages
Psipred Tutorial
No ratings yet
Psipred Tutorial
4 pages
LSM 3241
No ratings yet
LSM 3241
1 page
CAP 5510: Introduction To Bioinformatics (3 CR) Spring 2006: Tu Thu 11-12:15 in ECS 141
No ratings yet
CAP 5510: Introduction To Bioinformatics (3 CR) Spring 2006: Tu Thu 11-12:15 in ECS 141
1 page
Logical Modeling of Biological Systems
From Everand
Logical Modeling of Biological Systems
Luis Fariñas del Cerro
No ratings yet
Comprehensive Guide to BLAST: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to BLAST: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Article

Uploaded by

Article

Uploaded by

PROTEINS: Structure, Function, and Genetics 46:405– 415 (2002)

Automated Assignment of SCOP and CATH Protein

Key words: protein structure; protein databases; FSSP Database

© 2002 WILEY-LISS, INC.

CATH Database RESULTS

TABLE II. Summary of a “Confusion Table”

Original classification Assigned classification Cases (%)

TABLE IV. Protein Sets and Their Sizes

Name Description Size of set

You might also like