Article
Article
DOI 10.1002/prot.1176
ABSTRACT We present an automated proce- vealed by their folded state. Therefore, most of the recently
dure to assign CATH and SCOP classifications to proposed methods to derive energy functions to perform
proteins whose FSSP score is available. CATH classi- protein fold predictions rely in different ways on structural
fication is assigned down to the topology level, and data.12,13
SCOP classification is assigned to the fold level. The most comprehensive repository of three-dimen-
Because the FSSP database is updated weekly, this sional structures of proteins is the Protein Data Bank
method makes it possible to update also CATH and (PDB).14 The number of released structures is increasing
SCOP with the same frequency. Our predictions at the pace of about 50 per week, and ⬎12,000 complete
have a nearly perfect success rate when ambiguous sets of coordinates were available at the time of writing.
cases are discarded. These ambiguous cases are Many research groups maintain web-accessible hierarchi-
intrinsic in any protein structure classification that cal classifications of PDB entries. The most widely used
relies on structural information alone. Hence, we are FSSP,15 CATH,16 SCOP,17 HOMSTRAD,18 MMDB,19
introduce the “twilight zone for structure classifica- and 3Dee20 (see Table I for a list of abbreviations). Here we
tion.” We further suggest that to resolve these am- consider three of these: the FSSP, the CATH, and the
biguous cases, other criteria of classification, based SCOP databases. Each group has its own way to compare
also on information about sequence and function, and classify proteins; these three classification schemes
must be used. Proteins 2002;46:405– 415. are, however, consistent with each other to a large ex-
© 2002 Wiley-Liss, Inc. tent.21,22
TABLE I. Abbreviations and Definitions structure content.26 SCOP is linked to the ASTRAL com-
pendium,30 which provides a series of tools for further
Abbreviation Definition
analysis of the classified structures, mainly through the
3Dee Database of protein domain definitions use of their sequence. At variance with FSSP and CATH,
ASTRAL The ASTRAL compendium for sequence and SCOP is constructed manually, by visual inspection and
structure analysis
comparison of not only structures but also sequences and
CATH Protein structure classification
CO Classification by optimization functions.
DALI Protein structure comparison by alignment of
distance matrices
Automated Assignment of SCOP and CATH
DHS Dictionary of homologous superfamilies Classifications
FSSP Fold classification based on structure-structure In this work we present a method, Classification by
alignment of proteins
Optimization (CO), to predict without human intervention
HOMSTRAD Homologous structure alignment database
the SCOP fold level and the CATH topology level from the
MMDB Molecular modeling database
PDB Protein data bank FSSP pairwise structure similarity score. A protein for
SCOP Structural classification of proteins which the Z-score is available is classified into a SCOP fold
SSAP Structure comparison algorithm and into a CATH topology by the CO method, an optimiza-
tion procedure that finds the assignment of minimal cost,
where the cost is defined in terms of Z-scores (see Materi-
Z-score matrix. An alternate classification based on a more als and Methods). The query for the classification of any
common four-level hierarchy is also available.24 such protein can be submitted to the web site.31
Fig. 1. a: Z-score matrix between all pairs of proteins in the PFrCs set. A black dot represents Z ⬎ 2.0. b:
Same Z-score matrix with rows and columns rearranged by using the CATH classification (see text). Part (b)
shows the underlying order behind the apparent randomness of part (a) and illustrates the extent to which the
FSSP Z-scores reflect the CATH classification. The regions A, B, C, and D are discussed in the text.
to which the FSSP Z-scores reflect the CATH classifica- These findings suggest that Z-scores can be used to
tion. predict the CATH and SCOP classifications of yet unclassi-
Several interesting observations can be made. First, fied proteins. In what follows, we demonstrate that this
consider the Class level of CATH. As can be seen in Figure indeed can be done. We also estimate the success rate of
1(b), there are no matrix elements with Z ⬎ 2.0 in region our predictions and provide a web site31 that can be used to
A that connect proteins of the mainly-␣ class to the retrieve our predictions for the CATH topology and the
mainly- class. At variance with this, some proteins from SCOP fold for new entries in FSSP.
both of these classes have large Z-scores with proteins We also verified that the CATH and SCOP classifica-
from the ␣- class (region B). This is reasonable, because of tions are to a large extent mutually compatible. An
the way similarity is defined by FSSP; a mainly-␣ protein immediate consequence of this is that it is possible to
can have a high Z-score with an ␣- protein because of high construct a “translation table,” T̂, from the proteins that
similarity with the ␣ part. Second, in the Architecture have already both a CATH and a SCOP classification. In
level, we observe that there are architecture families that this way, given a CATH entry, one can obtain the corre-
are highly connected within themselves, e.g., ␣- barrels sponding SCOP classification (see Fig. 2). Row i of the
(482–525: region C), whereas for others the intrafamily table refers to a particular CATH topology and column j to
connections are more sparse. The similarities within the a particular SCOP fold. The element T̂ij of the table is the
mainly- sandwich family (318 – 406: region D) have two measured fraction of times that a protein has a CATH
relatively distinct subgroups, which suggest an inner topology i and a SCOP fold j. This number is calculated by
structure corresponding to the lower levels in the CATH enumerating all the 10,197 single-domain proteins with
hierarchy. Checking the topology level (the third CATH known CATH and SCOP classifications (PCsSs), and it is
level) for this architecture, one indeed finds two large an estimate of Tij, the joint probability distribution for a
topology subfamilies, the immunoglobulin-like proteins protein to have CATH topology i and SCOP fold j. If the
(324 –366: upper left part of region D) and the Jelly-Rolls CATH and SCOP classifications had been independent,
(373– 402: lower right part of region D), which correspond every element Tij could have been expressed as a product
precisely to the two strongly connected subgroups that of Ci, the fraction of proteins that belong to CATH topology
appear in Figure 1(b). i, and Sj, the fraction that belongs to SCOP fold j, that is,
We found that the CATH classification at the level of Tij ⫽ Ci ⴱ Sj. Randomly placing 10,197 proteins using
topology is reflected in the Z-matrix. This is to be expected such a probability distribution yields 4780 ⫾ 40 nonzero
because the Z-score measures the structural similarity of elements in the matrix. In the other extreme case, if there
two aligned proteins while preserving their connectivity. had been a full correspondence between the SCOP and
Overall, this analysis shows that the Z-matrix is corre- CATH classifications, the table would have had a single
lated with the CATH classification. In a similar way it is nonzero element in each row and column (in each CATH
possible to show that the Z-score is correlated with the topology row the nonzero element would have been in that
SCOP classification. The results are available at the web SCOP fold column that corresponds to it). In this case, the
site.31 proteins in PCsSs would have been distributed among 284
408 G. GETZ ET AL.
Fig. 2. Translation table from the CATH topology to the SCOP fold and vice versa. Nonzero entries of T̂ij
appear as black dots. T̂ij is proportional to the number of proteins of CATH topology i that have a SCOP fold j in
PCsSs.
nonzero elements (the number of distinct CATH topologies (both for CATH and SCOP), but a later release provides
in PCsSs). a different classification.
We found 369 nonzero elements in T̂, meaning that the
CATH and SCOP classifications are highly dependent. The frequencies of these outcomes greatly depends on
Still, the correspondence is not entirely one-to-one; in the statistics of the set of proteins to be classified. More
general, more than one SCOP fold corresponds to a given specifically, rejected proteins are of two types: proteins
CATH topology. The number of such folds is, however, that do not have high Z-scores with any other proteins
typically small. Such a translation table may be used to (“islands”; see Materials and Methods) and clusters of
predict the SCOP classification of a structure already proteins that are similar among themselves but do not
classified in CATH or at least to significantly restrict the have high Z-scores with other proteins outside their clus-
number of possibilities and vice versa. For example, the ter (“superislands”). The fraction of islands and superis-
assignment of the CATH topology to a protein with known lands is a feature of the particular set of proteins to be
SCOP fold can be done by selecting the CATH topology classified. The occurrence of a superisland suggests that a
with the largest value in the translation table for that new classification type (a new topology for CATH and and
particular SCOP fold. Such an assignment is correct in new fold for SCOP) might be needed. The work of maintain-
93% of the cases. The corresponding assignment of the ing CATH and SCOP can be thus focused on the classifica-
SCOP fold from the CATH topology is correct in 82% of the tion of a representative from each of these superislands.
cases. Although this is possibly useful information, in this For the set PFCs, the fraction of islands and superis-
work we do not assign classifications in this way. lands is 5%. We used this set to provide an upper bound for
the performance of the CO method (see below); however,
SUMMARY OF THE COCLASSIFICATION for the set PFC the fraction of rejections goes up to 22%. If
PERFORMANCE rejections are not counted, we classify correctly 98% of the
PFCs proteins. On the other hand, we could test our
Every time the FSSP Z-scores are updated (once a week) predictions also against the new CATH release v2.0. Of
the CO classification can be applied to all the proteins that 1582 proteins that were assigned to previously existing
appear in the new FSSP release but are not yet classified CATH topologies, CO has classified correctly 80%. The
in CATH or in SCOP. The possible outcomes of the difference in success rates between PFCs and PFC is due
classification procedure are as follows: to the different way in which the test set is nested in the
larger set of structures with known classification. In the
1. Correct classification: the predicted classification will first case, the test set consisted of 20% of the members of
agree with the future release of the databases. PFCs, selected at random; the remaining 80% were used to
2. Rejection: the program is unable to classify the struc- “predict” the classification of the test set. In the second
ture. case, the members of CATH v1.7 were used to predict the
3. Ambiguous classification: a classification is returned classification of the new proteins that were added when
AUTOMATED ASSIGNMENT OF PROTEIN CLASSIFICATION 409
Fig. 3. a: Z-score matrix between all pairs of proteins in the combined PFrCs ⫹ PFC sets. The submatrix in
the upper left corner is the reordered Z-score matrix of the set PFrCs, which was already shown in Figure 1(b).
. b: The same matrix as in (a) with the
The rest of the matrix presents the Z-scores for the proteins in the set PFC
rows and columns relative to the proteins in PFC reordered according to our assignment of their CATH
topology. With the CO method, the original order in the submatrix PFrCs is propagated to the entire matrix.
CATH v2.0 was released. These new structures are not predictions for the classes and architectures. In this work
distributed uniformly at random among the members of we present only results obtained by one particular method
CATH v1.7. (CO), which uses the original Z-score as a similarity
Ambiguous classifications are due to two different mecha- measure (see Materials and Methods). A complete list of
nisms. The first stems from a well-known problem with the the results obtained by using other methods can be found
way the FSSP similarity index is calculated (the “Russian in Ref. 21, which is available on the web site.31
doll effect”; see below). The second kind of “mistake” is Our final assignments for the set PFC using the CO
actually not a wrong classification; rather, it happens method are listed in the web site. A more illustrative way
when the newly classified structure lies within the ambigu- to present these results is shown in Figure 3. In Figure 3(a)
ous “twilight zone” between two closely related topologies we present the Z-score matrix for the combined set PFrCs ⫹
(for CATH) or folds (for SCOP), as demonstrated in detail PFC . The submatrix in the upper left corner is the
below. reordered Z-score matrix of the set PFrCs, which was
Automated Assignment of CATH Classification already shown in Figure 1(b). The rest of the matrix in
From FSSP Figure 3(a) presents the Z-scores of PFrCs with the set
PFC (randomly ordered) and the Z-scores of PFC among
In this section we describe the procedure that we used to
themselves. In Figure 3(b) we reordered the rows and
predict CATH topology level from the FSSP scores. We
; see Materials and columns whose index was ⬎860, corresponding to proteins
identified a set of 7431 proteins (PFC . Although in the matrix of Figure 3(a) these
in PFC
Methods) that appear in FSSP but were not yet processed
proteins appear in a random order, in Figure 3(b) they
by CATH 1.7. Our goal is to predict the CATH topology of
appear in the order imposed by our prediction of their
these 7431 proteins by using (a) the Z-scores between all
proteins in PF (see Materials and Methods) and (b) the CATH topology. One can see that the original order in the
known classifications of the set PFrCs (see Materials and submatrix PFrCs is propagated by our assignment proce-
dure to the set PFC . For example, focus on the small black
Methods).
Predicting topologies is a classification problem that we square at the upper left corner of the matrix. This small
treated with pattern recognition tools. We tested several black square represents the high Z-scores among the
prediction algorithms using cross-validation to estimate mainly-␣ class of proteins in PFrCs. In the corresponding
their performance.21 Every one of the algorithms that were top rows of the full matrix we see high Z-scores between
tested can be viewed as a two-stage process. In the first these structures and some proteins from PFC . In particu-
stage, a new similarity measure is produced from the lar, the small group with indices near 2476 are “close” to
original Z-scores. This is done either by a direct rescaling these mainly-␣ structures and hence are also classified as
of the original Z-scores or by using the results of various such. On the other hand, there is a large group of struc-
hierarchical clustering methods to produce new similarity tures from PFC (between 861 and 2476), which do not have
measures. The second stage consists of using these similari- high Z-scores with any of the proteins in PFrCs or with any
ties as the input to some classification method, yielding of the other structures in PFC with index ⬎2476. Hence,
410 G. GETZ ET AL.
we are unable to classify this group of structures on the the case of CATH, this number increased when we dis-
basis of their FSSP scores. carded proteins in the “twilight zone” (see the next sec-
Figure 3(b) illustrates the central idea of this work. We tion).
perform a task that is intermediate between clustering
and classification. We take proteins of known classification Twilight Zone for Protein Classification
and we use them as fixed a priori values in a clustering
The attempt to assign a new protein to a known fold
procedure.
might lead to frustration because at times one is undecided
The overall success rate of our prediction estimated by
about two or more possibilities. To assess that two proteins
cross-validation was 93%. To understand the significance
have similar structures, a similarity score is needed. FSSP
of these success rates, we derived a statistical (see Materi-
uses the Z-score, CATH uses the SSAP score, and SCOP
als and Methods) upper bound for this kind of prediction.
uses a subjective evaluation, which is also a kind of score.
This upper bound is 95% (see Materials and Methods),
The problem arises when the protein to be classified has
hence the figure of 93/95 ⫽ 98% given above.*
high scores with two proteins already classified, but to
We estimated the accuracy of the prediction by using the
different topologies. In this article, these proteins are
following procedure. First, the set PFCs was randomly
called borders (see Materials and Methods). Being a border
“diluted”; that is, we randomly chose a certain fraction of
protein depends on the similarity score. We showed,
the proteins in PFCs and placed them in a test set,
however, that FSSP, CATH, and SCOP are to a large
pretending that we did not know their classification. The
extent consistent classifications. Therefore, we suggest
FSSP scores of the entire set were then used to classify the
that there are “intrinsically” ambiguous cases— cases that
test set. For each protein from the test set, we either
are unavoidable in structure comparison. We refer to these
return a predicted classification or reject the protein (i.e.,
ambiguous regions in structure space as the “twilight
we declare that we are unable to classify it). The quality of
zone” in analogy with the case of protein sequence compari-
any classification algorithm (see Materials and Methods)
son where proteins with sequence similarity below 30%
is measured by its success rate (fraction of correctly
classified proteins, out of the test set) and by the purity cannot be reliably assigned to the same fold. We illustrate
(success rate out of the nonrejected proteins). For the CO this concept by a typical case, shown in Figure 4. This is a
method, the results were 93% for the success rate and 98% border protein. Protein 1dhn (the central one) is the one to
for the purity (using a dilution of 20%). More extensive be classified (in fact, it is a three-layer sandwich according
tests at other dilutions and for other methods are of to CATH). It has a Z-score of 9.3 with protein 1a8rA (on the
classification are discussed in Ref. 21 and available at the left), which is a three-layer sandwich topology and a
web site.31 Z-score of 8.7 with protein 1b66A (on the right), which is a
We also tested directly the reliability of the CO assign- two-layer sandwich topology. This example illustrates how
ments by using the CATH version 2.0 (PC2). In PC2, 1640 structural information alone might not provide a clear-cut
single-domain proteins that are present in PFC were criterion for classification of this protein. The incidence of
assigned to one of the topologies that existed in v1.7. the twilight zone is shown in Figure 5. In Figure 5(a) we
Fifty-eight of these we “rejected.” In 1266 cases of the present the histogram of the number of protein pairs that
remaining 1582 (80%), our prediction agrees with the one have different CATH topologies as a function of their
given in CATH v2.0. Almost all the cases in which we Z-score. This number is a rapidly decaying function of Z.
misassigned a domain can be explained in a simple way. On the contrary, the number of pairs with the same CATH
These cases are discussed in detail in a following section. topology is a slowly decaying function of Z. For Z ⬎ 3, the
The CO method can also be used to predict directly the C probability of having the same CATH topology becomes
level and the A level of CATH. We found that when the C greater than that of having different topologies. For Z ⬎
and A levels were predicted as a byproduct of predicting 7.5, the probability to have the same topology is 97.5%. In
the T level, the resulting C and A were consistent with Figure 5(b) we show the corresponding figure for SCOP.
those predicted directly. The number of folds in SCOP is larger than the number of
topologies in CATH; therefore, there is more ambiguity.
Automated Assignment of SCOP Classification However, also in this case for Z ⬎ 7.5, the probability to
From FSSP have the same topology is 93.5%. Taken together, these
We used the CO method to predict the SCOP fold for a results indicate that the twilight zone for structure com-
set of 3451 proteins (PFS ) that belong in PF but not yet in parison can be bound by Z ⱕ 7.
PS. The results are available on the web site.31 The There are other cases in which the classification of a
estimated success rate (by cross-validation) was 93%. As in particular protein is inconsistent with that of all its
neighbors. For example, proteins that we called colonies
(see Materials and Methods) are such that none of their
neighbors are of their own kind. This means that the FSSP
* One must keep in mind that the estimated success rate is
calculated for all proteins; both FSSP representatives (⬇10% of the
scores imply that these proteins are similar only to pro-
proteins) and nonrepresentatives. Because the presence of homolo- teins of different classes and architectures. Identifying
gous proteins can create a bias in these estimates, we also tested the these proteins can also focus the attention to possible
success rate of predicting the CATH topology only for the FSSP
representatives, which yielded 63%, to be compared with the corre- misclassification or to drawbacks of the Z-score. For ex-
sponding upper bound of 74%. ample, 1 of the 49 colonies (at the architecture level) that
AUTOMATED ASSIGNMENT OF PROTEIN CLASSIFICATION 411
Fig. 4. Center: Protein 1dhn, which has a CATH ␣  three-layer (␣) sandwich Aspartyl-glucosaminidase
chain B (3.50.11) topology. Left: Protein 1a8rA, which has also a CATH ␣  three-layer (␣) sandwich
Aspartylglucosaminidase chain B (3.50.11) topology and has Z-score of 9.3 with protein 1dhn. Right: Protein
1b66A, which has a CATH ␣  two-layer sandwich Tetrahydropterin Synthase, subunit A (3.30.479) topology
and has Z-score of 8.7 with protein 1dhn. This example illustrates how structural information alone might be
insufficient to provide a clear-cut criterion for the classification of this protein.
Fig. 5. Twilight zone for protein structure classification. a: The number of protein pairs of with a given FSSP
Z-score that have different CATH folds is a rapidly decaying function of Z. On the contrary, the number of
proteins pairs with the same CATH fold is decaying slowly. For Z ⬍ 5 there is a non-negligible probability to
have different folds. We call this threshold the “twilight zone for structure classification.” b: The corresponding
histogram for SCOP folds. The number of SCOP folds is larger than the number of CATH topologies; hence the
twilight zone is Z ⯝ 7.
we found in CATH is the PDB entry 1rboC, which is ture has a similar structure to a part of a protein of a
classified as a ␣- two-layer sandwich. It has 15 neighbors different architecture. Swindells et al.32 call the phenom-
in PC, 14 of which are classified as mainly- sandwiches. enon of structures within structures, the “Russian doll”
We summarize the results about the assignments of the effect. Such cases are common between architectures of
CATH architecture for proteins that already have a CATH long proteins that contain substructures corresponding to
classification (PFCs) in a “confusion table” (see Table II). The architectures of shorter proteins; for example, there are
first column lists the “correct” classification (as given in many two-layer sandwich proteins that resemble a part of
CATH v1.7 for the test set); the second column gives the three-layer sandwich proteins. Such relationships can
assignments by CO (correct, incorrect, or reject), and the occur at the class level [e.g., ␣- proteins that contain
third column lists the corresponding percentages. A full list mainly-␣ or mainly- proteins (1rboC, 1hgeA)]. They can
of the inconsistent proteins is available on the web site.31 also occur at the architecture level within the same class
Another problem is that there are some large Z-scores [e.g., ␣- complex architecture contains ␣- two-layer
between proteins of different architectures. Such large sandwich (1regX)]. Other inconsistencies occur when pro-
Z-scores arise when a protein of one particular architec- teins fit two architecture definitions.
412 G. GETZ ET AL.
TABLE III. The Search Result When Submitting “1cuoA” to the Web Site
http://www.weizmann.ac.il/physics/complex/compphys/f2cs/
SCOP
CATH v1.7 CATH v2.0 CATH prediction SCOP 1.53 prediction
Chain id # C A T # C A T C A T # C F C F
1cuoa ⫺1 1 2 60 40 2 60 40 ⫺1 2 5
This protein was classified by neither CATH v1.7 nor SCOP 1.53, which are the basis of our predictions. We predicted it to belong to CATH
topology 2.60.40 and SCOP fold 2.5. Later it was indeed classified by CATH v2.0 as 2.60.40. The ⫺1 in both CATH v1.7 and SCOP 1.53 represents
that it was not classified by them.
Class Prediction Using the Web Site of the existing folds, possibly all of them.3 In such a
large-scale project, human intervention, which is precious
To retrieve our prediction for the CATH topology or
in setting the principles of classification, should be gradu-
SCOP fold of a protein, one can use the web site31 by
ally replaced by automated procedures.
entering the protein chain identifier in the search box and
submitting the query. If the protein appears in our data- ACKNOWLEDGMENTS
base, then a table will be returned containing both the
known and the predicted SCOP and CATH classifications. We thank Liisa Holm for making the raw FSSP data
For example, the submission of the chain identifier “1cuoA” available to us and for useful discussions during the initial
returns Table III. This protein was classified by neither stages of this project. This work is based on a thesis for the
CATH v1.7 nor SCOP 1.53, which are the basis of our M.Sc. degree submitted by G.G. to Tel-Aviv University
predictions. We predicted it to belong to CATH topology (1998). We also thank Noam Shental for discussions. M.V.
2.60.40 and SCOP fold 2.5. Later, the release CATH v2.0 is supported by an European Molecular Biology Organiza-
identified 1cuoA as 2.60.40. tion (EMBO) long-term fellowship; he also thanks the
Einstein Center for Theoretical Physics for partial support
CONCLUSIONS of his stay at the Weizmann Institute. D.S. thanks the
Weizmann Institute of Science for hospitality while part of
The rapidly increasing number of experimentally de-
this work was carried out.
rived protein structures requires a continuous updating of
the existing structure classification databases. Each group MATERIALS AND METHODS
adopts different classification criteria at the level of se- Databases and Protein Sets
quence, of structure, and of function similarities. A compari-
son between different classification schemes can help to Because the CATH and SCOP databases classify do-
understand the optimal interplay between different levels, mains and FSSP deals with chains, we considered only
it can reveal possible misclassification, and it can ulti- chains that form a single domain; therefore, these proteins
mately offer a fully automated updating procedure. Manual appear as a single entry in the three databases. Several
steps can be automated in an ever-increasing way by using groups have developed methods to identify protein do-
the tools made available by other databases. mains.20,23,33–35 In this work, we used the Dali Domain
In this work we showed that it is possible to automati- Dictionary24 to identify single-domain proteins.
cally predict the CATH topology and the SCOP fold from We used the following databases. The CATH release 1.7,
the FSSP Z-scores. It is possible to submit a protein of which contains 15,802 protein chains, among which 10,906
unknown CATH or SCOP classifications but known FSSP are classified as single domain. This latter set is called
Z-scores to the web site31 to obtain its CATH and SCOP PCs. We also used the CATH release 2.0, which contains
classifications. Because the FSSP database is updated 20,780 protein chains, among which 14,389 are single
weekly, our procedure offers the possibility to update also domain (PC2s). The SCOP release 1.53, which contains
CATH and SCOP with the same frequency (at least down 20,021 protein chains, among which 15,375 are single
to the topology and fold level, respectively). We introduced domain (PSs). The FSSP release from 14 January 2001,
a classification method that clusters together structures of which contains 22,660 protein chains (PF). The FSSP
known and unknown classification according to their proteins are grouped into 2,494 homology classes so that
Z-scores. When proteins outside the twilight zone for within a class the sequence similarity is ⬎25%. One
structure comparison are considered, our method is highly protein per class is selected as representative, and we call
reliable. We suggest that, to classify proteins within the PFr the set of all representatives. All the protein sets and
twilight zone, other classification criteria, based on se- their sizes are listed in Table IV.
quence and function similarity, must be adopted.
Classification by Optimization (CO) Method
The advent of genome projects is multiplying the efforts
in the field of protein classification. In the past, the aim The classification scheme that we used is based on the
was to find the structure of the particular protein that was minimization of a particular cost function, defined as
interesting at a given time. Now the hope is to find a large follows (for the case of the prediction of CATH topology; a
representative set of structures that can encompass most similar definition holds for SCOP folds). Each protein is
414 G. GETZ ET AL.
assigned an integer number ci, describing its topology One can characterize the FSSP-based neighborhood of a
(1–305). We assign to proteins with known classification protein according to the CATH classification of itself and
the value of c(i) determined by their CATH classification. its neighbors. Every protein must belong to one of four
To the yet unclassified proteins we assign initially random categories:
values from 1 to 305. A cost is calculated for each configu- “Island”: The protein has no neighbors.
ration C ⫽ {ci} of topologies, which penalizes the assign- “Colony”: It has no neighbors of its own kind.
ment of different topologies to any pair of proteins. The “Border”: It has neighbors of its own kind as well as of
value of this penalty is chosen to be the similarity measure other kinds.
Zij between proteins i and j; the higher the similarity Zij, “Interior”: The protein has only neighbors of its own
the more costly it is to place proteins i and j in different kind.
topologies. The cost function is defined as the sum of
Using these definitions we can arrange the proteins of
penalties for all protein pairs 具i, j典,
PC in groups according to their neighborhood category
E共C兲 ⫽ 冘
具i,j典
Zij关1 ⫺ ␦共ci, cj兲兴. (1)
at the class, architecture, and topology levels. The
distribution of the proteins among these groups can be
used to calculate an upper bound for the CO method, if
we assume that the set of unclassified proteins has the
The classification problem is stated as finding the minimal
same distribution as the classified ones. For example,
cost configuration of the unclassified proteins, while keep-
ing the topologies (i.e., the ci values) of the classified islands cannot be classified and are therefore rejected.
proteins fixed. This problem corresponds to finding the Colonies are bound to be misclassified because none of
ground state of a random field Potts ferromagnet. their neighbors give a clue on their type. Because the
We search for a classification C of minimal cost by an fraction of proteins in each category was estimated on
iterative greedy algorithm described in detail elsewhere.21 the basis of a sample, it can be interpreted only as a
The algorithm identifies at which iteration, if any, it statistical upper bound.
performed a heuristic decision. For low fractions of un- We consider the set PFCs to obtain a first type of upper
known topologies, the algorithm usually reaches the global bound for the success rate of the CO method. This set (see
minimum of the cost function. Table IV) is formed by 10,541 proteins, among which 5%
are islands, a negligible fraction (0.2%) are colonies, 6%
Bounds on the Success Rate of the Prediction are borders, and 88% are interiors. Therefore, the upper
bound that we found is about 95% for predicting the
In this section we establish a statistical upper bound for
the prediction success rate relevant to a family of predic- topology level in CATH.
tion algorithms. The actual prediction performed in this work is done on
the set PFC , which is formed by the 7431 proteins that are
The Z-matrix can be reinterpreted as a weighted graph;
each vertex in the graph represents a protein and the in FSSP (14 January 2001) but not in CATH1.7 (see Table
weights on the edges connecting two vertices are the IV). Within PFC there is a subset of 1617 (about 22%)
corresponding Z-scores. Edges with Z ⬍ 2.0 are absent proteins that are either islands or superislands, that is,
from the graph. Following this representation, we define they are connected only with other proteins in the subset
two proteins as neighbors if they are connected by an edge. and therefore they have no connection to proteins with
By analyzing the connectivity properties of set PC we known classification. Thus, the upper bound for this
make inferences about our predictive power. second type of prediction is about 78%.
AUTOMATED ASSIGNMENT OF PROTEIN CLASSIFICATION 415
Evaluating a Classification Prediction Algorithm 12. Finkelstein AV. Protein structure: What is possible to predict
now? Curr Opin Struct Biol 1997;7:60 –71.
Because an algorithm can output either a predicted 13. Simons KT, Bonneau R, Ruczinski I, Baker D. Ab initio protein
classification or a “rejection,” if it does not have any structure prediction of CASP III targets using ROSETTA. Pro-
prediction, one has to estimate two probabilities: Psuccess teins 1999;37(Suppl 3):171–176.
14. Bernstein F, Koetzle T, Williams G, Meyer EJ, Brice M, Rodgers J,
and Preject. Robust estimation of these parameters is Kennard O, Shimanouchi T, Tasumi M. The Protein Data Bank: a
produced by cross-validation, a procedure that consists in computer-based archival file for macromolecular structures. J Mol
averaging over many (T) randomly sampled test trials. In Biol 1977;112:535–542.
15. Holm L, Sander C. Dali/FSSP classification of three-dimensional
each trial, the set is divided into two subsets; one is used
protein folds. Nucleic Acids Res 1997;25:231–234.
for training the algorithm and the other set, of Ntest 16. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB,
proteins, is used to test the algorithm by comparing its Thornton JM. CATH—a hierarchic classification of protein do-
prediction to the true classification. The probability esti- main structures. Structure 1997;5:1093–1108.
17. Conte LL, Ailey B, Hubbard TJP, Brenner SE, Murzin AG,
mates are given by Chothia C. SCOP: a structural classification of proteins database.
Nucleic Acids Res 2000;28:257–259.
冘
T
Nsuccess 18. Mizuguchi K, Deane CM, Blundell TL, Overington JP. HOM-
P̂success ⫽ 1/T (2) STRAD: a database for protein structure alignments for homolo-
Ntest gous families. Protein Sci 1998;7:2469 –2471.
t⫽1
19. Gibrat JF, Madej T, Bryant SH. Surprising similarities in struc-
ture comparison. Curr Opin Struct Biol 1996;6:377–385.
冘
T
Ntest ⫺ Nreject 20. Siddiqui AS, Barton GJ. Continuous and discontinuous domains:
P̂non-reject ⫽ 1 ⫺ P̂reject ⫽ 1/T (3) an algorithm for the automatic generation of reliable protein
Ntest domain definitions. Protein Sci 1995;4:872– 884.
t⫽1
21. Getz G. Clustering and classification of protein structures. M.Sc.
Another figure of merit, the purity Ppure, is the probability Thesis, Tel-Aviv University, 1998.
of correctly classifying nonrejected proteins. It is esti- 22. Hadley C, Jones DT. A systematic comparison of protein structure
classifications: SCOP, CATH and FSSP. Structure 1999;7:1099 –
mated by 1112.
23. Holm L, Sander C. The FSSP database of structurally aligned
P̂success protein fold families. Nucleic Acids Res 1994;22:3600 –3609.
P̂pure ⫽ (4) 24. Dietmann S, Park J, Notredame C, Heger A, Lappe M, Holm L. A
1 ⫺ P̂reject fully automatic evolutionary classification of protein folds: Dali
Domain Dictionary version 3. Nucleic Acids Res 2001;29:55–57.
REFERENCES 25. Jain AK, Dubes RC. Algorithms for clustering data. Englewood
1. Holm L, Sander C. Mapping the protein universe. Science 1996;273: Cliffs, NJ: Prentice-Hall; 1988.
595– 602. 26. Levitt M, Chothia C. Structural patterns in globular proteins.
2. Thornton JM, Orengo CA, Todd AE, Pearl FMG. Protein folds, Nature 1976;261:552–558.
functions and evolution. J Mol Biol 1999;293:333–342. 27. Taylor WR, Orengo CA. Protein structure alignment. J Mol Biol
3. Šali A. 100,000 protein structures for the biologist. Nat Struct Biol 1989;208:1–22.
1998;5:1029 –1032. 28. Orengo CA, Brown NP, Taylor WR. Fast structure alignment for
4. Martı́-Renom MA, Ashley AC, Fiser A, Sanchez R, Melo F, Šali A. protein databank searching. Proteins 1992;14:139 –167.
Comparative protein structure modeling og genes and genomes. 29. Bray JE, Todd AE, Pearl FMG, Thornton JM, Orengo CA. The
Annu Rev Biophys Biomol Struct 2000;29:291–325. CATH Dictionary of Homologous Superfamilies: a consensus
5. Heger A, Holm L. Towards a covering set of protein family profiles. approach to analyze distant structural homologues. Protein Eng
Prog Biophys Mol Biol 2000;73:321–337. 2000;13:153–165.
6. Bowie JU, Lüthy R, Eisenberg D. A method to identify protein 30. Brenner SE, Koehl P, Levitt M. The ASTRAL compendium for
sequences that fold into a known three-dimensional structure. protein structure and sequence analysis. Nucleic Acids Res 2000;
Science 1991;253:164 –170. 28:254 –256.
7. Jones DT, Taylor WR, Thornton JM. A new approach to protein 31. http://www.weizmann.ac.il/physics/complex/compphys/f2cs/
fold recognition. Nature 1992;358:86 – 89. index.html.
8. Fisher D, Rice D, Bowie JU, Eisenberg D. Assigning amino acid 32. Swindells MB, Orengo CA, Jones DT, Hutchinson EG, Thornton
sequences to 3-dimensional protein folds. FASEB J 1996;10:126 – JM. Contemporary approaches to protein structure classification.
136. Bioessays 1998;20:884 – 891.
9. Gerstein M, Levitt M. A structural census of the current popula- 33. Islam SA, Luo J, Sternberg MJE. Identification and analysis of
tion of protein sequences. Proc Natl Acad USA 1997;94:11911– domains in proteins. Protein Eng 1995;8:513–525.
11916. 34. Swindells MB. A procedure for detecting structural domains in
10. Murzin AG. Structural classification of proteins: new superfami- proteins. Protein Sci 1995;4:103–112.
lies. Curr Opin Struct Biol 1996;6:386 –394. 35. Sowdahamini R, Rufino SD, Blundell TL. Nuclear dynamics and
11. Blundell TL, Mizuguchi K. Structural genomics: an overview. electronic transition in a photosynthetic reaction center. J Am
Prog Biophys Mol Biol 2000;73:289 –295. Chem Soc 1997;119:3948 –3958.