Xu GMX 9 D JN
Xu GMX 9 D JN
'Now take a sheep', the Sergeant said. 'What is a sheep only millions of
little bits of sheepness whirling around and doing intricate convolutions
inside the sheep? What else is it but that?'
(from The Third Policeman, Flann O'Brien)
The Practical Approach Series
A Practical Approach
Edited by
D. Higgins
Department of Biochemistry,
University College, Cork,
Ireland
and
W. Taylor
Division of Mathematical Biology,
National Institute for Medical Research,
The Ridgeway, Mill Hill,
London NW7 1AA, UK
OXFORD
UNIVERSITY PRESS
OXPORD
UNIVERSITY PRESS
Great Clarendon Street, Oxford 0x2 6DP
Oxford University Press is a department of the University of Oxford.
It furthers the University's objective of excellence in research, scholarship,
and education by publishing worldwide in
Oxford New York
Athens Auckland Bangkok Bogota Buenos Aires Cape Town
Chennai Dar es Salaam Delhi Florence Hong Kong Istanbul Karachi
Kolkata Kuala Lumpur Madrid Melbourne Mexico City Mumbai Nairobi
Paris Sao Paulo Shanghai Singapore Taipei Tokyo Toronto Warsaw
with associated companies in Berlin Ibadan
Oxford is a registered trade mark of Oxford University Press
in the UK and in certain other countries
Published in the United States
by Oxford University Press Inc., New York
© Oxford University Press, 2000
The moral rights of the author have been asserted
Database right Oxford University Press (maker)
First published 2000
Reprinted 2001
All rights reserved. No part of this publication may be reproduced,
stored in a retrieval system, or transmitted, in any form or by any means,
without the prior permission in writing of Oxford University Press,
or as expressly permitted by law, or under terms agreed with the appropriate
reprographics rights organization. Enquiries concerning reproduction
outside the scope of the above should be sent to the Rights Department,
Oxford University Press, at the address above
You must not circulate this book in any other binding or cover
and you must impose this same condition on any acquirer
A catalogue record for this book is available from the British Library
Library of Congress Cataloging in Publication Data
(Data available)
ISBN 0 19 963791 1 (Hbk.)
ISBNO 19 963790 3 (Pbk.)
10 9 8 7 6 5 4 3 2 1
Typeset in Swift by Footnote Graphics, Warminster, Wilts
Printed in Great Britain on acid-free paper by
The Bath Press, Avon
Preface
v
PREFACE
vi
PREFACE
laboratories all over the word, to discover the function of newly sequenced
genes by carrying out FASTA searches of databases of characterized proteins.
Fortunately, by this time the databases were just big enough to give some chance
of finding a similar sequence in a search with a randomly chose gene. Sadly, the
chances were small initially, but by the early nineties they had risen to 1 in 3 and
now are well over 50%.
By 1990, even FASTA was too slow for some types of search to be carried out
routinely, but this was alleviated by the development of faster and faster
workstations. A parallel development was the use of specialist hardware such as
super-computers or massively parallel computers. These allowed Smith and
Waterman searches to be carried out in seconds and one very successful service
was provided by John Collins and Andrew Coulson in Edinburgh, UK. The snag
with these developments was the sheet cost of these specialist computers and
the great skill required to write the computer code so networks were important.
If you could not afford a big fast box of specialized chips, you might know
someone who would allow you to use theirs and you could log on to it using a
computer network.
In 1990, a new program called BLAST appeared. It was written by a collection
of biologists, mathematicians and computer scientists, mainly at the new NCBI,
in Washington DC, USA. It filled a similar niche to the FASTA program but was
an order of magnitude faster for many types of search. It also featured the use of
a probability calculation in order to help rank the importance of the sequences
that were hit in the search (see Chapter 8 for some details). Probability calcula-
tions are now very important in many areas of bioinformatics (such as hidden
Markov models; see chapter 4).
vii
PREFACE
Before the advent of multiple genome data, this favoured route often came to
a halt before it started: when no similar sequence could be found even to make an
alignment. However, with the genomes of phylogentically widespread organisms
either completed or promised soon (bacteria, yeast, plasmodium, worm, fly, fish,
man) there is now a good chance of finding proteins from each that can compile
a useful multiple-sequencing alignment. At the threading stage (2) in the above
progression, the current problem and worry is that there may not be a protein
structure on which the alignment can be fitted. Failiure at this stage generally
compromises any success in the final modelling stage (unless sufficient struc-
tural constraints are available from other experimental sources). This problem
will be eased by structural genomics programmes (often associated with a
genome program) for the large-scale determination of protein structures. As
with the genome, these data will greatly increase the chance of finding at least
one structure onto which the protein can be modelled.
viii
PREFACE
will shed light on the most ancient origins of protein structure and on the distant
relationships between biological systems.
The ultimate aim of Bioinformatics must surely be the complete understand-
ing of an organism—given its genome. This will require the characterization and
modelling of extremely complex systems: not only within the cell but also
including the fantastic network of cell-cell interactions that go to make-up an
organism (and how the whole system boot-straps itself). However, as Sergeant
Pluck has told us: what is an organism but only millions of little bits of itself
whirling around and doing intricate convolutions. If a genome can tell us all
these bits (and sure it will be no time till we have the genome for a sheep) then
all we have to do is figure out how it all whirls around. For this, without a doubt,
the Sergeant would have recommended the careful application of algebra—and,
had he known about them, I'm sure he would have used a computer.
D.H. and W.T., 2000
ix
This page intentionally left blank
Contents
Preface page v
List of protocols xvii
Abbreviations xix
xi
CONTENTS
xii
CONTENTS
5 Using HMMER2 82
Overview of using HMMBR 83
Making the first alignment 83
Making a profile-HMM from an alignment 84
Finding homologues and extending the alignment 84
6 False positives 85
7 Validating a profile-HMM match 85
8 Practical issues of the theories behind profile-HMMs 86
Overview of profile-HMMs 86
Statistics for profile-HMM 87
Profile-HMM construction 89
Priors and evolutionary information 89
Technical issues 90
References 91
xiii
CONTENTS
xiv
CONTENTS
XV
2 A user's primer 217
A simple query 239
Exploiting links between databases 220
Using Views to explore query results 221
Launching analysis tools 223
Overview 225
3 Advanced tools and concepts 225
Refining queries 225
Creating custom Views 230
SRS world wide: using DATABANKS 232
Interfacing with SRS over the network 233
4 SRS server side 236
User's point of view 236
Administrator's point of view 238
5 Where to turn to for help 240
Acknowledgements 241
References 241
xvii
PROTOCOL LIST
A user's primer
Performing a simple SRS query 219
Applying a link query to selected entries 220
Displaying selected entries with one of the pre-defined views 223
Launching an external application program for selected entries 223
Advanced tools and concepts
Browsing the index for a database field 227
Search SRS world wide 232
xviii
Abbreviations
xix
ABBREVIATIONS
1 Introduction
As the attempts to sequence entire genomes increases the number of protein
sequences by a factor of two each year, the gap between sequence and structural
information stored in public databases is growing rapidly. In stark contrast to
sequencing techniques, experimental methods for structure determination are
time-consuming, and limited in their application, and therefore will not be able
to keep pace with the flood of newly characterized gene products. The develop-
ment of practical methods for predicting protein structure from sequence is
therefore of considerable importance in the field of biology.
Several different approaches have been used to predict protein structure from
sequence, with varying degrees of success. Ab initio methods encompass any
means of calculating co-ordinates for a protein sequence from first principles—
that is, without reference to existing protein structures. Little success has been
seen in this area, with more theory produced than actual useful methodology.
Comparative (or homology) modelling, attempts to predict protein structure on
the strength of a protein's sequence similarity to another protein of known
structure (following the theory that similar sequence implies similar structure).
Some success has been achieved, but several limitations to this method, not
least of which are its dependence on alignment quality and the existence of a
good sequence homologue, indicate it is not applicable to a large fraction of pro-
tein sequences. The third main category of protein structure prediction, falling
somewhere between comparative modelling and ab initio prediction, is fold
recognition, or threading.
2 Threading methods
The term 'threading' was first coined in 1992 by Jones et al. (1), but the field has
grown considerably since then with many different methods being proposed:
for example, Godzik and Skolnick (2); Ouzounis et al. (3); Abagyan-et al. (4);
Overington et al. (5); Matsuo et al. (6); Madej et al. (7); Lathrop and Smith (8);
1
DAVID JONES AND CAROLINE HADLEY
Figure 1 An example of a pair of protein structures in the same family, (a) Human myoglooin
[2mml], |b) pig haemoglobin, alpha chain [2pghA]. At the family level, proteins have higher
sequences identity (in this case, 32%) and have highly similar structures. Figures created
using Molscript (19).
Figure 2 A pair of structures within the same superfamily. (a) A. denitrificans azurin [lazcA],
(b) poplar plastocyanin [1plc]. Members of the same superfamily may have insignificant
sequence identity (16% in this case), but still share most features of the protein fold,
reflecting a common evolutionary origin.
Taylor (9) amongst others. The idea behind threading came about from the
observation that a large percentage of proteins adopt one of a limited number of
folds (Figures 1-3), In fact, just 10 different folds (the 'superfolds') account for 50%
of the known structural similarities between protein superfamilies (18). Thus,
rather than trying to find the correct structure for a protein from the huge
number of all possible conformations available to a polypeptide chain, the cor-
rect (or close to correct) structure is likely to have already been observed and
already stored in a structural database. Of course, in cases where the target
protein shares significant sequence similarity to a protein of known 3-D struc-
ture, the 'fold recognition' problem is trivial—simple sequence comparison will
identify the correct fold. The hope was, however, that threading might be able
to detect structural similarities that are not accompanied by any detectable
sequence similarity, and this has subsequently been proven to be the case.
2
THREADING METHODS FOR PROTEIN STRUCTURE PREDICTION
3
DAVID JONES AND CAROLINE HADLEY
Figure 4 This is an outline of the fold recognition approach to protein structure prediction,
and identifies three clear aspects of the problem that need consideration: a fold library, a
method for modelling the object sequence on each fold, and a means for assessing the
goodness-of-fit between the sequence and the structure.
4
THREADING METHODS FOR PROTEIN STRUCTURE PREDICTION
the energy calculations, etc., choosing the best 'fit'). Significant progress may
also arise from improvements in the threading library used (i.e. the templates
upon which the sequences will be threaded).
To get some idea of the variety of methods which have been developed, four
distinct approaches to the fold-recognition problem will be described. Virtually
all fold-recognition methods are similar to at least one of these methods, and
some newer methods incorporate concepts from more than one.
5
DAVID JONES AND CAROLINE HADLEY
6
THREADING METHODS FOR PROTEIN STRUCTURE PREDICTION
7
DAVID JONES AND CAROLINE HADLEY
is only expected to work for family or superfamily level similarities between the
target and template proteins. This is both a positive and negative feature of the
method. The negative aspect is that, of course, many purely structural similari-
ties will not be detected by the method. The positive aspect is that superfamily
relationships produce the most reliable results, and also allow some aspects of
the function of the target protein to be inferred from the matched template
structure. This latter point is particularly useful when annotating unknown
genome sequences. Figure 5 shows the current applicability of different types of
fold recognition method to a genome such as that of M, genttalmm,
Unlike full threading methods, which require a great deal of computer power
to run, this type of method can be made readily available to the public via a
simple Web server. The GenTHREADER method is available from the following
URL:
http://globin.bio.warwick.ac.uk/psipred
8
THREADING METHODS FOR PROTEIN STRUCTURE PREDICTION
perform in real situations where the answers are not known at the time the
predictions are made. It was not until these methods were tested in a set of blind
trials—the Critical Assessment in Structure Prediction experiments (CASP)—
that it became clear how powerful these methods could be when used without
prior knowledge of the correct answer. The CASP experiment has now been run
three times (CASP1 in 1994, CASP2 in 1996, CASP3 in 1998): and in the last meet-
ing results from over 30 methods were evaluated by the independent assessors.
Up to date information on all of the CASP experiments can be obtained from the
following Web address:
http://predictioncenter.llnl.gov
9
DAVID JONES AND CAROLINE HADLEY
10
THREADING METHODS FOR PROTEIN STRUCTURE PREDICTION
11
DAVID JONES AND CAROLINE HADLEY
5 The future
One major difference between the academic challenge of protein structure
prediction and the practical applications of such methods is that in the latter
case there is an eventual end in sight. As more structures are solved, more target
sequences will find matches in the available fold libraries—matched either by
sequence comparison or threading methods. In terms of practical application,
the protein-folding problem will thus begin to vanish. There will of course still
be a need to better understand protein-folding for applications such as de novo
protein design, and the problem of modelling membrane protein structure will
probably remain unsolved for some time to come, but nonetheless, from a
practical viewpoint, the problem will be effectively solved. How long until this
point is reached? Given the variety of estimates for the number of naturally
occurring protein folds, it is difficult to come to a definite conclusion, but taking
an average of the published estimates for the number of naturally occurring
protein folds and applying some intelligent guesswork, it seems likely that when
threading fold libraries contain around 1500 different domain folds it will be
possible to build useful models for almost every globular protein sequence in a
given proteome. At the present rate at which protein structures are being
solved, this point is possibly 15-20 years away. However, pilot projects are now
underway to explore the possibility of crystallizing every globular protein in a
typical bacterial proteome. If such projects get fully under way, which seems
likely, then a complete domain fold library may be only five years away.
References
1. Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992). A new approach to protein fold
recognition. Nature, 358, 86.
2. Godzik, A. and Skolnick, J. (1992). Sequence-structure matching in globular proteins:
Application to supersecondaiy and tertiary structure determination. Proc. Nail. Acad.
Set USA, 89,12098.
12
THREADING METHODS FOR PROTEIN STRUCTURE PREDICTION
3. Ouzounis, C, Sander, C., Scharf, M., and Schneider, R. (1993). Prediction of protein
structure by evaluation of sequence-structure fitness. Aligning sequences to contact
profiles derived from three-dimensional structures.J. Mol. Biol., 232, 805.
4. Abagyan, R., Frishman, D., and Argos, P. (1994). Recognition of distantly related
proteins through energy calculations. Proteins: Struct, Funct. Genet., 19, 132.
5. Overington, J., Donnelly, D., Johnson, M. S., Sali, A., and Blundell, T. L (1992).
Environment-specific amino-acid substitution tables—tertiary templates and
prediction of protein folds. Protein Sci., 1, 216.
6. Matuso, Y., Nakamura, H., and Nishikawa, K. (1995). Detection of protein 3-D-l-D
compatability characterised by the evaluation of side-chain packing and electrostatic
interactions. J. Biochem. (japan), 118, 137.
7. Madej, T., Gilbrat, J.-F., and Bryant, S. H. (1995). Threading a database of protein cores.
Proteins, 23, 356.
8. Lathrop, R. H. and Smith, T. F. (1996). Global optimum protein threading with gapped
alignment and empirical pair score functions.J. Mol. Biol., 255, 641.
9. Taylor, W. R. (1997). Multiple sequence threading: An analysis of alignment quality
and stability.J. AM Biol., 269, 902.
10. Bowie, J. U, Luthy, R., and Eisenberg, D. (1991). A method to identify protein
sequences that fold into a known three-dimensional structure. Science, 253,164.
11. Taylor, W. R. and Orengo, C. A. (1989). Protein structure alignment.J. Mol. Biol.,
208, 1.
12. Jones, D. T. (1998). THREADER : Protein Sequence Threading by Double Dynamic
Programming. In Computational methods in molecular biology (ed. S. Salzberg, D. Searls,
and S. Kasif). Elsevier, Amsterdam.
13. Sippl, M. J. (1990). Calculation of conformational ensembles from potentials of mean
force. An approach to the knowledge-based prediction of local structures in globular
proteinsJ. Mol. Biol, 213, 859.
14. Rost, B. (1997). Protein fold recognition by prediction-based threading. J.. Mol. Biol,
270, 1.
15. Jones, D. T. (1999). GenTHREADER: An efficient and reliable protein fold recognition
method for genomic sequences. J. Mol. Biol., 287, 797.
16. Murzin, A, G. and Bateman, A. (1997). Distant homology recognition using structural
classification of proteins. Proteins Suppl., 1, 105.
17. Russell, R. B. and Barton, G. J. (1994). Structural features can be unconserved in
proteins with similar folds—an analysis of side-chain to side-chain contacts,
secondary structure and accessibility. J. Mol. Biol., 244, 332.
18. Orengo, C. A., Jones, D. T., and Thornton, J. M. (1994). Protein superfamilies and
domain superfolds. Nature, 372, 631.
19. Kraulis, P. J. (1991). Molscript—a program to produce both detailed and schematic
plots of protein structures. J . Appl. Crystallogr., 24, 946.
20 Jones, D. T., and Thornton, J. M. (1996). Potential energy functions for threading. Curr.
Opin. Struct. Biol., 6, 210.
13
This page intentionally left blank
Chapter 2
Comparison of protein
three-dimensional structures
Mark S. Johnson and Jukka V. Lehtonen
Department of Biochemistry and Pharmacy, Abo Akademi University, Tykistokatu
6 A, 20520 Turku, Finland.
1 Introduction
In this chapter we define the different types of questions that may be asked
through the comparison of the three-dimensional (3-D) structures of proteins,
how to make the comparisons necessary to answer each question, and how to
interpret them. We shall focus on the different strategies used, and the assump-
tions made within typical computer programs that are available.
Protein structure comparisons are often used to highlight the similarities and
differences among related—homologous—3-D structures. Homologous proteins are
descended from a common ancestral protein, but have subsequently duplicated,
evolved along separate paths, and thus changed over time. The independent
evolution of related proteins with the same function, orihologous proteins, which
are found in different species, and the paralogaus proteins, which have evolved
different functions, all retain information on the original relationship. The amino
acid sequences change over time reflecting the mutations, insertions and de-
letions that occur in their genes during evolution, and for many proteins the
sequences themselves are so similar that common ancestry is apparent. For
others, the sequences can be so dissimilar that the case for homology may be diffi-
cult to make on the basis of the primary structure. Nonetheless, comparing the
3-D structures when they are available can identify homologous proteins. This is
possible since the evolution of proteins occurs such that their folds are highly
conserved even though the sequences that encode them may not be recognizably
similar.
Homologous proteins are often compared in order to highlight features
(typically the amino acids and their relative orientations to one another), which
have come under strong evolutionary pressure not to change because of struc-
tural and functional restraints placed on them. Conversely, differences in an
otherwise conserved active site or binding site are used to explain differences in
observed function.
Dayhoff and coworkers (1) long ago predicted that about 1000 different protein
15
MARK S. JOHNSON AND JUKKA V. LEHTONEN
families should exist in nature, and it has become clear over recent years that
most newly-solved 3-D structures do fall into an existing family of structures
(2, 3). The approximately 100 000 proteins encoded in the human genome, whose
sequences will be known early in this century, will fall within this limited
number of families. Thus, one key bioinfbrmational goal has been to compare
and classify all proteins and their component domains into family groups, and
one immediate goal is to solve at least one representative structure for each
sequence family that is not obviously connected to any existing structural family.
This single representative structure can then be used in knowledge-based
modelling (4) to estimate the 3-D structures for other members of the family.
Comparisons are also made among non-homologous proteins to try and high-
light structural features that are locally similar, but whose present-day sequences
have not arisen as a consequence of evolutionary divergence from a common
ancestor. Classic examples include the active site similarities among serine
proteinases, subtilisins and serine carboxypeptidase II (5), each of which invoke
the participation of histidine, serine and an aspartic acid in their proteolytic
mechanism of action. The folds are different and the relative positions of these
key amino acids along the sequence are different too. In the 3-D structures, how-
ever, the residues are similarly positioned to reproduce a common catalytic
mechanism that has been exploited by nature on at least three separate occasions.
Comparisons among non-homologous proteins can highlight structural units
that are common features of the protein fold and comparisons have been made
to classify amino acid conformations, regular elements of secondary structure
(helices, strands, turns), supersecondary structure, and cofactor and ligand bind-
ing sites.
The comparison of protein structures can be achieved in many different ways.
In this chapter, we present several of the basic procedures used in the wide
variety of programs that have been developed over the years. These methods
range from rigid-body comparisons, to methods more typical of sequence com-
parisons—dynamic programming, and to those methods that employ Monte
Carlo simulations, simulated annealing and genetic algorithms to find solutions
for combinatorially-complex structural comparison problems. We will describe
methods that demand partial solutions as input to the procedure, as well as
strategies for automatic hands-off solutions; and approaches to both homologous
and non-homologous structural comparisons.
16
COMPARISON OF PROTEIN THREE-DIMENSIONAL STRUCTURES
and z directions in the co-ordinate system. The rotation matrix describes the a, B,
and -y rotations in the three orthogonal planes. One of the main tasks of many
super-positioning procedures is to define these values and then to apply them to
the co-ordinates of the objects and they will then be superposed on each other.
Two identical objects will have all points superposed exactly.
The major difficulty with non-identical objects, such as a pair of protein
structures, is that they typically have different numbers of amino acids, different
amino acids with different numbers, types and connectivities of atoms. Further-
more, amino acids present in one structure can be missing in the other: inser-
tions and deletions—the gaps seen in a sequence alignment. Thus, except in the
case of one protein co-ordinate set being compared with itself, no two proteins
will have atoms in exactly identical positions. A protein whose structure has
been solved several times will also vary with overall differences in the main
chain co-ordinates of no more than about 0.3 A, but they will be different.
The superposition of most protein structures as rigid-bodies, therefore, is not
straightforward, and several different considerations need to be resolved in
advance of the comparison. These include:
17
MARK S. JOHNSON AND JUKKA V. LEHTONEN
Properties
(a) Residues (b) Segments
Identity Secondary structure type
Physical properties Amphipathicity
Local conformation Improper dihedral angle
Distance from gravity centre Distance from gravity centre
Number of neighbours in vicinity Average Ca density
Position in space Position in space
Global direction in space Global direction
Main chain accessibility Main chain accessibility
Side chain accessibility Side chain accessibility
Main chain orientation Orientation relative to gravity centre
Side chain orientation
Main chain dihedral angles
Relations
(a) (b)
Disulfide bond Relative orientation of two or more segments
Vectors" to one or more nearest neighbours Vectors" to one or more nearest neighbours
Distances to one or more nearest neighbours Distances to one or more nearest neighbours
(e.g. atom pairs or contact maps)
Change in number of neighbours in vicinity
Ionic bond
Hydrogen bond
Hydrophobic cluster
8
See refs 7, 8, 10, 11.
" Vector defines both distance and direction in the local reference frame.
COMPARISON OF PROTEIN THREE-DIMENSIONAL STRUCTURES
fS-strands, matching closely and sequentially along the fold of the two struc-
tures. Differences in Ca-atom traces arc more often seen at loop regions that
connect the strands and helices in proteins: Frequently these loop regions are
exposed to the solvent at the surface of the protein and thus have fewer
constraints placed on their conformations.
For more dissimilar protein structures, rigid body movements and other
structural changes can occur in one structure relative to the other. When this
happens, rigid-body comparisons of the 3-D structures can often lead to poorly
matched structures, although the folds are the same. If these changes are not
large, then dynamic programming procedures (6) that consider only Cut-Co: atom
distances or other structural properties of the amino acids (Table 1) after an
initial rigid-body comparison can be quite effective in matching all residues
from the protein structures (7-9), Others have described automated procedures
that involve the comparison of structural relationships that require special
techniques to solve these problems of combinatorial complexity (7, 8, 10, 11).
19
MARK S. JOHNSON AND JUKKA V. LEHTONEN
algorithm is used to give the best 'sequence alignment' based on the structural
features that have been supplied.
(a) Property comparisons may require an initial alignment (e.g. rigid-body).
(b) Relationships can be aligned by a variety of methods, e.g. Monte Carlo simula-
tions (11), simulated annealing (7), double dynamic programming (10), genetic
algorithms.
3 the structures can subsequently be superposed according to the matches in the
alignment, but a single global superposition may be meaningless when large move-
ments, such as domain movements, have taken place. In that case, each domain
should be superposed separately.
where 5R is the rotation matrix being sought that minimizes the differences
between a total of N equivalent co-ordinate sets A"1 from the first protein and Beq
from the second protein; w is a weighting that can be applied to each ith pair of
equivalent positions.
Numerous methods have been developed to solve this pairwise least-squares
problem in a variety of different ways (12-16). Others have described more gen-
eral methods suitable for the least-squares comparison of more than two three-
dimensional structures (17, 18) In our experience, the method of Kearsley (19)
is a straightforward and simple means to obtain the optimal rotation matrix
for a set of equivalent co-ordinates. We will only consider this procedure here
(Protocol 2).
The major obstacle to solving the least-squares problem is that matched atom
pairs from the two structures to be compared need to be specified to the
algorithm at the beginning of any calculations. Thus, the computer program
requires some idea of the final alignment before it can proceed. There are
common situations where the comparisons would be made over a pre-defined
set of residues: for example, (a) comparisons over residues that line an active site
or binding site—to highlight similarities and differences over those positions; (b)
comparisons of independent structure solutions for the same protein. In these
cases, the atomic positions to be compared are usually known a priori, and a
single round of rigid-body comparison is sufficient to obtain the optimal match.
Frequently, however, global comparisons are made between proteins where the
best-matched positions are not obvious in advance. In the case of similar protein
structures, the requirement of an initial set of matches to seed the comparison
is inconvenient at best, requires the preanalysis of the proteins involved, and in
the case of more dissimilar proteins, may be difficult to define. Additionally, we
have often observed that when part of the answer is specified at the beginning
of the comparison, then the final solution can be prejudiced to give a final result
that is not necessarily the optimal one: The comparison was locked into a set of
possible solutions by the information supplied to seed the procedure. Despite
these criticisms, there are many good methods that employ this strategy.
For example, Sutcliffe et al (16) specify a set of at least 3 Ca-atoms common to
the two structures (3 positions define a unique plane in each structure). Good
candidates for these common residues, supplied a priori, can be conserved resi-
dues at an active site or ligand binding site, be positions conserved in terms of
the sequence similarity, or can be equivalent positions observed to form part of
the common fold when the proteins are examined on a graphics device. This
and other similar methods use an iterative procedure to progress towards better
21
MARK S. JOHNSON AND JUKKA V. LEHTONEN
and better solutions that incorporate more and more equivalent atom pairs. (Later
in this chapter we will detail several automatic strategies that have heen used to
get around this need for predetermining a set of equivalent atoms at the onset
of the structural comparison.)
In the equation describing the residual (above), Ae<l and B1'1* contain the x, y,
and z axes co-ordinates for exactly the same number of atoms from each of the
two structures. These atoms are termed equivalent positions, and are those aligned
positions that the superposition will now be calculated for. All other atoms in
the molecules are ignored in determining the superposition, but the translation
vector and the rotation matrix determined on the basis of these equivalent
positions is subsequently applied to all atoms in the co-ordinate file, including
any bound ligand, metal ions, and water molecules. Here, we will detail how to
calculate the translation vector for each protein and describe one simple yet
elegant method for determining the rotation matrix, developed by Kearsley (19,
20), which we use as the method of choice for our own procedures (Protacd 2).
In other words, sum all of the x co-ordinates together and divide by N to give the
average x co-ordinate for the equivalent set of atoms; repeat for the y and z co-
ordinates. Repeat for the corresponding N equivalent atoms in the second structure:
Thus, the centre of mass is a single x, y, and z co-ordinate set for each of the proteins.
2 Translate both structures, all atoms in the file, so that their centres of mass (accord-
ing to the set of equivalent atoms used) are located at the origin of the co-ordinate
system. For every atom i in the first structure:
22
COMPARISON OF PROTEIN THREE-DIMENSIONAL STRUCTURES
In other words, subtract the x, y, and z co-ordinate values for the centre of mass
from the x, y, and z co-ordinate values for every atom in the co-ordinate file. Repeat
for the second structure:
The rotation matrix: the Kearsley method (ref, 19) minimizes the average
difference between sets of atoms using quaternion algebra
1 Generate a symmetric 4 X 4 matrix by adding selected combinations of differences
and sums of co-ordinates calculated for each matched pair of equivalent atoms to
the elements of the matrix (19). These are the co-centred co-ordinates, but only the
coordinates of equivalent matched atom pairs, A^ and B^ are used at this stage.
2 Diagonalize the 4 X 4 matrix in order to obtain its eigenvalues and eigenvectors
(see ref. 21 for general procedures).
3 Select the lowest eigenvalue and use elements of the corresponding eigenvector to
construct the 3 x 3 rotation matrix 9? (see ref. 19 for details).
4 Multiplication of each co-ordinate in the second structure B by R will produce the
superposition of the entire structure onto protein A, where the average distance
between matched atoms of the equivalent set is a minimum: B (trans.,rot.) = R x
Bf (trans.).
5 The selected eigenvalue divided by the number of atom pairs in the equivalent set is
equal to the square of the RMSD after rotation. R, calculated above, leads to the
superposition whose RMSD is a minimum for these sets of equivalent atoms.
23
MARK S. JOHNSON AND JUKKA V. LEHTONEN
Figure 1 Reduction in the extent of the common equivalent matches in pairwise structural
superpositions as a function of decreasing percentage sequence identity. Traces of the
backbones are shown for Ca-positions within 2.5 A after rigid-body superposition with the
computer program MNYFIT (16). The haemoglobin a-chain of Pagothenia bemacchii (Protein
Data Bank (PDB, ref. 51) code: 1PBX) is aligned in (a) with the a-chain of equine
haemoglobin (2MHB) and in (b) with the B-chain of human haemoglobin (2HHB). (c) The
human haemoglobin B-chain (2HHB) aligned with the sea lamprey globin (2LHB). (d) The
erythrocruorin of Chironomous thummi thummi (1ECD) aligned with the leghaemoglobin of
Lupinus luteum (1LH1). (From ref. 4, with permission.)
(c) The dynamic programming algorithm produces a full alignment of all posi-
tions in the structures (residues are aligned with each other or with gaps),
while the rigid-body methods align fewer and fewer potions in the structures
as the sequence similarity decreases (Figure 1).
(d) Dynamic programming algorithms do not give a superposition of the
structures suitable for visualization. This can be obtained from the alignment
by applying the rigid body method to the defined matched pairs.
(e) Dynamic programming can often lead to alignments of the structures where
rigid-body movements have occurred in the structures themselves. For
example, the large movements of the entire domains seen in the liganded
and unliganded structures of the periplasmic bacterial lysine-arginine-
ornithine binding protein (Figure 2). Rigid-body comparisons can be applied,
24
COMPARISON OF PROTEIN THREE-DIMENSIONAL STRUCTURES
Figure 2 Two different conformations of the 3-D structure for the same protein, the
lysine-arginine-omithine binding protein from Salmonella typbimurium. Left: the structure of
the protein in complex with lysine (1LST), lysine not shown. Right, the uncomplexed structure
(2LAO). The smaller domain on the upper part of the figures is in same orientation and the
arrow pointing to the Cot-atom of Glu 216 illustrates the magnitude of the movement of the
larger domain at the bottom of the figure,Figureprepared with MOLSCRIPT (52).
In rigid body comparisons, where the Ca-atoms of the protein backbone have
been used as a basis for comparison, a distance cut-off typically in the range 2.5 A
25
MARK S. JOHNSON AND JUKKA V. LEHTONEN
to 4.5 A has been used. Values above 3 A lead to more multiple matches to a
single atom: the distance between two consecutive Ca-atoms along the protein
backbone is around 3.5 A. Lower values will reduce the number of equivalent
matches when more dissimilar proteins are compared. Distances or dissimilarity
measures will also be required for the comparison of other structural features,
both properties and relationships, see ref. 7 for example. Common to both rigid-
body methods, which rely on simple distance data, and other methods, which
incorporate other types of information into the alignment process, is the need
to determine the matching of locations between the structures to be superposed
(Protocol 3). This can be part of an iterative procedure to provide a new set of
equivalent atoms that are then used to determine a new translation vector and
rotation matrix in order to improve a match. This is also one of the final steps in
any comparison procedure, where the resultant alignment is determined. Three
basic approaches have been used: (a) dynamic programming, (b) graph theoret-
ical match list handling and clique detection methods, and (c) methods more
suitable for solving combinatorially-complex matching problems.
The Needleman and Wunsch (6) method is a convenient fast method for
aligning proteins. By scoring all possible pairs of matches between two struc-
tures, the method insures that the optimal scoring solution is found for the
scoring scheme employed. The method accommodates a loss of elements in one
structure relative to another—the gaps corresponding to insertions and de-
letions. Thus, the method provides a full alignment where every residue posi-
tion in each protein is matched to either a residue position in the other protein
or a gap. Thus, this method can efficiently resolve the multiple matching and
many combinatorial problems seen with the list sorting procedure. Once struc-
tural relationships have been equivalenced between a pair of structures, this
information can also be used within the dynamic programming method.
With the match list sorting procedure, for example ref. 22, possible equiva-
lent matches between the proteins are tabulated: matches of protein B to each
position in protein A in one list, and matches of protein A to protein B in a
second list. These lists contain both authentic matches of conserved structure,
chance matches that need to be eliminated from the lists and multiple matches
between one element in one protein to several different elements in the other
protein. The challenge, then, is to cull these lists by keeping the best matches
(i.e. matches that can extend a series of previous matches, have a good matching
score or give a good fit), removing structurally unlikely matches (matches that
are not co-linear—are out of sequence with other matches—and isolated matches
that do not extend further other matches), and by reducing multiple matches to
single matches.
A more elaborate approach was introduced by Mitchell et al. (23). Their
method does not filter out extraneous matches, but instead tests each com-
bination of matches to find the optimal equivalent set. As a result, a 'clique', the
maximum sub-graph common to two graphs representing the structures is
found. The clique detection algorithm is based on graph theory and offers a way
to find similar parts of structures that have not been superimposed. The basic
26
COMPARISON OF PROTEIN THREE-DIMENSIONAL STRUCTURES
idea is to represent each structure as a graph of nodes and vertices. Each node
corresponds to either an atom, piece of main chain, secondary structure ele-
ment, or similar definite piece of structure. Each vertex is a relation between two
nodes in a structure: the distance between the atoms, vector from one atom to
another (both distance and direction in some co-ordinate frame), distance and
angle between two secondary structure elements, or more a more complicated
distance measure involving other properties of the nodes. If two structures
contain a similar substructure, then the nodes belonging to that substructure are
connected in both structures by very similar vertices. The task is to find the
maximal common sub-graph from the set of possible common sub-graphs. While
this is a NP-complete task, it is feasible due (a) efficient search algorithms evolved
within graph theory, and (b) the use of (few) secondary structure elements (SSEs)
as the compared pieces of the structures instead of (many) atoms. Several other
programs have been described that use a very similar approach (see ref. 24 and
citations therein); the main differences are in the ways structures are repre-
sented and in the method used to reduce the search space for efficiency.
The comparison of relationships among features in one structure relative to
another is a powerful addition to any structural comparison procedure (see ref. 7
for an excellent discussion). Relationships, such as patterns of hydrogen bond-
ing, involve the comparison of a minimum of two residue positions for eveiy
hydrogen bond in both structures. In certain cases, e.g. in the method of Taylor
and Orengo (10), relationships—in this case inter-atomic vectors, are compared
using their novel double dynamic programming method. More often, the
matching of relationships is treated as a combinatorially-intensive task. There
are lots of candidate pairs of hydrogen bonds in each structure and matching
them relies on methods such as simulated annealing, Monte Carlo simulations
and genetic algorithms.
27
MARK S. JOHNSON AND JUKKA V. LEHTONEN
features of the protein (see Table t), or similarity scores derived from distances. In
this description, we will refer to a matrix filled with similarity scores derived from
distances.
4 Beginning at one corner (ammo-terminal end or carbon-terminal end of the
sequences) of the matrix and heading towards the opposite corner, sum diagonal
values to the current position if they are the best score (a residue-residue match), or
sum with an off-diagonal score minus a penalty (indicates a possible gap in one
protein or the other).
5 The largest value found at one edge of the matrix specifies the first two aligned
positions and gives the optimal alignment score for the comparison.
6 The full alignment that produced the optimal score can be traced beginning at the
highest vahie and progressing towards the opposite side of the matrix by following
the next best score in the matrix. Wben the next highest value is on the diagonal,
residues are matched in sequence; when an off diagonal score (less a penalty) is the
next best choice, then a gap is indicated.
7 This method produces the full alignment including gap regions, but elements within
a cut-off value can be used to determine the rigid-body superposition of the structures.
Clique detection methods (23, 25)
1 Represent each structure as a graph of nodes (Ca-atoms or secondary structure
elements) and vertices connecting the nodes. Each vertex is a distance between the
connected two nodes (atoms).
2 List for each vertex in structure A all such vertices in structure B, which are similar
within an error threshold (Le. vertices connecting the same kind of nodes with
similar distances).
3 Find the maximal common sub-graph (largest set of nodes and vertices, which exists
in both structure graphs) using a tree search algorithm, Monte Carlo simulation, or
a genetic algorithm. Each vertex in the common sub-graph corresponds uniquely to
one vertex in both structures A and B.
4 The nodes included in the sub-graph are equivalent for the two structures. If the
nodes are atoms, the superposition can be made directly (see Protocol 2). Also, the
secondary structure elements can be superimposed as if they were atoms of a rigid
molecule, or the Co-atoms within the SSEs can be superimposed.
Match list approaches (22)
This method is a variation of the clique detection method, which assumes that the
structures are initially superimposed, but equivalent matches are not known.
1 In the case of Ca-Ca distance comparisons, create two lists, one for each protein A
and B.
(a) In one list, tabulate all Cot-atoms in protein B with matches within a cut-off
distance, say 3.5 A, to a position in protein A.
28
COMPARISON OF PROTEIN THREE-DIMENSIONAL STRUCTURES
(b) In a second list, tabulate all Ca-atoms in protein A with matches within a cut-off
distance to a position in protein B.
2 Filter from the list the poorest matches to reduce the number of matches to a
unique set of equivalent matches:
(a) Remove matches that are not part of a contiguous run of at least 4 Ca-atoms.
(b)f Reduce multiple matches from one protein to a single Ca-atomin the other pro-
tein, e.g. does one of the matches extend a contiguous run of existing matches?
(e) If there are still multiple matches remaining, then the match with shortest
distance is kept and the others are removed.
Comparisons of relationships (7, 10, 11, 21)
1 The matching of relationships among features of one structure with relationships
among features of another structure is accomplished using one of several different
techniques.
(a) Monte Carlo simulations (11. 21).
(b) Simulated annealing (7, 21).
(c) Double dynamic programming (10)
(d) Genetic algorithms (22. 26, 27) can also be used.
2 The matched relationships may be insufficient in themselves to accurately align the
34) structures, and thus would be combined with the feature comparisons within a
dynamic programming procedure, for example, to give the final alignment (7).
29
MARK S. JOHNSON AND JUKKA V. LEHTONEN
given the cut-off value of 3 A, the RMSD obtained and each of the Cu-Ca atom
distances contributing to the RMSD will be less than the 3 A. Alternatively, the
RMSD can be calculated over all matched Cot-atom pairs, regardless of the dis-
tance between the superposed atoms. Of course, the RMSD can also be calcu-
lated between sets of any type of superposed atoms, not just Ca-atom pairs as
illustrated in Protocol 4.
30
COMPARISON OF PROTEIN THREE-DIMENSIONAL STRUCTURES
3.2 Comparisons
In the comparison of identical proteins that have 3-D structures that differ to
varying degrees, it is needed to compare the structures using a rigid-body
approach one time only (Protocol 5). No iteration is necessarily required to
achieve the best result, since one would typically supply all atom positions in
the structure for comparison. Likewise, no pre-comparison is necessary to supply
a seed set of residues for the comparison. In practice, iterative procedures are
used. Again, if big differences in the structures are anticipated, e.g. the relative
domain movements in Figure 2, then this approach may not be appropriate
31
except to provide an RMSD value that is an indication of the relative changes to
the structures.
32
COMPARISON OF PROTEIN THREE-DIMENSIONAL STRUCTURES
(b) Those methods that sample the realm of possible solutions and, as a result,
automatically find optimal alignments without specifying an initial starting
alignment.
Figure 3 The differences in alignments of the aspartic proteinase amino- and carboxyl-
terminal domains (labelled with an 'N' or 'C', respectively) from (a) multifeature (7) and from
(b) multi-sequence comparisons. Asterisks in (a) indicate those positions among the
structures that were found to be equivalent under rigid-body superposition with the computer
program MNYFIT (16). PDB codes: 4APE, endothiapepsin; 2APP, penicillopepsin; 2APR,
rhizopuspepsin. (From ref. 8, with permission.)
33
MARK S. JOHNSON AND JUKKA V. LEHTONEN
Method
1 Supply a minimum of three conserved residues from a sequence-based alignment,
or
2 Supply key residues implicated in a conserved binding or catalytic motif, or
3 Supply segments corresponding to secondary structure elements observed on a
graphics device to be conserved between the structures.
Semi-automatic methods
Required data
• Co-ordinates of the proteins to be Initial set of equivalent atom pairs to seed
compared the alignment procedure
Method
1 Calculate translation vector based on seed residues, translate all co-ordinates to the
origin and calculate the rotation matrix for the seed residues (see Protocol 2).
2 Apply the rotation matrix to all atoms of the second protein to achieve the first
superposition (see Protocol 2).
3 Obtain the alignment using dynamic programming or clique analysis (see Protocol
3).
34
For all matched residue pairs in the alignment, calculate the Euclidean distance.
Those matched pairs within the distance cut-off value will form the new updated
set of equivalent atom pairs for the next round of super-positioning.
Repeat steps 1-4 until convergence is obtained: the calculated RMSD (Protocol 4)
does not decrease and the number of equivalent atom pairs matched in the two
proteins does not increase.
A rigid-body comparison has been used as a starting point for more detailed structural
comparisons involving multiple structural features (e.g. the program COMPARER
described in ref. 7).
Required data
• Ca-atom co-ordinates of the compared structures and their sequences
Method
1 Align the amino acid sequences with a dynamic programming algorithm (Protocol
3, but using sequence-matching scores to produce the alignment).
2 Superimpose the structures according to Protocol 7 using the most conserved portions
of the sequence alignment as the initial set of seed residues.
35
MARK S. JOHNSON AND JUKKA V. LEHTONEN
(c) Methods that do not make rigid-body comparisons directly, but instead make
comparisons on the basis of similarities in structural properties and/or
relationships (Protocols 11-14).
Required data
• Cat-atom coordinates of compared structures
Method
1 Create a large random set of superpositions for the pair of structures,
2 Assign equivalent matches (Ca-atoms within a specified distance cut-off) using
dynamic programming and score each alignment (see Protocol 3).
3 Create a new set of superpositions by crossing-over and mutating the existing
solutions.
4 Repeat steps 2-3 until a close to final solution is achieved.
5 Optimize the best found superposition/alignment by least squares minimization
(see Protocol 2).
6 Calculate the final alignment with the dynamic programming algorithm (see
Protocols).
36
COMPARISON OF PROTEIN THREE-DIMENSIONAL STRUCTURES
Method
1 VERTAA, for each of two structures, plots the number of Cot-atoms within a
given radius (14.0 A) from each Cot-atom in the structure. Other properties can be
used too.
2 These 'spectra' are scaled and overlapping segments are aligned. More than one
alignment method is available:
(a) The dynamic programming algorithm (Protocol 3). Fast and robust if the input
values are properly scaled.
(b) The Fourier correlation (21). The values can be considered as a function over a
limited range and a correlation function obtained with the fast Fourier trans-
form to bring the spectra into register. Dynamic programming is then used to
define equivalent matches (Protocol 3).
3 Superimpose the structures (see Protocol 2) based on equivalent matches denned in
step 2.
4 Define a new alignment with dynamic programming and the Ca-Ca distances of the
superimposed structures within 3.5 A (see Protocol 3).
5 While the alignment and superimposition improve, repeat steps 3 and 4,
37
MARK S. JOHNSON AND JUKKA V. LEHTONEN
38
COMPARISON OF PROTEIN THREE-DIMENSIONAL STRUCTURES
Figure 4 Plots of Co-atom densities, alignment of plots, and the corresponding superposition
of the structures, (a) Cu-atom densities of residues in -y-chymotrypsin A (PDB code 2GCH).
(b) C«-atom densities of residues in Streptomyces griseus proteinase B (PDB code 3SGB,
chain E). For both spectra, the average density is set equal to 0. (c) The parts of the two
plots from (a) (dark) and (b) (light), which correspond to each other according to the alignment
of their spectra, (d) The superpositioned 3-D structures (2GCH dark, 3SGB light) based on the
alignment specified in (c). The side chains of the catalytic triad are shown and the ctosely
matching parts are drawn as ribbon diagrams. This superposition was made with the
computer program VERTAA (Lehtonen and Johnson, unpublished results) and contains 118
residues within 3.5 A with an RMSD of 1.8 A, Figure (d) was prepared with MOLSCRIPT (52).
Method
1 Calculate a distance matrix for each protein A. Element (i, j) of the matrix contains
the intramolecular distance between the ith and fth Ca-atom in A. Likewise, calculate
a distance matrix for protein B.
2 List from each distance matrix all possible 6 by 6 sub-matrices.
3 Reduce the number of sub-matrices by clustering similar ones and using the mean of
each cluster as the contact pattern. Sort contact patterns by intra-pattern distance.
4 Compare each pair of two contact patterns from A with all pairs of sub-matrices
from B. Compare each pair of two contact patterns from B with all pairs of sub-
matrices from A. List all pair-pair matches.
5 Remove redundancy from the list of matches and son it by match quality, which is
a function of the differences between the sub-matrices from A and from B,
6 Find the most extensive, non-exclusive collection of matches from the list. DALI
uses a Monte Carlo simulation to search the best 40000 matches. The simulation
tries to extend the matches by combining matches that contain a common contact
pattern in both distance matrices. The random element of the simulation is used to
find the best scoring combination from mutually exclusive possibilities.
39
MARK S. JOHNSON AND JUKKA V, LEHTONEN
Method
1 Calculate a distance matrix for protein A Element (J, j) of the matrix contains the
intramolecular vector from the 4th to the j"1 Ca-atom in A. The vector is in the co-
ordinate frame defined by the covalent bonds of A's 1th Ca-atom. Likewise, calculate
a distance matrix for protein B.
2 Calculate intramolecular difference matrices for each pair of rows from the two
distance matrices. Thus, element (i,j) of the matrix constructed from row h of A's,
and row k of B's distance matrix will contain the difference of the magnitude of the
tt CO
40
COMPARISON OF PROTEIN THREE-DIMENSIONAL STRUCTURES
A general approach
1 Use a sequence alignment procedure to align the proteins and to cluster them as a
bifurcating tree (see refs 31-34 and several chapters in ref. 35).
2 Use a pairwise structural alignment method to align clusters according to the tree
topology. This will involve comparing pairs of structures, one structure with a set of
previously aligned structures, and aligned structures with aligned structures, until
all clusters have been coalesced into a final alignment involving all of the proteins.
41
MARK S. JOHNSON AND JUKKA V. LEHTONEN
42
COMPARISON OF PROTEIN THREE-DIMENSIONAL STRUCTURES
segments of Ca-atoms (GENFIT) or SSEs (SARF2). GENHT starts with 'too many'
equivalent matches and reduces them until a maximal, but non-conflicting set,
is obtained. This is done for each of the many parallel comparisons being made,
but a single optimal result will be obtained in any one run: the parallel com-
43
MARK S. JOHNSON AND JUKKA V. LEHTONEN
Figure 6 Local similarity of ATP cofactor binding site seen in the pairwise superposition of
riPonucleotide reductase (PDB code: 3R1R, chain A; light grey) with cAMP-dependent protein
kinase (1CDK; dark grey) (a and b), and with D-Ala:D-Ala ligase (HOW; dark grey) (c). The
bound ATP molecules of the structures are shown as stick models. In (a), the four common
segments are drawn as ribbon diagrams, (b and c) The environment around the cofactor is
illustrated by showing the equivalent hydrogen bonds (dashed lines) and equivalent Cn-atoms
(spheres) forming hydrophobic contacts to the cofactor. (From ref. 37, with permission.)
parisons converge towards that result. SARF2 searches among a large set of
matches between the structures and finds the largest non-conflicting subset of
matches. Both methods are free from restraints on the order and chain direction
of objects along the sequence, but optional restraints can be applied.
44
COMPARISON OF PROTEIN THREE-DIMENSIONAL STRUCTURES
Required data
• Ca-atom co-ordinates of the two structures
Method
1 Search for and tabulate main-chain fragments from the structures that are similar
to five-residue long templates of typical a-helices and p-strands,
2 Create a list of SSE pairs from the first structure that match SSE pairs from the
second structure. Distance and angular criteria between the SSE's in both structures
is important to the determination of a match.
3 Combine matches to find the largest collection of SSE's that can be aligned. SARF2
uses an exhaustive, recursive search algorithm to find possible solutions (see ref. 41
for details).
4 For the best solutions found, superimpose the matched SSE's and then add nearby
Ca-atoms to matched regions using the dynamic programming method. Iteratively
repeat the superpositions of Ca-atoms until the maximum number of matched
atoms have been found.
5 A list of superpositions, ranked according to an alignment score, result.
GENFIT (22)
Automatic alignment of two locally similar protein structures using a genetic algorithm.
This implementation has been designed for parallel processing environments.
Required data
• Ca-atom co-ordinates of the two structures
Method
1 Create a large random set of superpositions for the pair of structures.
2 Assign equivalent matches using the match list algorithm (see Protocol 3). Criteria
for a match include:
(a) Ca-atorn matches must be within a user specified distance cut-off.
(b) Matches must include a minimum of four consecutive Ca-atoms.
(c) The direction of the main chain for matched segments is unimportant by default.
(d) Matches do not need to be co-linear (i.e. the location of a match along the
sequence relative to other matches is unimportant).
45
MARK S. JOHNSON AND JUKKA V. LEHTONEN
3 Calculate an alignment score for each superposition and create a new set of
superpositions by crossing-over and mutating existing ones (see ref. 22 for details).
4 Repeat steps 2 and 3 until convergence has been achieved.
5 Optimize the best superposition/alignment by least-squares rigid-body minimization
(Protocol 2).
6 Recalculate the alignment with the match list algorithm (Protocol 4).
7 If the number of equivalent matches has increased or the fit has improved, repeat
steps 5 and 6 with the current alignment.
8 Repetitive runs can produce different results showing that equally likely alternative
results exist.
46
MARK S. JOHNSON AND JUKKA V. LEHTONEN
References
1. Dayhoff, O. M, Barker, W. C., and Hunt, L. T. (1983). In Methods in enzymology (ed.
C. H. W. Hirs and S. W. Timashefi). Vol. 91, p. 524. Academic Press, London.
2. Chothia, C. (1992). Nature, 357, 543.
3. Blundell, T. L and Johnson, M. S. (1993). Protein Sri., 2, 877.
4. Johnson, M. S., Srinivasan, N., Sowdhamini, R., and Blundell, T. L. (1994). Crit. Rev.
Biochem. Mol. Biol., 29, 1.
5. Robertus, J. D., Alden, R. A., Birktoft, J. J., Kraut, J., Powers, J. C., and Wilcox, P. E.
(1972). Biochemistry, 11, 2449.
6. Needleman, S. B. and Wunsch, C. D. (1970). J. Mol. Biol., 48, 443.
7. Sali, A. and Blundell, T. L. (1990). J. Moi. Biol., 212, 403.
8. Johnson, M. S., Sali, A., and Blundell, T. L (1990). In Methods in enzymology (ed. R. F.
Doolittle), Vol. 183, p. 670. Academic Press, San Diego.
9. Russell, R. B. and Barton, G. J. (1992). Proteins, 14, 309.
10. Taylor, W. R. and Orengo, C. A. (1989). J. Mol BioL, 208, 1.
11. Holm, L. and Sander, C. (1993). J. MoJ. Biol., 233, 123.
12. McLachlan, A. D. (1972). Acta Crystallogr., A28, 656.
13. McLachlan, A. D. (1979). J. Mol. Biol., 128, 49.
14. McLachlan, A. D. (1982). Acta Crystallogr., A38, 871.
15. Diamond, R. (1988). Acta Crystallogr., A44, 211.
16. Sutcliffe, M. J., Haneef, L, Carney, D., and Blundell, T. L. (1987). Protein Eng., 1, 377.
17. Diamond, R. (1992). Protein Sri., 1, 1279.
48
COMPARISON OF PROTEIN THREE-DIMENSIONAL STRUCTURES
18. Shapiro, A., Botha, J. D., Pastor, A., and Lesk, A. M. (1992). Acto Crystallogr., A48, 11.
19. Kearsley, S. K. (1989). Acta Crystallogr., A45, 208.
20. Kearsley. S. K. (1990). J. Comput. Chem., 11, 1187.
21. Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (ed.) (1992).
Numerical recipes in C. The an of scientific computing. (2nd edn). Cambridge University
Press, Cambridge.
22. Lehtonen, J. V., Denessiouk, K., May, A. C. W., and Johnson, M. S. (1999). Proteins, 34,
341.
23. Mitchell, E. M., Artymiuk, P. J., Rice, D. W., and Willett, P. (1990). J. MoL Biol., 212,
151.
24. Gibrat, J.-F., Madej, T., and Bryant, S. H. (1996). Curr. Opin. Struct. Btol., 6, 377.
25. Grindley, H. M., Artymiuk, P. J., Rice, D. W., and Willet, P. (1993). J. Mol. Biol., 229 (3),
707.
26. May, A. C. W. and Johnson, M. S. (1994). Protein Eng., 7,475.
27. May, A. C. W. and Johnson, M. S. (1995). Protein Eng., 8, 873.
28. Johnson, M. S., Sutcliffe, M. J., and Blundell, T. L (1990) J. Mol. Bvol., 30,43.
29. Goldberg, D. E. (ed.) (1989). Genetic algorithms in search, optimization, and machine learning.
Addison-Wesley, Reading, MA.
30. Kleywegt, G. J. and Jones, T. A. (1997). In Methods in enzymology (ed. C. W. Carter and
R. M. Sweet), Vol. 277, p. 525. Academic Press.
31. Barton, G. J. and Steinberg, M. J. (1987)./. Mol. Biol., 198, 327.
32. Feng, D. F. and Doolittle, R. F. (1987). J. Mol. Bvol., 25, 351.
33. Johnson, M. S. and Overington, J. P. (1993). J. Mol. Biol., 233, 716.
34. Johnson, M. S., May, A. C. W., Rodionov, M. A., and Overington, J. P. (1996). In Methods
in enzymology (ed. R. F. Doolittle, Vol. 266, p. 575. Academic Press.
35. Doolittle, R. F. (ed.) (1996). In Methods in enzymology. Vol. 266, p. 711. Academic Press,
San Diego.
36. Denessiouk, K., Lehtonen, J. V., Korpela, T., and Johnson, M. S. (1998). Protein Set, 7,
1136.
37. Denessiouk, K., Lehtonen, J. V., and Johnson, M. S. (1998). Protein Set, 7,1768.
38. Denessiouk, K., Denesyuk, A. I., Lehtonen, J. V., Korpela, T., and Johnson, M. S. (1999).
Proteins, 35, 250.
39. Kobayashi, N. and Go, N. (1997). Nature Struct. Biol, 4, 6.
40. Alexandrov, N. N., Takahashi, T., and Go, T. (1992)./. Mol. Biol., 225, 5.
41. Alexandrov, N. N. and Fischer, D. (1996). Proteins, 25, 354.
42. Sali, A., Overington, J. P., Johnson, M. S., and Blundell, T. L. (1990). Trends Biochem. Set.,
15, 235.
43. Overington, J. P., Johnson, M. S., Sali, A., and Blundell, T. L (1990). Proc. R Soc. Lond. B,
241,132.
44. May, A. C. W., Johnson, M. S., Rufino, S. D., Wako, H., Zhu, Z.-Y., Sowdhamini, R., et ol.
(1994). Phil. Trans. R. Soc. Lond. B, 344, 373.
45. Mizuguchi, K., Deane, C. M., Blundell, T. L, Johnson, M. S., and Overington, J. P.
(1998). Bioin/ormotics, 14, 617.
46. Holm, L and Sander, C. (1998). Nucleic Acids Res., 26, 316.
47. Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B., and Thornton, J. M.
(1997). Structure, 5,1093.
48. Marchler-Bauer, A., Addess, K. J., Chappey, C., Geer, L, Madej, T., Matsuo, Y., etal.
(1999). Nucleic Acids Res., 27, 240.
49. Murzin, A. G., Brenner, S. E., Hubbard, T., and Chothia, C. (1995). J. MoL Biol., 247,
536.
50. Rufino, S. D. and Blundell, T. L. (1994). J. Comput. Aided Mol. Des., 8, 5.
49
MARK S. JOHNSON AND JUKKA V. LEHTONEN
51. Bernstein, F. G, Koetzle, T. F., Williams, G. J. B., Meyer, E. J. Jr, Brice, M. D., Rodgers,
J. K., etal. (1977).J. Mol. Biol, 112, 535.
52. Kraulis, P. J.(1991).J. Appl. CrystaUogr., 24, 946.
50
Chapter 3
Multiple alignments for
structural, functional, or
phylogenetic analyses of
homologous sequences
L. Duret
Laboratorie de Biome'trie, Genetique et Biologie des Populations, Universite
Claude Bernard, France.
S. Abdeddaim
Departement d'lnformatique de Rouen, Universite de Rouen, France.
1 Introduction
Understanding the structure, function and evolution of genes is one of the main
goals of genome sequencing projects. Classically, gene function has been investi-
gated experimentally through the analysis of mutant phenotypes. More recently,
comparative analysis of homologous sequences has proved to be a very efficient
approach to study gene function (this approach has been coined 'comparative
genomics' or 'phylogenomics'). Indeed, the evolution of living organisms may be
considered as an ongoing large-scale mutagenesis experiment. For more than
three billion years, genomes have continuously undergone mutations (substitu-
tions, insertion, deletions, recombination, and so on). Deleterious mutations are
generally rapidly eliminated by natural selection, while mutations that have no
phenotypic effect (neutral mutations) may, by random genetic drift, eventually
become fixed in the population. Globally, advantageous mutations are very rare,
and hence residues that are poorly conserved during evolution generally corres-
pond to regions that are weakly constrained by selection (1). Thus, studying
mutation patterns through the analysis of homologous sequences is useful not
only to study evolutionary relationships between sequences, but also to identify
structural or functional constraints on sequences (DNA, RNA, or protein).
The alignment of homologous sequences consists of trying to place residues
(nucleotides or amino acids) in columns that derive from a common ancestral
residue. This is achieved by introducing gaps (which represent insertions or de-
letions) into sequences. Thus, an alignment is a hypothetical model of mutations
51
L. DURET AND S. ABDEDDAIM
• Molecular phylogeny
Molecular phylogenetic trees rely on multiple alignments (protein or DNA) to infer mutation
events from which it is possible to retrace evolutionary relationships between sequences.
Such trees are useful to reconstruct the history of species or multigenic families, and notably
to identify gene duplication events to distinguish orthologues from paralogues. It is important
to note that unreliable parts of alignments should not be used to build phylogenetic trees
since they do not reflect the real pattern of mutations that occurred during evolution and may
lead to artifactual results.
• Structure prediction
The use of multiple alignments increases significantly the efficiency of protein secondary
structure prediction. Moreover, the identification of covariant sites (or compensatory
mutations) in alignments (protein or RNA) is a strong argument to suggest that these sites
interact in the molecule In vivo. Finally, alignments are commonly used for homology
modeling, i.e. for the structure prediction of sequences by comparison with homologues of
known structure.
• Function prediction
The three-dimensional (3D) structure of homologous proteins or RNA is often much more
conserved than their primary sequence. Similar shape usually implies similar function. Thus, if
a new gene is found to be homologous to an already characterized gene it is possible to infer
the likely function of the new gene from the known one. Such inferences should however be
used with great caution.
• Design of primers for PCR (polymerase chain reaction) Identification of related genes
52
MULTIPLE ALIGNMENTS FOR STRUCTURAL, FUNCTIONAL, OR PHYLOGENETIC ANALYSES
53
L. DURET AND S. ABDEDDAIM
Flgure 1 Global versus local alignment, (a) Conserved regions occur in the same order in all
sequences. They can be represented in a single global alignment, (b) Some conserved
regions are duplicated or occur in a different order along sequences. It is necessary to
perform local alignments to display similarities between all conserved regions.
54
MULTIPLE ALIGNMENTS FOR STRUCTURAL, FUNCTIONAL, OR PHYLOGENETIC ANALYSES
2.3.1 Substitutions
The probability of substitution of one amino acid by another depends on the
structure of the genetic code (i.e. on the number of mutations necessary to pass
from one codon to another) and also on the phenotypic effect of that mutation.
Substitutions of one amino acid by another with similar biochemical properties
generally do not greatly affect the structure and hence the function of the
protein. Thus, during evolution, such conservative substitutions are relatively
frequent compared to other substitutions. It is important to note that the prob-
ability of substitution of one amino acid by another depends on the evolutionary
distance between sequences. At short evolutionary distances, probabilities of
substitution mainly reflect the structure of the genetic code, whereas at larger
distances, probabilities of substitution depend essentially on biochemical simi-
larities between amino acids. Various methods have been proposed to build
series of matrices that give estimates of probabilities of all possible substitutions
for different evolutionary distances (6-8). The most commonly used are the PAM
and BLOSUM substitution matrices. PAM matrices suitable for increasing
evolutionary distances are indicated by increasing indices (e.g. PAM80, PAM120,
and PAM250). The opposite convention has been used for the BLOSUM series (e.g.
BLOSUM80 for short evolutionary distances, BLOSUM45 for large evolutionary
distances). Generally, alignment programs allow users to choose which substitu-
tion matrix to use. In the CLUSTAL W program (9) (see Section 4.2) substitution
matrices are automatically selected and varied at different alignment stages
according to the divergence of the sequences to be aligned.
Probabilities of substitutions also vary along sequences according to the local
environment of amino acids in the folded protein. Thus, several environment-
specific substitution matrices have been developed (e.g. for ot-helix, or p-sheet)
(10). However, to our knowledge, these matrices are rarely used for multiple
alignments.
At the DNA level, probabilities of substitution vary according to the bases.
Notably, transitions (substitutions between two purines—A, G—or two pyri-
midines—C, T) are generally more frequent than transversions (substitutions
between a purine and a pyrimidine). Thus, multiple alignment programs
generally propose a parameter to weight more heavily transversions than
transitions. Probabilities of nucleotide substitution also depend on neighbouring
bases (e.g. in vertebrates, C in CG dinucleotides is hypermutable) (11, 12).
However, currently available alignment programs do not make use of such
information.
55
L. DURET AND S. ABDEDDAIM
where L is the length of the gap, a the gap opening penalty, and b the gap
extension penalty. However, analyses of alignment of homologous sequences
have shown, both for protein and nucleic sequences, that this model under-
estimates the probability of long indels (7, 13, 14). Indeed, more realistic indel
penalties can be estimated with models of the following form:
However, because of computational complexity, such models have not been im-
plemented in commonly used alignment programs. Fortunately, other approaches
have been proposed to align sequences with large indels (see Section 4.3).
The probability of occurrence of indels in proteins also depend on the degree
of divergence between sequences (7, 13). Thus, as for amino-acids substitution
matrices, indel penalty parameters should ideally be varied according to the
divergence of the sequences to be aligned. The probability also depends on the
nature of the sequences: protein, structural RNA, non-coding DNA (in which
transposable elements may be inserted), etc. Moreover, probabilities of indel
may vary along sequences. In proteins notably, indels are more frequent within
external loops than in the core of the structure. Thus, knowledge on the
structure of proteins can be used to weight indels. For example, the CLUSTAL W
program uses residue specific indel penalties and locally reduced indel penalties
to encourage new gaps in potential loop regions rather than in regular second-
ary structure. In cases where secondary structure information is available, indel-
penalty masks can also be used to guide the alignment.
It is important to note that, in most programs, default parameters for gap
penalties have been set for typical globular proteins. These may not be optimal
for other sequences.
56
Table 2 Websites for text-based searches in sequence databases
consists in taking one member of the family and comparing it to the entire
database with a similarity search program such as FASTA (22) or BLAST (23, 24).
To guarantee a more exhaustive search, one may repeat this procedure with
several distantly related homologues identified in the first step. See the review
by Altschul, et al. (25) for a comprehensive discussion of sequence similarity
searches.
The sensitivity of a sequence similarity search may be improved by weighting
sites according to their degree of conservation. Thus, once several homologous
sequences have been identified, it is possible to use methods such as profile
searches (see Chapter 5 in this volume) or PSI-BLAST (24) that rely on a multiple
alignment to identify more distantly related members of the family. A list of
some similarity search WWW servers is presented Table 3.
57
L. DURET AND S. ABDEDDAIM
using heuristics to gain speed and limit space requirements. Although these
heuristics do not guarantee to find the optimal alignment, they are very useful
in practice and often give results very close to the exact solution. In the follow-
ing we will focus on four families of multiple alignment algorithms:
(a) Algorithms that guarantee to find the optimal alignment for a given scoring
scheme; these algorithms can be used only for a limited number of short
sequences.
(b) Heuristic algorithms that are based on a progressive pairwise alignment
approach.
(c) Heuristic algorithms that build a global alignment based on local alignments.
(d) Heuristic algorithms that build local multiple alignments.
It should be noted that this list is not exhaustive. Other multiple alignment
methods such as those based on hidden Markov models (26) or genetic algorithms
58
MULTIPLE ALIGNMENTS FOR STRUCTURAL, FUNCTIONAL, OR PHYLOGENETIC ANALYSES
(27) can also be used. For a review of multiple alignment algorithms see refer-
ence (28).
Many of the programs reviewed here can be used directly through the WWW
(see Table 4) or downloaded over the Internet to be installed on a local computer
(see Table 5).
59
L. DURET AND S. ABDEDDAIM
multiple alignment cost is more complex. One possible solution, known as the
Sum of Pairs (SP) alignment cost (29), consists of calculating the multiple
alignment cost from pairwise alignment costs. A multiple alignment a(S1,... ,Sn)
contains n(n-l)/2 pairwise alignments a(Si.,Sj) where 1 < i < j < n. Each projection
a(Si.,Sj) is the pairwise alignment built from a(S1,... ,Snj by removing all the rows
except the rows i and j, and then by removing all the columns that contains two
null letters. The SP multiple alignment cost is defined as the sum of all its
projections costs (29).
Simple SP alignment cost may, however, be inappropriate when some groups
of sequences are heavily over or under-represented in a family. This drawback
may be corrected by introducing a proper weighting system (30, 31) which
assigns a weight to each sequence. This can be used to give less weight to
sequences from overrepresented groups. Another solution consists of using a
cost function based on an evolutionary tree. The tree leaves are the sequences we
want to align, and the internal nodes are their hypothetical ancestral sequences.
For a given tree, the cost of an alignment a(Sl, ... ,Sn) is the sum of all its
projections afSi.,SjJ on adjacent sequences Si and Sjin the tree (32).
60
MULTIPLE ALIGNMENTS FOR STRUCTURAL, FUNCTIONAL, OR PHYLOGENETIC ANALYSES
tree, and the ancestral sequences such that the alignment cost is minimal. The
problem remains hard even if the tree is given (36).
61
L. DURET AND S. ABDEDDAIM
Figure 2 Progressive alignment process, (a) All sequences are compared to each other S2.
(b) A guide tree is calculated from the pairwise distance matrix, (c) Sequences are
progressively aligned following the guide tree.
The simplest method for this problem reduces each alignment to a consensus
sequence, and uses a pairwise alignment algorithm to do the work. In the con-
sensus sequence, each column of the alignment is represented by its most
frequent letter. Consensus alignment was used in the first version of CLUSTAL.
In most programs, each alignment is considered as a profile (see Chapter 5). hi a
profile, a column is reduced to a distribution giving the frequency of each letter.
Two profiles are aligned as two sequences by dynamic programming without
major modification of the algorithm. The alignment of two profiles of length I
takes ofa2!2), where a is the alphabet size. CLUSTAL W uses profile alignment
with position-specific gap penalties (see Section 2.3.2).
62
MULTIPLE ALIGNMENTS FOR STRUCTURAL, FUNCTIONAL, OR PHYLOGENETIC ANALYSES
Suppose that the guide tree based on pairwise comparison of entire sequences
indicates that we should first align sequence x with sequence y, followed by the
alignment of sequence z with the first two (already aligned together). At the first
step, there are three possible alignments of x and y giving exactly the same
score:
x ACTTA x ACTTA X ACTTA
y A-GTA y AGT-A y AG-TA
At the later step, the gap that was introduced cannot be changed. Thus adding
sequence z could give the following three alignments:
X ACTTA x ACTTA x ACTTA
y A-GTA y AGT-A y AG-TA
z ACGTA z ACGTA z ACGTA
Only the first of these alignments is optimal. At the first step, only one of the
three possibilities will be used. If it is the wrong one, we cannot correct this
later.
To avoid that problem, iterative optimization strategies such as RIW or DNR
(42) have been proposed. These methods are reported to perform better than
CLUSTAL W (42). However, although these methods are much faster than
optimal algorithms, they are still to slow for large dataset.
Another limitation of the progressive approach described above is that it
requires computing pairwise distances between all sequences to calculate the
guide tree. One may sometimes have to align set of homologous sequences that
include some non-overlapping fragments (e.g. partial protein sequences). When
sequences are non-overlapping they are obviously completely unrelated and
thus the guide tree generated may be totally false. The alignment produced in
this case can be unpredictable.
63
Figure 3 (a) Consistent set of blocks, (b) Non-consistent set of blocks.
conserved blocks that will be used as anchors in order to align the sequences.
Blocks are alignments of fragments (segments) of sequences (local alignments).
Most methods consider gap-free blocks. Depending on the programs used, the
blocks allowed can be exact (composed of identical segments) or not exact and
they may be uniform (found in every sequence) or not. The selected set of blocks
must be consistent, i.e. the blocks can occur together in a multiple global
alignment (figure 3). Once blocks have been computed, it is possible to use a
classical approach to align regions between blocks (e.g. ref. 43).
The first multiple block alignment program (44) used a sorting algorithm in
order to compute uniform exact blocks. Faster algorithms based on suffix trees
(45), or equivalent data structures, can also be used to compute exact blocks.
However, homologous regions are rarely exactly conserved. ASSEMBLE (46) per-
forms a dot matrix analysis on all pairs of sequences and then compares these
dot matrices to find uniform blocks that are not necessarily exact. In practice, it
often happens that some blocks are not present in all sequences. Thus, a further
improvement has consisted of developing methods that allow blocks that are
not necessarily uniform. DIALIGN (47,48) is based on computing gap-free blocks
between pairs of segments (diagonals).
A set of uniform blocks is consistent when each pair of blocks is ordered (they
do not cross each other). Using this observation, selecting an optimal consistent
set of blocks can be reduced to a classic optimal-path algorithm in a graph (44).
The optimal-path algorithm requires o(M2) time for M blocks. Faster algorithms
(sub-quadratic) have been proposed in order to compute an optimal consistent
uniform set of blocks (49, 50). However, finding an optimal consistent set of non-
uniform blocks is an intractable problem (51). Indeed, the consistency of non-
uniform blocks cannot be reduced to a binary relation between them. A set of
three non-uniform blocks, such that all its three pairs of blocks are consistent, is
not necessarily consistent. To compute a 'good' consistent set of diagonals,
DIALIGN uses a heuristic algorithm in which diagonals are incorporated by de-
64
MULTIPLE ALIGNMENTS FOR STRUCTURAL, FUNCTIONAL, OR PHYLOGENETIC ANALYSES
creasing score order into a consistent set of diagonals. Diagonals not consistent
with the set of selected diagonals are rejected In order to check if a new diagonal
is consistent or not with the set of selected diagonals, DIALIGN maintains a data
structure in o(kL2) time, for k sequences of total length I. This makes it slower
than progressive alignment programs. This computation time can however be
reduced to o(k2L + I2) (52) and even, thanks to recent developments, to o(L2).
Thus, faster versions of block-based alignment methods should be available in
the near future.
65
L. DURET AND S. ABDEDDAIM
methods give more or less correct results. Moreover, in such cases, any reason-
able set of parameters (substitution matrix, and usually, gap opening and gap
extension penalties) will give similar alignments. However, when at least two
sequences in a given family share less identity, or if homologous regions are
interrupted by large gaps of different sizes, the result of alignment may vary
considerably according to programs and parameters used.
Several comparative analyses of multiple alignment programs have been
published (42, 48, 61, 62). These comparisons are based on the ability to detect
motif patterns on several protein families or based on reference alignments
derived from three-dimensional protein structures. Comparative analysis can
also be based on the effect of the multiple alignment programs on phylogeny.
Such a study was done on 18S rDNA from 43 protozoan taxa (63). These com-
parisons must be taken only as indications. Indeed, the parameter values (sub-
stitution matrices, gap penalty, etc.) used in these comparisons may not be
optimal for other sequence families (61). In addition these parameters are not
really comparable, even if the programs use the same strategies. For example a
gap opening score of 5 does not have the same meaning in CLUSTAL W (9) as it
does in MULTAL (38), as the value 5 will be modified in the programs (multiplied
by constants for example). For these reasons and because no known method
guarantees to find the correct alignment, it is still necessary to combine different
methods from different families of algorithms and human expertise to obtain
satisfactory alignments.
Figure 4 summarizes indications to guide users in their choice according to
the sequences they have to align. For the alignment of two sequences, one
should use an optimal pairwise alignment method (for example LALIGN or SIM
(64), see Table 6). For more than two sequences, one generally has to use heuristic
approaches. As a first step, the user should try to compute the multiple
alignment with a progressive alignment program. These programs are rapid, do
not demand large memory capacity and may thus be run on large dataset even
on micro-computers. Among programs using this approach, we recommend
CLUSTAL W (or its graphical user interface version: CLUSTAL X) (65, 66). This
includes useful features such as automatic selection of amino-acid substitution
matrix during alignment and lower weighting of gaps in potential protein loops.
If this first alignment shows that all sequences are related to each other over
their entire lengths, it is unlikely that any other method will give a better result
(Figure 4a).
However, if there are some highly divergent sequences, large gaps, or poorly
conserved regions it is -ecommended to compare the results of different methods
and/or sets of parameters. Figure 4b shows homologous sequences sharing con-
served blocks separated by non-conserved regions of varying size. This situation,
which is frequently observed in practice (e.g. in genomic DNA sequences and in
many protein families), is particularly error prone for progressive alignment
methods, notably because the linear weighting of gaps tends to over-penalize
long indels. Block-based global methods (e.g. DIALIGN, ITERALIGN) (47, 48, 67)
are not sensitive to these long gaps and are particularly appropriate for such
66
Comment Appropriate approach
Figure 4 Choice of multiple alignment methods according of the nature of the sequence set.
L. DURET AND S. ABDEDDAIM
68
MULTIPLE ALIGNMENTS FOR STRUCTURAL, FUNCTIONAL, OR PHYLOGENETIC ANALYSES
69
L. DURET AND S. ABDEDDAIM
Figure 6 MASE format. This format is used to store nucleotide or protein multiple
alignments along with annotations relative to the whole alignment (indicated in the header),
or specific to each sequence. The beginning of the file must contain a header containing at
least one line (but the content of this header may be empty). The header lines begin by ';;'.
The body of the file has the following structure: First, each entry begins with one (or more)
annotation lines. Annotation lines begin by the character';'. Again, this annotation line may
be empty. After the annotations, the name of the sequence is written on a separate line. At
last, the sequence itself is written on the following lines.
70
MULTIPLE ALIGNMENTS FOR STRUCTURAL, FUNCTIONAL, OR PHYLOGENETIC ANALYSES
71
L. DURET AND S. ABDEDDAIM
useful when building phylogenetic trees where one needs to exclude unreliable
parts of alignments (i.e. regions for which the alignment is ambiguous). It is also
useful to select particular domains for profile searches. Definitions of groups
and blocks can be saved along with the alignment in MASE format (Figure 6).
72
MULTIPLE ALIGNMENTS FOR STRUCTURAL, FUNCTIONAL, OR PHYLOGENETIC ANALYSES
• Protein domains
ProDom http://protein.toulouse.inra.fr/prodom.html
PRINTS http://www.biochem.ucl.ac.uk/bsm/dbbrowser/PRINTS/
PRINTS.html
DOMO http://www.infobiogen.fr/~gracy/domo/
PFAM http://genome.wust! .edu/Pfam/
BLOCKS http://blocks.fhcrc.org/
•RNA/DNA
Ribosomal Database Project http://www.cme.msu.edu/RDP/
The rRNA WWW server http://rrna.uia.ac.be/
3
ACUTS http://pbil.univ-lyonl.fr/acuts/ACUTS.html
7 Summary
In this chapter, we describe methods commonly used to align homologous
sequences. Searching for the best alignment consists of finding the one that
represents the most likely evolutionary scenario (substitutions, insertion, and
deletion). Different alignment algorithms have been developed, but none of
them is ideal. Because of time and memory requirements, algorithms that
guarantee to find the best alignment for a given evolutionary model can be used
in practice only with a very limited number of short sequences. Therefore, non-
optimal algorithms based on heuristics have been proposed to gain speed and
73
L. DURET AND S. ABDEDDAIM
References
1. Kimura, M. (1983). The neutral theory of molecular evolution, Cambridge University Press,
Cambridge.
2. Doolittle, R. F. (1994). Trends Biochem. Sci., 19,15.
3. Wootton, J. C. and Federhen, S. (1993). Computers Chem., 17,149.
4. Altschul, S. F. and Gish, W. (1996). In Methods in enzymology (ed. R. F. Doolittle), Vol. 266,
p. 460. Academic Press, London.
5. Patthy, L. (1996). MatrixBioJ., 15, 301.
6. Schwartz, R. M. and Dayhoff, M. O. (1978). In Atlas of protein sequence and structure
(ed. M. O. Dayhoff), p. 353. Nat. Biomed. Res. Found., Washington DC.
7. Gonnet, G. H., Cohen, M. A., and Benner, S. A. (1992). Science, 256, 1443.
8. Henikoff, S. and Henikoff, J. G. (1992). Proc. Natl. Acad. Sci. USA, 89,10915.
9. Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994). Nucleic Acids Res., 22, 4673.
10. Overington, J., Donnelly, D., Johnson, M. S., Sali, A., and Blundell, T. L. (1992). Protein
Sci., 1, 216.
11. Bains, W. (1992). Mutat. Res., 267, 43.
12. Hess, S. T., Blake, J. D., and Blake, R. D. (1994). J. Mol. Biol., 236, 1022.
13. Benner, S. A., Cohen, M. A., and Gonnet, G. H. (1993). J. Mol. Biol, 229, 1065.
14. Gu, X. and Li, W. H. (1995). J. Mol. Evol, 40, 464.
15. Benson, D. A., Boguski, M. S., Lipman, D. J., Ostell, J., Ouellette, B. F. F., Rapp, B. A.,
et al. (1999). Nucleic Acids Res., 27, 12.
16. Stoesser, G., Tuli, M. A., Lopez, R., and Sterk, P. (1999). Nucleic Acids Res., 27, 18.
17. Bairoch, A. and Apweiler, R. (1999). Nucleic Acids Res., 27, 49.
18. Barker, W. C., Garavelli, J. S., McGarvey, P. B., Marzec, C. R., Orcutt, B. C., Srinivasarao,
G. Y., et al. (1999). Nucleic Acids Res., 27, 39.
19. Schuler, G. D., Epstein, J. A., Ohkawa, H., and Kans, J. A. (1996). In Methods in
enzymology (ed. R. F. Doolittle), Vol. 266, p. 141. Academic Press, London.
20. Etzold, T. and Argos, P. (1993). CABIOS, 9, 49.
21. Gouy, M., Gautier, C., Attimonelli, M., Lanave, C., and Di-Paola, G. (1985). Camp. Appl.
Biosri., 1,167.
22. Pearson, W. R. and Lipman, D. J. (1988). Proc. Natl. Acad. Sci. USA, 85, 2444.
23. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990)./. MoZ. Biol.,
215, 403.
24. Altschul, S. F., Madden, T. L, Schaffer, A. A., Zhang, J. H., Zhang, Z., Miller, W., et al.
(1997). Nucleic Acids Res., 25, 3389.
25. Altschul, S. F., Boguski, M. S., Gish, W., and Wootton, J. C. (1994). Nature Genet, 6,119.
26. Hughey, R. and Krogh, A. (1996). Comput. Appl. Biosci., 12, 95.
27. Notredame, C. and Higgins, D. G. (1996). Nucleic Acids Res., 25, 4570.
74
MULTIPLE ALIGNMENTS FOR STRUCTURAL, FUNCTIONAL, OR PHYLOGENETIC ANALYSES
28. Chan, S. C., Wong, A. K. C, and Chiu, D. K. Y. (1992). Bull. Math Bid., 54, 563.
29. Carillo, H. and Lipman D. (1988). SIAMJ. Appl. Math., 48,1073.
30. Altschul, S. F., Carroll, R. J., and Lipman, D. J. (1989). J. Mo!. Bid., 207, 647.
31. Gotoh, O. (1995). Comput. Appl. Biosci,, 11, 543.
32. Sankoff, D. (1975). SIAMJ. Appl. Math, 78, 35.
33. Needleman, S. B. and Wunsh C. D. (1970). J. Mo!. Bio!., 48, 443.
34. Lipman, D. J., Altschul, S. F., and Kececioglu, J. D. (1989). Proc. Natl. Acad.. Sri. USA, 86,
4412.
35. Gupta, S., Kececioglu, J. D., and Schaffer, A. (1995). J. Comput. Biol., 2, 459.
36. Jiang, T., Lawler, E. L, and Wang, L. (1994). ACM Sympos. Theory Comput., 26, 760.
37. Feng, D. F. and Doolittle, R. F. (1987). J. Mol. Evol., 25, 351.
38. Taylor, W. R. (1988). J. Mol. EvoL, 28, 161.
39. Higgins, D. G. and Sharp, P. M. (1988). Gene, 73, 237.
40. Sneath, H. A. and Sokal, R. R. (1973). Numerical taxonomy. W. H. Freeman. San Francisco.
41. Saitou, N. and Nei, M. (1987). Mol. Biol. EvoL, 4, 406.
42. Gotoh, O. (1996). J. Mol. Biol., 264, 823.
43. Vingron, M. and Argos, P. (1989). Comput. Appl. Biosci., 5, 115.
44. Sobel, E. and Martinez, H. (1986). Nucleic Acids Res., 14, 363.
45. McCreight, E. M. (1976). J. ACM, 232, 262.
46. Vingron, M. and Argos, P. (1991). J. Mol. Biol., 218, 33.
47. Morgenstern, B., Dress, A., and Werner T. (1996). Proc. Natl. Acad. Set. USA, 93, 12098.
48. Morgenstern, B., Atchley, W. R., Hahn, K., and Dress, A. (1998). Ismb, 6, 115.
49. Zhang, Z., Raghavachari, B., Hardison, R., and Miller, W. (1996). J. Comput. Biol., 1, 217.
50. Myers, E. and Miller, W. (1995). In Proc. 6th ACM-SIAM Symposium On Discrete Algorithms,
p. 38.
51. Zhang, Z., He, B., and Miller, W. (1996). J. Disc. AppZ. Math., 71, 337.
52. Abdeddairn, S. (1997). In lecture notes in computer science, Vol. 1264, p. 167.
Springer-Verlag.
53. Smith, R. F. and Waterman, M. S. (1981). J. Mol. Bio!., 147,195.
54. Waterman, M. S. and Jones, R. (1990). In Methods in enzymology (ed. R. F. Doolittle),
Vol. 183, p. 221. Academic Press, London.
55. Schuler, G. D., Altschul, S. F., and Lipman, D. J. (1991). Proteins, 9,180.
56. Lawrence, C. E., Altschul, S. R, Boguski, M. S., Liu, J. S., Neuwald, A. F., and Wootton,
J. C. (1993). Science, 262, 208.
57. Henikoff, S., Henikoff, J. G., Alford, W. J., and Pietrovoski, S. (1995). Gene-COMBIS, Gene,
163,17.
58. Bailey, T. L and Elkan, C. (1995). Ismb, 3, 21.
59. Lawrence, C. E. and Reilly, A. A. (1990). Proteins, 7, 41.
60. Cardon, L. R. and Stormo, G. D. (1992). J. Mol. Bio!., 223,159.
61. McClure, M. A., Vasi, T. K., and Fitch, W. M. (1994). Mo!. Biol. EvoL, 11, 571.
62. Briffeuil, P., Baudoux, G., Lambert, C., De Bolle, X., Vinals, C., Feytmans, E., et al.
(1998). Bioinformottcs, 14, 357.
63. Morrison, D. A. and Ellis, J. T. (1997). Mol. Bio!. EvoL, 14,428.
64. Huang, X. and Miller, W. (1991). Adv. Appl. Math., 12, 337.
65. Thompson, J. D., Gibson, T. J., Plewniak, F., Jeanmougin, F., and Higgins, D. G. (1997).
Nucleic Acids Res., 25, 4876.
66. Jeanmougin, F., Thompson, J. D., Gouy, M., Higgins, D. G., and Gibson, T. J. (1998).
Trends Biochem. Sci., 23, 403.
67. Brocchieri, L. and Karlin, S. (1998). J. Mol. Bio!., 276, 249.
68. Galtier, N., Gouy, M., and Gautier, C. (1996). Comput. Appl. Biosci., 12, 543.
75
L. DURET AND S. ABDEDDAIM
69. Parry-Smith, D. J., Payne, A. W., Michie, A. D., and Attwood, T. K. (1998). Gene, 221,
GC57.
70. Schneider, T. D. and Stephens, R. M. (1990). Nucleic Acids Res., 18, 6097.
71. Srinivasarao, G. Y., Yeh, L. S. L, Marzec, C. R., Orcutt, B. C, Barker, W. C., and Pfeiffer,
F. (1999). Nucleic Acids Res., 27, 284.
72. Yona, G., Linial, N., Tishby, N., and Linial, M. (1998). Ismb, 6, 212.
73. Corpet, R, Gouzy, J., and Kahn, D. (1999). Nucleic Adds Res., 27, 263.
74. Attwood, T. K., Flower, D. R., Lewis, A. P., Mabey, J. E., Morgan, S. R., Scordis, P., et al.
(1999). Nucleic Acids Res., 27, 220.
75. Gracy, J. and Argos, P. (1998). Bioinfarmatics, 14,164.
76. Gracy, J. and Argos, P. (1998). Bioinformatics, 14,174.
77. Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Finn, R. D., and Sonnhammer, E. L. L.
(1999). NucIeicSVidsKes., 27, 260.
78. Henikoff, J. G., Henikoff, S., and Pietrokovski, S. (1999). Nucleic Acids Res., 27, 226.
79. Duret, L, Mouchiroud, D., and Gouy, M. (1994). Nucleic Acids Res., 22, 2360.
80. Maidak, B. L, Cole, J. R., Parker Jr, C. T., Garrity, G. M., Larsen, N., Li, B., et al. (1999).
Nucleic Acids Res., 27,171.
81. Van de Peer, Y., Robbrecht, E., de Hoog, S., Caers, A., de Rijk, P., and de Wachter, R.
(1999). Nucleic Acids Res., 27, 179.
82. De Rijk, P., Robbrecht, E., de Hoog, S., Caers, A., Van de Peer, Y., and de Wachter, R.
(1999). Nucleic Acids Res., 27, 174.
83. Duret, L, Gasteiger, E., and Perriere, G. (1996). Comput. Appl. Biosci., 12, 507.
76
Chapter 4
Hidden Markov models for
database similarity searches
Ewan Birney
The Sanger Centre, Wellcome Trust Genome Campus, Cambridge, UK.
1 Introduction
Despite the huge number of genes in an organism, the protein coding genes are
thought to be made from a limited number of basic protein structures. Evolution
has reused these protein structures, combining them to form different proteins,
and altering them in different genes to achieve different functions. The diversity
of species, each with its own copies of genes made from the limited number of
building blocks, means that, for a protein of interest, a number of different re-
lated proteins may be found. In this chapter, I will discuss one set of techniques
which can be used to take advantage of this diversity of protein sequence. These
techniques are all related to the use of profiles, which are also discussed in
Chapter 5. hi this chapter, the emphasis will be on the use of hidden Markov
models (HMMs) for profile analysis. Some practitioners consider profiles to be a
type of HMM.
It is important to realize that a protein might be related to another protein in
a variety of different ways. It could be that the entire protein is homologous
(that is, derived from a common ancestor) to another, such as human and mouse
src protein (see Figure 1) or the human src2 protein which is a paralog to the src
protein. Alternatively only a portion of the protein might be derived from a
common ancestor, such as the jyn protein, which shares a common C terminal
region with a divergent N terminus to the src protein. Finally only a small region
might be conserved, such as the SH3 domain which is also found hi the Grb2
protein (along with many other proteins) with no other organization conserved
between the two proteins. This last type of conservation, conservation of a
domain generally corresponds to a structural domain of the protein which can
fold independently and, in most cases, function independently of other regions.
Figuring out when you have really defined a domain rather than a more exten-
sive piece of conservation is one of the challenges for a researcher. Profile
analysis is useful for all these different types of conservation. It is especially
useful for domain analysis as this is the hardest feature to define using other
methods.
77
EWAN BIRNEY
Figure 1 Three different types of relationship are shown. The grey ovals indicate regions
which are conserved, whereas the lines and other boxes show regions which are not related.
(1) Two very closely related genes, where the entire protein sequence in each gene is
conserved. (2) Two genes where the C termini are related but the N termini are unique.
(3) Two genes which share one domain but the other regions are entirely different.
2 Overview
For people coming from outside the field, the use of profiles and profile-HMMs
can require confronting much confusing jargon and cryptic computer programs.
This chapter is meant to demystify this type of analysis. The first point to
emphasize is that the programs are, basically, just employing some concept of a
'consensus'. This follows intuitively from the observation that if some sequences
have an Aspartate before a critical catalytic residue and others a Glutamate then
a new enzyme can be expected to have either an Aspartate or a Glutamate at
this position. This sort of simplistic rule is recast into a mathematically con-
venient form: resulting in some idea of a probability for each possible amino
acid at a different position, called a profile. The difficulty lies, as in many areas in
sequence analysis, that there may be different numbers of amino acids between
conserved residues. A consequence of the differing lengths is that there are
usually a number of different ways of providing a match to a 'consensus', and
some way of choosing the 'best' one must be decided. The variable lengths be-
tween conserved residues also makes the statistical behaviour of the technique
very hard to handle using conventional statistical analysis.
This chapter will concentrate first on using databases of profile-HMMs
through the World Wide Web (WWW), which is by far the easiest way of using
them. Then we will concentrate on PSI BLAST (1) which is the easiest do-it-
yourself profile method, also available through the Web. The final example will
78
hIDDEN MARKOV MODELS FOR DATABASE SIMILARITY SEARCHES
cove? the use of the HMMER2 (2) package which I find to be the most effective
profile-HMM package available. It is UNIX based and relatively easy to use,
though it is not currently available through the Web. Finally I will outline some
of the theories behind profile HMMs from the point of view of how it impacts on
their practical use.
The reader should be aware that there are many other profile-HMM packages.
I would draw your attention in particular to the Meta-MEME package (3) and the
PROBE package (4) as well thought out solutions. There are also a number of
other profile packages (5, 6) which are more focused on the use of the package
by their own groups. Finally, a number of commercial solutions exist (7-9), and
you may well have access to them. If you know someone on site who is already
skilled in using one of these packages, it is best to use that local knowledge and
treat this chapter as more of an introduction to the concepts involved. In
addition, it is likely that when you are reading this chapter, that new methods
or new presentations of old methods will become available. To keep up to date,
use the web, and try the URLs in Table I to find the most up-to-date resources.
Finally, I would like to warn users that I have a strong bias towards using a
probabilistic framework to explain and justify the methods: this fits easiest with
the HMM formalism and the use of Bayesian statistics (a branch of probability
analysis). Other researchers are less zealous about using this sort of framework
to explain the results. In either case the most important question is whether
these methods are biologically useful, whatever the theories say.
79
EWAN BIRNEY
Once you have a protein sequence it is probably best to put it into Fasta
format (see Section 8.5), though many resources will allow you to use other for-
mats. Then connect to one of the resources shown in Table 1 and find the page
for searching with your sequence.
Choose the 'search' page. Then use the 'file-upload' button on the forms to
submit your own sequence, and click 'submit' or 'run analysis'. The search against
the database will probably take a little over a minute, and should not take more
than 10 minutes. Each resource returns its own particular format of results, but
what is generally reported is the type of the domain, the position in your pro-
tein as a start- and end-point, and some indication of how confident your can be
of the hit. They all provide a nice graphical representation of the domain on
your sequence as a cartoon of the sequence with different coloured or shaped
regions indicating the different domains. Clicking on the graphic will take you
usually to an in-depth description of the domain, which in many cases will
contain links to other resources and literature references. How to interpret the
precise results varies from resource to resource.
3.1 Pfam
Pfam (10) is a database of protein families and corresponding profile-HMMs. Pfam
uses the HMMER2 package to provide tools for making the HMMs in the 'first
place and then for searching them. A search against Pfam will provide you with
three ways of deciding confidence in the matches. The first is a classical e-value
(expectation value) which generally is considered significant if it is below 1.0 for
individual searches. The second is the Bits score which is derived from the
underlying scoring scheme used to score the match between the sequence and
the profile-HMMs. It is related to the Bayesian inference of the probability of the
match (for a deeper explanation of the statistics read Section 8). A final check is
provided by a manually derived cut-off which an 'expert' has chosen to separate
the true examples from false examples. These cut-offs are chosen conservatively
so that, to the researcher's knowledge, they do not misclassify any protein. This
can mean that, in some cases, known trues are missed using this cut-off.
At the time of writing the Pfam database (Version 3.3) had 1344 protein
families, which covered 57% of the protein primary sequence database. In new
genome projects over one third of proteins had at least one hit to a protein
family.
80
HIDDEN MARKOV MODELS FOR DATABASE SIMILARITY SEARCHES
3.3 SMART
SMART (12) is currently based on conventional, non HMM profile technology.
The raw score is meaningless, rather you must trust the manually set cut-offs
provided internally. At the time of writing there were 302 profiles in SMART.
SMART is not focused on coverage but rather on providing very accurate align-
ments and resources of the domains of interest. It is likely by the time of reading
this that SMART has switched to using the HMMER2 package rather than old
style profiles.
4 Using PSI-BLAST
PSI-BLAST (1) is a profile building and searching package which is fast, accessible
through the Web, and aimed at a less expert audience than the other profile
packages. This makes it ideal for occasional use or quick investigations of a
particular protein sequence.
81
EWAN BIRNEY
In many ways PSI-BLAST follows the same methodology as using the HMMER
package below, just that this is done behind the scenes. PSI-BLAST starts from a
single sequence, which is then searched against a database using the fast BLAST
method. The resulting matches are aligned back to the query sequence, and this
derived multiple alignment is used to estimate a profile. The profile is then used
to search the database, collect homologues, and align back to the profile, and so
the process iterates onwards until it stabilizes or some cut-off is exceeded.
A URL to start the process off is given in Table 1. You load in your protein
sequence and launch the first search. At the end of each search you have the
option of including or rejecting each sequence for the next iteration. This gives
you the chance to eliminate potential false positives and include weak but true
matches from your knowledge of the biology. An e-value statistic is provided to
give an automatic selection of the next round of sequences, which should guide
you in your selection.
Many of the problems inherent in using PSI-BLAST are also present when
using HMMER, and so I would encourage you to read sections 6.0 and 7.0 care-
fully. Crucially you must be aware that the statistic to quote for the significance
of a match is the first one in which it appears in the profile: once a particular
sequence has been included in the set which makes the profile, it will, un-
surprisingly, score very well against the resulting profile.
The other problems of PSI-BLAST are less to do with the method and more to
how it is used. Because it starts with a single sequence, it is tempting to put in
an entire sequence of interest and simply start iterating. If the sequence con-
tains one common domain, although PSI-BLAST will find all the homologues of
the sequence, both including the domain and excluding it your results will be
dominated by this domain and become unmanageable. As you focus your effort
on a particular region, it is better to excise that region and use that as a starting
point for further analysis.
5 Using HMMER2
HMMER2 (2) is a package of UNIX command line programs which make and use
profile HMMs. If you have no experience of the UNIX command line, then using
HMMER2 is going to be a struggle. I suggest taking a short course in UNIX first.
In addition to the HMMER2 software, you will need a number of other reason-
ably standard bioinformatics resources. In particular:
(a) A copy of an up-to-date protein database in fasta format as a single file.
(b) A method of retrieving sequences from this database, preferably with the
ability to retrieve only a portion.
(c) A multiple alignment program such as Clustal W (17).
(d) A specialized multiple alignment editor.
It may also be useful to have some experience of a text reformatting language
such as Perl or Python, or access to someone who can write small glue programs
82
HIDDEN MARKOV MODELS FOR DATABASE SIMILARITY SEARCHES
for you. Installing the resources is best done with the co-operation of the systems
support group for the UNIX machine you are using.
Figure 2 A flow diagram of how profile-HMMs are commonly used. The programs in the
HMMer package which are used to provide the different transformations are given beside the
arrows. PSI-Blast uses the same principles although much of the mechanics are then hidden
from the user
83
EWAN BIRNEY
84
HIDDEN MARKOV MODELS FOR DATABASE SIMILARITY SEARCHES
as the second argument. The database file is in Fasta format. The results are
printed on standard output, so you usually need to redirect the output to save
that information in a file. The results give you the following information:
(a) A list of sequences which the HMM hit, ranked from most significant to least.
(b) A list of domains contained in the sequences, ranked from most significant to
least.
Notice that a particular sequence can contain more than one domain. In par-
ticular, although each of the domain scores might, on their own, not be signifi-
cant, the combined score of multi-domain match might easily be so.
Both the per-sequence and per-domain matches are provided with two
statistics: a bits score and an e-value (see Section 8.2). The more reliable score is
the e-value; e-values down to 1.0 can be considered significant (an e-value of 1.0
means that, by chance, 1 random sequence is expected to get this score in the
database of the size which you used). The bits score is helpful as it is indepen-
dent of database size.
Having chosen a significance level one would then like to make a new multi-
ple alignment of all the protein sequences found by the HMM. At the moment,
this is the most labour intensive step, as the HMMER package does not provide
all the functionality for this task. Somehow, one needs to extract all the sequences
which are hit and truncate them to the correct start/end points. This is best done
by a perl script or similar device. Once you have all the sequences which were
hit, as a Fasta file, the program hmmalign will provide a multiple alignment of
the proteins on the basis of the HMM. This multiple alignment can then be used
to make a new HMM for the next round.
6 False positives
One of the problems inherent to the iterative procedures, both PSI-blast and the
use of HMMER outlined above is that if a false positive is added to the alignment,
itself and any close relatives will score highly against the profile-HMM. For ex-
ample, if you inadvertently add a globin sequences to a protein kinase align-
ment the resulting HMM will match globin sequences surprisingly well.
This ability to start collecting false positives at will means that a researcher
should ideally by very vigilant as the iterations progress. Indications that the
profile might be picking up noise are:
(a) Low complexity regions occurring in alignment.
(b) A region overlapping a known domain, where it is clear that the multiple
alignment is not a divergent subfamily for this domain.
(c) Biological information that indicates that this match is false.
85
EWAN BIRNEY
correct measure of the significance of the match, as it includes all the sequences
you wish to score, and they will all score well. In fact the problem of justifying a
grouping of sequences is not well handled by the current statistics, in particular
when an iterative strategy is used. The following lines of evidence may be used
to give a researcher confidence that the similarity they observe is not by chance.
(a) See whether all the sequences can be connected together by significant single
sequence scores (e.g. from programs such as BLAST2). Ideally one should be
able to show this with the full length proteins (just taking the domain
improves the statistics considerably).
(b) Quote the significance of the 'new' sequences for the first time they provided
a significant score against the profile-HMM.
(c) To show that A is related B, show that by starting from either A or B one can
produce a profile which finds the other sequence using criterion (b).
(d) Provide biological justification that the relationship makes sense (e.g. com-
mon mode of enzymatic action). Conversely, biological information which
indicates that they should not be related should lessen the researcher's belief
in the result.
86
HIDDEN MARKOV MODELS FOR DATABASE SIMILARITY SEARCHES
it' and 'if this sequence does have an example of the HMM, what is the align-
ment of the HMM to the sequence'.
The HMMs which have been used in this field deliberately mimic the profile
model described above. For each conserved column in the multiple alignment,
three possible states are permitted: a match state which indicates that a single
residue is being aligned to the position, a delete state, which indicates that no
amino acid in this protein is present for this position, and an insert state, which
allows any number of amino acids to be inserted after the match position. A full-
length HMM will have some 500 or so different states, broken down into triplets
representing conserved column positions. The behaviour of these states is
governed by probabilities for the production of different amino acids from
match and insert states and probabilities for the transitions to the neighbouring
states. These probabilities are analagous to the scores of the profile and the gap
penalities in the profile respectively. Indeed, for practical purposes, the prob-
ability representation of a profile-HMM is rarely used. Instead, the probabilities
are transformed into sensibly sized integers via a log transformation. In this
logged representation, adding the numbers is equivalent to multiplying the
underlying probabilities, making the correspondence between profiles and
profile-HMMs all the more clear.
Given a particular profile-HMM, the questions 'does this sequence have an
example of my HMM' and, given that the last question is true, 'what is the align-
ment of the sequence to the HMM' can be easily answered using some well
known algorithms. These two questions are essentially what hmmsearch and
hmmalign provide answers for in the HMMER package.
87
EWAN BIRNEY
What is quoted is the log-likelihood ratio of the two models: when the base of
the log is 2, this statistic (the log-likelihood ratio) is called a Bits score. A bits score
of 0 is when the likelihood ratio is 1, and hence each model is equally likely to
have produced the sequence. Depending on how the ratio is quoted, either more
negative or more positive scores indicate that the desired model is more likely.
In the HMMER package, the more positive the bits score the better the match to
the profile-HMM.
The likelihood ratio does not provide quite enough information to allow an
estimate of the probability of the profile-HMM occurring, given the sequence
seen. An additional piece of information, being the probability of the profile-
HMM occurring without seeing the sequence data needs to be defined. As this
information has to be defined without seeing any sequence, it is called prior
information. Mathematically this is the same idea as the prior information
which will be introduced in the next section, but in practice it is used in a very
different aspect. Sensible priors include 1/d where d is the size of the data base
or one could use the probability of a random sequence having this domain in a
genome, say 1/10 000. Finally, to be confident of the match, the probability of
the profile-HMM occurring should be over 0.95. These two extra manipulations
—the prior information and the need for a significant probability translate into
a bits cut-off above which one considers matches to be significant. 25 bits trans-
lates to sensible choices of prior and significance, and so matches over 25 bits
can be considered to be significant.
There is another statistic that can be used to estimate whether the match is
significant or not. This is a classical (or frequentist) statistic and is one that most
users will be more familiar with. To provide a frequentist statistic, one needs to
assume that the match is random, derive the probability that a random match
would produce the score, and reject the assumption if this seems very unlikely
to have occurred. The problem with this sort of analysis was that it was clear
that the distribution of scores of random sequences against a profile-HMM was
not normally distributed, and so estimation of probability was very difficult. In
recent years the field has produced theoretical and empirical evidence that the
distribution is closely related to an Extreme Value Distribution (EVD) (15). PSI-
Blast assumes that for a particular way of making a profile, all profiles have the
same EVD parameters, regardless of content. PSI-Blast therefore tabulates this
information for all possible profile construction mechanisms, and uses the tabu-
lated parameters. HMMER uses a separate calibration step, where the profile-
HMM is compared to a large random database, and an EVD is fitted to the
resulting distribution. The parameters from this fitting are stored in the HMM so
they can be reused for individual sequence searches.
The natural way of reporting the classical statistic is as an expectation value
(e-value). This is the number of sequences expected to get this score by chance,
and is simply dr where d is data base size and r is the probability that a random
sequence will get this score. An e-value of 1.0 is therefore where you expect to
start seeing random sequences: e-values less than this are significant.
Which statistic to use: bits score or e-value? It is clear that the e-value statistic
88
HIDDEN MARKOV MODELS FOR DATABASE SIMILARITY SEARCHES
is more robust and more sensitive in the HMMER2 package, and it is what I
would recommend. However, the e-value has some less desirable properties, in
particular, it changes as the database size changes, unlike the bits score—of
course, if you decide the prior on the bits score should be 1/d (d is database size),
the cut-off for significance of the bits score will change with database size.
Quoting both in publications is very sensible.
89
EWAN BIRNEY
2. The estimation of amino acid probabilities, taking into account protein evolu-
tion, has a stronger theoretical backing. The problem is phrased as an under
sampling problem, where although one has a column of, say, 10 amino acids at
this position the frequencies of amino acids represented by observation is not an
ideal way to estimate the underlying probabilities; clearly not all the amino
acids can be represented even once! This problem occurs in many other situa-
tions and has been well studied. A good solution is to provide the estimation
machinery with prior knowledge of what sort of ammo-acid frequencies one ex-
pects in columns, for example, one with high leucine, valine, and isoleucine
probabilities, and another with a high probability of arginine and lysine, but low
probability for hydrophobic amino acids. These distributions are represented in
a complicated mathematical form called Dirchlet mixtures (16) and, by using
them, the estimation of probabilities for amino acid positions can take into
account evolutionary information. A Dirchlet mixture is a just a convenient
mathematical form for this information; there is nothing special about them
except that they make the downstream mathematics far easier to handle. The
Dirchlet mixture can be thought of rather like a protein comparison matrix
used in ad hoc profile methods.
90
HIDDEN MARKOV MODELS FOR DATABASE SIMILARITY SEARCHES
References
1. Altschul, S. F., Madden, T. L, Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., et al.
(1997). Nucleic Acids Res., 25, 3389.
2. Eddy, S. R. (1998). Bioinformatics, 14, 755.
3. Grundy, W. N., Bailey, T., Elkan, C. P., and Baker, M. E. (1997). Comput. Appl. Biosci., 13,
397.
4. Neuwald, A. F., Liu, J. S., lipman, D. J., and Lawerence, C. E. (1997). Nucleic Acids Res.,
25,1665.
5. Krogh, A., Brown, M., Mian, I. S., Sjolander, K., and Haussler, D. (1994). J. Mol. Biol,
235,1501.
6. Bucher, P., Karplus, K., Moeri, N., and Hoffmann, K. (1996). Camp. Chem., 20, 3.
7. http://www.gcg.com/
8. http://www.netid.com/
9. http://www.compugen.com/
10. Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Finn, R. D., and Sonnhammer, E. L.
(1999). Nucleic Acids Res., 27, 260.
11. Hofmann, K., Bucher, P., Falquet, L, and Bairoch, A. (1999). Nucleic Acids Res., 27, 215.
12. Ponting, C. P., Schultz, J., Milpetz, P., and Bork, P. (1999). Nucleic Acids Res., 27, 229.
13. Attwood, T. K., Flower, D. R., Lewis, A. P., Mabey, J. E., Morgan, S. R., Scordis, P., et al.
(1999). Nucleic Adds Res., 27, 220.
14. Henikoff, J. G., Henikoff, S., and Pietrokovski, S. (1999). Nucleic Acids Res., 27, 226.
15. Altschul, S. F. and Gish, W. (1996). In Methods in enzymology (ed. R. F. Doolirtle), Vol.
266, p. 460. Academic Press.
16. Brown, M., Hughey, R., Krogh, A, Mian, I. S., Sjolander, K., and Haussler, D. (1993) In
Proceedings of the First International Conference on Inteligent Systems for Molecular Biology (ed.
L. Hunter, D. Searls, and J. Shavlik), p. 47. AAAI Press, Menlo Park, CA, USA.
17. Thompson,J.D., Higgins, D. G., and Gibson, T. J. (1994). Nucleic Acids Res., 22, 4673.
91
This page intentionally left blank
Chapter 5
Protein family-based methods
for homology detection and
analysis
Steven Henikoff and Jorja Henikoff
Howard Hughes Medical Institute, Fred Hutchinson Cancer Research Center,
Seattle, USA.
1 Introduction
1.1 Expanding protein families
Most methods for homology detection have traditionally relied upon pairwise
comparisons of protein sequences, and in recent years, several improvements in
pairwise methods have been introduced (see Chapter 8). But, with sequence data
becoming available at an accelerating rate, there is an increasing opportunity to
use multiple related sequences for improved homology detection. Even when
functional information is lacking for known members of a protein family, these
members can be aligned and the alignments used in searches. Protein multiple
alignments have been shown to improve performance of secondary structure
prediction methods by identifying constraints on positions (1,2), and so it seems
reasonable to expect that improvements will be likewise obtained by using
multiple alignments for homology detection and analysis. In this chapter we re-
view some of the numerous methods that are aimed at achievement of this goal.
93
STEVEN HENIKOFF AND JORJA HENIKOFF
94
PROTEIN FAMILY-BASED METHODS FOR HOMOLOGY DETECTION AND ANALYSIS
methods that decide upon gap placement (described in Chapter 3) and use
gap-based tools, especially dynamic programming and hidden Markov models
(described in Chapter 4), for database probing. As is so often the case, the truth
lies somewhere between the extremes. So although we blockers prefer to
reduce the protein alignment problem to finding a set of ungapped blocks to
represent a protein family or module, we recognize that insertions and deletions
occur occasionally within conserved regions, and this is challenging for block-
based methods.
Alignment usefulness is the major driving force in developing methodology.
Obtaining a correct alignment is more important for some applications than for
others. The ability to find corresponding residues and local regions that have
similar functions is of unquestionable value, and the better the conservation of
a residue or local region in a sequence, the more likely it is that common func-
tion can be inferred. Regions of uncertain alignment, such as those that differ-
ent alignment programs using various score parameters disagree on, have little
if any value for drawing functional inferences. However, so much alignment
information is present in conserved regions that it might make sense to align
beyond what can be done with confidence in order that more nuggets are
captured. We suspect that this accounts for the success of many gap-based
approaches: gapped alignments may have a high degree of uncertainty, but the
proportion that is aligned successfully is sufficient to identify extensive shared
regions of sequence similarity in database searches, even to the point of dis-
covering correct folds more successfully than structure-based threading (3).
Practical utility requires ready availability to the general public. Nowadays,
this means access via the World Wide Web using a browser, and so nearly all
methods highlighted here (Table 1) can be performed without any special soft-
ware, hardware, or computational expertise. Some potentially powerful tools
are too computationally intensive to be made available in this way. Additionally,
some tools require a specialist's knowledge and are not sufficiently automated
for the average biologist to use them wisely. We believe that such tools should
be avoided if possible: sequence alignment is fraught with hazards, and erron-
eous conclusions drawn from naive use of powerful sequence analysis tools
abound (4).
95
STEVEN HENIKOFF AND JORJA HENIKOFF
Table 1 URLs
1. Displaying alignments
Boxshade http://www.ch.embnet.org/software/BOX_form.html
Logos http://blocks.fhcrc.org/about_logos.html
Trees http://blocks.fhcrc.0rg/about_trees. html
2. Finding alignments
BCM launcher http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html
MACAW ftp:/ncbi. nlm.nih.gov/repository
Such displays not only become complex, but also they fail to represent shared
similarities. Because of these limitations, multiple alignment representations
that emphasize regions of high similarity have been introduced.
Traditional displays of multiple sequence alignments show aligned sequences
one above the next, highlighting identical or similar residues in a column using
boxes, shading, or colour. These displays can be complex, especially when repre-
senting protein families that consist of large numbers of sequences that group
into distinct subfamilies. Position-based representations greatly simplify the
display of multiple alignments, because relationships between successive amino
acids in a sequence are not shown. Indeed, computer programs that utilize multi-
ple sequence alignment information in searches likewise consider all positions
in aligned sequences to be independent of one another, and so position-based
representations depict approximately what a searching program examines.
2.2 Patterns
The simplest position-based representations of multiple alignments are patterns,
which display only key conserved residues. The Prosite database (5) is a com-
96
PROTEIN FAMILY-BASED METHODS FOR HOMOLOGY DETECTION AND ANALYSIS
2.3 Logos
Sequence logos (8) are vivid graphical displays of multiple sequence alignments
consisting of ordered stacks of letters representing amino acids at successive
positions (Figure 1). The height of a letter in a stack increases with increasing
frequency (or probability) of the amino acid, and the height of a stack of letters
increases with increasing conservation of the aligned position. Stack heights are
displayed in bit units. One bit is the answer to a yes-or-no question, where yes is
as likely as no. About 4 bits are required to fully specify a residue at a given
position, because the first question narrows the field from 20 residues to 10, the
second to 5, etc. The most probable amino acid is at the top of the stack, making
it more visible, and below it is the next most probable residue, and so on. Logo
colours or shades are chosen to emphasize similar amino acid properties. Logos
can be scaled such that the stack height is proportional to the observed
frequency of a residue divided by the frequency with which the residue is
expected to occur by chance (odds ratio).
2.4 Trees
The most serious drawback of position-based displays is that they show only
alignment information in common among the sequences in a family, not the
Figure 1 Sequence logo depicting the chromodomain block (BL00598 in Blocks v. 11.0).
Each alignment position is represented as a stack of letters, where the height of the stack
and the height of each letter is measured in bit units.
97
STEVEN HENIKOFF AND JORJA HENIKOFF
9a
PROTEIN FAMILY-BASED METHODS FOR HOMOLOGY DETECTION AND ANALYSIS
threshold score are combined into an ungapped block (12). The extent of the
block is limited by the requirement that each column have some minimum de-
gree of homogeneity. Blocks that are separated by the same number of residues
in all sequences may be fused, and so blocks can contain both conserved and
diverged positions. MACAW is an interactive program that allows users to choose
a set of blocks from among candidates. The threshold score for block searching
can be relaxed by the user in order to find new blocks in regions between blocks
that were found in the first pass.
Starting with pairwise alignments presents the same potential drawback as
for gap-based hierarchical multiple sequence alignment programs (Chapter 5),
which is that information in common for all of the sequences might not be
represented in the pairwise alignments. In addition, the number of pairwise
comparisons needed is n2 for n sequences, and this can become somewhat im-
practical for large protein families and long sequences. Simultaneous methods
for finding motifs, described below, can potentially avoid these problems.
99
STEVEN HENIKOFF AND JORJA HENIKOFF
3.4 Implementations
Some of these methods are conveniently available over the internet, and
sequences in FASTA format may be submitted by either pasting into a window
or by file browsing. Because many real motifs can be subtle and as short as a few
residues, sensitive methods may return alignments for sequences that are not
based upon true relationships. A simple experiment (Protocol !) demonstrates
that even sequences chosen at random from a database can be aligned to yield
motifs that appear convincing and will easily detect the parent sequences and
their homologues from sequence databanks. Furthermore, even gross misalign-
ments can be masked by the existence of significant similarity among just a
fraction of sequences, and visual examination is notoriously unreliable (Figure 2).
One solution is to report a reliability measure for each position (22), and one
measure is implemented in Match-Box (23). BlockMaker's solution (24) is to
apply two very different motif finders with different scoring systems, Motif and
Gibbs sampling. In each case, a block assembly algorithm (25) is used to
determine a best set of blocks representing a protein family, and the two sets
are compared by the user: blocks with similar alignments obtained by the two
methods may be trusted, but those that differ require scrutiny. Both Match-Box
and BlockMaker require that the blocks be in order along the sequences, and so
repeats might be missed. However, the EM-based program, MEME (21), does not
impose an ordering criterion, and MEME finds repeats and displays them within
blocks. BlockMaker, MEME and Match-Box are available from the BCM multiple
alignment search launcher, which allows successive searches of a single query
with several tools, both traditional and motif-based. Performance evaluation of
the methods available over the Web show that there are trade-offs between
sensitivity and reliability (23), and so it is worthwhile to try several methods on
any particular set of sequences and compare the results.
100
PROTEIN FAMILY-BASED METHODS FOR HOMOLOGY DETECTION AND ANALYSIS
3 From the BlockMaker results page, examine the alignments in both sets of blocks
(from Motif above and Gibbs below), and choose the set of blocks that has the most
total residues. Click on the MAST direct link (to http://meme.sdsc.edu/meme/
website/mast.html) above the chosen set. MAST search results will be returned by
e-mail.
4 Compare the names of your submitted sequences to the significant hits from your
MAST search. Other significant hits may be homologues of your submitted sequences.
5 Now that you have done the necessary control, you are ready to use your own
sequences. Return to the BCM alignment launcher, copy and paste your own
sequences into the box and successively click on the various choices of multiple
aligners.
101
STEVEN HENIKOFF AND JORJA HENIKOFF
simple PSSM has as many columns as there are positions in the alignment, and
20 rows, one for each amino acid. In some applications, a PSSM consists of rows
that correspond to successive positions in the alignment (27), rather than
columns, and in some, there are position-specific gap scores.
Because they consist of numbers, PSSMs are useful for computer-based align-
ment and database searching methods but not for visual display. However, logos
are computed from PSSMs, and rules can be applied to convert PSSMs to pat-
terns (6) or consensus sequences (32). The construction of PSSMs from multiple
alignments has improved over the years, and as a result, we are better able to
detect weak similarities in searches (33). To construct effective PSSMs, two
major issues, described below, must be addressed.
102
PROTEIN FAMILY-BASED METHODS FOR HOMOLOGY DETECTION AND ANALYSIS
103
STEVEN HENIKOFF AND JORJA HENIKOFF
Currently, there are several choices of family databases and searching options
available over the internet (Table 1). An illustrative example is depicted in Figure 3,
which shows how well the different methods detected key features of a protein
that we recently described, a cytosine-5 DNA methyltransferase homologue with
an embedded chromodomain module, called a 'chromomethylase', which is
encoded by the Arabidopsis thaliana CMT1 locus (45). In addition to being the
subject of current experimental work in our group, we chose this sequence
because chromomethylases are not yet present in any of the family databases,
although both the cytosine-5 DNA methyltransferases and the chromodomains
are represented in all of them, and because this example reveals strengths and
weaknesses of the different methods especially well. Both the DNA methyltrans-
ferase and the chromodomain represent novel subfamilies of their respective
families, and so detection in their entirety can be challenging for a protein
family classification method that does not generalize well from known examples.
This is an anecdotal example, and overall performance can only be judged using
104
PROTEIN FAMILY-BASED METHODS FOR HOMOLOGY DETECTION AND ANALYSIS
105
STEVEN HENIKOFF AND JORJA HENIKOFF
are carried out on such a large scale. The first public database of this type was
introduced in 1990 (48), and several have been introduced over the years, only
some of which are extant. ProDom, which was introduced in 1994 (49), has been
continually maintained and enhanced (50); version 36 (August, 1998) contains
17 777 entries from Swiss-Prot with more than 2 sequences. ProDom entries vary
from short single motifs to longer stretches of similarity that might encompass
nearly entire sequences. ProDom is searched with multiple alignments or con-
sensus sequences. Using either option, ProDom detected the central and down-
stream conserved regions of the DNA methyltransferase, missing the upstream
region and the chromodomain.
Recently, three new clustering databases have been introduced. DOMO (51),
which is based on Swiss-Prot and PIR, is similar to ProDom, although it uses
different methodology to generate the database. DOMO clusters tend to be
longer and fewer in number than ProDom clusters. At present, DOMO does not
allow user-supplied sequences to be searched for classification. Protomap (52),
which is based on Swiss-Prot, does not yield multiple alignments as do ProDom
and DOMO, but rather provides a graphical tree-like view of the clustering. To
classify a protein sequence with Protomap, a Smith-Waterman search of Swiss-
Prot is performed, and each individual cluster that contains a sequence hit is
reported. For the chromomethylase, Protomap detected the chromodomain and
the central and downstream conserved regions of the DNA methyltransferase,
missing the upstream region of conservation. Prof_pat (53) extracts patterns
from clustering Swiss-Prot/TrEMBL, and these can be searched. Prof_pat did not
detect either the DNA methyltransferase or the chromodomain above false
positives.
106
PROTEIN FAMILY-BASED METHODS FOR HOMOLOGY DETECTION AND ANALYSIS
107
STEVEN HENIKOFF AND JORJA HENIKOFF
108
PROTEIN FAMILY-BASED METHODS FOR HOMOLOGY DETECTION AND ANALYSIS
block maps for intelligent interpretation of search results. MAST accepts PSSMs
directly from MEME and BlockMaker. Additionally, the Blocks server provides a
processor that can be used to convert other multiple alignments into efficient
PSSMs for sending directly to the MAST server.
109
STEVEN HENIKOFF AND JORJA HENIKOFF
out its neighbours in subsequent rounds, and this can lead to erroneous infer-
ences of homology. A defence against this type of error is to use conservative
levels of statistical significance for addition of sequences to the PSSM. However,
because proteins are not comprised of random sequences of residues, the
random statistical model that underlies the BLAST programs can be unreliable
(63), and so novel conclusions drawn from iterative searches should be viewed
with appropriate caution.
References
1. Rost, B. (1996). In Methods in enzymdogy (ed. R. F. Doolittle), Vol. 266, p. 525. Academic
Press.
2. Gamier, J., Gibrat, J.-F., and Robson, B. (1996). In Methods in enzymology, Vol. 266, p. 540.
3. Moult, J., Hubbard, T., Bryant, S. H., Fidelis, K., and Pedersen, J. T. (1997). Proteins:
Struct. Funrt. Genet., Suppl. 1, 2.
4. Henikoff, S. (1991). NewBiol, 3,1148.
5. Bairoch, A., Bucher, P., and Hofmann, K. (1997). Nucleic Acids Res., 25, 217.
6. Jonassen, I., Collins, J. F., and Higgins, D. G. (1995). Protein Sci, 4,1587.
7. Nevill-Manning, C. G., Wu, T. D., and Brutlag, D. L (1998). Proc. Natl Acad. Sci USA, 95,
5865.
8. Schneider, T. D. and Stephens, R. M. (1990). Nucleic Acids Res., 18, 6097.
9. Saitou, N. and Nei, M. (1987). Mol. Biol. Evol, 4,406.
10. Vingron, M. and Argos, P. (1991). J. Mol. Biol., 218, 33.
110
PROTEIN FAMILY-BASED METHODS FOR HOMOLOGY DETECTION AND ANALYSIS
11. Boguski, M. S., Hardison, R. C., Schwartz, S., and Miller, W. (1991). New Biol, 4, 247.
12. Schuler, G. D., Altschul, S. F., and Lipman, D. J. (1991). Proteins: Struct. Funct. Genet, 9,
180.
13. Sobel, E. and Martinez, H. M. (1986). Nucleic Acids Res., 14, 363.
14. Posfai, J., Bhagwat, A. S., Posfai, G., and Roberts, R. J. (1989). Nucleic Acids Res., 17, 2421.
15. Smith, H. O., Annau, T. M., and Chandrasegaran, S. (1990). Proc. Natl. Acad. Sri. USA, 87,
826.
16. Depiereux, E. and Feytmans, E. (1992). CABIOS, 8, 501.
17. Neuwald, A. F. and Green, P. (1994). J. Mol. Biol, 239, 698.
18. Bacon, D. J. and Anderson, W. F. (1986) J. Mol. Biol., 191,153.
19. Stormo, G. D. and Hartzell, G. W. 3rd (1989). Proc. Natl. Acad. Sri. USA, 86,1183.
20. Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F., and Wootton,
J. C. (1993). Science, 262, 208.
21. Bailey, T. and Elkan, C. (1994). In Proceedings of the Second International Conference on
Intelligent Systems for Molecular Biology, pp. 28-36. AAAI Press, Menlo Park, CA.
22. Notredame, C., Holm, L, and Higgins, D. G. (1998). Bioinformotics, 14, 407.
23. Briffeuil, P., Baudoux, G., Lambert, C., De Bolle, X., Vinals, C., Feytmans, E., et al.
(1998). Bioinformatics, 14, 357.
24. Henikoff, S., Henikoff, J. G., Alford, W. J., and Pietrokovski, S. (1995). Gene, 163, GC17.
25. Henikoff, S. and Henikoff, J. G. (1991). Nucleic Adds Res., 19, 6565.
26. McLachlan, A. D. (1983). J. Mol. Biol., 169, 15.
27. Gribskov, M., McLachlan, A. D., and Eisenberg, D. (1987). Proc. Natl. Acad. Sri. USA, 84,
4355.
28. Bowie, J. U., Luthy, R., and Eisenberg, D. (1991). Science, 253, 164.
29. Krogh, A., Brown, M., Mian, I. S., Sjolander, K., and Haussler, D. (1994). J. Mol. Biol.,
235, 1501.
30. Baldi, P., Chauvin, Y., Hunkapiller, T., and McClure, M. A. (1994). Proc. Natl. Acad. Sri.
USA, 91,1059.
31. Eddy, S. R., Mitchison, G., and Durbin, R. (1995).J. Comput. Biol, 2, 9.
32. Patthy, L. (1987). J. Mol. Biol., 198, 567.
33. Henikoff, J. G. and Henikoff, S. (1996). CABIOS, 12, 135.
34. Luthy, R., Xenarios, I., and Bucher, P. (1994). Protein Sri., 3,139.
35. Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994). CABIOS, 10, 19.
36. Sibbald, P. R. and Argos, P. (1990) J. Mol. Biol., 216, 813.
37. Henikoff, S. and Henikoff, J. G. (1994). J. Mol. Biol., 243, 574.
38. Dayhoff, M. (1978). Atlas of protein sequence and structure, Vol. 5, suppl. 3, pp. 345-58.
National Biomedical Research Foundation, Washington, DC.
39. Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992). CABIOS, 8, 275.
40. Henikoff, S. and Henikoff, J. G. (1992). Proc. Natl. Acad. Sri. USA, 89,10915.
41. Altschul, S. F. (1991). J. Mol. Biol., 219, 555.
42. Dodd, I. B. and Egan, J. B. (1987). J. Mol. Biol., 194, 557.
43. Brown, M. P., Hughey, R., Krogh, A., Mian, I. S., Sjolander, K., and Haussler, D. (1993).
In Proc. First Int. Conf. on Intelligent Systems for Molecular Biology (ed. L. Hunter, D. Searls,
and J. Shavlik), pp. 47-55. AAAI Press, Washington DC.
44. Tatusov, R. L, Altschul, S. F., and Koonin, E. V. (1994). Proc. Natl. Acad. Sri. USA, 91,
12091.
45. Henikoff, S. and Comai, L. (1998). Genetics, 149, 307.
46. Henikoff, S. and Henikoff, J. G. (1997). Protein Sci., 6, 698.
47. Sonnhammer, E. L., Eddy, S. R., and Durbin, R. (1997). Proteins: Struct. Funct. Genet., 28,
405.
Ill
STEVEN HENIKOFF AND JORJA HENIKOFF
48. Smith, R. F. and Smith, T. F. (1990). Proc. Natl. Acad. Sri. USA, 87,118.
49. Sonnhammer, E. L. L. and Kahn, D. (1994). Protein Sci., 3, 482.
50. Corpet, F., Gouzy, J., and Kahn, D. (1998). Nucleic Acids Res., 26, 323.
51. Gracy, J. and Argos, P. (1998). Bioinformatics, 14, 174.
52. Gona, G., LInial, N., Tishby, N., and Linial, M. (1998). ISMB, 6, 212.
53. Bachinsky, A. G., Yarigin, A. A., Guseva, E. H., Kulichkov, V. A., and Nizolenko, L. P.
(1997). CABIOS, 13, 115.
54. Wu, C. H., Zhao, S., and Chen, H. L. (1996). J. Comput. Biol, 3, 547.
55. Wu, C. H., Shivakumar, S., Shivakumar, C. V., and Chen, S. C. (1998). Bioinformatics, 14,
223.
56. Bucher, P., Karplus, K., Moeri, N., and Hofmann, K. (1996). Comput. Chem., 20, 3.
57. Bailey, T. L. and Gribskov, M. (1997). J. Comput. Biol, 4,45.
58. Nicodeme, P. (1998). Bioinformatics, 14, 508.
59. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Aneng, Z., Miller, W., et al
(1997). Nucleic Acids Res., 25, 3389.
60. Pearson, W. R. (1995). Protein Sci., 4, 1145.
61. Altschul, S. F. (1998). Proteins: Struct. Funct. Genet., 32, 88.
62. Alexandrov, M. M. and Luethy, R. (1998). Protein Sci., 7, 254.
63. Brenner, S. E., Chothia, C., and Hubbard, T. J. (1998). Proc. Natl. Acad. Sri. USA, 95, 6073.
64. Pietrokovski, S. (1996). Nucleic Acids Res., 24, 3836.
112
Chapter 6
Predicting secondary structure
from protein sequences
Jaap Heringa
National Institute for Medical Research, The Ridgeway, Mill Hill,
London NW7 1AA, UK
1 Introduction
Protein structure is intrinsically hierarchic in its internal organization. The
highest level in this hierarchy is constituted by complete proteins or assemblies of
such proteins, which become subdivided through domains via super-secondary
structure to secondary structure at the lowest hierarchical level.
At higher levels within in this hierarchy, especially from the domain level
upwards, the connectivity of the polypeptide backbone between substructures
becomes less important. A protein thus can retain a stable structure irrespective
of the sequential arrangement of domains and presence of fragments linking
them together. Such linker regions often constitute exposed surface loops that
do not disrupt the folds of the domains they connect (1).
At the level of protein secondary structure, however, the elements are not
only crucially dependent on their amino acid compositions, but, unlike domain
and higher-order structures, are also very much context dependent; i.e. they rely
critically on the substructures in their environment. It is because of this context
dependency, that predicting protein secondary structure is a very difficult task,
which after three decades of research has not attained the accuracy on which
further prediction of tertiary structure can be based. It must be stressed,
however, that some successful predictions of higher-order structure, based on a
knowledge of the secondary structure, have been achieved (e.g. ref. 2).
This chapter covers some background aspects of secondary structure pre-
diction and describes recent and successful prediction methods, most of which
are available through the World Wide Web and so can be used by virtually every
biologist who likes to find out about the secondary structure associated with a
particular protein query sequence.
113
JAAP HERINGA
distinct geometrical features. The two basic secondary structures are the a-helix
and the B-strand. Both show distinct structural features and are easily recogniz-
able in a protein structure (Figure 1). Other secondary structure types occurring
in protein structures are more difficult to classify as they are less regular than
a-helices or B-strands. Such structures are defined in the context of" most
prediction methods as coil; i.e. leftover secondary structures that cannot be
considered in the a-helical or pi-stranded conformation,
In general, about 50% of the amino acids fold into a-helices or B-stands, so
that roughly half the protein structures are irregularly shaped. The primary
reason for the regularity observed for helices and strands is the inherent polar
nature of the protein backbone, which contributes a polar nitrogen and oxygen
atom for each amino acid. To satisfy energetical constraints, the parts of the
main-chain buried in the internal protein core need to form hydrogen-bonds
between those polar atoms. The a-helix and B-strand conformations are optimal
as each main-chain nitrogen atom can associate with an oxygen partner (and via1
versa) whenever they adopt one of these two secondary structure types. It must
be stressed that, in order to satisfy their hydrogen-bonding constraints, p-strands
need to interact with other B-strands, which they can do in a parallel and anti-
parallel fashion, thus forming a B-pleated sheet. p-strands thus depend on crucial
long-range interactions between residues remote in sequence. They therefore
are more context dependent than a-helices, which would be more able to fold
'on their own'. The fact that the vast majority of prediction methods have
greatest difficulty in delineating B-strands correctly is believed to be due to their
pronounced context dependency,
11 4
PREDICTING SECONDARY STRUCTURE FROM PROTEIN SEQUENCES
tural elements would associate early during folding to provide a structural frame
to which subsequently other substructures could attach. Therefore, knowledge
of protein secondary structural regions along the sequence is a prerequisite to
model the folding process or kinetics associated with it. Also for tertiary model
building, the ability to predict the secondary structure from the sequence alone
is crucial, as it allows for docking experiments to be carried out on the predicted
a-helices and B-strands.
On the architectural side of protein structure, it is possible to recognize the
three-dimensional topology by comparing the successfully predicted secondary
structural elements of a query protein with a database of known topologies.
Successful prediction here means parts of those helices and strands essential for
the topology would have been predicted, without necessarily accurate predict-
ion of the edges of those structures or the detection of non-essential secondary
structures. An example of topologically essential secondary structures for the
flavodoxin fold is given in Figure 2. The figure shows a schematic representation
Figure 2 TOPS diagrams for four flavodoxin structures and their basic topology. The
essential secondary structures are given in the basic topology diagram.
115
JAAP HERINGA
a-helix:
(a) As the number of residues per turn is 3.6 in the ideal case and helices are often
positioned against a buried core, they have one phase contacting hydro-
phobic amino acids, while the other phase interacts with the solvent. Such
amphipathic helices (8) thus show a periodicity of three to four residues in
hydrophobicity of the associated sequence stretch (Figure 3).
(b) Proline residues do not occur in middle segments as they disrupt the a-helical
turn. However, they are seen in the first two positions of a-helices.
B-strand:
(a) B-Strands mostly fold into so-called B-pleated sheets which have two strands
forming either edge. Therefore the hydrophobic nature of edge strands is dif-
ferent from that of strands internal to a B-sheet. As side-chains of constituent
residues along a p-strand alternate the direction in which they protrude,
edge strands of p-sheets can show an alternating pattern of hydrophobic-
hydrophilic residues, while buried strands tend to contain merely hydro-
phobic residues (Figure 3).
116
PREDICTING SECONDARY STRUCTURE FROM PROTEIN SEQUENCES
(b) As B-strand is the most extended conformation (i.e. consecutive Ca atoms are
farthest apart), it takes relatively few residues to cross the protein core with a
strand. Therefore, the number of residues in a B-strand is usually limited and
can be anything from two or three amino acids, whereas helices shielding
such strands from solvent comprise more residues.
(c) The B-strands can be disrupted by single residues that induce a kink in the
extended structure of the main-chain. Such so-called 3-bulges are often
comprised of relatively hydrophobic residues.
Coil:
(a) Multiple alignments of protein sequences often display gapped and/or highly
variable regions, which would be expected to be associated with loop regions
rather than the two basic secondary structures.
(b) Loop regions contain a high proportion of small polar residues like Ala, Gly,
Ser, and Thr. Glycine residues are seen in loop regions due also to their
inherent flexibility.
(c) Proline residues are often seen in loops as well. They are not observed in
helices and strands as they kink the main-chain, although they can occur in
the N-terminal two positions of a-helices as mentioned above.
In addition to the positional requirements in hydrophobicity, there are also
general compositional differences between helix, strand and coil conformations
and this is the signal used in many of the early prediction methods (see below)
for single sequences. Methods that utilize multiple alignments can also exploit
the fact that the amino acid exchange patterns are different for the three
secondary structure states.
A few additional rules can help in clarifying the structure or function of a
protein sequence, once the secondary structure is predicted:
(a) Hydrophobic and particularly conserved hydrophobic residues are normally
buried in the protein core.
(b) More than 95% of all so-called B-a-B motifs; i.e. a B-strand followed in sequence
by a a-helix and another B-strand, show a right-handed chirality. The afore-
mentioned flavodoxin family (Figure 2) indeed shows only right-handed B-a-B
motifs. This fact can be used to build a topology for the secondary structures
of the sequence(s) considered.
117
JAAP HERINGA
(c) Helices often cover up a core of B-strands. Therefore, if both a-helices and (3-
strands are predicted, an attempt should be made to distribute the helices
evenly at either phase of a tentative B-sheet in topology modelling.
(d) As mentioned, strictly conserved residues in different regions of a multiple
alignment can be predicted with great confidence to be responsible for the
catalytic functions, particularly if they are polar and predicted to be in loop
structures hence unlikely to be buried. As active site residues are positioned
together in a protein 3-D structure, the coil structures they constitute should
be brought together in a topology model.
where Ps and Ns are respectively the number of positive and negative cases cor-
rectly predicted for the structural state considered, and ~PS and ~NS the num-
bers of false positives and negatives, respectively. Three-state predictions would
thus yield three Matthews' correlation coefficients. If overprediction or under-
prediction occurs for any of the structural states, this is more dramatically re-
flected in the Matthews' correlations than in the Q3 percentage. A third way to
assess prediction accuracy is by weights of evidence, defined for each secondary
structural type S as:
118
PREDICTING SECONDARY STRUCTURE FROM PROTEIN SEQUENCES
Jackknife testing
1 Take out one protein of the complete set of N proteins,
2 Train the method on the remaining N-l proteins (the training set).
3 Predict the secondary structure for the protein taken out.
4 Repeat step 1-3 for all N proteins and assess the average accuracy.
It is possible to test the method by averaging the predictions over all combinations of x
proteins (1<x<N), each time using the method trained on the remaining N-x proteins.
This provides an impression of the influence of different training sets on the sustained
accuracy of a single protein being predicted. As the number of combinations grows
rapidly with x. the training phase of most methods is too slow for extensive testing using
this mode. It can, however, also be used to save computation time if the database is split
evenly in test groups of sequences (e.g. 9), as each sequence within a test group is
associated with a single training set, thus saving training overhead.
119
JAAP HERINGA
based on the overlap of predicted and observed segments rather than on indi-
vidual positions (15-20). A recent secondary structure assignment program that
combines many of the features of earlier methods, such as checking hydrogen
bonding patterns and stereochemical characteristics, is the knowledge-based
method STRIDE (21), claimed to yield assignments in close agreement to those
made by crystallographic experts.
3.1.1 Lim
Lim (26) developed a set of complicated stereochemical prediction rules for a-
helices and B-sheets based on their packing as observed in globular proteins. Apart
from being the most successful early method (see below), Lim's stereochemical
rules are quite important for understanding protein folding. An example is the
set of hydrophobicity rules for a-helices with terminal hydrophobic pairs at
sequence positions i and i + 1, hydrophobic pairs in middle helical segments
positioned at (i, i + 4) and middle hydrophobic triplets positioned at (i, i + 1, i + 4)
or (i, i + 3, i + 4) (see also Figure 3). The Lim method never gained widespread
popularity because a computer implementation has not been available until
recently.
3.1.2 Chou-Fasman
The most widely used pioneering method is the one by Chou and Fasman (25), in
which predictions are based on differences in residue composition for three
states of secondary structure: a-helix, B-strand, and turn (i.e. neither a-helix nor
B-strand). Chou and Fasman performed a statistical analysis over a number of
crystallographically determined protein tertiary structures and determined the
frequency of each amino acid type in the three states. The position of turn
residues was included in the frequency calculations given significant positional
differences in residue type occurrences at turn sites. The frequencies were
120
PREDICTING SECONDARY STRUCTURE FROM PROTEIN SEQUENCES
normalized to amino acid type preferences for each of the structural states by
dividing each by that found in all positions of the known structures. For helix
and strand, effects of neighbouring residues in the protein sequence were taken
into account by averaging the preferences over three residues for a-helix pre-
dictions and over two for B-strands. Secondary structures were initiated accord-
ing to the higher preference values and minimum nucleation lengths required
for each structural state. Extensions were effected as long as preferences re-
mained high and certain residues were not encountered (e.g. proline in a -
helix). The Chou-Fasman method has owed its early popularity to the straight-
forward underlying statistics that are easy to understand.
3.1.3 GOR
The GOR method quickly became the standard for a decade after its first
appearance. Although the initial versions GOR I and GOR II predicted four states
by discriminating between coil and turn secondary structures, GOR III (28) and
the most recent version, GOR IV (29) perform the common three-state pre-
diction. The GOR method relies on the frequencies observed for residues in a 17-
residue window (i.e. eight residues N-terminal and eight C-terminal of the
central window position) for each of the three structural states. The amino acid
frequencies are exploited using an information function based on conditional
probabilities defined as:
where fS,R is the frequency of residue type R in state S,fR the general frequency of
residue type R, and fs/n that of structural state S. Significant in this formula is
that the information of a particular residue type in one of the structural states is
not only based on the normalized frequency, but shows an extra weighting
based on the inverse fraction of all residues in that state. In the GOR method,
this formula is used to calculate the information difference between the various
states defined as I(AS; R) = I(S; R)—I(!S; R) with !S denoting all other states (not S).
The information difference formula then becomes:
The above formula is defined for a single sequence position, but can be easily
extended to the GOR 17-residue window by, for example, writing R17 instead of
R. Unfortunately, it is not feasible to sample all possible 17-residue fragments
directly from the PDB (as there are 1720 possibilities). The subsequent versions of
121
JAAP HERINGA
the GOR method over the years have explored increasingly detailed approxi-
mations of this sampling problem, along with the increase of data in the PDB:
(a) GOR I just treated the 17 positions in the window independently, and so single-
position information could be summed over the 17-residue window.
(b) GOR II did the same but sampled over a larger database.
(c) GOR III (28) refined by including pair frequencies derived from 16 pairs be-
tween each non-central and the central residue in the 17-long window. As
the PDB at the time was not large enough to provide sufficient data, dummy
frequencies were calculated (28).
(d) The current version, GOR IV (29) uses pairwise information over all possible
paired positions in a window (there are 17 X 16/2 possibilities), albeit with a
relatively small weight as compared with the GOR I-type single-position
information (a) which is included as well.
The theoretical principles used in the GOR method are statistically sound and no
ad-hoc rules or artificial variables are invoked, which makes it one of the most
elegant methods with a high accuracy given its single sequences prediction.
However, as in many other methods (vide infra), a post-processing step was intro-
duced for the GOR IV method to refine the predictions. Helices are required to
be at least four residues in length and strands should consist of two or more
residues. If a shorter helix or strand fragment is initially predicted, the method
assesses the probabilities of extending the fragment to the minimum associated
length or deleting it (i.e. changing it to coil).
122
PREDICTING SECONDARY STRUCTURE FROM PROTEIN SEQUENCES
primary on to the secondary structure and to thus enhance the success rate of
prediction. These include:
(a) Neural network applications (9, 35).
(b) Nearest-neighbour methods (36-39).
(c) Linear discriminant analysis (40).
(d) Inductive logic programming (ILP) (41).
Examples of the first three formalisms will be described in the following section.
The latter computational concept (ILP) is designed for learning structural relation-
ships between objects. Muggleton et al. (41) used the ILP computer program
Golem to automatically describe qualitative rules for residues in the a-helix con-
formation and central in a 9-residue window. The rules made use of the physico-
chemical amino acid characterizations of Taylor (42) and were established during
iterative training steps over a small set of 12 known a/a protein structures. With
the thus obtained set of rules, a-helices in four independent a/a proteins were pre-
dicted with an accuracy of 81% on a per residue basis (Q3). The Golem algorithm
is of limited use because it is only able to predict helices in all-helical proteins.
PHD
The PHD method (Profile network from HeiDelberg) (9) combines the added
information from multiple sequence information with the optimization strength
of the neural network formalism. The method makes use of three consecutive
complete neural networks:
(a) The first network produces the first raw 3-state prediction for each alignment
position. It takes as input the fractions of the 20 amino acids at each multiple
123
JAAP HERINGA
alignment position together with the two 6-residue flanking regions; i.e. a 13-
residue window (w = 13) is used to predict each alignment position with the
central residue in the middle position. The output of the first network for
each alignment position is three probabilities for three the states (helix,
strand, and coil).
(b) A second network refines the raw predictions of the first level by filtering the
3-state probabilities for each alignment position based on the probabilities of
the flanking positions. It takes as input the output of the first network and
processes the information using a 17-residue window. The output of the
second network comprises for each alignment position the three adjusted
state probabilities. This post-processing step for the raw predictions of the
first network is aimed at correcting unfeasible predictions and would, for
example, change (HHHEEHH) into (HHHHHHH).
(c) The first two networks perform the basic prediction of the secondary struc-
ture associated with a query multiple alignment. However, as the networks
can be trained in various ways, PHD employs a number of separately trained
consecutive network pairs ((a) and (b)) and feeds their predictions (3-state
probabilities) into a third network for a so-called jury decision.
The predictions obtained by the jury network undergo a final filtering to delete
predicted helices of one or two residues and changing those into coil. The method
was trained on a non-redundant set of 130 alignments from the HSSP database
(46), each containing one sequence with a known structure. The method showed
an overall prediction accuracy of 70.8% in a jackknife test over 126 alignments
(4 of 130 alignments were transmembrane protein families), which for computa-
tional reasons were divided in 7 groups (see Protocol 1). Although this count is not
the highest accuracy reported, the PHD method to date shows the most sustained
performance as compared with all other methods available on the Web.
If the PHD webserver is given a single sequence for prediction, it performs a
BLAST-search to find a set of homologous sequences and aligns those using the
MAXHOM alignment program (46). The resulting alignment is then fed into the
actual PHD neural net algorithm.
Pred2ary
Another accurate profile and neural net-based prediction method is Pred2ary
(35) which was assessed with an accuracy of 74.8% and balanced prediction over
the three structural states. The method employs a second neural net to filter the
raw predictions of the first net, as does the PHD method (9). A recent extended
version, which combines in a jury decision the outputs of a massive number of
120 networks individually trained, is claimed to predict 75.9% ± 7.9% accurately.
This is achieved by constructing a priori probabilities of correctly predicting the
structural state at each query sequence position for all combinations of network
output weighs for helix and strand These probabilities are then used for a final
state prediction corresponding to the highest of the a priori probabilities for each
of the three states.
124
PREDICTING SECONDARY STRUCTURE FROM PROTEIN SEQUENCES
Yi and Lander
Yi and Lander (36) were the first to use nearest-neighbour classifiers for pre-
diction of secondary structure. A database of 110 proteins with known tertiary
structure was used to derive a large collection of 19-residue exemplars for which
only the environmental states were noted; i.e. the residue type information was
discarded. As a label for each exemplar the secondary structural state of the
central residue was taken. For each 19-residue window of the query protein, 50
nearest neighbour exemplars were identified using the amino acid environ-
mental scoring system of Bowie et al. (47), which includes as parameters the
secondary structure state, accessible surface area and polarity; and scores the
likelihood of a residue type to be in a particular state (or range) over these three
parameters. As a score, the average was taken of 19 residues within a query
window matched with the 19-position exemplar considered. During training, for
each exemplar a cut-off score was determined, which should be met by the
query fragment compared to it in order to count the exemplar as a neighbour:
The cut-off score can be viewed as a reliability check for the predictive value of
the exemplars. The 50 thus obtained nearest neighbours showed a distribution
of the associated secondary structure labels, from which probability estimates
for the three structural states were derived for the query fragment considered.
Yi and Lander explored various scoring systems and found that the best performer
included 15 environmental classes (3 secondary structures times 5 different
accessibility/polarity classes) combined with an amino acid exchange score from
the Gonnet et al. matrix (48). Note that for this final scoring system, the amino
acid types of the exemplars were taken into account. This scenario resulted in a
prediction accuracy of 67.1%. Using a neural network for a jury decision over six
different scoring systems led to the final accuracy of 68%, as assessed through
jackknife testing (Protocol 1).
NNSSP
The NNSSP (Nearest Neighbour Secondary Structure Prediction) (37) method
adopts the nearest neighbour approach of Yi and Lander (36) for single sequence
prediction. Differences with the Yi and Lander method are:
125
JAAP HERINGA
PREDATOR
The PREDATOR method of Frishman and Argos (38, 39) owes its accuracy mostly
to the incorporation of long-range interactions for B-strand prediction and attains
68% prediction accuracy for single sequence prediction which was assessed
using a one-at-a time jackknife test (see Protocol 1) over the protein set of Rost
and Sander (RS) (9). Using a k-nearest neighbour approach (with k = 25 and 13-
residue windows), propensities for the general three states (PH, PE, and Pc) were
determined for each residue. Using pairwise potentials involving long-range
interactions, two more propensities for B-strand were determined. This was
done by assessing the likelihood for all pairwise 5-residue fragments (separated
by more than six amino acids) to form parallel or anti-parallel B-bridges, based
on summing residue hydrogen bonding propensities obtained from known
structures (two sets of propensities for anti-parallel and one for parallel bridges).
As the final parallel and anti-parallel B-strand propensity for each residue (Ppar
and pAntipar), the maximum scoring window pair was taken with the residue con-
sidered at the N-terminal position in one of the windows. Pairwise hydrogen
bonding potentials were also determined for a-helical residues at a sequence
separation of four residues. Their sum was calculated over a 7-residue window to
arrive at an extra helix propensity for the residue N-terminal in the window
(PHelix). The last additional propensity concerned p-turns (pTurn) and was obtained
by summing single-residue propensities in classic p-turn positions 1-4 (49) using
126
PREDICTING SECONDARY STRUCTURE FROM PROTEIN SEQUENCES
Figure 4 Usage of local alignments in the PREDATOR algorithm. For details, see text.
127
JAAP HERINGA
(see Section 1.3). This information is processed using linear statistics. Apart from
the conformational propensities, the following concepts are used:
• N-terminal and C-terminal sequence fragments are normally coil.
• Moments of hydrophobicity (see Figure 3).
• Alignment positions comprising gaps are indicative for coil regions.
• Moments of conservation.
• Autocorrelation.
• Residue ratios in the alignment.
• Feedback of predicted secondary structure information.
• Simple filtering.
The relative importance of these concepts was determined in five runs, which
successively relied on increased information as follows:
(a) Run 1: The GOR method was used on each of the aligned sequences and the
average GOR score for each of the three states was compiled for each align-
ment position.
(b) Run 2: For each position in the query multiple alignment, a so-called attribute
vector was compiled, consisting of 10 attributes: three averaged GOR scores
for H, E, and C; distance to alignment edge; hydrophobic moment assuming
helix; hydrophobic moment assuming strand; number of insertions; number
of deletions; conservation moment assuming helix and that assuming strand.
(c) Run 3: Positional 20-attribute vectors were determined consisting of the
above 10 attributes and those in a smoothed fashion.
(d) Run 4: Positional 27-attribute vectors were compiled comprising the 20
attributes of the preceding round, combined with fractions of predicted ot-
helix and B-strand, and fractions of the five most discriminating residue
types; His, Glu, Gln, Asp, and Arg.
(e) Run 5: A set of 11 filter rules were employed for a final prediction, such as,
for example, ([E/C]CE[H/E/C][H/C]) -> C. These filter rules were found auto-
matically using machine learning.
For run (b) to (d), a linear discrimination function was determined for each of
the three secondary structural states. A linear discrimination function is effect-
ively a set of weights for the attributes in the positional vector, so that the
secondary structure associated with the highest scoring discrimination function
is assigned to the alignment position considered.
The DSC predictions are based on the information arising from the five above
runs. The Q3 was assessed for successively increasing numbers of runs (run 1,
runs 1 and 2, runs 1-3, 1-4, 1-5) for the five runs based on the Rost-Sander
protein set and comprised 63.5%, 67.8%, 68.3%, 69.4%, and 70.1% (DSC), respect-
ively. The DSC method performs especially well for moderately sized proteins in
the range 90-170 residues. A special feature of the DSC technique is that it
128
PREDICTING SECONDARY STRUCTURE FROM PROTEIN SEQUENCES
accepts predictions by the PHD algorithm as input and attempts to refine those
using the above concepts. The Q3 of this PHD-DSC combinatorial procedure was
evaluated at 72.4% (40).
129
PREDICTING SECONDARY STRUCTURE FROM PROTEIN SEQUENCES
Figure 5 Secondary structure prediction for chemotaxis protein cheY (3chy). The top
alignment block represents the multiple alignment of the 3chy sequence with 13 distant
flavodoxin sequences by the method PRALINE. The middle block is the same sequence set
aligned by CLUSTALX. Under each of the alignments are given the alignments by five
secondary structure prediction methods. The bottom block depicts consensus secondary
structures determined by Jpred over the five methods used, respectively for the PRALINE and
CLUSTALX alignments, as well as for a set of 32 homologous sequences aligned by
CLUSTALX (cons HOMOLOGS). Vertical bars (|') under each of the consensus predictions
indicate correct predictions. The bottom line identifies the standard of truth as obtained from
the 3chy tertiary structure by the DSSP program. (10) The secondary structure states
assigned by DSSP other than 'H' and 'E' were set to ' '(coil) for clarity.
131
JAAP HERINGA
132
PREDICTING SECONDARY STRUCTURE FROM PROTEIN SEQUENCES
133
JAAP HERINGA
134
PREDICTING SECONDARY STRUCTURE FROM PROTEIN SEQUENCES
135
JAAP HERINGA
136
PREDICTING SECONDARY STRUCTURE FROM PROTEIN SEQUENCES
5 Coiled-coil structures
If a protein is predicted to contain a-helices, higher-order information as well as
increased confidence in predictions made could be gained from testing the
possibility that a pair of helices adopts a superhelical twist resulting in a coiled-
coil conformation. The left-handed coiled-coil interaction involves a repeated
motif of seven helical residues (abcdefg). The a and d positions are normally
occupied by non-polar residues constituting the hydrophobic core of the
helix-helix interface, whereas the other positions display a high likelihood to
comprise hydrophilic residues. The e and g positions in addition are often charged
and can form salt-bridges to each other. The program COILS2 (81, 82) exploits
this information and compares a query sequence with a database of known
parallel two-stranded coiled-coils. A similarity score is derived and compared to
two score distributions, one for globular proteins (without coiled-coils) and one
for known coiled-coil structures, and a probability is then calculated for the
query sequence to adopt a coiled-coil conformation. As the program assumes the
presence of heptad repeats, the probabilities are derived using windows of 14,
21, and 28 amino acids. However, the program offers the option to include user-
defined window lengths two allow the handling of cases with extreme coiled-
coil lengths. A recently updated scoring matrix which includes new structures
with known coiled-coils and contains amino acid type propensities at the various
positions in the heptad repeats, led to increased recognition of coiled-coils
elements. The COILS2 method accurately recognises left-handed two-stranded
coiled coils but loses sensitivity for coiled-coil structures composed of more than
two strands. It is not able to recognize right-handed or buried coiled-coil helices
and therefore is not applicable to transmembrane coiled-coil structures known
to basically show the similar coiled-coil conformations as soluble proteins,
albeit with dramatically different and more hydrophobic constituent amino
acids (56).
137
JAAP HERINGA
6 Threading
If a homologous protein with known structure is available for a query sequence,
this structure can then be aligned to the query sequence using the threading
technique (83). Treading methods test the feasibility for a given sequences to
adopt a particular fold, based on assessing the likelihood for the amino acids in
the query sequence to occur in the local residue environments within the known
tertiary structure. The optimal fit of the query sequence through the tertiary
structure effectively leads to an alignment, which can be used to copy the
secondary structure of the known fold to the query sequence. Although the
incorporation of tertiary structure information should lead to better alignment
and recognition of related sequences, the increased sensitivity of available
threading methods as compared with conventional sequence alignments is not
always clear. Jones et al. (84) discuss various threading methods available and
also how their results should be interpreted.
Table 1 Websites of various secondary structure prediction methods and related services
138
PREDICTING SECONDARY STRUCTURE FROM PROTEIN SEQUENCES
References
1. Heringa .J. and Taylor, W. R. (1997). Curr.Opin.'.Struct. Biol., 7,416.
2. Springer, T. A. (1997). Prof. Natl. Acad. Sri. USA, 94, 65.
3. Baldwin R, L, and Roder H. (1991). Curr. Biol., 1,218.
4. Goldenberg, D. P., Frieden, R. W,, Haack, J. A.., and Morrison. T. B. (1989). Nature, 338,
127.
5. Baldwin, R. L. (1990), Nature, 346. 409.
139
JAAP HERINGA
6. Flores, T. P., Moss, D. S., and Thornton, J. M. (1994). Protein Eng., 7, 31.
7. Bernstein, F. C., Koetzle, T. F., Williams, G. J., Meyer, E. F., Brice, M. D., Rodgers, J. R.,
et al (1977).J. Mol. Biol, 112, 535.
8. Schiffer, M. and Edmundson, A. B. (1967). Biophys.J., 7,121.
9. Rost, B. and Sander, C. (1993) .J. Mol. Biol., 232, 584.
10. Kabsch, W. and Sander, C. (1983). Biopolymers, 22, 2577.
11. Colloc'h, N., Etchebest, C., Thoreau, E., Henrissat, B., and Mornon, J.-P. (1993). Protein
Eng., 6, 377.
12. Sklenar, H., Etchebest, C., and Lavery, R. (1989). Proteins: Struct. Funct. Genet., 6, 46.
13. Woodcock, S., Mornon, J.-P., and Henrissat, B. (1992). Protein Eng., 5, 629.
14. Russell, R. B. and Barton, G. J. (1993). J. Mol. Biol., 234, 951.
15. Taylor, W. R. (1984). J. Mol. Biol., 173, 512.
16. Cohen, F. E., Abarbanel, R. M., Kuntz, I. D., and Fletterick, R. J. (1986). Biochemistry, 25,
266.
17. Cohen, F. E. and Kuntz, I. D. (1989). In Prediction of protein structure and the principles of
protein conformation (ed. G. D. Fasman), pp. 647-706. Plenum, New York, London.
18. Sternberg, M. J. E. (1992). Curr. Opin. Struct. Biol., 2, 237.
19. Benner, S. A., Cohen, M. A., and Gerloff, D. (1993). J. Mol. Biol., 229, 295.
20. Rost, B., Sander, C., and Schneider, R. (1994) J. Mol.Biol., 235, 13.
21. Frishman, D. and Argos, P. (1995). Proteins: Struct. Funct. Genet., 25, 633.
22. Szent-Gyorgyi, A. G. and Cohen, C. (1957). Science, 126, 697.
23. Periti, P. F., Quagliarotti, G., and Liquori, A. M. (1967).J. Mol. Biol., 24, 313.
24. Nagano, K. (1973). J. Mol.Biol., 75, 401.
25. Chou, P. Y. and Fasman, G. D. (1974). Biochemistry, 13, 211.
26. Lim, V. I. (1974). J. MoL Biol., 88, 857.
27. Garnier.J., Osguthorpe, D. J., and Robson, B. (1978).J. MoL Biol., 120, 97.
28. Gibrat, J.-F., Gamier, J., and Robson, B. (1987). J. Mol. Biol., 198, 425.
29. Gamier, J. G., Gibrat, J.-F., and Robson, B. (1996). In Methods in enzymology (ed. R. F.
Doolittle), Vol. 266, pp. 540-53. Academic Press.
30. Schultz, G. A. (1988). Annu. Rev. Biophys. Chem., 17,1.
31. Cohen, F. E., Abarbanel, R. M., Kuntz, I. D., and Fletterick, R. J. (1983). Biochemistry, 25,
4894.
32. Taylor, W. R. and Thornton, J. M. (1983). Nature, 354, 105.
33. Rooman, M. J., Wodak, S., and Thornton, J. M. (1989). Protein Eng., 3, 23.
34. Presnell, S. R., Cohen, B. I., and Cohen, F. E. (1992). Biochemistry, 31, 983.
35. Chandonia, J.-M. and Karplus, M. (1998). Proteins: Struct. Funct. Genet., 35, 293.
36. Yi, T.-M. and Lander, E. S. (1993). J. Mol. Biol., 232,1117.
37. Salamov, A. A. and Solovyev, V. V. (1995). J. Mol. Biol., 247,11.
38. Frishman, D. and Argos, P. (1996). Protein Eng., 9,133.
39. Frishman, D. and Argos, P. (1997). Proteins: Struct. Funct. Genet., 27, 329.
40. King, R. D. and Sternberg, M. J. E. (1996). Protein Sci., 5, 2298.
41. Muggleton, S., King, R., and Sternberg, M. J. E. (1992). Protein Eng., 5, 647.
42. Taylor, W. R. (1986). J. Theor. Biol., 119, 205.
43. Zvelebil, M. J., Barton, G. J., Taylor, W. R., and Sternberg, M. J. E.(1987).J. Mol. Biol.,
195, 957.
44. Levin, J. M., Pascarella, S., Argos, P., and Gamier, J. (1993). Protein Eng., 6, 849.
45. Minsky, M. and Papert, S. (1988). Perceptrons. MIT Press, Cambridge, MA.
46. Sander, C. and Schneider, R. (1991). Proteins: Struct. Funct. Genet., 9, 56.
47. Bowie, J. U., Luthy, R., and Eisenberg, D. (1991). Science, 253, 164.
48. Gonnet, G. H., Cohen, M. A., and Benner, S. A. (1992). Science, 256, 1443.
140
PREDICTING SECONDARY STRUCTURE FROM PROTEIN SEQUENCES
141
This page intentionally left blank
Chapter 7
Methods for discovering
conserved patterns in protein
sequences and structures
Inge Jonassen
Department of Informatics, University of Bergen HIB, 5020 Bergen, Norway
1 Introduction
The amount of available biomolecular data is exploding: the number of known
protein sequences is increasing rapidly while the number of known structures is
also increasing, though not as rapidly. A very useful observation in this situation
is that common features among proteins can be used to group them into
families, and then study the proteins on a family level. By a family we mean a
set of proteins sharing some definite biological properties in terms of common
function and/or structure, often implying that the proteins have evolved from a
common ancestor, i.e. that they are homologous.
When studying a family, one can compare the sequences and structures (if
known) of the proteins in the family in order to find what sequence or structure
properties are shared by the family members and how these could explain the
biological properties shared by the proteins in the family. A description of
sequence properties is called a sequence pattern, and a description of structure
properties is called a structure pattern. If a pattern is common to a family of
proteins, it is called a motif for the family.
An example protein family is the set of proteins containing the classical zinc-
finger DNA-binding domain. Most of the sequences in this family match the
pattern C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H, thus this is a sequence motif
for this protein family. A sequence matches this pattern if it contains a C
followed by 2-4 arbitrary letters followed by C and 3 arbitrary letters and one of
L, I, V, M, F, Y, W, or C, and so on (this pattern notation is described in detail
below). This particular pattern describes two cysteines and two histidines that
are needed for coordinating the zinc ion in the classical zinc-finger domain,
which means that this particular pattern has a direct biological interpretation.
Additionally, the pattern can be used for classification, since not only do most
sequences in the family match the pattern, but also very few sequences outside
143
INGE JONASSEN
the family match the pattern. So, if one finds a match to this pattern in a new
sequence, the chances are good that this new protein contains a zinc finger
domain and binds to DNA.
A large number of patterns have been compiled and collected in different
protein family databases. An example is the PROSITE database (1) which con-
tains more than 1000 protein families and for most of these it gives a pattern
which occurs in most of the sequences in the family. The patterns are regular
expressions (like the zinc finger pattern given above) or profiles (position specific
scoring matrices). Other databases also use local alignments, profiles, and
Hidden Markov models (HMMs). We describe briefly some of these databases
and how they can be used later in the chapter.
When one knows the structures of some of the proteins in the protein family
under study, the structures can be compared and similarities described as
patterns or motifs. In the same way that protein structures can be described at
different levels (e.g. atom, residue, secondary structure, element level), structure
patterns can describe structure properties at different levels. For example, one
pattern could describe the packing of four alpha-helices and another pattern
could describe the relative position of the cysteines and histidines in the
classical zinc finger.
In this chapter we will use a very broad definition of patterns including both
sequence and structure patterns and all the ways in which these can be defined.
When going into more detail we will focus on sequence patterns which are of
the regular expression type and on one particular type of structure patterns
which describes packing of individual residues. In the following we discuss how
existing databases of patterns can be used for analysing a new protein query
sequence and how to assess the output of such a search. Later we describe
different approaches to finding respectively sequence and structure patterns for
a family.
2 Pattern descriptions
A very simple type of patterns is substring patterns—a sequence matches a
substring pattern if it contains the substring (contiguous word in the sequence).
For example, the substring pattern CDEC is matched by all sequences contain-
ing CDEC as a substring. This very simple type of patterns can sometimes be
useful in the analysis of protein or nucleotide sequences. We will first define the
concept of approximate pattern matching and then describe different general-
izations of substring patterns.
144
METHODS FOR DISCOVERING CONSERVED PATTERNS IN PROTEIN SEQUENCES
on the distance to be allowed. One simple way to measure the distance between
two strings (or a pattern and a string) is to count the number of character
changes needed to transform one into the other. This is called the number of
mismatches or Hamming distance and can measure the distance between two
strings only if they have equal length. For example, the sequence AGCDFCALKW
approximately matches the substring pattern CDEC since the substring CDFC
can be transformed into CDED by substituting the F with an E.
More general distance measures allow for insertion and deletion of characters
in addition to substitutions. The edit distance between two strings (sequences) is
the minimum number of single character insertions/deletions and substitutions
needed to transform one into the other. For example, the sequence AGCDDALKW
approximately matches the pattern CDED when one allows for an edit distance
of more than one, but it does not match if one allows only for one mismatch.
When comparing protein sequences, one may also penalize different substitu-
tions differently since some amino acid replacements are found more often in
equivalent positions in homologous (evolutionarily related) proteins. For example
substitutions can be penalized using a substitution matrix, e.g. PAM-matrices (2)
or BLOSUM matrices (3).
In order to find the edit distance between two sequences, one can use the
dynamic programming algorithm (4) which can also be used when substitution
matrices are used. For substring patterns, the matching problem is very similar
to local pairwise alignment and database searching where speed-ups can be
used, e.g. BLAST (5), Fasta (6).
145
INGE JONASSEN
Rgure 1 Example of a local alignment of sequence segments containing the classical zinc-
finger motif. The consensus sequence below shows for each position which amino acid is
the most frequent, and to the right it is shown for each segment the number of mismatches
between the segment and the consensus. Grey shading marks the positions conserved in all
four segments. The positions of the conserved cysteines and histidines of the zinc-finger
motif are underlined in the consensus sequence.
(c) Set of residues given in curly brackets: matches any one sequence letter not
in the set, e.g. {KER} matches any letter except K, E, and R in the sequence.
(d) Wildcard x: matches any one letter in a sequence
Additionally:
(a) Single pattern elements can be followed by parentheses (i, j) which means
that the sequence can contain between i and j (inclusive) letters each match-
ing the preceding pattern element. For example, x(3,5) matches between 3
and 5 arbitrary sequence letters.
(b) The pattern can start with '<' meaning that the pattern should match from
the beginning of a sequence.
(c) The pattern can end with '>' meaning that the pattern should match until
the end of a sequence.
(d) The pattern elements are separated by hyphens '-'.
Consecutive pattern elements should be matched by consecutive sequence
symbols, so for example the pattern C-x(2,3)-[DE] matches any sequence con-
taining a C followed by two or three arbitrary letters followed by a D or an E.
146
METHODS FOR DISCOVERING CONSERVED PATTERNS IN PROTEIN SEQUENCES
147
INGE JONASSEN
which is the best one, and to find whether the identified pattern could be the
result of chance.
2.4.1 Characterization
Patterns can be used to describe biologically important features of the proteins,
that is, one wants to describe which features are compulsory in order for the
protein to belong to a particular family and which features are optional. If one
has available a set of sequences (or structures) from the same family, these can
be compared, and it can be determined which residues are conserved through
evolution and therefore likely to be important to the proteins' function and
structure. If the available proteins have undergone little evolution since their
last common ancestor (for example if their sequences are 90% identical), it will
not be easy to find which of the conserved residues are most important for the
biological function of the proteins. Therefore one should try to collect proteins
that are as diverse as possible while avoiding inclusion of unrelated proteins.
Having developed a pattern conserved in a set of sequences, one should find
whether such a pattern is likely to be conserved by chance and therefore not
necessarily biologically important. One common method is to calculate an esti-
mate of the probability that a set of random sequences (equal in number and
lengths to the sequences under analysis) would share a pattern of the same
strength as the identified pattern, as a result of chance. If this probability is very
low, one has a better reason for believing that the identified pattern has some
biological meaning. When evaluating the significance of the discovered pattern,
one should also take into account the number of patterns that have been con-
sidered in the pattern discovery phase (see e.g. ref. 14). For example, a pattern
with probability 10-6 is expected to be found if one million patterns are con-
sidered.
An alternative to calculating pattern probabilities is to measure the patterns'
information content (15). The higher the information content the pattern
possesses, the less likely a random sequence is to match it. The measure was
designed for ranking patterns matching the same number of sequences. Using
the principle of minimum description length (MDL) from machine learning (16),
this has been extended to also take into account the number of sequences
matching each sequence (17).
An alternative approach is to do a series of pattern discovery experiments on
sets of sequences with characteristics similar to the sequences in which the
148
METHODS FOR DISCOVERING CONSERVED PATTERNS IN PROTEIN SEQUENCES
patterns were found. The sequences should be chosen so that they share no
significant patterns. The result of the pattern discovery experiments will give
information about what type of patterns can be found by chance. For example,
one can repeat x times: shuffle all the sequences, and check which patterns can
be found to match at least the same number of (the shuffled) sequences as the
pattern under analysis. An advantage of this approach is that in assessing the
'background probability' one can use sequences which have the same character-
istics as the original similar sequences (local sequence composition, etc.) for
example by using special shuffling operations.
When evaluating discovered patterns, it is important to take into consider-
ation whether (some of) the sequences under analysis are very closely related.
When calculating the probabilities, the model normally assumes that the
sequences are independently generated by some probabilistic model and the
shuffling would normally also extinguish any close similarities between the
sequences. If some of the sequences are very similar, they will contain many
common patterns, and any pattern matching one of them will probably match
all, and is therefore likely to be deemed as more significant. A scoring scheme
taking this into account, has been proposed (18).
2.4.2 Classification
When a pattern is to be used for classification, it should ideally match all family
members and no other sequences. Most often, however, the pattern fails to
match some member sequences (called false negatives), and it may match some
sequences outside the family (false positives). For an illustration, see Figure 2. The
fewer false negatives, the more sensitive the pattern is said to be, and the fewer
false positives, the more specific it is. Ideally, a pattern should have zero false
positives and negatives.
An estimate of the number of matches in a sequence database can be found
by multiplying the probability that one random sequence matches the pattern
by the number of sequences in the database. In order to calculate the probability
we assume that random sequences are generated using a specific probabilistic
model. Sternberg (19) did this for all the patterns in the PROSITE database and
showed a clear correlation between the expected number of false positives and
Figure 2 Illustration of the concepts of true positives, true negatives, false positives, and
false negatives. The circles are the family members and the squares are non-family
members. A pattern matches the encircled objects, and the status of each object is shown
by its colour (unfilled means 'true' and filled means 'false').
149
INGE JONASSEN
the actual number, i.e. the number of unrelated sequences in the SWISS-PROT
database (20) matching the pattern.
Denoting the number of true positives (sequences in the family matching the
pattern) by TP and the number of false negatives by FN, the sensitivity of a pattern
(21) can be defined as
Sensitivity = TP/(TP+FN)
and measures of how big a proportion of the family sequences are 'picked up by'
(matched by) the pattern. Similarly, the specificity of the pattern can be defined as
Specificity = TN/(TN+FP)
(where TN and FP are respectively the number of true negatives and the number
of false positives) which measures of how big a proportion of the sequences
outside the family are not matched by the pattern. Yet another useful number is
the positive predictive value (PPV) which says how big a proportion of the
sequences matching the pattern are actually in the family,
PPV=TP/(TP+FP)
The value range for all three is from zero to one, one being the best possible.
When evaluating patterns to be used for classification, one needs to use more
than one of the measures. This can be illustrated by two degenerate cases, (1) the
empty pattern matching any protein, and (2) a pattern matching one single
protein being member in the family. Pattern (1) has perfect sensitivity, but very
bad specificity and PPV, while pattern (2) has perfect specificity and PPV, but bad
sensitivity. For a concrete example of the use of these equations, see below. In
practice one often needs to make a trade-off between sensitivity and specificity
when choosing which pattern to use for a family. One way to evaluate a prob-
abilistic pattern's ability to discriminate between family members and other
sequences is to find a cut-off on the score that gives the same number of false
positives and false negatives. Tatusov et al. (22) evaluated alternative ways of
finding weight matrices from local ungapped alignments using this approach.
2.4.3 Discussion
Often the patterns that describe biologically important features will also be good
for classification purposes and vice versa. However, it is possible that a pattern
that gives perfect discrimination can be derived and yet lacks any biological
interpretation. Also, it may be that the features described by a pattern are im-
portant in the family, but not unique to the family, so that it is not specific
enough to be used for classification purposes.
150
METHODS FOR DISCOVERING CONSERVED PATTERNS IN PROTEIN SEQUENCES
ing and pattern definition) while others are made (to a varying degree) auto-
matically. For some summary information about a few family databases, see
Table 1.
2.5.1 PROSITE
One of the most widely used databases is PROSITE (1), in which, or each family,
one or several patterns and/or profiles are given in the format described above
(Section 2.2). Profiles are described using the generalised profile syntax (23). Statis-
tics are given which describe the patterns' ability to discriminate between family
members and other sequences given in the SWISS-PROT protein sequence data-
base (20), in the form of the number of false positives and false negatives. Also, a
number of unknowns is given which is the number of sequences in SWISS-PROT
which match the pattern, but for which it is not yet known whether it belongs
to the family or not.
Figure 3 shows a PROSITE entry giving a signature pattern (motif) for the
actinin-type actin-binding domain. We will explain the most important (for our
purpose) parts of the entry. First, the ID and AC lines give, respectively, the
name and the accession number of the PROSITE entry. The pattern is given on
the PA line (the pattern can continue over several PA lines, then end of the
pattern being marked by a period sign). The NR lines give statistics about the
pattern's discriminatory power with respect to the SWISS-PROT database. The
first NR line says that the statistics are with respect to release 35 of SWISS-PROT,
which has 69113 sequence entries. The next NR line gives the number of
matches of different categories (true positives, false positives, false negatives,
etc.) in the SWISS-PROT database. Each is given as x(y) meaning that there are x
matches to the pattern in y different sequences.
The DR lines give references to the corresponding SWISS-PROT entries both
by their names and accession numbers. Each reference is on the. form 'AC, ID,
Status;' where Status is one of T (true positive), N (false negative), F (false positive),
P (partial), and ? (unknown). Finally, the 3-D line gives names of PDB (Protein
Data Bank) entries containing structures of proteins in the family, and the DO
line gives the accession number of the entry in the PROSITE documentation part
corresponding to this entry.
151
INGE JONASSEN
Figure 3 Example of a PROSITE entry taken from release 14.0 (November 1997). See text,
for a detailed explanation.
2.5.2 BLOCKS
Another protein family database is BLOCKS (8) which contains the same families
as PROSITE, but instead of giving patterns or profiles, it gives a set of blocks for
each family. A block is an ungapped local multiple alignment. The database is
constructed fully automatically. For each family the member sequences (as given
in PROSITE) are subjected to pattern discovery and local alignment methods. The
blocks can also be linked in chains. It is recommended that a query sequence is
matched against both PROSITE and BLOCKS, because even though they describe
152
METHODS FOR DISCOVERING CONSERVED PATTERNS IN PROTEIN SEQUENCES
the same families, the patterns and the blocks in a sense have complementary
strengths when used as classifiers.
2.5.3 PRINTS
The PRINTS database is a collection of fingerprints and is constructed semi-
automatically (9). A fingerprint is defined as a list of motifs, each motif being a
local ungapped alignment. The fingerprints are made by first manually making
an alignment of some family members and then iteratively scanning a database
of protein sequences, adding new members, updating the fingerprint, scanning
again until convergence (no new family members are found). Each entry in
PRINTS, gives the local alignments corresponding to the motif and information
about partial matches etc. On the website of PRINTS, a tool FingerPRINTScan, is
available for scanning a query sequence against the database. The output can be
visualized showing the position of the motif matches in the query sequence.
2.5.4 PFam
Pfam is a database of multiple sequence alignments and HMM-profiles of protein
domains (24). It is partly manually curated. Seed alignments for each family are
made semi-automatically, and these are extended using HMM methods. Version
3.1 (August 1998) contains 1313 families.
On the Web site there are available tools for matching a query sequence
against the database. Also, a special database SWISSPFam is made which shows
the domain organization (according to Pfam) or the sequences in SWISS-PROT
and TrEMBL.
2.5.5 Identify
Identify (25) is a database of patterns of the same form as used in PROSITE but
without flexible length wildcards. It contains patterns for the families in the
BLOCKS and PRINTS databases. The patterns were constructed from the un-
gapped alignments (blocks) in the BLOCKS and PRINTS databases by using a
pattern finding program (EMOTIF). For each block, there can be several patterns
so that each pattern matches a subset of the sequences. The patterns were
generated to have a certain specificity (calculated as the probability that a
random sequence matches the pattern by chance, cf. ref. 19), and patterns were
generated for different specificity levels. The World Wide Web server allows the
user to input a query sequence which then is matched against the pattern
collection and the user is given the list of matching patterns together with links
to the corresponding BLOCKS and PRINTS database entries.
153
INGE JONASSEN
ScanProsite that can be used for both (1) and (2) with regular expression type
patterns. For scanning against the profiles in PROSITE, the tool ProfileScan can
be used. The ScanProsite program does not return any information about the
probability that the match between the sequence and the pattern could be by
chance, and it does only allow for exact matching between the pattern and the
sequence.
Another very useful tool is PdbMotif (26), which takes as input a protein
structure and finds all matches between the protein's sequence and patterns in
the PROSITE database. It generates a script that can be input to RasMol (27) to
highlight the pattern matches. PdbMotif also outputs a probability that the
sequence should match the PROSITE pattern by chance, which can help to
identify possible false positives.
154
METHODS FOR DISCOVERING CONSERVED PATTERNS IN PROTEIN SEQUENCES
155
INGE JONASSEN
ally of the PROSITE type, but the set of possible amino acid groups is limited and
given by an Amino Acid Class Covering (AACC) hierarchy. The two sequences
aligned are now replaced by their common pattern, and the procedure is repeated
until there remains only one pattern matching all of the input sequences. For an
example, see Figure 4. Note that this approach is very analogous to the pro-
gressive multiple alignment methods used for example by Thompson et al. (31)
and Taylor (32).
156
METHODS FOR DISCOVERING CONSERVED PATTERNS IN PROTEIN SEQUENCES
a conserved pattern. Pratt uses a two-step search for finding conserved patterns
from the chosen solution space having maximum fitness. The fitness is norm-
ally defined as the information content of the pattern (see below).
In the following sections we give a practical guide to how Pratt should be
used and then some details about the algorithms used in Pratt. For more detailed
technical descriptions, see the original papers (15, 33) and the Pratt home page
on the World Wide Web: http://www.ii.uib.no/~inge/Pratt.html.
157
INGE JONASSEN
Table 2 The table shows some example patterns, and for each example, the minimum
values to be used for some Pratt parameters if the pattern is to be discovered. For example,
in order to discover the bottom-most pattern, one needs to increase the value of the PX
parameter to at least 12
158
METHODS FOR DISCOVERING CONSERVED PATTERNS IN PROTEIN SEQUENCES
using the option -px 15 to allow long wildcards (if we want to re-discover the
known motif, we need to set PX to at least 12 since during the initial pattern
search, a spacing of 12 is needed between the last conserved cysteine and the
first histidine). Also, if one wants to find all patterns matching a minimum 90%
of these sequences, one can use the command
pratt fasta c2h2 -px 15 -c% 90
159
INGE JONASSEN
Figure 6 Example of a pattern graph. The paths in the graph define the patterns A-B-x(0,2)-
x(3,3)-D, A-B-x(0,2)-C, A-B, B-x(0,2)-C-x(3,3)-D, B-x(0,2)-C, C-x(3,3)-D, A-x(l,3)-C-x(3,3)-D, A-
x(l,3)-C.
The initial search explores all patterns that can be derived from paths in the
pattern graph and that are contained in the class of patterns defined by the user.
The search is focused on finding only the highest scoring patterns. Branch-and-
bound techniques are used to avoid considering parts of the search space that
cannot possibly contain patterns with higher scores than already identified
patterns.
Also, heuristics have been implemented that effectively avoids exploring
search paths unlikely to produce patterns scoring higher than patterns already
found. The user can adjust the greediness of the search. Setting the E parameter
to zero gives non-heuristic search (guaranteed to find highest-scoring patterns),
setting E to 1 gives the same guarantee in cases where no flexibility is allowed in
wildcard regions, and E values above 1 gives increasingly greedy search. The
default value is 3. The more greedy the search, the faster it will be, and the more
likely Pratt is to not find the highest scoring patterns. Experiments have shown
that E = 3 gives a good compromise between speed and accuracy for protein
sequences, while for DNA a lower value should be used (for instance, E = 1.5).
160
METHODS FOR DISCOVERING CONSERVED PATTERNS IN PROTEIN SEQUENCES
Figure 7 Example of the set B of w-segments made for a set of sequences. The B segments
are used in the block data structure.
each sequence w-1 dummy symbols '-'. For an example of the w-segments made
for a set of sequences, see Figure 7.
Now, for each amino acid symbol a, and for each i between 1 and w, construct
the set bi,a that is the set of all w-segments having character a in position i. These
sets can be used to quickly find the set of w-segments matching any pattern
considered by Pratt not having length exceeding w. For instance, the set of
segments matching A-x(2) -B is b1,A nb4,B. In the recursive search, the set of
segments matching P is used together with the block data structure to find
the segments matching each extension P-x(i,j)-A of P. For a more detailed
description, see Jonassen et al. (15).
161
INGE JONASSEN
If the aim is to find patterns to be used for classification, Pratt can evaluate
the discovered patterns by their positive predictive value (PPV, see Section 2.4.2,
Classification). It is assumed that the sequences under analysis are all in the
SWISS-PROT database (20). The number of false positives for each pattern is
found by matching the patterns against the SWISS-PROT database that must be
locally available in flat file format.
5 Structure motifs
Finding recurring patterns (motifs) in protein structures help to better under-
stand the rules underlying the formation of protein structures. Since structure is
better conserved during evolution than sequence, structural similarities can also
help to identify remote evolutionary relationships. Structure motifs can help in
approaching the structure prediction problem and in assigning function to pro-
teins. Structure motifs can represent common structural features at different
levels. For example, they can represent packing of secondary structure elements,
local packing of residues, and atom coordinates of binding atoms in active site
(or ligand binding) residues. Structure motifs describing functional sites in
proteins, have been developed by, for example, Wallace et al. (34). They call their
motifs 'templates' and suggest that they can be used for finding functional
sites in proteins. They have also developed a database PROCAT of such templates
(35) which allows the user to search for 3-D enzyme active sites in a protein
structure.
In order to find recurring patterns in protein structures one can use methods
for the comparison of protein structures. A number of such methods have been
developed, most of them for comparing pairs of structures, but also some for
multiple structure comparison. The methods differ in what similarities they are
able to find. Some represent the structures as composed of secondary structure
elements (alpha helices, beta strands, and loops) and have provided methods for
finding patterns of conserved patterns at this level. Other methods find patterns
of residues (or atoms) that have similar configurations in space. Brown et al. (36)
gives a survey of a large number of different methods focusing on their way of
representing similarities. An important difference between methods is whether
they require matched elements to be in the same order along the proteins'
primary structure. A number of different methods have been used, including
extensions of the dynamic programming algorithms used for pairwise sequence
(37), use of graph-theoretic methods (38), and methods from computer vision
(39).
Also, some more direct methods for structure motif discovery have been sug-
gested. Here we will describe the SPratt program, a more detailed description of
which can be found in (40).
162
METHODS FOR DISCOVERING CONSERVED PATTERNS IN PROTEIN SEQUENCES
encoding structural features in the form of strings and input these to Pratt
producing patterns common to the strings. Next one needs to check if the
patterns found in the structure description strings correspond to similarities
between the structures. The encoding of structural properties that we adopted
was one described by Karlin and Zhu (41). In their method, they make one string
per residue in each structure. The strings contain information about the spatial
neighbourhood of each residue. Karlin and Zhu describe alternative methods for
making these strings, and we chose one of them to be used in SPratt.
163
INGE JONASSEN
for which the occurrences do not superpose very well using a measure of RMSD.
However, the RMSD value for superposing the coordinates of a small number of
residues, for example 4 is not very informative. The significance of the patterns
found by Pratt, can be further assessed using the structure alignment program
SAP (42). We have done this by rewarding alignment of residues in agreement
with the motif, and in this Way checking whether the matching of the few
residues described by the motif can be extended to an alignment of larger parts
of the structures.
6 Examples
In (15) we described the application of the first version of Pratt to the analysis of
some protein families in PROSITE. For example, we analysed the Snake toxin
family (PS00272 in PROSITE) containing 164 sequences of average length 64. We
retrieved the sequences from the SWISS-PROT database and input them (un-
aligned) to the Pratt program. Using default parameters (which requires patterns
to match all the sequences), no patterns were found. However, using Pratt to
discover patterns matching at least 155 out of the 164 sequences, we got the
pattern G-C-x(l,3)-C-P-x(8,10)-C-C-x(2)-[PDEN]. This pattern turned out not to
match any sequences in SWISS-PROT apart from the family members and was
since included in the PROSITE database as the pattern for this family.
Using the SPratt program we analysed a set of cupredoxin protein structures
(40). The proteins were selected from the cupredoxins super-family in SCOP (43)
so that all pairwise sequence similarities were 30% or less. The 10 structures
were input to the SPratt program and it was instructed to search for patterns
containing single residue elements and also allowing the match-set [MLQ] (since
the methionine ligand-binding residue is known to be substituted by L or Q. in
some proteins). SPratt used two minutes and identified three patterns all match-
ing around the copper binding sites. The substructure occurrences of the
pattern superpose with very low RMSD values (0.7 A or less). We also used the
motif identified by SPratt to guide the structure alignment program SAP using
the pairs of equivalenced residues as extra constraints on the alignment. This
resulted in a greatly improved alignment with RMSD of 1.56 A over 63 pairs of
residues (as compared to the alignment with RMSD of 5.1 A over 26 residues
when no SAP was run without any motif information).
7 Conclusions
Patterns and pattern discovery tools can help in the analysis of protein families,
i.e. sets of proteins believed to share structural and/or functional properties. The
most well conserved parts of the sequences or structures can be identified and it
can be analysed whether the conserved patterns are statistically significant and
therefore likely to have biological importance. Furthermore, the patterns can be
used to identify additional related proteins. Patterns can describe protein prop-
164
METHODS FOR DISCOVERING CONSERVED PATTERNS IN PROTEIN SEQUENCES
References
1. Bairoch, A., Bucher, P., and Herman, K. (1996). Nucleic Acids Res., 24,189.
2. Dayhoff, M. O., Schwartz, R. M., and Orcutt, B. C. (1978). In Atlas of protein sequence and
structure (ed. M. O. Dayhoff), Vol. 5, Suppl. 3, p. 345. National Biomedical Research
Foundation, Washington DC.
3. Henikoff, S. and Henikoff, J. G. (1992). Proc. Natl. Acad. Sci. USA, 89,10915.
4. Needleman, S. B. and Wunsch, C. D. (1970).J. Mol Biol., 48, 443.
5. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990).J. Mol. Biol.,
215, 403.
165
INGE JONASSEN
6.
Lipman, D. J. and Pearson, W. R. (1985). Science, 277, 1435.
7.
Brazma, A., Jonassen, I., Eidhammer, I., and Gilbert, D. (1998). J. Comp. Biol., 5, 279.
8.
Pietrokovski, S., Henikoff, J. G., and Henikoff, S. (1996). Nucleic Acids Res., 24, 197.
9.
Attwood, T. K., Beck, M. E., Bleasby, A. J., Debtyarenko, K., and Smith, D. J. P. (1996).
Nucleic Adds Res., 24,182.
10. Gribskov, M., McLachland, A. D., and Eisenberg, D. (1987). Proc. Natl. Acad. Sri. USA, 84,
4355.
11. Krogh, A., Brown, M., Miah, I. S., Sjoelander, K., and Haussler, D. (1994). J. Mol. Biol.,
235,1501.
12. Brown, M., Hughey, R., Krogh, A., Mian, I. S., Sjoelander, K., and Haussler, D. (1993). In
Proc. 1st Int. Conf. On Intell. Systems for Mol. Biol, pp. 47-53. AAAI Press.
13. Altschul, S. F., Carroll, R. J., and Lipman, D. J. (1989). J. Mol. Biol., 207, 309.
14. Neuwald, A. F. and Green, P. (1994). J. Mol. Biol., 239, 698.
15. Jonassen, I., Collins, J. F., and Higgins, D. G. (1995). Protein Sci., 4,1587.
16. Rissanen, J. (1978). Automotica-J.IFAC, 14, 465.
17. Brazma, A., Jonassen, I., Ukkonen, E., and Vilo, J. (1996). In Proc. 4th Int. Conf. On Intell.
Systems for Mol Biol, pp. 34-43. AAAI Press.
18. Jonassen, I., Helgesen, C, and Higgins, D. (1996). Reports in Informatics, report no.
116. Dept. of Informatics, Univ. of Bergen, Norway.
19. Sternberg, M. J. E. (1991). Nature, 349, 111.
20. Bairoch, A. and Boeckmann, P. (1992). Nucleic Acids Res., 20, 2019.
21. Lathrop, R., Webster, T., Smith, R., Winston, P., and Smith, T. (1993). In Artificial
intelligence and molecular biology (ed. L Hunter), pp. 211-58. AAAI Press/The MIT Press.
22. Tatusov, R. L, Altschul, S. F., and Koonin, E. V. (1994). Proc. Notl. Acad. Sri. USA, 91,12091.
23. Bucher, P.and Bairoch, A. (1994). In Proc of 2nd Int. Conf. On Intell. Systems for Mol. Biol.,
pp. 53-61. AAAI Press.
24. Sonnhammer, E. L., Eddy, S. R., Birney, E., Bateman, A., and Durbin, R. (1996). Nucleic
Acids Res., 26, 320.
25. Nevill-Manning, C. G., Wu, T. D., and Brutlag, D. L. (1998). Proc. Natl. Acad. Sri. USA, 95,
5865.
26. Saqi, M. A. S. and Sayle, R. (1994). Comput. Appl. Biosci., 10, 545.
27. Sayle, R. A. and Milner-White, E. J. (1995). Trends Biochem. Sci., 20, 374.
28. Smith, H. O., Annau, T. M., and Chandrasegaran, S. (1990). Proc. Natl. Acad. Sri. USA, 87,
826.
29. Smith, R. F. and Smith, T. F. (1990). Proc. Natl. Acad. Sci. USA, 87,118.
30. Smith, T. F. and Waterman, M. S. (1981). J. Mol. Biol., 147,195.
31. Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994). Nucleic Acids Res., 22, 4673.
32. Taylor, W. R. (1988). J. Mol. Evol., 28, 161.
33. Jonassen, I. (1997). Comput. Appl. Biosci., 13, 509.
34. Wallace, A. C., Borkakoti, N., and Thornton, J. M. (1997). Protein Sci., 6, 2308.
35. http://www.biochem.ucl.ac.uk/bsm/PROCAT/PROCAT.html
36. Brown, N. P., Orengo, C. A., and Taylor, W. R. (1996). Comput. Chem., 20, 359.
37. Taylor, W. R. and Orengo, C. A. (1989). J. Mol. Biol., 208,1.
38. Artymiuk, P. J., Porrette, A. R., Grindley, H. M., Rice, D. W., and Willett, P. (1994).
]. Mol. Biol, 243, 327.
39. Nussinov, R. and Wolfson, H. J. (1991). Proc. Natl. Acad. Sri. USA, 88,10495.
40. Jonassen, I., Eidhmmer, I., and Taylor, W. R. (1999). Proteins: Struct. Funct. Genet, 34, 206.
41. Karlin, S. and Zhu, Z.-Y. (1996). Proc. Natl. Acad. Sri. USA, 93, 8344.
42. Taylor, W. R. (1999). Prot. Sci., 8, 654
43. Murzin, A. G., Brenner, S. E., Hubbard, T., and Chothia, C. (1995).J. Mol. Biol., 247, 536.
44. Brazma, A., Jonassen, I., Vilo, J., and Ukkonen, E. (1998). Genome Res., 8,1202.
166
Chapter 8
Comparison of protein
sequences and practical
database searching
Golan Yona and Steven E. Brenner*
Department of Structural Biology, Stanford University, USA.
* Department of Plant and Microbial Biology, University of California,
Berkeley, USA
1 Introduction
During the last three decades a considerable effort has been made to develop
algorithms that compare sequences of biological macromolecules (proteins, DNA).
The purpose of such algorithms is to detect evolutionary, and thus structural
and functional, relations among sequences. Successful sequence comparison
would allow us to infer the biological properties of new sequences from data
accumulated on related genes. For example, a similarity between a translated
nucleotide sequence and a known protein sequence suggests a homologous
coding region in the corresponding nucleotide sequence. Significant sequence
similarity among proteins may imply that the proteins share the same second-
ary and tertiary structures, and have close biological functions. The prediction of
unknown protein structures is often based on the study of known structures of
homologous proteins.
Today, the routine procedure for analysis of a new protein sequence almost
always starts with a comparison of the sequence to hand with the sequences in
one or more of the main sequence databases. A new sequence is analysed by
extrapolating the properties of its 'Neighbours' in a database search. Such
methods have been applied during the last three decades with much success and
have helped to identify the biological function of many protein sequences, as
well as to reveal many distant and interesting relationships between protein
families. Actually, more sequences have been putatively characterized by data-
base searches than by any other single technology.
Detecting homology may often help in determining the function of new pro-
teins. By definition, homologous proteins have evolved from the same ancestor
protein. The degree of sequence conservation varies among protein families.
Yet, homologous proteins almost always have the same fold (1-3). Although the
common evolutionary origin of two proteins is almost never directly observed,
167
GOLAN YONA AND STEVEN BRENNER
we can deduce homology, with a high statistical confidence, given that the
sequence similarity is significant.
In principle, similarity does not necessarily imply homology (similarity may
be quantified whereas homology is a relation that either holds or does not hold).
Therefore, similarity should be used carefully in attempting to deduce homology.
The deduction of biological function out of sequence similarity is not straight-
forward, and sequence comparison procedures may lead to false conclusions
when applied simple-mindedly. Today sequence comparison algorithms are
accompanied with statistical estimates which provide a measure of statistical
significance of the observed sequence similarities. These estimates can further
help in assessing the significance of the similarity, and in many cases can lead to
deduction of homology. The confidence in the deduction clearly depends on the
level of statistical significance. In this view, database searches should be treated
as experiments analogous to wet-lab characterization. Their use deserves the
same care both in the design of the experiment and in the interpretation of
results.
Planning a good experiment requires understanding of the methods being
applied. Fundamentally, database searches are a simple operation: a query
sequence is aligned with each of the sequences (called targets) in a database. A
score is computed from each alignment, and the query/target pairs with the best
scores are then reported to the user. Statistics are used to help improve the
ability to interpret these scores and distinguish true relations between proteins
from chance similarities. A more detailed description of this process, the
sequence-comparison algorithms, the scoring schemes, and the statistics of
sequence alignments is given next.
2 Alignment of sequences
During evolution, sequences have changed by insertions, deletions, and muta-
tions. These evolutionary events may be traced today by applying algorithms for
sequence alignment. Suppose that a DNA sequence a has evolved to the
sequence b through substitutions, insertions and deletions. This transformation
can be represented by an alignment where a is written above b with the com-
mon (conserved) bases aligned appropriately. For example, say that a = ACTTGA
and b is obtained by substituting the second base from C to G, inserting an A
between the second and the third bases, and by deleting the fifth base (G). The
corresponding alignment will be:
a = A C - T T G A
b = A G A T T - A
score =1 0- 1 1 1-1 1
We usually do not actually know which sequence evolved from the other.
Therefore the events are not directional and insertion of A in b might have been
a deletion of A in a.
In a typical application we are given two related sequences and we wish to
168
COMPARISON OF PROTEIN SEQUENCES AND PRACTICAL DATABASE SEARCHING
recover the evolutionary events that transformed one to the other. The goal of
sequence alignment is to find the correct alignment that encodes the true series
of evolutionary events that have occurred. The alignment can be assigned a
score which accounts for the number of identities (a match of two identical
letters), the number of substitutions (a match of two different letters), and the
number of gaps (insertions/deletions). For example, in the alignment above, a
score of 1 was given for each identity, a score of 0 was given for each sub-
stitution, and a negative score of -1 was given for each gap. Overall, the align-
ment scored 2, which is the sum of all pair scores and gap scores. In general, the
scores for identities and substitutions which are used to score the alignment are
called the scoring matrix, and the scores for gaps are called gap penalties.
Altogether they are called the scoring scheme (see Section 4.5 for details). With
high (positive) scores for identities, and low (or negative) scores for substitutions
and gaps, the basic strategy towards tracing the correct alignment seeks the
alignment which scores best. In the following sections we describe in detail the
common algorithms for sequence comparison. The discussion focuses on the
comparison of protein sequences, but it holds for DNA sequences as well.
169
GOLAN YONA AND STEVEN BRENNER
every possible subalignment is calculated only once, and in constant time1, out
of its optimal subalignments.
Formally speaking, denote by Sij the score of the best alignment of the sub-
string ai a2... ai with the substring b1 b2 • • •bj,i.e.
Assume that the gap penalty is constant and equals a. Then, after an initial-
ization step
(where n and m are the lengths of the sequences a and b respectively) define Si,j
recursively
Therefore, the score S(a,b) can be calculated recursively. Since the subalignment
for each i and j has to be calculated, the time complexity of this algorithm is
proportional to the product of the lengths of the sequences compared (a quad-
ratic time complexity). In practice, the scores are stored in a two-dimensional
array of size (n+1). (m+1). The initialization set the values at row zero and
column zero and the computation proceeds row by row so that the value of each
matrix cell is calculated from entries which were already calculated (see Figure 1).
1
This is true with linear gap functions. With non-linear gap penalties, the calculation of this
optimal subalignment may need up to i+j+1 operations.
170
COMPARISON OF PROTEIN SEQUENCES AND PRACTICAL DATABASE SEARCHING
Figure 1 Calculating the global similarity score. The score of the (i,j) entry in the matrix is
calculated from three matrix cells: the one on the left, the one on the top, and the one
located at the top left corner of the current cell. In case of a non-constant gap penalty we
need also to check all the cells in the same row and all the cells in the same column (along
the dashed lines).
unrelated. In such cases global alignment may not be the appropriate tool. In
the search for an optimal global alignment, local similarities may be masked by
long unrelated regions. Consequently, the score of such an alignment can be as
low as for totally unrelated sequences. Moreover, the algorithm may even mis-
align the common region. Therefore, usually it is better to compare sequences
locally. A local alignment of a and b is defined as an alignment between a
substring of a and a substring of b. The local similarity of sequences a and b is
defined as the maximal score over all possible local alignments.
The algorithm which finds the best local alignment is based on a minor mod-
ification of the dynamic programming algorithm for global alignment. Specific-
ally, whenever the score of the optimal subalignment of two subsequences
becomes negative, the score is set to zero, meaning that the corresponding
subsequences should not be aligned. Following the notations of the previous
section, Si,j is now defined
171
GOLAN YONA AND STEVEN BRENNER
suitable for this purpose. For example, the comparison of a sequence, of average
length of 350 amino acids, against a typical database (like SWISSPROT (10), with
more than 80000 sequences), may take few CPU hours on a standard PC of
nowadays (Pentium-Ill).
Several algorithms have been developed to speed up the alignment procedure.
The two main algorithms are FASTA (11) and BLAST (12). These are heuristic
algorithms which are not guaranteed to find the optimal alignment. However,
they proved to be very effective for sequence comparison, and they are signifi-
cantly faster than the rigorous dynamic programming algorithm.2
2
In the last few years, biotechnology companies such as Compugen and Paracel, have developed
special purpose hardware that accelerates the dynamic programming algorithm (13). This special-
purpose hardware has again made the dynamic programming algorithm competitive with
FASTA and BLAST, both in speed and in simplicity of use. However, meanwhile, FASTA and
BLAST have become standard in this field and are being used extensively by biologists all over
the world. Both algorithms are fast, effective, and do not require the purchase of additional hard-
ware. BLAST has an additional advantage, as it may reveal similarities which are missed by the
dynamic programming algorithm, for example when two similar regions are separated by a long
dissimilar region.
172
COMPARISON OF PROTEIN SEQUENCES AND PRACTICAL DATABASE SEARCHING
chance similarities, thus making this algorithm an important tool for molecular
biologists.
Current improvements of BLAST allow gapped alignments, by using dynamic
programming to extend a central seed in both directions (15). This is com-
plemented by PSI-BLAST, an iterative version of BLAST, with a position-specific
score matrix (see Section 4.5) that is generated from significant alignments found
in round i and used in round i+1. The latter may better detect weak similarities
that are missed in database searches with a simple sequence query.
2.2.2 FASTA
FASTA is another heuristic that performs a fast sequence comparison. The
algorithm starts by creating a hash table of all k-tuples (a string of length k) in
the query sequence (usually, k = 1 or 2 for protein sequences, where k = 1 gives
higher sensitivity). This table stores the k-tuples in a way which enables fast
accession, and restoration of each k-tuple. Then, when scanning a library
sequence, each k-tuple of the library sequence is looked up in the hash table,
and if it is found (this means k-tuple identity) it is marked. At a second stage, the
ten regions with the highest density of identities are rescanned. Common k-
tuples which are on the same diagonal (same offset in both sequences), and not
very far apart (the exact parameters are set heuristically), are joined to form a
region (a gapless local alignment, or HSP in BLAST terminology). The regions are
scored to account for the matches as well as the mismatches, and the best region
is reported (its score is termed 'initial score' or 'initl'). Then, the algorithm tries
to join nearby high scoring regions, even if they are not on the same diagonal
(the corresponding score being termed 'initn score'). Finally, a bounded dynamic
programming is run in a band around the best region, to obtain the 'optimized
score'. If the sequences are related then the optimized score is usually much
higher than the initial score.
173
GOLAN YONA AND STEVEN BRENNER
3
Two exceptions are segments with unusual amino acid composition, and similarity that is due
to convergent evolution.
174
COMPARISON OF PROTEIN SEQUENCES AND PRACTICAL DATABASE SEARCHING
this procedure many times it is possible to estimate the mean and the variance
of the distribution, and a reasonable measure of statistical significance (e.g. by
means of the z-score) can be obtained. Formally, denote by S the global simi-
larity score. Let ul and o2 be the mean and the variance of the distribution of
scores. Then, the z-score associated with the score S is defined as
This score measures how many units of standard deviation apart the score S is
from the mean of the distribution. The larger it is, the more significant is the
score S.
where u = (In Kmn)/X, and K is a constant that can be estimated from the back-
ground distribution and the scoring matrix (Karlin & Altschul 1990).
175
GOLAN YONA AND STEVEN BRENNER
Figure 2 Probability density function for the extreme value distribution with u = 0 and X = 1.
This result helps to calculate the probability that a given MSP score could have
been obtained by chance. The score will be statistically significant at the 1% level
if S > x0 where x0 is determined by the equation Kmne -x *° = 0.01. In general, a
pairwise alignment with score S has a p-value of p where p = Kmne~*'s. I.e., there
is a probability p that this score could have happened by chance.
The probability p, that a similarity score S could have been obtained simply
by chance from the comparison of two random sequences, should be adjusted
when multiple comparisons are performed. One example of this is when a
sequence is compared with each of the sequences in a database with D
sequences. Denote by p-match a match between two sequences that has a p-
value < o (i.e. its score > S). The probability P of observing at least one p-match
(i.e. at least one 'success'), in a database search follows the Poisson distribution
Since not all library sequences have the same probability of sharing a similar
region with the query sequence, D should be replaced with the effective size of
the database. If the query sequence is of length n, and the (pairwise) alignment
of interest involves a libaiy segment of length m, and the database has a total of
N amino acids, then D should be replaced with N/m. Thus,
so the effective size of the search space is Nn (intuitively, this is the number of
possible starting positions of a match).
176
COMPARISON OF PROTEIN SEQUENCES AND PRACTICAL DATABASE SEARCHING
This is the expected number of distinct matches (segment pairs) that would
obtain a score > S by chance in a database search, with a database of size N
(amino acids) and composition P (the background distribution of amino acids).
The higher it is, the match is less significant. For example, if £ = 0.01, then the
expected number of random hits with a score > S is 0.01. In other words, we
may expect a random hit with that score only once in 100 independent searches.
If E = 10, then we should expect 10 hits with a score > S by chance, in a single
database search. This means that such a hit is not significant. (Note that
E = P for P< 0.1.)
Finally, by setting a value for £ and solving the equation above for S, it is
possible to define a threshold score, above which hits are reported. This is the
score above which the number of hits that are expected to occur at random is
< E. Therefore, we can deduce that a match with this score or above reflects true
biological relationship, but we should expect up to £ errors per search. The
specific value of £ affects both the sensitivity of a search (the number of true
relationships detected) and its selectivity (the number of errors). A lower value
of E would decrease the error rate. However, it would decrease the sensitivity as
well. A reasonable choice for E is between 0.1 and 0.001.
177
GOLAN YONA AND STEVEN BRENNER
observations, Pearson (28) has derived statistical estimates for local alignment
with gaps, using the extreme value distribution for scores obtained from a
database search. A database search provides tens of thousands of scores from
sequences which are unrelated to the query sequence, and therefore are effect-
ively random. As discussed above, these scores are thus expected to follow the
extreme-value distribution. This is true as long as the gap penalties are not too
low. Otherwise the alignments shift from local to global and the extreme value
distribution no longer apply.
Since the logarithmic growth in the sequence length holds in this case, scores
are corrected first for the expected effect of sequence length. The correction is
done by calculating the regression line S = a + b .In n for the scores obtained in
a database search, after removing very high scoring sequences (probably related
sequences). The process is repeated as many as five times. The regression line
and the average variance of the normalized scores are used to define the z-
score:
where c1 and c2 are constants, and the expectation value is defined as before by
£(z-score> x) = N • p where N is the number of sequences in the database (the
number of tests).
This empirical approach has the advantage of internal calibration of the
accuracy of the estimates, and has proved to be very accurate in estimating the
statistical significance of gapped similarity scores (28). (See also refs. 18 and 29.)
178
COMPARISON OF PROTEIN SEQUENCES AND PRACTICAL DATABASE SEARCHING
the search should take place at the protein level, as proteins allow one to detect
far more distant homologies than DNA. Another aspect is that in DNA com-
parisons, there is noise from comparisons of non-coding frames (though this
latter issue still arises in DNA as Protein searches). DNA versus DNA comparison
is typically only used to find identical regions of sequence in a database. One
would do such a search to discover whether another group has sequenced or
studied a gene, and to learn where it is expressed or where splice junctions occur.
In short, protein-level searches are valuable for detecting evolutionarily related
genes, while DNA searches are best for locating nearly identical regions of
sequence (see Table 1 for available comparison programs and the corresponding
types of comparison).
4.2 Databases
Next, it is necessary to select a database to search against. There are several
commonly used databases (e.g. GenBank, SwissProt, ESTs, etc.). For homology
searches, it is best to use a comprehensive collection of all known proteins. Two
such databases are available. One is the nr database at the NCBI website
(http://www.ncbi.nlm.nih.gov/). The nr (which stands for non-redundant) protein
database combines data from several sources (GenPept, SwissProt, PIR, RPF, and
PDB) removes the redundant identical sequences, and yields a collection with
nearly all known proteins. The second nr database is available at the ExPASy
website in Switzerland (http://www.expasy.ch/). Both databases are frequently
updated, to incorporate as many sequences as possible. Obviously, a search will
not identify a sequence that has not been included in the database, and since
databases are growing so rapidly, it is essential to use a current database.
The main sources of these non-redundant databases are the SwissProt database
and the TrEMBL database (10), the PIR database (30), and the GenPept database
(31). The SwissProt database is maintained at the ExPASy centre in Switzerland.
This is a non-redundant highly annotated database which offers a lot of valuable
biological information on almost all of its entries (more than 86000 in the latest
release, June 2000). Such information may include for example the description
of the function of a protein, its domain structure, post-translational modifica-
tions, etc. This database is supplemented by TrEMBL, which is a collection of
all the translations of EMBL nucleotide sequence entries not yet integrated in
SwissProt. For most of these entries some biological information is available,
179
GOLAN YONA AND STEVEN BRENNER
usually based on sequence analysis carried by the ExPASy team. PIR is another
database that offers a lot of biological information on entries through an exten-
sive annotation as well as classification to families and superfamilies and links
to alignments with other family members. GenPept is a database that contains
all translations of DNA sequences in the GenBank database.
Several specialized databases are also available, all of which overlap with the
composite non-redundant databases. For example, if one is interested in search-
ing for proteins of known structure, it is best to just search the smaller PDB
database. Other specialized databases are available for each of the fully sequenced
genomes, as well as for subsets of protein families (such as protein kinases
or immunoglobulins), etc. See Table 2 for a list of the main databases (see also
Chapters 9 and 10).
One may also wish to search DNA databases at the protein level. Programs
can do so automatically by first translating the DNA in all six reading frames and
then making comparisons with each of these conceptual translations. The nr
DNA database (containing most known DNA sequence except GSS, EST, STS, or
HTGS sequences) is useful to search when hunting new genes; the identified
genes in this database would already be in the protein nr database. Searches
against the GSS, EST, STS, and HTGS databases can find new homologous genes,
and are especially useful to learn about expression data or genome map location.
180
COMPARISON OF PROTEIN SEQUENCES AND PRACTICAL DATABASE SEARCHING
4.3 Algorithms
The choice of the comparison algorithm should be based on the desired com-
parison type, the available computational resources, and the goals of the search.
All standard comparison algorithms can be run over the Web and can be down-
loaded from the FTP site to run locally (see Table 3). The rigorous Smith-
Waterman algorithm is available, as well as the FASTA program, within the
FASTA package. This algorithm is more sensitive than the others, but it is also
much slower. The FASTA program is faster, and with the parameter ktup set to 1,
is almost as sensitive as the Smith-Waterman algorithm (32, 18). The fastest
algorithm is BLAST, the newest versions of which support gapped alignments
(15) and provide a reliable, sensitive and fast option (the older versions are slower,
detect fewer homologs, and have problems with some statistics). Iterative pro-
grams like PSI-BLAST require extreme care in their operation, as they can provide
very misleading results; however, they have the potential to find more homologs
than purely pairwise methods.
4.4 Filtering
The statistics for database searches assumes that unrelated sequences look
essentially random with respect to each other. Specifically, the theoretical re-
sults that were obtained for the statistics of local alignments without gaps (see
Section 3.2) are subject to the restriction that the amino acid composition of the
two sequences that are compared are not too dissimilar (20). Assuming that both
sequences are drawn from the background distribution, the amino acid
composition of both should resemble the background distribution. Without this
restriction the statistical estimates overestimates the probability of similarity
scores, and indeed, this is observed in protein sequences with unusual com-
positions (18, 29). The most common exceptions are long runs of a small number of
different residues (such as a poly-alanine tract). Such regions of a sequence may
spuriously obtain extremely high match scores. For this reason, it is recom-
mended to filter out these regions using programs such as SEG (33). The NCBI
181
GOLAN YONA AND STEVEN BRENNER
182
COMPARISON OF PROTEIN SEQUENCES AND PRACTICAL DATABASE SEARCHING
or less. The acronym PAM stands for Percent of Accepted Mutations (and hence
the distance is in percentages) or for Point Accepted Mutations (and hence the
distance in number of mutations per 100 amino acids).
The PAM-1 matrix is then extrapolated to yield the family of PAM-k matrices.
Each PAM-k matrix is obtained from PAM-1 by k consecutive multiplication, and
is suitable for comparison of sequences which have diverged k%, or are k
evolutionary units apart. For example, PAM-250 = (PAM-1)250 reflects the fre-
quencies of mutations for proteins which have diverged 250% (250 mutations
per 100 amino acids). The actual scoring matrices that are used by search pro-
grams are derived from the transition probability matrices and the background
probabilities. The score of each pair s(a,b) is defined as the logarithm of the
likelihood ratio of the transition probability Mab (mutation) versus the prob-
ability of a random occurrence of the amino acid b in the second sequence, i.e.,
s(a,b) = logMab/pb.
The PAM matrices were later refined by Jones et al. (35) based on much larger
data set. The significant differences were detected for substitions that were
hardly observed in the original data set of (34).
The PAM-250 matrix. The PAM-250 matrix is one of the most extensively
used matrices in this field. This matrix corresponds to a divergence of 250
mutations per 100 amino acids. Naturally one may ask whether it makes sense
to compare sequences which have diverged this much. Surprising as it may
seem, when calculating the probability that a sequence remains unchanged
after 250 PAMs (this is given by the sum EapaMaa where pa is the probability of a
random occurrence of amino acid s and Maa is the diagonal entry in the PAM-250
matrix that corrresponds to the amino acid a) the outcome is that such
sequences are expected to share about 20% of their amino acids. For reference,
note that the expected percentage of identity in a random match is 100Eap2a,
and for a typical distribution of amino acids (in a large ensemble of protein
sequences), we should expect less than 6% identies.
To reduce the bias in the amino acid pair frequencies caused by multiple counts
from closely related sequences, segments in a block with at least x% identity are
183
GOLAN YONA AND STEVEN BRENNER
clustered and pairs are counted between clusters, i.e., pairs are counted only be-
tween segments less than x% identical. When counting pairs frequencies between
clusters, the contributions of all segments within a cluster are averaged, so that
each cluster is weighted as a single sequence. Varying the percentage of identity
x within clusters results in a family of matrices BLOSUM-x, where x ranges from
30 to 100. For example, BLOSUM-62 is based on pairs that were counted only
between segments less than 62% identical.
184
COMPARISON OF PROTEIN SEQUENCES AND PRACTICAL DATABASE SEARCHING
where s(a,b) is the similarity of amino acids a and b according to some scoring
matrix. For a review on algorithms for multiple alignment and profile techniques,
see refs 7-9, 43, 44.
185
GOLAN YONA AND STEVEN BRENNER
Table 4 Parameters for sequence comparison programs. PSI-BLAST and gapped BLAST are
executed by the same program (blastpgp). The default mode is a simple gapped BLAST (i.e.,
the parameter j is set to 1).
186
COMPARISON OF PROTEIN SEQUENCES AND PRACTICAL DATABASE SEARCHING
Finally, there is a third set of parameters which controls the output of the
program, e.g. how many results are reported, and how many alignments are
displayed. The number of hits reported is often controlled by the e-value
parameter (see Section 3.2). For example, by default, the BLAST programs will
report only matches with an e-value up to 10 (this parameter also affects the
sensitivity of the method, in an indirect manner). The total number of matches
is limited to the best 500, and detailed information with the alignment is
provided for up to 100 pairs. To retrieve more matches, these numbers can be
altered (see Table 4).
5 Interpretation of results
Interpretation of the results of a sequence database search involves first eval-
uating the matches, to determine whether they are significant and therefore
imply homology. The most effective way of doing so is through use of the
statistical scores (the e-values). The e-values are more useful than the raw or bit
scores, and they are far more powerful than percentage identity (which is best
not even considered unless the identity is very high) (18). Fortunately, the e-
values from FASTA, SSEARCH, and gapped BLAST seem to be accurate and are
therefore easy to interpret (18, 29).
The e-value (or expectation-value) of a match should measure the expected
number of sequences in the database which would achieve a given score. There-
fore, in the average database search, one expects to find ten random matches
with e-value score of 10; obviously, such matches are not significant. However,
lacking better matches, sequences with these scores may provide hints of func-
tion or suggest new experiments. Scores below 0.01 would occur by chance only
very rarely, and are therefore likely to indicate homology, unless biased in some
way. Scores of near le-50 are now seen frequently, and these offer extremely
high confidence that the query protein is evolutionarily related to the matched
target in the database.
Inferring function from the homologous matched sequences is a process still
fraught with difficulty. If the score is extremely good and the alignment covers
the whole of both proteins, then there is a good chance that they will share the
same or a related function. However, is dangerous to place too much trust in the
query having the same function as the matched protein: functions do diverge,
and organismal or cellular roles may alter even when biochemical function is
unchanged. Moreover, a significant fraction of functional annotations in data-
bases are wrong (45), so one needs to be suspicious. There are other complexi-
ties; for example, if only a portion of the proteins align, they may share a
domain which only contributes an aspect of the overall function. It is often the
case that all of the highest-scoring hits align to one region of the query, and
matches to other regions need to be sought much lower in the score ranking.
For this reason, it is necessary to consider carefully the overlap between the
query and each of the targets.
Database search methods are also limited because most homologous sequences
187
GOLAN YONA AND STEVEN BRENNER
6 Conclusion
One should neither have excessive faith in the results of a database search, nor
should they be blithely disregarded. The standard search programs such as
FASTA, gapped BLAST and SSEARCH are well-tested and reliable indicators of
sequence similarity, and their underlying principles are straightforward. These
programs and their parameters have been optimized for the hundreds of
thousands of runs every day. If one is careful about posing the database search
experiment and interprets the results with care, sequence comparison methods
can be trusted to rapidly and easily provide an incomparable wealth of bio-
logical information.
References
1. Sander, C. and Schneider, R. (1991). Database of homology-derived protein structures
and the structural meaning of sequence alignment. Proteins, 9, 56.
2. Hilbert, M., Bohm, G., andjaenicke, R. (1993). Structural relationships of homologous
proteins as a fundamental principle in homology modeling. Proteins, 17, 138.
3. Pearson, W. R. (1996). Effective protein sequence comparison. In Methods in enzymology
(ed. R. F. Doolittle), Vol. 266, pp. 227-58. Academic Press.
4. Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable to the search
for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443.
188
COMPARISON OF PROTEIN SEQUENCES AND PRACTICAL DATABASE SEARCHING
189
GOLAN YONA AND STEVEN BRENNER
30. George, D. G., Barker, W. C., Mewes, H. W., Pfeiffer, F., and Tsugita, A. (1996). The PIR-
International protein sequence database. Nucleic Adds Res., 24, 17.
31. Benson, D. A., Boguski, M. S., Lipman, D. J., Ostell, J., Ouellette, B. F., Rapp, B. A., et al.
(1999). GenBank. Nudeic Acids Res., 27, 12.
32. Pearson, W. R.' (1995). Comparison of methods for searching protein sequence
databases. Protein Sri., 4, 1145.
33. Wootton, J. C. and Federhen, S. (1993). Statistics of local complexity in amino acid
sequences and sequence databases. Comput. Chem., 17, 149.
34. Dayhoff, M. 0., Schwartz, R. M., and Orcutt, B. C. (1978). A model of evolutionary
change in Proteins. In Atlas of protein sequence and structure (ed. M. Dayhoff), Vol. 5,
Suppl. 3, pp 345-52. National biomedical research foundation, Silver Spring, MD.
35. Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992). The rapid generation of
mutation data matrices from protein sequences. Comput. Appl. Biosci., 8:3, 275.
36. Henikoff, S. and Henikoff, J. G. (1992). Amino acid substitution matrices from protein
blocks. Proc. Natl Acad. Sri. USA, 89, 10915.
37. Gonnet, G. H., Cohen, M. A., and Benner, S. A. (1992). Exhaustive matching of the
entire protein sequence database. Science, 256, 1443.
38. Risler, J. L., Delorme, M. O., Delacroix, H., and Henaut, A. (1988). Amino acid
substitutions in structurally related proteins. A pattern recognition approach.
Determination of a new and efficient scoring matrix. J. Mol. Biol., 204, 1019.
39. Johnson, M. S. and Overington, J. P. (1993). A structural basis for sequence
comparisons. An evaluation of scoring methodologies. J. Mol. Bid., 233, 716.
40. Henikoff, S. and Henikoff, J. G. (1991). Automated assembly of protein blocks for
database searching. Nudeic Acids Res., 19, 6565.
41. Altschul, S. F. (1991). Amino acid substitution matrices from an information theoretic
perspective. J. Mol. Biol., 219, 555.
42. Henikoff, S. and Henikoff, J. G. (1993). Performance evaluation of amino acid
substitution matrices. Proteins, 17, 49.
43. Gribskov, M. and Veretnik, S. (1996). Identification of sequence patterns with profile
analysis. In Methods in enzymology (ed. R. F. Doolittle). Vol. 266, pp. 198-211. Academic
Press.
44. Taylor, W. R. (1996). Multiple protein sequence alignment: algorithms and gap
insertion. In Methods in enzymology (ed. R. F. Doolittle). Vol. 266, 343-67. Academic
Press.
45. Brenner, S. E. (1999). Errors in genome annotation. Trends Genet, 15, 132.
46. Karplus, K., Barrett, C., and Hughey, R. (1998). Hidden markov models for detecting
remote protein homologies. Bioinformatics, 14:10, 846.
190
Chapter 9
Networking for the biologist
R. A. Harper
EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus,
Cambridge, UK.
Chuck Berry
1 Introduction
Every research worker would like to have the tools on hand to make his job
quicker and more efficient, and with the advent of the World Wide Web many
of the tasks associated with molecular biology have become freely available
online. In the past when a scientist wanted to know something about a par-
ticular subject then the first option was to talk to colleagues in the laboratory
and ask for their advice. If that was not sufficient then it was off to the library to
scan abstracts or the latest journals for the relevant information.
However times are changing and so are working habits. Why ask questions
from people in your laboratory when you can ask the same question on the
Bionet newsgroups http://www.bio.net from research workers all over the world?
Why thumb through textbooks for references when you can type in keywords
to an Internet search engine such as Lycos or Alta Vista and get a satisfactory
answer in no time at all? But often you find that the major search engines index
everything on the Web, which makes it difficult to find exactly what you want.
So often it is more profitable to use search engines that are totally dedicated to
biology.
In Europe you could use BiowURLd http://search.ebi.ac.uk:8888/compassl or Bio-
Hunt http://www.expasy.ch/BioHunt, which deal exclusively with biology-related
subjects. Another comprehensive listing exists at the Virtual Library in the
BioSciences division. http://www.vlib.org/Biosciences.html, and from China there is
the NEE-HOW project, http://biology.neehow.org which is an invaluable resource
for research workers from the Pacific rim.
In the USA one of the original and best lists of Biological resources, put to-
gether by Keith Robison can be found at Harvard http://golgi.harvard.edu/biopages.list
and of course there is the ever popular Pedro's BioMolecular Research Tools at
191
R. A. HARPER
192
NETWORKING FOR THE BIOLOGIST
Figure 1 Primary uses of the Web. (Copyright 1994-1998 Georgia Tech Research
Corporation. All rights Reserved. Source: GVU's WWW User Survey
www.gvu.gatech.edu/user_surveys)
Figure 2 Primary places of WWW access. (Copyright 1994-1998 Georgia Tech Research
Corporation. All rights Reserved. Source: GVU's WWW User Survey
www.gvu.gatech.edu/user_surveys)
are being used the most. If the scientist insists that they can get by with their
VT100 terminal and a text based Lynx browser very soon they will be unable to
browse sites that are visually rich or rely on Java scripts or corba interfaces. It is
clear from the latest survey results that the most used widely used computing
platform is Windows 95 (Figure 3). No doubt this is partly due to the popularity of
Microsoft Internet Explorer which comes bundled with the operating system.
The browser wars between Netscape and Microsoft have already led to legal
battles in the American courts.
193
R. A. HARPER
Figure 4 Average round trip times times in ms from EmbNet node CAOS/CAMM in the
Netherlands to the EBI in the UK.
194
NETWORKING FOR THE BIOLOGIST
In 1995, a Network Usage and Quality Advisory Group of the Dutch Network
organisation SURFnet, defined 'an upper RTT (Round Trip Time) limit of 125 msec.
without packet loss' as a minimum QOS (Quality of Service) level for interactive on-
line work. The RTT values from the Dutch Embnet node to the EBI can be
represented by the following graph.
This shows that the RTT from the Netherlands to the EBI in the UK has
consistently been below the recommended time of 125 msec, which means
scientists from the Netherlands, should have no difficulties in contacting the EBI
web server. It is interesting to note that the results from October 1998 show that
of the thirty-two nodes monitored, twenty-two have a RTT of less than 125 msec.
This must surely be good news for networking within Europe (Table 1).
It is essential that research workers learn to use the services provided for
them within their own countries. Penalties are always paid when you network
across international borders. It would seem that the more borders you cross, the
less efficient the network becomes. However, networking within your own
country is more efficient because more often than not a basic infrastructure
already exists between the major universities. In the mind of the molecular
biologist, however, Mecca is either at the NCBI or EBI and that is the direction
they religiously point their browsers to, only to suffer frustration when they
cannot get their work done due to poor bandwidth and increased traffic directed
towards these sites. For this reason EMBnet tries to co-ordinate their activities so
that all the EMBnet nodes provide easy access for database query and retrieval.
Many of the EMBnet nodes use a mirror package to update their databases on a
daily basis, via remote ftp from the databases stored on the EBI anonymous ftp
server at ftp://ftp.ebi.ac.uk/pub/databases.
The major databases such as EMBL or SWISS-Prot are then indexed at the
EMBnet nodes and can be queried with the SRS package. SRS which was de-
veloped at EMBL Heidelberg by Thure Etzold, has been adopted by many of the
EMBnet nodes throughout Europe and also abroad. SRS is also unique in that it
is able to index very many different databases. A list of all EMBNET sites that use
SRS is given in Table 2.
195
R. A. HARPER
Table 1 Table of EMBNET Internet PING results from 1-30 September 1998. These give
network response times from the CAOS/CAMM centre in Nijmegen (The Netherlands) to all
of the European EMBNET sites. The main figures show the average, minimum, and maximum
RTT times.
EBI you simply send a e-mail message to [email protected] and include in the
main body of the message the word help and full instructions will be sent via e-
mail on how to operate the service.
A similar method for sequence retrieval is employed by the NCBI and the e--
mail query system utilizes the Entrez retrieval system that they have developed
196
NETWORKING FOR THE BIOLOGIST
Table 2 A List of the EMBNET nodes and some other sites around the world which support
SRS
for their website. Many people would argue that getting sequence via e-mail is
old-fashioned technology. It is primitive in that it only delivers simple ascii-
formatted text. However the e-mail query server at the NCBI is clever enough to
be able to return the sequence to you in a variety of different formats including
GenBank, FASTA, or Html.
197
R. A. HARPER
It is often more convenient to shoot off a query by e-mail and get an answer
within a few minutes than it is to struggle with tiying to access a website
that has bandwidth problems. The address for the NCBI e-mail server is at
[email protected]. To receive full instructions on how the server works just
send an e-mail message to [email protected] and in the main body of the
message type the word help. I have often found that people who have used an e-
mail server generally have a better understanding of databases and sequence
retrieval than those who have only used a WWW interface.
DB n
UID U30150,U30153
DOPT f
198
NETWORKING FOR THE BIOLOGIST
Will search the nucleotide database for entries whose accession numbers are U30150 and
U30153, and display them in FASTA format.
DB m
UID 88055872
DOPT r
HTML
Will search the MEDLINE database for the record with MEDLINE UID 88055872 and dis-
play it in MEDLINE Report format. Send the results in HTML format for viewing through
a WWW browser.
DB p
UlDsp|P11598|
DOPT m
Will search the protein database, using a FASTA formatted UID, to retrieve the entry whose
Swiss-Prot accession number is P11598, and display the MEDLINE links for that protein
record as document summaries.
199
R. A, HARPER
BEGIN
>XYZ012 mygene XYZ
tgcttggctgaggagccataggacgagagcttcctggtgaagtgtgtttcttgaaatcat
The actual search request begins with the mandatory parameter 'PROGRAM' in the first
column followed by the value 'blastn' (the name of the program) for searching nucleic
acids. The next line contains the mandatory search parameter 'DATALIB' with the value
'month' for the newest nucleic acid sequences. The third line contains an optional
EXPECT parameter and the value desired for it. The fourth line contains the mandatory
'BEGIN' directive, followed by the query sequence in FASTA/Pearson format. Each line of
information must be less than 80 characters in length. Once the e-mail message has been
sent it will be processed automatically at the NCBI and the results returned to your e-mail
address once they have been computed.
The BLAST algorithm was developed by the National Center for Biotech-
nology Information at the National Library of Medicine. The BLAST family of
programs employs this algorithm to compare an ammo acid query sequence
against a protein sequence database or a nucleotide query sequence against a
nudeotide sequence database, as well as other combinations of protein and
nucleic acid. If you use BLAST as a tool in your published research, the following
reference should be cited:
Altschul. S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W.,
and Lipman, D. J. (1997).Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucleic Acids Res., Sept. 1, 25(17), 3389.
It used to be that the NCBI exclusively provided access to BLAST but in recent
years you can now run BLAST searches from many different sites around the
world, which is a clear indication that this programme has become a very
popular method for doing homology searches. The fact that it appears in so
many places may be due to the fact that it is available for free from the NCBI
anonymous ftp server at ftp:///ftp.ncbi.nlm.nih.gov/blast/.
Historically the EBI has always provided hornology searches through FASTA.
The following reference should be cited when you have used FASTA:
200
NETWORKING FOR THE BIOLOGIST
2.4.2 Compugen
Compugen have succeeded in introducing the Biocellorator to many pharma-
ceutical companies to aid them in their search for new and novel drugs. The EBI
has a biocellorator, which is online and is available for public use. At the EBI
there are two different interfaces to this service. The one provided by Com-
pugen called GeneWeb and a simple custom interface developed at the EBI at
http://www.ebi.ac.uk/bic_sw. The interface to the BIC-SW at the EBI is very compact
201
R. A. HARPER
and easy to use. Most people just accept the default settings, paste in their query
sequences and run the program. The interface page is shown in Figure 7 and
some results are shown in Figure 8.
If you are from the Mediterranean area then perhaps it would be more
convenient to try the GeneWeb interface from the Weizmann Institute in
Israel, which is also open to the public for unregistered users. The URL is
(http://sgbcd.weizmann.ac.11:80/cgi-bin/gcnweb/main.cgi)
202
NETWORKING FOR THE BIOLOGIST
203
R. A. HARPER
204
NETWORKING FOR THE BIOLOGIST
205
R. A. HARPER
Figure 10 The first few lines of some search results from an FDF search at the Swiss
EMBnet node.
206
NETWORKING FOR THE BIOLOGIST
and fill in your search criteria as keywords. For example in Figure 12, we see a
sample query using the fields Organism (Rhizobium), authors (Ausubel), Seq-
Length (a range 500:700) and the date (a range l-Jan-1998:30-Dec-1990),
SRS will then display two hits in Swiss-Prot for that particular year with a
sequence range between 500 and 700 (Figure 13). It should also be noted that SRS
also gives the possibility to launch an application such as BLAST or FASTA tor
any of the sequences that you care to select. You may also select different views
of a sequence. For example the FASTA format, which then allows you to launch
a multiple sequence alignment using ClustalW, directly from within SRS. This
207
R. A. HARPER
Figure 14 The results from launching clustalw as an application on the results in Figure 13.
method is a great time saver since there is no need to cut and paste your
sequence into a separate ClustalW application (Figure 14).
4 Submitting sequences
Not only does the research worker want to query, retrieve and analyse sequences.
occasionally they also want to submit their own sequences to the databanks be
it GenBank or EMBL. The three major organizations that collect sequence in-
formation work in collaboration with each other so that sequences entered into
GenBank are transferred daily by FTP to both EBI and DDBJ (and vice versa) in an
attempt to keep the major databases synchronized.
At any given time the three institutes are continually swapping data so it is a
false idea to believe that any one database is more current than the other. All
three institutes have online methods of submitting sequence data through the
Web. The NCBI were the first to come online with BANKIT. The EBI then
followed with WEB1N and the Japanese at DDBJ have Sakuara,
It should also be noted that the NCBI developed a stand-alone programme
for MAC'S, PC's, and Unix called SEQUIN that allows the end-user to enter their
data from a personal computer and to send the submission via e-mail or to
simply post the disk to the appropriate institute where it is then uploaded into
the database. Sequin is strongly recommended if you have bulk submissions to
make,
208
NETWORKING FOR THE BIOLOGIST
figure 15 The Welcoming page of the Bankit service from the NCBI.
209
R. A. HARPER
Figure 16 The welcoming screen of the Sequin data submission program from the NCBI.
robust error checking and accommodates very long sequences and complex
annotations.
Although Sequin has been implemented by the NCBI, the opening screen
allows you to select which database you would like to submit you sequence to be
it GenBank, EMBL, or DDBJ. Usually when a sequence is submitted there may be
a process whereby the submitter has to be in contact with the annotators of the
sequence by telephone to clarify certain details. Therefore it is wise to choose a
submission centre in your geographical region if you want to avoid long dis-
tance telephone calls. A screen capture of Sequin is shown in Figure 16,
Once you have completed the submission depending on which database you
have selected at the beginning you will be prompted to send an e-mail to gb-
[email protected] for NCBI, [email protected] for EMBL or [email protected]
for DDBJ. Sequin runs on Macintosh, PC/Windows, and UNIX computers. The
program itself, along with its on-line help documentation, is available by anony-
mous FTP from the EBI (UK) at ftp://ftp.ebi.ac.uk/pub/software/sequin/ or from the
NCBI (USA) at ftp://ncbi.nlm.nih.gov/sequin/ A useful FAQ to help you if you run
into problems during submission can be found at. http://www.ebi.ac.uk/~sterk/
sqndocs/index.html
210
NETWORKING FOR THE BIOLOGIST
211
R. A. HARPER
submitting your sequence. You are able to check your sequence data prior to
submission for potential vector contamination by running a BLASTN search
against EMVEC, a vector database containing information on more than 2000
vectors from the EMBL/GenBank/DDBJ Database SYN(thetic) division. The results
will list sequences producing significant alignments and associated information
like vector name, score, alignment, etc. The EB1 suggests that you remove vector
contamination from your sequence data before submitting to the database.
5 Conclusions
Historically there has been a collaboration between EBI, NCBI, and DDBJ, These
three sites are still the only places that have the infrastructure set up to handle
Figure 18 A page from the SAKURA DNA submission system from DDBJ.
212
NETWORKING FOR THE BIOLOGIST
References
General references to articles about biological services on the internet.
1. Aldhous, P. (1993). Managing the genome data deluge. Science, 262, 502.
2. Altschul, S. et al. (NCBI) (1994). Issues in searching molecular sequence databases.
Nature Genet., 6(Feb), 119.
3. Appel, R. D., Sanchez, J.-C., Bairoch, A., Golaz, O., Ravier, F., Pasquali, C., et al. (1996).
The Swiss-2DPAGE database of two-dimensional polyacrylamide gel electrophoresis,
its status in 1995. Nucleic Acids Res., 24(1), 180.
4. Ashburner, M. and Goodman, N. (1997). Informatics—genome and genetic databases.
Curr. Opin. Genet. Dev., 7, 750.
5. Bairoch, A., Bucher, P., and Hofman, K. (1996). The Prosite Database, its status in 1995.
Nucleic Acids Res., 24(1), 189.
6. Bairoch, A. and Apweiler, R. (1996). The Swiss-Prot Protein sequence data bank and its
new supplement Trembl. Nucleic Acids Res., 24(1), 21.
7. Bairoch, A. (1996). The ENZYME Data Bank in 1995. Nucleic Acids Res., 24(1), 221.
8. Bairoch, A. (1991). SEQANALREF: a sequence analysis bibliographic reference
databank. Camput. Appl. Biosci., 7(2), 268.
9. Bleasby, A., Griffiths, P., Harper, R., Hines, D., Hoover, K., Kristofferson, D., et al.
(1992).Electronic communications and the new biology. Nucleic Acids Res., 20(16),
4127.
10. Coulson, A. (1994). High performance searching of biosequence databases. Trends
Biotechnol, 12, 76.
11. Fuchs, R. (1994). Sequence analysis by electronic mail: a tool for accessing Internet
e-mail servers. Comput. Appl. Biosci., 10(4), 413.
12. Gershon, D. (1997). Bioinformatics in the post-genomic age. Careers and recruitment
article. Nature, 389, 417.
13. Gershon, D. (1995). The boom in bioinformatics (employment review). Nature, 375,
262.
14. Harper, R. (EBI). (1995). World Wide Web resources for the biologist. Trends Genet.,
11(6), 223.
15. Holm, L. and Sander, C. (1996). The FSSP database: fold Classification based on
structure-structure alignment of proteins. Nucleic Acids Res., 24(1), 206.
16. Marshall, E. (1996). Hot property: biologists who compute. Science, 272, 1730.
17. O'Donnell, C. (1994). Obtaining software via INTERNET. Methods Mol. Biol., 24, 345.
18. Peitsch, M. C., Wells, T. N., Stampf, D. R., and Sussman, J. L. (1995). The Swiss-
SDImage collection and PDB-Browser on the World-Wide Web. Trends Biochem. Sci.,
20(2), 82.
19. Pietrokovski, S., Henikoff, J. G., and Henikoff, S. (1996). The Blocks database a system
for protein classification. Nucleic Acids Res., 24(1), 197.
213
R. A. HARPER
214
Chapter 10
SRS—Access to molecular
biological databanks and
integrated data analysis tools
D. P. Kreil and T. Etzold
EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus,
Cambridge, UK.
1 Introduction
This first section gives an introduction to SRS. Section 2 (A user's primer) is a
tutorial that demonstrates basic tasks: simple database queries, exploiting links
between databases, exploration of results, and launching analysis tools. Section 3
(Advanced tools and concepts) builds on the skills imparted by the tutorial, and
shows how to refine queries, create custom views on data, and use distributed
SRS resources. Section 4 (SRS server side) introduces aspects of using a local SRS
installation, and outlines how to take advantage of one. Section 5 (Where to
turn for help) suggests where to turn to, if this chapter does not address a par-
ticular question or problem.
215
D. P. KREIL AND T. ETZOLD
216
SRS—ACCESS TO MOLECULAR BIOLOGICAL DATABANKS
entries. SRS was first applied to databases with information on protein or nucleo-
tide sequences. It has since developed into a much more general data-retrieval
tool. It is used, for example, for bibliographical databases, hierarchically struc-
tured databases like taxonomies, or clinical data on mutations.
Besides giving researchers the freedom to store information the way they
want to, an approach centred on the use of plain text files benefits from a
medium that is portable across different computer architectures and is usually
easy to read for humans. In the integration of distributed resources, this can be
quite helpful in resolving conflicts that may be triggered by the individually
evolving components.
SRS parsers, definitions of database structures and interfaces to analysis tools,
and a considerable amount of the SRS core functionality itself are written in
Icarus, a language especially designed for that task. Icarus is used to define de-
scriptive data structures (meta-data) which are extensively used to represent SRS
concepts like database structures. Icarus code in parsers is interpreted. These
two design features allow rapid development: using meta-data reduces the
amount of code to be written, and interpreted parsers can be modified and
tested fast—it often takes just a few hours to integrate a new database starting
from scratch. Not only has this approach proven to scale very well with the
number of databases to integrate, the flexible combination of a recursive break-
down of structure and the powerful in-place processing of parsed data that is
characteristic of SRS parsers handles even very complex database formats
elegantly. Developers who have used an interpreted language before will have
little difficulty adapting to Icarus. We think this is reflected by its ready
acceptance by SRS server maintainers around the world who have added well
over hundred databases to the system.
Icarus is evolving towards a general-purpose object-oriented programming
language with special support for recursive lazy parsing and definition of meta-
data structures. We expect that it will play a central role in scripting and large-
scale data analysis in the future of SRS.
SRS integrates and interfaces to databases and analysis tools of other re-
searchers, and it is also often used as an engine for database access in other
systems like 0PM (8), and BioKleisli (9)—'wrap and be wrapped'! Communi-
cation with an SRS system can be through the World Wide Web (the most
popular form of access), from the Unix command line, from within an Icarus
script, or using the C application programming interface (API). Current new
developments include prototypes of a Corba server, and of a Perl API. Adding
support for other language API's (e.g. for Java via JNI) would be straightforward.
2 A user's primer
This section will introduce core SRS functionality with simple step-by-step
instructions. It concludes with an overview of the screens covered in the primer.
We suggest that new users follow the examples given below. The concepts are
much more readily understood with some hands-on experience.
217
Figure 1 (a) The SRS Home Page at EMBLEBI, http: //srs.ebi.ac.uk/. (b) The 'Top Page': Select the databases to query.
SRS—ACCESS TO MOLECULAR BIOLOGICAL DATABANKS
219
D. P. KREIL AND T. ETZOLD
If the list of results is long, the user can page through the list using the links
at the bottom of the screen ('go to entries in chunk [. ..]').Every item in the list
of results has an identifier composed of database name and entry-i.d., such as
'SWISSPROT:LPTR_BACST'. They are hyper-linked to the respective complete
entries.
To submit another query of the same database(s), press the 'Query Form'
button and continue with Protocol 1. step 5. For a new query, press the 'Top Page'
button and go to Protocol 1, step 3, SeeFigure6 for more options,
Protocol 2 shows how to apply a link query to a selected set of entries. The
example query retrieves an EMBL entry that contains the nucleotide sequence of
the gene that encodes the protein described by 'SWISSPROT:TCRB_BACSU'. In
addition, entries that hold larger stretches of genomic DNA containing this gene
are retrieved (such as the complete genome of Bacillus subtilis).
220
SRS—ACCESS TO MOLECULAR BIOLOGICAL DATABANKS
221
Figure 4 More examples of Views showing the same selection of entries. The screen on the right (b) employs a Java Applet to display
various local protein properties as controlled by the user.
SRS—ACCESS TO MOLECULAR BIOLOGICAL DATABANKS
223
Figure 5 The application launch page of the 'SW' (Smith-Waterman search) application program. Here users can set application
parameters. Clicking on a hyper-finked parameter prompts will open a separate window with related help text. This window is re-used to
reduce screen clutter. (b) Results of an 'SW' application launch as seen with the view 'SwMofeFamilies'.
SRS—ACCESS TO MOLECULAR BIOLOGICAL DATABANKS
2.5 Overview
Here we give a flow-chart like overview of the interface screens covered by the
protocols of this section (Figure 6). Users may abort operations and return to the
Top Page or a new Query Form by pressing the respective buttons shown at the
top of most screens.
225
D. P. KREIL AND T. ETZOLD
Figure 6 An overview of the pages discussed so far. The thick arrows show the path of
usual progression as described in the protocols. There may be an intermediate page before
the list of results after an application launch.
'homo sapiens sapiens', or a combination of all these terms? The index browser is
the easiest way to inspect the terms a query can be matched against. Always
consult the index browser when working with an unfamiliar database or data-
base field, and when in doubt. Protocol 5 shows how to do that.
Looking at the returned values, one sees the need to restrict the search to
'human' without a trailing wildcard '*' (to exclude, e.g. the human viruses).
Checking the index for terms matching 'homo*', one respectively finds the same
number of entries for 'homo' and 'homo sapiens' as for 'human'. Queries using
logical operators (see below) can be used to check the suspicion that the three
terms are equivalent and that it suffices to use only one of them.
226
SRS—ACCESS TO MOLECULAR BIOLOGICAL DATABANKS
227
D. P. KREIL AND T. ETZOLD
Figure 7 The index browser page for the database field 'Organism'.
Figure 8 This query retrieves all entries on tetracycline that are relevant to antibiotic
resistance but are neither trans-membrane proteins nor peptides. Note that one can directly
search for the phrase 'antibiotic resistance' in the 'Keyword' index. To search for
entries described by 'resistance protein', one has to ask for each word separately
(namely, 'resistance & protein') because the index for 'Description' only contains single
words (cf. Index Browser, above).
and that expressions are evaluated strictly left to right (all operators have the
same precedence). However, the drop-down menu 'Combine searches with' can
be used to select a logical operator that will be applied to join the lines of the
query form, figure 8 shows a query that retrieves all entries on tetracycline that
are relevant to antibiotic resistance but are neither trans-membrane proteins
nor peptides.
228
SRS—ACCESS TO MOLECULAR BIOLOGICAL DATABANKS
Figure 9 A query to retrieve entries described as 'tetracycline' and with a sequence about
which there is disagreement in the available literature.
Figure 10 The Query Manager page offers previous results for reuse and allows for more
complex operations.
to keep for reference, run a new query as shown in Figure 9. Go to the Query
Manager page, select the last two entries, and finally choose 'BUTNOT' as
operator and click on the 'combine' button (Figure 10).
Advanced users will learn to appreciate the option to enter more complex
query expressions directly. The previous result can also be obtained by entering
'[SWISSPROT-Description: tetracycline ! peptide] & [SWISSPROT-
Keywords: antibiotic resistance 1 transmembrane] ! ( [SWISSPROT-
FtKey: conflict ] > parent)' (without the quotes) into the window next to
the 'expression' button. Press the 'expression' button to request the evaluation
of the query. This example demonstrates several features:
(a) Simple queries take the form '[DatabaseName-FieldName:QueryString]'.
Logical operators may be used with the same restrictions as in the Query
Form. For brevity, the abbreviation of the field name, as displayed in the
Index Browser, can be used, e.g. ' [ SWISSPROT-des: tetracyc* ]' to query
the 'Description' field.
229
D. P. KREIL AND T. ETZOLD
(b) Queries can be combined using logical operators. Here, parentheses may be
used for grouping.
(c) The syntax for a link query is 'A < B' or 'A > B', where A and B can each be a
database name or the name of a set of entries (such as those automatically
assigned, e.g. 'Q2'). The first expression retrieves those entries of A that are
linked to B, while the second one retrieves the entries of B that are linked to
A (as suggested by the points of the 'arrows'). Thus, 'A > B' and 'B < A' are
equivalent expressions.
(d) Special links: Some entries can have sub-entries (such as the SWISS-PROT
Features). These are indexed and can be queried just like the main entries.
Linking with the special pre-defined 'parent' provides a mechanism to access
the entries that contain the retrieved sub-entries. There are other special
links for moving up and down in hierarchical databases and to link hits of
a search application to the originally searched database (. . .searchdb').
SRS Links can always be searched in both directions; they are bi-directional. If a
specialized database explicitly refers to a common (e.g. repository) database, one
can thus also request linked entries of the specialized database starting from an
entry in the common database. The designers of the common database do not
need to know even of the existence of the specialized database. SRS also finds
entries that are linked via an intermediary. For example, a SWISS-PROT entry
that has links to both EMBL and PDB creates (indirect) links between the respect-
ive EMBL and PDB entries. Consequentially, indirect links can involve any
number of intermediate linking steps. SRS uses the shortest way through the
net of databases when resolving a link. SRS server administrators can set up
penalties reflecting the desirability of using a particular inter-database link.
Users can chose a different path through the network by explicitly requesting a
series of links (. . . A > B > ... > C').
230
SRS—ACCESS TO MOLECULAR BIOLOGICAL DATABANKS
display. Values are extracted from the original entry format of the database
text file, and the data are usually re-formatted for easy readability and con-
cise display. Table Views are ideal for summaries, excerpts, and also when
data from different databases needs to be combined. All the Views shown in
Figure 3, Figure 4, and Figure 5 are Table Views.
The View Manager allows the creation of more complicated Views, such as the
one show in Figure 5b. Also, Views can be deleted. Named custom Views can be
created from scratch, or derived by editing existing Views. Named custom Views
are then available just like the pre-defined Views. Users can specify the type of
the View (List View or Table View), and whether the abbreviated forms or the
complete names of the displayed database fields are to be shown in Table Views.
Views are defined for a set of databases (the root-libraries). Views also may in-
clude fields that are only present in some of the chosen root-libraries. Many
elaborate Views link to additional databases that are to be specified (the leaves of
the View). When all the involved databases have been selected, the user chooses
the respective database fields that should be shown, and the formats in which to
display data objects (e.g. sequences, alignments) of particular fields.
Advanced options allow the following:
(a) 'Display only number of linked entries': The query that links a displayed
entry to a leaf database may result in a set of several entries. For summaries,
the size of this set (instead of the individual entries) can be displayed; this
includes a hyper-link to the set of entries for inspection of details.
(b) 'Use view to display entries': In List Views, the entries of each leaf-database
can be shown in a specified View.
Figure 11 For this List View, the fields 'AccNumber' (accession number), 'Description',
'CitationNo', 'Authors', 'Citation', 'Sequence', and 'FtDescription' (feature sub-entry
description) have been selected for display. The 'Swiss' (SWISS-PROT) format has been
chosen for the sequence field.
231
D. P. KREIL AND T. ETZOLD
(c) 'Use query instead of link': This allows complex operations, such as per-
forming a link following a specified path through the net of databases, or
excluding certain entries. The set 'entry' holds the entry to be displayed and
can be used in the query expression.
232
SRS—ACCESS TO MOLECULAR BIOLOGICAL DATABANKS
233
D. P. KREIL AND T. ETZOLD
Figure 12 A typical DATABANKS entry. The entry contains a copy of the respective remote
SRS databank information page, which includes a description, references and links, as well
as detailed documentation of database fields and indices. It concludes with a listing of
alternative sites that offer ENZYME. Direct links to these sites and the remote query forms
for ENZYME are provided. For uses in the network vicinity of a particular DATABANKS server,
the relative response times compiled by that server give a clue to the net distances to other
sites ('N/A 1 indicates problems connecting at the specified time).
234
SRS—ACCESS TO MOLECULAR BIOLOGICAL DATABANKS
Figure 13 Query for databanks that have a description containing the terms 'sequence'
and 'align'. The second line of the query form requests that the results be restricted to
one representative databank for each group of alternatives.
Figure 14 The results of a query for databanks named 'PIRALN', The number of indexed
entries and the release number (where assigned by the server maintainers) help users to
choose a nearby server that offers a current version of the appropriate database.
with '| ' instead of with spaces. To include literal spaces in your parameters,
replace them with ' % 2 0'.
There are various ways in which entries can be displayed, for example:
(a) As plain lists of entries: Specify your query as you would in the expression
box of the Query Manager.
(b) As individual entries: Specify the '-e' (entry) switch together with an SRS
query, e.g..../wgetz?-e+ [SWISSPROT-ID:LPTR_BACST].
(c) As lists of entries that can be browsed: Specify the '-sl' (sequence list)
switch with an SRS queiy. This yields a page that looks like the list of results
in the standard SRS Web interface.
235
D. P. KREIL AND T. ETZOLD
For each of the above, a particular View may be requested, e.g. '-view +
SequenceSimple'.
Some SRS functions can be accessed directly as entry points into the system,
e.g. requesting the query form for a specified set of databases. The switch '-l'
(libraries) sets the databases to operate on, the '-fun' (function) switch selects
the operation requested: e.g. '.../wgetz?-fun+PageQueryForm+-l+SWISS
PROT%20SWISSNEW'.
Coming back to a previous session, or to maintain data across several wgetz
calls, a user context needs to be specified with the '-id' switch. The id of a
session is part of most of the links it displays; it is easily seen in the 'resume'
link on the Top Page (cf. Figure 6). Chapter 3 of the on-line manual contains
further information on linking to SRS servers using the Web.
Future versions of SRS will also support other mechanisms of remote access
to SRS servers. We have already developed a prototype Corba server. Generally,
SRS is able to serve structured data and methods that operate on them to any
object oriented or procedural environment.
236
SRS—ACCESS TO MOLECULAR BIOLOGICAL DATABANKS
More examples can be found in the 'demo' sub-directory of the SRS installa-
tion. This is the interface to use for the most extensive access to SRS features.
However, only the functions introduced in the examples are guaranteed to
be supported in future versions—other functions may change as we improve
and extend the system.
(b) More and more, Icarus allows access to SRS functionality. Already now, SRS
queries and access to token tables is supported. Token tables hold the results
of the parsing process. In future versions, application launching and all other
SRS functions and data will be accessible through Icarus. The example shown
in Figure 15a prints the entry-ids and the molecular weights of all matches to
the query 'SWISSPROT-des:Tetracycline*'. The example shown in Figure
15b takes an SRS query string on the command line and dumps the 'fields'
token table. See Chapters 5-7 of the on-line manual to learn more about
Icarus.
(c) Future versions of SRS will offer native language interfaces to other general
purpose languages like C++ or Java, and to popular scripting languages like
Perl (for which a prototype has already been developed)
Sometimes users find they struggle to extract particular data through a query.
Often a simple modification or extension of the parsers involved helps a lot.
Clearly, many users do not want to deal with the parsers themselves, or they
even may not have permission to do so. Still, they need to know when it pays to
ask for help. Typical problems that can be solved include:
Figure 15 Example for using Icarus as a cripting language (for SRS version 5.1 or higher).
The program displayed on the left prints the entry-ids and the molecular weights of all
matches to the query 'SWlSSPROT-des:Tetracycline*'. The program shown on the right takes
an SRS query string as its argument on the command line and prints the parser's 'fields'
token table: For each token, it shows both the token code (a label, which can be used to
determine how a token is further processed), and the token string. The token object (which
is used, e.g. to hold the sequence object) could have been accessed using Stoken.obj.
$Query returns a $Set object. See Chapter 7 of the on-line manual for a reference of Icarus
classes and chapters 5 and 6 to learn more about Icarus syntax and functions.
237
D. P. KREIL AND T. ETZOLD
(a) Extracting data from a field: If you find yourself trying to query certain
parts of a field regularly, that extraction process should be delegated to the
parser. In a parser, this extraction can also be much more sophisticated.
(b) Querying phrases: The parser controls which terms can be matched by a
query (see Sections 3.1.1 and 3.1.2). Sometimes the way a parser breaks a field
into indexed terms is not well suited for answering a particular question.
Changing the parser or adding a new field that takes particular requirements
into account solves the problem. Consider a comment field, for which the
parser extracts single words for indexing; the query 'not & known' will not
only find the phrase 'not known', but also retrieve instances that contain the
two words out of context. To address this problem, the parser might be
changed to watch, e. g., for words preceded by 'not', and write the two word
phrase to the index.
(c) Similarly, queries that use logical operators to combine data from several fields
are sometimes not sufficient if the necessary context has been lost in the
indexing process. Changing the parser is again the only satisfactory solution.
238
SRS—ACCESS TO MOLECULAR BIOLOGICAL DATABANKS
which we call the SRS root directory, and which we will print as '. . .' for the rest
of this chapter. From this directory, first run './srsinstall all'. To also
install the SRS web interface, run './srsinstall www' next. At completion,
this prints two lines that have to be inserted into the 'srm.conf file of your
web server. Ask your system administrator to do this if necessary. If there is no
web server installed on your site, save the lines for later reference. You then
need to have a Web server installed. We now normally use Apache (see
http://www.apache.org/).
239
D. P. KREIL AND T. ETZOLD
240
SRS—ACCESS TO MOLECULAR BIOLOGICAL DATABANKS
Acknowledgements
Many people have contributed to the wealth of SRS database and application
modules that is publicly available now, and we are indebted to them all!
We wish to warmly thank Rob Falla for his help and the fruitful late-night
discussions. Also, we gratefully thank Mark Wooding who thoroughly examined
the final draft of this chapter. Any errors were certainly introduced afterwards!
References
1. Discala, C., et al. (1998). DBCAT, the public catalog of databases. INFOBIOGEN,
Villejuif, France;contact [email protected].
2. Frishman, D., Heumann, K., Lesk, A., and Mewes, H.-W. (1998). Bioinformatics, 14, 551.
3. Brenner, S. E. (1995). Science, 268, 622.
4. Kreil, D. P. and Etzold, T. (1998). Trends Biochem. Set., 24, 155.
5. Etzold, T., Ulyanov, A, and Argos, P. (1996). In Methods in enzymology (ed. R. F. Doolittle).
Vol. 266, p. 114. Academic Press.
6. Etzold, T. and Argos, P. (1993). Comput. Appl. Biosci., 9, 49.
7. Etzold, T. and Argos, P. (1993). Comput. Appl. Biosci., 9, 59.
8. Markowitz, V. M. and Ritter, O. (1995). J. Comput. Biol., 2, 537.
9. Davidson, S. B., Overton, C., Tannen, V., and Wong, L. (1997). Int. J. Digit. Libr., 1, 36.
10. Altschul, S. F., et al. (1990). J. Mol. Biol., 215, 403.
11. Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994). Nucleic Acids Res., 22, 4673.
12. Sonnhammer, E. L, et al. (1998). Nucleic Acids Res., 26, 320.
13. Attwood, T. K., Beck, M. E., Bleasby, A. J., and Parry-Smith, D. J. (1994). Nucleic Acids Res.,
24, 182.
14. Barker, W. C., et al. (1998). Nucleic Acids Res., 26, 27.
15. Duret, L, Mouchiroud, D., and Gouy, M. (1994). Nucleic Acids Res., 22, 2360.
16. Lanave, C., et al. (1999). Nucleic Acids Res., 27, 134.
17. Maidak, B. L, et al (1997). Nucleic Acids Res., 25, 109.
18. Holm, L, et al. (1992). Protein Set., 1, 1691.
19. Sander, C. and Schneider, R. (1991). Proteins, 9, 56.
20. Heinemeyer, T., et al. (1998). Nucleic Acids Res., 26, 362.
241
This page intentionally left blank
List of suppliers
Anderman and Co. Ltd., 145 London Road, Bio 101 Inc., PO Box 2284, La Jolla, CA
Kingston-upon-Thames, Surrey KT2 6NH, 92038-2284, USA. Tel: 001 760 598 7299
UK. Fax: 001 760 598 0116
Tel: 0181 541 0035 Fax: 0181 541 0623 URL: http://www.biol01.com
Beckman Coulter (UK) Ltd., Oakley Court, Bio-Rad Laboratories Ltd., Bio-Rad House,
Kingsmead Business Park, London Road, Maylands Avenue, Hemel Hempstead,
High Wycombe, Buckinghamshire HP11 Hertfordshire HP2 7TD, UK.
1JU, UK. Tel: 0181 328 2000 Fax: 0181 328 2550
Tel: 01494 441181 URL: http://www.bio-rad.com
Fax: 01494 447558 Bio-Rad Laboratories Ltd., Division
URL: http://www.beckman.com Headquarters, 1000 Alfred Noble Drive,
Beckman Coulter Inc., 4300 N Harbor Hercules, CA 94547, USA.
Boulevard, PO Box 3100, Fullerton, CA Tel: 001 510 724 7000
92834-3100, USA. Fax: 001 510 741 5817
Tel: 001 714 871 4848 URL: http://www.bio-rad.com
Fax: 001 714 773 8283
URL: http://www.beckman.com CP Instrument Co. Ltd., PO Box 22, Bishop
Stortford, Hertfordshire CM23 3DX, UK.
Becton Dickinson and Co., 21 Between Tel: 01279 757711 Fax: 01279 755785
Towns Road, Cowley, Oxford 0X4 3LY, UK. URL: http://www.cpinstrument.co.uk
Tel: 01865 748844
Fax: 01865 781627 Dupont (UK) Ltd., Industrial Products
URL: http://www.bd.com Division, Wedgwood Way, Stevenage,
Becton Dickinson and Co., 1 Becton Drive, Hertfordshire SG1 4QN, UK.
Franklin Lakes, NJ 07417-1883, USA. Tel: 01438 734000
Tel: 001 201 847 6800 Fax: 01438 734382
URL: http://www.bd.com URL: http://www.dupont.com
Dupont Co. (Biotechnology Systems
Bio 101 Inc., c/o Anachem Ltd., Anachem Division), PO Box 80024, Wilmington, DE
House, 20 Charles Street, Luton, 19880-002, USA.
Bedfordshire LU2 0EB, UK. Tel: 001 302 774 1000
Tel: 01582 456666 Fax: 01582 391768 Fax: 001 302 774 7321
URL: http://www.anachem.co.uk URL: http://www.dupont.com
243
LIST OF SUPPLIERS
Eastman Chemical Co., 100 North Eastman Invltrogen Corp., 1600 Faraday Avenue,
Road, PO Box 511, Kingsport, TN 37662- Carlsbad, CA 92008, USA.
5075, USA. Tel: 001 760 603 7200
Tel: 001 423 229 2000 Fax: 001 760 603 7201
URL: http://www.eastman.com URL: http://www.invitrogen.com
Invitrogen BV, PO Box 2312, 9704 CH
Fisher Scientific UK Ltd., Bishop Meadow Groningen, The Netherlands.
Road, Loughborough, Leicestershire LE11 Tel: 00800 5345 5345
5RG, UK. Fax: 00800 7890 7890
Tel: 01509 231166 URL: http://www.invitrogen.com
Fax: 01509 231893
URL: http://www.fisher.co.uk Life Technologies Ltd., PO Box 35, Free
Fisher Scientific, Fisher Research, 2761 Fountain Drive, Incsinnan Business Park,
Walnut Avenue, Tustin, CA 92780, USA. Paisley PA4 9RF, UK.
Tel: 001 714 669 4600 Tel: 0800 269210
Fax: 001 714 669 1613 Fax: 0800 838380
URL: http://www.fishersci.com URL: http://www.lifetech.com
Life Technologies Inc., 9800 Medical Center
Fluka, PO Box 2060, Milwaukee, WI 53201, Drive, Rockville, MD 20850, USA.
USA. Tel: 001 301 610 8000
Tel: 001 414 273 5013 URL: http://www.lifetech.com
Fax: 001 414 2734979
URL: http://www.signia-aldrich.com Sharp & Dohme, Research Laboratories,
Fluka Chemical Co. Ltd., PO Box 260, CH- Neuroscience Research Centre, Terlings
9471, Buchs, Switzerland. Park, Harlow, Essex CM20 2QR, UK.
Tel: 0041 81 745 2828 URL: http://www.msd-nrc.co.uk
Fax: 0041 81 756 5449 MSD Sharp and Dohme GmbH, Lindenplatz
URL: http://www.sigma-aldrich.com 1, D-85540, Haar, Germany.
URL: http://www.msd-deutschland.com
Hybald Ltd., Action Court, Ashford Road,
Ashford, Middlesex TW15 1XB, UK. Mllllpore (UK) Ltd., The Boulevard,
Tel: 01784 425000 Blackmoor Lane, Watford, Hertfordshire
Fax: 01784 248085 WD1 8YW, UK.
URL: http://www.hybaid.com Tel: 01923 816375
Hybaid US, 8 East Forge Parkway, Franklin, Fax: 01923 818297
MA 02038, USA. URL: http://www.milupore.com/local/UK.htm
Tel: 001 508 541 6918 Millipore Corp., 80 Ashby Road, Bedford,
Fax: 001 508 541 3041 MA 01730, USA.
URL: http://www.hybaid.com Tel: 001 800 645 5476
Fax: 001 800 645 5439
HyClone Laboratories, 1725 South HyClone URL: http://www.millipore.com
Road, Logan, UT 84321, USA.
Tel: 001 435 753 4584 New England Blolabs, 32 Tozer Road,
Fax: 001 435 753 4589 Beverley, MA 01915-5510, USA.
URL: http://www.hyclone.com Tel: 001 978 927 5054
244
LIST OF SUPPLIERS
Nikon Inc., 1300 Walt Whitman Road, Promega Corp., 2800 Woods Hollow Road,
Melville, NY 11747-3064, USA. Madison, WI 53711-5399, USA.
Tel: 001 516 547 4200 Tel: 001 608 274 4330
Fax: 001 516 547 0299 Fax: 001 608 277 2516
URL: http://www.nikonusa.com URL: http://www.promega.com
Nikon Corp., Fuji Building, 2-3, 3-chome,
Marunouchi, Chiyoda-ku, Tokyo 100, Japan. Qlagen UK Ltd., Boundary Court, Gatwick
Tel: 00813 3214 5311 Road, Crawley, West Sussex RH10 2AX, UK.
Fax: 00813 3201 5856 Tel: 01293 422911 Fax: 01293 422922
URL: http://www.nikon.co.jp/main/index_e.htm URL: http://www.qiagen.com
Qiagen Inc., 28159 Avenue Stanford,
Nycomed Amersham pic, Amersham Place, Valencia, CA 91355, USA.
Little Chalfont, Buckinghamshire HP7 9NA, Tel: 001 800 426 8157
UK. Fax: 001 800 718 2056
Tel: 01494 544000 Fax: 01494 542266 URL: http://www.qiagen.com
URL: http://www.amersham.co.uk
Nycomed Amersham, 101 Carnegie Center, Roche Diagnostics Ltd., Bell Lane, Lewes,
Princeton, NJ 08540, USA. East Sussex BN7 1LG, UK.
Tel: 001 609 514 6000 Tel: 01273 484644 Fax: 01273 480266
URL: http://www.amersham.co.uk URL: http://www.roche.com
Roche Diagnostics Corp., 9115 Hague Road,
PerkIn Elmer Ltd., Post Office Lane, PO Box 50457, Indianapolis, IN 46256, USA.
Beaconsfield, Buckinghamshire HP9 1QA, Tel: 001 317 845 2358
UK. Fax: 001 317 576 2126
Tel: 01494 676161 URL: http://www.roche.com
URL: http://www.perkin-elmer.com Roche Diagnostics GmbH, Sandhoferstrasse
116, 68305 Mannheim, Germany.
Pharmacia Biotech (Blochrom) Ltd., Unit 22,
Tel: 0049 621 759 4747
Cambridge Science Park, Milton Road,
Fax: 0049 621 759 4002
Cambridge CB4 OFJ, UK.
URL: http://www.roche.com
Tel: 01223 423723
Fax: 01223 420164 Schlelcher and Schuell Inc., Keene, NH
URL: http://www.biochrom.co.uk 03431A, USA.
Pharmacia and Upjohn Ltd., Davy Avenue, Tel: 001 603 357 2398
Knowlhill, Milton Keynes, Buckinghamshire
MK5 8PH, UK. Shandon Scientific Ltd., 93-96 Chadwick
Tel: 01908 661101 Road, Astmoor, Runcorn, Cheshire WA7
Fax: 01908 690091 1PR, UK.
URL: http://www.eu.pnu.com Tel: 01928 566611
URL: http://www.shandon.com
Promega UK Ltd., Delta House, Chilworth
Research Centre, Southampton SO16 7NS, Sigma-Aidrich Co. Ltd., The Old Brickyard,
UK. New Road, Gillingham, Dorset XP8 4XT, UK.
Tel: 0800 378994 Tel: 01747 822211
Fax: 0800 181037 Fax: 01747 823779
URL: http://www.promega.com URL: http://www.sigma-aldrich.com
245
LIST OF SUPPLIERS
Sigma-Aldrich Co. Ltd., Fancy Road, Poole, Stratagene Europe, Gebouw California,
Dorset BH12 4QH, UK. Hogehilweg 15, 1101 CB Amsterdam
Tel: 01202 722114 Zuidoost, The Netherlands.
Fax: 01202 715460 Tel: 00800 9100 9100
URL: http://www.sigma-aldrich.com URL: http://www.stratagene.com
Sigma Chemical Co., PO Box 14508,
St Louis, MO 63178, USA. United States Biochemical, PO Box 22400,
Tel: 001 314 771 5765 Cleveland, OH 44122, USA.
Fax: 001 314 771 5757 Tel: 001 216 464 9277
URL: http://www.sigma-aldrich.corn
246
Index
Brackets denotes figures Compugen computer company expectation value (e-value) 177,
201 187
1D-3D profile 5 consensus sequence 145, [146] extreme value distribution
175-6, [176]
247
INDEX
248
INDEX
topological equivalence 25 VERTAA structure comparison World Wide Web (WWW) 191
TOPS topology program 116-7 program 37, [38]
transmembrane prediction of
structure 133-6 z-score 178
topology 136 Webin sequence submission
program 210, [211]
249