1 Pearson
1 Pearson
5
Matrix
William R. Pearson1
1
University of Virginia School of Medicine, Charlottesville, Virginia
ABSTRACT
Protein sequence similarity searching programs like BLASTP, SSEARCH, and FASTA use scor-
ing matrices that are designed to identify distant evolutionary relationships (BLOSUM62 for
BLAST, BLOSUM50 for SSEARCH and FASTA). Different similarity scoring matrices are
most effective at different evolutionary distances. “Deep” scoring matrices like BLOSUM62 and
BLOSUM50 target alignments with 20% to 30% identity, while “shallow” scoring matrices (e.g.,
VTML10 to VTML80) target alignments that share 90% to 50% identity, reflecting much less
evolutionary change. While “deep” matrices provide very sensitive similarity searches, they also
require longer sequence alignments and can sometimes produce alignment overextension into
nonhomologous regions. Shallower scoring matrices are more effective when searching for short
protein domains, or when the goal is to limit the scope of the search to sequences that are likely
to be orthologous between recently diverged organisms. Likewise, in DNA searches, the match
and mismatch parameters set evolutionary look-back times and domain boundaries. In this unit,
we will discuss the theoretical foundations that drive practical choices of protein and DNA simi-
larity scoring matrices and gap penalties. Deep scoring matrices (BLOSUM62 and BLOSUM50)
should be used for sensitive searches with full-length protein sequences, but short domains or
restricted evolutionary look-back require shallower scoring matrices. Curr. Protoc. Bioinform.
43:3.5.1-3.5.9.
C 2013 by John Wiley & Sons, Inc.
Table 3.5.1 Scoring Matrix Target Identity, Information Content, and Alignment Lengtha
Selecting the Right Thus, a 10/2 penalty produces a penalty of 12 for a one residue gap, 14 for two residues, etc.
Similarity-Scoring c Scaled in 1/3-bit units; all other matrices are scaled in 1/2-bit units.
Matrix d As calculated according to Mueller et al. (2002).
3.5.2
Supplement 43 Current Protocols in Bioinformatics
replacement frequencies over long evolution- The Henikoffs used the same odds-ratio al-
ary times (Mueller et al., 2002). gebra when developing the BLOSUM matri-
In 1992, Steve and Jorja Henikoff de- ces, but calculated their transition frequencies
scribed a direct approach to counting replace- by counting the number of weighted changes
ment frequencies at long evolutionary dis- in different blocks.
tances (Henikoff and Henikoff, 1992). The In 1991, Altschul published a seminal paper
BLOSUM scoring matrices avoided the prob- (Altschul, 1991) that showed that any scor-
lem of extrapolating from PAM1 replace- ing matrix appropriate for local alignments
ment frequencies by counting replacement fre- (one with a negative expected score) could be
quencies directly with the BLOSUM series treated as a “log-odds” matrix of the form: λsi,j
of matrices. Rather than relying on align- = log(qij /pi pj ), where sij is the score given to
ments of relatively closely related proteins, the i,j alignment, qij is the replacement fre-
they identified conserved BLOCKS, or un- quency for amino acid i to j, and the pi pj term
gapped patches of conserved sequences, in gives the expected frequency of two amino
sets of proteins that were potentially very dis- acids aligning by chance. The λ term is used
tantly related. They then counted the amino to scale the matrix so that individual scores
acid replacements within these blocks, us- can be accurately represented with integers.
ing a percent identity threshold to exclude Widely used scoring matrix values typically
closely and more moderately related se- range from −10 to +20, reflecting λ scale fac-
quences. In their description of the BLO- tors of ln(2)/2—half-bit units used by BLO-
SUM matrices, they showed that BLOSUM62 SUM62 and PAM120—or ln(2)/3—third-bit
performed much more effectively than ei- units used by BLOSUM50 and PAM250. For
ther the PAM120 (BLOSUM62 equivalent in- example, the BLOSUM62 score for aligning
formation content) or the PAM250 matrix aspartic acid (“D”) with itself is +6, and BLO-
(BLOSUM45 equivalent) for identifying dis- SUM62 is scaled in 1/2-bit units, so a D:D
tant homologs. BLOSUM62 was then incor- alignment in related proteins is 6 = 2.0 ×
porated as the default for the BLASTP (UNIT log2 (qD ,D /pD pD ) or 23 = 8 times more likely
3.4) program, while FASTA (UNIT 3.9) and to occur because of homology than by chance.
SSEARCH (UNIT 3.10) switched to the BLO- Likewise, the BLOSUM62 matrix assigns a
SUM50 matrix, which is more sensitive than D:L alignment a score of −4, which means
BLOSUM62, but requires longer alignments. that it is 22 = 4 times more likely to occur by
chance than in the homologous blocks aligned
THE ALGEBRA OF SIMILARITY for BLOSUM62.
SCORING (LOG-ODDS) MATRICES This ratio of homologous replacement fre-
quency to chance alignment frequency ex-
Scoring Matrices as Odds Ratios plains why modern scoring matrices can give
Similarity scoring matrices for local se- very different scores to identical residues. In
quence alignments, which are rigorously cal- the denominator, amino acids are not uni-
culated by the Smith-Waterman algorithm formly abundant (common amino acids like
(Smith and Waterman, 1981) and heuristi- L, A, S, and G are found more than four times
cally calculated by BLASTP (Altschul et al., more frequently than rare amino acids like W,
1990; Altschul et al., 1997) or FASTA (Pear- C, H, and M; see APPENDIX 1A for a table of
son and Lipman, 1988), require scoring ma- the 1-letter amino acid codes), so common
trices that produce negative values on average amino acids often have lower identity scores
between random sequences. If the average or than rare ones. Likewise, amino acids are not
expected matrix score is positive, the align- uniformly mutable—A, S, and T change fre-
ment will extend to the ends of the sequences, quently over evolutionary time, while W and C
and be global, rather than local. Dayhoff’s ini- change rarely. Thus, the highest identity score
tial PAM matrices were calculated as log-odds in the BLOSUM62 matrix (Fig. 3.5.1) is 11,
ratios, the logarithm of the ratio of the align- corresponding to a W:W alignment, while A,
ment frequency observed after a given evo- I, L, S, and V get identity alignment scores of
lutionary distance divided by the alignment 4. Differences in identity scores, together with
frequency expected by chance: positive scores for nonidentity alignments be-
tween conserved amino acids, explain why se-
⎛ frequency in homologs ⎞ quence similarity scores are dramatically more
log ⎜ ⎟
Finding
⎝ frequency by chance ⎠
sensitive than percent identity for inferring Similarities
homology (see UNIT 3.1). and Inferring
Homologies
3.5.3
Current Protocols in Bioinformatics Supplement 43
A R N D C Q E G H I L K M F P S T W Y V X
A 4
R -1 5
N -2 0 6
D -2 -2 1 6
C 0 -3 -3 -3 9
Q -1 1 0 0 -3 5
E -1 0 0 2 -4 2 5
G 0 -2 0 -1 -3 -2 -2 6
H -2 0 1 -1 -3 0 0 -2 8
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
Figure 3.5.1 The BLOSUM62 matrix. The BLOSUM62 matrix used by BLASTP, BLASTX, and
TBLASTN is actually 23 × 23: 20 amino acids plus X (any amino acid), B (D or E), and Z (N or
Q). Only the lower half of the symmetric matrix is shown to highlight the identity scores on the
diagonal. The most positive value is 11 (W:W alignment); the most negative is −4 (found for many
hydrophobic/hydrophilic and small/large replacements). The BLOSUM62 matrix is scaled in 1/2-bit
units, so the W:W alignment of 11 is 25.5 = 45 times more common in homologous proteins than
by chance. Weighted by amino acid abundance, the average similarity score is about −1 half-bits.
Figure 3.5.2 Comparison of a “shallow” (VTML 20) and “deep” (BLOSUM62) scoring matrix.
Both matrices are scaled in 1/2-bits. For the small part of the matrices shown here, the VTML20
matrix produces an average 2.80 half-bit identity score, and an average −0.59 nonidentical score
(weighted by amino-acid abundance). In contrast, BLOSUM62 produces 1.86 for identities but
only −0.06 for nonidentities. Thus, VTML20 targets shorter, higher-identity alignments, because
it penalizes nonidentities much more strongly.
matrix. Thus, in addition to differing in infor- to 2000 million years ago. Mouse and human
mation content, scoring matrices have range of orthologs share about 83% amino acid iden-
target percent identities and alignment lengths tity; thus, for mammals, the VTML 20 matrix
(Table 3.5.1). Shallower scoring matrices is expected to find all orthologs and paralogs
produce shorter, more identical alignments, that have diverged over the past 200 million
because they give more negative scores to years, but the matrix is much less likely to
nonidentical aligned residues. “Deeper” scor- identify paralogs that share less than 40% se-
ing matrices produce longer alignments with quence identity (divergence time > 1000 mil-
lower percent identities because the penalty lion years).
for a mismatch is much lower and more con-
servative nonidentities get positive scores. SCORING MATRICES AND GAP
In practice, the relationship between scor- PENALTIES
ing matrix evolutionary distance, information While there is an intuitive mathematical ex-
content, percent identity, and alignment length planation for pairwise similarity scores from
suggests two reasons for changing from the the log-odds perspective, sensitive sequence
BLOSUM62 and BLOSUM50 matrices used alignments require both aligned residues and
by BLASTP and SSEARCH/FASTA. First, insertion or deletion gaps. Unfortunately, we
one should change to a shallower matrix when do not have an analytical model for gap
looking for short alignments. We need a shal- penalties and evolutionary distances. The de-
lower scoring matrix for short domains, short fault gap-penalties provided for BLASTP,
exons, or short DNA reads because deep scor- SSEARCH, and FASTA were determined em-
ing matrices like BLOSUM62 do not have pirically (e.g., Pearson, 1991) with a fo-
enough information content to produce signif- cus on identifying distant homologs. In gen-
icant scores. Short alignments require shallow eral, default gap penalties for BLASTP and
scoring matrices. SSEARCH/FASTA are set as low as possible;
One should also use a shallower scor- lower gap penalties would convert alignments
ing matrix when looking for orthologs— from local to global, which would invalidate
sequences that differ because of specia- the statistical estimates. Thus, when consider-
tion events and are likely to share similar ing whether to change gap penalties to improve
functions—between “relatively” closely re- search selectivity for a particular protein fam-
lated organisms (100 to 500 My). Protein se- ily, gap penalties should be increased (made
quence comparison algorithms are very sensi- more stringent), not decreased. Just as “shal-
tive; BLASTP and SSEARCH routinely find lower” scoring matrices target less divergence
significant alignments between human and by giving higher scores to identities and more
yeast (1.2 billion year divergence) and human negative scores to nonidentities, gap penalties
and E. coli (>2.4 billion years). Because of this should increase with shallower scoring matri-
sensitivity, a mouse-human comparison often ces (Reese and Pearson, 2002). Simulations
reports not only the orthologs (sequences that to maximize the significance of short align-
Finding
diverged at the primate/rodent split 80 million ments suggest that for 1/2-bit scoring matrices, Similarities
years ago), but also dozens of more distantly gap open penalties of 16.7−(0.067 × pam- and Inferring
related paralogs that may have diverged 200 distance), e.g., 16.7−(0.067 × 20) = 15 for Homologies
3.5.5
Current Protocols in Bioinformatics Supplement 43
VTML 20, and gap extend penalties of 2, are cent identity), different scoring matrices have
most effective (Reese and Pearson, 2002). different preferred alignment lengths (Ta-
Low gap penalties can dramatically re- ble 3.5.1). Shallow scoring matrices have
duce the information content and average per- large negative values for amino acid replace-
cent identity associated with a scoring matrix, ments (Fig. 3.5.2), so alignments to nonho-
and can dramatically increase the lengths of mologous (random) sequences will be short.
alignments produced by the matrix. The tar- Deep scoring matrices have less negative av-
get percent identity, information content, and erage replacement scores (VTML20’s aver-
alignment lengths presented in Table 3.5.1 age nonidentity score is −5.8 half-bits, while
reflect the observed median values of the BLOSUM62’s is −1.2 half-bits), so their
highest-scoring alignment produced by ran- alignments tend to be longer. Table 3.5.1 (ran-
dom queries against real protein sequences dom alignment length column) summarizes
with the specified matrix and gap penalties. the median alignment length between random
If gaps are not allowed, the average percent queries and real protein sequences. BLAST
identity and information content increase and and SSEARCH/FASTA statistics are very ac-
alignment length gets shorter. For example, if curate (UNIT 3.1), so sequences that share sta-
gaps are not allowed with BLOSUM62, the tistically significant scores will always share
median percent identity increases from 28.9 a homologous domain. However, BLAST and
(Table 3.5.1) to 33, information content almost SSEARCH/FASTA calculate local sequence
doubles from 0.40 to 0.74, and median random alignments—the alignments begin and end at a
alignment length drops from 86 to 45 residues. position that maximizes the alignment score—
A similar effect is seen with VTML 80, where so the boundaries of the alignment depend on
information content increases and alignment both the location of the homologous domain
lengths decrease almost 2-fold when gaps are and the scoring matrix used to produce the
not allowed. Gap effects are less dramatic with alignment. When a deep scoring matrix like
shallower matrices like VTML 20—from 86% BLOSUM62 is used to align more closely
to 89% identity, from 3.3 to 3.5 bits per posi- related sequences, the alignment can extend
tion, and from 11 to 10 residue median align- (overextend) into nonhomologous neighbor-
ment lengths—because short evolutionary dis- ing sequence. Gonzalez and Pearson (2010)
tances should allow many fewer insertions and termed this artifact “homologous overexten-
deletions. sion,” and showed that it is a major source of
errors in PSI-BLAST searches.
Homologous overextension often occurs
BLASTP Gap Penalties with Shallow from short repeated domains. For ex-
Scoring Matrices ample, Figure 3.5.3A shows a BLASTP
While the BLAST programs offer a set
alignment of VAV HUMAN (P15498) with
of scoring matrices with different evolution-
SKAP2 XENTR (Q5FVW6), a protein that
ary horizons (BLOSUM50 and BLOSUM62
contains an SH3 domain that is homologous
are “deep”; PAM30 is relatively “shallow”),
over 58 amino acids. However, the align-
the modest gap penalties provided with their
ment is 198 residues long; the additional 140
shallow matrices dramatically modify their
residues in the alignment include a 100-residue
effective evolutionary distance (Table 3.5.1).
Pleckstrin domain in SKAP2 XENTR that
The “shallowest” combination of scoring ma-
is not homologous (VAV HUMAN contains
trix (PAM30) and gap penalties (9/1) requires
an SH3 domain in the region that aligns to
an average of 56 aligned amino acids, or
the Pleckstrin domain in SKAP2 XENTR).
more than 160 nucleotides, to produce a 50-
The 58-residue homologous SH3 domain con-
bit alignment score. Because these gap penal-
tributes 85% of the bit score, with the addi-
ties are too low (Reese and Pearson, 2002),
tional 140 residues contributing less than 15%
the BLAST protein matrices are less effec-
of the score. Using the slightly more strin-
tive for short alignments or short evolution-
gent (shallower) BLOSUM80 matrix does not
ary distances than they would be with higher
change the alignment overextension.
penalties.
The FASTA programs offer a new option
for identifying homologous overextension—
LONG ALIGNMENTS AND subdomain scoring (Fig. 3.5.3B). By using
OVEREXTENSION the domain annotations available for one of
Selecting the Right In addition to differing in information the sequences to subdivide the alignment, it
Similarity-Scoring content (score or “bits” per aligned posi- becomes apparent that the 58-residue SH3
Matrix
tion) and optimal evolutionary distances (per- domain is responsible for almost all of the
3.5.6
Supplement 43 Current Protocols in Bioinformatics
A
sp|P15498.4|VAV_HUMAN Proto-oncogene vav: Length: 845
sp|Q5FVW6.2|SKAP2_XENTR Src kinase-associated phosphoprotein 2 Length: 328
[ SH3
Query 649 WFPCNRVKPYVHGPPQDLSVHL WYAGPMERAGAESILAN--RSDGTFLVRQRVKDAAEFA 706
W C Y +G +D ++ RA L + D F + K +FA
Sbjct 128 WCVCTNSMFYYYGSDKDKQQKGAFSLDGYRAKMNDTLRKDAKKDCCFEIFAPDKRVYQFA 187
]
Query 707 ISIKYNVEVKHIKIMTAEGLYRITEKKAFRGLTELVEFYQQNSLKDCFKSLDTTLQFPF K 766
S E IM + G +++ + + + V+ + +D ++ L + P
Sbjct 188 ASSPKEAEEWVNIIMNSRGNIPTEDEELYDDVNQEVDASHE---EDIYEELPEESEKPVT 244
Pleckstrin ]
[ SH2
Query 767 EPEKRTISRPAVGST K------YFGTAKARYDFCARDRSELSLKEGDIIKILNKK-GQQG 819
E E + V +T Y + +D ELS K GD I IL+K+ G
Sbjct 245 EIETPKATPVPVNNTSGKENT DYANFYRGLWDCTGDHPDELSFKHGDTIYILSKEYNTYG 304
[
]
Query 820 WWRGEIYGRVGWFPANYVEEDY 841
WW GE+ G +G P Y+ E Y
Sbjct 305 WWVGEMKGTIGLVPKAYIMEMY 326
]
B
sp|P15498.4|VAV_HUMAN Proto-oncogene vav
sp|Q5FVW6.2|SKAP2_XENTR Src kinase-associated phosphoprotein (328 aa)
sRegion: 626-725:103-206 : score=7; bits=8.7; Id=0.202; Q= 0.0 : Pleckstrin
qRegion: 671-765:150-243 : score=7; bits=8.7; Id=0.175; Q= 0.0 : SH2
qRegion: 782-841:260-326 : score=83; bits=35.8; Id=0.343; Q=53.4 : SH3
sRegion: 783-841:266-326 : score=88; bits=37.6; Id=0.383; Q=58.7 : SH3
s-w opt: 116 bits: 47.6 E(454402): 0.0006
Figure 3.5.3 Overextension of an alignment of homologous SH2 domains. (A) BLASTP alignment of
VAV HUMAN with SKAP2 XENTR. The two proteins share a homologous SH2 domain (highlighted in red) over
about 58 amino acids that contributes more than 85% of the similarity score. The remaining 140 amino acid
alignment juxtaposes an SH3 domain from VAV HUMAN (brown) with a Pleckstrin domain from SKAP2 XENTR
(green). These two domains are not homologous; they are classified as having different folds in SCOP. (B)
Sub-alignment scores produced by the SSEARCH36 program using the same scoring matrix as BLASTP
(BLOSUM62, 11/1) for the VAV HUMAN / SKAP2 XENTR alignment. Boundaries for annotated domains in the
two proteins were taken from InterPro using the query VAV HUMAN (qRegion) or the subject SKAP2 XENTR
(sRegion). Thus, 103-206 for the Pleckstrin domain comes from InterPro annotations for SKAP2 XENTR, as
does 671-765 for SH3 domain in VAV HUMAN. The raw score, bit-score, and percent identity are shown for the
subregions. The Q-score is −10log(p-value) based on the bit score; thus Q = 30 corresponds to a probability
(uncorrected for database size) of 0.001.
significant similarity found. It is often very example, the default match/mismatch penal-
difficult to judge the quality of a distant ties used by BLASTN in its most sensitive
alignment visually; subdomain scoring pro- mode (-task blastn) uses a score of +2
vides a quantitative strategy for identifying for a match and −3 for a mismatch, which
overextension. targets sequences at PAM10, or 90% identity
(States et al. 1991). By default, searches on
SCORING MATRICES FOR DNA the NCBI nucleotide BLAST Web site use
DNA scoring matrices, which are usually MEGABLAST (-task megablast), with Finding
implemented as match/mismatch scores, can match/mismatch scores of +1/−3 that tar- Similarities
also be treated as log-odds matrices with target get sequences that are 99% identical. By de- and Inferring
Homologies
evolutionary distances (States et al., 1991). For fault, the FASTA program uses +5/−4 (also
3.5.7
Current Protocols in Bioinformatics Supplement 43
available with BLASTN, -task blastn), The match/mismatch ratios used in DNA
which corresponds approximately to PAM 40, similarity searches also have target evolution-
or 70% identity. Because DNA sequence com- ary distances. The stringent match/mismatch
parison is much less sensitive than protein se- ratios used by MEGABLAST are most ef-
quence comparison, it is very difficult to detect fective at matching sequences that are essen-
statistically significant DNA:DNA sequence tially 100% identical, e.g., mRNA sequences
similarity at distances greater than PAM 40 to genomic exons. Deeper, more sensitive
(PAM 40 is a short distance for protein com- DNA scoring parameters are more effective
parisons). for longer DNA evolutionary distances, e.g.,
In practice, the effective target identity mouse-human.
for heuristic methods like BLAT, BLASTN, While scoring matrices and gap penal-
MEGABLAST, and other genome-alignment ties can dramatically affect search sensitiv-
programs that do use scoring matrices, may ity and alignment regions, modern sequence-
be difficult to estimate from the reported comparison programs provide accurate simi-
match/mismatch scores. Heuristic programs larity statistics, so it is unlikely that the wrong
typically use a hierarchy of filters to accel- scoring matrix will produce a significant match
erate the similarity search, and each of those to a nonhomologous protein. However, the
filters will affect the percentage identity and wrong matrix can prevent short homologous
evolutionary distance of the alignments that regions from being found, or allow an overex-
are displayed. As a result, it is possible that tension into a nonhomologous region from
the displayed alignments may have a lower a homologous domain. The rapidly increas-
percent identity than other possible alignments ing volume of protein sequence means that
that were excluded during the early stages of close homologs will often be available, and
the filtering process. shallower scoring matrices can produce more
Ideally, the match/mismatch penalties used reliable, functionally informative alignments
in genome alignment would match the evo- when closer homologs (>50% identical) are
lutionary distances of the sequences being found.
aligned; human DNA is expected to be more
than 99.9% identical to itself, but human-
mouse alignments in protein-coding regions ACKNOWLEDGMENTS
W.R.P. has been supported by fund-
will be less than 80% identical (outside
ing from the National Library of Medicine
of protein-coding regions, identity will typ-
(LM04969).
ically be undetectable at <50%). Likewise,
match/mismatch parameters should reflect po-
tential alignment length; searches with short LITERATURE CITED
Altschul, S.F. 1991. Amino acid substitution matri-
sequences will need higher match/mismatch ces from an information theoretic perspective.
ratios with higher information content (States J. Mol. Biol. 219:555-565.
et al., 1991). Altschul, S.F., Gish, W., Miller, W., Myers, E.W.,
and Lipman, D.J. 1990. A basic local alignment
SUMMARY search tool. J. Mol. Biol. 215:403-410.
The BLAST and FASTA/SSEARCH Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang,
protein-alignment programs use “deep” sim- J., Zhang, Z., Miller, W., and Lipman, D.J. 1997.
Gapped BLAST and PSI-BLAST: A new gener-
ilarity scoring matrices like BLOSUM62 or
ation of protein database search programs. Nu-
BLOSUM50 to identify homologs that share cleic Acids Res. 25:3389-3402.
less than 25% sequence identity. Deep scor- Gonnet, G.H., Cohen, M.A., and Benner, S.A. 1992.
ing matrices require long sequence alignments Exhaustive matching of the entire protein se-
to achieve statistically significant similarity quence database. Science 256:1443-1445.
scores and are more likely to extend align- Gonzalez, M.W. and Pearson, W.R. 2010. Homol-
ments outside the homologous region. Shal- ogous over-extension: A challenge for iterative
lower scoring matrices are more effective similarity searches. Nucleic Acids Res. 38:2177-
when searching for short homologous domains 2189.
or short (<150-nt) exons, or when search- Henikoff, S. and Henikoff, J.G. 1992. Amino acid
ing over shorter evolutionary distances. Scor- substitution matrices from protein blocks. Proc.
Natl. Acad. Sci. U.S.A. 89:10915-10919.
ing matrices that are matched to the evolu-
tionary distance of the homologous sequences Jones, D.T., Taylor, W.R., and Thornton, J.M. 1992.
Selecting the Right The rapid generation of mutation data matri-
are also less likely to produce homologous ces from protein sequences. Comp. Appl. Biosci.
Similarity-Scoring
Matrix overextension. 8:275-282.
3.5.8
Supplement 43 Current Protocols in Bioinformatics
Mueller, T., Spang, R., and Vingron, M. 2002. Esti- quence comparison. Bioinformatics 18:1500-
mating amino acid substitution models: A com- 1507.
parison of Dayhoff’s estimator, the resolvent ap- Schwartz, R.M. and Dayhoff, M. 1978. Matrices for
proach and a maximum likelihood method. Mol. detecting distant relationships. In Atlas of Pro-
Biol. Evol. 19:8-13. tein Sequence and Structure, Volume 5, Supple-
Pearson, W.R. 1991. Searching protein se- ment 3 ( M. Dayhoff, ed.), pp. 353-358. National
quence libraries: Comparison of the sensi- Biomedical Research Foundation, Silver Spring,
tivity and selectivity of the Smith-Waterman Maryland.
and FASTA algorithms. Genomics 11:635- Smith, T.F. and Waterman, M.S. 1981. Identifi-
650. cation of common molecular subsequences. J.
Pearson, W.R. and Lipman, D.J. 1988. Improved Mol. Biol. 147:195-197.
tools for biological sequence comparison. Proc. States, D.J., Gish, W., and Altschul, S.F. 1991.
Natl. Acad. Sci. U.S.A. 85:2444-2448. Improved sensitivity of nucleic acid database
Reese, J.T. and Pearson, W.R. 2002. Empirical de- searches using application-specific scoring ma-
termination of effective gap penalties for se- trices. Methods Enzymol. 3:66-70.
Finding
Similarities
and Inferring
Homologies
3.5.9
Current Protocols in Bioinformatics Supplement 43