0% found this document useful (0 votes)
49 views9 pages

1 Pearson

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views9 pages

1 Pearson

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Selecting the Right Similarity-Scoring UNIT 3.

5
Matrix
William R. Pearson1
1
University of Virginia School of Medicine, Charlottesville, Virginia

ABSTRACT
Protein sequence similarity searching programs like BLASTP, SSEARCH, and FASTA use scor-
ing matrices that are designed to identify distant evolutionary relationships (BLOSUM62 for
BLAST, BLOSUM50 for SSEARCH and FASTA). Different similarity scoring matrices are
most effective at different evolutionary distances. “Deep” scoring matrices like BLOSUM62 and
BLOSUM50 target alignments with 20% to 30% identity, while “shallow” scoring matrices (e.g.,
VTML10 to VTML80) target alignments that share 90% to 50% identity, reflecting much less
evolutionary change. While “deep” matrices provide very sensitive similarity searches, they also
require longer sequence alignments and can sometimes produce alignment overextension into
nonhomologous regions. Shallower scoring matrices are more effective when searching for short
protein domains, or when the goal is to limit the scope of the search to sequences that are likely
to be orthologous between recently diverged organisms. Likewise, in DNA searches, the match
and mismatch parameters set evolutionary look-back times and domain boundaries. In this unit,
we will discuss the theoretical foundations that drive practical choices of protein and DNA simi-
larity scoring matrices and gap penalties. Deep scoring matrices (BLOSUM62 and BLOSUM50)
should be used for sensitive searches with full-length protein sequences, but short domains or
restricted evolutionary look-back require shallower scoring matrices. Curr. Protoc. Bioinform.
43:3.5.1-3.5.9. 
C 2013 by John Wiley & Sons, Inc.

Keywords: similarity scoring matrices r PAM matrices r BLOSUM matrices r


sequence alignment

SIMILARITY SEARCHING, content and matrix target evolutionary dis-


HOMOLOGY, AND STATISTICAL tance. Because finding distantly related pro-
SIGNIFICANCE tein sequences is more challenging than find-
Protein similarity scoring matrices dramat- ing closely related sequences, the BLOSUM62
ically improve evolutionary look-back time matrix used by the BLAST programs and
because they capture amino acid substitution the BLOSUM50 matrix used by the FASTA
preferences that have emerged over evolution- programs are designed to identify distant
ary time. Amino acid changes can range from homologs using long (typically full-length)
biochemically conservative, e.g., leucine to va- sequences. Understanding the explicit or im-
line or arginine to lysine, to dramatically dif- plicit evolutionary models used in similar-
ferent, e.g., tryptophan to glycine. Amino acid ity scoring matrices makes it much easier to
scoring matrices capture this evolutionary in- choose the right scoring matrix. Generally,
formation; conservative changes receive pos- searches for short domains (or with shorter
itive scores, while nonconservative changes query sequences) require shallower scoring
will receive the largest negative scores. matrices. Likewise, shallow scoring matrices
As a result, statistical expectation values can be more effective at highlighting com-
(E() values) based on amino-acid similarity mon orthologs when comparing proteins that
scores are far more sensitive than percent iden- have diverged in the past 100 to 500 mil-
tity for finding homologs (UNIT 3.1). lion years. While deep scoring matrices are
In this unit, we provide a brief overview more effective in identifying distant relation-
of the history of scoring matrices, the alge- ships, deep scoring matrices can also con-
bra used to calculate scoring matrices, and tribute to homologous overextension when
the important concepts of matrix information two closely related domains are embedded in Finding
Similarities
and Inferring
Homologies

Current Protocols in Bioinformatics 3.5.1-3.5.9, October 2013 3.5.1


Published online October 2013 in Wiley Online Library (wileyonlinelibrary.com).
DOI: 10.1002/0471250953.bi0305s43 Supplement 43
Copyright C 2013 John Wiley & Sons, Inc.
nonhomologous protein contexts. Using the model-based parameters using alignments be-
appropriate scoring matrix can improve both tween more distantly related proteins.
search sensitivity and alignment accuracy. Model-based scoring matrices are appeal-
ing because they can be calculated for align-
AMINO ACID SUBSTITUTION ments at any evolutionary distance. Dayhoff’s
MATRICES: HISTORY AND original PAM250 matrix was calculated based
CLASSIFICATION on 1572 observed mutations in 71 families
The earliest amino acid scoring matrices of proteins with alignments that were more
were based on amino acid properties or ge- than 85% identical. The frequency of muta-
netic code differences, but modern amino acid tions was normalized for 1% change (99%
scoring matrices are based on empirical mea- identity), or PAM1, and then extrapolated to
surements of amino acid replacement frequen- much longer evolutionary distances simply by
cies from large sets of homologous sequences multiplying the replacement frequency matrix.
(Schwartz and Dayhoff, 1978). Empirical re- Thus, PAM10 corresponds to ∼90% identity,
placement frequency scoring matrices can be PAM30, ∼75% identity, PAM70, ∼55% iden-
divided into two types: those with an explicit tity, PAM120, ∼37% identity, and PAM250,
evolutionary model and the BLOSUM scoring ∼20% identity. Table 3.5.1 presents a more
matrices. Model-based scoring matrices in- comprehensive set of scoring matrices and
clude Dayhoff’s original PAM series of matri- target percent identities. More recently, Vin-
ces (Schwartz and Dayhoff, 1978), which were gron and Mueller described strategies for
updated by Jones, Taylor, and Thornton (Jones estimating replacement frequencies that use
et al., 1992). More recently, Gonnet (Gonnet measurements from a broader range of evolu-
et al., 1992) and Vingron and Mueller (VT tionary distances. However, evolutionary mod-
and VTML; Mueller et al., 2002) developed els assume that the model accurately describes

Table 3.5.1 Scoring Matrix Target Identity, Information Content, and Alignment Lengtha

Bits/ Random 50-bit


Matrix Gap penaltyb % Identity position alignment length length

SSEARCH version 36.3.6


BLOSUM50c 10/2 25.3 0.21 160 238
BLOSUM62 11/1 28.9 0.40 86 125
VTML 160c,d 12/2 23.9 0.25 139 200
VTML 140 10/1 28.4 0.44 82 114
VTML 120 11/1 32.1 0.54 62 93
VTML 80 10/1 40.5 0.74 47 68
VTML 40 13/1 64.7 1.92 18 26
VTML 20 15/2 86.1 3.30 11 15
VTML 10 16/2 90.9 3.87 9 13
BLAST version 2.2.27+
BLOSUM50c 13/2 29.4 0.39 85 128
BLOSUM62 11/1 29.6 0.41 82 122
BLOSUM80 10/1 32.0 0.48 69 104
PAM70 10/1 33.9 0.58 56 86
PAM30 9/1 45.9 0.90 34 56
a Median percent identity, bits per aligned position, alignment length, and alignment length required for a 50-bit score
based on searches of 140 random sequences against 240,000 real protein sequences using the specified scoring matrix
and gap penalties.
b Gap open/extend penalty, where the total penalty is open + r × extend, where r is the number of residues in the gap.

Selecting the Right Thus, a 10/2 penalty produces a penalty of 12 for a one residue gap, 14 for two residues, etc.
Similarity-Scoring c Scaled in 1/3-bit units; all other matrices are scaled in 1/2-bit units.
Matrix d As calculated according to Mueller et al. (2002).

3.5.2
Supplement 43 Current Protocols in Bioinformatics
replacement frequencies over long evolution- The Henikoffs used the same odds-ratio al-
ary times (Mueller et al., 2002). gebra when developing the BLOSUM matri-
In 1992, Steve and Jorja Henikoff de- ces, but calculated their transition frequencies
scribed a direct approach to counting replace- by counting the number of weighted changes
ment frequencies at long evolutionary dis- in different blocks.
tances (Henikoff and Henikoff, 1992). The In 1991, Altschul published a seminal paper
BLOSUM scoring matrices avoided the prob- (Altschul, 1991) that showed that any scor-
lem of extrapolating from PAM1 replace- ing matrix appropriate for local alignments
ment frequencies by counting replacement fre- (one with a negative expected score) could be
quencies directly with the BLOSUM series treated as a “log-odds” matrix of the form: λsi,j
of matrices. Rather than relying on align- = log(qij /pi pj ), where sij is the score given to
ments of relatively closely related proteins, the i,j alignment, qij is the replacement fre-
they identified conserved BLOCKS, or un- quency for amino acid i to j, and the pi pj term
gapped patches of conserved sequences, in gives the expected frequency of two amino
sets of proteins that were potentially very dis- acids aligning by chance. The λ term is used
tantly related. They then counted the amino to scale the matrix so that individual scores
acid replacements within these blocks, us- can be accurately represented with integers.
ing a percent identity threshold to exclude Widely used scoring matrix values typically
closely and more moderately related se- range from −10 to +20, reflecting λ scale fac-
quences. In their description of the BLO- tors of ln(2)/2—half-bit units used by BLO-
SUM matrices, they showed that BLOSUM62 SUM62 and PAM120—or ln(2)/3—third-bit
performed much more effectively than ei- units used by BLOSUM50 and PAM250. For
ther the PAM120 (BLOSUM62 equivalent in- example, the BLOSUM62 score for aligning
formation content) or the PAM250 matrix aspartic acid (“D”) with itself is +6, and BLO-
(BLOSUM45 equivalent) for identifying dis- SUM62 is scaled in 1/2-bit units, so a D:D
tant homologs. BLOSUM62 was then incor- alignment in related proteins is 6 = 2.0 ×
porated as the default for the BLASTP (UNIT log2 (qD ,D /pD pD ) or 23 = 8 times more likely
3.4) program, while FASTA (UNIT 3.9) and to occur because of homology than by chance.
SSEARCH (UNIT 3.10) switched to the BLO- Likewise, the BLOSUM62 matrix assigns a
SUM50 matrix, which is more sensitive than D:L alignment a score of −4, which means
BLOSUM62, but requires longer alignments. that it is 22 = 4 times more likely to occur by
chance than in the homologous blocks aligned
THE ALGEBRA OF SIMILARITY for BLOSUM62.
SCORING (LOG-ODDS) MATRICES This ratio of homologous replacement fre-
quency to chance alignment frequency ex-
Scoring Matrices as Odds Ratios plains why modern scoring matrices can give
Similarity scoring matrices for local se- very different scores to identical residues. In
quence alignments, which are rigorously cal- the denominator, amino acids are not uni-
culated by the Smith-Waterman algorithm formly abundant (common amino acids like
(Smith and Waterman, 1981) and heuristi- L, A, S, and G are found more than four times
cally calculated by BLASTP (Altschul et al., more frequently than rare amino acids like W,
1990; Altschul et al., 1997) or FASTA (Pear- C, H, and M; see APPENDIX 1A for a table of
son and Lipman, 1988), require scoring ma- the 1-letter amino acid codes), so common
trices that produce negative values on average amino acids often have lower identity scores
between random sequences. If the average or than rare ones. Likewise, amino acids are not
expected matrix score is positive, the align- uniformly mutable—A, S, and T change fre-
ment will extend to the ends of the sequences, quently over evolutionary time, while W and C
and be global, rather than local. Dayhoff’s ini- change rarely. Thus, the highest identity score
tial PAM matrices were calculated as log-odds in the BLOSUM62 matrix (Fig. 3.5.1) is 11,
ratios, the logarithm of the ratio of the align- corresponding to a W:W alignment, while A,
ment frequency observed after a given evo- I, L, S, and V get identity alignment scores of
lutionary distance divided by the alignment 4. Differences in identity scores, together with
frequency expected by chance: positive scores for nonidentity alignments be-
tween conserved amino acids, explain why se-
⎛ frequency in homologs ⎞ quence similarity scores are dramatically more
log ⎜ ⎟
Finding
⎝ frequency by chance ⎠
sensitive than percent identity for inferring Similarities
homology (see UNIT 3.1). and Inferring
Homologies

3.5.3
Current Protocols in Bioinformatics Supplement 43
A R N D C Q E G H I L K M F P S T W Y V X
A 4
R -1 5
N -2 0 6
D -2 -2 1 6
C 0 -3 -3 -3 9
Q -1 1 0 0 -3 5
E -1 0 0 2 -4 2 5
G 0 -2 0 -1 -3 -2 -2 6
H -2 0 1 -1 -3 0 0 -2 8
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

Figure 3.5.1 The BLOSUM62 matrix. The BLOSUM62 matrix used by BLASTP, BLASTX, and
TBLASTN is actually 23 × 23: 20 amino acids plus X (any amino acid), B (D or E), and Z (N or
Q). Only the lower half of the symmetric matrix is shown to highlight the identity scores on the
diagonal. The most positive value is 11 (W:W alignment); the most negative is −4 (found for many
hydrophobic/hydrophilic and small/large replacements). The BLOSUM62 matrix is scaled in 1/2-bit
units, so the W:W alignment of 11 is 25.5 = 45 times more common in homologous proteins than
by chance. Weighted by amino acid abundance, the average similarity score is about −1 half-bits.

Matrix Information Content, Target 15-residue alignment can be significant. Thus,


Identity, and Alignment Length in a large-scale similarity search that needs
In addition to generalizing scoring ma- a 50-bit score for statistical significance, do-
trices as log-odds matrices, Altschul (1991) mains shorter than 125 amino acids, or DNA
also showed that log-odds scoring matrices exons shorter than 375 residues, often would
have an associated information content (rel- not produce statistically significant scores
ative entropy) or score per aligned position with BLOSUM62, the default matrix used by
(“bits-per-position”). “Bits-per-position” can BLAST, while exons shorter than 50 residues
be used to estimate the number of aligned can easily be detected with VT20.
residues required to produce a statistically “Shallow” scoring matrices have more in-
significant score. Shallow scoring matrices formation content because they give more pos-
(e.g., PAM/VTML 10, PAM/VTML 20, or itive scores to identities and more negative
PAM/VTML 40) have higher information scores to nonidentical replacements by vary-
content than deep matrices (BLOSUM62, ing the qij term in the log-odds matrices (the
PAM250), which means that a shorter align- pi pj values do not depend on evolutionary dis-
ment (10 to 50 residues) can produce a more tance). From the evolutionary perspective, se-
statistically significant score. At the same time, quences that have diverged for less time, e.g.,
shallower matrices tend to produce higher 10% to 20% change, will have more identi-
identity alignments, because they give higher cal residues and fewer replacements simply
positive scores to identities and more nega- because there has been less time for the se-
tive scores to replacements (Table 3.5.1 and quences to change. Alternatively, sequences
Fig. 3.5.2). For example, if an alignment needs that have less than 25% identity because of
a 50-bit score to be significant in a database a large amount of change will have many
search (UNIT 3.1), and the average bit score fewer identities and many more conservative
for BLOSUM62 is about 0.4 bits per aligned replacements (PAM200 sequences will be less
position (Table 3.5.1), then about 50/0.4 = than 25% identical, on average). The numer-
Selecting the Right 125 residues must be included in the align- ical basis for this difference can be seen in
Similarity-Scoring ment. In contrast, the VT20 matrix provides Fig. 3.5.2, which compares parts of a “shal-
Matrix about 3.3 bits per aligned position, so even a low” (VTML 20) and “deep” (BLOSUM62)
3.5.4
Supplement 43 Current Protocols in Bioinformatics
VTML 20 BLOSUM62
A R N D C Q E A R N D C Q E
A 7 A 4
R -7 8 R -1 5
N -6 -5 8 N -2 0 6
D -6 -12 -1 8 D -2 -2 1 6
C -3 -7 -8 -14 12 C 0 -3 -3 -3 9
Q -5 -2 -4 -4 -13 9 Q -1 1 0 0 -3 5
E -5 -10 -5 -1 -14 -1 7 E -1 0 0 2 -4 2 5

Figure 3.5.2 Comparison of a “shallow” (VTML 20) and “deep” (BLOSUM62) scoring matrix.
Both matrices are scaled in 1/2-bits. For the small part of the matrices shown here, the VTML20
matrix produces an average 2.80 half-bit identity score, and an average −0.59 nonidentical score
(weighted by amino-acid abundance). In contrast, BLOSUM62 produces 1.86 for identities but
only −0.06 for nonidentities. Thus, VTML20 targets shorter, higher-identity alignments, because
it penalizes nonidentities much more strongly.

matrix. Thus, in addition to differing in infor- to 2000 million years ago. Mouse and human
mation content, scoring matrices have range of orthologs share about 83% amino acid iden-
target percent identities and alignment lengths tity; thus, for mammals, the VTML 20 matrix
(Table 3.5.1). Shallower scoring matrices is expected to find all orthologs and paralogs
produce shorter, more identical alignments, that have diverged over the past 200 million
because they give more negative scores to years, but the matrix is much less likely to
nonidentical aligned residues. “Deeper” scor- identify paralogs that share less than 40% se-
ing matrices produce longer alignments with quence identity (divergence time > 1000 mil-
lower percent identities because the penalty lion years).
for a mismatch is much lower and more con-
servative nonidentities get positive scores. SCORING MATRICES AND GAP
In practice, the relationship between scor- PENALTIES
ing matrix evolutionary distance, information While there is an intuitive mathematical ex-
content, percent identity, and alignment length planation for pairwise similarity scores from
suggests two reasons for changing from the the log-odds perspective, sensitive sequence
BLOSUM62 and BLOSUM50 matrices used alignments require both aligned residues and
by BLASTP and SSEARCH/FASTA. First, insertion or deletion gaps. Unfortunately, we
one should change to a shallower matrix when do not have an analytical model for gap
looking for short alignments. We need a shal- penalties and evolutionary distances. The de-
lower scoring matrix for short domains, short fault gap-penalties provided for BLASTP,
exons, or short DNA reads because deep scor- SSEARCH, and FASTA were determined em-
ing matrices like BLOSUM62 do not have pirically (e.g., Pearson, 1991) with a fo-
enough information content to produce signif- cus on identifying distant homologs. In gen-
icant scores. Short alignments require shallow eral, default gap penalties for BLASTP and
scoring matrices. SSEARCH/FASTA are set as low as possible;
One should also use a shallower scor- lower gap penalties would convert alignments
ing matrix when looking for orthologs— from local to global, which would invalidate
sequences that differ because of specia- the statistical estimates. Thus, when consider-
tion events and are likely to share similar ing whether to change gap penalties to improve
functions—between “relatively” closely re- search selectivity for a particular protein fam-
lated organisms (100 to 500 My). Protein se- ily, gap penalties should be increased (made
quence comparison algorithms are very sensi- more stringent), not decreased. Just as “shal-
tive; BLASTP and SSEARCH routinely find lower” scoring matrices target less divergence
significant alignments between human and by giving higher scores to identities and more
yeast (1.2 billion year divergence) and human negative scores to nonidentities, gap penalties
and E. coli (>2.4 billion years). Because of this should increase with shallower scoring matri-
sensitivity, a mouse-human comparison often ces (Reese and Pearson, 2002). Simulations
reports not only the orthologs (sequences that to maximize the significance of short align-
Finding
diverged at the primate/rodent split 80 million ments suggest that for 1/2-bit scoring matrices, Similarities
years ago), but also dozens of more distantly gap open penalties of 16.7−(0.067 × pam- and Inferring
related paralogs that may have diverged 200 distance), e.g., 16.7−(0.067 × 20) = 15 for Homologies

3.5.5
Current Protocols in Bioinformatics Supplement 43
VTML 20, and gap extend penalties of 2, are cent identity), different scoring matrices have
most effective (Reese and Pearson, 2002). different preferred alignment lengths (Ta-
Low gap penalties can dramatically re- ble 3.5.1). Shallow scoring matrices have
duce the information content and average per- large negative values for amino acid replace-
cent identity associated with a scoring matrix, ments (Fig. 3.5.2), so alignments to nonho-
and can dramatically increase the lengths of mologous (random) sequences will be short.
alignments produced by the matrix. The tar- Deep scoring matrices have less negative av-
get percent identity, information content, and erage replacement scores (VTML20’s aver-
alignment lengths presented in Table 3.5.1 age nonidentity score is −5.8 half-bits, while
reflect the observed median values of the BLOSUM62’s is −1.2 half-bits), so their
highest-scoring alignment produced by ran- alignments tend to be longer. Table 3.5.1 (ran-
dom queries against real protein sequences dom alignment length column) summarizes
with the specified matrix and gap penalties. the median alignment length between random
If gaps are not allowed, the average percent queries and real protein sequences. BLAST
identity and information content increase and and SSEARCH/FASTA statistics are very ac-
alignment length gets shorter. For example, if curate (UNIT 3.1), so sequences that share sta-
gaps are not allowed with BLOSUM62, the tistically significant scores will always share
median percent identity increases from 28.9 a homologous domain. However, BLAST and
(Table 3.5.1) to 33, information content almost SSEARCH/FASTA calculate local sequence
doubles from 0.40 to 0.74, and median random alignments—the alignments begin and end at a
alignment length drops from 86 to 45 residues. position that maximizes the alignment score—
A similar effect is seen with VTML 80, where so the boundaries of the alignment depend on
information content increases and alignment both the location of the homologous domain
lengths decrease almost 2-fold when gaps are and the scoring matrix used to produce the
not allowed. Gap effects are less dramatic with alignment. When a deep scoring matrix like
shallower matrices like VTML 20—from 86% BLOSUM62 is used to align more closely
to 89% identity, from 3.3 to 3.5 bits per posi- related sequences, the alignment can extend
tion, and from 11 to 10 residue median align- (overextend) into nonhomologous neighbor-
ment lengths—because short evolutionary dis- ing sequence. Gonzalez and Pearson (2010)
tances should allow many fewer insertions and termed this artifact “homologous overexten-
deletions. sion,” and showed that it is a major source of
errors in PSI-BLAST searches.
Homologous overextension often occurs
BLASTP Gap Penalties with Shallow from short repeated domains. For ex-
Scoring Matrices ample, Figure 3.5.3A shows a BLASTP
While the BLAST programs offer a set
alignment of VAV HUMAN (P15498) with
of scoring matrices with different evolution-
SKAP2 XENTR (Q5FVW6), a protein that
ary horizons (BLOSUM50 and BLOSUM62
contains an SH3 domain that is homologous
are “deep”; PAM30 is relatively “shallow”),
over 58 amino acids. However, the align-
the modest gap penalties provided with their
ment is 198 residues long; the additional 140
shallow matrices dramatically modify their
residues in the alignment include a 100-residue
effective evolutionary distance (Table 3.5.1).
Pleckstrin domain in SKAP2 XENTR that
The “shallowest” combination of scoring ma-
is not homologous (VAV HUMAN contains
trix (PAM30) and gap penalties (9/1) requires
an SH3 domain in the region that aligns to
an average of 56 aligned amino acids, or
the Pleckstrin domain in SKAP2 XENTR).
more than 160 nucleotides, to produce a 50-
The 58-residue homologous SH3 domain con-
bit alignment score. Because these gap penal-
tributes 85% of the bit score, with the addi-
ties are too low (Reese and Pearson, 2002),
tional 140 residues contributing less than 15%
the BLAST protein matrices are less effec-
of the score. Using the slightly more strin-
tive for short alignments or short evolution-
gent (shallower) BLOSUM80 matrix does not
ary distances than they would be with higher
change the alignment overextension.
penalties.
The FASTA programs offer a new option
for identifying homologous overextension—
LONG ALIGNMENTS AND subdomain scoring (Fig. 3.5.3B). By using
OVEREXTENSION the domain annotations available for one of
Selecting the Right In addition to differing in information the sequences to subdivide the alignment, it
Similarity-Scoring content (score or “bits” per aligned posi- becomes apparent that the 58-residue SH3
Matrix
tion) and optimal evolutionary distances (per- domain is responsible for almost all of the
3.5.6
Supplement 43 Current Protocols in Bioinformatics
A
sp|P15498.4|VAV_HUMAN Proto-oncogene vav: Length: 845
sp|Q5FVW6.2|SKAP2_XENTR Src kinase-associated phosphoprotein 2 Length: 328

Range 1: 128 to 326


Score Expect M Identities Positives Gaps
45.8 bits(107) 1e-05 Comp. matrix adjust. 49/202(24%) 78/202(38%) 12/202(5%)

[ SH3
Query 649 WFPCNRVKPYVHGPPQDLSVHL WYAGPMERAGAESILAN--RSDGTFLVRQRVKDAAEFA 706
W C Y +G +D ++ RA L + D F + K +FA
Sbjct 128 WCVCTNSMFYYYGSDKDKQQKGAFSLDGYRAKMNDTLRKDAKKDCCFEIFAPDKRVYQFA 187

]
Query 707 ISIKYNVEVKHIKIMTAEGLYRITEKKAFRGLTELVEFYQQNSLKDCFKSLDTTLQFPF K 766
S E IM + G +++ + + + V+ + +D ++ L + P
Sbjct 188 ASSPKEAEEWVNIIMNSRGNIPTEDEELYDDVNQEVDASHE---EDIYEELPEESEKPVT 244
Pleckstrin ]
[ SH2
Query 767 EPEKRTISRPAVGST K------YFGTAKARYDFCARDRSELSLKEGDIIKILNKK-GQQG 819
E E + V +T Y + +D ELS K GD I IL+K+ G
Sbjct 245 EIETPKATPVPVNNTSGKENT DYANFYRGLWDCTGDHPDELSFKHGDTIYILSKEYNTYG 304
[
]
Query 820 WWRGEIYGRVGWFPANYVEEDY 841
WW GE+ G +G P Y+ E Y
Sbjct 305 WWVGEMKGTIGLVPKAYIMEMY 326
]

B
sp|P15498.4|VAV_HUMAN Proto-oncogene vav
sp|Q5FVW6.2|SKAP2_XENTR Src kinase-associated phosphoprotein (328 aa)
sRegion: 626-725:103-206 : score=7; bits=8.7; Id=0.202; Q= 0.0 : Pleckstrin
qRegion: 671-765:150-243 : score=7; bits=8.7; Id=0.175; Q= 0.0 : SH2
qRegion: 782-841:260-326 : score=83; bits=35.8; Id=0.343; Q=53.4 : SH3
sRegion: 783-841:266-326 : score=88; bits=37.6; Id=0.383; Q=58.7 : SH3
s-w opt: 116 bits: 47.6 E(454402): 0.0006

Figure 3.5.3 Overextension of an alignment of homologous SH2 domains. (A) BLASTP alignment of
VAV HUMAN with SKAP2 XENTR. The two proteins share a homologous SH2 domain (highlighted in red) over
about 58 amino acids that contributes more than 85% of the similarity score. The remaining 140 amino acid
alignment juxtaposes an SH3 domain from VAV HUMAN (brown) with a Pleckstrin domain from SKAP2 XENTR
(green). These two domains are not homologous; they are classified as having different folds in SCOP. (B)
Sub-alignment scores produced by the SSEARCH36 program using the same scoring matrix as BLASTP
(BLOSUM62, 11/1) for the VAV HUMAN / SKAP2 XENTR alignment. Boundaries for annotated domains in the
two proteins were taken from InterPro using the query VAV HUMAN (qRegion) or the subject SKAP2 XENTR
(sRegion). Thus, 103-206 for the Pleckstrin domain comes from InterPro annotations for SKAP2 XENTR, as
does 671-765 for SH3 domain in VAV HUMAN. The raw score, bit-score, and percent identity are shown for the
subregions. The Q-score is −10log(p-value) based on the bit score; thus Q = 30 corresponds to a probability
(uncorrected for database size) of 0.001.

significant similarity found. It is often very example, the default match/mismatch penal-
difficult to judge the quality of a distant ties used by BLASTN in its most sensitive
alignment visually; subdomain scoring pro- mode (-task blastn) uses a score of +2
vides a quantitative strategy for identifying for a match and −3 for a mismatch, which
overextension. targets sequences at PAM10, or 90% identity
(States et al. 1991). By default, searches on
SCORING MATRICES FOR DNA the NCBI nucleotide BLAST Web site use
DNA scoring matrices, which are usually MEGABLAST (-task megablast), with Finding
implemented as match/mismatch scores, can match/mismatch scores of +1/−3 that tar- Similarities
also be treated as log-odds matrices with target get sequences that are 99% identical. By de- and Inferring
Homologies
evolutionary distances (States et al., 1991). For fault, the FASTA program uses +5/−4 (also
3.5.7
Current Protocols in Bioinformatics Supplement 43
available with BLASTN, -task blastn), The match/mismatch ratios used in DNA
which corresponds approximately to PAM 40, similarity searches also have target evolution-
or 70% identity. Because DNA sequence com- ary distances. The stringent match/mismatch
parison is much less sensitive than protein se- ratios used by MEGABLAST are most ef-
quence comparison, it is very difficult to detect fective at matching sequences that are essen-
statistically significant DNA:DNA sequence tially 100% identical, e.g., mRNA sequences
similarity at distances greater than PAM 40 to genomic exons. Deeper, more sensitive
(PAM 40 is a short distance for protein com- DNA scoring parameters are more effective
parisons). for longer DNA evolutionary distances, e.g.,
In practice, the effective target identity mouse-human.
for heuristic methods like BLAT, BLASTN, While scoring matrices and gap penal-
MEGABLAST, and other genome-alignment ties can dramatically affect search sensitiv-
programs that do use scoring matrices, may ity and alignment regions, modern sequence-
be difficult to estimate from the reported comparison programs provide accurate simi-
match/mismatch scores. Heuristic programs larity statistics, so it is unlikely that the wrong
typically use a hierarchy of filters to accel- scoring matrix will produce a significant match
erate the similarity search, and each of those to a nonhomologous protein. However, the
filters will affect the percentage identity and wrong matrix can prevent short homologous
evolutionary distance of the alignments that regions from being found, or allow an overex-
are displayed. As a result, it is possible that tension into a nonhomologous region from
the displayed alignments may have a lower a homologous domain. The rapidly increas-
percent identity than other possible alignments ing volume of protein sequence means that
that were excluded during the early stages of close homologs will often be available, and
the filtering process. shallower scoring matrices can produce more
Ideally, the match/mismatch penalties used reliable, functionally informative alignments
in genome alignment would match the evo- when closer homologs (>50% identical) are
lutionary distances of the sequences being found.
aligned; human DNA is expected to be more
than 99.9% identical to itself, but human-
mouse alignments in protein-coding regions ACKNOWLEDGMENTS
W.R.P. has been supported by fund-
will be less than 80% identical (outside
ing from the National Library of Medicine
of protein-coding regions, identity will typ-
(LM04969).
ically be undetectable at <50%). Likewise,
match/mismatch parameters should reflect po-
tential alignment length; searches with short LITERATURE CITED
Altschul, S.F. 1991. Amino acid substitution matri-
sequences will need higher match/mismatch ces from an information theoretic perspective.
ratios with higher information content (States J. Mol. Biol. 219:555-565.
et al., 1991). Altschul, S.F., Gish, W., Miller, W., Myers, E.W.,
and Lipman, D.J. 1990. A basic local alignment
SUMMARY search tool. J. Mol. Biol. 215:403-410.
The BLAST and FASTA/SSEARCH Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang,
protein-alignment programs use “deep” sim- J., Zhang, Z., Miller, W., and Lipman, D.J. 1997.
Gapped BLAST and PSI-BLAST: A new gener-
ilarity scoring matrices like BLOSUM62 or
ation of protein database search programs. Nu-
BLOSUM50 to identify homologs that share cleic Acids Res. 25:3389-3402.
less than 25% sequence identity. Deep scor- Gonnet, G.H., Cohen, M.A., and Benner, S.A. 1992.
ing matrices require long sequence alignments Exhaustive matching of the entire protein se-
to achieve statistically significant similarity quence database. Science 256:1443-1445.
scores and are more likely to extend align- Gonzalez, M.W. and Pearson, W.R. 2010. Homol-
ments outside the homologous region. Shal- ogous over-extension: A challenge for iterative
lower scoring matrices are more effective similarity searches. Nucleic Acids Res. 38:2177-
when searching for short homologous domains 2189.
or short (<150-nt) exons, or when search- Henikoff, S. and Henikoff, J.G. 1992. Amino acid
ing over shorter evolutionary distances. Scor- substitution matrices from protein blocks. Proc.
Natl. Acad. Sci. U.S.A. 89:10915-10919.
ing matrices that are matched to the evolu-
tionary distance of the homologous sequences Jones, D.T., Taylor, W.R., and Thornton, J.M. 1992.
Selecting the Right The rapid generation of mutation data matri-
are also less likely to produce homologous ces from protein sequences. Comp. Appl. Biosci.
Similarity-Scoring
Matrix overextension. 8:275-282.

3.5.8
Supplement 43 Current Protocols in Bioinformatics
Mueller, T., Spang, R., and Vingron, M. 2002. Esti- quence comparison. Bioinformatics 18:1500-
mating amino acid substitution models: A com- 1507.
parison of Dayhoff’s estimator, the resolvent ap- Schwartz, R.M. and Dayhoff, M. 1978. Matrices for
proach and a maximum likelihood method. Mol. detecting distant relationships. In Atlas of Pro-
Biol. Evol. 19:8-13. tein Sequence and Structure, Volume 5, Supple-
Pearson, W.R. 1991. Searching protein se- ment 3 ( M. Dayhoff, ed.), pp. 353-358. National
quence libraries: Comparison of the sensi- Biomedical Research Foundation, Silver Spring,
tivity and selectivity of the Smith-Waterman Maryland.
and FASTA algorithms. Genomics 11:635- Smith, T.F. and Waterman, M.S. 1981. Identifi-
650. cation of common molecular subsequences. J.
Pearson, W.R. and Lipman, D.J. 1988. Improved Mol. Biol. 147:195-197.
tools for biological sequence comparison. Proc. States, D.J., Gish, W., and Altschul, S.F. 1991.
Natl. Acad. Sci. U.S.A. 85:2444-2448. Improved sensitivity of nucleic acid database
Reese, J.T. and Pearson, W.R. 2002. Empirical de- searches using application-specific scoring ma-
termination of effective gap penalties for se- trices. Methods Enzymol. 3:66-70.

Finding
Similarities
and Inferring
Homologies

3.5.9
Current Protocols in Bioinformatics Supplement 43

You might also like