Mount - 2008 - Using PAM Matrices in Sequence Alignments
Mount - 2008 - Using PAM Matrices in Sequence Alignments
Email Alerting Receive free email alerts when new articles cite this article - click here.
Service
Subject Browse articles on similar topics from Cold Spring Harbor Protocols.
Categories
Alignment of Pairs of Sequences (12 articles)
Alignment of Sequences (33 articles)
Bioinformatics/Genomics, general (131 articles)
Computational Biology (74 articles)
Genetics, general (322 articles)
Genome Analysis (102 articles)
Proteins and Proteomics, general (488 articles)
Proteomics (60 articles)
Topic Introduction
Adapted from “Alignment of Pairs of Sequences,” Chapter 3, in Bioinformatics: Sequence and Genome
Analysis, 2nd edition, by David W. Mount. Cold Spring Harbor Laboratory Press, Cold Spring Harbor,
NY, USA, 2004.
INTRODUCTION
Certain amino acid substitutions commonly occur in related proteins from different species. Because
a protein still functions with these substitutions, the substituted amino acids are compatible with pro-
tein structure and function. Knowing the types of changes that are most and least common in a large
number of proteins can assist with predicting alignments for any set of protein sequences. If related
protein sequences are quite similar, they are easy to align, and one can readily determine the single-
step amino acid changes. If ancestor relationships among a group of proteins are assessed, the most
likely amino acid changes that occurred during evolution can be predicted. This type of analysis was
pioneered by Margaret Dayhoff and used by her to produce a type of scoring matrix called a percent
accepted mutation (PAM) matrix. This article introduces Dayhoff PAM matrices, explains how they are
constructed and how they can be used for sequence alignments, and highlights their strengths and
limitations.
RELATED INFORMATION
The use of blocks amino acid substitution matrices (BLOSUM) is described in Using BLOSUM in
Sequence Alignments (Mount 2008a), while PAM and BLOSUM are compared in Comparison of
the PAM and BLOSUM Amino Acid Substitution Matrices (Mount 2008b). PAM matrices are based
on a Markov model of protein evolution. This model is tested in A Test of the Markov Model of
Evolution in Proteins (Mount 2008c). The appropriate choice for gap penalties to be used with
various matrices is discussed in Using Gaps and Gap Penalties to Optimize Pairwise Sequence
Alignments (Mount2008d). BLOSUM and other scoring matrices are compared in combination with
various alignment algorithms and gap penalties in Studies of Varying Alignment Algorithm, Amino
Acid Scoring Matrix, and Gap Penalties (Mount 2008e). Several other scoring matrices for data-
base searches are described in Strategies for Sequence Similarity Database Searches (Mount
2007a), Using a FASTA Sequence Database Similarity Search (Mount 2007b), Using the Basic
Local Alignment Search Tool (BLAST) (Mount 2007c), and Steps Used by the BLAST Algorithm
(Mount 2007d).
© 2008 Cold Spring Harbor Laboratory Press 1 Vol. 3, Issue 6, June 2008
Downloaded from http://cshprotocols.cshlp.org/ at UNIVERSITE LAVAL on June 24, 2014 - Published by
Cold Spring Harbor Laboratory Press
in a relatively short period of time, so that they are still 50% or more similar. Another gives the changes
expected of proteins that have diverged over a much longer period, leaving only 20% similarity. These
predicted changes are used to produce optimal alignments between two protein sequences and to
score the alignment. The assumption in this evolutionary model is that the amino acid substitutions
observed over short periods of evolutionary history can be extrapolated to longer distances. In con-
trast, another type of matrix, BLOSUM (see Using BLOSUM in Sequence Alignments [Mount
2008a]), is based on scoring substitutions found over a range of evolutionary periods. The BLOSUM
matrices reveal that substitutions are not always as predicted by the PAM model.
In deriving the PAM matrices, each change in the current amino acid at a particular site is assumed
to be independent of previous mutational events at that site (Dayhoff 1978). Thus, the probability of
change of any amino acid a to amino acid b is the same, regardless of the previous changes at that
site and also regardless of the position of amino acid a in a protein sequence. Amino acid substitutions
in a protein sequence are thus viewed as a Markov model (see A Test of the Markov Model of
Evolution in Proteins [Mount 2008c]), which is characterized by a series of changes of state in a sys-
tem such that a change from one state to another does not depend on the previous history of the
state. Use of this model makes it possible to extrapolate amino acid substitutions observed over a rel-
atively short period of evolutionary time to longer periods of evolutionary time.
To prepare the Dayhoff PAM matrices, amino acid substitutions that occurred in a group of evolv-
ing proteins were estimated using 1572 changes in 71 groups of protein sequences that were at least
85% similar. Because these changes were observed in closely related proteins, they represented amino
acid substitutions that do not significantly change the function of the protein. Hence they are called
“accepted mutations,” defined as amino acid changes “accepted” by natural selection. Similar
sequences were first organized into a phylogenetic tree. The number of changes of each amino acid
into every other amino acid was then counted. To make these numbers useful for sequence analysis,
information on the relative amount of change (relative mutabilities) for each amino acid was needed.
Relative mutabilities were evaluated by counting, in each group of related sequences, the num-
ber of changes of each amino acid and by dividing this number by a factor, called the exposure to
mutation of the amino acid. This factor is the product of the frequency of occurrence of the amino
acid in that group of sequences being analyzed and the total number of all amino acid changes that
occurred in that group per 100 sites. This factor normalizes the data for variations in amino acid com-
position, mutation rate, and sequence length. The normalized frequencies were then summed for all
sequence groups. By these scores, Asn, Ser, Asp, and Glu were the most mutable amino acids, and Cys
and Trp were the least mutable. An example for changing Phe to any other amino acid is shown in
Figure 1 and in the next section.
The above amino acid exchange counts and mutability values were then used to generate a 20 ×
20 mutation probability matrix representing all possible amino acid changes. Because amino acid
change was modeled by a Markov model, in which the probability of mutation at each site is inde-
pendent of the previous history of mutations, the changes predicted for more distantly related pro-
teins that have undergone N percent mutations could be calculated. A PAM1 matrix showing the
relative frequencies of change for each amino acid into any other adding up to a total frequency of
1% change was first calculated. According to the Markov model, the PAM1 matrix could be multiplied
by itself N times (see Fig. 1B) to give transition matrices for comparing sequences with lower and
lower levels of similarity due to separation over longer periods of evolutionary history. Thus, the com-
monly used PAM250 matrix represents a level of 250% of change expected in 2500 Myr (million
years). Although this amount of change seems very large, sequences at this level of divergence still
have ~20% similarity. For example, alanine will be matched with alanine 13% of the time and with
another amino acid 87% of the time.
The percentage of remaining similarity for any PAM matrix can be calculated by summing the per-
centages for amino acids not changing (Ala versus Ala, etc.) after multiplying each by the frequency
of that amino acid pair in the database (e.g., 0.089 for Ala) (Dayhoff 1978). The PAM120, PAM80, and
PAM60 matrices should be used for aligning sequences that are 40%, 50%, and 60% similar, respec-
tively. Simulations by George et al. (1990) have shown that, as predicted, the PAM250 matrix provides
a better-scoring alignment than lower-numbered PAM matrices for distantly related proteins of 14%-
27% similarity. Do not confuse this mutation probability form of the PAM250 matrix with the log odds
form of the matrix described below.
PAM matrices are usually converted into another form, called log odds matrices. Because the
direction of mutation is not known, the log odds score in the scoring matrix for changing amino acid
FIGURE 1. (A) Normalized probability scores for changing Phe to any other amino acid (or of not changing) at PAM1
and PAM250 evolutionary distances. (B) The multiplication of two PAM1 matrices to give a PAM2 matrix.
a into amino acid b is the same as that changing b into a. These scores are the average of the observed
changes of a to b, and of b to a. The odds score represents the ratio of the chance of amino acid sub-
stitution by two different hypotheses—one that the change actually represents an authentic evolu-
tionary variation at that site (the numerator), and the other that the change occurred because of
random sequence variation of no biological significance (the denominator). Odds ratios are converted
to logarithms to give log odds scores for convenience in multiplying odds scores of amino acid pairs
in an alignment by adding the logarithms (Fig. 2).
At one time, the PAM250 scoring matrix was modified in an attempt to improve the alignments
obtained when PAM250 was used as the scoring matrix with dynamic programming. All scores for
FIGURE 2. Use of amino acid substitution matrix to evaluate an alignment of two protein sequences. The score for each
amino acid pair (Tyr/Phe, etc.) is looked up in the BLOSUM62 matrix. Each value represents an odds score, the likeli-
hood that the two amino acids will be aligned in alignments of similar proteins divided by the likelihood that they will
be aligned by chance in an alignment of unrelated proteins. In a series of individual matches in an alignment, these odds
scores are multiplied to give an overall odds score for the alignment itself. For convenience, odds scores are converted
to log odds scores so that the values for amino acid pairs in an alignment may be summed to obtain the log odds score
of the alignment. In this case, the logarithms are calculated to the base 2 and multiplied by 2 to give values designated
as half-bits (a bit is the unit of an odds score that has been converted to a logarithm to the base 2). The value of 4 indi-
cates that the 4 amino acid alignment is 2(4/2) = fourfold more likely than expected by chance.
matching a particular amino acid were normalized to the same mean and standard deviation, for
example, by summing the scores and choosing the mean, and all amino acid identities were given the
same score to provide an equal contribution for each amino acid in a sequence alignment (Gribskov
and Burgess 1986). These modifications were included as the default matrices for the GCG sequence
alignment programs in versions 8 and earlier; they are optional in later versions. Their use is not rec-
ommended because these modifications will not give an optimal alignment that is in accord with the
evolutionary model.
Calculations for Obtaining the Log Odds Score for Changes between Phe and Tyr at an Evolutionary Distance of
250 PAMs
An example for changing Phe to any other amino acid is presented here and in Figure 1.
1. Of 1572 observed amino acid changes, there were 260 changes between Phe and Tyr. These num-
bers were multiplied by (1) the relative mutability of Phe (see previous section) and (2) the frac-
tion of Phe to Tyr changes over all changes of Phe to any other amino acid (because Phe to Tyr
and Tyr to Phe changes are not distinguished in the original mutation counts, sums of changes are
used to calculate the fraction) to obtain a mutation probability score of Phe to Tyr. A similar score
was obtained for changes of Phe to each of the other 18 amino acids, and also for the calculated
probability of not changing at all. The resulting 20 scores were summed and divided by a nor-
malizing factor such that their sum represented a probability of change of 1%, as illustrated in
Figure 1.
In this matrix, the score for changing Phe to Tyr was 0.0021, as opposed to a score of Phe not
changing at all of 0.9946, as shown in Figure 1A. These calculations were repeated for Tyr chang-
ing to any other amino acid. The score for changing Tyr to Phe was 0.0028, and that of not chang-
ing Tyr was 0.9946 (not shown). These scores were placed in the PAM1 matrix, in which the overall
probability of each amino acid changing to another is ~1%, and that of each not changing is
~99%.
2. The above PAM1 matrix was multiplied by itself 250 times to obtain the distribution of changes
expected for 250 PAMs of evolutionary change, as illustrated in Figure 1B for the first of these mul-
tiplications. These changes can include both forward changes to another amino acid and reverse
changes to a former one. At this distance, the probability of change of Phe to Tyr was 0.15 as
opposed to a probability of 0.32 of no change in Phe. The corresponding probabilities for Tyr to
Phe at 250 PAMs, obtained through the matrix multiplication described above, were 0.20 and 0.31
for no change.
3. The log odds values for changes between Phe and Tyr were then calculated. The Phe to Tyr
score in the 250 PAM matrix, 0.15, was divided by the frequency of Phe in the sequence data,
0.040, to give the relative frequency of change. This ratio, 0.15/0.04 = 3.75, was converted to
a logarithm to the base 10 (log103.75 = 0.57) and multiplied by 10 to remove fractional values
(0.57 × 10 = 5.7). Similarly, the Tyr-to-Phe score is 0.20/0.03 = 6.7, and the logarithm of this
number is log106.7 = 0.83, and multiplied by 10 is (0.83 × 10 = 8.3). The average of 5.7 and
8.3 is 7, the number entered in the log odds table for changes between Phe and Tyr at 250
PAMs of evolutionary distance.
The log odds form of the PAM250 matrix, which is sometimes referred to as the mutation data
matrix (MDM) at 250 PAMs and also as MDM78, is shown in Figure 3. The log odds scores in this
matrix lie within the range of −8 to +17. A value of 0 indicates that the frequency of the substitu-
tion between a matched pair of amino acids in related proteins is as expected by chance; a value
less than 0 or greater than 0 indicates that the frequency is less than or greater than that expected
by chance, respectively. Using such a matrix, a high positive score between two amino acids means
that the pair is more likely to be found aligned in sequences that are related than in unrelated
sequences. The highest-scoring replacements are for amino acids whose side chains are chemically
similar, as might be expected if the amino acid substitution is not to impede function. In the orig-
inal data, the largest number of observed changes (83) was between the acidic amino acids Asp
(D) and Glu (E), which have similar chemical properties. This number is reflected as a log odds
score of +3 in the MDM. Many changes were not observed. For example, there were no changes
between Gly (G) and Trp (W), resulting in a score of −7 in the matrix.
Choosing the Best PAM Scoring Matrices for Detecting Sequence Similarity
The ability of PAM scoring matrices to distinguish statistically between chance and biologically
meaningful alignments has been analyzed using a statistical theory for sequences (Altschul 1991).
As discussed above, each PAM matrix is designed to score alignments between sequences that have
diverged by a particular degree of evolutionary distance. Altschul (1991) has examined how well
the PAM matrices actually can distinguish proteins that have diverged to a greater or lesser extent,
when these proteins are subjected to a local alignment.
Initially, when using a scoring matrix to produce an alignment, the amount of similarity between
sequences may not be known. However, the ungapped alignment scores obtained are maximal when
FIGURE 3. The log odds form or MDM of the PAM250 scoring matrix. Amino acids are grouped according to the chem-
istry of the side group: (C) sulfhydryl; (STPAG) small hydrophilic; (NDEQ) acid, acid amide, and hydrophilic; (HRK) basic;
(MILV) small hydrophobic; and (FYW) aromatic. Each matrix value is calculated from an odds score, the probability that
the amino acid pair will be found in alignments of homologous proteins divided by the probability that the pair will be
found in alignments of unrelated proteins by random chance. The logarithm of these odds scores to the base 10 is mul-
tiplied by 10 and then used as the table value. Thus, +10 means the ancestor probability is greater, 0 that the proba-
bilities are equal, and −4 that the alignment is more often a chance one than due to an ancestor relationship. Because
these numbers are logarithms, they may be added to give a combined probability of two or more amino acid pairs in an
alignment. Thus, the probability of aligning two Ys in an alignment YY/YY is 10 + 10 = 20, a very significant score,
whereas that of YY/TP is −3 − 5 = −8, a rare and unexpected alignment between homologous sequences.
the correct PAM matrix, i.e., the one corresponding to the degree of similarity in the target sequences,
is used (Altschul 1991). One approach is to use a series of PAM matrices and then to choose the best
alignment score. Altschul (1991) has also examined the ability of PAM matrices to provide a reliable
enough indication of an ungapped local alignment score between sequences on an initial attempt of
alignment. For sequence alignments, the PAM200 matrix is able to detect a significant ungapped
alignment of 16-62 amino acids whose score is within 87% of the optimal one. Alternatively, several
combinations, such as PAM80 and PAM250 or PAM120 and PAM350, can also be used. Altschul
(1993) has also proposed using a single matrix and adjusting a statistical parameter in the scoring sys-
tem to reach more distantly related sequences, but this change would primarily be for database
searches (see the next section).
In addition to the aforementioned differences among PAM scoring matrices for scoring align-
ments of more- or less-related proteins, the ability of each PAM matrix to discriminate real local align-
ments from chance alignments also varies. To calculate the ability of the entire matrix to discriminate
related from unrelated sequences (H, the relative entropy), the score for each amino acid pair sij (in
units of log2, called bits) is multiplied by the probability of occurrence of that pair in the original data
set, qij (Altschul 1991). This weighted score is then summed over all the amino acid pairs to produce
a score that represents the ability of the average amino acid pair in the matrix to discriminate actual
from chance alignments. When the correct PAM matrix is chosen for a pair of sequences, as discussed
above, this discriminatory ability is best utilized.
In information theory, this score is called the average mutual information content per pair, and
the sum over all pairs is the relative entropy of the matrix (termed H). The relative entropy will be a
small positive number. For the PAM250 matrix the number is +0.36; for PAM120, +0.98; and for
PAM160, +0.70. In general, all other factors being equal, the higher the value of H for a scoring matrix,
the more likely it is to be able to distinguish real from chance alignments. The practical application for
using H is to choose a scoring matrix that has the highest value of H. The lower value of H in the higher
PAM matrix is a reflection of the increased uncertainty that arises as to the number of changes that
have occurred in sequences that are more divergent. Note also that the value of H goes down as the
PAM number increases. Thus, using the PAM250 for sequences that are very similar is not a useful way
to distinguish real from chance alignments, because this matrix will not take advantage of the higher
expected level of identity.
acids found at other sites in the protein and depend only on the current amino acid at the site. The
assumptions that underlie the method of constructing the Dayhoff scoring matrix have been chal-
lenged (for discussion, see George et al. 1990; States and Boguski 1991). First, it is assumed that each
amino acid position is equally mutable, whereas, in fact, sites vary considerably in their degree of
mutability. Mutagenesis hot spots are well known in molecular genetics, and variations in mutability
of different amino acid sites in proteins are also well known.
The more conserved amino acids in similar proteins from different species are ones that play an
essential role in structure and function, and the less conserved are in sites that can vary without hav-
ing a significant effect on function. Thus, there are many factors that influence both the location and
types of amino acid changes that occur in proteins. Wilbur (1985) has tested the Markov model of
evolution (see A Test of the Markov Model of Evolution in Proteins [Mount 2008c]) and has shown
that it can be valid if certain changes are made in the way that the PAM matrices are calculated.
In addition to the questions raised by Wilbur, a further criticism of the PAM scoring matrices is that
they are not more useful for sequence alignment than simpler matrices, such as one based on a chem-
ical grouping of amino acid side chains. Although alignment of related proteins is straightforward and
quite independent of the symbol comparison scoring scheme, alignments of less-related proteins are
much more speculative (Feng et al. 1985). PAM and BLOSUM matrices have both been very useful for
finding more distantly related sequences (George et al. 1990).
Once a family has been identified, family-specific scoring matrices can be produced, and there is
no point in using these general matrices. A scoring matrix representing a section of aligned sequences
with no gaps, or a matrix representing a section of aligned sequences with matches, mismatches, and
gaps (a profile), is the best tool to search for more family members.
Another criticism of the PAM matrix is that constructing phylogenetic relationships prior to scor-
ing mutations has limitations, due to the difficulty of determining ancestral relationships among
sequences, a topic discussed in Distance Methods for Phylogenetic Prediction (Mount 2008f).
Early on in the Dayhoff analysis, the evolutionary trees were estimated by a voting scheme for the
branches in the tree, each node being estimated by the most abundant amino acid in distal parts
of the tree. Once available, the PAM matrices were used to estimate the evolutionary distance
between proteins, given the amount of sequence similarity. Such data can be used to produce a tree
based on evolutionary distances (see Distance Methods for Phylogenetic Prediction [Mount
2008f]). This circular analysis of using alignments to score amino acid changes and then to use the
matrices to produce new alignments has also been criticized. However, no method has yet been
devised in any type of sequence analysis for completely circumventing this problem. Evidence that
the values in the scoring matrix are insensitive to changes in the phylogenetic relationships has been
provided (George et al. 1990).
Finally, the Dayhoff PAM matrices have been criticized because they are based on a small set of
closely related proteins. The Dayhoff data set has been augmented to include the 1991 protein data-
base (Gonnet et al. 1992; Jones et al. 1992) as discussed in the next section. The Dayhoff matrices
have also been extensively compared to other scoring matrices, as discussed in Studies of Varying
Alignment Algorithm, Amino Acid Scoring Matrix, and Gap Penalties (Mount 2008e).
There are significant differences between both the new Gonnet92 and JTT250 matrices and the
old Dayhoff matrix at 250 PAMs and also between the Gonnet92 and JTT250 matrices. The most strik-
ing differences with the Dayhoff matrix are in the substitutions between C (Cys) and other amino acids
and W (Trp) and other amino acids. C changes to other amino acids and W changes to other amino
acids quite rarely, but the later data sets used to make the Gonnet92 and JTT250 matrices have exam-
ples. In the new protein comparisons, C (Cys) and W (Trp) were exchanged for each other, whereas
they were not exchanged in the original Dayhoff analysis. There are also some significant differences
between the Gonnet92 and JTT250 matrices, attributable to the overall number of sequences com-
pared and their relatedness.
On the basis of these and other differences in these scoring matrices, the best recommendation
that can be made is to use the JTT PAM-N matrices as a more modern substitute for the Dayhoff PAM-
N matrices (where N is the PAM distance) and the Gonnet92 matrix as a substitute for the Dayhoff
PAM250 matrix for scoring amino acid changes between proteins that are more distantly related.
Gonnet92 behaves similarly to the BLOSUM62 scoring matrix (see Studies of Varying Alignment
Algorithm, Amino Acid Scoring Matrix, and Gap Penalties [Mount 2008e]).
REFERENCES
Altschul, S.F. 1991. Amino acid substitution matrices from an infor- Mount, D.W. 2007b. Using a FASTA sequence database similarity
mation theoretic perspective. J. Mol. Biol. 219: 555–565. search. CSH Protocols doi: 10.1101/pdb.top16.
Altschul, S.F. 1993. A protein alignment scoring system sensitive to all Mount, D.W. 2007c. Using the basic local alignment search tool
evolutionary distances. J. Mol. Evol. 36: 290–300. (BLAST). CSH Protocols doi: 10.1101/pdb.top17.
Benner, S.A., Cohen, M.A., and Gonnet, G.H. 1994. Amino acid sub- Mount, D.W. 2007d. Steps used by the BLAST algorithm. CSH
stitution during functionally constrained divergent evolution of Protocols doi: 10.1101/pdb.ip41.
protein sequences. Protein Eng. 7: 1323–1332. Mount, D.W. 2008a. Using BLOSUM in sequence alignments. CSH
Dayhoff, M.O. 1978. Survey of new data and computer methods of Protocols (this issue) doi: 10.1101/pdbtop39.
analysis. In Atlas of protein sequence and structure, Vol. 5, Suppl. 3, Mount, D.W. 2008b. Comparison of the PAM and BLOSUM amino
p. 29. National Biomedical Research Foundation, Washington, D.C. acid substitution matrices. CSH Protocols (this issue) doi:
Feng, D.F., Johnson, M.S., and Doolittle, R.F. 1985. Aligning amino 10.1101/pdb.ip59.
acid sequences: Comparison of commonly used methods. J. Mol. Mount, D.W. 2008c. A test of the Markov model of evolution in pro-
Evol. 21: 112–125. teins. CSH Protocols (this issue) doi: 10.1101/pdb.ip58.
George, D.G., Barker, W.C., and Hunt, L.T. 1990. Mutation data Mount, D.W. 2008d. Using gaps and gap penalties to optimize pair-
matrix and its uses. Methods Enzymol. 183: 333–351. wise sequence alignments. CSH Protocols (this issue) doi:
Gonnet, G.H., Cohen, M.A., and Benner, S.A. 1992. Exhaustive 10.1101/pdb.top40.
matching of the entire protein sequence database. Science 256: Mount, D.W. 2008e. Studies of varying alignment algorithm, amino
1443–1445. acid scoring matrix and gap penalties. CSH Protocols (this issue)
Gribskov, M. and Burgess, R.R. 1986. Sigma factors from E. coli, B. doi: 10.1101/pdb.ip60.
subtilis, phage SP01, and phage T4 are homologous proteins. Mount, D.W. 2008f. Distance methods for phylogenetic prediction.
Nucleic Acids Res. 14: 6745–6763. CSH Protocols doi: 10.1101/pdb.top33.
Jones, D.T., Taylor, W.R., and Thornton, J.M. 1992. The rapid gener- States, D.J. and Boguski, M.S. 1991. Similarity and homology. In
ation of mutation data matrices from protein sequences. Comput. Sequence analysis primer (eds. M Gribskov and J Devereux), pp.
Appl. Biosci. 8: 275–282. 92–124. Stockton Press, New York.
Mount, D.W. 2007a. Strategies for sequence similarity database Wilbur, W.J. 1985. On the PAM model of protein evolution. Mol. Biol.
searches. CSH Protocols doi: 10.1101/pdb.top15. Evol. 2: 434–447.