0% found this document useful (0 votes)
8 views5 pages

Mount - 2008 - Using BLOSUM in Sequence Alignments

The document discusses the creation and application of BLOSUM scoring matrices for protein sequence alignments, highlighting their development from a large set of conserved amino acid patterns. BLOSUM matrices, particularly BLOSUM62, are shown to provide effective scoring for diverse protein sequences, balancing information content and data size. The article also references related works on PAM matrices and the evaluation of amino acid substitution matrices.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views5 pages

Mount - 2008 - Using BLOSUM in Sequence Alignments

The document discusses the creation and application of BLOSUM scoring matrices for protein sequence alignments, highlighting their development from a large set of conserved amino acid patterns. BLOSUM matrices, particularly BLOSUM62, are shown to provide effective scoring for diverse protein sequences, balancing information content and data size. The article also references related works on PAM matrices and the evaluation of amino acid substitution matrices.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Downloaded from http://cshprotocols.cshlp.

org/ at UNIVERSITE LAVAL on June 24, 2014 - Published by


Cold Spring Harbor Laboratory Press

Using BLOSUM in Sequence Alignments


David W. Mount

Cold Spring Harb Protoc; doi: 10.1101/pdb.top39

Email Alerting Receive free email alerts when new articles cite this article - click here.
Service

Subject Browse articles on similar topics from Cold Spring Harbor Protocols.
Categories
Alignment of Pairs of Sequences (12 articles)
Alignment of Sequences (33 articles)
Bioinformatics/Genomics, general (131 articles)
Computational Biology (74 articles)
Genetics, general (322 articles)
Genome Analysis (102 articles)
Proteins and Proteomics, general (488 articles)
Proteomics (60 articles)

To subscribe to Cold Spring Harbor Protocols go to:


http://cshprotocols.cshlp.org/subscriptions
Downloaded from http://cshprotocols.cshlp.org/ at UNIVERSITE LAVAL on June 24, 2014 - Published by
Cold Spring Harbor Laboratory Press

Topic Introduction

Using BLOSUM in Sequence Alignments


David W. Mount

Adapted from “Alignment of Pairs of Sequences,” Chapter 3, in Bioinformatics: Sequence and Genome
Analysis, 2nd edition, by David W. Mount. Cold Spring Harbor Laboratory Press, Cold Spring Harbor,
NY, USA, 2004.

INTRODUCTION
The original Dayhoff percent accepted mutation (PAM) matrices were developed based on a small
number of protein sequences and an evolutionary model of protein change. By extrapolating from the
observed changes at small evolutionary distances to large ones, it was possible to establish a PAM250
scoring matrix for sequences that were highly divergent. Another approach to finding a scoring matrix
for divergent sequences is to start with a more divergent set of sequences and produce a scoring
matrix from the substitutions found in those less-related sequences. The blocks amino acid substitu-
tion matrices (BLOSUM) scoring matrices were prepared this way. This article explains how BLOSUM
scoring matrices were created and how they can best be used.

RELATED INFORMATION
More information on PAM and BLOSUM matrices can be found in Using PAM Matrices in Sequence
Alignments (Mount 2008a) and Comparison of the PAM and BLOSUM Amino Acid Substitution
Matrices (Mount 2008b). PAM matrices are based on a Markov model of protein evolution. This
model is tested in A Test of the Markov Model of Evolution in Proteins (Mount 2008c). The appro-
priate choice for gap penalties to be used with various matrices is discussed in Using Gaps and Gap
Penalties to Optimize Pairwise Sequence Alignments (Mount 2008d). In Studies of Varying
Alignment Algorithm, Amino Acid Scoring Matrix, and Gap Penalties (Mount 2008e), BLOSUM
and other scoring matrices are compared in combination with various alignment algorithms and gap
penalties.

BLOCKS AMINO ACID SUBSTITUTION MATRICES (BLOSUM)


The BLOSUM62 substitution matrix (Henikoff and Henikoff 1992) is widely used for scoring protein
sequence alignments. The matrix values are based on the observed amino acid substitutions in a large
set of ~2000 conserved amino acid patterns, called blocks. These blocks have been found in a data-
base of protein sequences representing more than 500 families of related proteins (Henikoff and
Henikoff 1992) and act as signatures of these protein families. The BLOSUM matrices, which are
designed to find the conserved domains of proteins, are thus based on an entirely different type of
sequence analysis and a much larger data set than the Dayhoff PAM matrices.
The protein families used for making the BLOSUM matrices were originally identified by Amos
Bairoch in the Prosite catalog (Bairoch 1991). This catalog provides lists of proteins that are in the
same family because they have a similar biochemical function. For each family, a pattern of amino
acids that are characteristic of that function is provided. Henikoff and Henikoff (1991) examined each
Prosite family for the presence of ungapped amino acid patterns (blocks) that were present in each
family and that could be used to identify members of that family. To locate these patterns, the
sequences of each protein family were searched for similar amino acid patterns by the MOTIF program

Please cite as: CSH Protocols; 2008; doi:10.1101/pdb.top39 www.cshprotocols.org

© 2008 Cold Spring Harbor Laboratory Press 1 Vol. 3, Issue 6, June 2008
Downloaded from http://cshprotocols.cshlp.org/ at UNIVERSITE LAVAL on June 24, 2014 - Published by
Cold Spring Harbor Laboratory Press

of H. Smith (Smith et al. 1990), which can find patterns of the type aa1 d1 aa2 d2 aa3, where aa1
and aa2 are conserved amino acids and d1 and d2 are stretches of intervening sequence up to 24
amino acids long located in all sequences. These initial patterns were organized into larger ungapped
patterns (blocks) between 3 and 60 amino acids long by the Henikoffs’ PROTOMAT program
(http://blocks.fhcrc.org). Because these blocks were present in all of the sequences in each family, they
could be used to identify other members of the same family. Thus, the family collections were
enlarged by searching the available sequence databases for more proteins with these same conserved
blocks.
The columns of the aligned blocks that characterized each family indicated the types of amino
acid substitutions that occurred. The amino acid changes that were observed in each column of the
alignment could then be counted. The types of substitutions were then scored for all aligned patterns
in the database and used to prepare a scoring matrix, the BLOSUM matrix, indicating the frequency
of each type of substitution. BLOSUM matrix values were given as logarithms of odds scores of the
ratio of the observed frequency of amino acid substitutions divided by the frequency expected by
chance. An example of the calculations is shown in Figure 1.
This procedure of counting all of the amino acid changes in the blocks, however, can lead to an
overrepresentation of amino acid substitutions that occur in the most closely related members of each
family. To reduce this dominant contribution from the most-alike block sequences, these block
sequences were grouped together into one sequence before scoring the amino acid substitutions in
the aligned blocks. The amino acid changes within these grouped sequences were then averaged.
Patterns that were 60% identical were grouped together to make one substitution matrix called
BLOSUM60, and those 80% alike to make another matrix called BLOSUM80, and so on. As with the
PAM matrices, these matrices differ in the degree to which the more common amino acid pairs
are scored relative to the less common pairs. Thus, when used for aligning protein sequences, they
provide a different level of distinction between the more common and less common amino acid

FIGURE 1. Derivation of the matrix values in the BLOSUM62 scoring matrix. As an example of the calculations, if a col-
umn in one of the blocks consisted of 9 A and 1 S amino acids, the following is true for this data set (see Henikoff and
Henikoff 1992).
1. Since the original sequence from which the others were derived is not known, each column position has to be con-
sidered a possible ancestor of the other nine positions. Hence, there are 8 + 7 + 6 + . . . + 1 = 36 possible AA pairs
(fAA) and 9 possible AS pairs (fAS) to be compared.
2. There are 20 + 19 + 18 + . . . + 1 = 210 possible amino acid pairs.
3. The frequency of occurrence of an AA pair, qAA = fAA/(fAA + fAS) = 36/(36+9) = 0.8, and that of an AS pair, qAS =
fAS/(fAA + fAS) = 9 / (36+9) = 0.2.
4. The expected frequency of A being in a pair, pA = (qAA + qAS/2) = 0.8 + 0.2/2 = 0.9, and that of pS = qAS/2 = 0.1.
5. The expected frequency of occurrence of AA pairs, eAA = pA × pA = 0.9 × 0.9 = 0.81, and that of AS, eAS = 2 ×
pS × pA = 2 × 0.9 × 0.1 = 0.18.
6. The matrix entry for AA will be calculated from the ratio of the occurrence frequency to the expected frequency. For
AA, the ratio = qAA/eAA = 0.8/0.81 = 0.99, and for AS, the ratio = qAS/ eAS = 0.2/0.18 = 1.11.
7. Both ratios are converted to logarithms to the base 2 called bits (logs to the base 2 may be calculated from log to the
base 10 by dividing by 0.693) and then multiplied by 2 to give units of half-bits. Matrix entry for AA, sAA = 2 × log2
(qAA/eAA) = −0.04, and for AS, sAS = 2 × log2(qAS/ eAS) = 0.30. These logarithms are both rounded to the nearest half-bit
unit, which in this case would be 0 and 0, respectively. The entire BLOCKS multiple sequence alignment database was
then used in this same manner to obtain the values shown in Fig. 2.

www.cshprotocols.org 2 CSH Protocols


Downloaded from http://cshprotocols.cshlp.org/ at UNIVERSITE LAVAL on June 24, 2014 - Published by
Cold Spring Harbor Laboratory Press

pairs. Experience has shown that BLOSUM62 generally provides the best alignment of proteins over a
range of sequence similarity and that other BLOSUM matrices may provide a better alignment when
dealing with more closely related proteins.
The ability of these different BLOSUM matrices to distinguish real from chance alignments and to
identify as many members as possible of a protein family has been determined (Henikoff and Henikoff
1992). Two types of analyses were performed: (1) an information-content analysis of each matrix and
(2) an actual comparison of the ability of each matrix to find members of the same families in a data-
base search. As the grouping percentage was increased, the ability of the resulting matrix to distin-
guish actual from chance alignments, defined as the relative entropy of the matrix or the average
information content per residue pair, also increased. As grouping increased from 45% to 62%, the
information content per residue increased from ~ 0.4 to 0.7 bits per residue, and was ~ 1.0 bits at
80% grouping. However, at the same time, the number of blocks that contributed information
decreased by 25% between no grouping and 62% grouping. BLOSUM62 represents a balance
between information content and data size. The BLOSUM62 matrix is shown in Figure 2.
Henikoff and Henikoff (1993) have prepared a set of interval BLOSUM matrices that represent the
changes observed between more closely related or more distantly related representatives of each
block. Rather than representing the changes observed in very-alike sequences up to sequences that
were n% alike to give a BLOSUM-n matrix, the new BLOSUM-nm matrix represented the changes
observed in sequences that were between n% alike and m% alike. The idea behind these matrices was
to have a set of matrices corresponding to amino acid changes in sequence blocks that are separated
by different evolutionary distances.

FIGURE 2. The BLOSUM62 amino acid substitution matrix. The amino acids in the table are grouped according to the
chemistry of the side group: (C) sulfhydryl, (STPAG) small hydrophilic; (NDEQ) acid, acid amide, and hydrophilic; (HRK)
basic; (MILV) small hydrophobic; and (FYW) aromatic. Each entry is the logarithm of the odds score, found by dividing
the frequency of occurrence of the amino acid pair in the BLOCKS database (after sequences 62% or more in similarity
have been clustered) by the likelihood of an alignment of the amino acids by random chance. The denominator in this
ratio is calculated from the frequency of occurrence of each of the two individual amino acids in the BLOCKS database
and provides a measure of a chance alignment of the two amino acids. The actual/expected ratio is expressed as a log
odds score in so-called half-bit units, obtained by converting the odds ratio to a logarithm to the base 2, and then mul-
tiplying by 2. A zero score means that the frequency of the amino acid pair in the database is as expected by chance, a
positive score that the pair is found more often than by chance, and a negative score that the pair is found less often
than by chance. The accumulated score of an alignment of several amino acids in two sequences may be obtained by
adding up the respective scores of each individual pair of amino acids. As with the PAM250-derived matrix, the high-
est-scoring matches are between amino acids that are in the same chemical group, and the very highest-scoring matches
are for cysteine-cysteine matches and for matches among the aromatic amino acids. Compared to the PAM160 matrix,
however, the BLOSUM62 matrix gives a more positive score to mismatches with the rare amino acids, e.g., cysteine, a
more positive score to mismatches with hydrophobic amino acids, but a more negative score to mismatches with
hydrophilic amino acids (Henikoff and Henikoff 1992).

www.cshprotocols.org 3 CSH Protocols


Downloaded from http://cshprotocols.cshlp.org/ at UNIVERSITE LAVAL on June 24, 2014 - Published by
Cold Spring Harbor Laboratory Press

REFERENCES
Bairoch, A. 1991. PROSITE: A dictionary of sites and patterns in pro- acid substitution matrices. CSH Protocols (this issue) doi:
teins. Nucleic Acids Res. 19: 2241–2245. 10.1101/pdb.ip59.
Henikoff, S. and Henikoff, J.G. 1991. Automated assembly of pro- Mount, D.W. 2008c. A test of the Markov model of evolution in pro-
tein blocks for database searching. Nucleic Acids Res. 19: 6565– teins. CSH Protocols (this issue) doi: 10.1101/pdb.ip58.
6572. Mount, D.W. 2008d. Using gaps and gap penalties to optimize pair-
Henikoff, S. and Henikoff, J.G. 1992. Amino acid substitution matri- wise sequence alignments. CSH Protocols (this issue) doi:
ces from protein blocks. Proc. Natl. Acad. Sci. 89: 10915–10919. 10.1101/pdb.top40.
Henikoff, S. and Henikoff, J.G. 1993. Performance evaluation of Mount, D.W. 2008e. Studies of varying alignment algorithm, amino
amino acid substitution matrices. Proteins Struct. Funct. Genet. 17: acid scoring matrix and gap penalties. CSH Protocols (this issue)
49–61. doi: 10.1101/pdb.ip60.
Mount, D.W. 2008a. Using PAM matrices in sequence alignments. Smith, H.O., Annau, T.M., and Chandrasegaran, S. 1990. Finding
CSH Protocols (this issue) doi: 10.1101/pdb.top38. sequence motifs in groups of functionally related proteins. Proc.
Mount, D.W. 2008b. Comparison of the PAM and BLOSUM amino Natl. Acad. Sci. 87: 826–830.

www.cshprotocols.org 4 CSH Protocols

You might also like