Mount - 2008 - Using BLOSUM in Sequence Alignments

The document discusses the creation and application of BLOSUM scoring matrices for protein sequence alignments, highlighting their development from a large set of conserved amino acid patterns. BLOSUM matrices, particularly BLOSUM62, are shown to provide effective scoring for diverse protein sequences, balancing information content and data size. The article also references related works on PAM matrices and the evaluation of amino acid substitution matrices.

Uploaded by

Zjardyn Liera-Hood

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views5 pages

Mount - 2008 - Using BLOSUM in Sequence Alignments

Uploaded by

Zjardyn Liera-Hood

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Downloaded from http://cshprotocols.cshlp.

org/ at UNIVERSITE LAVAL on June 24, 2014 - Published by

Cold Spring Harbor Laboratory Press

Using BLOSUM in Sequence Alignments

David W. Mount

Cold Spring Harb Protoc; doi: 10.1101/pdb.top39

Email Alerting Receive free email alerts when new articles cite this article - click here.
Service

Subject Browse articles on similar topics from Cold Spring Harbor Protocols.
Categories
Alignment of Pairs of Sequences (12 articles)
Alignment of Sequences (33 articles)
Bioinformatics/Genomics, general (131 articles)
Computational Biology (74 articles)
Genetics, general (322 articles)
Genome Analysis (102 articles)
Proteins and Proteomics, general (488 articles)
Proteomics (60 articles)

To subscribe to Cold Spring Harbor Protocols go to:

http://cshprotocols.cshlp.org/subscriptions
Downloaded from http://cshprotocols.cshlp.org/ at UNIVERSITE LAVAL on June 24, 2014 - Published by
Cold Spring Harbor Laboratory Press

Topic Introduction

Using BLOSUM in Sequence Alignments

David W. Mount

Adapted from “Alignment of Pairs of Sequences,” Chapter 3, in Bioinformatics: Sequence and Genome
Analysis, 2nd edition, by David W. Mount. Cold Spring Harbor Laboratory Press, Cold Spring Harbor,
NY, USA, 2004.

INTRODUCTION
The original Dayhoff percent accepted mutation (PAM) matrices were developed based on a small
number of protein sequences and an evolutionary model of protein change. By extrapolating from the
observed changes at small evolutionary distances to large ones, it was possible to establish a PAM250
scoring matrix for sequences that were highly divergent. Another approach to finding a scoring matrix
for divergent sequences is to start with a more divergent set of sequences and produce a scoring
matrix from the substitutions found in those less-related sequences. The blocks amino acid substitu-
tion matrices (BLOSUM) scoring matrices were prepared this way. This article explains how BLOSUM
scoring matrices were created and how they can best be used.

RELATED INFORMATION
More information on PAM and BLOSUM matrices can be found in Using PAM Matrices in Sequence
Alignments (Mount 2008a) and Comparison of the PAM and BLOSUM Amino Acid Substitution
Matrices (Mount 2008b). PAM matrices are based on a Markov model of protein evolution. This
model is tested in A Test of the Markov Model of Evolution in Proteins (Mount 2008c). The appro-
priate choice for gap penalties to be used with various matrices is discussed in Using Gaps and Gap
Penalties to Optimize Pairwise Sequence Alignments (Mount 2008d). In Studies of Varying
Alignment Algorithm, Amino Acid Scoring Matrix, and Gap Penalties (Mount 2008e), BLOSUM
and other scoring matrices are compared in combination with various alignment algorithms and gap
penalties.

BLOCKS AMINO ACID SUBSTITUTION MATRICES (BLOSUM)

The BLOSUM62 substitution matrix (Henikoff and Henikoff 1992) is widely used for scoring protein
sequence alignments. The matrix values are based on the observed amino acid substitutions in a large
set of ~2000 conserved amino acid patterns, called blocks. These blocks have been found in a data-
base of protein sequences representing more than 500 families of related proteins (Henikoff and
Henikoff 1992) and act as signatures of these protein families. The BLOSUM matrices, which are
designed to find the conserved domains of proteins, are thus based on an entirely different type of
sequence analysis and a much larger data set than the Dayhoff PAM matrices.
The protein families used for making the BLOSUM matrices were originally identified by Amos
Bairoch in the Prosite catalog (Bairoch 1991). This catalog provides lists of proteins that are in the
same family because they have a similar biochemical function. For each family, a pattern of amino
acids that are characteristic of that function is provided. Henikoff and Henikoff (1991) examined each
Prosite family for the presence of ungapped amino acid patterns (blocks) that were present in each
family and that could be used to identify members of that family. To locate these patterns, the
sequences of each protein family were searched for similar amino acid patterns by the MOTIF program

Please cite as: CSH Protocols; 2008; doi:10.1101/pdb.top39 www.cshprotocols.org

© 2008 Cold Spring Harbor Laboratory Press 1 Vol. 3, Issue 6, June 2008
Downloaded from http://cshprotocols.cshlp.org/ at UNIVERSITE LAVAL on June 24, 2014 - Published by
Cold Spring Harbor Laboratory Press

of H. Smith (Smith et al. 1990), which can find patterns of the type aa1 d1 aa2 d2 aa3, where aa1
and aa2 are conserved amino acids and d1 and d2 are stretches of intervening sequence up to 24
amino acids long located in all sequences. These initial patterns were organized into larger ungapped
patterns (blocks) between 3 and 60 amino acids long by the Henikoffs’ PROTOMAT program
(http://blocks.fhcrc.org). Because these blocks were present in all of the sequences in each family, they
could be used to identify other members of the same family. Thus, the family collections were
enlarged by searching the available sequence databases for more proteins with these same conserved
blocks.
The columns of the aligned blocks that characterized each family indicated the types of amino
acid substitutions that occurred. The amino acid changes that were observed in each column of the
alignment could then be counted. The types of substitutions were then scored for all aligned patterns
in the database and used to prepare a scoring matrix, the BLOSUM matrix, indicating the frequency
of each type of substitution. BLOSUM matrix values were given as logarithms of odds scores of the
ratio of the observed frequency of amino acid substitutions divided by the frequency expected by
chance. An example of the calculations is shown in Figure 1.
This procedure of counting all of the amino acid changes in the blocks, however, can lead to an
overrepresentation of amino acid substitutions that occur in the most closely related members of each
family. To reduce this dominant contribution from the most-alike block sequences, these block
sequences were grouped together into one sequence before scoring the amino acid substitutions in
the aligned blocks. The amino acid changes within these grouped sequences were then averaged.
Patterns that were 60% identical were grouped together to make one substitution matrix called
BLOSUM60, and those 80% alike to make another matrix called BLOSUM80, and so on. As with the
PAM matrices, these matrices differ in the degree to which the more common amino acid pairs
are scored relative to the less common pairs. Thus, when used for aligning protein sequences, they
provide a different level of distinction between the more common and less common amino acid

FIGURE 1. Derivation of the matrix values in the BLOSUM62 scoring matrix. As an example of the calculations, if a col-
umn in one of the blocks consisted of 9 A and 1 S amino acids, the following is true for this data set (see Henikoff and
Henikoff 1992).
1. Since the original sequence from which the others were derived is not known, each column position has to be con-
sidered a possible ancestor of the other nine positions. Hence, there are 8 + 7 + 6 + . . . + 1 = 36 possible AA pairs
(fAA) and 9 possible AS pairs (fAS) to be compared.
2. There are 20 + 19 + 18 + . . . + 1 = 210 possible amino acid pairs.
3. The frequency of occurrence of an AA pair, qAA = fAA/(fAA + fAS) = 36/(36+9) = 0.8, and that of an AS pair, qAS =
fAS/(fAA + fAS) = 9 / (36+9) = 0.2.
4. The expected frequency of A being in a pair, pA = (qAA + qAS/2) = 0.8 + 0.2/2 = 0.9, and that of pS = qAS/2 = 0.1.
5. The expected frequency of occurrence of AA pairs, eAA = pA × pA = 0.9 × 0.9 = 0.81, and that of AS, eAS = 2 ×
pS × pA = 2 × 0.9 × 0.1 = 0.18.
6. The matrix entry for AA will be calculated from the ratio of the occurrence frequency to the expected frequency. For
AA, the ratio = qAA/eAA = 0.8/0.81 = 0.99, and for AS, the ratio = qAS/ eAS = 0.2/0.18 = 1.11.
7. Both ratios are converted to logarithms to the base 2 called bits (logs to the base 2 may be calculated from log to the
base 10 by dividing by 0.693) and then multiplied by 2 to give units of half-bits. Matrix entry for AA, sAA = 2 × log2
(qAA/eAA) = −0.04, and for AS, sAS = 2 × log2(qAS/ eAS) = 0.30. These logarithms are both rounded to the nearest half-bit
unit, which in this case would be 0 and 0, respectively. The entire BLOCKS multiple sequence alignment database was
then used in this same manner to obtain the values shown in Fig. 2.

www.cshprotocols.org 2 CSH Protocols

Downloaded from http://cshprotocols.cshlp.org/ at UNIVERSITE LAVAL on June 24, 2014 - Published by
Cold Spring Harbor Laboratory Press

pairs. Experience has shown that BLOSUM62 generally provides the best alignment of proteins over a
range of sequence similarity and that other BLOSUM matrices may provide a better alignment when
dealing with more closely related proteins.
The ability of these different BLOSUM matrices to distinguish real from chance alignments and to
identify as many members as possible of a protein family has been determined (Henikoff and Henikoff
1992). Two types of analyses were performed: (1) an information-content analysis of each matrix and
(2) an actual comparison of the ability of each matrix to find members of the same families in a data-
base search. As the grouping percentage was increased, the ability of the resulting matrix to distin-
guish actual from chance alignments, defined as the relative entropy of the matrix or the average
information content per residue pair, also increased. As grouping increased from 45% to 62%, the
information content per residue increased from ~ 0.4 to 0.7 bits per residue, and was ~ 1.0 bits at
80% grouping. However, at the same time, the number of blocks that contributed information
decreased by 25% between no grouping and 62% grouping. BLOSUM62 represents a balance
between information content and data size. The BLOSUM62 matrix is shown in Figure 2.
Henikoff and Henikoff (1993) have prepared a set of interval BLOSUM matrices that represent the
changes observed between more closely related or more distantly related representatives of each
block. Rather than representing the changes observed in very-alike sequences up to sequences that
were n% alike to give a BLOSUM-n matrix, the new BLOSUM-nm matrix represented the changes
observed in sequences that were between n% alike and m% alike. The idea behind these matrices was
to have a set of matrices corresponding to amino acid changes in sequence blocks that are separated
by different evolutionary distances.

FIGURE 2. The BLOSUM62 amino acid substitution matrix. The amino acids in the table are grouped according to the
chemistry of the side group: (C) sulfhydryl, (STPAG) small hydrophilic; (NDEQ) acid, acid amide, and hydrophilic; (HRK)
basic; (MILV) small hydrophobic; and (FYW) aromatic. Each entry is the logarithm of the odds score, found by dividing
the frequency of occurrence of the amino acid pair in the BLOCKS database (after sequences 62% or more in similarity
have been clustered) by the likelihood of an alignment of the amino acids by random chance. The denominator in this
ratio is calculated from the frequency of occurrence of each of the two individual amino acids in the BLOCKS database
and provides a measure of a chance alignment of the two amino acids. The actual/expected ratio is expressed as a log
odds score in so-called half-bit units, obtained by converting the odds ratio to a logarithm to the base 2, and then mul-
tiplying by 2. A zero score means that the frequency of the amino acid pair in the database is as expected by chance, a
positive score that the pair is found more often than by chance, and a negative score that the pair is found less often
than by chance. The accumulated score of an alignment of several amino acids in two sequences may be obtained by
adding up the respective scores of each individual pair of amino acids. As with the PAM250-derived matrix, the high-
est-scoring matches are between amino acids that are in the same chemical group, and the very highest-scoring matches
are for cysteine-cysteine matches and for matches among the aromatic amino acids. Compared to the PAM160 matrix,
however, the BLOSUM62 matrix gives a more positive score to mismatches with the rare amino acids, e.g., cysteine, a
more positive score to mismatches with hydrophobic amino acids, but a more negative score to mismatches with
hydrophilic amino acids (Henikoff and Henikoff 1992).

www.cshprotocols.org 3 CSH Protocols

Downloaded from http://cshprotocols.cshlp.org/ at UNIVERSITE LAVAL on June 24, 2014 - Published by
Cold Spring Harbor Laboratory Press

REFERENCES
Bairoch, A. 1991. PROSITE: A dictionary of sites and patterns in pro- acid substitution matrices. CSH Protocols (this issue) doi:
teins. Nucleic Acids Res. 19: 2241–2245. 10.1101/pdb.ip59.
Henikoff, S. and Henikoff, J.G. 1991. Automated assembly of pro- Mount, D.W. 2008c. A test of the Markov model of evolution in pro-
tein blocks for database searching. Nucleic Acids Res. 19: 6565– teins. CSH Protocols (this issue) doi: 10.1101/pdb.ip58.
6572. Mount, D.W. 2008d. Using gaps and gap penalties to optimize pair-
Henikoff, S. and Henikoff, J.G. 1992. Amino acid substitution matri- wise sequence alignments. CSH Protocols (this issue) doi:
ces from protein blocks. Proc. Natl. Acad. Sci. 89: 10915–10919. 10.1101/pdb.top40.
Henikoff, S. and Henikoff, J.G. 1993. Performance evaluation of Mount, D.W. 2008e. Studies of varying alignment algorithm, amino
amino acid substitution matrices. Proteins Struct. Funct. Genet. 17: acid scoring matrix and gap penalties. CSH Protocols (this issue)
49–61. doi: 10.1101/pdb.ip60.
Mount, D.W. 2008a. Using PAM matrices in sequence alignments. Smith, H.O., Annau, T.M., and Chandrasegaran, S. 1990. Finding
CSH Protocols (this issue) doi: 10.1101/pdb.top38. sequence motifs in groups of functionally related proteins. Proc.
Mount, D.W. 2008b. Comparison of the PAM and BLOSUM amino Natl. Acad. Sci. 87: 826–830.

www.cshprotocols.org 4 CSH Protocols

8609 Quiz
100% (3)
8609 Quiz
41 pages
Substitution Matrix
No ratings yet
Substitution Matrix
10 pages
BLOSUM Matrices
No ratings yet
BLOSUM Matrices
18 pages
Blast
100% (1)
Blast
21 pages
Bioinformatics 1 p3
No ratings yet
Bioinformatics 1 p3
17 pages
Paul Rabinow
No ratings yet
Paul Rabinow
15 pages
Bioinformatics in PAM AND BLOSUM
100% (15)
Bioinformatics in PAM AND BLOSUM
17 pages
Barangay Detailed Accomplishment Report New LH
No ratings yet
Barangay Detailed Accomplishment Report New LH
3 pages
1ab29bb7-bd81-49c3-8a8e-c373e8db6363
No ratings yet
1ab29bb7-bd81-49c3-8a8e-c373e8db6363
947 pages
Argumentative Text Quiz! - Quizizz
No ratings yet
Argumentative Text Quiz! - Quizizz
3 pages
Reporte Guanaco Amancaya Technical Report Jun 16 2017
No ratings yet
Reporte Guanaco Amancaya Technical Report Jun 16 2017
377 pages
Pam Blosum
100% (1)
Pam Blosum
71 pages
BLOSUM Matrices
No ratings yet
BLOSUM Matrices
18 pages
Magnetic Level Gauge Magnetrol
No ratings yet
Magnetic Level Gauge Magnetrol
9 pages
Basic Bioinformatics
No ratings yet
Basic Bioinformatics
40 pages
Msa MTech
No ratings yet
Msa MTech
17 pages
Syntekoclassic Eu en Msds
No ratings yet
Syntekoclassic Eu en Msds
28 pages
Calculating Spread and Checking Intake Location: Office of Design
No ratings yet
Calculating Spread and Checking Intake Location: Office of Design
7 pages
DOH Administrative Order No. 2013-0027 - National Policy On Water Safetly Plan For All Drinking Water Source Provider
No ratings yet
DOH Administrative Order No. 2013-0027 - National Policy On Water Safetly Plan For All Drinking Water Source Provider
4 pages
Second - Done - W14a - Substitution Patterns
No ratings yet
Second - Done - W14a - Substitution Patterns
36 pages
PAM and BLOSUM
No ratings yet
PAM and BLOSUM
21 pages
PAM and BLOSUM Presentation
No ratings yet
PAM and BLOSUM Presentation
11 pages
Bioinformatics Module 2 Notes
No ratings yet
Bioinformatics Module 2 Notes
28 pages
Bioinfo Final Practical
No ratings yet
Bioinfo Final Practical
66 pages
Sequence Alignment
No ratings yet
Sequence Alignment
24 pages
Optimal Alignment and Heuristic Solutions
No ratings yet
Optimal Alignment and Heuristic Solutions
7 pages
Lecture 9 Scoring Matrices
No ratings yet
Lecture 9 Scoring Matrices
20 pages
BLOSUM - Dot Plot - Needleman & Wunch - Smith & Waterman Matrix Filling
No ratings yet
BLOSUM - Dot Plot - Needleman & Wunch - Smith & Waterman Matrix Filling
41 pages
Chap 03 BioInfo
No ratings yet
Chap 03 BioInfo
15 pages
Frid Seminar
No ratings yet
Frid Seminar
30 pages
Modeling Multisystemic Resilience
No ratings yet
Modeling Multisystemic Resilience
27 pages
04 CAP5510 Fall21
No ratings yet
04 CAP5510 Fall21
37 pages
Elements of Art and Principles of Design
No ratings yet
Elements of Art and Principles of Design
22 pages
12 Blossum
No ratings yet
12 Blossum
10 pages
Acoustic Holography
No ratings yet
Acoustic Holography
17 pages
Soln4 15
No ratings yet
Soln4 15
10 pages
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
No ratings yet
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
51 pages
2-Substitution Matrices and Python - 2017
No ratings yet
2-Substitution Matrices and Python - 2017
65 pages
Sequence Analysis - Pairwise Alignment
No ratings yet
Sequence Analysis - Pairwise Alignment
26 pages
Physics - Principles and Applications (6e), Giancoli - CHP - 19 - 22
No ratings yet
Physics - Principles and Applications (6e), Giancoli - CHP - 19 - 22
79 pages
#몽땅쌤 수능감잡기 0지문
No ratings yet
#몽땅쌤 수능감잡기 0지문
32 pages
Alignment of Sequences
No ratings yet
Alignment of Sequences
33 pages
16 Unnamed 08 08 2024
No ratings yet
16 Unnamed 08 08 2024
13 pages
Sequence Alignment and Searching
No ratings yet
Sequence Alignment and Searching
37 pages
Unit Iii
No ratings yet
Unit Iii
14 pages
Experiment No #2: Compaction Factor Test Student Name: Student ID
No ratings yet
Experiment No #2: Compaction Factor Test Student Name: Student ID
8 pages
Lecture 7 - Score Matrix
No ratings yet
Lecture 7 - Score Matrix
12 pages
PAM and BLOSUM Substitution Matrices
No ratings yet
PAM and BLOSUM Substitution Matrices
3 pages
Mount - 2008 - Using PAM Matrices in Sequence Alignments
No ratings yet
Mount - 2008 - Using PAM Matrices in Sequence Alignments
9 pages
Sequence Alignment: Scoring Matrices
No ratings yet
Sequence Alignment: Scoring Matrices
30 pages
Lab 2.1
No ratings yet
Lab 2.1
21 pages
BLAST Lecture Notes
No ratings yet
BLAST Lecture Notes
16 pages
Introduction To Bioinformatics: Sequence Alignment
No ratings yet
Introduction To Bioinformatics: Sequence Alignment
29 pages
Using Scoring Matrices
No ratings yet
Using Scoring Matrices
3 pages
Unit Ii
No ratings yet
Unit Ii
14 pages
SECT 5 SL L1-Rev
No ratings yet
SECT 5 SL L1-Rev
30 pages
Unit 12 - Day 3 - Presentation
No ratings yet
Unit 12 - Day 3 - Presentation
21 pages
Protein Alignment Scoring - PAM and BLOSUM
No ratings yet
Protein Alignment Scoring - PAM and BLOSUM
11 pages
Practica Post CyO 2
No ratings yet
Practica Post CyO 2
11 pages
Multiple Sequence Alignment MSA
No ratings yet
Multiple Sequence Alignment MSA
8 pages
Introduction To Bioinformatics Lecture 3
No ratings yet
Introduction To Bioinformatics Lecture 3
20 pages
Corporate Social Responsibility - Nestle
No ratings yet
Corporate Social Responsibility - Nestle
2 pages
CHM201 Exp3 Lab Report
No ratings yet
CHM201 Exp3 Lab Report
5 pages
Blosum 2014
No ratings yet
Blosum 2014
3 pages
Full PDF
No ratings yet
Full PDF
5 pages
Unit2 2
No ratings yet
Unit2 2
30 pages
DLL - All Subjects 2 - Q4 - W3 - D4
No ratings yet
DLL - All Subjects 2 - Q4 - W3 - D4
9 pages
Compression Test On Concrete: EN 12390-3
No ratings yet
Compression Test On Concrete: EN 12390-3
7 pages
Protein Sequence Alignment Lecture Notes
No ratings yet
Protein Sequence Alignment Lecture Notes
2 pages
PAM and BLOSUM Matrices
No ratings yet
PAM and BLOSUM Matrices
3 pages
Pam Blosum Comparison 2022
No ratings yet
Pam Blosum Comparison 2022
2 pages
Year 7 Mathematics Semester 2 Examination, 2014: General Instructions
No ratings yet
Year 7 Mathematics Semester 2 Examination, 2014: General Instructions
12 pages
CL662 Homework 3: Roll Number: 150020027 Name: Prathamesh Kulkarni
No ratings yet
CL662 Homework 3: Roll Number: 150020027 Name: Prathamesh Kulkarni
21 pages
End Term Examination IKS
No ratings yet
End Term Examination IKS
3 pages
ESD-Final Exam - SCS3140-013
No ratings yet
ESD-Final Exam - SCS3140-013
4 pages
Case Study Creo
No ratings yet
Case Study Creo
5 pages
Energy Losses in Bends and Fittings - F1-22
No ratings yet
Energy Losses in Bends and Fittings - F1-22
1 page
Comparison of The PAM and BLOSUM Amino Acid Substitution Matrices
No ratings yet
Comparison of The PAM and BLOSUM Amino Acid Substitution Matrices
4 pages
Logistics Co-Ordinator
No ratings yet
Logistics Co-Ordinator
2 pages
Capital Project Request Form - 1
No ratings yet
Capital Project Request Form - 1
3 pages
Lecture 3 and 4 LSM2241
No ratings yet
Lecture 3 and 4 LSM2241
6 pages
14 Handbook of Plant Biotechnology
No ratings yet
14 Handbook of Plant Biotechnology
1 page
Kinetic and Potential Energy
No ratings yet
Kinetic and Potential Energy
2 pages
BLOSUM
No ratings yet
BLOSUM
3 pages
PAM Abd BLOSUM
No ratings yet
PAM Abd BLOSUM
3 pages
Amino Acid Substitution Scores: 1 2 N 1 2 N N I 1 I I
No ratings yet
Amino Acid Substitution Scores: 1 2 N 1 2 N N I 1 I I
3 pages
6 Blastp
No ratings yet
6 Blastp
1 page
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)
Classical Approach to Constrained and Unconstrained Molecular Dynamics
From Everand
Classical Approach to Constrained and Unconstrained Molecular Dynamics
Ajith Gunaratne
No ratings yet
Topical Guidebook For GCE O Level Biology 3 Part 2
From Everand
Topical Guidebook For GCE O Level Biology 3 Part 2
Esther Chen
5/5 (1)