0% found this document useful (0 votes)
38 views11 pages

Protein Alignment Scoring - PAM and BLOSUM

The document discusses protein sequence alignment scoring methods. It introduces the PAM and BLOSUM matrices which are commonly used to score substitutions between amino acids in protein alignments based on empirical substitution frequencies observed in related proteins. The PAM matrix models substitution probabilities directly observed in very similar proteins, while the BLOSUM matrix averages these probabilities over clusters of more distantly related proteins to avoid issues with low-probability estimates.

Uploaded by

rikzariaz0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views11 pages

Protein Alignment Scoring - PAM and BLOSUM

The document discusses protein sequence alignment scoring methods. It introduces the PAM and BLOSUM matrices which are commonly used to score substitutions between amino acids in protein alignments based on empirical substitution frequencies observed in related proteins. The PAM matrix models substitution probabilities directly observed in very similar proteins, while the BLOSUM matrix averages these probabilities over clusters of more distantly related proteins to avoid issues with low-probability estimates.

Uploaded by

rikzariaz0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Wright State University

CORE Scholar

Computer Science and Engineering Faculty Computer Science & Engineering


Publications

2003

Protein Alignment Scoring - PAM and BLOSUM


Dan E. Krane
Wright State University - Main Campus, [email protected]

Michael L. Raymer
Wright State University - Main Campus, [email protected]

Follow this and additional works at: https://corescholar.libraries.wright.edu/cse

Part of the Computer Sciences Commons, and the Engineering Commons

Repository Citation
Krane, D. E., & Raymer, M. L. (2003). Protein Alignment Scoring - PAM and BLOSUM. .
https://corescholar.libraries.wright.edu/cse/388

This Presentation is brought to you for free and open access by Wright State University’s CORE Scholar. It has been
accepted for inclusion in Computer Science and Engineering Faculty Publications by an authorized administrator of
CORE Scholar. For more information, please contact [email protected].
Sequence Alignments Revisited
• Scoring nucleotide sequence alignments was
easier
• Match score
• Possibly different scores for transitions and
transversions
• For amino acids, there are many more possible
substitutions
• How do we score which substitutions are highly
penalized and which are moderately penalized?
• Physical and chemical characteristics
• Empirical methods
Protein-Related Algorithms Intro to Bioinformatics 1
Scoring Mismatches
• Physical and chemical characteristics
• V → I – Both small, both hydrophobic,
conservative substitution, small penalty
• V → K – Small → large, hydrophobic → charged,
large penalty
• Requires some expert knowledge and judgement
• Empirical methods
• How often does the substitution V → I occur in
proteins that are known to be related?
 Scoring matrices: PAM and BLOSUM

Protein-Related Algorithms Intro to Bioinformatics 2


PAM matrices
• PAM = “Point Accepted Mutation” interested
only in mutations that have been “accepted” by
natural selection
• Starts with a multiple sequence alignment of
very similar (>85% identity) proteins. Assumed
to be homologous
• Compute the relative mutability, mi, of each
amino acid
• e.g. mA = how many times was alanine substituted
with anything else?
Protein-Related Algorithms Intro to Bioinformatics 3
Relative mutability
• ACGCTAFKI
GCGCTAFKI
ACGCTAFKL
GCGCTGFKI
GCGCTLFKI
ASGCTAFKL
ACACTAFKL
• Across all pairs of sequences, there are 28
A → X substitutions
• There are 10 ALA residues, so mA = 2.8
Protein-Related Algorithms Intro to Bioinformatics 4
Pam Matrices, cont’d
• Construct a phylogenetic tree for the sequences
in the alignment
ACGCTAFKI
A→G I→L
FG,A = 3
GCGCTAFKI ACGCTAFKL

A→G A→L C→S G→A

GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL

• Calculate substitution frequences FX,X


• Substitutions may have occurred either way, so
A → G also counts as G → A.
Protein-Related Algorithms Intro to Bioinformatics 5
Mutation Probabilities
• Mi,j represents the probability of J → I
substitution.
m j Fij
M ij =
ACGCTAFKI

∑ Fij
i GCGCTAFKI
A→G I→L

ACGCTAFKL

A→G A→L C→S G→A

GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL

2.7 × 3
• M G, A = = 2.025
4

Protein-Related Algorithms Intro to Bioinformatics 6


The PAM matrix
• The entries, Ri,j are the Mi,j values divided by
the frequency of occurrence, fi, of residue i.
• fG = 10 GLY / 63 residues = 0.1587
• RG,A = log(2.025/0.1587) = log(12.760) = 1.106
• The log is taken so that we can add, rather than
multiply entries to get compound probabilities.
• Log-odds matrix
• Diagonal entries are 1– mj

Protein-Related Algorithms Intro to Bioinformatics 7


Interpretation of PAM matrices
• PAM-1 – one substitution per 100 residues (a
PAM unit of time)
• Multiply them together to get PAM-100, etc.
• “Suppose I start with a given polypeptide
sequence M at time t, and observe the
evolutionary changes in the sequence until 1% of
all amino acid residues have undergone
substitutions at time t+n. Let the new sequence at
time t+n be called M’. What is the probability that
a residue of type j in M will be replaced by i in
M’?”
Protein-Related Algorithms Intro to Bioinformatics 8
PAM matrix considerations

• If Mi,j is very small, we may not have a large


enough sample to estimate the real probability.
When we multiply the PAM matrices many
times, the error is magnified.
• PAM-1 – similar sequences, PAM-1000 very
dissimilar sequences

Protein-Related Algorithms Intro to Bioinformatics 9


BLOSUM matrix
• Starts by clustering proteins by similarity
• Avoids problems with small probabilities by
using averages over clusters
• Numbering works opposite
• BLOSUM-62 is appropriate for sequences of about
62% identity, while BLOSUM-80 is appropriate for
more similar sequences.

Protein-Related Algorithms Intro to Bioinformatics 10

You might also like