Bioinformatics Module 2 Notes
Bioinformatics Module 2 Notes
• The PAM matrices for amino acids, along with the single letter
abbreviations used for genetically encoded amino acids, were
developed by Margaret Dayhoff.
17
PAM
🞂 A percent(or point) accepted mutation — also
known as a PAM — is the replacement of a
single amino acid in the primary structure of
a protein with another single amino acid, which is
accepted by the processes of natural selection.
🞂 These mutations were identified by comparing
highly similar sequences with at least 85% identity
18
1
5/19/2024
• PAM also defines a time unit, where 1 PAM is the time in which 1/100
amino acids are expected to undergo a mutation.
• The PAM1 probability matrix shows the probability of the amino acid
at row i being replaced by the amino acid at column j.
19
20
2
5/19/2024
21
PAM
🞂 Each entry indicates the likelihood of the amino acid
of that row being replaced with the amino acid of
that column through a series of one or more point
accepted mutations during a specified evolutionary
interval, rather than these two amino acids being
aligned due to chance.
🞂 Different PAM matrices correspond to different
lengths of time in the evolution of the protein
sequence.
22
3
5/19/2024
PAM Matrices
🞂 PAM matrices are amino acid substitution matrices that
encode the expected evolutionary change at the amino
acid level.
🞂 Each PAM matrix is designed to compare two sequences
which are a specific number of PAM units apart.
🞂 One PAM unit is defined as 1% of the amino acids positions
that have been changed.
🞂 Two sequences S1 and S2 are at evolutionary distance of 1
PAM unit ,if S1 has converted to S2 with an average of one
amino acid substitution per 100 amino acids.
🞂 250 PAM = 250 mutations per 100 amino acids, so 2.5
accepted mutations per amino acid
23
PAM Matrices
🞂 When used for protein comparison, the mutation
probability (odds) matrix is normalized and the
logarithm is taken. (this lets us add the scores along
a protein instead of multiplying the probabilities).
The resulting matrix is the "log-odds" matrix, known
as the PAM matrix.
24
4
5/19/2024
PAM Series
🞂 There is a whole series of matrices: PAM10 ……..
PAM250
🞂 These matrices are extrapolated from PAM1 matrix
(by matrix multiplication)
🞂 The PAM120 score matrix is designed to compare
between sequences that are 120 PAM units apart:
The score it gives a pair of sequences is the (log of
the) probabilities of such sequences evolving during
120 PAM units of evolution.
25
PAM Series
🞂 For any specific pair (Ai, Aj) of amino acids the (i,j)
entry in the PAM n matrix reflects the frequency at
which Ai is expected to replace with Aj in two
sequences that are n PAM units diverged. These
frequencies should be estimated by gathering
statistics on replaced amino acids.
26
5
5/19/2024
PAM 100
27
28
6
5/19/2024
29
BLOSUM
🞂 Block Substitution Matrix.
🞂 BLOSUM matrices were first by Steven Henikoff and
Jorja Henikoff
🞂 Only blocks of amino acid sequences with small
change between them are considered. These blocks
are called conserved blocks.
🞂 Local alignment
30
7
5/19/2024
BLOSUM
🞂 The Blocks database contains multiply aligned
ungapped segments corresponding to the most
highly conserved regions of proteins (local alignment
versus global alignment).
🞂 Blocks contains sequences at all different
evolutionary distances.
31
BLOSUM
🞂 In each alignment the sequences similar at some threshold
value of percent identity were clustered into groups and
averaged.
🞂 Different BLOSUM matrices differ in the % sequence identity
used in clustering.
🞂 Therefore, BLOSUM62 means that the sequences used to
create this matrix have approximately 62% identity.
🞂 BLOSUM matrices are derived from blocks whose alignment
corresponds to the BLOSUM-matrix number.
🞂 BLOSUM62 represents closer sequences than BLOSUM45.
32
8
5/19/2024
BLOSUM
33
Construction of BLOSUM
Step 0: Eliminating the sequences that are more than r%
identical
34
9
5/19/2024
Construction of BLOSUM
35
Construction of BLOSUM
36
10
5/19/2024
Construction of BLOSUM
🞂 Step 3: Count the observed frequency of Amino acid
pair.
◦ ABobs =8/60
🞂 Step 4: Count the expected frequency of amino acid
pairs.
◦ ABexp =(14/24 X 4/24) X 2
= 112/576
● 2 -> Since ancestral states are not known , we will consider both
substitutions AB and BA as equiprobable.
🞂 Step 5: Calculate the log odd ratio.
◦ 2log2AB = 2log2(O/E) = 2log2((8/60)/(112/576)
= - 1.09
37
Construction of BLOSUM
Pair Observed(O) Expected (E) 2log2(O/E)
AA 26/60 196/576 .70
AB 8/60 112/576 -1.09
AC 10/60 168/576 -1.61
BB 3/60 16/576 1.70
BC 6/60 48/576 0.53
CC 7/60 36/576 1.80
38
11
5/19/2024
BLOSUM Matrices
🞂 No extrapolations are made in going to higher
evolutionary distances.
🞂 High number - closely related sequences
🞂 Low number - distant sequences.
🞂 BLOSUM62 is the most popular: best for general
alignment.
39
PAM VS BLOSUM
PAM BLOSUM
PAM matrices are used to score alignments BLOSUM matrices are used to score alignments
between closely related protein sequences. between evolutionarily divergent protein
sequences.
Based on global alignments Based on local alignments
Alignments have high similarity than Alignments have low similarity than PAM
BLOSUM alignments alignments
Higher numbers in the PAM matrix naming Higher numbers in the BLOSUM matrix
denotes greater evolutionary distance naming denotes higher sequence similarity
and smaller evolutionary distance
useful at short evolutionary distances (PAM10 - At long evolutionary distances, for example
PAM120). PAM250 or 20% identity, BLOSUM matrices are
more effective
Example: PAM 250 is used for more distant Example: BLOSUM 80 is used for closely
sequences than PAM 120 related sequences than BLOSUM 62
40
12