04 CAP5510 Fall21
04 CAP5510 Fall21
Substitution Patterns
Tamer Kahveci
CISE Department
University of Florida
1
Goals
• Understand how mutations occur
• Learn models for predicting the number of
mutations
• Understand why scoring matrices are used
and how they are derived
• Learn major scoring matrices
2
Why Substitute Patterns ?
• Mutations happen because of mistakes in DNA
replication and repair.
• Our genetic code changes due to mutations
– Insert, delete, replace
• Three types of mutations
– Advantageous
– Disadvantageous
– Neutral
• We only observe substitutions that passed
selection process
3
Mutation Rates
Parent Organism
T time R = K/(2T)
Organism A Organism B
K: number of substitutions
4
Functional Constraints
• Functional sites are less likely to mutate
– Noncoding = 3.33 (subs/109 yr)
– Coding = 1.58 (subs/109 yr)
• Indels about 10 times less likely than
substitutions
5
Nucleotide Substitutions and Amino
Acids
• Synonymous substitutions do not change amino acids
• Nonsynonymous do change
• Degeneracy
– Fourfold degenerate: gly = {GGG, GGA, GGU, GGC}
– Twofold degenerate: asp = {GAU, GAC}, glu = {GAA, GAG}
– Non-degenerate: phe = UUU, leu = CUU, ile = AUU, val = GUU
• Example substitution rates in human and mouse
– Fourfold degenerate: 2.35
– Twofold degenerate: 1.67
– Non-degenerate: 0.56
6
Predicting Substitutions
7
Jukes-Cantor Model
• Each nucleotide can change into another
one with the same probability
P(A->A’, 1) = x, for each A’
P(A->A, 1) = 1 – 3x
Compute P(A->A’, 2) & P(A->A, 2)
x
A C P(A->A, t+1) = 3 P(A->A’, t) P(A’->A, 1) +
x P(A->A, t) P(A->A, 1)
x
P(A->A, t) ~ ¼ + (3/4)e-4ft
G T
K = num. subst. = -¾ ln(1 – f4/3), f =
fraction of observed substitutions
Oversimplification 8
Two Parameter Model
• Transition:
– purine->purine (A, G), Purine
pyrimidine->pyrimidine (C,
T)
• Transversion:
– purine <-> pyrimidine
• Transitions are more
likely than transversions.
• Use different probabilities
Pyrimidine
for transitions and
transversions.
9
Two Parameter Model
•P(AA,1) = 1-x-2y P(AA,2) = (1-x-2y) P(AA,1) + x P(AG,1) + y
•Compute P(AA,2) P(AC,1) + y P(AT,1)
10
More Parameters ?
• Assign a different probability for each pair
of nucleotides
• Not harder to compute than simpler
models
• Not necessarily better than simpler models
11
Amino Acid substitutions (1)
• Harder to model than nucleotides
– An amino acid can be substituted for another in more
than one ways
– The number of nucleotide substitutions needed to
transform one amino acid to another may differ
• Pro = CCC, leu = CUC, ile = AUC
– The likelihood of nucleotide substitutions may differ
• Asp = GAU, asn = AAU, his = CAU
– Amino acid substitutions may have different effects on
the protein function
12
Amino Acid substitutions (2)
• Mutation rates may vary greatly among
genes
– Nonsynonymous substitution may affect
functionality with smaller probability in some
genes
• Molecular clock (Zuckerlandl, Paulding)
– Mutation rates may be different for different
organisms, but it remains almost constant
over the time.
13
Scoring Matrices
14
What is it & why ?
• Let alphabet contain N letters
– N = 4 and 20 for nucleotides and amino acids
• N x N matrix
• (i,j) shows the relationship between ith and jth
letters.
– Positive number if letter i is likely to mutate into letter j
– Negative otherwise
– Magnitude shows the degree of proximity
• Symmetric
15
The BLOSUM45 Matrix
A R N D C Q E G H I L K M F P S T W Y V
A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -2 -2 0
R -2 7 0 -1 -3 1 0 -2 0 -3 -2 3 -1 -2 -2 -1 -1 -2 -1 -2
N -1 0 6 2 -2 0 0 0 1 -2 -3 0 -2 -2 -2 1 0 -4 -2 -3
D -2 -1 2 7 -3 0 2 -1 0 -4 -3 0 -3 -4 -1 0 -1 -4 -2 -3
C -1 -3 -2 -3 12 -3 -3 -3 -3 -3 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1
Q -1 1 0 0 -3 6 2 -2 1 -2 -2 1 0 -4 -1 0 -1 -2 -1 -3
E -1 0 0 2 -3 2 6 -2 0 -3 -2 1 -2 -3 0 0 -1 -3 -2 -3
G 0 -2 0 -1 -3 -2 -2 7 -2 -4 -3 -2 -2 -3 -2 0 -2 -2 -3 -3
H -2 0 1 0 -3 1 0 -2 10 -3 -2 -1 0 -2 -2 -1 -2 -3 2 -3
I -1 -3 -2 -4 -3 -2 -3 -4 -3 5 2 -3 2 0 -2 -2 -1 -2 0 3
L -1 -2 -3 -3 -2 -2 -2 -3 -2 2 5 -3 2 1 -3 -3 -1 -2 0 1
K -1 3 0 0 -3 1 1 -2 -1 -3 -3 5 -1 -3 -1 -1 -1 -2 -1 -2
M -1 -1 -2 -3 -2 0 -2 -2 0 2 2 -1 6 0 -2 -2 -1 -2 0 1
F -2 -2 -2 -4 -2 -4 -3 -3 -2 0 1 -3 0 8 -3 -2 -1 1 3 0
P -1 -2 -2 -1 -4 -1 0 -2 -2 -2 -3 -1 -2 -3 9 -1 -1 -3 -3 -3
S 1 -1 1 0 -1 0 0 0 -1 -2 -3 -1 -2 -2 -1 4 2 -4 -2 -1
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -1 -1 2 5 -3 -1 0
W -2 -2 -4 -4 -5 -2 -3 -2 -3 -2 -2 -2 -2 1 -3 -4 -3 15 3 -3
Y -2 -1 -2 -2 -3 -1 -2 -3 2 0 0 -1 0 3 -3 -2 -1 3 8 -1 16
V 0 -2 -3 -3 -1 -3 -3 -3 -3 3 1 -2 1 0 -3 -1 0 -3 -1 5
Scoring Matrices for DNA
A C G T
A C G T A C G T
A 1 -3 -3 -3
A 1 0 0 0 A 1 -5 -1 -5
C -3 1 -3 -3
C 0 1 0 0 C -5 1 -5 -1
G -3 -3 1 -3
G 0 0 1 0 G -1 -5 1 -5
T -3 -3 -3 1
T 0 0 0 1 T -5 -1 -5 1
17
Scoring Matrices for Amino Acids
• Chemical similarities
– Non-polar, Hydrophobic (G, A, V, L, I, M, F, W, P)
– Polar, Hydrophilic (S, T, C, Y, N, Q)
– Electrically charged (D, E, K, R, H)
– Requires expert knowledge
• Genetic code: Nucleotide substitutions
– E: GAA, GAG
– D: GAU, GAC
– F: UUU, UUC
• Actual substitutions
– PAM
– BLOSUM
18
Scoring Matrices: Actual
Substitutions
• Manually align proteins
• Look for amino acid substitutions
• Entry ~ log(freq(observed)/freq(expected))
• Log-odds matrices
19
PAM Matrices
(Dayhoff 1972)
20
PAM
• PAM = “Point Accepted Mutation”
interested only in mutations that have
been “accepted” by natural selection
• An accepted mutation is a mutation that
occurred and was positively selected by
the environment; that is, it did not cause
the demise of the particular organism
where it occurred.
21
Interpretation of PAM matrices
• PAM-1 : one substitution per 100 residues (a
PAM unit of time)
• “Suppose I start with a given polypeptide sequence M at
time t, and observe the evolutionary changes in the
sequence until 1% of all amino acid residues have
undergone substitutions at time t+n. Let the new
sequence at time t+n be called M’. What is the probability
that a residue of type j in M will be replaced by i in M’?”
• PAM-K : K PAM time units
22
PAM Matrices (1)
• Starts with a multiple sequence alignment
of very similar (>85% identity) proteins.
Assumed to be homologous
• Compute the relative mutability, mi, of
each amino acid
– e.g. mA = how many times was alanine
substituted with anything else on the
average?
23
Relative Mutability
• ACGCTAFKI
GCGCTAFKI
ACGCTAFKL
GCGCTGFKI
GCGCTLFKI
ASGCTAFKL
ACACTAFKL
• Across all pairs of sequences, there are 28
A X substitutions
• There are 10 ALA residues, so mA = 2.8
24
Pam Matrices (2)
• Construct a phylogenetic tree for the sequences in the
alignment
AG
ACGCTAFKI
IL
FG,A = 3
GCGCTAFKI ACGCTAFKL
ACGCTAFKI
AG IL
GCGCTAFKI ACGCTAFKL
28
PAM - Discussion
• Smaller K, PAM-K is better for closely related
sequences, large K is better for distantly related
sequences
• Biased towards closely related sequences since it starts
from highly similar sequences (BLOSUM solves this)
• If Mi,j is very small, we may not have a large enough
sample to estimate the real probability. When we
multiply the PAM matrices many times, the error is
magnified.
• Mutation rate may change from one gene to another
29
BLOSUM Matrices
30
BLOSUM Matrix
• Begin with a set of protein sequences and obtain blocks.
– ~2000 blocks from 500 families of related proteins
– More data than PAM
• A block is the ungapped alignment of a highly conserved region of a
family of proteins.
• MOTIF program is used to find blocks
• Substitutions in these blocks are used to compute BLOSUM matrix
32
Constructing the Matrix: Example
• fAA = 36, fAS = 9
• Observed frequencies of pairs
– qAA = fAA/(fAA+fAS) = 36/45 = 0.8
– qAS = 9/45 = 0.2
A
A • Expected frequencies of letters
– pA = qAA + qAS/2 = 0.9
A
– pS = qAS/2 = 0.1
A • Expected frequencies of pairs
S – eAA = pA x pA = 0.81
… A … – eAS = 2 x pA x pS = 0.18
A • Matrix entries
A – MAA = 2x log2(qAA/eAA) = -0.04 ~ 0
A – MAS = 2 x log2(qAS/eAS) = 0.3 ~ 0
A
9A, 1S
33
Computation of BLOSUM-K
• Different levels of the BLOSUM matrix can be created by
differentially weighting the degree of similarity between
sequences. For example, a BLOSUM62 matrix is
calculated from protein blocks such that if two
sequences are more than 62% identical, then the
contribution of these sequences is weighted to sum to
one. In this
a b
way the contributions of multiple entries of
closely related sequences is reduced.
• Larger numbers used to measure recent divergence,
default is BLOSUM62
34
BLOSUM 62 Matrix
Check scores for
MILV
-small hydrophobic
NDEQ
-acid, hydrophilic
HRK
-basic
FYW
-aromatic
STPAG
-small hydrophilic
C
-sulphydryl
35
PAM vs. BLOSUM
PAM100 = Blosum90
PAM120 = Blosum80
PAM160 = Blosum60
PAM200 = Blosum52
PAM250 = Blosum45
36
PAM vs. BLOSUM
PAM BLOSUM
37