Second - Done - W14a - Substitution Patterns
Second - Done - W14a - Substitution Patterns
1
Why Substitute Patterns ?
• Mutations happen because of mistakes in DNA
replication and repair.
• Our genetic code changes due to mutations
– Insert, delete, replace
• Three types of mutations
– Advantageous
– Disadvantageous
– Neutral
• We only observe substitutions that passed
selection process
2
Mutation Rates
Parent Organism
T time R = K/(2T)
Organism A Organism B
K: number of substitutions
3
Functional Constraints
• Functional sites are less likely to mutate
– Noncoding = 3.33 (subs/109 yr)
– Coding = 1.58 (subs/109 yr)
• Indels about 10 times less likely than
substitutions
4
Nucleotide Substitutions and Amino
Acids
• Synonymous substitutions do not change amino acids
• Nonsynonymous do change
• Degeneracy
– Fourfold degenerate: gly = {GGG, GGA, GGU, GGC}
– Twofold degenerate: asp = {GAU, GAC}, glu = {GAA, GAG}
– Non-degenerate: phe = UUU, leu = CUU, ile = AUU, val = GUU
• Example substitution rates in human and mouse
– Fourfold degenerate: 2.35
– Twofold degenerate: 1.67
– Non-degenerate: 0.56
5
Predicting Substitutions
6
Jukes-Cantor Model
• Each nucleotide can change into another
one with the same probability
P(A->A’, 1) = x, for each A’
P(A->A, 1) = 1 – 3x
Compute P(A->A’, 2) & P(A->A, 2)
x
A C P(A->A, t+1) = 3 P(A->A’, t) P(A’->A, 1) +
x P(A->A, t) P(A->A, 1)
x
P(A->A, t) ~ ¼ + (3/4)e-4ft
G T
K = num. subst. = -¾ ln(1 – f4/3), f =
fraction of observed substitutions
Oversimplification 7
Two Parameter Model
• Transition:
– purine->purine (A, G), Purine
pyrimidine->pyrimidine (C,
T)
• Transversion:
– purine <-> pyrimidine
• Transitions are more
likely than transversions.
• Use different probabilities
for transitions and Pyrimidine
transversions.
8
Two Parameter Model
•P(AA,1) = 1-x-2y P(AA,2) = (1-x-2y) P(AA,1) + x P(AG,1) + y
•Compute P(AA,2) P(AC,1) + y P(AT,1)
9
More Parameters ?
• Assign a different probability for each pair
of nucleotides
• Not harder to compute than simpler
models
• Not necessarily better than simpler models
10
Amino Acid substitutions (1)
• Harder to model than nucleotides
– An amino acid can be substituted for another in more
than one ways
– The number of nucleotide substitutions needed to
transform one amino acid to another may differ
• Pro = CCC, leu = CUC, ile = AUC
– The likelihood of nucleotide substitutions may differ
• Asp = GAU, asn = AAU, his = CAU
– Amino acid substitutions may have different effects on
the protein function
11
Amino Acid substitutions (2)
• Mutation rates may vary greatly among
genes
– Nonsynonymous substitution may affect
functionality with smaller probability in some
genes
• Molecular clock (Zuckerlandl, Paulding)
– Mutation rates may be different for different
organisms, but it remains almost constant
over the time.
12
Scoring Matrices
13
What is it & why ?
• Let alphabet contain N letters
– N = 4 and 20 for nucleotides and amino acids
• N x N matrix
• (i,j) shows the relationship between ith and jth
letters.
– Positive number if letter i is likely to mutate into letter j
– Negative otherwise
– Magnitude shows the degree of proximity
• Symmetric
14
The BLOSUM45 Matrix
A R N D C Q E G H I L K M F P S T W Y V
A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -2 -2 0
R -2 7 0 -1 -3 1 0 -2 0 -3 -2 3 -1 -2 -2 -1 -1 -2 -1 -2
N -1 0 6 2 -2 0 0 0 1 -2 -3 0 -2 -2 -2 1 0 -4 -2 -3
D -2 -1 2 7 -3 0 2 -1 0 -4 -3 0 -3 -4 -1 0 -1 -4 -2 -3
C -1 -3 -2 -3 12 -3 -3 -3 -3 -3 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1
Q -1 1 0 0 -3 6 2 -2 1 -2 -2 1 0 -4 -1 0 -1 -2 -1 -3
E -1 0 0 2 -3 2 6 -2 0 -3 -2 1 -2 -3 0 0 -1 -3 -2 -3
G 0 -2 0 -1 -3 -2 -2 7 -2 -4 -3 -2 -2 -3 -2 0 -2 -2 -3 -3
H -2 0 1 0 -3 1 0 -2 10 -3 -2 -1 0 -2 -2 -1 -2 -3 2 -3
I -1 -3 -2 -4 -3 -2 -3 -4 -3 5 2 -3 2 0 -2 -2 -1 -2 0 3
L -1 -2 -3 -3 -2 -2 -2 -3 -2 2 5 -3 2 1 -3 -3 -1 -2 0 1
K -1 3 0 0 -3 1 1 -2 -1 -3 -3 5 -1 -3 -1 -1 -1 -2 -1 -2
M -1 -1 -2 -3 -2 0 -2 -2 0 2 2 -1 6 0 -2 -2 -1 -2 0 1
F -2 -2 -2 -4 -2 -4 -3 -3 -2 0 1 -3 0 8 -3 -2 -1 1 3 0
P -1 -2 -2 -1 -4 -1 0 -2 -2 -2 -3 -1 -2 -3 9 -1 -1 -3 -3 -3
S 1 -1 1 0 -1 0 0 0 -1 -2 -3 -1 -2 -2 -1 4 2 -4 -2 -1
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -1 -1 2 5 -3 -1 0
W -2 -2 -4 -4 -5 -2 -3 -2 -3 -2 -2 -2 -2 1 -3 -4 -3 15 3 -3
Y -2 -1 -2 -2 -3 -1 -2 -3 2 0 0 -1 0 3 -3 -2 -1 3 8 -1
15
V 0 -2 -3 -3 -1 -3 -3 -3 -3 3 1 -2 1 0 -3 -1 0 -3 -1 5
Scoring Matrices for DNA
A C G T
A C G T A C G T
A 1 -3 -3 -3
A 1 0 0 0 A 1 -5 -1 -5
C -3 1 -3 -3
C 0 1 0 0 C -5 1 -5 -1
G -3 -3 1 -3
G 0 0 1 0 G -1 -5 1 -5
T -3 -3 -3 1
T 0 0 0 1 T -5 -1 -5 1
16
Scoring Matrices for Amino Acids
• Chemical similarities
– Non-polar, Hydrophobic (G, A, V, L, I, M, F, W, P)
– Polar, Hydrophilic (S, T, C, Y, N, Q)
– Electrically charged (D, E, K, R, H)
– Requires expert knowledge
• Genetic code: Nucleotide substitutions
– E: GAA, GAG
– D: GAU, GAC
– F: UUU, UUC
• Actual substitutions
– PAM
– BLOSUM
17
Scoring Matrices: Actual
Substitutions
• Manually align proteins
• Look for amino acid substitutions
• Entry ~ log(freq(observed)/freq(expected))
• Log-odds matrices
18
PAM Matrices
(Dayhoff 1972)
19
PAM
• PAM = “Point Accepted Mutation”
interested only in mutations that have
been “accepted” by natural selection
• An accepted mutation is a mutation that
occurred and was positively selected by
the environment; that is, it did not cause
the demise of the particular organism
where it occurred.
20
Interpretation of PAM matrices
• PAM-1 : one substitution per 100 residues (a
PAM unit of time)
• “Suppose I start with a given polypeptide sequence M at
time t, and observe the evolutionary changes in the
sequence until 1% of all amino acid residues have
undergone substitutions at time t+n. Let the new
sequence at time t+n be called M’. What is the
probability that a residue of type j in M will be replaced
by i in M’?”
• PAM-K : K PAM time units
21
PAM Matrices (1)
• Starts with a multiple sequence alignment
of very similar (>85% identity) proteins.
Assumed to be homologous
• Compute the relative mutability, mi, of
each amino acid
– e.g. mA = how many times was alanine
substituted with anything else on the
average?
22
Relative Mutability
• ACGCTAFKI
GCGCTAFKI
ACGCTAFKL
GCGCTGFKI
GCGCTLFKI
ASGCTAFKL
ACACTAFKL
• Across all pairs of sequences, there are 28
A ® X substitutions
• There are 10 ALA residues, so mA = 2.8
23
Pam Matrices (2)
• Construct a phylogenetic tree for the sequences in the
alignment
A®G
ACGCTAFKI
I®L
FG,A = 3
GCGCTAFKI ACGCTAFKL
ACGCTAFKI
A®G I®L
GCGCTAFKI ACGCTAFKL
m j Fij 2.8 ´3
M ij = M G, A = = 2.1
å Fij
i
4
25
The PAM Matrix
• The entries of the scoring matrix are the
Mi,j values divided by the frequency of
occurrence, fi, of residue i.
• fG = 10 GLY / 63 residues = 0.1587
• RG,A = log(2.1/0.1587) = log(12.760) = 1.106
• Log-odds matrix
• Diagonal entries are Mjj = 1– mj
26
Computation of PAM-K
• Assume that changes at time T+1 are
independent of the changes at time T.
• Markov chain
• P(A-->B) = åX P(A->X) P(X->B)
• PAM-K = (PAM-1)K
• PAM-250 is most commonly used
27
PAM - Discussion
• Smaller K, PAM-K is better for closely related
sequences, large K is better for distantly related
sequences
• Biased towards closely related sequences since it starts
from highly similar sequences (BLOSUM solves this)
• If Mi,j is very small, we may not have a large enough
sample to estimate the real probability. When we
multiply the PAM matrices many times, the error is
magnified.
• Mutation rate may change from one gene to another
28
BLOSUM Matrices
29
BLOSUM Matrix
• Begin with a set of protein sequences and obtain blocks.
– ~2000 blocks from 500 families of related proteins
– More data than PAM
• A block is the ungapped alignment of a highly conserved region of a
family of proteins.
• MOTIF program is used to find blocks
• Substitutions in these blocks are used to compute BLOSUM matrix
30
Constructing the Matrix
• Count the frequency of occurrence of each amino acid. This gives
the background distribution pa
• Count the number of times amino acid a is aligned with amino acid
b: fab
– A block of width w and depth s contributes ws(s-1)/2 = np pairs
• Compute the occurrence probability of each pair
– qab = fab/ np
• Compute the
i probability of occurrence of amino acid a
– pa = qaa + Σ qab /2
a≠b
• Compute the expected probability of occurrence of each pair
– eab = 2papb, if a ≠ b
papb otherwise
• Compute the log likelihood ratios, normalize, and round.
– 2* log2 qab / eab
31
Constructing the Matrix: Example
• fAA = 36, fAS = 9
• Observed frequencies of pairs
– qAA = fAA/(fAA+fAS) = 36/45 = 0.8
A – qAS = 9/45 = 0.2
A • Expected frequencies of letters
– pA = qAA + qAS/2 = 0.9
A
– pS = qAS/2 = 0.1
A
• Expected frequencies of pairs
S – eAA = pA x pA = 0.81
… A … – eAS = 2 x pA x pS = 0.18
A • Matrix entries
A – MAA = 2x log2(qAA/eAA) = -0.04 ~ 0
A – MAS = 2 x log2(qAS/eAS) = 0.3 ~ 0
A
9A, 1S
32
Computation of BLOSUM-K
• Different levels of the BLOSUM matrix can be created by
differentially weighting the degree of similarity between
sequences. For example, a BLOSUM62 matrix is
calculated from protein blocks such that if two
sequences are more than 62% identical, then the
contribution of these sequences is weighted to sum to
one. In this
a b
way the contributions of multiple entries of
closely related sequences is reduced.
• Larger numbers used to measure recent divergence,
default is BLOSUM62
33
BLOSUM 62 Matrix
Check scores for
MILV
-small hydrophobic
NDEQ
-acid, hydrophilic
HRK
-basic
FYW
-aromatic
STPAG
-small hydrophilic
C
-sulphydryl
34
PAM vs. BLOSUM
PAM100 = Blosum90
PAM120 = Blosum80
PAM160 = Blosum60
PAM200 = Blosum52
PAM250 = Blosum45
35
PAM vs. BLOSUM
PAM BLOSUM
36