1.AlleleFrequencies 0
1.AlleleFrequencies 0
Section 1 Slide 1
Probability
Subjective probability:
Probability is a measure of belief.
Section 1 Slide 2
First Law of Probability
Law says that probability can take values only in the range zero
to one and that an event which is certain has probability one.
0 ≤ Pr(E) ≤ 1
Section 1 Slide 3
Second Law of Probability
Section 1 Slide 4
Complementary Probability
Pr(E) + Pr(Ē) = 1
Pr(Ē) = 1 − Pr(E)
The probability that E is false is one minus the probability it is
true.
Section 1 Slide 5
Third Law of Probability
For any two events, G and H, the third law can be written:
= 1 × p2
A
Section 1 Slide 6
Independent Events
Pr(G|H) = Pr(G)
and
Section 1 Slide 7
Law of Total Probability
Section 1 Slide 9
Bayes’ Theorem Example
Now what is the probability that a man has genotype A1A2 given
that he transmits allele A1 to his child?
Pr(H|G) Pr(G)
Pr(G|H) =
Pr(H)
0.5 × 2p1p2
=
p1
= p2
Section 1 Slide 10
Sampling
Section 1 Slide 11
Classical Model
Reference population
(Usually assumed infinite and in equilibrium)
↓ ↓
Time 1 Population Population
of size N ··· of size N
↓ ↓
Population Population
Time 2 of size N ··· of size N
↓ ↓
... ...
↓ ↓
Population Population
Time t of size N ··· of size N
H H
H
HH
H
H
HH
=
j
Sample of Sample of
size n · · · size n
Section 1 Slide 12
Aside: Coalescent Theory
Section 1 Slide 13
Binomial Distribution
Section 1 Slide 14
Properties of Estimators
Section 1 Slide 15
Binomial Distribution
Section 1 Slide 16
Binomial distribution
p × p × . . . × p = px
(1 − p) × (1 − p) × . . . × (1 − p) = (1 − p)n−x
L(p|x) ∝ px (1 − p)(n−x)
Section 1 Slide 18
Likelihood L(p|x, n = 4)
Section 1 Slide 19
Binomial Mean
E(p̃) = p
Section 1 Slide 20
Binomial Variance
Section 1 Slide 21
Normal Approximation
Provided np is not too small (e.g. not less than 5), the binomial
distribution can be approximated by the normal distribution with
the same mean and variance. In particular:
!
p(1 − p)
p̃ ∼ N p,
n
To use the normal distribution in practice, change to the standard
normal variable z with a mean of 0, and a variance of 1:
p̃ − p
z = q
p(1 − p)/n
For a standard normal, 95% of the values lie between ±1.96.
The normal approximation to the binomial therefore implies that
95% of the values of p̃ lie in the range
q
p ± 1.96 p(1 − p)/n
Section 1 Slide 22
Confidence Intervals
Section 1 Slide 23
Confidence Intervals
Section 1 Slide 24
Confidence Intervals
Section 1 Slide 25
Exact Confidence Intervals: One-sided
k=0
k
If x = 0, then (1 − pU )n ≥ 0.05 if pU ≤ 1 − 0.051/n and this is
0.0295 when n = 100. More generally, pU ≈ 3/n when x = 0.
Section 1 Slide 26
Aside: Two-sided Exact Confidence Intervals
k=0
k
n n
pkL(1 − pL)n−k ≥ 0.025
X
k=x
k
Section 1 Slide 27
Exact CIs for n = 10
One-sided Two-sided
x p̃ pU x pL p̃ pU
0 0.00 0.26 0 0.00 0.00 0.31
1 0.10 0.39 1 0.00 0.10 0.45
2 0.20 0.51 2 0.03 0.20 0.56
3 0.30 0.61 3 0.07 0.30 0.65
4 0.40 0.70 4 0.12 0.40 0.74
5 0.50 0.78 5 0.19 0.50 0.81
6 0.60 0.85 6 0.26 0.60 0.88
7 0.70 0.91 7 0.35 0.70 0.93
8 0.80 0.96 8 0.44 0.80 0.97
9 0.90 0.99 9 0.55 0.90 1.00
10 1.00 1.00 10 0.69 1.00 1.00
Section 1 Slide 28
Bootstrapping
Section 1 Slide 29
Allele Frequency Sampling
Section 1 Slide 30
Multinomial Distribution
For a SNP with alleles A, a the three genotypes and their prob-
abilities are
Genotype Probability
AA PAA
Aa or aA PAa
aa Paa
E(P̃i ) = Pi
1
Var(P̃i) = Pi(1 − Pi)
n
1
Cov(P̃i, P̃j ) = − PiPj , i 6= j
n
The covariance is defined as E[(P̃i − Pi)(P̃j − Pj )].
E(ni ) = nPi
Var(ni) = nPi(1 − Pi)
Cov(ni, nj ) = −nPi Pj , i 6= j
Section 1 Slide 32
Allele Frequency Sampling Distribution
Section 1 Slide 33
Within-population Variance
PAA = p2
A + f pA pa
PAa = 2pApa − 2f pApa
Paa = p2
a + f pA pa
Section 1 Slide 35
Bounds on f
Since
pA ≥ PAA = p2A + f pA (1 − pA ) ≥ 0
pa ≥ Paa = p2
a + f pa (1 − pa ) ≥ 0
there are bounds on f :
−pA/(1 − pA) ≤ f ≤ 1
−pa /(1 − pa) ≤ f ≤ 1
or
!
p pa
max − A , − ≤ f ≤1
pa pA
Section 1 Slide 36
An aside: Indicator Variables
E(xjk ) = pA
E(x2
jk ) = pA
E(xjk xjk0 ) = PAA
If there is random sampling, individuals are independent, and
Section 1 Slide 37
An aside: Intraclass Correlation
Var(xjk ) = E(x2
jk ) − [E(xjk )]
2
= pA(1 − pA)
= Var(xjk0 ), k 6= k0
and
Section 1 Slide 38
Allele Dosage
Section 1 Slide 39
Maximum Likelihood Estimation: Allele Data
Section 1 Slide 40
Maximum Likelihood Estimation: Genotype Data
P̂AA = nAA/n
P̂Aa = nAa/n
P̂aa = naa/n
Section 1 Slide 41
Maximum Likelihood Estimation: f
Because
PAA = p2
A + f pA (1 − pA )
PAa = 2pA(1 − pA) − 2f pA(1 − pA )
Paa = (1 − pA)2 + f pA(1 − pA)
The likelihood function for pA, f is
n!
L(pA, f ) = [p2 + pA (1 − pA )f ]nAA
nAA!nAa!naa ! A
nAA/n = p̂2 ˆ
A + f p̂A (1 − p̂A )
nAa/n = 2p̂A(1 − p̂A) − 2fˆp̂A(1 − p̂A)
naa/n = (1 − p̂A)2 + fˆp̂A(1 − p̂A)
Solving these equations (e.g. by adding the first equation to half
the second equation to give solution for p̂A and then substituting
that into one equation):
2nAA + nAa
p̂A = = p̃A
2n
nAa P̃
fˆ = 1 − = 1 − Aa
2np̃A(1 − p̃A) 2p̃Ap̃a
Section 1 Slide 43
Aside: Three-allele Case
With three alleles, there are six genotypes and 5 df. To use
Bailey’s method, would need five parameters: 2 allele frequencies
and 3 inbreeding coefficients. For example
P11 = p2
1 + f12 p1p2 + f13p1p3
P12 = 2p1p2 − 2f12p1p2
P22 = p2
2 + f12 p1p2 + f23p2p3
P13 = 2p1p3 − 2f13p1p3
P23 = 2p2p3 − 2f23p2p3
P33 = p2
3 + f13 p1p3 + f23p2p3
We would generally prefer to have only one inbreeding coefficient
f . It is a difficult numerical problem to find the MLE for f .
Section 1 Slide 44
Method of Moments
(Bailey’s method is for the special case where the MLEs are also
MoM estimates.)
Section 1 Slide 45
Aside: Method of Moments
For the inbreeding coefficient at loci with m alleles Au, two pos-
sible MoM estimates are (for large sample sizes)
Pm
(P̃ − p̃2)
u=1 uu u
fˆLH1 = Pm
u=1 p̃u (1 − p̃u)
m 2
!
1 P̃uu − p̃u
ˆ
X
fLH5 =
m − 1 u=1 p̃u
These both have low bias. Their variances depend on the value
of f .
For loci with two alleles, m = 2, the two moment estimates are
equal to each other and to the maximum likelihood estimate:
P̃Aa
fˆLH1 = fˆLH5 = 1 −
2p̃Ap̃a
L(pa) = (p2
a )naa (1 − p2 )n−naa
a
ln[L(pa)] = 2naa ln(pa ) + (n − naa) ln(1 − p2
a)
Differentiating wrt pa:
∂ ln L(pa) 2naa 2pa(n − naa)
= −
∂pa pa 1 − p2
a
Setting this to qzero leads to an equation that can be solved
explicitly: pa = naa/n.
Section 1 Slide 47
Aside: EM Algorithm for Recessive Alleles
Section 1 Slide 48
Aside: EM Algorithm for Recessive Alleles
nAa + 2naa
p̂a =
2n !
1 2pa(n − naa)
= + 2naa
2n (1 + pa )
2(npa + naa )
=
2n(1 + pa)
Section 1 Slide 49
EM Algorithm for Two Loci
A more interesting application of the EM algorithm is the estimation of two-
locus gamete frequencies from unphased genotype data. For locus A with
alleles A, a and locus B with alleles B, b, the ten two-locus frequencies are:
AB/AB AB
PAB p2AB AB/Ab AB
PAb 2pAB pAb
AB/aB AB
PaB 2pAB paB AB/ab AB
Pab 2pAB pab
Ab/Ab Ab
PAb p2Ab Ab/aB Ab
PaB 2pAb paB
Ab/ab Ab
Pab 2pAbpab aB/aB aB
PaB p2aB
aB/ab aB
Pab 2paB pab ab/ab ab
Pab p2ab
Section 1 Slide 50
EM Algorithm for Two Loci
Section 1 Slide 51
EM Algorithm for Two Loci
Section 1 Slide 52
Example
B b Total
A nAB = x nAb = 12 − x nA = 12
a naB = 6 − x nab = x + 12 na = 18
Total nB = 6 nb = 24 2n = 30
0≤x≤6
Section 1 Slide 53
Example
EM iterative equation:
2pAB pab
= 2nAABB + nAABb + nAaBB + nAaBb
2pAB pab + 2pAbpaB
2x(x + 12)
= 0+0+1+3×
2x(x + 12) + 2(12 − x)(6 − x)
3x(x + 12)
= 1+
x(x + 12) + (12 − x)(6 − x)
Section 1 Slide 54
Example