0% found this document useful (0 votes)
12 views55 pages

1.AlleleFrequencies 0

This document discusses probability and the binomial distribution. It defines probability, covers laws of probability like the law of total probability and Bayes' theorem, and describes how the binomial distribution can model counting experiments with a fixed number of trials and probability of success. It also discusses maximum likelihood estimation and properties of estimators.

Uploaded by

papanomba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views55 pages

1.AlleleFrequencies 0

This document discusses probability and the binomial distribution. It defines probability, covers laws of probability like the law of total probability and Bayes' theorem, and describes how the binomial distribution can model counting experiments with a fixed number of trials and probability of success. It also discusses maximum likelihood estimation and properties of estimators.

Uploaded by

papanomba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

ALLELE FREQUENCIES

Section 1 Slide 1
Probability

Probability provides the language of data analysis.

Equiprobable outcomes definition:


Probability of event E is number of outcomes favorable to E
divided by the total number of outcomes. e.g. Probability of a
head = 1/2.

Long-run frequency definition:


If event E occurs n times in N identical experiments, the prob-
ability of E is the limit of n/N as N goes to infinity.

Subjective probability:
Probability is a measure of belief.

Section 1 Slide 2
First Law of Probability

Law says that probability can take values only in the range zero
to one and that an event which is certain has probability one.

 0 ≤ Pr(E) ≤ 1

 Pr(E|E) = 1 for any E


i.e. If event E is true, then it has a probability of 1. For example:

Pr(Seed is Round|Seed is Round) = 1

Section 1 Slide 3
Second Law of Probability

If G and H are mutually exclusive events, then:

Pr(G or H) = Pr(G) + Pr(H)


For example,

Pr(Seed is Round or Wrinkled) = Pr(Round) + Pr(Wrinkled)

More generally, if Ei, i = 1, . . . r, are mutually exclusive then

Pr(E1 or . . . or Er ) = Pr(E1 ) + . . . + Pr(Er )


X
= Pr(Ei )
i

Section 1 Slide 4
Complementary Probability

If Pr(E) is the probability that E is true then Pr(Ē) denotes the


probability that E is false. Because these two events are mutually
exclusive

Pr(E or Ē) = Pr(E) + Pr(Ē)


and they are also exhaustive in that between them they cover all
possibilities – one or other of them must be true. So,

Pr(E) + Pr(Ē) = 1
Pr(Ē) = 1 − Pr(E)
The probability that E is false is one minus the probability it is
true.

Section 1 Slide 5
Third Law of Probability

For any two events, G and H, the third law can be written:

Pr(G and H) = Pr(G) Pr(H|G)


There is no reason why G should precede H and the law can also
be written:

Pr(G and H) = Pr(H) Pr(G|H)


For example

Pr(Seed is round & is type AA)

= Pr(Seed is round|Seed is type AA) × Pr(Seed is type AA)

= 1 × p2
A

Section 1 Slide 6
Independent Events

If the information that H is true does nothing to change uncer-


tainty about G, then

Pr(G|H) = Pr(G)
and

Pr(H and G) = Pr(H) Pr(G)

Events G, H are independent.

Section 1 Slide 7
Law of Total Probability

If G, Ḡ are two mutually exclusive and exhaustive events (Ḡ =


not G), then for any other event E, the law of total probability
states that

Pr(E) = Pr(E|G) Pr(G) + Pr(E|Ḡ) Pr(Ḡ)


This generalizes to any set of mutually exclusive and exhaustive
events {Si}:
X
Pr(E) = Pr(E|Si ) Pr(Si )
i
For example

Pr(Seed is round) = Pr(Round|Type AA) Pr(Type AA)


+ Pr(Round|Type Aa) Pr(Type Aa)
+ Pr(Round|Type aa) Pr(Type aa)
= 1 × p2 2
A + 1 × 2pA pa + 0 × pa
= pA(2 − pA)
Section 1 Slide 8
Bayes’ Theorem

Bayes’ theorem relates Pr(G|H) to Pr(H|G):


Pr(GH)
Pr(G|H) = , from third law
Pr(H)
Pr(H|G) Pr(G)
= , from third law
Pr(H)
If {Gi } are exhaustive and mutually exclusive, Bayes’ theorem
can be written as
Pr(H|Gi ) Pr(Gi)
Pr(Gi|H) = P
i Pr(H|Gi ) Pr(Gi )

Section 1 Slide 9
Bayes’ Theorem Example

Suppose G is event that a man has genotype A1A2 and H is the


event that he transmits allele A1 to his child. Then Pr(H|G) =
0.5.

Now what is the probability that a man has genotype A1A2 given
that he transmits allele A1 to his child?
Pr(H|G) Pr(G)
Pr(G|H) =
Pr(H)
0.5 × 2p1p2
=
p1

= p2

Section 1 Slide 10
Sampling

Statistical sampling: The variation among repeated samples


from the same population (“fixed” sampling). Inferences can
be made about that particular population.

Genetic sampling: The variation among replicate (conceptual)


populations (“random” sampling). Inferences are made to all
populations with the same history.

Section 1 Slide 11
Classical Model

Reference population
(Usually assumed infinite and in equilibrium)
↓ ↓
Time 1 Population Population
of size N ··· of size N
↓ ↓
Population Population
Time 2 of size N ··· of size N
↓ ↓
... ...

↓ ↓
Population Population
Time t of size N ··· of size N
H H
 H
 HH
 H
 H
 HH
=
 j
Sample of Sample of
size n · · · size n

Section 1 Slide 12
Aside: Coalescent Theory

An alternative framework works with genealogical history of a


sample of alleles. There is a tree linking all alleles in a current
sample to the “most recent common ancestral allele.” Allelic
variation is due to mutations since that ancestral allele.

The coalescent approach requires mutation and may be more


appropriate for long-term evolution and analyses involving more
than one species. The classical approach allows mutation but
does not require it: within one species variation among popula-
tions may be due primarily to drift.

Section 1 Slide 13
Binomial Distribution

Section 1 Slide 14
Properties of Estimators

Consistency Increasing accuracy


as sample size increases

Unbiasedness Expected value is the parameter

Efficiency Smallest variance

Sufficiency Contains all the information


in the data about parameter

Section 1 Slide 15
Binomial Distribution

Most population genetic data consists of numbers of observa-


tions in some categories. The values and frequencies of these
counts form a distribution.

Toss a coin n times, and note the number of heads. There


are (n + 1) outcomes, and the number of times each outcome is
observed in many sets of n tosses gives the sampling distribution.
Or: sample n alleles from a population and observe x copies of
type A.

Section 1 Slide 16
Binomial distribution

If every toss has the same chance p of giving a head:

Probability of x heads in a row of independent tosses is

p × p × . . . × p = px

Probability of n − x tails in a row of independent tosses is

(1 − p) × (1 − p) × . . . × (1 − p) = (1 − p)n−x

The number of ways of ordering x heads and n − x tails among


n outcomes is n!/[x!(n − x)!].

The binomial probability of x successes in n trials is


n!
Pr(x|p) = px (1 − p)n−x
x!(n − x)!
Section 1 Slide 17
Binomial Likelihood

The quantity Pr(x|p) is the probability of the data, x successes


in n trials, when each trial has probability p of success.

The same quantity, written as L(p|x), is the likelihood of the


parameter, p, when the value x has been observed. The terms
that do not involve p are not needed, so

L(p|x) ∝ px (1 − p)(n−x)

Each value of x gives a different likelihood curve, and each curve


points to a p value with maximum likelihood. This leads to
maximum likelihood estimation.

Section 1 Slide 18
Likelihood L(p|x, n = 4)

Section 1 Slide 19
Binomial Mean

If there are n trials, each of which has probability p of giving a


success, the mean or the expected number of successes is np.

The sample proportion of successes is


x
p̃ =
n
(This is also the maximum likelihood estimate of p.)

The expected, or mean, value of p̃ is p.

E(p̃) = p

Section 1 Slide 20
Binomial Variance

The expected value of the squared difference between the num-


ber of successes and its mean, (x − np)2 , is np(1 − p). This is
the variance of the number of successes in n trials, and indicates
the spread of the distribution.

The variance of the sample proportion p̃ is


p(1 − p)
Var(p̃) =
n

Section 1 Slide 21
Normal Approximation

Provided np is not too small (e.g. not less than 5), the binomial
distribution can be approximated by the normal distribution with
the same mean and variance. In particular:
!
p(1 − p)
p̃ ∼ N p,
n
To use the normal distribution in practice, change to the standard
normal variable z with a mean of 0, and a variance of 1:
p̃ − p
z = q
p(1 − p)/n
For a standard normal, 95% of the values lie between ±1.96.
The normal approximation to the binomial therefore implies that
95% of the values of p̃ lie in the range
q
p ± 1.96 p(1 − p)/n

Section 1 Slide 22
Confidence Intervals

A 95% confidence interval is a variable quantity. It has end-


points which vary with the sample. It is expected that 95% of
samples will lead to an interval that includes the unknown true
value p.

The standard normal variable z has 95% of its values between


−1.96 and +1.96. This suggests that a 95% confidence interval
for the binomial parameter p is
s
p̃(1 − p̃)
p̃ ± 1.96
n

Section 1 Slide 23
Confidence Intervals

For samples of size 10, the 11 possible confidence intervals are:


p̃ Confidence
√ Interval
0.0 0.0 ± 2√0.000 = (0.00, 0.00)
0.1 0.1 ± 2√0.009 = (0.00, 0.29)
0.2 0.2 ± 2√0.016 = (0.00, 0.45)
0.3 0.3 ± 2√0.021 = (0.02, 0.58)
0.4 0.4 ± 2√0.024 = (0.10, 0.70)
0.5 0.5 ± 2√0.025 = (0.19, 0.81)
0.6 0.6 ± 2√0.024 = (0.30, 0.90)
0.7 0.7 ± 2√0.021 = (0.42, 0.98)
0.8 0.8 ± 2√0.016 = (0.55, 1.00)
0.9 0.9 ± 2√0.009 = (0.71, 1.00)
1.0 1.0 ± 2 0.000 = (1.00, 1.00)
Can modify interval a little by extending it by the “continuity
correction” ±1/2n in each direction.

Section 1 Slide 24
Confidence Intervals

To be 95% sure that q the estimate is no more than 0.01 from


the true value, 1.96 p(1 − p)/n should be less than 0.01. The
widest confidence interval is when p = 0.5, and then the sample
size should satisfy
q
0.01 ≥ 1.96 0.5 × 0.5/n

which means that n ≥ 10, 000. For a width of 0.03 instead of


0.01, n ≈ 1, 000 as is common in public opinion surveys.

If the true value of p was about 0.05, however,


q
0.01 ≥ 2 0.05 × 0.95/n
n ≥ 1, 900 ≈ 2, 000

Section 1 Slide 25
Exact Confidence Intervals: One-sided

The normal-based confidence intervals are constructed to be


symmetric about the sample value, unless the interval goes out-
side the interval from 0 to 1. They are therefore less satisfactory
the closer the true value is to 0 or 1.

More accurate confidence limits follow from the binomial distri-


bution exactly. For events with low probabilities p, how large
could p be for there to be at least a 5% chance of seeing no
more than x (i.e. 0, 1, 2, . . . x) occurrences of that event among
n events. If this upper bound is pU ,
x
X
Pr(k) ≥ 0.05
k=0
x n
pkU (1 − pU )n−k ≥ 0.05
X

k=0
k
If x = 0, then (1 − pU )n ≥ 0.05 if pU ≤ 1 − 0.051/n and this is
0.0295 when n = 100. More generally, pU ≈ 3/n when x = 0.
Section 1 Slide 26
Aside: Two-sided Exact Confidence Intervals

A two-sided interval is bounded above by pU for which there is at


least a 2.5% chance of seeing no more than x (i.e. 0, 1, 2 . . . x)
occurrences, and is bounded below by pL for which there is at
least a 2.5% chance of seeing at least x (i.e. x, x + 1, x + 2, . . . n)
occurrences:
x  
n k
pU (1 − pU )n−k ≥ 0.025
X

k=0
k
n n
pkL(1 − pL)n−k ≥ 0.025
X

k=x
k

If x = 0, then (1 −pU ) ≥ 0.0251/n and this gives pU ≤ 0.036 when


n = 100.
If x = n, then pL ≥ 0.9751/n and this gives pL ≥ 0.964 when
n = 100.

Section 1 Slide 27
Exact CIs for n = 10

One-sided Two-sided
x p̃ pU x pL p̃ pU
0 0.00 0.26 0 0.00 0.00 0.31
1 0.10 0.39 1 0.00 0.10 0.45
2 0.20 0.51 2 0.03 0.20 0.56
3 0.30 0.61 3 0.07 0.30 0.65
4 0.40 0.70 4 0.12 0.40 0.74
5 0.50 0.78 5 0.19 0.50 0.81
6 0.60 0.85 6 0.26 0.60 0.88
7 0.70 0.91 7 0.35 0.70 0.93
8 0.80 0.96 8 0.44 0.80 0.97
9 0.90 0.99 9 0.55 0.90 1.00
10 1.00 1.00 10 0.69 1.00 1.00

The two-sided CI is not symmetrical around p̃.

Section 1 Slide 28
Bootstrapping

An alternative method for constructing confidence intervals uses


numerical resampling. A set of samples is drawn, with replace-
ment, from the original sample to mimic the variation among
samples from the original population. Each new sample is the
same size as the original sample, and is called a bootstrap sam-
ple.

The middle 95% of the sample values p̃ from a large number of


bootstrap samples provides a 95% confidence interval.

Section 1 Slide 29
Allele Frequency Sampling

Section 1 Slide 30
Multinomial Distribution

For a SNP with alleles A, a the three genotypes and their prob-
abilities are

Genotype Probability
AA PAA
Aa or aA PAa
aa Paa

For a sample of n independently sampled individuals, the multi-


nomial distribution gives the probability of x of AA, y of Aa
and z of aa. The probability of x genotypes AA is (PAA)x, etc.
The numbers of ways of ordering x, y, z occurrences of the three
outcomes is n!/(x!y!z!) where n = x + y + z.

The multinomial probability is:


n!
Pr(x, y, z) = (PAA)x(PAa)y (Paa)z
x!y!z!
Section 1 Slide 31
Multinomial Variances and Covariances

If {Pi} are the probabilities for a series of categories, the sam-


ple proportions P̃i from a sample of n observations have these
properties:

E(P̃i ) = Pi
1
Var(P̃i) = Pi(1 − Pi)
n
1
Cov(P̃i, P̃j ) = − PiPj , i 6= j
n
The covariance is defined as E[(P̃i − Pi)(P̃j − Pj )].

For the sample counts:

E(ni ) = nPi
Var(ni) = nPi(1 − Pi)
Cov(ni, nj ) = −nPi Pj , i 6= j

Section 1 Slide 32
Allele Frequency Sampling Distribution

If a locus has alleles A and a, in a sample of size n the allele


counts are sums of genotype counts:

n = nAA + nAa + naa


nA = 2nAA + nAa
na = 2naa + nAa
2n = nA + na
Genotype counts in a random sample are multinomially distributed.
What about allele counts? Approach this question by calculating
variance of nA.

Section 1 Slide 33
Within-population Variance

Var(nA) = Var(2nAA + nAa)

= Var(2nAA) + 2Cov(2nAA, nAa) + Var(nAa )

= 4nPAA(1 − PAA) − 4nPAAPAa + nPAa(1 − PAa)

= 2npA(1 − pA) + 2n(PAA − p2


A)
This is not the same as the binomial variance 2npA(1−pA) unless
PAA = p2 A . In general, the allele frequency distribution is not
binomial.

The variance of the sample allele frequency p̃A = nA/(2n) can


be written as
pA(1 − pA) PAA − p2
A
Var(p̃A) = +
2n 2n
Section 1 Slide 34
Within-population Variance

It is convenient to reparameterize genotype frequencies with the


within-population inbreeding coefficient f :

PAA = p2
A + f pA pa
PAa = 2pApa − 2f pApa
Paa = p2
a + f pA pa

Then the variance can be written as


pA(1 − pA)(1 + f )
Var(p̃A ) =
2n
This variance is different from the binomial variance of pA(1 −
pA)/2n.

Section 1 Slide 35
Bounds on f

Since

pA ≥ PAA = p2A + f pA (1 − pA ) ≥ 0
pa ≥ Paa = p2
a + f pa (1 − pa ) ≥ 0
there are bounds on f :

−pA/(1 − pA) ≤ f ≤ 1
−pa /(1 − pa) ≤ f ≤ 1
or
!
p pa
max − A , − ≤ f ≤1
pa pA

This range of values is [-1,1] when pA = pa.

Section 1 Slide 36
An aside: Indicator Variables

A very convenient way to derive many statistical genetic results


is to define an indicator variable xjk for allele k in individual j:
(
1 if allele is A
xjk =
0 if allele is not A
Then

E(xjk ) = pA
E(x2
jk ) = pA
E(xjk xjk0 ) = PAA
If there is random sampling, individuals are independent, and

E(xjk xj 0k0 ) = E(xjk )E(xj 0 k0 ) = p2


A
These expectations are the averages of values from many sam-
ples from the same population.

Section 1 Slide 37
An aside: Intraclass Correlation

In general, the inbreeding coefficient is the correlation of the


indicator variables for the two alleles k, k0 at a locus carried by
an individual j. This is because:

Var(xjk ) = E(x2
jk ) − [E(xjk )]
2

= pA(1 − pA)
= Var(xjk0 ), k 6= k0
and

Cov(xjk , xjk0 ) = E(xjk xjk0 ) − [E(xjk )][E(xjk0 )], k 6= k0


= PAA − p2A
= f pA(1 − pA)
so
Cov(xjk , xjk0 )
Corr(xjk , xjk0 ) = q =f
Var(xjk )Var(xjk0 )

Section 1 Slide 38
Allele Dosage

The dosage X of allele A for an individual is the number of


copies of A (0,1,2) that individual carries (the sum of its two
allele indicators).

The probabilities for X are

Pr(X = 0) = Paa, Pr(X = 1) = PAa, Pr(X = 2) = PAA


so the expected value of X is 2PAA + PAa = 2pA.

The expected value of X 2 is 4PAA + PAa = 2(pA + PAA) and this


leads to a variance of the dosage for an individual of

Var(X) = 2PAA + 2pa − 4p2


A = 2pA (1 − pA )(1 + f )
We will come back to this result, but note here that the f term
is usually not included in genetic data analysis packages.

Section 1 Slide 39
Maximum Likelihood Estimation: Allele Data

For a sample of n independent alleles, the likelihood of pA when


there are nA alleles of type A is

L(pA|nA) = C(pA)nA (1 − pA)n−nA


and this is maximized when
∂L(pA|nA ) ∂ ln L(pA|na )
= 0 or when =0
∂pA ∂pA
Now

ln L(pA|nA ) = ln C + nA ln(pA) + (n − nA) ln(1 − pA)


so
∂ ln L(pA|nA ) nA n − nA
= −
∂pA pA 1 − pA
and this is zero when pA = nA/n. The MLE of pA is its sample
value: p̂A = p̃A.

Section 1 Slide 40
Maximum Likelihood Estimation: Genotype Data

If {ni} are multinomial with parameters n and {Pi}, then the


MLE’s of Pi are ni/n. This will always hold for genotype propor-
tions, but not always for allele proportions.

For two alleles, the MLE’s for genotype proportions are:

P̂AA = nAA/n
P̂Aa = nAa/n
P̂aa = naa/n

Does this lead to estimates of allele proportions and the within-


population inbreeding coefficient?

Section 1 Slide 41
Maximum Likelihood Estimation: f

Because

PAA = p2
A + f pA (1 − pA )
PAa = 2pA(1 − pA) − 2f pA(1 − pA )
Paa = (1 − pA)2 + f pA(1 − pA)
The likelihood function for pA, f is
n!
L(pA, f ) = [p2 + pA (1 − pA )f ]nAA
nAA!nAa!naa ! A

×[2pA(1 − pA)f ]nAa [(1 − pA)2 + pA (1 − pA)f ]naa


and it is difficult to find, algebraically, the values of pA and f
that maximize this function or its logarithm.

There is an alternative way of finding maximum likelihood esti-


mates in this case: equating the observed and expected values
of the genotype frequencies.
Section 1 Slide 42
Bailey’s Method

Because the number of parameters (2) equals the number of


degrees of freedom in this case, we can just equate observed
and expected genotype proportions based on the estimates of pA
and f :

nAA/n = p̂2 ˆ
A + f p̂A (1 − p̂A )
nAa/n = 2p̂A(1 − p̂A) − 2fˆp̂A(1 − p̂A)
naa/n = (1 − p̂A)2 + fˆp̂A(1 − p̂A)
Solving these equations (e.g. by adding the first equation to half
the second equation to give solution for p̂A and then substituting
that into one equation):
2nAA + nAa
p̂A = = p̃A
2n
nAa P̃
fˆ = 1 − = 1 − Aa
2np̃A(1 − p̃A) 2p̃Ap̃a
Section 1 Slide 43
Aside: Three-allele Case

With three alleles, there are six genotypes and 5 df. To use
Bailey’s method, would need five parameters: 2 allele frequencies
and 3 inbreeding coefficients. For example

P11 = p2
1 + f12 p1p2 + f13p1p3
P12 = 2p1p2 − 2f12p1p2
P22 = p2
2 + f12 p1p2 + f23p2p3
P13 = 2p1p3 − 2f13p1p3
P23 = 2p2p3 − 2f23p2p3
P33 = p2
3 + f13 p1p3 + f23p2p3
We would generally prefer to have only one inbreeding coefficient
f . It is a difficult numerical problem to find the MLE for f .

Section 1 Slide 44
Method of Moments

An alternative to maximum likelihood estimation is the method


of moments (MoM) where observed values of statistics are set
equal to their expected values regardless of degrees of freedom.
In general, this does not lead to unique estimates or to estimates
with variances as small as those for maximum likelihood.

(Bailey’s method is for the special case where the MLEs are also
MoM estimates.)

Section 1 Slide 45
Aside: Method of Moments

For the inbreeding coefficient at loci with m alleles Au, two pos-
sible MoM estimates are (for large sample sizes)
Pm
(P̃ − p̃2)
u=1 uu u
fˆLH1 = Pm
u=1 p̃u (1 − p̃u)
m 2
!
1 P̃uu − p̃u
ˆ
X
fLH5 =
m − 1 u=1 p̃u
These both have low bias. Their variances depend on the value
of f .

For loci with two alleles, m = 2, the two moment estimates are
equal to each other and to the maximum likelihood estimate:
P̃Aa
fˆLH1 = fˆLH5 = 1 −
2p̃Ap̃a

Li CC, Horvitz DG. 1953. Am J Human Genetics 5:107-16. Equations 1 and


5.
Section 1 Slide 46
Aside: MLE for Recessive Alleles

Suppose allele a is recessive to allele A, and a sample of n individ-


uals has naa recessive homozygotes. The genotypes of the other
(n − nAa) individuals can be AA or Aa.If there is Hardy-Weinberg
equilibrium, the likelihood for the two phenotypes is

L(pa) = (p2
a )naa (1 − p2 )n−naa
a
ln[L(pa)] = 2naa ln(pa ) + (n − naa) ln(1 − p2
a)
Differentiating wrt pa:
∂ ln L(pa) 2naa 2pa(n − naa)
= −
∂pa pa 1 − p2
a
Setting this to qzero leads to an equation that can be solved
explicitly: pa = naa/n.

Section 1 Slide 47
Aside: EM Algorithm for Recessive Alleles

An alternative way of finding maximum likelihood estimates when


there are “missing data” involves Estimation of the missing data
and then Maximization of the likelihood. For a locus with allele
A dominant to a the missing information is the counts of the AA
and Aa genotypes. Only the joint count (n − naa) of AA + Aa is
observed.

Estimate the missing genotype counts (assuming independence


of alleles) as proportions of the total count of dominant pheno-
types:
(1 − pa)2 (1 − pa)(n − naa)
nAA = (n − naa) =
1 − p2
a (1 + pa)
2pa(1 − pa ) 2pa(n − naa)
nAa = (n − naa ) =
1 − p2
a (1 + pa)

Section 1 Slide 48
Aside: EM Algorithm for Recessive Alleles

Maximize the likelihood (using Bailey’s method):

nAa + 2naa
p̂a =
2n !
1 2pa(n − naa)
= + 2naa
2n (1 + pa )
2(npa + naa )
=
2n(1 + pa)

An initial estimate pa is put into the right hand side to give an


updated estimated p̂a on the left hand side. This is then put
back into the right hand side to give an iterative equation for pa.
q
This procedure also has explicit solution p̂a = naa/n.

Section 1 Slide 49
EM Algorithm for Two Loci
A more interesting application of the EM algorithm is the estimation of two-
locus gamete frequencies from unphased genotype data. For locus A with
alleles A, a and locus B with alleles B, b, the ten two-locus frequencies are:

Genotype Actual Expected Genotype Actual Expected

AB/AB AB
PAB p2AB AB/Ab AB
PAb 2pAB pAb

AB/aB AB
PaB 2pAB paB AB/ab AB
Pab 2pAB pab

Ab/Ab Ab
PAb p2Ab Ab/aB Ab
PaB 2pAb paB

Ab/ab Ab
Pab 2pAbpab aB/aB aB
PaB p2aB

aB/ab aB
Pab 2paB pab ab/ab ab
Pab p2ab

Section 1 Slide 50
EM Algorithm for Two Loci

Gamete frequencies are marginal sums:


AB 1 AB AB AB
pAB = PAB + (PAb + PaB + Pab )
2
Ab 1 Ab Ab Ab
pAb = PAb + (PAB + Pab + PaB )
2
aB 1 aB aB + P aB )
paB = PaB + (PAB + Pab Ab
2
ab + 1 (P ab + P ab + P ab )
pab = Pab aB AB
2 Ab

Arrange the gamete frequencies as a two-way table to show that


only one of them is unknown when the allele frequencies are
known:
pAB pAb pA
paB pab pa
pB pb 1

Section 1 Slide 51
EM Algorithm for Two Loci

The two double heterozygote counts nAB Ab


ab , naB are “missing data.”

Assume initial value of pAB and Estimate the missing counts as


proportions of the total count nAaBb of double heterozygotes:
2pAB pab
nAB
ab = nAaBb
2pAB pab + 2pAbpaB
2pAbpaB
nAb
aB = nAaBb
2pAB pab + 2pAbpaB
and then Maximize the likelihood by setting
1  AB AB AB AB

pAB = 2nAB + nAb + naB + nab
2n
or
nAB = 2nAB
AB + nAB
Ab + nAB
aB + nAB
ab

Section 1 Slide 52
Example

As an example, consider these data:


BB Bb bb Total
AA nAABB = 0 nAABb = 0 nAAbb = 2 nAA = 2
Aa nAaBB = 1 nAaBb = 3 nAabb = 4 nAa = 8
aa naaBB = 0 naaBb = 1 naabb = 4 naa = 5
Total nBB = 1 nBb = 4 nbb = 10 n = 15

There is one unknown gamete count x = nAB for AB:

B b Total
A nAB = x nAb = 12 − x nA = 12
a naB = 6 − x nab = x + 12 na = 18
Total nB = 6 nb = 24 2n = 30

0≤x≤6
Section 1 Slide 53
Example

EM iterative equation:

x0 = 2nAABB + nAABb + nAaBB + nAB/ab

2pAB pab
= 2nAABB + nAABb + nAaBB + nAaBb
2pAB pab + 2pAbpaB
2x(x + 12)
= 0+0+1+3×
2x(x + 12) + 2(12 − x)(6 − x)
3x(x + 12)
= 1+
x(x + 12) + (12 − x)(6 − x)

Section 1 Slide 54
Example

A good starting value would assume independence of A and B


alleles: x = 2n ∗ pA ∗ pB = (30 × 12/30 × 6/30) = 2.4. Successive
iterates are:
Iterate x x/2n
1 2.4000 0.0800
2 2.5000 0.0833
3 2.5647 0.0855
4 2.6063 0.0869
5 2.6327 0.0878
6 2.6494 0.0883
7 2.6600 0.0887
8 2.6667 0.0889
9 2.6709 0.0890
10 2.6736 0.0891
11 2.6752 0.0892
12 2.6763 0.0892
13 2.6769 0.0892
14 2.6773 0.0892
15 2.6776 0.0893
16 2.6778 0.0893
... ... ...
Section 1 Slide 55

You might also like