0% found this document useful (0 votes)

12 views55 pages

1.AlleleFrequencies 0

This document discusses probability and the binomial distribution. It defines probability, covers laws of probability like the law of total probability and Bayes' theorem, and describes how the binomial distribution can model counting experiments with a fixed number of trials and probability of success. It also discusses maximum likelihood estimation and properties of estimators.

Uploaded by

papanomba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views55 pages

1.AlleleFrequencies 0

Uploaded by

papanomba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

ALLELE FREQUENCIES

Section 1 Slide 1
Probability

Probability provides the language of data analysis.

Equiprobable outcomes definition:

Probability of event E is number of outcomes favorable to E
divided by the total number of outcomes. e.g. Probability of a
head = 1/2.

Long-run frequency definition:

If event E occurs n times in N identical experiments, the prob-
ability of E is the limit of n/N as N goes to infinity.

Subjective probability:
Probability is a measure of belief.

Section 1 Slide 2
First Law of Probability

Law says that probability can take values only in the range zero
to one and that an event which is certain has probability one.

 0 ≤ Pr(E) ≤ 1


 Pr(E|E) = 1 for any E



i.e. If event E is true, then it has a probability of 1. For example:

Pr(Seed is Round|Seed is Round) = 1

Section 1 Slide 3
Second Law of Probability

If G and H are mutually exclusive events, then:

Pr(G or H) = Pr(G) + Pr(H)

For example,

Pr(Seed is Round or Wrinkled) = Pr(Round) + Pr(Wrinkled)

More generally, if Ei, i = 1, . . . r, are mutually exclusive then

Pr(E1 or . . . or Er ) = Pr(E1 ) + . . . + Pr(Er )

X
= Pr(Ei )
i

Section 1 Slide 4
Complementary Probability

If Pr(E) is the probability that E is true then Pr(Ē) denotes the

probability that E is false. Because these two events are mutually
exclusive

Pr(E or Ē) = Pr(E) + Pr(Ē)

and they are also exhaustive in that between them they cover all
possibilities – one or other of them must be true. So,

Pr(E) + Pr(Ē) = 1
Pr(Ē) = 1 − Pr(E)
The probability that E is false is one minus the probability it is
true.

Section 1 Slide 5
Third Law of Probability

For any two events, G and H, the third law can be written:

Pr(G and H) = Pr(G) Pr(H|G)

There is no reason why G should precede H and the law can also
be written:

Pr(G and H) = Pr(H) Pr(G|H)

For example

Pr(Seed is round & is type AA)

= Pr(Seed is round|Seed is type AA) × Pr(Seed is type AA)

= 1 × p2
A

Section 1 Slide 6
Independent Events

If the information that H is true does nothing to change uncer-

tainty about G, then

Pr(G|H) = Pr(G)
and

Pr(H and G) = Pr(H) Pr(G)

Events G, H are independent.

Section 1 Slide 7
Law of Total Probability

If G, Ḡ are two mutually exclusive and exhaustive events (Ḡ =

not G), then for any other event E, the law of total probability
states that

Pr(E) = Pr(E|G) Pr(G) + Pr(E|Ḡ) Pr(Ḡ)

This generalizes to any set of mutually exclusive and exhaustive
events {Si}:
X
Pr(E) = Pr(E|Si ) Pr(Si )
i
For example

Pr(Seed is round) = Pr(Round|Type AA) Pr(Type AA)

+ Pr(Round|Type Aa) Pr(Type Aa)
+ Pr(Round|Type aa) Pr(Type aa)
= 1 × p2 2
A + 1 × 2pA pa + 0 × pa
= pA(2 − pA)
Section 1 Slide 8
Bayes’ Theorem

Bayes’ theorem relates Pr(G|H) to Pr(H|G):

Pr(GH)
Pr(G|H) = , from third law
Pr(H)
Pr(H|G) Pr(G)
= , from third law
Pr(H)
If {Gi } are exhaustive and mutually exclusive, Bayes’ theorem
can be written as
Pr(H|Gi ) Pr(Gi)
Pr(Gi|H) = P
i Pr(H|Gi ) Pr(Gi )

Section 1 Slide 9
Bayes’ Theorem Example

Suppose G is event that a man has genotype A1A2 and H is the

event that he transmits allele A1 to his child. Then Pr(H|G) =
0.5.

Now what is the probability that a man has genotype A1A2 given
that he transmits allele A1 to his child?
Pr(H|G) Pr(G)
Pr(G|H) =
Pr(H)
0.5 × 2p1p2
=
p1

= p2

Section 1 Slide 10
Sampling

Statistical sampling: The variation among repeated samples

from the same population (“fixed” sampling). Inferences can
be made about that particular population.

Genetic sampling: The variation among replicate (conceptual)

populations (“random” sampling). Inferences are made to all
populations with the same history.

Section 1 Slide 11
Classical Model

Reference population
(Usually assumed infinite and in equilibrium)
↓ ↓
Time 1 Population Population
of size N ··· of size N
↓ ↓
Population Population
Time 2 of size N ··· of size N
↓ ↓
... ...

↓ ↓
Population Population
Time t of size N ··· of size N
H H
H
HH
H
H
HH
=
j
Sample of Sample of
size n · · · size n

Section 1 Slide 12
Aside: Coalescent Theory

An alternative framework works with genealogical history of a

sample of alleles. There is a tree linking all alleles in a current
sample to the “most recent common ancestral allele.” Allelic
variation is due to mutations since that ancestral allele.

The coalescent approach requires mutation and may be more

appropriate for long-term evolution and analyses involving more
than one species. The classical approach allows mutation but
does not require it: within one species variation among popula-
tions may be due primarily to drift.

Section 1 Slide 13
Binomial Distribution

Section 1 Slide 14
Properties of Estimators

Consistency Increasing accuracy

as sample size increases

Unbiasedness Expected value is the parameter

Efficiency Smallest variance

Sufficiency Contains all the information

in the data about parameter

Section 1 Slide 15
Binomial Distribution

Most population genetic data consists of numbers of observa-

tions in some categories. The values and frequencies of these
counts form a distribution.

Toss a coin n times, and note the number of heads. There

are (n + 1) outcomes, and the number of times each outcome is
observed in many sets of n tosses gives the sampling distribution.
Or: sample n alleles from a population and observe x copies of
type A.

Section 1 Slide 16
Binomial distribution

If every toss has the same chance p of giving a head:

Probability of x heads in a row of independent tosses is

p × p × . . . × p = px

Probability of n − x tails in a row of independent tosses is

(1 − p) × (1 − p) × . . . × (1 − p) = (1 − p)n−x

The number of ways of ordering x heads and n − x tails among

n outcomes is n!/[x!(n − x)!].

The binomial probability of x successes in n trials is

n!
Pr(x|p) = px (1 − p)n−x
x!(n − x)!
Section 1 Slide 17
Binomial Likelihood

The quantity Pr(x|p) is the probability of the data, x successes

in n trials, when each trial has probability p of success.

The same quantity, written as L(p|x), is the likelihood of the

parameter, p, when the value x has been observed. The terms
that do not involve p are not needed, so

L(p|x) ∝ px (1 − p)(n−x)

Each value of x gives a different likelihood curve, and each curve

points to a p value with maximum likelihood. This leads to
maximum likelihood estimation.

Section 1 Slide 18
Likelihood L(p|x, n = 4)

Section 1 Slide 19
Binomial Mean

If there are n trials, each of which has probability p of giving a

success, the mean or the expected number of successes is np.

The sample proportion of successes is

x
p̃ =
n
(This is also the maximum likelihood estimate of p.)

The expected, or mean, value of p̃ is p.

E(p̃) = p

Section 1 Slide 20
Binomial Variance

The expected value of the squared difference between the num-

ber of successes and its mean, (x − np)2 , is np(1 − p). This is
the variance of the number of successes in n trials, and indicates
the spread of the distribution.

The variance of the sample proportion p̃ is

p(1 − p)
Var(p̃) =
n

Section 1 Slide 21
Normal Approximation

Provided np is not too small (e.g. not less than 5), the binomial
distribution can be approximated by the normal distribution with
the same mean and variance. In particular:
!
p(1 − p)
p̃ ∼ N p,
n
To use the normal distribution in practice, change to the standard
normal variable z with a mean of 0, and a variance of 1:
p̃ − p
z = q
p(1 − p)/n
For a standard normal, 95% of the values lie between ±1.96.
The normal approximation to the binomial therefore implies that
95% of the values of p̃ lie in the range
q
p ± 1.96 p(1 − p)/n

Section 1 Slide 22
Confidence Intervals

A 95% confidence interval is a variable quantity. It has end-

points which vary with the sample. It is expected that 95% of
samples will lead to an interval that includes the unknown true
value p.

The standard normal variable z has 95% of its values between

−1.96 and +1.96. This suggests that a 95% confidence interval
for the binomial parameter p is
s
p̃(1 − p̃)
p̃ ± 1.96
n

Section 1 Slide 23
Confidence Intervals

For samples of size 10, the 11 possible confidence intervals are:

p̃ Confidence
√ Interval
0.0 0.0 ± 2√0.000 = (0.00, 0.00)
0.1 0.1 ± 2√0.009 = (0.00, 0.29)
0.2 0.2 ± 2√0.016 = (0.00, 0.45)
0.3 0.3 ± 2√0.021 = (0.02, 0.58)
0.4 0.4 ± 2√0.024 = (0.10, 0.70)
0.5 0.5 ± 2√0.025 = (0.19, 0.81)
0.6 0.6 ± 2√0.024 = (0.30, 0.90)
0.7 0.7 ± 2√0.021 = (0.42, 0.98)
0.8 0.8 ± 2√0.016 = (0.55, 1.00)
0.9 0.9 ± 2√0.009 = (0.71, 1.00)
1.0 1.0 ± 2 0.000 = (1.00, 1.00)
Can modify interval a little by extending it by the “continuity
correction” ±1/2n in each direction.

Section 1 Slide 24
Confidence Intervals

To be 95% sure that q the estimate is no more than 0.01 from

the true value, 1.96 p(1 − p)/n should be less than 0.01. The
widest confidence interval is when p = 0.5, and then the sample
size should satisfy
q
0.01 ≥ 1.96 0.5 × 0.5/n

which means that n ≥ 10, 000. For a width of 0.03 instead of

0.01, n ≈ 1, 000 as is common in public opinion surveys.

If the true value of p was about 0.05, however,

q
0.01 ≥ 2 0.05 × 0.95/n
n ≥ 1, 900 ≈ 2, 000

Section 1 Slide 25
Exact Confidence Intervals: One-sided

The normal-based confidence intervals are constructed to be

symmetric about the sample value, unless the interval goes out-
side the interval from 0 to 1. They are therefore less satisfactory
the closer the true value is to 0 or 1.

More accurate confidence limits follow from the binomial distri-

bution exactly. For events with low probabilities p, how large
could p be for there to be at least a 5% chance of seeing no
more than x (i.e. 0, 1, 2, . . . x) occurrences of that event among
n events. If this upper bound is pU ,
x
X
Pr(k) ≥ 0.05
k=0
x n
pkU (1 − pU )n−k ≥ 0.05
X

k=0
k
If x = 0, then (1 − pU )n ≥ 0.05 if pU ≤ 1 − 0.051/n and this is
0.0295 when n = 100. More generally, pU ≈ 3/n when x = 0.
Section 1 Slide 26
Aside: Two-sided Exact Confidence Intervals

A two-sided interval is bounded above by pU for which there is at

least a 2.5% chance of seeing no more than x (i.e. 0, 1, 2 . . . x)
occurrences, and is bounded below by pL for which there is at
least a 2.5% chance of seeing at least x (i.e. x, x + 1, x + 2, . . . n)
occurrences:
x
n k
pU (1 − pU )n−k ≥ 0.025
X

k=0
k
n n
pkL(1 − pL)n−k ≥ 0.025
X

k=x
k

If x = 0, then (1 −pU ) ≥ 0.0251/n and this gives pU ≤ 0.036 when

n = 100.
If x = n, then pL ≥ 0.9751/n and this gives pL ≥ 0.964 when
n = 100.

Section 1 Slide 27
Exact CIs for n = 10

One-sided Two-sided
x p̃ pU x pL p̃ pU
0 0.00 0.26 0 0.00 0.00 0.31
1 0.10 0.39 1 0.00 0.10 0.45
2 0.20 0.51 2 0.03 0.20 0.56
3 0.30 0.61 3 0.07 0.30 0.65
4 0.40 0.70 4 0.12 0.40 0.74
5 0.50 0.78 5 0.19 0.50 0.81
6 0.60 0.85 6 0.26 0.60 0.88
7 0.70 0.91 7 0.35 0.70 0.93
8 0.80 0.96 8 0.44 0.80 0.97
9 0.90 0.99 9 0.55 0.90 1.00
10 1.00 1.00 10 0.69 1.00 1.00

The two-sided CI is not symmetrical around p̃.

Section 1 Slide 28
Bootstrapping

An alternative method for constructing confidence intervals uses

numerical resampling. A set of samples is drawn, with replace-
ment, from the original sample to mimic the variation among
samples from the original population. Each new sample is the
same size as the original sample, and is called a bootstrap sam-
ple.

The middle 95% of the sample values p̃ from a large number of

bootstrap samples provides a 95% confidence interval.

Section 1 Slide 29
Allele Frequency Sampling

Section 1 Slide 30
Multinomial Distribution

For a SNP with alleles A, a the three genotypes and their prob-
abilities are

Genotype Probability
AA PAA
Aa or aA PAa
aa Paa

For a sample of n independently sampled individuals, the multi-

nomial distribution gives the probability of x of AA, y of Aa
and z of aa. The probability of x genotypes AA is (PAA)x, etc.
The numbers of ways of ordering x, y, z occurrences of the three
outcomes is n!/(x!y!z!) where n = x + y + z.

The multinomial probability is:

n!
Pr(x, y, z) = (PAA)x(PAa)y (Paa)z
x!y!z!
Section 1 Slide 31
Multinomial Variances and Covariances

If {Pi} are the probabilities for a series of categories, the sam-

ple proportions P̃i from a sample of n observations have these
properties:

E(P̃i ) = Pi
1
Var(P̃i) = Pi(1 − Pi)
n
1
Cov(P̃i, P̃j ) = − PiPj , i 6= j
n
The covariance is defined as E[(P̃i − Pi)(P̃j − Pj )].

For the sample counts:

E(ni ) = nPi
Var(ni) = nPi(1 − Pi)
Cov(ni, nj ) = −nPi Pj , i 6= j

Section 1 Slide 32
Allele Frequency Sampling Distribution

If a locus has alleles A and a, in a sample of size n the allele

counts are sums of genotype counts:

n = nAA + nAa + naa

nA = 2nAA + nAa
na = 2naa + nAa
2n = nA + na
Genotype counts in a random sample are multinomially distributed.
What about allele counts? Approach this question by calculating
variance of nA.

Section 1 Slide 33
Within-population Variance

Var(nA) = Var(2nAA + nAa)

= Var(2nAA) + 2Cov(2nAA, nAa) + Var(nAa )

= 4nPAA(1 − PAA) − 4nPAAPAa + nPAa(1 − PAa)

= 2npA(1 − pA) + 2n(PAA − p2

A)
This is not the same as the binomial variance 2npA(1−pA) unless
PAA = p2 A . In general, the allele frequency distribution is not
binomial.

The variance of the sample allele frequency p̃A = nA/(2n) can

be written as
pA(1 − pA) PAA − p2
A
Var(p̃A) = +
2n 2n
Section 1 Slide 34
Within-population Variance

It is convenient to reparameterize genotype frequencies with the

within-population inbreeding coefficient f :

PAA = p2
A + f pA pa
PAa = 2pApa − 2f pApa
Paa = p2
a + f pA pa

Then the variance can be written as

pA(1 − pA)(1 + f )
Var(p̃A ) =
2n
This variance is different from the binomial variance of pA(1 −
pA)/2n.

Section 1 Slide 35
Bounds on f

Since

pA ≥ PAA = p2A + f pA (1 − pA ) ≥ 0
pa ≥ Paa = p2
a + f pa (1 − pa ) ≥ 0
there are bounds on f :

−pA/(1 − pA) ≤ f ≤ 1
−pa /(1 − pa) ≤ f ≤ 1
or
!
p pa
max − A , − ≤ f ≤1
pa pA

This range of values is [-1,1] when pA = pa.

Section 1 Slide 36
An aside: Indicator Variables

A very convenient way to derive many statistical genetic results

is to define an indicator variable xjk for allele k in individual j:
(
1 if allele is A
xjk =
0 if allele is not A
Then

E(xjk ) = pA
E(x2
jk ) = pA
E(xjk xjk0 ) = PAA
If there is random sampling, individuals are independent, and

E(xjk xj 0k0 ) = E(xjk )E(xj 0 k0 ) = p2

A
These expectations are the averages of values from many sam-
ples from the same population.

Section 1 Slide 37
An aside: Intraclass Correlation

In general, the inbreeding coefficient is the correlation of the

indicator variables for the two alleles k, k0 at a locus carried by
an individual j. This is because:

Var(xjk ) = E(x2
jk ) − [E(xjk )]
2

= pA(1 − pA)
= Var(xjk0 ), k 6= k0
and

Cov(xjk , xjk0 ) = E(xjk xjk0 ) − [E(xjk )][E(xjk0 )], k 6= k0

= PAA − p2A
= f pA(1 − pA)
so
Cov(xjk , xjk0 )
Corr(xjk , xjk0 ) = q =f
Var(xjk )Var(xjk0 )

Section 1 Slide 38
Allele Dosage

The dosage X of allele A for an individual is the number of

copies of A (0,1,2) that individual carries (the sum of its two
allele indicators).

The probabilities for X are

Pr(X = 0) = Paa, Pr(X = 1) = PAa, Pr(X = 2) = PAA

so the expected value of X is 2PAA + PAa = 2pA.

The expected value of X 2 is 4PAA + PAa = 2(pA + PAA) and this

leads to a variance of the dosage for an individual of

Var(X) = 2PAA + 2pa − 4p2

A = 2pA (1 − pA )(1 + f )
We will come back to this result, but note here that the f term
is usually not included in genetic data analysis packages.

Section 1 Slide 39
Maximum Likelihood Estimation: Allele Data

For a sample of n independent alleles, the likelihood of pA when

there are nA alleles of type A is

L(pA|nA) = C(pA)nA (1 − pA)n−nA

and this is maximized when
∂L(pA|nA ) ∂ ln L(pA|na )
= 0 or when =0
∂pA ∂pA
Now

ln L(pA|nA ) = ln C + nA ln(pA) + (n − nA) ln(1 − pA)

so
∂ ln L(pA|nA ) nA n − nA
= −
∂pA pA 1 − pA
and this is zero when pA = nA/n. The MLE of pA is its sample
value: p̂A = p̃A.

Section 1 Slide 40
Maximum Likelihood Estimation: Genotype Data

If {ni} are multinomial with parameters n and {Pi}, then the

MLE’s of Pi are ni/n. This will always hold for genotype propor-
tions, but not always for allele proportions.

For two alleles, the MLE’s for genotype proportions are:

P̂AA = nAA/n
P̂Aa = nAa/n
P̂aa = naa/n

Does this lead to estimates of allele proportions and the within-

population inbreeding coefficient?

Section 1 Slide 41
Maximum Likelihood Estimation: f

Because

PAA = p2
A + f pA (1 − pA )
PAa = 2pA(1 − pA) − 2f pA(1 − pA )
Paa = (1 − pA)2 + f pA(1 − pA)
The likelihood function for pA, f is
n!
L(pA, f ) = [p2 + pA (1 − pA )f ]nAA
nAA!nAa!naa ! A

×[2pA(1 − pA)f ]nAa [(1 − pA)2 + pA (1 − pA)f ]naa

and it is difficult to find, algebraically, the values of pA and f
that maximize this function or its logarithm.

There is an alternative way of finding maximum likelihood esti-

mates in this case: equating the observed and expected values
of the genotype frequencies.
Section 1 Slide 42
Bailey’s Method

Because the number of parameters (2) equals the number of

degrees of freedom in this case, we can just equate observed
and expected genotype proportions based on the estimates of pA
and f :

nAA/n = p̂2 ˆ
A + f p̂A (1 − p̂A )
nAa/n = 2p̂A(1 − p̂A) − 2fˆp̂A(1 − p̂A)
naa/n = (1 − p̂A)2 + fˆp̂A(1 − p̂A)
Solving these equations (e.g. by adding the first equation to half
the second equation to give solution for p̂A and then substituting
that into one equation):
2nAA + nAa
p̂A = = p̃A
2n
nAa P̃
fˆ = 1 − = 1 − Aa
2np̃A(1 − p̃A) 2p̃Ap̃a
Section 1 Slide 43
Aside: Three-allele Case

With three alleles, there are six genotypes and 5 df. To use
Bailey’s method, would need five parameters: 2 allele frequencies
and 3 inbreeding coefficients. For example

P11 = p2
1 + f12 p1p2 + f13p1p3
P12 = 2p1p2 − 2f12p1p2
P22 = p2
2 + f12 p1p2 + f23p2p3
P13 = 2p1p3 − 2f13p1p3
P23 = 2p2p3 − 2f23p2p3
P33 = p2
3 + f13 p1p3 + f23p2p3
We would generally prefer to have only one inbreeding coefficient
f . It is a difficult numerical problem to find the MLE for f .

Section 1 Slide 44
Method of Moments

An alternative to maximum likelihood estimation is the method

of moments (MoM) where observed values of statistics are set
equal to their expected values regardless of degrees of freedom.
In general, this does not lead to unique estimates or to estimates
with variances as small as those for maximum likelihood.

(Bailey’s method is for the special case where the MLEs are also
MoM estimates.)

Section 1 Slide 45
Aside: Method of Moments

For the inbreeding coefficient at loci with m alleles Au, two pos-
sible MoM estimates are (for large sample sizes)
Pm
(P̃ − p̃2)
u=1 uu u
fˆLH1 = Pm
u=1 p̃u (1 − p̃u)
m 2
!
1 P̃uu − p̃u
ˆ
X
fLH5 =
m − 1 u=1 p̃u
These both have low bias. Their variances depend on the value
of f .

For loci with two alleles, m = 2, the two moment estimates are
equal to each other and to the maximum likelihood estimate:
P̃Aa
fˆLH1 = fˆLH5 = 1 −
2p̃Ap̃a

Li CC, Horvitz DG. 1953. Am J Human Genetics 5:107-16. Equations 1 and

5.
Section 1 Slide 46
Aside: MLE for Recessive Alleles

Suppose allele a is recessive to allele A, and a sample of n individ-

uals has naa recessive homozygotes. The genotypes of the other
(n − nAa) individuals can be AA or Aa.If there is Hardy-Weinberg
equilibrium, the likelihood for the two phenotypes is

L(pa) = (p2
a )naa (1 − p2 )n−naa
a
ln[L(pa)] = 2naa ln(pa ) + (n − naa) ln(1 − p2
a)
Differentiating wrt pa:
∂ ln L(pa) 2naa 2pa(n − naa)
= −
∂pa pa 1 − p2
a
Setting this to qzero leads to an equation that can be solved
explicitly: pa = naa/n.

Section 1 Slide 47
Aside: EM Algorithm for Recessive Alleles

An alternative way of finding maximum likelihood estimates when

there are “missing data” involves Estimation of the missing data
and then Maximization of the likelihood. For a locus with allele
A dominant to a the missing information is the counts of the AA
and Aa genotypes. Only the joint count (n − naa) of AA + Aa is
observed.

Estimate the missing genotype counts (assuming independence

of alleles) as proportions of the total count of dominant pheno-
types:
(1 − pa)2 (1 − pa)(n − naa)
nAA = (n − naa) =
1 − p2
a (1 + pa)
2pa(1 − pa ) 2pa(n − naa)
nAa = (n − naa ) =
1 − p2
a (1 + pa)

Section 1 Slide 48
Aside: EM Algorithm for Recessive Alleles

Maximize the likelihood (using Bailey’s method):

nAa + 2naa
p̂a =
2n !
1 2pa(n − naa)
= + 2naa
2n (1 + pa )
2(npa + naa )
=
2n(1 + pa)

An initial estimate pa is put into the right hand side to give an

updated estimated p̂a on the left hand side. This is then put
back into the right hand side to give an iterative equation for pa.
q
This procedure also has explicit solution p̂a = naa/n.

Section 1 Slide 49
EM Algorithm for Two Loci
A more interesting application of the EM algorithm is the estimation of two-
locus gamete frequencies from unphased genotype data. For locus A with
alleles A, a and locus B with alleles B, b, the ten two-locus frequencies are:

Genotype Actual Expected Genotype Actual Expected

AB/AB AB
PAB p2AB AB/Ab AB
PAb 2pAB pAb

AB/aB AB
PaB 2pAB paB AB/ab AB
Pab 2pAB pab

Ab/Ab Ab
PAb p2Ab Ab/aB Ab
PaB 2pAb paB

Ab/ab Ab
Pab 2pAbpab aB/aB aB
PaB p2aB

aB/ab aB
Pab 2paB pab ab/ab ab
Pab p2ab

Section 1 Slide 50
EM Algorithm for Two Loci

Gamete frequencies are marginal sums:

AB 1 AB AB AB
pAB = PAB + (PAb + PaB + Pab )
2
Ab 1 Ab Ab Ab
pAb = PAb + (PAB + Pab + PaB )
2
aB 1 aB aB + P aB )
paB = PaB + (PAB + Pab Ab
2
ab + 1 (P ab + P ab + P ab )
pab = Pab aB AB
2 Ab

Arrange the gamete frequencies as a two-way table to show that

only one of them is unknown when the allele frequencies are
known:
pAB pAb pA
paB pab pa
pB pb 1

Section 1 Slide 51
EM Algorithm for Two Loci

The two double heterozygote counts nAB Ab

ab , naB are “missing data.”

Assume initial value of pAB and Estimate the missing counts as

proportions of the total count nAaBb of double heterozygotes:
2pAB pab
nAB
ab = nAaBb
2pAB pab + 2pAbpaB
2pAbpaB
nAb
aB = nAaBb
2pAB pab + 2pAbpaB
and then Maximize the likelihood by setting
1 AB AB AB AB

pAB = 2nAB + nAb + naB + nab
2n
or
nAB = 2nAB
AB + nAB
Ab + nAB
aB + nAB
ab

Section 1 Slide 52
Example

As an example, consider these data:

BB Bb bb Total
AA nAABB = 0 nAABb = 0 nAAbb = 2 nAA = 2
Aa nAaBB = 1 nAaBb = 3 nAabb = 4 nAa = 8
aa naaBB = 0 naaBb = 1 naabb = 4 naa = 5
Total nBB = 1 nBb = 4 nbb = 10 n = 15

There is one unknown gamete count x = nAB for AB:

B b Total
A nAB = x nAb = 12 − x nA = 12
a naB = 6 − x nab = x + 12 na = 18
Total nB = 6 nb = 24 2n = 30

0≤x≤6
Section 1 Slide 53
Example

EM iterative equation:

x0 = 2nAABB + nAABb + nAaBB + nAB/ab

2pAB pab
= 2nAABB + nAABb + nAaBB + nAaBb
2pAB pab + 2pAbpaB
2x(x + 12)
= 0+0+1+3×
2x(x + 12) + 2(12 − x)(6 − x)
3x(x + 12)
= 1+
x(x + 12) + (12 − x)(6 − x)

Section 1 Slide 54
Example

A good starting value would assume independence of A and B

alleles: x = 2n ∗ pA ∗ pB = (30 × 12/30 × 6/30) = 2.4. Successive
iterates are:
Iterate x x/2n
1 2.4000 0.0800
2 2.5000 0.0833
3 2.5647 0.0855
4 2.6063 0.0869
5 2.6327 0.0878
6 2.6494 0.0883
7 2.6600 0.0887
8 2.6667 0.0889
9 2.6709 0.0890
10 2.6736 0.0891
11 2.6752 0.0892
12 2.6763 0.0892
13 2.6769 0.0892
14 2.6773 0.0892
15 2.6776 0.0893
16 2.6778 0.0893
... ... ...
Section 1 Slide 55

Fisher, R. A. (1925) Statistical Methods For Research Workers
75% (4)
Fisher, R. A. (1925) Statistical Methods For Research Workers
145 pages
Statistical Methods For Machine Learning
No ratings yet
Statistical Methods For Machine Learning
272 pages
A System of Legal Logic: Using Aristotle, Ayn Rand, and Analytical Philosophy to Understand the Law, Interpret Cases, and Win in Litigation
From Everand
A System of Legal Logic: Using Aristotle, Ayn Rand, and Analytical Philosophy to Understand the Law, Interpret Cases, and Win in Litigation
Russell Hasan
No ratings yet
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
208 pages
Theoretical Minimum 1.
No ratings yet
Theoretical Minimum 1.
3 pages
Restricted Maximum Likelihood (REML) Estimation of Variance Components in The Mixed Model
No ratings yet
Restricted Maximum Likelihood (REML) Estimation of Variance Components in The Mixed Model
8 pages
Week 9 - Activity 6
No ratings yet
Week 9 - Activity 6
5 pages
4523 16834 2 PB
No ratings yet
4523 16834 2 PB
22 pages
Direction of Arrival Estimation of Reflections From Room Impulse Responses Using A Spherical Microphone Array
No ratings yet
Direction of Arrival Estimation of Reflections From Room Impulse Responses Using A Spherical Microphone Array
13 pages
2006 Unemployment
No ratings yet
2006 Unemployment
24 pages
Valuingforfloodregulation
No ratings yet
Valuingforfloodregulation
45 pages
3 2LeastSquaresRegression
No ratings yet
3 2LeastSquaresRegression
29 pages
Untitled 480
No ratings yet
Untitled 480
17 pages
Untitled 481
No ratings yet
Untitled 481
12 pages
11 The Fiscal Benefits of Stringent Climate Change Mitigation An Overview
No ratings yet
11 The Fiscal Benefits of Stringent Climate Change Mitigation An Overview
17 pages
Twin de Ficit Hypothesis and Reverse Causality: A Case Study of China
No ratings yet
Twin de Ficit Hypothesis and Reverse Causality: A Case Study of China
10 pages
Statistical Failure Models For Water Distribution Pipes - A Review From A Unified Perspective
No ratings yet
Statistical Failure Models For Water Distribution Pipes - A Review From A Unified Perspective
11 pages
Untitled 478
No ratings yet
Untitled 478
23 pages
Miller Credibiliy Paper
No ratings yet
Miller Credibiliy Paper
19 pages
Multiple Logistic Regression (SPSS) 2021
No ratings yet
Multiple Logistic Regression (SPSS) 2021
79 pages
4 Probability
No ratings yet
4 Probability
19 pages
China Economic Debate
No ratings yet
China Economic Debate
5 pages
A Study On How Cafe Amadeo Development
No ratings yet
A Study On How Cafe Amadeo Development
18 pages
Population Mental Health Improves With Increasing Access To Treatment Evidence From A Dynamic Modelling Analysis
No ratings yet
Population Mental Health Improves With Increasing Access To Treatment Evidence From A Dynamic Modelling Analysis
11 pages
Unsupervised Generative Modeling Using Matrix Product States
No ratings yet
Unsupervised Generative Modeling Using Matrix Product States
13 pages
Lecture4 Probability
No ratings yet
Lecture4 Probability
28 pages
Lecture Notes For Machine Learning Theory
No ratings yet
Lecture Notes For Machine Learning Theory
167 pages
Probability Distributions - Training
No ratings yet
Probability Distributions - Training
43 pages
Maths IV, Unit-IV by NP Bali N HK Dass
No ratings yet
Maths IV, Unit-IV by NP Bali N HK Dass
94 pages
Untitled 472
No ratings yet
Untitled 472
13 pages
Probability Distribution Lecture
No ratings yet
Probability Distribution Lecture
48 pages
Problem Set 4 Solution Numerical Methods
No ratings yet
Problem Set 4 Solution Numerical Methods
6 pages
Statistical Inference
No ratings yet
Statistical Inference
158 pages
Cognitive and Motivational Biases in Decision and Risk Analysis
No ratings yet
Cognitive and Motivational Biases in Decision and Risk Analysis
22 pages
Module 3 Introduction To Probability
No ratings yet
Module 3 Introduction To Probability
4 pages
OSHAD-SF - TG - Process of Risk Management v3.0 English
50% (2)
OSHAD-SF - TG - Process of Risk Management v3.0 English
20 pages
Elementary Probability and Statistics
No ratings yet
Elementary Probability and Statistics
25 pages
Lab Manual Ch4
No ratings yet
Lab Manual Ch4
10 pages
Chapter 2 Elemtry Vs Probabilty Distr
No ratings yet
Chapter 2 Elemtry Vs Probabilty Distr
77 pages
Prob Distributions
No ratings yet
Prob Distributions
12 pages
Aldrich - R. A. Fisher On Bayes and Bayes' Theorem
No ratings yet
Aldrich - R. A. Fisher On Bayes and Bayes' Theorem
10 pages
Statistical Modeling & Intro To Probability
No ratings yet
Statistical Modeling & Intro To Probability
31 pages
Review Stat 2024
No ratings yet
Review Stat 2024
86 pages
PTSP
No ratings yet
PTSP
74 pages
ProbabilityDistributions BRSM SP2022 Lecture3
No ratings yet
ProbabilityDistributions BRSM SP2022 Lecture3
45 pages
Stat 235 Midterm Review
No ratings yet
Stat 235 Midterm Review
5 pages
Practical Statistics
No ratings yet
Practical Statistics
26 pages
Formula Sheet - Test 2 - STAT4001
No ratings yet
Formula Sheet - Test 2 - STAT4001
5 pages
Managerial Statistics
No ratings yet
Managerial Statistics
128 pages
Lecture 4.1 - Inferential Statistics (Discrete Distributions)
No ratings yet
Lecture 4.1 - Inferential Statistics (Discrete Distributions)
24 pages
Functional Logistic Regression: A Comparison of Three Methods
No ratings yet
Functional Logistic Regression: A Comparison of Three Methods
20 pages
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
100% (1)
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
16 pages
Song11-11 Probability Distribution
No ratings yet
Song11-11 Probability Distribution
34 pages
Unit 5 & 6. Probability and Prob Disti
No ratings yet
Unit 5 & 6. Probability and Prob Disti
90 pages
A Multivariate Logistic Regression Equation To Screen For Diabetes
No ratings yet
A Multivariate Logistic Regression Equation To Screen For Diabetes
5 pages
Probability 2024
No ratings yet
Probability 2024
14 pages
Bayesian Statistics: Thomas Bayes
No ratings yet
Bayesian Statistics: Thomas Bayes
22 pages
Unit-II-Probability-binomia Distribution-Poisson Distribution-Normal distribution-NOTES
No ratings yet
Unit-II-Probability-binomia Distribution-Poisson Distribution-Normal distribution-NOTES
50 pages
5 Probability and Probability Distribution
No ratings yet
5 Probability and Probability Distribution
33 pages
PTSP
No ratings yet
PTSP
101 pages
2024 F STA-1005ab Review Problems For The Final Exam
No ratings yet
2024 F STA-1005ab Review Problems For The Final Exam
65 pages
Chapter 3
No ratings yet
Chapter 3
6 pages
Maths Project
No ratings yet
Maths Project
18 pages
Logarithm Rules
67% (3)
Logarithm Rules
18 pages
BIOL 2163 Lecture 5 - Discrete Probability Distributions
No ratings yet
BIOL 2163 Lecture 5 - Discrete Probability Distributions
62 pages
Statistics
No ratings yet
Statistics
5 pages
Econometric Chap1
No ratings yet
Econometric Chap1
91 pages
95-033F Tech Paper
No ratings yet
95-033F Tech Paper
24 pages
Intro To Probability (Pattern Recognition)
No ratings yet
Intro To Probability (Pattern Recognition)
94 pages
Gologit 2
No ratings yet
Gologit 2
18 pages
Study of A Risk-Based Piping Inspection Guideline System: Shiaw-Wen Tien, Wen-Tsung Hwang, Chih-Hung Tsai
No ratings yet
Study of A Risk-Based Piping Inspection Guideline System: Shiaw-Wen Tien, Wen-Tsung Hwang, Chih-Hung Tsai
8 pages
Probability
No ratings yet
Probability
57 pages
Probability Cheat Sheet
No ratings yet
Probability Cheat Sheet
8 pages
Biostatistics Sem V
No ratings yet
Biostatistics Sem V
20 pages
Statistical Methods
No ratings yet
Statistical Methods
16 pages
Final Exam
100% (1)
Final Exam
2 pages
TLG 18.1 Random Variable and Its Probability Distribution, Part 2 - Binomial Random Variable and Its Distribution
No ratings yet
TLG 18.1 Random Variable and Its Probability Distribution, Part 2 - Binomial Random Variable and Its Distribution
6 pages
Conditional Probability, Total Probability Theorem, Bayes Rule
No ratings yet
Conditional Probability, Total Probability Theorem, Bayes Rule
21 pages
AP ECON 2500 Session 3
No ratings yet
AP ECON 2500 Session 3
29 pages
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
No ratings yet
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
76 pages
Stats Semis
No ratings yet
Stats Semis
18 pages
Chapter 3
No ratings yet
Chapter 3
57 pages
Expected Value:) P (X X X E
No ratings yet
Expected Value:) P (X X X E
28 pages
Binomial and Poissons Distribution
No ratings yet
Binomial and Poissons Distribution
29 pages
Probability and Statistics - Practice Tests and Solutions
No ratings yet
Probability and Statistics - Practice Tests and Solutions
46 pages
Unit-Ii: Probability I: Introductory Ideas
No ratings yet
Unit-Ii: Probability I: Introductory Ideas
28 pages
Statistical Analysis: Dr. Shahid Iqbal Fall 2021
No ratings yet
Statistical Analysis: Dr. Shahid Iqbal Fall 2021
65 pages
On Probability Theory &stochastic Process
No ratings yet
On Probability Theory &stochastic Process
101 pages
BS UNIT 2 Note # 3
No ratings yet
BS UNIT 2 Note # 3
7 pages
Probability and Statistics: Dr. K.W. Chow Mechanical Engineering
No ratings yet
Probability and Statistics: Dr. K.W. Chow Mechanical Engineering
113 pages
Chapter 2: Axioms of Probability
No ratings yet
Chapter 2: Axioms of Probability
8 pages
Session 5-6
No ratings yet
Session 5-6
25 pages
Psyc 235: Introduction To Statistics: Don'T Forget To Sign in For Credit!
No ratings yet
Psyc 235: Introduction To Statistics: Don'T Forget To Sign in For Credit!
41 pages
Chapter 4
No ratings yet
Chapter 4
76 pages
Statistics 578 Assignemnt 2
100% (3)
Statistics 578 Assignemnt 2
15 pages
Osobine Var
No ratings yet
Osobine Var
19 pages
Tes9e ch04
No ratings yet
Tes9e ch04
45 pages
Rvrlecture 1
No ratings yet
Rvrlecture 1
20 pages
Probability Theory: Much Inspired by The Presentation of Kren and Samuelsson
No ratings yet
Probability Theory: Much Inspired by The Presentation of Kren and Samuelsson
27 pages
Binomial Normal Distribution
No ratings yet
Binomial Normal Distribution
47 pages