Thesis Paternity
Thesis Paternity
Thesis Paternity
L
l=1
P(A
li
l
A
l j
l
), where A
li
l
is
the ith allele on locus l and L is the total number of typed STR loci. Using the allele probabilities
estimated for the Danish population, the probability of observing the DNA prole of Table 1.1
when sampling a random person from the population is 1.327
10
10
.
1.1 Qualitative models 3
When a crime is committed, DNA evidence is often considered in the court of law, when convict-
ing a suspect guilty or innocent. Let respectively H
p
and H
d
denote the hypotheses relating to
the guilt and innocence of the suspect, and E the evidence relevant for the hypotheses. Then the
court is interested in posterior ratio P(H
p
|E)/P(H
d
|E). However, such statements are impossible
for the forensic geneticist to quantify since this involves the prior ratio P(H
p
)/P(H
d
) which is
unknown to the forensic expert. What can be evaluated by the expert witness is the likelihood ra-
tio P(E|H
p
)/P(E|H
d
) using a model for the occurrence of the evidence given that the hypothesis
is true.
The likelihood ratio, LR, is the essential quantity in forensic genetics and this thesis discuss
several ways to include more of the available information in its evaluation. Consider a crime
case with an identied suspect. Let G
S
denote the suspects DNA prole and E
c
the DNA
stain obtained from the scene of crime, and assume that E
c
is consistent with G
S
. That is, all
alleles in G
S
are present in E
c
which we denote G
S
E
c
. The two competing hypothesis state
respectively; H
p
: The suspect is the donor of the DNA stain and H
d
: An unknown and to the
suspect unrelated person is the donor of the DNA stain. The latter hypothesis is what is called a
random man-hypothesis. Let G
U
denote the DNA prole of the random man which assuming
no typing errors implies that G
U
G
S
. In this case the LR is given by:
LR =
P(E|H
p
)
P(E|H
d
)
=
P(E
c
, G
S
|H
p
)
P(E
c
, G
S
|H
d
)
=
P(E
c
|G
S
)P(G
S
)
P(E
c
|U)P(G
S
|G
U
)P(G
U
)
=
1
P(G
U
|G
S
)
where P(G
U
|G
S
) under some model assumptions is the probability of observing the crime scene
prole at random in the population. Hence, the evidence enables the forensic geneticist to make
statements like The probability of observing this particular prole at random in the reference
population is 1 in 1,000,000 or equivalently the DNA evidence is 1,000,000 times more likely
under H
p
than under H
d
. There is an ongoing debate in the forensic genetic community on
which probability is of relevance to the court. In the recent decade the match probability
(Balding, 2005) that takes subpopulation structures or common coancestry into account has be-
come prevalent. That is, rather than considering proles of the suspect and random man as
independent, one computes the probability of the observed prole conditioned on the suspects
prole. Hence, using the posterior distribution of alleles rather than the prior distribution,
rare alleles are less extreme. This implies a more conservative evaluation of the evidence since
one accounts for the possibility that an allele that is rare in an admixed population is more com-
monly observed in one of its subpopulations to which the suspect (and possibly the culprit) might
belong.
Over the recent years the national databases of STR proles have grown in size due to the suc-
cess of forensic DNA analysis in solving crimes. With these vast numbers of proles available, it
is possible to test the validity and applicability of population models to forensic genetics (Weir,
2004, 2007; Curran et al., 2007; Mueller, 2008). Furthermore, the accumulation of DNA pro-
les implies that the probability of a random match or near match of two randomly selected
DNA proles in the database increases. If all pairs of proles are compared to each other in the
database this corresponds to
_
n
2
_
= n(n1)/2 pairwise comparisons in a database with n DNA
proles. In the Danish DNA reference database there are approximately 52,000 DNA proles
which yield 1,351,974,000 pairwise comparisons. With these large number of comparisons it is
4 Introduction
likely to observe DNA proles that coincide on many loci which has concerned some commenta-
tors and raised questions about overstating the power of DNA evidence. Hence, it is important
to demonstrate that the observed and expected number of matches are suciently close in order
to retain the condence in DNA typing in general and the population genetic models used for
evidential calculations in particular.
1.2 Quantitative models
The commercial kits used for analysis of DNA evidence provide quantitative and qualitative
information to the analyst. The qualitative information reports which alleles that are present in
the data (like in Table 1.1), whereas the qualitative part gives information on peak intensities in
terms of height and area of the peaks obtained from the electropherogram (EPG). An example
of an EPG is given in Figure 1.1 where the peak intensities (peak heights and areas) are plotted
in relative uorescent units (rfu) against the base pair (bp) length. Peaks with a low bp value
correspond to alleles with short amplicons (amplicons are made up by the primer binding site
and STR repeat structure). The peak intensities are measured by a CCD camera where the
observed intensity corresponds to the amount of light emitted from the uorescence dye.
100 150 200 250 300 350 400
0
1
0
0
0
2
0
0
0
3
0
0
0
4
0
0
0
Base pair (bp)
P
e
a
k
h
e
i
g
h
t
(
r
f
u
)
Blue fluorescent dye
Green fluorescent dye
Yellow fluorescent dye
Red fluorescent dye
Figure 1.1: An example of an electropherogram (EPG) for the SGM-Plus kit (Applied Biosys-
tems) with peak height (in rfu) plotted against the base pair (bp) length.
1.2 Quantitative models 5
Thus the crime scene evidence, E
c
, consists of two components: The qualitative (or genetic)
part, G; and the quantitative part with peak intensities, Q. The peak intensities reect the amount
of DNA contributed to the particular allele, and since the technology is indierent to the origin
of the various DNA fragments, the DNA amounts contributed to shared alleles add up. The
resulting peak intensities are registered via a CCD camera that detects the light emitted from a
uorochrome attached to DNA molecules corresponding to a STR allele. A dierence in electric
potential forces the DNA molecules to move in the capillary, where the size dierence of the
molecules implies that some DNA fragments pass the CCD camera before others.
Since the length of the repeat sequences of the STR loci under investigation are overlapping,
most commercial STR kits applies 3-5 dierent uorochrome dyes in order to concurrently detect
signals from multiple alleles and loci. One of these dyes contain DNA fragments of known
length which are used for fragment size determination of the observed peak intensities. These
xed lengths are used to align the observed peak intensities to an allelic ladder which converts
an observed fragment length to an allelic repeat number. For the SGM-Plus kit this size marker
is given a red uorochrome (represented by dashed lines in Figure 1.1).
In a single contributor DNA sample it is possible to observe one or two alleles per locus depend-
ing on whether the DNA prole is homozygous or heterozygous, respectively. However, when m
DNA proles contribute to the same sample it is possible to observe one to 2m alleles per locus,
since the individuals may share all or no alleles. The peak intensities associated to the alleles
reect the amount of DNA contributed to that particular allele. Hence, in a two-person mixture
alleles where the major component (the DNA prole with the largest amount of DNA contributed
to the sample) contributes are often larger than those of the minor component. However, if the
DNA proles share alleles the peak intensities of the common alleles are approximately the sum
of the contributions.
When assigning weight to the evidence under a given hypothesis the methodology needs to
consider both parts of the data. This is particularly important when the data originate from a
DNA mixture, since the quantitative evidence currently is the only way used to separate the
observed alleles into contributing proles. Often the peak intensities are used only to reduce
the number of possible combinations entering the likelihood ratio. This approach is sometimes
called the binary model in the forensic literature, e.g. by Bill et al. (2005); Buckleton et al.
(2005). However, a more correct approach would be to attach a likelihood to each combination
of proles measuring the agreement between the observed peak intensities and the expected
intensities under some model. Let the evidence E = (E
c
, K) where K are the known proles
associated to the crime, then the extended LR taking Q into account is given by:
LR =
P(E|H
p
)
P(E|H
d
)
=
P(Q, G, K|H
p
)
P(Q, G, K|H
d
)
=
P(Q|G, K, H
p
)P(G|K, H
p
)P(K|H
p
)
P(Q|G, K, H
d
)P(G|K, H
d
)P(K|H
d
)
, (1.1)
where P(Q|) measures the agreement between the observed and expected peak intensities. Ide-
ally the model for P(Q|) should take the entire EPG signal into account which includes the noise
component (pictured as a rug close to 0 rfu in Figure 1.1), adjustment and correction for tech-
nical artefacts (stutters and pull-up eects, cf. below), detection of degradation (discussed in the
end of this section), the genotypes of the contributors, etc.
6 Introduction
However, evaluating the LR under such a model is computationally intense and complicated.
That is, for each locus every pair of alleles constructed as a Cartesian product of the allelic ladder
should be considered even though the peak height imbalances (ratio of peak heights) within and
between loci were extreme. For practical purposes such an approach would be infeasible and
too computational intense for standard case work. Hence, it is common to reduce Q to a smaller
set of observations by using a criterion to separate the noise and signal into two parts, such that
the number of possible combinations of DNA proles decreases. A limit of detection is often
used to discriminate between the noise and signal. However, such a threshold approach induces
the risk of making wrong assignment of noise and signal, i.e. false positive and negative calls.
In forensic genetics, these terms are commonly denoted drop-ins and drop-outs which refers to
extra alleles in the signal not contributed by the true donors of the stain and missing alleles of
the true donors being be below the limit of detection.
Let Q denote the part of the EPG that is classied as true signal. As mentioned above Q is cur-
rently the basis for separating DNA mixtures in its contributing components. That is, by dening
a model for Q given a set of contributing proles, it is possible to determine the goodness-of-t
between a hypothesised combination of DNA proles and the observed peak intensities. Meth-
ods exist for modelling P(Q|G, G, H) of which some are more heuristic than statistical (Bill et
al., 2005; Wang et al., 2006), but progress is made towards models based on statistical methods
(Perlin and Szabady, 2001; Cowell et al., 2007a, b, 2010; Curran, 2008; Tvedebrink et al., 2010).
In cases where the amount of DNA contributed by the donor of the prole is low, there is a
risk of the peak heights being below a limit of detection. The limit of detection is introduced
in order to distinguish between noise and true signals. This may imply allelic drop-out which
causes only a partial (or no) prole to be typed. Hence, a true contributing prole to an observed
stain may have one or more alleles not present in the case sample. Not taking allelic drop-out into
consideration could imply that the true donor is erroneously excluded fromfurther consideration.
In order to include the possibility for drop-out in the evidence evaluation it is necessary to be
able to quantify this possibility in terms of a probability.
In contrast to drop-out which is missing alleles, the biotechnology used in the typing of DNA
proles may cause additional peaks to be present in the observed stain. The PCR process, which
amplies the DNA by making multiple copies of the present alleles, causes extra peaks in the
position in front of the true peak. These peaks are called stutters and is due to mispairings
between the Taq enzyme and amplicon. This creates a DNA product one repeat unit shorter than
the true amplicon. Stutters may be produced in any cycle of the PCR process and a rule of thumb
says that the stutter peak height is about 10-15% of the true peak height. This percentage is an
overall value across alleles and loci, but shorter alleles tend to have lower stutter percentage than
longer alleles.
Another systematic component caused by the typing technology are the so called pull-up (or
bleed through) eects, where the light emitted fromone uorescent dye is detected in the spectra
of a dierent uorochrome. This implies false detection of peaks with similar fragment length as
the parental peak, but on a dierent dye band. Furthermore, using a xed limit of detection, of 50
rfu say, neglects important information about the noise level in a sample. If a peak in the interval
40 rfu to 49 rfu is observed, the xed threshold-protocol determines this peak as undetected.
1.2 Quantitative models 7
However, by using a model for the threshold, it might be reasonable to have a variable limit set
such that e.g. 99.95%of all noise peaks are removed. This may for some cases imply a threshold
as low as 25 rfu allowing for a more exible analysis scheme which may be valuable for samples
of low amounts of DNA.
When DNA is exposed to inhibitors such as chemicals, moisture, sunlight and heat, the DNA
molecules are prone to degrade and the DNA strand damaged. This causes the results of the
DNA investigation to have a characteristic prole with decreasing peak intensities as a function
of the DNA fragment length. The longer the amplicon, the more likely it is that the peaks will
have low emission values. This implies that the risk of allelic drop-outs increase for longer
amplicons and may result in partial DNA proles since some loci fail to produce any signal.
Degraded biological material is pronounced in samples taken from the debris of mass disasters
or mass graves.
8 Introduction
1.3 Outline
The following seven chapters (Chapters 2-8) present the seven journal papers constituting this
PhD thesis. The organisation of each chapter is such that the paper is presented in its jour-
nal form (including bibliography) followed by supplementary remarks about the results, how
it relates to the previous chapters, further discussion and additional data analysis. As a conse-
quence notation is not necessarily consistent between the chapters and some of the material is
repeated in dierent chapters. On the other hand, the chapters may be read independently of
each other. Each chapter has its own bibliography with the references used in there, and on the
last pages of the thesis there is a complete list of all references. In chapters were there is a ref-
erence to supplementary material, e.g. as in journal papers, the material is available on-line at
http://people.math.aau.dk/tvede/thesis.
The order of the chapters is such that the number of factors considered in the evaluation of the
evidence increases. First only the qualitative part of the data is considered in the likelihood ratio
with the correction for population stratication eects. Later the quantitative data is added to
the likelihood ratio where each model relaxes the assumptions made in the preceding chapters.
Finally the last chapter combines the results and suggests topics for future research.
Chapter 2 discusses the topic of substructures in populations and how to account for this in evi-
dential calculations. Concepts of identical-by-descent and subpopulations eects are common
concepts from population genetics. The idea of measuring population stratication goes back
to Wright (1951) who dened three quantities measuring the degree of relatedness between in-
dividuals, subpopulations and the total population. The model discussed in the chapter handles
this from a statistical point of view by dening the correlation among individuals DNA proles
as overdispersion and show how it is manifested in the so-called -correction used in forensic
genetics.
Chapter 3 is an analysis of the Danish DNA prole reference database. By the beginning of
2009 the database included 51,517 unique DNA proles typed on ten forensic autosomal STR
loci. We investigated the methodology of Weir (2004, 2007) who made pairwise comparisons of
every pair of DNA proles in the database. We derived an ecient way to compute the expected
number of matches and partial matches for a given , cf. above. Furthermore, in line with Curran
et al. (2007) we extended the model to allow for closer familial relationships (full-siblings, rst-
cousins, parent-child and avuncular) and we derived expressions for the variance of the number
of matches and partial matches in the database.
Chapter 4 is the rst of ve papers on the quantitative part of the data available fromSTR results.
The paper is an extension of the work I did in my MSc thesis where the peak intensities of the
EPG were modelled by a multivariate normal distribution. The challenging part of the model
is the fact that the dimensions of the data vector (and sub-vectors hereof) vary among DNA
mixtures due to the dierent number of shared alleles between individuals. An EM-algorithm
was proposed for optimisation and we demonstrated that the model in fact is a mixed eects
model.
1.3 Outline 9
Chapter 5 discusses a simpler and more operational model for DNA mixtures than the one from
the previous chapter. In order to separate an observed DNA mixture into the contributing DNA
proles we derived a statistical model, which was suited for a greedy optimisation algorithm.
The algorithm is very ecient, separating complex DNA mixtures in a few seconds. It is imple-
mented as an on-line tool which provides valuable graphical output for further interpretation by
the forensic geneticists.
Chapter 6 addresses an important question in forensic genetics and evidential calculations: Esti-
mating the probability of allelic drop-out. We dene a proxy for the amount of DNA contributed
to a sample and use this quantity to derive an logistic regression model to estimate the probability
of allelic drop-out.
Chapter 7 presents a methodology for ltering the quantitative data from STR results. The ob-
served data is a conversion of emitted light froma uorochrome detected by a CCD camera. This
implies that the signal consists of a noise component and further systematic components, the so-
called pull-up eects and stutters. We demonstrate how to determine a oating threshold
using distribution analysis of the noise component. Pull-up and stutter corrections were per-
formed by regression analysis. The methodology decreases signicantly the number of allelic
drop-outs compared to the standard protocol.
Chapter 8 is a short communication on how to model degraded DNA in a simple and intu-
itive manner. Degraded DNA is a common problem in crime case samples since the biological
material from which the DNA is extracted has often been exposed to non optimal conditions.
Sunlight, humidity, bacteria and chemicals are some of the reasons for observing degraded DNA
which complicate the succeeding analysis and interpretation. The model presented in the pa-
per is used to modify the drop-out model discussed in Chapter 6 by adjusting the proxy for the
amount of DNA taking the level of degradation into account.
Chapter 9 summarises the results from the proceeding seven chapters by forming a unifying
likelihood ratio. The terms in this likelihood ratio consist of:
P(Q
mis
|Q
obs
, G
mis
, G
obs
, G)P(Q
obs
|G
mis
, G
obs
, G)P(G
mis
, G
obs
|K, G)P(K|G)P(G),
where G is a combination of DNA proles consistent with the hypothesis under consideration.
Furthermore, Q and G symbolises the quantitative and qualitative parts of the evidence, respec-
tively. The rst term, P(Q
mis
|) evaluates the probability of allelic drop-out using the models of
Chapters 6, 7 and 8, P(Q
obs
|) is evaluated by one of the models for DNA mixtures (Chapters 4
and 5), while the last terms are evaluated using the -correction discussed in Chapters 2 and 3.
CHAPTER 2
Overdispersion in allelic counts and
-correction in forensic genetics
Publication details
Co-authors: None
Journal: Theoretical Population Biology (In Press)
DOI: 10.1016/j.tpb.2010.07.002
11
12 Overdispersion in allelic counts and -correction in forensic genetics
Abstract:
We present a statistical model for incorporating the extra variability in allelic counts due to sub-
population structures. In forensic genetics, this eect is modelled by the identical-by-descent
parameter , which measures the relationship between pairs of alleles within a population rela-
tive to the relationship of alleles between populations (Weir, 2007). In our statistical approach,
we demonstrate that may be dened as an overdispersion parameter capturing the subpopula-
tion eects. This formulation allows derivation of maximum likelihood estimates of the allele
probabilities and together with computation of the prole log-likelihood, condence intervals
and hypothesis testing.
In order to compare our method with existing methods, we reanalysed FBI data from Budowle
and Moretti (1999) with allele counts in six US subpopulations. Furthermore, we investigate
properties of our methodology from simulation studies.
Keywords:
-correction; Forensic genetics; Subpopulation; Dirichlet-multinomial distribution; Maximum
likelihood estimate; Condence interval.
2.1 Introduction
Attaching probabilities to dierent levels of relatedness in paternity disputes or evaluating the
weight of evidence in crime cases with biological traces present at the scene of crime are essential
tasks in forensic genetics. To this purpose, the dierence in the genetic constitution of individuals
in the population is used to assess the probabilities of the evidence under competing hypotheses.
Currently, 10 to 20 locations on the genome (loci) are investigated for identication purposes
and an individuals DNA prole is made up by the dierent states (alleles) of the loci.
It is well known that allele frequencies may vary between ethnic groups, geographic remote
populations and subpopulations. However, due to a common evolutionary past it is assumed that
the allele frequencies of the subpopulations have a common mean, and that the variation between
subpopulations is due to genetic sampling (Weir, 1996).
In forensic genetics, population structures are of great importance when the probability of ob-
serving a given suspects DNA prole is assessed under various hypotheses. Budowle and
Moretti (1999) published allele frequencies fromsix dierent US subpopulations (African Amer-
ican, Bahamian, Jamaican, Trinidad, Caucasian and Hispanic) for 13 CODIS Core STR loci. In
this study, the authors obtained allele frequency estimates varying signicantly across subpop-
ulations. For example, the frequencies range from 6.9% (Hispanic) to 27.3% (Jamaican) for
allele 28 in locus D21, indicating that a homozygote on this locus could be 16 times more likely
in the Jamaican than in the Hispanic subpopulation (when assuming Hardy-Weinberg equilib-
rium). The ability to distinguish true genetic dierences from sampling eects depends on the
sample size. That is, testing the signicance of such allele frequency dierences depends on the
database sizes, since the variance of the estimates scales inversely with the number of sampled
DNA proles.
2.2 Overdispersion in allelic counts 13
In order to correct for subpopulation structure, Nichols and Balding (1991) suggested the -
correction to be used when inferring the weight of evidence in forensic genetics. Our approach
acknowledges the extra variability in the allelic counts and addresses this as overdispersion. The
statistical model of the present paper has the same properties as the genetic model. We exploit
results from the statistical literature in order to obtain maximum likelihood estimates (MLEs)
of the allele frequencies and -parameter, and compute prole log-likelihoods for providing
approximate condence intervals.
The basic idea and principle of overdispersion in allelic counts formulated in Section 2.2 has
previously been noted in the forensic literature, although not called overdispersion, by Balding
(2005). However, the terminology of overdispersion (or heterogeneity) explicitly underlines that
a simple assumption of the sampling distribution (multinomial distribution) is insucient to
model the data. By overdispersion it becomes more transparent to statisticians with limited
knowledge in population genetics to appreciate the concept of variability between population
groups. Hence, these rather specialised types of model are put into a more general statistical
framework.
2.2 Overdispersion in allelic counts
Our set-up assumes that the allelic counts in a given subpopulation X follow a multinomial
distribution with some unknown allele probabilities. Due to an evolutionary past, there exists
some variation among dierent subpopulations in terms of allele probabilities. However, these
allele probabilities have a common distribution across subpopulations with a mean and variance.
For now, we just let E(P) = be the mean of this distribution and V(P) its covariance matrix.
Note that this parametrisation of E(P) implies that are the allele probabilities in the reference
population from which the subpopulations are assumed to have descended.
Let n be the number of alleles sampled from a given subpopulation with k alleles. Then the
model can be formulated as
P(X = x|P = p) =
_
n
x
_
k
_
j=1
p
x
j
j
, where
_
n
x
_
=
n!
k
j=1
x
j
!
, (2.1)
is the multinomial coecient. Thus X follows a multinomial distribution when conditioned
on P = p. This implies that E(X) = E(E{X|P}) = E(nP) = n from the assumption of
E(P) = .
2.2.1 Dirichlet-multinomial distribution
In line with other authors (Lange, 1995b,a; Weir, 1996; Rannala and Hartigan, 1996; Balding,
2003), we assume the distribution of allele probabilities to be a Dirichlet distribution. The as-
sumption of a Dirichlet distribution is based on theoretical arguments from population genetics
together with the convenience that the Dirichlet distribution is the conjugate prior of the multi-
14 Overdispersion in allelic counts and -correction in forensic genetics
nomial distribution. The Dirichlet distribution has density function
f (p
1
, . . . , p
k
;
1
, . . . ,
k
) =
(
+
)
k
j=1
(
j
)
k
_
j=1
p
j
1
j
, (2.2)
where
+
=
_
k
j=1
j
. When assuming a Dirichlet distribution of P, we can derive the marginal
distribution of X by multiplying (2.2) and (2.1) and integrating over p. The resulting distribu-
tion is called the Dirichlet-multinomial distribution (Johnson et al., 1997) or multivariate P` olya
distribution (from its relation to the P` olya urn scheme, Green and Mortera (2009)) with density
P(X = x) =
_
n
x
_
(
+
)
(n +
+
)
k
_
j=1
_
x
j
+
j
_
j
_ . (2.3)
Using the results of Mosimann (1962), the mean of X
j
may be computed as E(X
j
) = n
j
/
+
,
where
j
/
+
is the mean of P
j
, E(P
j
) =
j
=
j
/
+
. Furthermore, the covariance matrix of X is
given by V(X) = cn[diag()
], where c = (n +
+
)/(1 +
+
) and
is the transpose of .
Hence, the covariance matrix of the Dirichlet-multinomial distribution is inated by the factor c
compared to an ordinary multinomial covariance.
The Dirichlet-multinomial distribution derived in (2.3) is almost identical to Eq. (8) in Curran
et al. (1999) except for the multinomial coecient, which is merely a constant with respect to the
parameters of the model. Furthermore, by introducing as in Curran et al. (1999),
+
= (1)/
or equivalently = (1 +
+
)
1
, we may rewrite c in terms of :
c =
n +
+
1 +
+
= (n +
+
) = n + (1 ) = 1 + (n 1).
This implies that V(X) = n[diag()
P
j
}
k
j=1
= {X
j
/n}
k
j=1
is an unbiased estimator of with covariance matrix n
1
[diag()
_
_
+
1
n
_
=
_
diag()
_
,
as opposed to the asymptotic behaviour under the multinomial model where lim
n
n
1
[diag()
n
i=1
P(Y
i
= y
i
|p)dp
_
f (p)
n
i=1
P(Y
i
= y
i
|p)dp
=
(
j
+ x
n+1
j
)(
+
+ n)
(
j
+ x
n
j
)(
+
+ n + 1)
=
x
n
j
+ (1 )
j
1 + (n 1)
, (2.4)
where we used f (p) from(2.2) and x
n+1
j
= x
n
j
+1. This expression emphasises that the probability
of observing a future j allele only depends on the previous sampled alleles through the total allele
count, n, and how many of these alleles were of type j, x
n
j
. Hence, we also apply the notation
P( j|x
n
j
) for this probability which is identical to P
n
(A) = (n
A
+ {1 }p
A
)/(1 + {n 1}) in the
recursion equation of Balding and Nichols (1997, equation (1) where we changed their notation
from F for ), which is the probability of observing an A allele after n
A
of n alleles being of type
A.
2.2.2 Application to paternity testing
Forensic genetics is widely used in paternity disputes or when a person applies for a family
reunication. In the setting of a paternity dispute, let H
1
be the hypothesis: The alleged father
is the true father and H
2
the hypothesis: A man unrelated to the alleged father is the true
father. The paternity index (PI) is dened as PI = P(E|H
1
)/P(E|H
2
), where the evidence, E, is
the DNA proles of the involved individuals, i.e. child, mother and alleged father.
In paternity testing, the -correction enters the PI through the assumption of correlated individ-
uals in the population due to subpopulation structures (Balding and Nichols, 1995; Evett and
Weir, 1998). Consider only one locus where a childs DNA prole is (ac) and its mothers
prole is (ab). Assuming no mutations, the true father must pass on a c allele to the child.
If the alleged fathers DNA prole is (cd), the H
1
-hypothesis implies that the parents (ab, cd)
have ospring (ac). The probability of the evidence, given H
1
, is computed as P(E|H
1
) =
P(ac|ab, cd)P(ab, cd), where P(ac|ab, cd) is the probability that a child is ac when its parents
are (ab, cd), i.e. P(ac|ab, cd) =
1
4
, and P(ab, cd) is the probability for observing alleles a, b, c
and d in the population. The other hypothesis, H
2
, claims that the child got its c allele from a
man unrelated to the alleged father. Then the paternity index, as derived in Appendix 2.A.1, is
given by
PI() =
1 + 3
2[ + (1 )
c
]
, (2.5)
where PI() is used to emphasise PIs dependence on . Table 1 and 2 in Balding and Nichols
(1995) give the (reciprocal) PI() for other parent-child scenarios (with denoted by F).
As an example, let us assume that this specic trio scenario is replicated for all S loci used for
DNA prole testing. The consequence between applying > 0 and and using the simple PI(0) =
16 Overdispersion in allelic counts and -correction in forensic genetics
1/(2p
c
) for independent proles is very pronounced even for reasonably common alleles. If
c
= 0.025 and = 0.03, then PI(0.03) 10 while PI(0) = 20. Hence, the numerical dierence
between the two paternity indexes is for S independent DNA markers {PI(0)/PI(0.03)}
S
2
S
,
which for the typical forensic typing kits with S 10 yields a ratio of at least 1, 000. That is, the
evidential weight may decrease by several orders of magnitude by correcting for possible IBD
or population stratication.
2.2.3 Application to DNA mixtures
When two or more individuals contribute to a biological stain, the observable DNA prole is a
mixture of the various alleles contributing to the stain, and is therefore called a DNA mixture. In
an m-person DNA mixture, it is possible to observe 1 to 2m alleles per locus, since the involved
DNA proles may share all or no alleles (see e.g. Tvedebrink et al., 2010, for a further discussion
of DNAmixtures). Assume for a two-person mixture, e.g. a rape case, that we observe the alleles
(abc) and that the victims DNA prole is (ab) and the suspects DNA prole is (cc). Then, in
line with the paternity index, the likelihood ratio is dened as LR = P(E|H
1
)/P(E|H
2
), where
E is the evidence (abc) and the known DNA proles and H
1
and H
2
is the prosecutors and
defences hypotheses, respectively (in the literature H
p
and H
d
are commonly used for the same
hypotheses). The hypothesis H
1
states The victim and suspect constitute the DNA mixture
whereas H
2
acquits the suspect: The victim and an unknown individual constitute the DNA
mixture. Let P(abc|ab, i j) be the probability of observing the crime scene stain (abc) given the
mixture originates from genotypes ab and i j. When assuming no typing errors this probability
is 1 if (i j) {(ac), (bc), (ca), (cb), (cc)} and 0 otherwise. In line with the derivation of PI() (see
Appendix 2.A.1 for the details of PI()), we get
LR() =
P(abc|ab, cc)P(ab, cc)
_
i j
P(abc|ab, i j)P(ab, cc, i j)
=
(1 + 3)(1 + 4)
(7 + {1 }[2
a
+ 2
b
+
c
])(2 + {1 }
c
)
(2.6)
In Figure 2.1, we have plotted the LR() function for the DNA mixture above with
a
and
b
xed at 0.1. The solid line represents the uncorrected LR( = 0) and the broken lines show the
corrected LR() for -values as described by the legend. The inserted plot shows the behaviour
close to the value
c
= 0.71 where the eect of the -correction is reversed. We see that the
eect of is minimal for common alleles and more pronounced for rare ones. Hence, in practice
the larger is the more conservative the LR-estimates are.
The use of in evidential computations can be seen as a means to smoothing the allele proba-
bilities over possible subpopulations and thereby adjusting for the uncertainty associated with
unobserved or unobservable substructures in the larger database. This latent structure may
be seen as a reason for overdispersion in statistical terms, i.e. inhomogeneity due to unob-
served/unobservable variables.
2.2 Overdispersion in allelic counts 17
0.0 0.2 0.4 0.6 0.8
0
5
0
1
0
0
1
5
0
2
0
0
2
5
0
3
0
0
c
L
i
k
e
l
i
h
o
o
d
R
a
t
i
o
,
L
R
(
)
= 0.00
= 0.01
= 0.02
= 0.03
= 0.04
1
.
1
0
1
.
1
5
1
.
2
0
1
.
2
5
1
.
3
0
1
.
3
5
1
.
4
0
0.68 0.70 0.72 0.74 0.76
Figure 2.1: The eect of on LR() for a single locus as exemplied. LR() is plotted for
various -values ranging from 0.00 (no subpopulation eect) to 0.04 (large subpopulation eect)
against the allele frequency of the allele in question (here allele c) with the other probabilities
(
a
and
b
) xed at 0.1. Inserted is a blow-up of the curve near
c
= 0.71 ( marks this point).
If the suspect or alleged father in the two situations considered above has a ethnicity or na-
tionality that indicates that a specic database is representative for his genetic origin then allele
frequencies estimated from this database are the most appropriate reference sample to use for
evidential weight calculations. However, the database and the population that it resembles may
be constituted by several subpopulations or groups, which causes this conceptual population to
be heterogeneous. That is, geopolitical or tribal structures together with marital and religious
preferences may induce genetic diversity causing overdispersion. Hence, a database that seems
to be the most appropriate for a particular suspect may not be sampled on a suciently high
resolution to obtain a homogeneous reference subpopulation. In fact, it may not even be possible
to obtain samples with this property. Thus, genetic diversity and the resulting overdispersion in
allele counts must be accounted for by the -correction.
18 Overdispersion in allelic counts and -correction in forensic genetics
2.3 Parameter estimation
Assume that we have allelic counts from N dierent subpopulations such that x
i j
denotes the
number of allele j in subpopulation i and that for each subpopulation i, i = 1, . . . , N, there is a
total of n
i
counts, n = (n
1
, . . . , n
N
). In addition, we assume that the subpopulations are indepen-
dent, implying that the likelihoods of the counts from the subpopulations multiply.
The likelihood may then be derived by multiplying over the terms of (2.3). This likelihood
implies dierentiation of -functions in order to solve the likelihood equations. A useful obser-
vation about the -function is that
y
_
r=1
{ + (r 1)} = ( + y 1) =
( + y)
()
, (2.7)
using the fact that x(x) = (x + 1). Hence, an equivalent way of expressing the distribution in
(2.3) using the rising factorials of (2.7) is given as
P(X = x) =
_
n
x
_
k
j=1
x
j
r=1
{
j
(1 ) + (r 1)}
n
r=1
{1 + (r 1)}
.
Fromthis probability function we can compute the log-likelihood function (, ; x). Discarding
the multinomial constant (which is a constant with respect to the parameters), the log-likelihood
is
(, ; x) =
N
i=1
k
j=1
x
i j
r=1
log{
j
(1 ) + (r 1)}
N
i=1
n
i
r=1
log{1 + (r 1)}. (2.8)
The corresponding likelihood equations, (, ; x)/(, ) = 0, cannot be solved analytically
for the parameters; hence numerical methods need to be invoked for parameter estimation. Let
denote the parameter vector = (, ) = ({
j
}
k1
j=1
, ), since
k
= 1
_
k1
j=1
j
. A possible
numerical method for solving the likelihood equations is Fisher-scoring, where the parameter
estimates in each iteration are updated using
(m+1)
=
(m)
+ {I(
(m)
)}
1
u(
(m)
), where
(m)
is the estimate in the mth iteration, u(
(m)
) is the score function, (; x)/, and I(
(m)
) is
the expected Fisher Information Matrix (FIM) both evaluated in
(m)
. Paul et al. (2005) derived
exact expressions for the expected FIM entries, I(). The results of Paul et al. (2005) imply that
the expected FIM may be computed using expressions only involving the marginal distributions
of X
j
. Similar results were obtained by Neerchal and Morel (2005).
2.3.1 Computational considerations
Most of the methodology discussed in this section and subsections hereof have been imple-
mented in the R-package dirmult available on-line in the CRAN repository at http://www.r-
project.org (Tvedebrink, 2009).
Even though the expressions for the expected FIM, I(, ), given in (Paul et al., 2005, pp. 232)
are compact, they cause the parameter estimation to be computationally inecient. Numerical
2.3 Parameter estimation 19
work has shown that it is much more convenient to estimate the -parameters and transform the
estimates, rather than estimate and directly. The log-likelihood (; x) is
(; x) =
N
i=1
k
j=1
x
i j
r=1
log{
j
+ r 1}
N
i=1
n
i
r=1
log{
+
+ r 1}, (2.9)
where we used (2.3) and (2.7). The rst-order and second-order derivatives of the log-likelihood
(; x) are given by
(; x)
j
=
N
i=1
_
_
x
i j
r=1
1
j
+ r 1
n
i
r=1
1
+
+ r 1
_
_
(2.10)
2
(; x)
j
2
=
N
i=1
_
_
n
i
r=1
1
(
+
+ r 1)
2
x
i j
r=1
1
(
j
+ r 1)
2
_
_
(2.11)
2
(; x)
l
=
N
i=1
n
i
r=1
1
(
+
+ r 1)
2
, (2.12)
where (2.10) gives the elements of the score function u(). Furthermore, this implies that the
diagonal elements of the expected FIM, I(), are
I(
j
,
j
) =
N
i=1
_
_
x
i j
r=1
P(X
i j
r)
(
j
+ r 1)
2
n
i
r=1
1
(
+
+ r 1)
2
_
_
,
for j = 1, . . . , k, and the o-diagonal elements, I(
j
,
l
), equal (2.12). However, for most practi-
cal purposes using the observed FIM, J(), rather than the expected FIM, I(), in the Newton-
Raphson scoring ensures much lower computational time. Numerical investigations indicate that
the J()-implementation converges to the same extrema and much more quickly as the diagonal
elements, J(
j
,
j
), for this matrix are as in (2.11), i.e. the terms P(X
i j
r), r = 1, . . . , x
i j
, where
X
i j
Beta-Binomial(
j
,
+
j
), need not be computed.
The inverse of the expected FIM is the asymptotic covariance matrix of the MLE. As our interest
is in (, ), we exploit that I(, ) =
I(), where {}
i j
= {/}
i j
.
Simulations
Standard asymptotic theory assures that the MLE is the most ecient estimator. However, infer-
ence about depends mainly on the number of subpopulations sampled, N, and only to a minor
degree on the subpopulation sample sizes, n. Hence, in order to verify our implementation and
the performance of the maximum likelihood estimator for dierent number of subpopulations,
we simulated data with known allele frequencies, , and -value. When simulating the mth data
matrix, x
m
, for m = 1, . . . , M, we used the following sampling scheme:
1. Draw p
i,m
Dirichlet({
j
(1)/}
k
j=1
), i = 1, . . . , N.
2. Draw x
i,m
Multinomial(n
i
, p
i,m
), i = 1, . . . , N.
3. The mth data matrix is x
m
= [x
1,m
, . . . , x
N,m
]
.
20 Overdispersion in allelic counts and -correction in forensic genetics
This ensures that the random variable X
i
, of which x
i,m
is a realisation, follows a Dirichlet-
multinomial distribution with parameters and . Note that the concept of N subpopulations
is a theoretical one. In practice only an overall database would exist which neglects the present
substructure. However, the intension is to account for this partitioning using the -correction.
In Weir and Hill (2002), the authors argue that if the expectation of a ratio was the ratio of ex-
pectations then the method of moment (MoM) estimator,
MoM
, of Weir and Hill (2002, equation
5) was an unbiased estimator of :
MoM
=
_
k
j=1
(MSP
j
MSG
j
)
_
k
j=1
(MSP
j
+ (n
c
1)MSG
j
)
,
where n
c
= (N 1)
1
_
_
N
i=1
n
i
n
1
+
_
N
i=1
n
2
i
_
and n
+
=
_
N
i=1
n
i
. The quantities MSG
j
and MSP
j
are two mean squares dened as
MSP
j
=
1
N1
N
i=1
n
i
( p
i j
p
j
)
2
and MSG
j
=
1
_
N
i=1
(n
i
1)
N
i=1
n
i
p
i j
(1 p
i j
)
with p
i j
= x
i j
/n
i
, p
j
= n
1
+
_
N
i=1
x
i j
. Even though the expectation does not satisfy the property
mentioned above, the
MoM
-estimator seems to perform reasonably well on average.
More recently, Zhou and Lange (2010) has derived MM (Minorisationmaximisation) algorithms
for some discrete multivariate distributions and among these the Dirichlet-multinomial distribu-
tion. The authors have provided Matlab scripts (on line supplementary material available at the
website of Journal of Computational and Graphical Statistics) for estimating parameters in the
MM set-up.
In the following we compare the MLE, MoM and MM estimates on simulated data using the
relative frequencies in locus D13 from data published in Budowle and Moretti (1999) as and
= 0.03. The box plot in Figure 2.2 show -estimates of 100 simulated datasets (M = 100)
with sample sizes, n
i
, of 200 and an increasing number of databases (increasing number of
subpopulations, N).
From the box plot it is evident that the MLE has a lower variance, but also that on average the
MoM and MM estimates are closer to the true value. However, as the number of databases
increases so does the accuracy of the estimates, as one would expect. In addition to the accu-
racy of the estimation procedure, it is relevant to compare the computational speed and ease of
implementation of the various methods. Naturally, the MoM estimator is the easiest to imple-
ment, and since no iterations are applied, convergence happens immediately. Both MLE and
MM estimates are based on iterative procedures. Where several statistical tools exist for easy
implementation of Newton-Raphson iterations, a little more code needs to be written for MM
algorithms. However, the script-les of Zhou and Lange (2010) elegantly demonstrated how
these obstacles can be handled in Matlab. We compared the computation times for the various
iterative methods (Zhou and Lange, 2010, implemented simple and more advanced MM meth-
ods in their paper) and number of iterations needed to satisfy the convergence criteria. The MLE
method implemented in R is always faster and needs fewer iterations for convergence compared
2.3 Parameter estimation 21
Number of databases
0
.
0
0
0
.
0
2
0
.
0
4
0
.
0
6
0
.
0
8
0
.
1
0
0
.
1
2
2 4 8 16 32 64
MLE
MoM
MM
Posterior mean
Figure 2.2: Box plots of 100 estimates based on simulated data with = 0.03 for an increasing
number of databases with a xed number of observations per database (n
i
= 200 for all i). White
boxes are MLE, grey boxes are MoM estimates, dark grey boxes are MM estimates, and the light
grey boxes are posterior means. The indicates the average of the estimates within each block.
to the standard MM implementation. However, the more advanced MM updating schemes are
more ecient than the MLE for small database counts. We tested the same algorithms on larger
datasets (Danish and Greenlandic forensic databases of 20,000 and 2,000 DNA proles). For
these larger databases, the MLE implementation was 10 times faster than the specialised MM
algorithms and up to 1,000 times faster than the standard MM implementation. However, this is
only true when using the observed FIM, J(), while the computation of the expected FIM, I(),
is very slow even for databases of moderate size.
Prole log-likelihood
From the box plots in Figure 2.2, there seems to be a tendency for the MLE to underestimate
the -parameter. In order to investigate the reason for this behaviour and compute the condence
intervals for , we derived the prole log-likelihood,
() = max
)/
+
). The
partial derivatives yield
()
i
=
(; x)
i
;
()
=
+
+
which implies that the score function for this new system is u(, ) = (u()1
k
,
+
+
)
,
where u() is the score function from (2.10) and 1
k
is a k-dimensional vector of ones. The
observed FIM, J(, ), is also almost preserved from the likelihood equations,
J(, ) =
_
J() 1
k
1
k
0
_
,
where J() is the observed FIM fromSection 2.3.1. Hence, we may apply Newton-Raphson iter-
ations in order to maximise (; x) under the constraint
+
=
+
. Alternatively this constrained
optimisation problem could have been solved using (recursive) quadratic programming. How-
ever, for this particular log-likelihood function Newton-Raphson procedure works very well with
Lagrange multipliers, and the existing code for maximisation is easily extended for handling the
extra terms induced by the constraints.
In Figure 2.3 the prole log-likelihood for simulated data with = 0.03 is plotted. Each panel
is standardised such that the maximum value of
() is zero, 2[
()
(
i
exp[
i
)]
_
n
i=1
exp[
i
)]
. (2.13)
The n dierent -values used in the sums of (2.13) are the same as those used for computing the
prole log-likelihood, e.g. equidistant points covering the 95%-condence interval. Table 2.1
lists the posterior means and estimates for the data in Figure 2.3, where the data points used for
computing each posterior mean lies within the 95%-condence interval.
2.3 Parameter estimation 23
2
[
l ^
(
l ^
(
^
)
]
10
8
6
4
2
0
0.00 0.05 0.10 0.15 0.20
Number of databases: 2
0.02 0.04 0.06 0.08
Number of databases: 4
10
8
6
4
2
0
0.01 0.02 0.03 0.04 0.05
Number of databases: 8
0.015 0.020 0.025 0.030 0.035 0.040 0.045
Number of databases: 16
10
8
6
4
2
0
0.020 0.025 0.030 0.035 0.040 0.045
Number of databases: 32
0.025 0.030 0.035
Number of databases: 64
Figure 2.3: Prole log-likelihoods for simulated data for an increasing number of databases
with = 0.03 for all simulations (marked by ). The MLE (), MoM (), MM () and posterior
mean (+) are plotted together with a 95%-condence interval (intersection of the dotted line
and the prole log-likelihood curve). The horizontal dashed and solid lines represent bootstrap
condence intervals based on randomisation and cluster resampling, respectively.
Table 2.1: Posterior means and estimates for the data in Figure 2.3 ( = 0.03).
Number of databases MLE MoM MM Posterior mean
2 0.0395 0.1315 0.0411 0.0532
4 0.0271 0.0374 0.0278 0.0310
8 0.0243 0.0349 0.0249 0.0263
16 0.0258 0.0247 0.0265 0.0267
32 0.0297 0.0345 0.0306 0.0303
64 0.0286 0.0316 0.0295 0.0290
24 Overdispersion in allelic counts and -correction in forensic genetics
We see that the posterior mean estimate in most situations improves the MLE estimate (except
for the rst row) and reduces the amount of bias for small numbers of databases. In Figure 2.2 the
light grey boxes (rightmost box whiskers for each stratum) represent the posterior means for the
simulated data computed using -values within the 95%-condence interval for the associated
MLE. Table 2.1 indicates that the bias is reduced for the posterior means, with only a minor
increment in the variance (see Figure 2.2).
A full Bayesian implementation with prior distributions on and (or equivalently on -
parameters) was not pursued in this study. However, several authors (see e.g. Holsinger, 1999)
have discussed estimation of (and other population genetics diversity measures) froma Bayesi-
an perspective. We refer to the review paper by Holsinger and Weir (2009) for further results
and discussions on Bayesian methodologies.
Bootstrapping condence intervals
In addition to computing a condence interval for using the
2
1
-approximation of the prole
log-likelihood, we also investigated the performance of bootstrap methods to construct the con-
dence intervals. However, there are some problems when bootstrapping clustered data in order
to assess the variability of the intra-cluster correlation parameter .
Several studies (Davison and Hinkley, 1997; Ukoumunne et al., 2003; Fields and Welsh, 2007)
indicate that special attention needs to be paid when one applies the bootstrap methodology
to this problem. The general recommendation is to sample on a subpopulation (cluster) level
rather than an individual (randomised) level due to the dependence structure implied by the intra-
cluster correlation factor. In Figure 2.3, we have superimposed bootstrap condence intervals
(horizontal solid and dashed lines) based on both kinds of bootstrap regime. The general picture
is that the cluster sampling underestimates (solid line - missing in rst two panels due to few
databases), whereas the randomised bootstrap provides overestimated values (dashed line).
From numerical studies we recommend the use of the prole log-likelihood method in order to
estimate the condence intervals for since this method is valid for any number of subpopula-
tions in the data. This might not be surprising (Davison and Hinkley, 1997; Ukoumunne et al.,
2003; Fields and Welsh, 2007). However, bootstrapping is often applied when assessing the
variability of estimates but for this is inappropriate.
Signicance test
Testing whether satises certain numerical properties is interesting since equality of across
loci simplies PI and LR computations. Further simplications are possible if = 0 is supported
by data. This implies that there is no detectable dierence among the databases, where the
reasons for this may be small sample sizes (and thus large variation), or that the databases are as
if sampled from a homogeneous population.
Samanta et al. (2009, Section 3: Hypothesis testing) derived hypothesis tests for inference about
under various population assumptions. Here we initiate by testing for equality of
s
for the
various loci, s = 1, . . . , S . The null hypothesis is
1
= =
S
=
,
2.4 Results 25
where
s
is the -value for locus s, with the alternative hypothesis specifying that at least one
s
is dierent from
; x) =
S
s=1
s
(
s
,
; x
s
) (2.14)
where
s
is the regular log-likelihood in (2.8) with
s
=
_
({
s
}
S
s=1
,
; x)
S
s=1
s
(
s
,
s
; x
s
)
_
_
,
and is approximately
2
S 1
-distributed fromthe S 1 degrees of freedom(DoF). Details of nding
stationary points of (2.14) are given in Appendix 2.A.2.
Furthermore, testing whether = 0 is another interesting hypothesis test. Under the null hy-
pothesis there is no evident substructure in the data. Having support for = 0 implies that DNA
proles may be regarded as independent, which has a high inuence on the estimation of the
evidential weight (see Sections 2.2.2 and 2.2.3). The Dirichlet-multinomial model with = 0
is equivalent to the simpler multinomial model. However, testing the hypothesis that = 0 can
not be based on asymptotic theory nor inferred from the inclusion/exclusion of zero in the con-
dence intervals from the prole log-likelihood since = 0 lies on the boundary of the parameter
space.
A possible method is to use a parametric bootstrap, where we simulate data x
1
, . . . , x
M
under the
null hypothesis = 0. From these simulated data we estimate
m
and obtain an approximative
distribution of
under the null hypothesis, which we apply in order to test the signicance of
0 for the observed data, x. Hence, the parametric bootstrap comprises two steps: (1) draw
x
m
Multinomial({x
i+
}
N
i=1
, {x
+j
}
k
j=1
/n
+
) and (2) estimate
m
.
By choosing M large, e.g. M = 1000, one gets M estimates of of which most should have an
estimate smaller than
when the hypothesis = 0 is false. An empirical p-value is computed
by #{
m
>
}/M, i.e. the ratio of the number of larger parametric bootstrap estimates to the total
number of bootstraps.
2.4 Results
The paper of Budowle and Moretti (1999) presents allele frequencies of 13 CODIS Core STR
loci in six US subpopulations. The data have previously been used to estimate the magnitude of
used for forensic purposes; see e.g. Weir (2007). Henceforth we refer to these data as FBI
data.
Estimates of based on the MoM, MLE and MM are given in Table 2.2. There are some distinct
dierences between the
MLE
and
MoM
, with often a factor two in dierence; furthermore , the
standard errors are often very much smaller for the MLE than for the MoM estimates. The
standard errors are asymptotic, where SE(
MoM
SE(
MLE
SE(
) 95%-CI for
MM
PM
D3 0.0108 0.0085 0.0056 0.0020 (0.0028; 0.0110) 0.0057 0.0061
vWA 0.0107 0.0085 0.0053 0.0017 (0.0027; 0.0098) 0.0053 0.0056
FGA 0.0050 0.0051 0.0037 0.0010 (0.0021; 0.0061) 0.0037 0.0038
D8 0.0140 0.0106 0.0084 0.0024 (0.0049; 0.0145) 0.0085 0.0089
D21 0.0126 0.0097 0.0053 0.0013 (0.0031; 0.0086) 0.0053 0.0055
D18 0.0142 0.0107 0.0086 0.0019 (0.0056; 0.0133) 0.0087 0.0089
D5 0.0226 0.0157 0.0161 0.0042 (0.0097; 0.0276) 0.0163 0.0170
D13 0.0264 0.0180 0.0147 0.0040 (0.0088; 0.0254) 0.0149 0.0156
D7 0.0061 0.0056 0.0035 0.0013 (0.0015; 0.0072) 0.0036 0.0038
CSF 0.0050 0.0049 0.0091 0.0026 (0.0049; 0.0167) 0.0092 0.0097
TPOX 0.0306 0.0205 0.0248 0.0066 (0.0147; 0.0433) 0.0254 0.0263
TH01 0.0328 0.0217 0.0189 0.0054 (0.0110; 0.0340) 0.0193 0.0202
D16 0.0117 0.0091 0.0069 0.0023 (0.0036; 0.0131) 0.0070 0.0074
(Weir and Hill, 2002, pp. 730), and SE(
) = {(I
1
)
,
}
1/2
from Section 2.3.1. Standard errors of
the MM estimates are not readily obtained fromthe Matlab scripts of the supplementary material
of Zhou and Lange (2010), hence these are not provided in Table 2.2. The ratio
MoM
/
MLE
of the
estimates in Table 2.2 repeats the pattern which was indicated by the plots in Figures 2.2 and 2.3.
For most loci, the MoM estimate lies within the 95%-condence interval. The MM estimates
coincide with the MLE for all loci. The posterior means are for most loci close to the MLE,
which is due to the rather symmetric shape of the prole log-likelihoods plotted in Figure 2.4,
where the prole log-likelihoods for the FBI data are plotted together with the MLE (marked by
), MoM (), MM () and posterior mean (+).
We tested the hypothesis of equality of for all loci in the FBI data. From Table 2.2, it is clear
that there are dierences among loci, but also some clustering of the estimates. In Table 2.3, we
have listed the results from testing dierent hypotheses.
Table 2.3: Results from testing hypothesis of equality of for multiple loci.
Loci
2
[
l ^
(
l ^
(
^
)
]
10
5
0
0.005 0.010 0.015
D3
0.005 0.010
vWA
0.002 0.004 0.006 0.008
FGA
0.005 0.010 0.015 0.020
D8
0.002 0.006 0.010
D21
10
5
0
0.005 0.010 0.015
D18
0.01 0.02 0.03 0.04
D5
0.010 0.020 0.030
D13
0.002 0.006 0.010
D7
0.005 0.010 0.015 0.020
CSF
10
5
0
0.01 0.02 0.03 0.04 0.05 0.06
TPOX
0.01 0.02 0.03 0.04 0.05
THO1
0.005 0.010 0.015 0.020
D16
Figure 2.4: Prole log-likelihoods for the 13 CODIS loci from the FBI data of Budowle and
Moretti (1999). The MLE is marked by , MoM by and MM by . For all loci the MLE
and MM estimate coincide. For most loci, the MoM estimate lies within the MLE condence
interval. The + indicates the posterior mean.
The tests indicate that there are groups of loci with similar -values. The mean,
, of the four
loci (D5, D13, TPOX and TH01) with the largest -estimates in Table 2.2 is
= 0.0186, and the
mean of the remaining loci is
= 0.0063. In both groups, the estimated
= 0.0183 (95%-CI:
[0.0126; 0.0269]) and
D
8
D
2
1
D
1
8
D
5
D
1
3
D
7
C
S
F
T
P
O
X
T
H
0
1
D
1
6
5
.
0
0
6
3
.
0
3
7
.
0
0
1
9
.
0
2
3
6
.
0
0
0
9
.
0
0
9
.
0
0
2
4
.
0
1
5
.
0
0
1
1
.
0
1
1
.
0
3
8
9
.
1
1
9
.
1
7
6
1
.
2
2
7
.
0
0
0
5
.
0
1
1
.
0
0
1
8
.
0
1
8
.
0
0
0
5
.
0
1
3
.
0
4
8
2
.
4
1
9
.
1
7
3
4
.
4
9
1
7
.
0
0
0
8
.
0
0
8
.
0
4
9
9
.
0
8
0
.
0
0
6
6
.
0
3
6
.
0
0
2
7
.
0
2
1
.
0
1
2
9
.
0
3
6
.
0
2
9
0
.
0
7
2
.
0
2
2
7
.
0
8
6
.
3
2
8
5
.
2
8
2
.
0
0
0
5
.
0
1
1
.
0
5
0
1
.
2
2
6
.
0
1
3
1
.
2
4
4
.
0
0
0
9
.
0
1
4
.
0
1
2
2
.
0
6
5
.
0
3
3
2
.
2
9
2
.
0
1
6
4
.
1
1
6
.
3
3
4
1
.
9
0
1
8
.
0
0
7
0
.
0
2
7
.
0
0
5
5
.
0
2
8
.
1
8
5
2
.
1
4
5
.
0
2
5
4
.
0
8
0
.
0
5
0
1
.
1
1
3
.
1
6
2
6
.
1
2
0
.
0
3
2
7
.
0
7
7
.
4
1
6
1
.
3
3
3
.
1
7
6
9
.
2
2
7
.
0
3
4
4
.
0
7
2
.
0
0
8
9
.
1
1
7
.
0
0
5
0
.
0
6
8
.
1
8
9
2
.
7
0
9
.
0
3
3
7
.
2
9
4
.
0
4
9
2
.
3
0
3
.
1
6
2
6
.
3
9
4
.
0
4
2
5
.
3
4
2
.
4
3
0
8
1
.
0
7
5
.
1
7
5
2
.
6
5
6
.
0
3
1
0
.
1
6
2
9
.
0
6
3
4
.
0
8
1
.
0
0
7
8
.
0
3
4
.
1
9
0
3
.
1
4
6
.
0
2
7
9
.
0
8
5
.
0
5
9
8
.
1
2
4
.
1
2
0
2
.
1
0
6
.
0
2
9
0
.
0
7
2
.
1
5
2
5
.
2
3
8
.
1
5
4
9
.
2
1
5
.
1
5
6
8
.
1
4
7
.
0
6
2
5
.
0
9
8
.
0
0
5
5
.
0
3
8
.
1
8
8
2
.
1
1
7
.
0
2
4
4
.
1
7
3
.
0
7
4
6
.
7
5
1
.
1
2
3
4
.
3
9
6
.
0
2
7
1
.
1
4
6
.
1
6
1
8
.
8
0
1
.
1
4
5
3
.
3
6
2
.
1
5
9
6
.
5
8
2
1
0
.
0
8
1
9
.
0
9
2
.
0
4
4
9
.
0
8
8
.
2
0
9
8
.
1
5
2
.
0
0
6
7
.
0
3
1
.
0
7
5
5
.
1
4
4
.
0
5
2
7
.
1
1
6
.
3
2
2
8
.
1
5
2
.
2
6
2
6
.
1
9
7
.
0
7
2
6
.
1
6
7
.
0
1
4
3
.
0
6
0
.
1
1
3
5
.
1
2
8
.
0
8
3
3
.
3
3
7
.
0
5
2
6
.
3
8
9
.
2
1
6
5
.
6
4
0
.
0
0
5
5
.
0
5
3
.
0
6
6
1
.
4
0
0
.
0
5
0
7
.
2
7
9
.
3
2
1
6
.
2
0
5
.
2
5
9
3
.
1
5
6
.
0
6
4
5
.
2
5
4
.
0
0
7
9
.
0
6
2
.
1
1
3
6
.
3
5
2
1
1
.
0
0
5
4
.
0
2
5
.
1
2
0
2
.
1
0
9
.
0
5
3
8
.
0
9
7
.
0
7
6
0
.
0
9
8
.
0
1
4
2
.
0
4
8
.
3
0
1
7
.
2
5
7
.
2
7
2
5
.
2
4
0
.
2
2
3
4
.
1
3
6
.
2
4
2
8
.
1
9
2
.
2
4
3
9
.
2
8
8
.
2
9
9
6
.
1
8
6
.
0
0
3
7
.
0
3
3
.
1
2
2
0
.
3
3
3
.
0
5
0
1
.
1
6
7
.
0
7
5
0
.
1
6
9
.
0
1
0
4
.
0
7
6
.
3
0
5
5
.
9
0
3
.
2
6
8
3
.
4
3
7
.
2
2
3
0
.
3
5
7
.
2
4
3
5
.
3
6
7
.
2
3
3
4
.
4
7
5
.
2
9
8
1
.
1
6
3
1
2
.
0
0
2
0
.
0
1
4
.
1
8
4
1
.
1
3
0
.
1
3
0
7
.
1
4
7
.
0
1
8
0
.
0
4
8
.
0
7
7
3
.
1
1
6
.
3
5
3
6
.
2
6
8
.
3
5
9
2
.
2
6
0
.
1
2
4
4
.
1
0
7
.
3
1
5
5
.
2
0
8
.
0
5
1
4
.
1
3
9
.
2
2
4
5
.
1
6
9
.
0
0
1
4
.
0
2
1
.
1
8
4
5
.
2
6
5
.
1
2
6
0
.
1
9
0
.
0
1
5
4
.
0
8
1
.
0
7
9
0
.
3
3
8
.
3
5
2
9
.
3
8
2
.
3
6
8
7
1
.
0
0
5
.
1
2
6
1
.
3
7
3
.
3
1
7
8
.
4
3
1
.
0
4
4
4
.
2
5
6
.
2
3
2
4
.
7
0
6
1
3
.
0
0
6
0
.
0
2
7
.
0
0
8
6
.
0
3
2
.
1
6
0
1
.
1
2
3
.
2
5
2
4
.
1
9
0
.
0
0
4
6
.
0
2
3
.
0
8
0
6
.
1
1
9
.
1
8
2
3
.
2
1
5
.
1
4
4
2
.
1
8
8
.
0
2
4
6
.
0
5
0
.
0
6
8
6
.
1
1
1
.
0
0
1
9
.
0
1
9
.
1
4
4
7
.
1
4
2
.
0
0
6
2
.
0
6
7
.
0
0
8
4
.
0
9
8
.
1
5
9
2
.
2
5
3
.
2
5
8
4
.
6
3
9
.
0
0
3
5
.
0
3
4
.
0
8
9
0
.
5
3
6
.
1
8
9
3
.
6
6
7
.
1
3
7
0
.
1
9
9
.
0
2
3
9
.
0
9
3
.
0
6
3
6
.
1
0
4
.
0
0
0
5
.
0
1
0
.
1
4
2
7
.
2
2
4
1
4
.
0
9
1
4
.
1
0
9
.
0
7
6
8
.
0
9
8
.
1
6
2
7
.
1
2
4
.
2
8
1
8
.
1
9
7
.
0
0
6
4
.
0
2
7
.
0
9
2
6
.
1
2
7
.
0
1
7
2
.
0
6
4
.
0
5
7
5
.
1
2
1
.
0
0
5
2
.
0
2
2
.
0
1
3
8
.
0
4
8
.
0
2
3
6
.
0
5
9
.
0
9
4
7
.
3
3
8
.
0
7
4
9
.
1
6
2
.
1
5
9
7
.
3
5
0
.
2
8
2
2
.
5
7
0
.
0
0
5
5
.
0
7
1
.
1
0
2
9
.
6
0
3
.
0
1
0
8
.
0
7
3
.
0
5
0
2
.
1
7
7
.
0
0
4
5
.
0
3
2
.
0
0
9
8
.
0
3
0
.
0
2
1
1
.
1
1
4
1
5
.
3
2
1
2
.
1
7
7
.
1
5
3
3
.
1
3
3
.
1
0
4
7
.
1
0
2
.
1
6
7
7
.
1
6
3
.
0
1
3
6
.
0
4
1
.
1
5
0
4
.
1
5
7
.
0
0
6
9
.
0
3
7
.
0
0
1
4
.
0
1
4
.
0
0
3
8
.
0
2
3
.
0
0
2
8
.
0
1
8
.
3
2
3
5
.
5
9
8
.
1
6
1
0
.
6
3
3
.
1
0
4
7
.
2
2
8
.
1
6
7
2
.
4
7
3
.
0
1
7
4
.
1
6
3
.
1
4
6
1
.
1
3
9
.
0
0
3
3
.
0
2
8
.
0
0
0
5
.
0
1
1
.
0
0
2
3
.
0
2
4
.
0
0
1
4
.
0
1
5
1
6
.
2
9
3
1
.
1
7
3
.
2
7
9
1
.
1
6
6
.
0
4
1
7
.
0
6
6
.
0
4
4
8
.
0
8
8
.
0
0
2
6
.
0
1
7
.
1
6
0
3
.
1
6
2
.
0
0
1
4
.
0
1
5
.
2
8
9
7
.
3
9
5
.
2
8
0
4
.
5
1
2
.
0
4
3
2
.
2
2
2
.
0
4
4
6
.
2
4
8
.
0
0
3
0
.
0
4
4
.
1
5
6
1
.
4
3
1
.
0
0
0
5
.
0
2
5
1
7
.
1
9
0
8
.
1
4
9
.
2
1
7
7
.
1
5
3
.
0
3
0
9
.
0
5
7
.
0
0
9
6
.
0
3
9
.
0
0
2
0
.
0
1
4
.
1
5
0
2
.
1
5
7
.
1
8
9
3
.
3
2
2
.
2
1
4
4
.
3
7
5
.
0
3
1
7
.
1
7
4
.
0
0
7
9
.
0
7
9
.
0
0
1
5
.
0
2
3
.
1
5
6
6
.
5
1
8
1
8
.
0
8
3
2
.
1
0
4
.
1
6
7
8
.
1
3
8
.
0
0
9
2
.
0
3
1
.
0
0
1
1
.
0
1
1
.
0
9
6
7
.
1
2
9
.
0
8
5
6
.
4
3
8
.
1
6
8
5
.
3
8
1
.
0
0
8
9
.
0
6
9
.
0
0
0
5
.
0
1
1
.
0
9
7
9
.
3
4
5
1
9
.
0
0
7
8
.
0
3
1
.
0
6
9
2
.
0
9
3
.
0
0
4
7
.
0
2
1
.
0
7
0
7
.
1
1
2
.
0
0
6
7
.
0
5
0
.
0
6
7
9
.
1
3
9
.
0
0
4
0
.
0
3
5
.
0
6
8
6
.
2
8
7
2
0
.
0
2
1
0
.
0
5
1
.
0
0
0
8
.
0
0
8
.
0
4
8
1
.
0
9
2
.
0
1
9
2
.
0
7
1
.
0
0
0
5
.
0
1
1
.
0
4
6
7
.
2
4
8
2
1
.
0
0
1
1
.
0
1
0
.
0
0
0
8
.
0
0
8
.
0
2
2
8
.
0
6
3
.
0
0
1
4
.
0
2
5
.
0
0
0
5
.
0
2
6
.
0
2
0
4
.
1
3
4
2
2
.
0
1
1
0
.
0
4
2
.
0
0
9
9
.
0
9
6
2
3
.
0
0
4
6
.
0
2
5
.
0
0
3
5
.
0
3
4
2
4
.
0
0
1
1
.
0
1
1
.
0
0
0
5
.
0
1
1
2.6 Conclusion 31
propriate to use the locus-specic MLEs (or the common -values for groups of loci in Table 2.3)
when computing PI().
2.6 Conclusion
We have demonstrated how the genetic dependence caused by identical-by-descent assumption
can be modelled as overdispersion from a statistical point of view. This allowed for maximum
likelihood estimation of allele probabilities in the reference population, , and the identical-by-
descent measure, . By using recent results from the statistical literature the FIM was computed
analytically and condence intervals based on prole log-likelihoods were provided.
Acknowledgements
I would like to thank my PhD supervisor Associate Professor Poul Svante Eriksen (Aalborg Uni-
versity, Denmark), Professor Niels Morling (University of Copenhagen, Denmark) and Professor
Bruce S. Weir (University of Washington, USA) for comments and valuable discussions. I am
thankful to Professor Weir for inviting me as visiting scientist to The Department of Biostatis-
tics, University of Washington, which I was visiting while working on this paper. Furthermore,
I would like to thank Associate Professor Esben Hg (Aalborg University, Denmark) and three
anonymous reviewers for their comments, which have signicantly improved the nal version of
this paper.
Appendix
2.A Mathematical details
In Appendix 2.A.1, we give some mathematical details on how to derive the paternity index,
PI(), of (2.5), and Appendix 2.A.2 is about testing for equality of across loci.
2.A.1 Deriving paternity index (PI)
We demonstrate howto derive the paternity index, PI, in (2.5) using P(Y
n+1
=j|Y
n
=y
n
) = P( j|x
n
j
)
in (2.4). In a given locus, the childs prole is (ac) and the mother is heterozygous (ab), where
c is dierent from a and b. Discarding the possibility of mutations, the true father needs to pass
on a c allele to the child. Assume that the alleged father is heterozygous (cd), which implies
P(ac|ab, cd) =
1
4
, i.e. under hypothesis H
1
the probability of the childs prole given its parents
proles is
1
4
. The PI is determined by:
PI =
P(ac, ab, cd|H
1
)
P(ac, ab, cd|H
2
)
=
P(ac|ab, cd)P(ab, cd)
_
k
i, j
P(ac|ab, cd, i j)P(ab, cd, i j)
,
32 Overdispersion in allelic counts and -correction in forensic genetics
where (i j) denotes the prole of the true father under H
2
and summation is over all k alleles
in the given locus. However, when omitting the possibility of mutations, unless i or j equals c
the child can not be the true fathers ospring, i.e. P(ac|ab, cd, i j) = 0 for (i, j) where c i
and c j. Hence, we x j = c and sum over all i = 1, . . . , k, where P(ac|ab, cd, cc) =
1
2
and
P(ac|ab, cd, ic) =
1
4
for all i c under H
2
. This implies that the expression for the PI is given by
PI =
1
4
P(ab, cd)
1
2
P(ab, cd, cc) + 2
_
ic
1
4
P(ab, cd, ic)
=
P(ab, cd)
2P(ab, cd, c)
_
P(c|ab, cd, c) +
_
ic
P(i|ab, cd, c)
_ =
1
2P(c|ab, cd)
,
where the sum in square brackets by denition is one. Using the expression P( j|x
n
j
) in (2.4) with
x
4
= (x
4
a
, x
4
b
, x
4
c
, x
4
d
) = (1, 1, 1, 1), we have,
PI =
1
2P(c|x
4
c
)
=
1 + (n 1)
2[x
c
+ (1 )
c
]
=
1 + 3
2[ + (1 )
c
]
.
2.A.2 Testing equality of for multiple loci
In order to nd stationary points for the log-likelihood of (2.14), we use Fisher-scoring with La-
grange multipliers, = {
s
}
S
s=1
, ensuring equal for all loci. Translating the common parameter
to
ensures computational simplicity. The observed FIM, J(), associated with (2.14) is
J() =
_
_
[J(
1
)] O
1,2
O
1,S
g
k
1
O
2,1
[J(
2
)] O
2,S
g
k
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
O
S,1
O
S,S 1
[J(
2
)] g
k
S
g
k
1
g
k
2
g
k
S
0
_
_
where O
s,t
is a (k
s
+1) (k
t
+1)-matrix of zeros,
[J(
s
)] =
_
J(
s
) 1
k
s
1
k
s
0
_
and g
k
s
=
_
0
k
s
1
_
Furthermore, the score function is
u({
s
,
s
}
S
s=1
,
) = ({u(
s
)
s
1
k
s
, (
s+
)}
S
s=1
,
+
)
where u(
s
) is the score function of (2.10).
Bibliography 33
Bibliography
Balding, D. J. (2003). Likelihood-based inference for genetic correlation coecients. Theoreti-
cal Population Biology 63, 221230.
Balding, D. J. (2005). Weight-of-evidence for Forensic DNA Proles. Chichester, West Sussex:
John Wiley & Sons, Ltd.
Balding, D. J. and R. A. Nichols (1995). A method for quantifying dierentiation between
populations at multi-allelic loci and its implications for investigating identity and paternity.
Genetica 96, 312.
Balding, D. J. and R. A. Nichols (1997). Signicant genetic correlations among caucasians at
forensic DNA loci. Heredity 78(6), 583589.
Barndor-Nielsen, O. E. and D. R. Cox (1994). Inference and Asymptotics. Number 52 in
Monographs on Statistics and Applied Probability. London: Chapman & Hall.
Box, G. E. P. and N. R. Draper (1987). Empirical model-builing and response surfaces. Wiley.
Budowle, B. and T. R. Moretti (1999). Genotype proles for six population groups at the 13
CODIS short tandem repeat core loci and other PCR-based loci. Forensic Science Communi-
cations.
Cockerham, C. C. (1969). Variance of gene frequencies. Evolution 23(1), 7284.
Cockerham, C. C. (1973). Analysis of gene frequencies. Genetics 74(4), 679700.
Curran, J. M., J. S. Buckleton, C. M. Triggs, and B. S. Weir (2002). Assessing uncertainty in
DNA evidence caused by sampling eects. Science and Justice 42(1), 2937.
Curran, J. M., C. M. Triggs, J. S. Buckleton, and B. S. Weir (1999). Interpreting DNA mixtures
in structured populations. Journal of Forensic Science 44(5), 987995.
Davison, A. C. and D. V. Hinkley (1997). Bootstrap Methods and their Application. Cambridge
University Press.
Evett, I. W. and B. S. Weir (1998). Interpreting DNA Evidence: Statistical Genetics for Forensic
Scientists. Sunderland, MA: Sinauer Associates.
Fields, C. A. and A. H. Welsh (2007). Bootstrapping clustered data. Journal of the Royal
Statistical Society. Series B, Statistical methodology 69(3), 369390.
Green, P. J. and J. Mortera (2009). Sensitivity of inferences in forensic genetics to assumptions
about founding genes. Annals of Applied Statistics 3(2), 731763.
Hardy, G. H. (1908). Mendelian proportions in a mixed population. Science 28(706), 4950.
Holsinger, K. E. (1999). Analysis of genetic diversity in geographically structure populations:
A bayesian perspective. Hereditas 130, 245255.
Holsinger, K. E. and B. S. Weir (2009). Genetics in geographically structured populations:
dening, estimating and interpreting F
S T
. Nature Reviews. Genetics 10(9), 639650.
Johnson, N. L., S. Kotz, and N. Balakrishnan (1997). Discrete Multivariate Distributions. Wiley.
Lange, K. (1995a). Applications of the Dirichlet distribution to forensic match probabilities.
Genetica 96, 107117.
34 Overdispersion in allelic counts and -correction in forensic genetics
Lange, K. (1995b). Mathematical and Statistical Methods for Genetic Analysis (2 ed.). Springer.
Mosimann, J. E. (1962). On the compound multinomial distribution, the multivariate -
distribution, and correlations among proportions. Biometrika 49(1-2), 6582.
Neerchal, N. K. and J. G. Morel (2005). An improved method for the computation of maximum
likelihood estimates for multinomial overdispersion models. Computational Statistics &Data
Analysis 49, 3343.
Nichols, R. A. and D. J. Balding (1991). Eects of population structure on DNA ngerprint
analysis in forensic science. Heredity 66, 297302.
Paul, S. R., U. Balasooriya, and T. Banerjee (2005). Fisher information matrix for the Dirichlet-
multinomial distribution. Biometrical Journal 47(2), 230236.
Phillips, C., T. Tvedebrink, et al. (2010). Analysis of global variability in 15 established and
5 new European Standard Set (ESS) STRs using the CEPH human genome diversity panel.
Forensic Science International: Genetics. In Press.
Rannala, B. and J. A. Hartigan (1996). Estimating gene ow in island populations. Genetical
Research 67, 147158.
Samanta, S., Y.-J. Li, and B. S. Weir (2009). Drawing inferences about the coancestry coecient.
Theoretial Population Biology 75, 312319.
Tvedebrink, T. (2009). dirmult: Estimation in Dirichlet-Multinomial distribution. R package
version 0.1.3.
Tvedebrink, T., P. S. Eriksen, H. S. Mogensen, and N. Morling (2010). Evaluating the weight
of evidence using quantitative STR data in DNA mixtures. Journal of the Royal Statistical
Society. Series C, Applied statistics. In Press.
Ukoumunne, O. C., A. C. Davison, M. C. Gulliford, and S. Chinn (2003). Non-parametric boot-
strap condence intervals for the intraclass correlation coecient. Statistics in Medicine 22,
38053821.
Weinberg, W. (1908).
Uber den nachweis der vererbung beimmenschen. Jahreshefte des Vereins
f ur vaterl andische Naturkunde in W urttemberg 64, 368382.
Weir, B. S. (1996). Genetic Data Analysis II. Sinauer Associates, Inc.
Weir, B. S. (2007). The rarity of DNA proles. The Annals of Applied Statistics 1(2), 358370.
Weir, B. S. and C. C. Cockerham (1984). Estimating F-statistics for the Analysis of Population
Structure. Evolution 38(6), 13581370.
Weir, B. S. and W. G. Hill (2002). Esimating F-statistics. Annual Review of Genetics 36, 721
750.
Wright, S. (1951). The genetical structure of populations. Annals of eugenics 15, 323354.
Zhou, H. and K. Lange (2010). MM algorithms for some discrete multivariate distributions.
Journal of Computational and Graphical Statistics. In Press.
2.7 Supplementary remarks 35
2.7 Supplementary remarks
The methodology presented above for estimating and computing the prole log-likelihood has
been applied in the publication by Phillips et al. (2010). I performed some of the computations
of that paper using the dirmult package and made plots similar to those of Figures 2.3 and
2.4 (Fig. 4 in Phillips et al., 2010). Plots in higher resolution are available from my web page
(http://people.math.aau.dk/tvede under List of publication).
In population genetics the Hardy-Weinberg equilibrium(HWE) constitute a fundamental point of
reference. Proposed independently by Hardy (1908) and Weinberg (1908), the HWE states that
assuming random mating, no selection, no mutations and innite population size the probability
of a diploid genotype is the product of allele probabilities, P(A
i
A
j
) = 2p
i
p
j
and P(A
i
A
i
) = p
2
i
.
We know immediate from these assumptions that HWE fail to hold since no real world popula-
tion satisfy these restrictions. However, quoting Box and Draper (1987, pp. 74): Remember
that all models are wrong; the practical question is how wrong do they have to be to not be use-
ful applies also to HWE. In fact testing for HWE is often done to test for data quality, where
the test is performed on genetic data to detect possible over-representation of homozygotes due
to typing errors.
Over the last 100 years since the publication of the Hardy-Weinberg principle several genetic
models have been proposed to relax the assumptions mentioned above. One such attempt were
Wright (1951) who dened the F-statistics (F
S T
, F
IT
and F
S I
), which are measures of popula-
tion dierentiation (Holsinger and Weir, 2009). In forensic genetics the most interesting of the
parameters is F
S T
which measures the divergence between a subpopulation, S , and the total pop-
ulation, T. Cockerham (1969, 1973) showed that for most interesting assumptions made about
the population structure and breeding patterns is identical to F
S T
(Weir and Cockerham, 1984,
pp. 1358). The use of the -correction alters the genotype probabilities P(A
i
A
j
) = 2p
i
p
j
(1 )
and P(A
i
A
i
) = p
i
+ p
2
i
(1 ), where the magnitude of controls the deviation from HWE.
In the following chapter, a paper discussing the -correction in relation to a DNA reference
prole databases is presented. In that setting only one database is available, hence there is no
point of reference to which extend a particular subsampled database diers in allelic constitution
from another. Therefore dierent means of estimating needs to be considered. In the setting
above was a measure of subpopulation structure in a larger database, whereas in the subsequent
setting is a measure of correlation between gametes (within and between individuals). Hence,
by making pairwise comparisons of all individuals in the database we may be able to quantify
by analysing the dierence between expected and observed counts of matching loci.
CHAPTER 3
Analysis of matches and partial-matches
in Danish DNA reference prole database
Publication details
Co-authors: Poul Svante Eriksen
, James Curran
Department of Statistics
University of Auckland
i=1
n
j>i
M(G
i
, G
j
), (3.1)
which corresponds to N =
_
n
2
_
= n(n 1)/2 pairwise comparisons of n DNA proles. With the
database size of n = 51,517 this results in N = 1,326,974,886 comparisons.
The result of analysing the Danish database with n = 51,517 DNA proles is summarised in
Table 3.1, where M
m/p
corresponds to the number of pairs with m matching loci and p partially-
matching loci. From Table 3.1 we nd that e.g. the number of pairs of proles with 5 matching
loci and 4 partially-matching loci out of ten autosomal loci is M
5/4
= 17,060. Figure 3.1 shows
a the summary statistic in an informative way where we have plotted the observed counts on
log
10
-scale.
Two of the authors (T. Tvedebrink and J. Curran) implemented computationally ecient func-
tions for constructing the M-table in the statistical software R (R Development Core Team,
2009). The compare-function from the DNAtools-package (Curran and Tvedebrink, 2010b)
took less than 5 minutes to perform all 1,326,974,886 pairwise comparisons on a 2.50 GHz lap-
top computer. Most of the methodology in this paper has been implemented in the DNAtools-
package together with specialised plotting functions. The package is described in more detail
elsewhere (Curran and Tvedebrink, 2010a).
40 Analysis of matches and partial-matches in Danish DNA database
Table 3.1: Summary matrix M for the Danish reference DNA prole database with 51,517 DNA
proles. M
m/p
is the number pairs of proles with m matching (where m is the row number) and
p partially-matching (where p is the column number) loci. Owing to lack of space the font size
is reduced for the least interesting part of the table (low number of matching loci).
M 0 1 2 3 4 5 6 7 8 9 10
0 906,881 8,707,969 37,632,872 96,157,037 160,570,778 182,820,115 143,627,613 76,852,119 26,786,782 5,486,572 501,671
1 1,100,493 9,484,061 36,229,766 80,292,877 113,733,413 106,635,954 66,164,365 26,183,818 5,992,415 604,900
2 595,135 4,531,792 14,996,133 28,165,271 32,810,688 24,271,278 11,132,519 2,887,555 325,493
3 188,146 1,237,733 3,467,281 5,353,738 4,913,791 2,683,854 805,798 103,305
4 38,094 212,192 487,484 592,929 401,832 143,202 21,490
5 5,114 23,490 42,459 37,933 17,060 3,100
6 470 1,685 2,272 1,414 378
7 26 96 91 64
8 3 6 21
9 0 0
10 0
Match/Partial match
C
o
u
n
t
s
0
/
0
0
/
1
0
/
2
0
/
3
0
/
4
0
/
5
0
/
6
0
/
7
0
/
8
0
/
9
0
/
1
0
1
/
0
1
/
1
1
/
2
1
/
3
1
/
4
1
/
5
1
/
6
1
/
7
1
/
8
1
/
9
2
/
0
2
/
1
2
/
2
2
/
3
2
/
4
2
/
5
2
/
6
2
/
7
2
/
8
3
/
0
3
/
1
3
/
2
3
/
3
3
/
4
3
/
5
3
/
6
3
/
7
4
/
0
4
/
1
4
/
2
4
/
3
4
/
4
4
/
5
4
/
6
5
/
0
5
/
1
5
/
2
5
/
3
5
/
4
5
/
5
6
/
0
6
/
1
6
/
2
6
/
3
6
/
4
7
/
0
7
/
1
7
/
2
7
/
3
8
/
0
8
/
1
8
/
2
9
/
0
9
/
1
1
0
/
0
1
10
2
10
4
10
6
10
8
Figure 3.1: Plot of observed counts (marked by ) versus the number of matching and partially-
matching loci (counts on log
10
-scale) for the Danish database. The superimposed points ()
represents the expected counts (under the model described in Section 3.2.2) and the vertical
bars indicate an approximative 95%-condence interval computed by N 2
_
diag{()} (see
Sections 3.3 and 3.3.2).
3.2 Materials and methods 41
3.2.2 Population genetic model
The model proposed by Weir (2007, 2004) denes for each of the L loci three probabilities
(P
0/0
, P
0/1
, P
1/0
), which are the probabilities for two randomly selected proles sharing none,
one or both alleles at a given locus (Weir denoted the probabilities P
0
, P
1
, P
2
. The change
of subscript will hopefully be clear in the following). The probabilities P
m/p
depends on the
coancestry coecient through the match probability equations (Nichols and Balding, 1991)
that are derived using the recursion formula: P(A
i
|x
n
) = [x
n
i
+ (1 )p
i
]/[1 +(n 1)], which
is the probability of observing an i
alleles of type i
among n sampled
alleles.
The expected values associated with the observed counts in M under this model is computed
as N, where = {
m/p
}
m, p
is the matrix of probabilities for the match/partially-match events
(m, p). The elements of ,
m/p
, m = 0, . . . , L; p = 0, . . . , L m, may be computed using
recursion over loci: Let
m/p
denote the probability based on loci, i.e. using only a subset of
size of the L loci. Then the following equation denote howto compute
+1
m/p
for = 1, . . . , L1:
+1
m/p
= P
+1
0/0
m/p
+ P
+1
0/1
m/p1
+ P
+1
1/0
m1/p
, (3.2)
where the sum of the subscripts for each term on the right hand side equals the subscript on
the left hand side, and P
m/p
refer to the P
m/p
probabilities for the th added locus. When either
m = 0 and/or p = 0 we have these boundary equations:
+1
0/0
= P
+1
0/0
0/0
,
+1
0/p
= P
+1
0/0
0/p
+ P
+1
0/1
0/p1
and
+1
m/0
= P
+1
0/0
m/0
+ P
+1
1/0
m1/0
,
where
1
1/0
= P
1
1/0
,
1
0/1
= P
1
0/1
and
1
0/0
= P
1
0/0
. These equations are easily implemented in
computer software and eciently compute the expected numbers for various -values.
Weir (2007) focused in his survey paper primarily on comparison between the observed counts
and the expected number, N(), for dierent values of . However, as Curran et al. (2007) dis-
cussed one needs to consider normalisation of these dierences for a proper comparison between
the observed and expected counts. In this paper we show how to compute the covariance matrix
of M in order to make a more rigorous comparison taking the correlation between cell counts
into consideration.
Close relatedness
Weir (2007) showed that for a specied family relationship of a pairs of proles, P
m/p
is updated
using the probabilities, k
I
, that the two individuals share I alleles identical-by-decent (IBD):
P
0/0
= k
0
P
0/0
P
0/1
= k
1
(1 )(1 S
2
) + k
0
P
0/1
and
P
1/0
= k
2
+ k
1
[ + (1 )S
2
] + k
0
P
1/0
,
where S
2
=
_
K
i
=1
p
2
i
10
6
(Curran
et al., 2007, -estimate in caption of Fig. 1).
These probabilities might be used in relation to crime cases where a suspect, S , declares that a
close relative is the culprit, C. Let G
S
be the suspects prole (known to the investigator) and
G
C
the prole of the culprit (unknown, but may be identical to G
S
). For some crime cases the
defence may claim that the circumstances of the crime is such that the true oender is a close
relative to S . Given a specic familial relationship, r, it is possible to compute the probability
that S and C share the same DNA prole. We need to distinguish between the situation of
G
S
being heterozygous or homozygous, and let P(G
C
= A
i
A
j
|G
S
= A
i
A
j
, R = r) and P(G
C
=
A
i
A
i
|G
S
= A
i
A
i
, R = r) denote these probabilities, where r is the specied familial relationship
of C and S . Furthermore, the information about r, implies knowledge of k which gives these
expression for the two probabilities:
P(G
C
=A
i
A
j
|G
S
=A
i
A
j
, R=r)=k
2
+
k
1
2
_
P(A
i
|A
j
, A
i
A
j
)+P(A
j
|A
i
, A
i
A
j
)
_
+k
0
P(A
i
A
j
|A
i
A
j
)
=k
2
+
k
1
2
2+(1)(p
i
+p
j
)
1+2
+2k
0
2
+(1)(p
i
+p
j
)+(1)
2
p
i
p
j
(1+2)(1+)
(3.4)
P(G
C
=A
i
A
i
|G
S
=A
i
A
i
, R=r)=k
2
+k
1
P(A
i
|A
i
, A
i
A
i
)+k
0
P(A
i
A
i
|A
i
A
i
)
=k
2
+k
1
3+(1)p
i
1+2
+ k
0
6
2
+5(1)p
i
+(1)
2
p
2
i
(1+2)(1+)
(3.5)
If the suspect is not the true culprit, then the probability that G
S
G
C
(share the same DNA
prole) is given by
10/0
. For the ve types of relatedness considered here, the probabilities are
plotted in the right-most category in Figure 3.2 for = 0.03.
3.3 Results
3.3.1 Simulations
We used the model discussed above to simulate DNA prole databases with known allele fre-
quencies (the estimated allele frequencies from the Danish database) and various values for .
For a specied number of DNA proles, we used the recursive formula of Nichols and Balding
44 Analysis of matches and partial-matches in Danish DNA database
(1991) for individuals only remotely related P(A
i
|x
n
) = [x
n
i
+ (1 )p
i
]/[1 + (n 1)] to
simulate alleles with a correlation governed by where p
i
in the formula is the allele frequency
of allele A
i
and the vector x
n
= (x
n
1
, . . . , x
n
K
) is the sucient summary statistic (Tvedebrink,
2010). In order to take close relationships among the individuals into consideration, we simu-
lated the number of individuals with a specied relationship n
R
= (n
FS
, n
1C
, n
PC
, n
AV
, n n
+
),
where all n
r
are even numbers. The subscripts relates to full-siblings (FS), rst-cousins (1C),
parent-child (PC) and avuncular (AV). The last entry in n
R
refer to the remaining number of
unrelated DNA proles (UN). Since the comparisons M(G
i
, G
j
) only considers pairs of proles,
the closely related DNA proles are simulated in pairs such that:
1. Simulate the rst relative R
1
: R
1
P(A
i
A
j
|x
n
) = P(A
i
|x
n+1
)P(A
j
|x
n
), where x
n+1
=x
n
+e
j
and e
j
is a vector of zeros except for a one in entry j
.
2. Simulate the number of alleles the second relative R
2
share IBD with R
1
: I P(k).
3. Prole R
2
is simulated conditioned on the value of I:
I = 0: R
2
is generated unrelated to R
1
: R
2
P(A
k
A
l
|A
i
A
j
, x
n
), and may be identical (by
state) to R
1
.
I = 1: The rst allele of R
2
is drawn randomly from the alleles of R
1
, e.g. A
i
is sampled. The
second allele is then sampled from P(A
k
|A
i
, A
i
A
j
, x
n
).
I = 2: R
2
is identical to R
1
. Note that only full-siblings has this possibility in our simulations.
By using this sampling scheme we make n
r
/2 pairwise comparisons for relatedness on level
r, since all other pairs of simulated relatives are mutually unrelated to each other. Hence, the
known vector of p
r
= {P(R = r)}
rR
is for each simulated database:
p
r
=
_
n
FS
n(n 1)
,
n
1C
n(n 1)
,
n
PC
n(n 1)
,
n
AV
n(n 1)
, 1
n
+
n(n 1)
_
From the expressions above it is clear that for increasing database sizes the number of com-
parisons between relatives is o(n
2
). However the impact on M depends on the product of the
matching probabilities and the fraction of comparisons,
r
p
r
. Mueller (2008) argued that the
number of full-sibling pairs in the Arizonian database (n = 65,493) needed to be between 1,000
to 3,000 pairs. This gives that the fraction of pairwise comparisons attributed to full-siblings is
between 4.73
10
7
and 1.42
10
6
for the Arizonian database.
In the formulation of Weir (2004, 2007) was assumed constant across loci. However, this need
not to be the case due to dierent mutation rates, and possibly selection or indirect selection by
linkage to other genes/markers subject to selection (Tvedebrink, 2010). In our simulations we
used a constant across loci for simplicity. For each simulated database we estimated using
ve optimisation criteria:
C
1
() =
_
(M N()) C
2
() =
(M N())
2
N()
C
3
() =
|M N()|
N()
(3.6)
T
1
() =
(M N())
2
diag{()}
T
2
() = {M N()}
()
{M N()}, (3.7)
where summation is over the vector entries. The object functions in (3.6) were investigated by
Curran et al. (2007) as a mean to compare the expected and observed counts. The authors argued
3.3 Results 45
that numerical work indicated that C
3
() yielded good results since special emphasis is placed
on the upper tail of the distribution (large number of matching loci). The functions in (3.7) uses
the covariance matrix, (), computed in this paper (cf. below). The rst function, T
1
(), does
not take correlations into accounts, whereas T
2
() is a natural measure of similarity (a so called
Mahalanobis-distance) incorporating the covariance matrix.
Let M be the M-matrix written in vector format (Appendix see 3.A for details on the transfor-
mation). We derived the expression for the variance of M, (), such that T
2
() = {N()
M}
()
{N() M} may be compared for various values of in order to obtain the minimal
T
2
(). We use the generalised inverse of () since () is not of full rank due to the linear
constraint N = M
+/+
, where the +-notation indicates summation over the index. Let all the
DNA prole identiers, (i
1
, i
2
, i
3
, i
4
) be dierent, then the variance is computed as:
() =
_
n
2
_
V
_
M(G
i
1
, G
i
2
)
_
+ 6
_
n
3
_
C
_
M(G
i
1
, G
i
2
), M(G
i
1
, G
i
3
)
_
+ 6
_
n
4
_
C
_
M(G
i
1
, G
i
2
), M(G
i
3
, G
i
4
)
_
,
(3.8)
where the covariances C
_
M(G
i
1
, G
i
2
), M(G
i
1
, G
i
3
)
_
and C
_
M(G
i
1
, G
i
2
), M(G
i
3
, G
i
4
)
_
are the most
involved terms to compute since V
_
M(G
i
1
, G
i
2
)
_
= diag{()} ()()
E
x
t
r
e
m
e
o
b
s
e
r
v
a
t
i
o
n
s
(
s
i
m
u
l
a
t
i
o
n
s
)
Figure 3.3: Box plots of the cell counts (on log
10
-scale) for the various categories for 1,000
simulated databases with 10,000 DNA proles and = 0.03. The legend explains the plot
characters.
3.3 Results 47
^
0
= 0.00
0
= 0.01
0
= 0.02
0
= 0.03
0
= 0.04
0
.
0
4
0
.
0
2
0
.
0
0
0
.
0
2
0
.
0
4
C
1
C
2
C
3
T
1
T
2
C
1
C
2
C
3
T
1
T
2
C
1
C
2
C
3
T
1
T
2
C
1
C
2
C
3
T
1
T
2
C
1
C
2
C
3
T
1
T
2
Figure 3.4: Comparisons of the performance of the object functions in (3.6) and (3.7).
Table 3.3: Mean square errors for the ve dierent measures of similarity stratied on .
C
1
() C
2
() C
3
() T
1
() T
2
()
= 0.00 1.072
10
7
1.136
10
7
1.078
10
7
1.077
10
7
1.205
10
7
= 0.01 3.432
10
5
3.418
10
5
3.457
10
5
3.280
10
5
3.264
10
5
= 0.02 7.509
10
5
7.456
10
5
7.601
10
5
7.538
10
5
7.460
10
5
= 0.03 1.213
10
4
1.205
10
4
1.231
10
4
1.222
10
4
1.208
10
4
= 0.04 1.711
10
4
1.697
10
4
1.730
10
4
1.727
10
4
1.702
10
4
Overall 8.034
10
5
7.977
10
5
8.132
10
5
8.061
10
5
7.963
10
5
Simulations including close relatives
The simulations in the previous section only considered remote relatedness trough allelic cor-
relation governed by . However, most realistic reference DNA prole databases will contain
DNA proles from closely related individuals, e.g. brothers and father-son pairs. Hence, we also
investigated the performance of the C() and T()-functions for databases with pairs of close
relatives. For each -value we simulated databases with the number of relatives as specied in
Table 3.4.
Like in in the previous section we want to minimise the deviation between the observed and ex-
pected counts. However, for these simulations the expected value depend on and p
r
through the
expression: E(M; , p
r
) =
_
rR
P(R = r)E(M|; R = r) =
_
rR
p
r
N
r
, as discussed in relation
to (3.3). Let
C() and
T() be as in (3.6) and (3.7), but with N() replaced by
_
rR
p
r
N
r
,
48 Analysis of matches and partial-matches in Danish DNA database
Table 3.4: The number of simulated relatives for the various -values with a total of 10,000 DNA
proles. The numbers in brackets are the relative frequency of pairwise comparisons between
DNA prole with the specied relationship, i.e. the known P(r)-values.
Full-siblings First-cousins Parent-child Avuncular Unrelated
2,000 (2
10
5
) 2,000 (2
10
5
) 2,000 (2
10
5
) 2,000 (2
10
5
) 2,000 (0.99992)
5,000 (5
10
5
) 1,000 (1
10
5
) 1,000 (1
10
5
) 1,000 (1
10
5
) 2,000 (0.99992)
1,000 (1
10
5
) 5,000 (5
10
5
) 1,000 (1
10
5
) 1,000 (1
10
5
) 2,000 (0.99992)
1,000 (1
10
5
) 1,000 (1
10
5
) 5,000 (5
10
5
) 1,000 (1
10
5
) 2,000 (0.99992)
1,000 (1
10
5
) 1,000 (1
10
5
) 1,000 (1
10
5
) 5,000 (5
10
5
) 2,000 (0.99992)
then we seek (
, p
r
) = arg min
(,p
r
)
F() for
F being either
C or
T.
It should be noted that for consistency the variance of M should in this case be computed as
() = E(V(M|R)) +V(E(M|R)). However, we argue that the complexity and cost in computing
10
7
which is
the approximate value obtained if one assumes that every individual of the Danish adult popula-
tion has exactly one full-sibling. However, it is likely that the frequency of full-siblings is larger
in the reference database than in the population due to various factors, e.g. the polices sampling
3.3 Results 49
^
0
0.00
0.05
0.10
= 0.00
0.00
0.05
0.10
= 0.01
0.00
0.05
0.10
= 0.02
0.00
0.05
0.10
= 0.03
0.00
0.05
0.10
C
~
1
C
~
2
C
~
3
T
~
1
T
~
2 C
~
1
C
~
2
C
~
3
T
~
1
T
~
2 C
~
1
C
~
2
C
~
3
T
~
1
T
~
2 C
~
1
C
~
2
C
~
3
T
~
1
T
~
2 C
~
1
C
~
2
C
~
3
T
~
1
T
~
2
= 0.04
P(Fullsiblings) P(Firstcousins) P(Parentchild) P(Avuncular)
Figure 3.5: Box plot of the dierences between
0
and
(with replaced for the relevant
parameters) for various -values and number of relatives in the simulated databases.
criteria and social factors. Inserting these values in and () gives the expected values and
covariance matrix, and given these quantities we computed marginal 95%-condence intervals
(superimposed in Figure 3.1).
The argument for using the -correction when assessing the evidential weight of a given DNA
prole is to adjust for possible subpopulation eects in the population from which the suspect
and proles for estimating allele probabilities are drawn. A structured population causes the
probability of observing a specic DNA prole to be heterogeneous, since the prevalence of its
constituting alleles may be higher in some subpopulation relative to the entire population. Taking
the argument further, one could argue that adjustment should be made for close relatedness
between the suspect and random man. Hence, when forming the likelihood ratio, LR, the
hypothesis in the denominator could be H
d
: A man possibly related to the suspect is the true
donor of the biological stain. The evaluation of P(E|H
d
) would then be a sum
_
rR
P(E|H
d
, R =
r)P(R = r), where (H
d
, R = r) concretises the specic relationship r between suspect and culprit.
50 Analysis of matches and partial-matches in Danish DNA database
Table 3.5: Mean squared errors (MSE) for various number of relatives stratied by -values.
Parameter C
1
() C
2
() C
3
() T
1
() T
2
()
0.00 1.354
10
7
1.316
10
7
1.539
10
7
2.514
10
4
1.202
10
7
P(FS) 1.008
10
11
1.897
10
8
1.683
10
9
1.653
10
10
5.238
10
6
P(1C) 4.813
10
6
8.078
10
6
5.762
10
7
1.505
10
10
3.630
10
6
P(PC) 6.835
10
11
5.191
10
10
1.362
10
10
4.788
10
9
7.962
10
7
P(AV) 1.835
10
8
7.247
10
8
6.895
10
9
1.693
10
10
7.217
10
8
0.01 3.221
10
5
3.273
10
5
3.049
10
5
2.032
10
4
2.984
10
5
P(FS) 2.878
10
11
8.088
10
8
8.669
10
10
1.453
10
10
1.995
10
6
P(1C) 3.825
10
5
6.430
10
5
7.402
10
6
1.431
10
10
6.560
10
7
P(PC) 1.708
10
10
1.428
10
9
2.053
10
10
8.610
10
9
1.052
10
6
P(AV) 7.710
10
9
3.483
10
9
1.590
10
8
3.275
10
9
6.465
10
6
0.02 7.006
10
5
7.165
10
5
6.542
10
5
2.472
10
4
6.853
10
5
P(FS) 4.029
10
11
1.727
10
8
1.131
10
9
1.719
10
10
6.694
10
7
P(1C) 7.252
10
5
1.055
10
4
9.885
10
6
1.050
10
9
3.598
10
7
P(PC) 1.976
10
10
7.711
10
10
2.924
10
10
1.024
10
8
5.935
10
7
P(AV) 4.547
10
9
5.264
10
10
1.201
10
8
1.250
10
8
5.160
10
6
0.03 1.232
10
4
1.263
10
4
1.153
10
4
2.237
10
4
1.213
10
4
P(FS) 5.841
10
11
5.695
10
9
7.739
10
10
1.714
10
10
7.768
10
7
P(1C) 1.688
10
4
1.649
10
4
1.854
10
5
2.294
10
7
3.728
10
7
P(PC) 2.294
10
10
3.160
10
10
3.701
10
10
1.490
10
8
7.187
10
7
P(AV) 3.361
10
10
5.053
10
10
5.101
10
8
6.679
10
9
4.757
10
6
0.04 1.661
10
4
1.698
10
4
1.532
10
4
2.839
10
4
1.463
10
4
P(FS) 8.469
10
11
1.109
10
9
1.195
10
9
2.560
10
10
1.063
10
6
P(1C) 1.886
10
4
1.542
10
4
2.180
10
5
3.568
10
7
5.585
10
7
P(PC) 2.318
10
10
2.240
10
10
4.195
10
10
1.665
10
8
1.028
10
6
P(AV) 5.348
10
10
8.605
10
11
1.799
10
8
1.416
10
8
5.296
10
6
Table 3.6: Estimated values for the Danish database using various object functions.
Method P(Full-siblings) P(First-cousins) P(Parent-child) P(Avuncular)
C
1
() 0.0000 2.592
10
6
8.413
10
9
1.072
10
12
1.930
10
9
C
2
() 0.0000 3.700
10
7
5.100
10
7
1.000
10
8
4.600
10
7
C
3
() 0.0000 5.005
10
6
3.534
10
7
6.089
10
13
2.475
10
7
T
1
() 0.0125 1.072
10
6
4.573
10
8
5.197
10
5
5.930
10
9
T
2
() 0.0107 2.263
10
6
1.757
10
7
1.491
10
6
5.882
10
9
3.4 Discussion 51
The problem of this approach would be to quantify P(R = r) for a given suspect. One approach
could be to take p
r
as estimated from the database and then form a weighted sum in the denom-
inator. By doing so for the Danish database with the estimated , frequencies for alleles and
pairs of relatives we obtained LR and LR
r
, where LR
r
denotes the LR taking close relatives into
account:
LR
r
=
P(E|H
p
)
P(E|H
d
)
=
P(E|H
p
)
_
rR
P(E|H
d
, R = r)P(R = r)
=
1
_
rR
P(C|S, R = r)P(R = r)
,
where P(C|S, R = r) is computed by multiplying (3.4) and (3.5) over loci.
For each prole in the database we computed LR assuming that the prole was that of a suspect in
single contributor crime case, i.e. LR = 1/P(U|S ) where P(U|S ) is the probability of observing
an unknown prole (the defence hypothesis) given the suspects prole. Similarly we computed
LR
r
under the same circumstances, except that the unknown prole may a close relative to S .
In Figure 3.6, we have plotted log
10
LR
r
against log
10
LR and see that the relationship is close
to linear: log
10
LR
r
= + log
10
LR. Estimating the parameters ( ,
) = (0.115, 8.59) we ob-
tain a simple formula to calculate LR
r
from LR: LR
r
= 10
8.59
LR
0.115
. In Figure 3.6, we have
superimposed the predicted value (solid line) with the uncertainty represented by the predictive
interval (dashed lines). The estimated mean and standard deviation of log
10
LR/LR
r
are respec-
tively 3.128 and 0.97. Hence, an approximative condence interval for the ratio is given as
10
3.1281.96
0.97
= [27 ; 106,955], i.e. taking close relatives into account decreases the LR with up
to ve orders of magnitude. The dominating contribution to the sum of P(E|H
d
) is that of full-
siblings, P(E|H
d
, R=FS) p
FS
, which accounts for approximately 99.5% of LR
r
. In Figure 3.2
this was also the category with the largest
10/0
. Hence, for practical purposes the only relevant
type of close relatedness to include in LR
r
is full-siblings since the decrease in P(E|H
d
, R) for
the remaining types of relatives is minimal relative to p
r
. Furthermore, previous we saw that the
model only including full-siblings and unrelated increased P(Full-siblings). Thus, this would
decrease LR
r
further yielding a more conservative evaluation of the evidence.
3.4 Discussion
It is evident from the analysis of the Danish reference DNA prole database that a -correction
close to 1% is sucient to capture the eects from substructure among the typed DNA proles.
Furthermore, did the analysis indicate the presence of close relatives in the database. A fact that
were known beforehand, but the number of close relatives were unknown. However, the signi-
cance of the estimated probabilities, p
r
, were not assessed implying some of them may be zero.
It is unknown whether it makes sense to present the LR
r
in court since often the judge and jury
are more interested in the LR for a specic relationship rather than a mean over common relation-
ships with numerical impact on P(E|H
d
). However, LR
r
may be used in order to accommodate
for the fact that the unrelated man may in fact be a unknown close relative to the suspect.
52 Analysis of matches and partial-matches in Danish DNA database
10 12 14 16 18
9
.
6
9
.
8
1
0
.
0
1
0
.
2
1
0
.
4
1
0
.
6
log
10
LR
l
o
g
1
0
L
R
r
1200
1100
1000
900
800
700
600
500
400
300
200
100
1
Figure 3.6: Relationship between LR and LR
r
with a predictive interval superimposed (solid
line: mean, dashed lines: predictive limits). The shaded hexagons indicate bin counts.
3.5 Conclusion
The main objective with the work presented in this paper were to analyse the Danish reference
DNA prole database of 51,517 dierent individuals. This was to accommodate the fact that at
some point two apparently unrelated individuals will share DNA proles for all ten loci in the
Danish population. If a specied relationship is determined it is straight forward to calculate the
probability of identical DNA proles, however, one still needs to account for remote coancestry
for both related and unrelated pairs of proles.
Furthermore, only modelling the expected value or calculating the mean is never satisfactory in
statistics. A measure of precision or variability is needed in order to discuss the extremity of an
observation relative to the expectation under a given model. Hence, deriving and computing the
covariance matrix of M was essential. However, as the simulations exemplied that there was no
pronounced improvement by using the Mahalanobis distance, T
2
() = [MN()]
()
[M
N()], rather than the C()-functions for estimating .
Acknowledgements
The authors would like to thank Ms. Kirstine Kristensen and Ms. Line Maria Irlund Pedersen
both from The Section of Forensic Genetics, University of Copenhagen) for their assistance in
verifying the familial relationships of the twins in the database, and validating some near matches
due to typing errors.
3.A Derivation and computation of the variance 53
Appendix
3.A Derivation and computation of the variance
In order to compute the variance of the summary matrix, we use the denition of variance
and covariance for random variables. First, note that M(G
i
, G
j
) may be listed as a vector:
M(G
i
, G
j
) M(G
i
, G
j
), where the mapping operates on the m/p values: f (m, p; L) = m[(L +
1) + (m 1)/2] + (p + 1), where L is the total number of loci. Next, we expand the expression
V(M) = ():
()=V
_
_
n1
i=1
n
j>i
M(G
i
, G
j
)
_
_
=
n1
i=1
n
j>i
V
_
M(G
i
, G
j
)
_
+ 6
n2
i=1
n1
j>i
n
k>j
C
_
M(G
i
, G
j
), M(G
i
, G
k
)
_
+
n1
i=1
n
j>i
n1
k{i, j}
n
l>k
l{i, j}
C
_
M(G
i
, G
j
), M(G
k
, G
l
)
_
=
_
n
2
_
V
_
M(G
i
1
, G
i
2
)
_
+6
_
n
3
_
C
_
M(G
i
1
, G
i
2
), M(G
i
1
, G
i
3
)
_
+6
_
n
4
_
C
_
M(G
i
1
, G
i
2
), M(G
i
3
, G
i
4
)
_
where (i
1
, i
2
, i
3
, i
4
) in the last line relates to any of the DNA proles in the database as long as
they are dierent proles. We go from the rst to second line by expanding the sum and observe
that C[M(G
i
, G
j
), M(G
i
, G
k
)] = C[M(G
i
, G
j
), M(G
j
, G
k
)] = C[M(G
i
, G
k
), M(G
j
, G
k
)] since
M(, ) is symmetric. The sum over the last term in the expansion, C[M(G
i
, G
j
), M(G
k
, G
l
)]
with all prole indexes dierent, also contain several symmetries implying the weights in the
nal expression. In order to compute the covariances , we need to compute
E
_
M(G
i
, G
j
)M(G
i
, G
k
)
_
and E
_
M(G
i
, G
j
)M(G
k
, G
l
)
_
,
respectively, given that the DNA prole indexes i, j, k and l are all dierent.
For computing E
_
M(G
i
, G
j
)M(G
i
, G
k
)
_
we need to account for the fact that prole G
i
enters in
both pairwise comparisons. Hence, we need to condition on G
i
when deriving the probabilities
m/p, m/ p
=
_
i
, j
P(m/p, m/ p|G
i
=A
i
A
j
)P(A
i
A
j
) for all combinations of m/p, m/ p, where m/p
relates to the number of matches/partial-matches of G
i
and G
j
, with a similar denition of m/ p
for proles G
i
and G
k
.
As for the mean we use a recursion formula over loci to compute
m/p, m/ p
. However, in this
setting there are nine terms on the right hand side:
+1
m/p, m/ p
=
m/p, m/ p
P
+1
0/0,0/0
+
m/p1, m/ p
P
+1
0/1,0/0
+
m1/p, m/ p
P
+1
1/0,0/0
+
m/p, m/ p1
P
+1
0/0,0/1
+
m/p, m1/ p
P
+1
0/0,1/0
+
m/p1, m/ p1
P
+1
0/1,0/1
+
m1/p, m/ p1
P
+1
1/0,0/1
+
m/p1, m1/ p
P
+1
0/1,1/0
+
m1/p, m1/ p
P
+1
1/0,1/0
.
54 Analysis of matches and partial-matches in Danish DNA database
When one or more of the subscripts are zero there are similar boundary conditions for
m/p, m/ p
as
those specied in Section 3.2.2. The probabilities P
m/p, m/ p
are found by considering the events
separately. For each conguration of (m/p, m/ p) {(x
0
/y
0
, x
1
/y
1
) : (x
i
, y
i
) {0, 1}0 x
i
+y
i
1} we compute the probabilities:
P
m/p, m/ p
= P(m/p, m/ p) =
, j
P(m/p, m/ p|G
i
= A
i
A
j
)P(A
i
A
j
)
Each of the probabilities in the sums are expanded such that the events specied by m/p and
m/ p are satised, e.g. m/p = 1/0 and m/ p = 1/0 implying that both prole G
j
and G
k
matches
the proles of G
i
on that particular locus:
P(1/0, 1/0) =
, j
P(A
i
A
j
, A
i
A
j
|A
i
A
j
)P(A
i
A
j
) +
P(A
i
A
i
, A
i
A
i
|A
i
A
i
)P(A
i
A
i
)
= 2
i, ji
P(A
i
A
i
A
j
A
j
|A
i
A
j
)P(A
i
A
j
) +
P(A
i
A
i
A
i
A
i
|A
i
A
i
)P(A
i
A
i
)
= 4
i, ji
P(A
i
A
i
A
i
A
j
A
j
A
j
) +
P(A
i
A
i
A
i
A
i
A
i
A
i
).
From the recursive formula P(A
i
|x
n
) = [x
n
i
+ (1 )p
i
]/[1 + (n 1)], we see that the de-
nominator do not depend on the total number of sampled alleles. Hence, for a probability like
P(A
i
A
j
A
k
A
j
A
i
A
i
) that involves six alleles, the denominator will always be
5
n=1
(1 + n).
Hence, to keep the formulae simple, we only consider the numerator in the following deriva-
tions. First, we observe that:
P(A
i
A
j
A
k
A
j
A
i
A
i
) = P(A
i
|A
j
A
k
A
j
A
i
A
i
)P(A
j
A
k
A
j
A
i
A
i
)
= [(
i
1) + (1 )p
i
]P(A
j
A
k
A
j
A
i
A
i
)
= (
i
1)P(A
j
A
k
A
j
A
i
A
i
) + (1 )p
i
P(A
j
A
k
A
j
A
i
A
i
), (3.9)
where
i
counts the number of i
alleles in the expression on the left hand side. Now, the term
(
i
1)P(A
j
A
k
A
j
A
i
A
i
) follows a similar expansion as the left hand side of (3.9). How-
ever, the latter term of (3.9) involves p
i
which needs to be taken into account when evaluating
P(A
j
A
k
A
j
A
i
A
i
). By following the recursion to the end, that is when the left hand side of
(3.9) is, say, P(A
i
A
j
) = P(A
i
|A
j
)P(A
j
) = [(
i
1) + (1 )p
i
]p
j
= (1 )p
i
p
j
we end
up with terms of the form a
0
a
1
(1 )
a
2
p
1
1
p
K
K
for some constants a = (a
0
, a
1
, a
2
) and
= (
1
, . . . ,
K
). The values of a and is build up during the recursion, hence determining the
actual value is only a matter of bookkeeping.
Furthermore, consider the case where the product of allele probabilities is p
2
i
p
2
j
p
2
k
where the
indexes are dierent. A rst step would be to replace p
2
k
= S
2
p
2
i
p
2
j
p
2
j
(S
2
p
2
i
p
2
j
) for i
, j
,k
p
2
i
p
2
j
p
2
k
, j
p
2
i
p
2
j
(S
2
p
2
i
p
2
j
) = S
2
, j
p
2
i
p
2
j
, j
p
4
i
p
2
j
, j
p
2
i
p
4
j
,
3.A Derivation and computation of the variance 55
where the notation imply summation over dierent values of the indexes. Rewriting the expres-
sion above with the powers replaced by the -parameters we get this more general expression:
i, j,k
p
=
_
i, j
p
i, j
p
i
+
k
i, j
p
j
+
k
where all -parameters were 2 in the previous example. The formula can be programmed
in a computer as a recursion formula. Hence, in contrast to the simpler situations only in-
volving a pair of DNA proles where a few equations give the necessary probabilities (Weir,
2004, 2007), we let the computer compute the expectations E[M(G
i
1
, G
i
2
)M(G
i
1
, G
i
3
)
] and
E[M(G
i
1
, G
i
2
)M(G
i
3
, G
i
4
)
2
which for non-negative parameters, = (0.13, 0.87, 14.71), is a
monotonic increasing function. That is, the probability increase with , i.e. the more heteroge-
neous the population is, the larger is the probability of coinciding DNA proles.
However, this fact does not imply that DNA proling is overrated nor that the weight of evidence
reported in court is overstated. When using the LR-approach the reported evidential-value relates
to the specic DNA prole of a suspect. The pairwise comparisons of each pair in the DNA
database were used to validate the population genetic model. The diagnostics presented above
indicated that the dierences between the observed and expected counts were not too extreme,
and thus we may still have condence in the models used for reporting the evidential weight in
court.
CHAPTER 4
Evaluating the weight of evidence using
quantitative STR data in DNA mixtures
Publication details
Co-authors: Poul Svante Eriksen
, of
a two-person mixture, the consistency with G induces the set C = {G
: (G
, G
G
U
C
d
P(Q|G, G
S
, G
V
, G
U
)P(G, G
S
, G
V
, G
U
),
4.1 Introduction 63
Table 4.1: The four DNA proles used in the controlled pairwise two-person mixture experi-
ments.
D3 vWA D16 D2 D8 D21 D18 D19 TH0 FGA
A 14,18 17,19 12,14 20,24 10,13 30.2,32.2 13,13 12,13 8,9 20,22
B 15,16 14,16 10,12 17,25 13,16 30,30 13,13 14,15 6,9 19,23
C 15,16 15,17 11,11 19,25 8,12 29,31 15,17 13,13 6,8 23,24
D 16,19 15,17 10,12 23,25 13,13 28,30 12,16 13,15 6,7 20,23
where P(Q|G, G
S
, G
V
, G
U
) = P(Q|G
V
, G
U
) and P(G, G
S
, G
V
, G
U
) = P(G
S
, G
V
, G
U
) due to (G
V
, G
U
)
G and H
d
is assumed. Hence,
P(E, G
S
, G
V
|H
d
) =
G
U
C
d
P(Q|G
V
, G
U
)P(G
S
, G
V
, G
U
).
Similar arguments apply to the numerator of LR, and assuming independence between the pro-
les involved, i.e. unrelated individuals such that P(G
S
, G
V
, G
U
) = P(G
S
)P(G
V
)P(G
U
), the nal
LR expression is:
LR =
P(Q|G
S
, G
V
)
_
G
U
C
d
P(Q|G
V
, G
U
)P(G
U
)
, (4.1)
where the factors P(G
S
)P(G
V
) have cancelled out. The numerator P(Q|G
S
, G
V
) of (4.1) assesses
the probability of observing the quantitative information given that the mixture consists of ge-
netic material from the proles G
S
and G
V
. The denominator equals the mean value of the
quantitative likelihood among the pairs of proles that are consistent with the genetic trace. If
we assume P(Q|G
S
, G
V
) = P(Q|G
V
, G
U
) for all G
U
, i.e. the observed quantitative information
has equal probability for all proles paired with G
V
, then (4.1) reduces to the usual likelihood
ratio as in Evett and Weir (1998), since P(Q|G
S
, G
V
) and P(Q|G
V
, G
U
) then cancel each other in
(4.1). The assumption that the proles G
S
, G
V
and G
U
are independent is a rather strong. The
so-called -correction incorporates the correlation from shared ancestry (Balding and Nichols,
1994) and closer familial relationships induces further correlation of the genetic proles. How-
ever, for the purpose of introducing the factorisation of the qualitative and quantitative evidence
the assumption used in (4.1) is adequate.
The objective of the present paper is to develop a methodology and an adequate statistical model
to describe P(Q|G
, G
and G
DNA =
s
=
_
_
1 0 0 0
0 1 1 0
0 0 0 1
_
_
_
A
(1)
s,15
, A
(1)
s,16
, A
(2)
s,16
, A
(2)
s,19
_
+
s
,
adding together the entries in A
s
that relates to the same allele, i.e. allele 16.
The number of allelic measurements within each locus varies from case to case since dierent
pairs of proles will share a dierent number of alleles. A mixture of person A and B would
have n
D3
= 4, and B and C has n
D3
= 2 (see Table 4.1). Not only will the number of alleles vary,
the specic alleles present in a given mixture depends on the proles in the mixture, e.g. A and
B give alleles {14, 15, 16, 18}, and B and C give {15, 16}. This makes it dicult to incorporate a
covariance structure covering all allele combinations.
We standardised the residual, , by the observed peak heights, h = (h
s
)
sS
with h
s
= (h
s,i
)
n
s
i=1
,
by dening the scaled residual, = (
s
)
sS
, where
s
= (
s,i
/
_
h
s,i
)
n
s
i=1
. To make the model
operational, we assumed a compound symmetry model for the covariance of , Cov( ) =
and
that this does not depend on the specic alleles in the mixture. The only case specic adjustment
made was to make the dimensions of the compound symmetry concordant with the number of
observed peaks for each locus. The compound symmetry structure of
implies that sub-vectors
of share some properties with respect to the scaled covariance
. There are three dierent
types of correlation in our setting:
Dierent loci (s t): Cov(
s,i
,
t, j
) =
st
.
Same locus, dierent alleles (s = t, i j): Cov(
s,i
,
s, j
)=
ss
.
Same locus, same allele (s = t, i = j): Cov(
s,i
,
s,i
) = Var(
s,i
) =
ss
+
s
.
Hence, we can parameterise
by = {
s
}
sS
and = {
st
}
s,tS
. The interpretation of
st
is
that the correlations between observations at dierent loci depend only on the loci and not on
the specic alleles present on each locus. Similarly, the correlation between alleles on the same
locus, s, is independent of the specic alleles, whereas for identical elements, the covariance
corresponds to the variance, and the addition of
s
allows for a larger variance than that given by
the intra-locus covariance.
68 Evaluating the weight of evidence using quantitative STR data in DNA mixtures
4.2.3 Implementation of the EM-algorithm
In order to handle the latent structure of A and the associated missing data problem, we used
the EM-algorithm to impute the missing observations and estimate the parameters in the condi-
tional distribution of A given M. However, since the dimensions of M and sub-vectors hereof
varied from case to case, we obtained a likelihood, which was not very well suited for the imple-
mentation of the EM-algorithm. The problem was solved by introducing appropriate auxiliary
variables.
This allowed for an implementation of the EM-algorithm in the usual full exponential family
framework with the constraint that the
ss
-parameters should be positive, i.e. this method implies
positive intra-locus covariances. However, the inter-locus covariances
st
are not constrained.
The parameters estimated using the EM-algorithmare not case specic but reect the distribution
of the quantitative STR DNA in the laboratory.
Appendices 4.A and 4.B give mathematical details on the model and the implementation of the
EM-algorithm.
4.3 Impact on the likelihood ratio
As mentioned in Section 4.1.2, both the qualitative and quantitative evidence need to be evaluated
for proper use of the available information from a crime scene. The probability P(Q|G
, G
) in
the likelihood ratio of (4.1) is evaluated by using the tted model to calculate L(M|G
, G
) =
|
(G
,G
)
|
1/2
exp{
1
2
(M
(G
,G
)
)
1
(G
,G
)
(M
(G
,G
)
)} of (G
, G
, G
) for each
pair of proles. As mentioned in Bill et al. (2005), this approach will not yield the correct LR
as all possible combinations should be weighted by their associated L(M|G
, G
)-value. This
attempt to evaluate the LR aims at including more of the available information and thus yielding
a better approximation to the actual LR, since each pair of proles has its own weight reecting
how well it ts the quantitative data.
In the example, we demonstrate the eect of including the quantitative information in the evi-
dence evaluation for three dierent suspect proles. The suspect proles used in the example
70 Evaluating the weight of evidence using quantitative STR data in DNA mixtures
Table 4.3: Proles of the suspects (a)-(c), unknowns and best matching pairs of proles ()
in example of Section 4.3.1. For all the suspects, only one unknown matches the chosen sus-
pect among the 860 combinations. In loci where the suspect combination diers from the best
matching combination in part (), allelic numbers are in bold font.
Locus D3 vWA D16 D2 D8 D21 D18 D19 TH0 FGA
(a)
Suspect 15,16 14,16 10,12 17,17 13,16 29,31 15,15 14,15 9,9 19,19
Unknown 15,16 15,17 11,11 19,25 8,12 30,31 13,17 13,13 6,8 23,24
(b)
Suspect 15,16 14,16 10,12 17,25 13,16 30,30 17,17 14,15 6,9 19,19
Unknown 15,16 15,17 11,11 19,25 8,12 29,31 13,15 13,13 6,8 23,24
(c)
Suspect 15,16 14,16 10,12 17,25 13,16 30,30 13,15 14,15 6,9 19,23
Unknown 15,16 15,17 11,11 19,25 8,12 29,31 15,17 13,13 6,8 23,24
()
Minor 15,16 14,16 10,12 17,25 13,16 29,29 13,13 14,15 6,9 19,23
Major 15,16 15,17 11,11 19,25 8,12 30,31 15,17 13,13 6,8 23,24
are given in Table 4.3, together with the unknown prole G
U
that maximises L(M|G
S
, G
U
) for
each suspect prole, G
S
. For each suspect prole, only one of the 860 pairs of proles satises
(G
U
, G
S
) G which implies a product of L(M|G
S
, G
U
) and P(G
U
) in the numerator for each
suspect prole, and 860 terms in the sum of the denominator of which the combination of Mi-
nor and Major of Table 4.3, part () has the largest quantitative likelihood value. Throughout
the example, the main focus will be on the suspect of part (a) in Table 4.3, with comparisons to
the results obtained using the suspects of part (b) and (c).
In Fig. 4.4 and Fig. 4.5, the observed quantitative peaks, , are plotted together with the expected
peaks, , for the proles of part (a) and () of Table 4.3, respectively. The expected peaks are
given by
M = T , where T and H
(k)
in
s,k
=
s
H
(k)
are computed for the specic pair of
proles. It is clear from Fig. 4.4 that the imbalances induced by the suspect combination in part
(a) imply substantial deviation from the observed data for loci D2, D21, D18, TH0 and FGA.
These are also the loci where the two pairs of proles of part (a) and () in Table 4.3 dier.
First, we make a non-quantitative evaluation of the LR using only allele probabilities for the
suspect of part (a). Since there is only one combination among the 860 that includes this suspect,
the likelihood ratio LR = P(G
U
)/[
_
P(G
U
1
)P(G
U
2
)], where the sum in the denominator is over
the set C
d
, but here this set consists of 860 combinations satisfying Hb [3/5 ; 5/3] and
M
s
x
[
M
x
0.25] for computational simplicity. This yields a non-quantitative likelihood ratio, LR
G
,
estimate of 4.527
10
13
, which is very strong evidence in favour of the hypothesis that the suspect
is a contributor to the stain.
The dominating values of the quantitative likelihood in the numerator and denominator are given
by L(M|G
(a)
S
, G
(a)
U
) = 5.9
10
119
and L(M|G
()
U
1
, G
()
U
2
) = 5.57
10
100
respectively. Alarge dier-
ence in the quantitative likelihood values was expected from the dierence in t to the observed
peaks pictured in Figs. 4.4 and 4.5. Thus, including the quantitative evidence, the quantitative
likelihood ratio estimate, LR
GQ
, decreased by a factor 10
17
to 7.63
10
4
which is strongly in
favour of the suspect not having contributed to the stain.
4.3 Impact on the likelihood ratio 71
50
250
500
750
1000
1250
1500
1750
2000
2250
2500
50
250
500
750
1000
1250
1500
1750
2000
2250
2500
50
250
500
750
1000
1250
1500
1750
2000
2250
2500
D16
D18
D19
D2
D21
D3
D8
FGA TH0
vWA
10 11 12
13 15 17
13 14 15
17 19 25
29 30 31
15 16
12 13 16 8
19 23 24 6 8 9
14 15 16 17
Expected peak
Observed peak
Figure 4.4: Observed, , and expected peaks, , assuming a two-person mixture of the suspect
and unknown in Table 4.3, part (a). Abscissa: Basepair (bp) values computed using the allelic
number and STR locus, Ordinate: Peak heights in rfu.
Table 4.4: Likelihood ratios for the three dierent suspects in Table 4.3. Here, LR
G
and LR
GQ
denote the non-quantitative and quantitative likelihood ratios, respectively, and LR
GQ
/LR
G
is
the relative change in the weight of the evidence. The allele frequencies used in the calculations
were provided by The Section of Forensic Genetics, University of Copenhagen.
LR
G
LR
GQ
LR
GQ
/LR
G
Suspect (a) 4.527
10
13
7.630
10
4
1.685
10
17
Suspect (b) 4.216
10
13
5.185
10
8
1.230
10
5
Suspect (c) 3.596
10
13
9.744
10
13
2.710
Together with similar computations for the suspects of parts (b) and (c), this information is given
in Table 4.4. Here, we see that for suspects of part (b) and (c), the change in the weight of
evidence is a moderate decrease and small increase, respectively. Note that part (b) diers from
the best matching pair of proles in three loci (D21, D18 and FGA) and part (c) in the two loci
D21 and D18.
72 Evaluating the weight of evidence using quantitative STR data in DNA mixtures
50
250
500
750
1000
1250
1500
1750
2000
2250
2500
50
250
500
750
1000
1250
1500
1750
2000
2250
2500
50
250
500
750
1000
1250
1500
1750
2000
2250
2500
D16
D18
D19
D2
D21
D3
D8
FGA TH0
vWA
10 11 12
13 15 17
13 14 15
17 19 25
29 30 31
15 16
12 13 16 8
19 23 24 6 8 9
14 15 16 17
Expected peak
Observed peak
Figure 4.5: Observed, , and expected peaks, , assuming a two-person mixture of the minor
and major proles in Table 4.3, part (). Abscissa and ordinate as in Fig. 4.4.
The non-quantitative likelihood ratio estimates, LR
G
, of Table 4.4 will in many legal systems
point towards conviction of any of the suspects. When including the quantitative information,
we see that the change in the weight of evidence may add further to the evidence against the
suspect (as in part (c)), or may decrease the likelihood ratio estimate such that it provides strong
evidence in favour of the suspect (part (a)), however, also situations in between these two ex-
tremes will occur (part (b)). This example shows that, even when a persons genotype matches
the genetic stain, imbalanced STR DNA proles judged by the observed quantitative data may
speak strongly in favour of the suspect. However, weighing each pair of genotypes by the asso-
ciated quantitative likelihood-value may add further to the evidence against the suspect when the
suspects prole only causes a few or small imbalances with respect to the observed peaks.
4.4 Parameter estimation
The EM-algorithm and the specic expressions as derived in Appendix 4.B were implemented
in the statistical software package R (R Development Core Team, 2009). In order to validate the
implementation, we simulated peak area data given the peak heights fromcontrolled experiments
and known model parameters. After 30,000 iterations, the parameter estimates were close to the
4.5 Discussion 73
true values indicating a successful implementation of the tting algorithm.
In order to estimate the model parameters, we used a training set consisting of results of investi-
gations of DNA mixtures from 71 controlled experiments conducted at The Section of Forensic
Genetics, University of Copenhagen. These 71 cases were chosen such that all alleles from
the contributing proles were present in the data, i.e. no drop-out events occurred (see Tvede-
brink et al., 2009, for discussion on allelic drop-out). The algorithm was executed using several
dierent sets of initial values. For each set, we ran 30,000 iterations of the EM-algorithm all
converging to the same parameter estimates.
In order to monitor the convergence of the EM-algorithm, we computed the deviance after each
iteration. After 1,100 iterations, the absolute improvement for successive deviances was less
than 0.01.
In the part of Table 4.5, the shading shows the locus correlations,
st
/
ss
tt
, while the above-
diagonal part shows the locus covariances,
st
, when
s
= 0 (see Section 4.5.2). Most of the
loci were highly correlated. This indicates that evaluation of quantitative DNA evidence with
the assumption of independence across loci is an extensive simplication.
The dierent signal intensities of the uorescent dyes were also identiable in the parameter
estimates. The strong signals of the green dye band and the weaker signals of the yellow dye
band (Butler, 2005) were reected in the parameter estimates of
s
. In Table 4.5, we see that the
magnitude of the s of the yellow uorescence was smaller than that of the blue uorescence,
which again was smaller than that of the green uorescence (except for loci D16 and D21).
In addition to the parameter estimates and deviance, we also computed the asymptotic variances
of the estimates by the normality approximation of the MLE with the inverse Fisher Information
as covariance matrix. We found that the estimated standard deviation of both and
2
indicated
reasonably good estimates of these parameters. Large asymptotic standard deviations of did,
however, indicate the possibility of model reductions.
4.5 Discussion
4.5.1 Validity of the hypothesis of a two-person mixture
When analysing the STR results of a crime scene stain, we need to be able to determine whether
the stain is likely to originate from a two-person mixture or not. In this section, we demon-
strate how this is possible using our model for the quantitative STR DNA data. In order to
verify the hypothesis of a given two-person mixture, we simulated 1,000 vectors of peak areas,
M
1
. . . , M
1000
, for each of the 71 cases from the controlled experiments.
Simulations of the peak areas were conditioned on the observed peak heights and true proles of
the mixture, and we used the parameter estimates fromTable 4.5. This corresponds to simulating
under a null hypothesis with the T-matrix, H = (H
(1)
, H
(2)
) and h known together with xed
parameters ,
2
and , i.e. assuming that the stain originates from a two-person mixture.
74 Evaluating the weight of evidence using quantitative STR data in DNA mixtures
Table 4.5: Parameter estimates after 30,000 iterations of the EM-algorithmwith = 0 (Section
4.5.2). The -matrix shows the covariances
st
and correlations
st
/
ss
tt
(shaded).
Y
e
l
l
o
w
d
y
e
o
u
r
e
s
c
e
n
c
e
B
l
u
e
d
y
e
o
u
r
e
s
c
e
n
c
e
G
r
e
e
n
d
y
e
o
u
r
e
s
c
e
n
c
e
F
G
A
T
H
0
D
1
9
D
2
v
W
A
D
3
D
1
6
D
2
1
D
8
D
1
8
1
1
5
1
.
5
4
7
7
3
.
9
1
1
4
4
1
.
0
0
1
4
9
2
.
0
1
1
0
4
4
.
4
8
8
5
7
.
4
6
1
3
0
5
.
3
6
1
0
3
3
.
5
0
3
9
7
.
3
4
1
4
6
1
.
5
9
F
G
A
0
.
7
5
9
2
5
.
6
9
1
0
4
2
.
0
3
1
0
9
0
.
1
6
5
8
7
.
5
8
6
6
4
.
5
5
5
2
7
.
2
1
6
5
4
.
2
6
5
8
2
.
8
8
1
0
8
5
.
3
7
T
H
0
0
.
9
4
0
.
7
6
2
0
5
2
.
8
3
2
1
5
1
.
7
6
1
3
1
9
.
4
9
1
0
5
0
.
9
5
1
7
1
6
.
8
5
1
2
7
9
.
9
3
6
1
9
.
2
2
1
9
6
4
.
6
8
D
1
9
0
.
8
8
0
.
7
2
0
.
9
5
2
4
8
1
.
7
0
1
4
3
8
.
3
6
1
2
3
7
.
0
7
1
8
4
8
.
1
4
1
3
5
4
.
9
4
7
6
5
.
3
5
2
0
7
7
.
8
1
D
2
0
.
9
4
0
.
5
9
0
.
8
9
0
.
8
8
1
0
8
2
.
1
0
8
2
1
.
7
8
1
3
3
9
.
6
4
9
7
5
.
9
1
3
4
0
.
3
5
1
4
4
9
.
9
1
v
W
A
0
.
8
6
0
.
7
4
0
.
7
9
0
.
8
5
0
.
8
5
8
6
4
.
0
7
9
5
4
.
0
8
7
7
6
.
6
4
5
3
6
.
5
6
1
1
4
2
.
7
7
D
3
0
.
8
8
0
.
4
0
0
.
8
7
0
.
8
5
0
.
9
3
0
.
7
4
1
9
1
5
.
7
8
1
2
0
1
.
8
3
2
4
6
.
5
0
1
6
5
3
.
9
1
D
1
6
0
.
9
9
0
.
7
0
0
.
9
2
0
.
8
8
0
.
9
6
0
.
8
6
0
.
8
9
9
5
2
.
2
7
3
8
0
.
7
2
1
3
5
3
.
2
5
D
2
1
0
.
4
1
0
.
6
7
0
.
4
8
0
.
5
4
0
.
3
6
0
.
6
4
0
.
2
0
0
.
4
3
8
2
1
.
0
0
7
5
0
.
0
2
D
8
0
.
9
2
0
.
7
6
0
.
9
3
0
.
8
9
0
.
9
4
0
.
8
3
0
.
8
1
0
.
9
4
0
.
5
6
2
1
9
6
.
6
5
D
1
8
5
.
5
3
5
.
9
9
6
.
1
5
7
.
0
1
7
.
6
4
8
.
2
5
9
.
1
0
8
.
9
2
1
0
.
1
9
1
0
.
1
8
2
5
9
6
.
5
3
7
3
0
.
2
9
1
0
0
2
.
9
3
1
2
3
6
.
3
1
1
1
4
6
.
7
8
1
3
3
1
.
7
9
1
8
2
1
.
0
4
1
7
9
7
.
3
9
1
8
5
4
.
9
4
3
2
0
8
.
7
3
4.5 Discussion 75
For each of the simulated peak area vectors, M
i
, we found the pair of proles maximising the
likelihood,
G
i
= (
G
i1
,
G
i2
), using the approach of (Tvedebrink et al., 2010, Chapter 5 of this
thesis) and computed T and H associated with
G
i
. Using these quantities, we can determine the
Mahalanobis distance,
M
d
(M
i
,
G
i
) = (M
i
M
G
i
)
Var(M
G
i
)
1
(M
i
M
G
i
), (4.2)
where
M
G
i
and Var(M
G
i
) are the expected peak areas and variance assuming a mixture of
G
i
respectively. If
G
i
were equal to the true proles of the mixture, then M
d
would follow a
2
n
-
distribution with n being the number of observations in the mixture. However, the true mixture
proles may not always be identical to the pair of proles maximising the likelihood. This may
be due to stochastic variations and systematic components, e.g. stutter and pull-up eects. The
former is caused by artefacts in the polymerase chain reaction resulting in an increase of peak
intensities typically in the allelic position before the true allele. Pull-up eects are manifested
as an increase of the true peaks caused by overlap of the spectra of the light emitted from the
various uorochromes, which are detected by a CCD camera in the data generating process
(Butler, 2005). Hence, on average we expect M
d
for
G
i
to be smaller than for the true proles
which implies fewer degrees of freedom in the
2
-distribution. Fig. 4.6 shows a histogram
of 1,000 simulated Mahalanobis distances for the data given in Table 4.2. The superimposed
curves indicate that the expectation of fewer than n degrees of freedom for the
2
-distribution is
reasonable, where n = 31 in this example. The hypothesis that the Mahalanobis distance follows
a
2
29
-distribution is supported by a Kolmogorov-Smirnotest (p-value of 0.2410), whereas both
30 and 31 degrees of freedom are rejected (p-values are 0.0307 and 1.966 10
8
, respectively).
In crime casework the DNA may be degraded or partly degraded, which implies that results
only are obtained for short STR loci/alleles (loci/alleles with low base pair numbers), but not (or
weak results) with longer STR loci/alleles (loci/alleles for high base pair numbers). This is a
potential problem since this is not incorporated in the model due to the assumptions on inter-loci
correlation.
However, the Mahalanobis distance M
d
in (4.2) can be decomposed into two parts evaluating the
quality of the sample, M
(q)
d
in (4.3), and the goodness of t of a proposed mixture G = (G
, G
)
of two proles, M
(m)
d
in (4.4). Let
M|M
+
= Var(M
G
|M
+
) and
M
+
= Var(M
+,G
), then
M
(q)
d
(M, G) = (M
+
+,G
)
1
M
+
(M
+
+,G
), (4.3)
M
(m)
d
(M, G) = (M
G|+
)
M|M
+
(M
G|+
), (4.4)
where M
+
is the vector of loci peak area sums and
G|+
(
+,G
) are the expected peak areas
(sums) conditioned on the loci sums for proles G. The reason for this decomposition follows
from the normality assumption, where f (M) = f
M|+
(M|M
+
) f
+
(M
+
), which in density func-
tions yields
|
M
|
1
2
e
1
2
M
d
(M,G)
= |
M|M
+
|
1
2
|
M
+
|
1
2
e
1
2
_
M
(m)
d
(M,G)+M
(q)
d
(M,G)
_
,
where
M
= Var(M
G
). We note that M|M
+
is a distribution restricted to the ane subspace
with xed peak area sums.
76 Evaluating the weight of evidence using quantitative STR data in DNA mixtures
Figure 4.6: Histogram of Mahalanobis distances for simulations based on data from Table 4.2.
Superimposed are a
2
31
-distribution (solid), Gaussian based kernel density estimate (dashed) and
a
2
29
-distribution (dotted).
Since |
M
|
1
2
= |
M|M
+
|
1
2
|
M
+
|
1
2
, taking 2 log on both sides of the equation gives the decom-
position of the Mahalanobis distance (4.2) into the two parts (4.3) and (4.4). Both Mahalanobis
distances, M
(q)
d
(M, G) and M
(m)
d
(M, G), follow
2
-distributions with S and nS degrees of
freedom, respectively.
In Fig. 4.7, we have plotted histograms of the p-values for M
(m)
d
and M
(q)
d
for 66 real crime cases
made available by The Section of Forensic Genetics, University of Copenhagen. In all cases
the contributors are not known for certain. However, the circumstances of the crime cases made
a victim and suspect prole available for each case. The two proles matched and completely
explained the mixed prole the stain.
The left panel shows the histogram of the p-values from M
(m)
d
assessing how well the proposed
pair of proles matched the mixture given the assumptions of the model. The histogram of the
p-values indicated that the model is applicable to STR results in real crime cases, since large p-
values, or equivalently small Mahalanobis-distances, imply that H
p
is supported by the evidence.
The right panel of Fig. 4.7 shows that more than half (35 cases) of the p-values from the test of
the sample quality were less than 0.01. This indicates that most of the crime case samples had
been subject to degradation of the DNA material. Degradation of the DNA is often complicating
the interpretation of DNA mixtures. It is worth emphasising that imbalances caused by degraded
DNA may imply that no pair of proles has M
d
2
n,(1)
, where
2
k,(1)
is the critical value
on signicance level (e.g. = 0.01) for a
2
k
-distributed variable. However, conditioned on
4.5 Discussion 77
Figure 4.7: Histogram of p-values of the Mahalanobis distances of 66 crime cases in which we
had found the pair of proles maximising the likelihood. For these proles, we have decomposed
the overall Mahalanobis distance M
d
into M
(m)
d
and M
(q)
d
.
the loci sums, such imbalances do not aect the evaluation of a particular pair of proles, i.e.
M
(m)
d
2
nS,(1)
is possible.
In order to investigate whether an observed stain may originate from a two-person mixture, the
evaluation of M
(m)
d
(M,
G) needs to be less than
2
nS,(1)
. If this is not the case for the observed
stain, it may be a mixture of more than two contributors or the results are strongly inuenced by
DNA degradation, drop-outs, stutters, pull-up eects, etc. With M
(m)
d
(M,
G)
2
nS,(1)
, it is
plausible for the observed stain to be a mixture of two individuals since, for the pair of proles
maximising the likelihood, the conditional Mahalanobis distance is suciently small. Then the
quality of the sample may be investigated by evaluating M
(q)
d
and observing if it falls above
the critical value
2
S,(1)
, e.g. = 0.01. If so, this indicates unexpected imbalances between
loci, which may be due to e.g. degraded DNA, inhibitors aecting only certain loci or allelic
drop-outs.
4.5.2 Model reductions
When tting the parameters of the model, we nd for our specic data set that the additional
variance components,
s
, s S, were innitesimally small compared to the contributions of
ss
.
A
2
-test indicated that the goodness of t was not signicantly improved by this parameter.
Hence, the results reported in Table 4.5 corresponded to the model with
s
= 0 for all s S.
Investigations showed that further reduction of the covariance structure was not supported by the
data (see Appendix 4.C for more details).
78 Evaluating the weight of evidence using quantitative STR data in DNA mixtures
4.6 Conclusion
In the example of Section 4.3.1, the usual evaluation of the likelihood by considering LR
G
=
P(G|H
p
)/P(G|H
d
) gave a likelihood ratio supporting the H
p
-hypothesis with a likelihood ratio
larger than 10
13
. However, when including the quantitative information, the weight of evidence
was decreased to a likelihood ratio, LR
GQ
, less than one. This was true even with limits of
0.25 for the mixture proportion balances in the setup of Bill et al. (2005). The likelihood ratio
without taking the quantitative information into account corresponded to the situation, where all
combinations passing the guidelines of Bill et al. (2005) were given identical weights. Hence,
excluding possible combinations from entering the likelihood ratio based on the quantitative in-
formation was not sucient for an accurate estimate of the likelihood ratio based on quantitative
information.
For cases where the qualitative results strongly support that the suspect contributed to a mixed
stain, the inclusion of the quantitative information may further support the conclusion. Con-
versely, the likelihood ratio may decrease supporting the H
d
-hypothesis. Both situations were
demonstrated by the example of Section 4.3.1. Hence, the evaluation of the quantitative infor-
mation using a statistical model is of great importance in order to assess the weight of evidence
obtained from DNA mixtures.
The model derived in this paper incorporates both information on qualitative traits (STR alleles)
and on quantitative aspects of the STR alleles (peak heights and areas). Graphical diagnos-
tics (not included in this manuscript) indicate that the model is well suited for the evaluation
of P(Q|G, H). Furthermore, assuming independence of the peak areas of the various STR is a
simplication that cannot be supported by the work carried out in this paper. Hence, inter-locus
correlations or other means of correction need to be considered when assessing the weight of
evidence from quantitative data in forensic DNA STR settings.
The concordance between the model properties and prior knowledge of dierences in amplica-
tion eciency of various STR loci and in emission intensities of various uorescent dyes adds
further support to the model.
The model described in the present paper is also applicable in other elds of science. A useful
property is the handling of variable dimension of the observations while exploiting compound
symmetries (Votaw, 1948). For example similar problems with modelling covariance structures
may arise in animal breeding studies, where the litter size varies and osprings may be related
through the same breeding lines.
4.A The model 79
Appendices
4.A The model
In this section, we provide more mathematical details than given in Section 4.2.2. The model
assumes proportionality of the mean and variance of A N
4S
(, ). The covariance, , is a
diagonal matrix with elements
2
s
H
(k)
and is a vector partitioned in a similar way with the
element
s
H
(k)
for both peak areas associated with locus s and person k.
The observable peak area measurements, M, were dened as a linear transformation, T, such
that M = TA + . In order to model the proportionality of the mean and variance of M, we
dened the scaled residuals =
_
i
/
h
i
_
n
i=1
, where n =
_
sS
n
s
. For , we assumed a compound
symmetry covariance matrix
(Votaw, 1948). Since = diag(h)
1/2
, the covariance of
is Cov() = = diag(h)
1/2
diag(h)
1/2
. We parametrised the covariance,
, as an additive
structure using = {
st
}
s,tS
and = (
s
)
sS
, such that Cov(
s
,
t
) =
st
1
n
s
1
n
t
+
st
s
I
n
s
,
where
s
are the scaled residuals of locus s, 1
k
is a k-dimensional vector of ones, and
st
is the
Kronecker delta. For implementation of the EM-algorithm, we need the conditional distribution
of A|M. Using Lauritzen (1996, Proposition C.5), this is
A|M N
4S
_
+ T
_
TT
+
_
1
(M T), T
_
TT
+
_
1
T
_
. (4.5)
The model for M corresponds to a linear mixed eects model:
M = X+ Z(
1
,
2
)
, where
1
N(0, diag(1
4
2
s
)
sS
) and
2
N(0, ) (4.6)
for some case specic design matrices X and Z. However, estimation of the variance components
are complicated due to the varying dimensions of M and M
s
, s S from case to case.
4.B EM-estimators
In order to handle the complete structure of A that includes the missing data problem, we used
the EM-algorithm to impute the unobservable data. However, since the dimensions of M and
sub-vectors hereof varied from case to case, we obtained a likelihood that was not very well
suited for implementation of the EM-algorithm. This was due to the dependence on n
s
in the
covariance of the locus-wise average of the scaled residuals
= (
1
, . . . ,
S
),
Cov(
) = diag(
s
/n
s
)
sS
+ = diag(/n) + ,
where n = (n
s
)
sS
and the vector division is done component-wise, x/y = (x
i
/y
i
)
n
i=1
.
The problem was solved using appropriate auxiliary variables v and u, which we assumed to
be independent and zero-mean normal distributed variables with covariances and diag(/n),
respectively. By introducing v and u, we obtained a likelihood of a full exponential family,
where the estimation of and may be done separately. The use of auxiliary variables is
80 Evaluating the weight of evidence using quantitative STR data in DNA mixtures
equivalent to adding constraints on the diagonal elements of . By assuming Cov(v) = , we get
the constraint that
ss
> 0, s S. In (4.6), this corresponds to splitting
2
into two independent
parts
21
and
22
,
2
=
21
+
22
, where
21
N(0, Q
c
Q
c
) and
22
N(0, diag(
s
1
n
s
)
sS
) with
Q
c
dened in (4.7).
Hence, the E-step consisted of imputing A, u and v given the observations M. In the M-step,
we used that the full likelihood factorises into two terms modelling the biological part of the data
given the measurement noise, (A, M)|(u, v, ), and the noise, (u, v, ), respectively:
f (A, M, u, v, ; , , , |H, h) = g(A, M; , |u, v, , H, h)h(u, v, ; , |H, h)
with g and h being the density functions of the two multivariate normal distributions below:
g :
_
A
M
_
N
__
T+
_
,
_
T
T TT
__
h :
_
_
u
v
_
_
N
_
_
_
_
0
0
0
_
_
,
_
_
diag(/n) O diag(/n)Q
c
O Q
c
Q
c
diag(/n) Q
c
_
_
_
,
where Q
c
is dened in (4.7). In order to derive the estimators of the parameters entering the
functions g and h, we dened two matrices Q and Q
c
,
Q =
_
_
1
4
. . . O
.
.
.
.
.
.
.
.
.
O . . . 1
4
_
_
and Q
c
=
_
_
1
n
1c
. . . O
.
.
.
.
.
.
.
.
.
O . . . 1
n
S c
_
_
, (4.7)
where subscript c refers to case c, c = 1, . . . , C. Furthermore, the DNA proxy H = (H
(1)
, H
(2)
) is
expanded to a 4S -dimensional vector, H = (H
s
)
sS
, where the components H
s
are xed for all
loci, H
s
= (H
(1)
, H
(1)
, H
(2)
, H
(2)
). Note, that the compound symmetry structure of the covariance
of with = 0 can be written as
= Q
c
Q
c
. The estimators of and
2
can be found as
=
_
c
Q
E(A
c
|M
c
)
_
c
Q
H
c
2
= (4C 1)
1
c
Q
_
{E(A
c
|M
c
)
c
}
2
+ diag{Cov(A
c
|M
c
)}
H
c
_
,
where the squaring of a vector is done component-wise, x
2
= (x
2
i
)
n
i=1
and diag{B} extract the
diagonal vector of B, diag{B} = (B
ii
)
n
i=1
. Furthermore, the moments of A
c
|M
c
are given in (4.5).
The estimators of = (
s
)
sS
and are,
s
= n
1
s+
c
_
E(u
2
sc
|M
c
)n
sc
+ E(
s
s
n
s
2
s
|M
c
)
_
= C
1
c
_
E(v
c
|M
c
)E(v
c
|M
c
)
+ Cov(v
c
|M
c
)
_
.
For both v and u, the covariance with M is expressed as Cov(x, M) =
Cov(x)Q
c
diag(h)
1/2
, for xreplaced by v or u. The conditional moments entering the estimation
4.C Model reduction 81
equations may be found using the formulae for computing conditional moments in the multivari-
ate normal distribution, E(X|Y ) =
X
+
12
1
22
(Y
Y
) and Cov(X|Y ) =
11
12
1
22
21
for (X, Y )
N
_
(
X
,
Y
)
,
_
with =
_
11
12
21
22
_
(Lauritzen, 1996, Proposition C.5).
4.C Model reduction
As mentioned in Section 4.4, the large asymptotic standard deviations indicated that the covari-
ance structure of
could be simplied. The estimated parameters for nearly all loci were neg-
ligible compared to
ss
. Let Diag(A
i
)
n
i=1
be a block-diagonal matrix with matrices A
i
, i = 1, . . . , n
as elements and the square root of a vector dened as
x = (
x
i
)
n
i=1
. Then, we may write the
covariance matrix of M,
M
, as:
M
= TT
+ = Diag
_
2
s
T
s
diag(H
s
)T
s
+
s
_
h
s
_
h
s
_
sS
+
_
st
_
h
s
_
h
t
_
s,tS
.
From the equation above, we see that setting = 0 does not introduce any singularities in
M
.
Hence, the asymptotic theory is not violated. In order to test whether was statistically signif-
icant, we used an approximately
2
-distributed test-statistic with the dierence in parameters as
degrees of freedom (Cox and Hinkley, 1974). In the full model, there were S (S + 3)/2 parame-
ters. By restricting = 0, we removed S parameters and the
2
S
-test yielded a p-value of 0.9999
supporting the hypothesis of = 0. The reported parameter estimates in Table 4.5 were based
on this restricted model.
Data exploration and the estimated parameters of from Table 4.5 suggest that further model
reductions may be feasible. Possible parametrisations of
may be,
Cov(
s
,
t
) =
d(s),d(t)
1
n
s
1
n
t
+
st
s
I
n
s
(4.8)
Cov(
s
,
t
) =
d(s),d(t)
1
n
s
1
n
t
+
d(s)d(t)
d(s)
I
n
s
(4.9)
Cov(
s
,
t
) = 1
n
s
1
n
t
+
st
s
I
n
s
, (4.10)
where d maps locus to uorescence dye colour, e.g. d(FGA) = Yellow. The covariance struc-
tures in (4.8)-(4.10) all use fewer parameters in
than the restricted model with D(D+1)/2+S ,
D(D+3)/2 and 1+S parameters, respectively, where D is the number of dye colours. In our data
D = 3 and S = 10 and thus we removed 39, 46 and 44 parameters, respectively. The three tests
indicated that there were signicant dierences between the full model and any of the reduced
models, all with p-values < 0.0001. Hence, the model with the best t included locus depen-
dent parameters for the between and within covariance on the measurement errors. Inspection
of the correlation matrix in Table 4.5 indicated that locus D8 was the only locus with an average
between-locus-correlation less than 0.5. This may well cause the dye covariance models to have
a poor t.
However, one has to bear in mind that the parameter estimates were based on a limited training
set. Hence, the rejections of the hypotheses of simpler models may be biased towards the four
proles included in the training set. In order to fully verify the model we need to increase the
proportion of alleles from each locus and also the number of homozygous proles. This will
82 Evaluating the weight of evidence using quantitative STR data in DNA mixtures
reduce the possible individual specic eect that may exist in the training set. Such work is in
progress.
A more detailed description of the model and the implementation of the EM-algorithm with full
R-source code are available on line at http://people.math.aau.dk/tvede/dna. The programs can
also be obtained from http://www.blackwellpublishing.com/rss.
Acknowledgements
The authors would like to thank Prof. Bruce S. Weir, University of Washington, for some clari-
fying comments on an earlier version of the manuscript. We also thank Ms. Catharina Steentoft
for collecting the DNA proles from the crime case work used in Section 4.5.1, and Ms. Lis-
beth Grubbe Nielsen for thorough review of language and grammar. Furthermore, very help-
ful comments were made by the journals editors and anonymous reviewers. The 22nd ISFG
Congress-proceedings (Tvedebrink et al., 2008) has a brief model description.
Bibliography 83
Bibliography
Balding, D. J. and R. A. Nichols (1994). DNA prole match probability calculation: how to
allow for population stratication, relatedness, database selection and single bands. Forensic
Science International 64, 125140.
Bill, M. et al. (2005). PENDULUM - a guideline-based approach to the interpretation of STR
mixtures. Forensic Science International 148, 181189.
Butler, J. M. (2005). Forensic DNA Typing: Biology, Technology, and Genetics of STR Markers
(2 ed.). Burlington, MA: Elsevier Academic Press Inc., U.S.
Cowell, R. G. (2009). Validation of an STR peak area model. Forensic Science International:
Genetics 3(3), 193199.
Cowell, R. G., S. L. Lauritzen, and J. Mortera (2007a). A gamma model for DNA mixture
analyses. Bayesian Analysis 2(2), 333348.
Cowell, R. G., S. L. Lauritzen, and J. Mortera (2007b). Identication and separation of DNA
mixtures using peak area information. Forensic Science International 166, 2834.
Cowell, R. G., S. L. Lauritzen, and J. Mortera (2010). Probabilistic expert systems for handling
artifacts in complex DNA mixtures. Forensic Science International: Genetics. In Press.
Cox, D. R. and D. V. Hinkley (1974). Theoretical Statistics. Chapman and Hall Ltd.
Curran, J. M. (2008). A MCMC method for resolving two person mixtures. Science &Justice 48,
168177.
Evett, I. W. and B. S. Weir (1998). Interpreting DNA Evidence: Statistical Genetics for Forensic
Scientists. Sunderland, MA: Sinauer Associates.
Gill, P. D. et al. (1998). Interpreting simple STR mixtures using allele peak areas. Forensic
Science International 91(1), 4153.
Gill, P. D. et al. (2006). DNA commission of the International Society of Forensic Genetics:
Recommendations on the interpretation of mixtures. Forensic Science International 160(2-3),
90101.
Lauritzen, S. L. (1996). Graphical models. Oxford University Press.
Little, R. and D. Rubin (2002). Statistical Analysis with missing data (2 ed.). Wiley.
Perlin, M. W. and B. Szabady (2001). Linear mixture analysis: A mathematical approach to
resolving mixed DNA samples. Journal of Forensic Science 46(6), 13721378.
R Development Core Team (2009). R: A Language and Environment for Statistical Computing.
Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0.
Tvedebrink, T., P. S. Eriksen, H. S. Mogensen, and N. Morling (2008). Amplication of DNA
mixtures - Missing data approach. Forensic Science International: Genetics Supplement Se-
ries 1, 664666.
84 Evaluating the weight of evidence using quantitative STR data in DNA mixtures
Tvedebrink, T., P. S. Eriksen, H. S. Mogensen, and N. Morling (2009). Estimating the proba-
bility of allelic drop-out of STR alleles in forensic genetics. Forensic Science International:
Genetics 3(4), 222226.
Tvedebrink, T., P. S. Eriksen, H. S. Mogensen, and N. Morling (2010). Identifying contributors
of DNA mixtures by of quantitative information of STR typing. Journal of Computational
Biology. Accepted for publication.
Votaw, D. F. (1948). Testing compound symmetry in a normal multivariate distribution. Annals
of Mathematical Statistics 19(4), 447473.
Wang, T., N. Xue, and J. D. Birdwell (2006). Least-square deconvolution: A framework for
interpreting short tandem repeat mixtures. Journal of Forensic Science 51(6), 12841297.
4.7 Supplementary remarks 85
4.7 Supplementary remarks
As briey mentioned at page 79, the model presented above is a case of the larger class of linear
mixed eects models. However, what distinguishes the model from other types of linear mixed
eects models, is the property of handling varying dimensions of the observation matrix and
subvectors hereof under the assumed mean and covariance structure. Typically an experimental
design is set up such that n
s
and n (as dened above) are constant over the various factors of the
experiment. In order to construct interesting and realistic experiment useful to forensic genetics
it is not possible to full such restrictions. However, by restricting the intra-locus correlations to
be positive, the EM-algorithm may be used to t the model to data where the subvectors of the
response vary across samples.
The model extends the LR by including the quantitative information in the evidence calculations.
By evaluating L(M|G) for a given pair of DNA proles, G, it is possible to assess the goodness-
of-t for a proposed pair of DNA proles versus the observed peak intensities. However, since
the model presented above assumes intra-locus correlations, it is very time consuming and com-
putational intense to search for a pair of best matching proles
G = max
G
L(M|G), since the
conguration on the various loci aect each other through the non-zero correlations.
Hence, in order to perform such a task, we need to relax some of the assumptions for fast com-
putation and evaluation. In the following chapter we present a statistical model and an ecient
algorithm for nding a pair of best matching proles. The basic assumptions are similar to those
discussed above, with the dierence that the peak intensities within each locus is assumed con-
ditionally independent. That is, by conditioning on an ancillary statistic (for the mixture ratio)
we assume that the conguration of the DNA proles in locus s is independent of congurations
in locus t for all t s.
The methodology diers from previous approaches since it is frequentistic and based on a statis-
tical model taking the present proportionality of mean and variance of the peak intensities into
account. There are several Bayesian methods for modelling and separating DNA mixtures, e.g.
Cowell et al. (2007a,b, 2010); Cowell (2009) discussed the use of probabilistic expert systems to
model DNA mixtures using rst a normal distribution (2007a-paper) and later a gamma distribu-
tion, and also Curran (2008) took a Bayesian approach and modelled the peak intensities using a
multivariate normal distribution. However, Curran (2008) did not include the proportionality of
the mean and variance, which is a intrinsic feature of the gamma models of Cowell et al.
Earlier Perlin and Szabady (2001) and Wang et al. (2006) used linear models to model the peak
intensities of DNA mixtures using a frequentistic approach. However, their models did not take
the mentioned proportionalities of the rst two moments into account, and their methods did
not allow for ecient and consistent modelling of all loci simultaneously. For example, Wang
et al. (2006) did not incorporate a common mixture ratio across loci even though there are strong
biological and biochemical arguments for this assumption. Furthermore, did the method of Wang
et al. (2006) call for a reasonably large amount of manual labour in order to use the output from
their method.
CHAPTER 5
Identifying contributors of DNA mixtures by
means of quantitative information of STR typing
Publication details
Co-authors: Poul Svante Eriksen
s
, (5.1)
where denotes the proportion with which person 1 contributes to the mixture, and C
s
= I
n
s
n
1
s
1
n
s
1
n
s
with n
s
, 1 n
s
4, being the number of observed peaks at locus s. Note that
is supposed to be common to all loci. The denition of the covariance matrix is close to the
ordinary covariance when conditioning on the vector sum. However, as the variance of the peak
area is assumed proportional to the mean, we use the diagonal matrix diag(h
s
), where h
s
is the
associated peak heights on locus s, to obtain weighted observations that stabilise the variance.
Furthermore,
2
is a common variance parameter for all loci, s S.
The P
s,k
-vector is a vector of indicators taking values 0, 1 or 2 referring to the number of copies
that person k has of each allele in the mixture on locus s. E.g., if the two individuals contributing
to the mixture have genotypes (10, 12) and (14, 14), respectively, we will have P
s,1
= (1, 1, 0)
and P
s,2
= (0, 0, 2)
and P
s,2
= (2, 0)
be unlikely as we
assumed person 1 to have the lowest contribution and the second area to be the larger.
The numbers of possible pairs of proles for loci with two, three and four observations are
respectively 7, 12 and 6, when discarding the information from peak areas and only using com-
binatorics. Thus, using the assumptions of the model, we decrease the number of proles which
needs to be examined in order to nd the most likely proles forming the observed mixture.
We assume the peak areas to be normally distributed with conditional means and covariances
as specied in (5.1). Due to the conditional independence of the loci, the overall estimates of
and
2
are found as sums over the loci. Let W
s
= C
s
diag(h
s
)C
s
, then we can write the
conditional distribution as A
s
|A
s,+
N
n
s
(x
s
0
x
s
1
,
2
W
s
), where x
s
0
= (P
s,1
P
s,2
)A
s,+
/2 and
x
s
1
= P
s,2
A
s,+
/2 are the terms of the mean, linear and constant in , respectively. Solving the
likelihood equation with respect to and
2
yield the unbiased estimators
=
_
sS
x
s
0
s
(A
s
x
s
1
)
_
sS
x
s
0
s
x
s
0
and (5.2)
2
= N
1
sS
(A
s
x
s
0
x
s
1
)
s
(A
s
x
s
0
x
s
1
),
where N = n
+
S 1 =
_
sS
(n
s
1) 1 and W
s
is the generalised inverse of W
s
. We have
to use the generalised inverse of W
s
as W
s
has the rank n
s
1. An approximation to this model
assumes that the precision matrix,
2
W
1
s
, is given by
2
C
s
diag(h
s
)
1
C
s
. Hence, we have
a closed form expression for the inverse covariance matrix yielding simple expressions for the
estimators of and
2
,
=
_
sS
_
n
s
i=1
x
s
0,i
(A
s,i
x
s
1,i
)h
1
s,i
_
sS
_
n
s
i=1
x
s
0,i
2
h
1
s,i
and
2
= N
1
sS
n
s
i=1
(A
s,i
x
s
0,i
x
s
1,i
)
2
h
1
s,i
,
where A
s,i
, h
s,i
, x
s
0,i
and x
s
1,i
are the ith components of the respective bold faced vectors. We denote
5.4 Finding best matching pair of proles 93
the unbiased maximum likelihood estimates for the two models as ( , ) and ( , ), respectively.
The latter version is what is implemented in an on-line tool as discussed in Section 5.4.2.
In addition to the estimate of , we are also interested in determining a condence interval for
. The conditional variance of given A
+
is found using the covariance operator on both sides
of (5.2),
Var( |A
+
) =
Cov
_
_
sS
x
s
0
s
(A
s
x
s
1
)
A
+
_
_
_
sS
x
s
0
s
x
s
0
_
2
=
_
sS
x
s
0
s
Cov(A
s
|A
+
)W
s
x
s
0
_
_
sS
x
s
0
s
x
s
0
_
2
=
2
_
sS
x
s
0
s
x
s
0
_
1
, (5.3)
where we fromthe rst to second equality used the conditional independence of A
s
and A
t
given
A
+
, and second to third properties of the covariance together with the expression of Cov(A
s
|A
+
)
in (5.1). The condence interval of given A
+
is then given by
CI
() = t
1/2,N
_
_
sS
x
s
0
s
x
s
0
,
where t
1/2,N
is the critical value on signicance level for a t-distribution with N = n
+
S
1 degrees of freedom. A similar condence interval using the ( , )-estimates is obtained by
inserting the ( , )-estimates instead of ( , ) and replacing W
with W
1
. From the expression
of CI
(), it is obvious that a small -estimate decreases the width of the condence interval and
thus increases the trust in the estimated mixture proportion.
5.4.1 Greedy algorithm
This model was used in an algorithm for nding the most likely pair of proles contributing to
an observed mixture where the STR proles of both individuals were assumed unknown. First,
dene the set J = {J
1
, . . . , J
4
}, where J
i
is the set of plausible proles for loci with n
s
= i.
These sets were dened in Section 5.4 (Table 5.3). The pseudo code for a greedy algorithm
nding a pair of proles (locally) maximising the likelihood of the model specied by (5.1) is
given in Figure 5.1. A greedy algorithm is any algorithm that solves a problem by making the
locally optimum choice at each stage with the hope of nding the global optimum. A graphical
representation of the algorithm is given in Figure 5.7 for a general number of contributors, m.
The algorithm works with both ( , ) or ( , ) as estimates of (, ).
The greedy algorithm initiates by estimating based on a locus s with four present alleles. The
loci of S
4
contain full information on the mixture ratio, , and are thus used for assessing this
quantity. In succession, the loci with three and two (S
3
and S
2
, respectively) observations are
analysed and the combination with the smallest contribution to and best concordance to the
94 Identifying contributors of DNA mixtures by means of quantitative information
Algorithm: Find best matching pair of STR proles.
Let T = , = 0 and
2
= .
While
2
decreases or TS
For i {4, 3, 2}
For s S
i
= {s : s S and n
s
= i}
Choose combination j J
i
minimising
2
Set T = {T \ (s, )} (s, j) and compute
Return , and T.
Figure 5.1: Greedy algorithm for nding a pair of proles (locally) maximising the likelihood
of (5.1).
previously determined mixture proportion is chosen. The set T contains a list of the optimal
combinations on previously visited loci and is updated after each iteration. On termination, the
greedy algorithm returns the best matching pair of proles together with the estimates of and
. The algorithm is designed to perform calculations and decisions similar to those of a forensic
geneticist when analysing a two-person mixture.
The optimisation problem is complicated since the inputs of the function that we are interested
in minimising depend on each other, f (,(P
s,1
, P
s,2
)
sS
) =
_
sS
D
s
, where D
s
= (A
s
x
s
0
x
s
1
)
s
(A
s
x
s
0
x
s
1
). Here, f denotes the object function and (P
s,1
, P
s,2
)
sS
the set of possible
combinations for all loci, s S. It is easy to see that, for a xed , we can minimise D
s
for each
locus s by choosing the combination yielding the smallest square distance. Similarly, xing
the combinations for all loci, is estimated using (5.2). However, from the construction of the
greedy algorithm, the algorithm chooses the combination that minimises
2
for locus s given
and the congurations on loci previously visited loci, t {T \ s}. This ensures locally optimal
solutions, and for most practical purposes, the algorithm returns a global maximum. One should
note that when the algorithmrecovers the best matching pair of proles, we still need to consider
all proles close to these proles consistent with the evidence for likelihood ratio evaluation (see
Section 5.5 for further details).
5.4.2 On-line implementation
The greedy algorithm of Figure 5.1 together with the methods for evaluating the goodness of
t for a given pair of proles are implemented in an on-line application. The on-line imple-
mentation applies the ( , )-estimates when nding the best matching pair of proles. The
two-person (and three-person) mixture separator is available on-line at the rst authors website
(http://people.math.aau.dk/tvede/dna/). The script can plot the expected and observed peak
areas for visual inspection of the t (see Figure 5.2).
The script allows for user uploads of csv-les containing information about loci, alleles, peak
heights and peak areas. The loci implemented are those contained in the SGMPlus and Identiler
kits (AB) excluding amelogenin.
5.4 Finding best matching pair of proles 95
Figure 5.2: Plot produced by the on-line implementation of the algorithm
(http://people.math.aau.dk/tvede/dna/ - sample data le Paper case). The observed
peaks, , are based on data from Table 5.4, and the expected peaks, , assuming a mixture of
the best matching pair of STR proles (Table 5.5). The observed and expected peaks coincide
for nearly all peaks.
Apart from nding the best matching pair of unknown proles, the user can specify a suspect
prole, and the script nds the best matching unknown prole for two-person mixtures.
Example of a two-person mixture separation in an 1:1 mixture ratio
We demonstrate the algorithm and implementation on data from a controlled experiment con-
ducted at the Section of Forensic Genetics, Department of Forensic Medicine, Faculty of Health
Sciences, University of Copenhagen, Denmark. The data are presented in Table 5.4 together
with information on the true proles of the mixture (denoted by and ).
The algorithm found that the two proles of Table 5.5 are the best matching pair of proles.
The proles are consistent with the true proles of the mixture except for loci TH0 and FGA.
In Figure 5.2, we have plotted the data from Table 5.4 (solid cones, ) together with the best
matching pair of proles as listed in Table 5.5.
In Figure 5.3, the traces of the parameter estimates of (dashed) and
2
(solid) are plotted for
each successive iteration with the nal parameter estimates being = 0.43 (95%-CI: [0.40 ;
0.45]) and
2
= 1134.04. Evaluating the mixture of the true proles (marked by and in
96 Identifying contributors of DNA mixtures by means of quantitative information
Table 5.4: Data used in demonstrating the algorithm. The and represents prole 1 and 2,
respectively.
Locus Allele Height Area
D3 15 1802 15410
D3 16 1939 16282
vWA 14 712 6128
vWA 15 725 6620
vWA 16 626 5637
vWA 17 830 7362
D16 10 824 7910
D16 11 1772 17231
D16 12 586 6101
D2 17 434 4558
D2 19 612 6563
D2 25 843 9257
D8 8 1284 10782
D8 12 1232 10359
D8 13 903 7891
D8 16 638 5291
Locus Allele Height Area
D21 29 1073 9454
D21 30 1469 12828
D21 31 798 6992
D18 13 1247 12302
D18 15 899 9104
D18 17 726 7549
D19 13 1332 10534
D19 14 416 3478
D19 15 504 3968
TH0 6 820 6739
TH0 8 668 5573
TH0 9 486 4004
FGA 19 490 4415
FGA 23 865 7968
FGA 24 527 5036
Table 5.5: Best matching pair of proles for the data in Table 5.4. This pair of proles is pictured
in Figure 5.2 as the expected peaks.
Locus D3 vWA D16 D2 D8 D21 D18 D19 TH0 FGA
Minor 15,16 14,16 10,12 17,25 13,16 30,30 13,13 14,15 6,6 23,23
Major 15,16 15,17 11,11 19,25 8,12 29,31 15,17 13,13 8,9 19,24
Table 5.4), the estimate is almost unchanged ( = 0.42), but with an increase in
2
to 1266.34
indicating a slightly worse t.
The fact that a combination dierent from the true one has a better t, indicates that there are
multiple explanations of the trace since it is a 1:1-mixture ( close to 0.5). However, the dier-
ence in
2
-estimates for the two combinations will only have a minor inuence in the evaluation
of the evidence.
Example of a two-person mixture separation in an 1:2 mixture ratio
Wang et al. (2006, Table 10) presented data from a two-person DNA mixture with known minor
(victim) and major (suspect) proles. Curran (2008) and others have analysed these data in
order to demonstrate their models for separating two-person DNA mixtures. Using the on-line
implementation we obtained the true proles with = 0.30 (95%-CI: [0.28 ; 0.32]) and
2
=
124.87.
5.4 Finding best matching pair of proles 97
Figure 5.3: Trace of the parameter estimates of (dashed/right ordinate labels) and
2
(solid/left ordinate labels). The plot is produced by the on-line tool available at
http://people.math.aau.dk/tvede/dna/.
5.4.3 Dropping non-tting loci
In some cases, the stain may be contaminated, and it may be subject to drop-in or drop-out.
Drop-ins are allelic peaks present in the DNA prole not belonging to the true proles. Drop-ins
may occur at random (contamination) or by more systematic mechanisms such as stuttering or
pull-up eects. Stutters are caused by artefacts in the polymerase chain reaction resulting in an
increase of peak intensities typically in the allelic position before the true peaks. Pull-up eects
are manifested as an increase of true peaks caused by overlap of the spectra of the light emitted
from the various uorochromes, which are detected by a CCD camera in the data generating
process (Butler, 2005). Drop-outs are allelic peaks of the true proles that are absent in the DNA
prole due to, e.g. low amount of DNA or degradation of the DNA. In such cases, the observed
peak heights and peak areas no longer originates solely from a two-person mixture. Hence, the
proportionalities of Section 5.1 need no longer to be satised and the mean structure of (5.1)
may not explain the observed peak heights and peak areas in all loci.
We use an F-test approach to evaluate whether any of the included loci s S has signicant
unexpected balances due to e.g. stutters, degradation or contamination. The purpose is to return
a list of loci in which the hypothesis of a two-person mixture can be supported.
For each locus, the contribution to
2
is computed by D
s
, which we assume to follow a
2
n
s
1
-
distribution. Hence, to test whether any locus contributes signicantly to the overall variance,
2
, we evaluate for each locus s S the ratio
(n
s
1)
1
D
s
(n
+
S n
s
1)
1
_
t{S\s}
D
t
F
(n
s
1),(n
+
S n
s
1)
,
98 Identifying contributors of DNA mixtures by means of quantitative information
where F
(
1
),(
2
)
is an F-distribution with
1
numerator and
2
denominator degrees of freedom.
Since we perform this test for all loci, we make a Bonferroni-correction to compensate for mul-
tiple testing. We apply this procedure successively and drop the most signicant locus (if any)
until no locus has a signicant test-value. This facility is also available in the on-line implemen-
tation.
If the variance contribution from multiple loci is large, the test-value will not indicate any sig-
nicant locus as the overall noise of the sample is large or may be a mixture of more than two
individuals. This will result in large values for the overall
2
.
5.5 Likelihood ratio
Let G be the DNA prole of the crime stain, and G
S
and G
U
i
the proles of the suspect and
unknown contributor i, respectively. Furthermore, the evidence, E, consists of both quantitative
information (peak heights and areas), Q, and the genetic crime stain (allelic information), G.
The probability P(E|H) factories as P(Q, G|H) = P(Q|G, H)P(G|H) using the denition of con-
ditional probabilities. Since Q is a continuous stochastic variable, we use the likelihood of our
model, L(A|G
, G
) =
sS
{|W
s
|
1/2
exp(
1
2
D
s
)}, to evaluate P(Q|G, H), where the hypothesis
H involves proles G
and G
.
Let C
p
= {G
U
: (G
S
, G
U
) G} be the set of unknown proles that together with G
S
are consistent
with G, then P(G|G
S
, G
U
) = 1 for G
U
C
p
and 0 otherwise, i.e. C
p
is the set of possible
unknowns under H
p
. Similarly, let C
d
= {(G
U
1
, G
U
2
) : (G
U
1
, G
U
2
) G} be the set of two unknown
proles consistent with G, i.e. possible pairs of proles under H
d
. This partitioning of the set
of proles is equivalent to Assumption 2 in Evett et al. (1998), where the authors argue that the
only genotype congurations of interest are those proles (G
, G
, G
A
s,+
/2
2 a
3
b (aa, ab) (1+, 1)
A
s,+
/2
(ab, aa) (2, )
A
s,+
/2
a
2
b
2
(aa, bb) (2, 2(1))
A
s,+
/2
(ab, ab) (1, 1)
A
s,+
/2
3 a
2
bc (aa, bc) (2, 1, 1)
A
s,+
/2
(ab, ac) (1, , 1)
A
s,+
/2
(bc, aa) (2(1), , )
A
s,+
/2
4 abcd (ab, cd) (, , 1, 1)
A
s,+
/2
(ac, bd) (, 1, , 1)
A
s,+
/2
In some cases, the value of L(A|G
S
, G
V
) may be very much lower than the likelihood value
for the pair of best matching proles. This indicates that it is inappropriate to assume that the
evidence is a mixture of G
S
and G
V
- even though the proles (G
S
, G
V
) are consistent with G.
The sums involved in the evaluation of the likelihood ratio will often involve an intractable num-
ber of terms depending on the number of loci and number of observed peaks in each locus. As
the inclusion of all possible combinations is infeasible, we need at least to include combinations
with a numerical impact on the likelihood ratio for the approximation of the true likelihood ratio
to be satisfactory for forensic use.
The best matching pair of proles will provide an estimate, of the mixture proportion . The
expected peak areas in Table 5.6 (expressed in terms of ) indicate that alternative combinations
need to have an -estimate close to the estimate of the best matching pair in order to have a
reasonable t. We exploit this result when dening our proposal distribution in the section on
importance sampling.
5.6 Importance sampling of the likelihood ratio
An exact assessment of the weight of evidence comprises evaluation of every term of the numer-
ator and denominator of (5.4). However, this is infeasible and other methods of evaluating the
evidence need to be considered. In this section, we show how importance sampling can be used,
for estimation of the weight of evidence by assigning weights to the individual combinations.
Maimon (2010) also considered importance sampling in a Bayesian context for modelling DNA
mixtures.
100 Identifying contributors of DNA mixtures by means of quantitative information
Let C
d
= {(G
U
1
, G
U
2
) G}, and G = (G
, G
, G
). The expression
of P(E|H
d
) can be interpreted as a expectation of Q with respect to the probability measure P on
G:
P(E|H
d
) =
GC
d
L(A|G)P(G) = E(h(E); P). (5.5)
Hence, simulating combinations G from G with respect to P may be used to estimate P(E|H
d
).
However, simulation with respect to P does not take the quantitative evidence, Q, into account
and will thus yield a poor estimate of P(E|H
d
) due to the possible larger numerical impact from
L(A|G) compared to P(G) in (5.4). To handle this, we use importance sampling based on the
marginal likelihood values of each combination.
Let q(G) =
sS
q
s
(G
s
), where G
s
= (G
s
, G
s
) is the proles on locus s and
q
s
(G
s
) =
L(A|G
s
,
G
s
)P(G
s
)
_
N
s
i=1
L(A|G
s,i
,
G
s
)P(G
s,i
)
, (5.6)
where N
s
is the number of combinations for the observed number of alleles, (G
s
,
G
s
) is the par-
ticular combination on locus s merged with the best matching combination,
G, in the remaining
loci, t {S \ s}, and the sum in the denominator is over all possible combinations, N
s
, in locus s
merge with the best matching combination in the remaining loci. Hence, L(A|G
s
,
G
s
) is called
the marginal likelihood as it gives the likelihood for the particular combination on locus s with
the combinations on the remaining loci identical to the best matching pair of proles. Further-
more, the denominator of (5.6) is a constant, B
s
, for each locus. Using this proposal distribution,
P(E|H
d
) may be expressed as an expectation with respect to q,
P(E|H
d
) =
GC
d
L(A|G)
P(G)
q(G)
q(G) = E(h(E)W(E); q),
where W(E) = P(G)/q(G) is the importance weight. Since P(G) =
sS
P(G
s
) and B =
sS
B
s
, the ratio of L(A|G)P(G)/q(G) is nearly constant:
L(A|G)P(G)
sS
{L(A|G
s
,
G
s
)P(G
s
)}
sS
B
s
=
L(A|G)B
sS
L(A|G
s
,
G
s
)
,
where the product in the denominator in many cases is a good approximation to L(A|G). This
constantness of h(E)W(E) improves the performance of importance sampling and reduces the
number of samples needed for results with low variance (Robert and Casella, 2004).
In order to estimate P(E|H
d
), we draw combinations G
i
, i = 1, . . . , M, from q(G) and compute
the Monte Carlo estimate,
P(E|H
d
) =
1
M
M
i=1
L(A|G
i
)W(G
i
), G
i
q(G),
where W(G
i
) = P(G
i
)/q(G
i
) are the importance weights.
5.6 Importance sampling of the likelihood ratio 101
The estimate is unbiased as the terms are independently simulated from q(G) and all have ex-
pectation E(h(E)W(E); q) = P(E|H
d
). For the variance of
P(E|H
d
), we compute
Var(
P(E|H
d
))=
1
M1
M
i=1
_
[L(A|G
i
)w(G
i
)]
2
P(E|H
d
)
2
_
.
The numerator of LR, P(E|H
d
), can be handled similarly taking into consideration that we are
summing over a restricted set of combinations, C
p
, all including the suspects prole, C
p
=
{G
U
: (G
S
, G
U
) G}. The greedy algorithm of Figure 5.1 is also applicable when specifying
a suspect. We only need another ordering of the observations and a dierent set of J-matrices
using the extra information of the suspects prole. This implies that there exists a best matching
combination,
G
(S )
, in C
p
having the same properties as
G for the unrestricted set, C
d
. Hence,
importance sampling may also be used in estimating P(E|H
p
) with similar formulae as those for
estimating P(E|H
d
).
5.6.1 Example of estimating LR using importance sampling
The best matching pair of proles for the data in Table 5.4 was found in Table 5.5 and were used
for estimating q(G) and the constant B. In the computations, we assumed uniform distributions
of the allele probabilities. Table 5.7, lists the prole of a ctive suspect, G
S
, together with the
unknown prole maximising the likelihood with G
S
xed. This pair plays the role of
G
(S )
in
this example. In Figure 5.4 the observed, , and expected peak heights, , assuming a mixture
of these proles are plotted.
Table 5.7: Suspects STR prole together with best matching STR prole of an unknown person.
Locus D3 vWA D16 D2 D8 D21 D18 D19 TH0 FGA
Suspect 16,16 15,17 11,11 25,25 13,16 30,30 15,17 15,15 8,9 24,24
Unknown 15,15 14,16 10,12 17,19 8,12 29,31 13,13 13,14 6,6 19,23
In order to verify the validity of our methodology and implementation of the importance sam-
pler, we limited our data to include only loci on the blue uorescent dye band (D3, vWA, D16
and D2). The total number of possible combinations for the blue loci is 7
1
12
2
6
1
= 6,048 and
it is therefore possible to compute the correct value of P(E|H
d
) = 0.481335
10
10
. For the
suspects prole specied in Table 5.7, locus D3 is the only blue locus for which it is pos-
sible to alter the unknown prole and still have consistency with G. Hence, there are only
two terms in the P(E|H
p
) when restricting the analysis to the blue dye band. The value of
P(E|H
p
) = 0.225730
10
13
indicating that the suspect is not likely to be a true contributor of the
DNA mixture since P(E|H
p
) < P(E|H
d
).
In order to evaluate the performance of the importance sampler, we computed 1,000 estimates of
P(E|H
d
) each based on 10,000 samples. The estimates are plotted together with the correct value
102 Identifying contributors of DNA mixtures by means of quantitative information
Figure 5.4: Plot of the observed peaks, , and the expected peaks, , assuming a mixture of the
suspect and best matching unknown (STR proles of Table 5.7).
in the histogram of Figure 5.5. The distribution of the estimates tends to be skew for this particu-
lar example, but with most of the estimates close to the true value of P(E|H
d
). The mean of the es-
timates,
P(E|H
d
), is 0.483731
10
10
with a standard deviation of 0.184432
10
11
. From the cen-
tral limit theoremwe may approximate the (positive) distribution of
P(E|H
d
) with a normal distri-
bution and compute an approximative 95%-condence interval: [0.122243, 0.8452178]
10
10
.
In forensic genetics it is common practice to evaluate the evidence anti-conservative, mean-
ing that the estimates and approximations are favourable to the suspect/defendant (Balding,
2005). For a conservative LR the estimate of the numerator should be larger than the true value,
P(E|H
d
) > P(E|H
d
). However, 66% of the importance sample estimates are smaller than the true
value for this particular example. A likely explanation for this is that the sampling scheme places
to much of the probability mass close to the best matching pair of proles. Hence, the (very)
large set of less likely combinations are not included in the estimate.
5.7 Results
The algorithm was tested on data from 71 controlled two-person mixtures with known proles.
Hence, it was possible to validate the suggested proles returned by the separation algorithm. Ta-
ble 5.8 summarises the comparisons with the best matching pair of proles and the true mixture
proles.
5.7 Results 103
Estimates
F
r
e
q
u
e
n
c
y
0
1
0
0
2
0
0
3
0
0
4
0
0
0.5 10
10
1 10
10
1.5 10
10
Correct value of P(|H
d
)
Figure 5.5: Histogram of 1,000 estimates of P(E|H
d
) each based on 10,000 samples.
Table 5.8: Detailed summary table with the number of correctly separated loci, x, stratied by
mixture ratio.
Cases with both proles Cases with major prole Cases with minor prole
correct in x of 10 loci correct in x of 10 loci correct in x of 10 loci
Ratio 3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10
1:1 1 3 0 2 2 2 1 0 1 2 1 2 2 2 1 0 1 3 0 2 2 2 1 0
1:2 0 0 0 0 2 5 7 8 0 0 0 0 0 2 8 12 0 0 0 0 2 4 8 8
1:4 0 0 0 1 3 2 6 7 0 0 0 0 0 0 0 19 0 0 0 1 3 2 6 7
1:8 0 0 0 2 5 4 0 4 0 0 0 0 0 0 0 15 0 0 0 2 5 4 0 4
1:16 0 0 1 0 0 0 0 3 0 0 0 0 0 0 0 4 0 0 1 0 0 0 0 3
Total 1 3 1 5 12 13 14 22 1 2 1 2 2 4 9 50 1 3 1 5 12 12 15 22
From the bottom row of Table 5.8, we see that the separation algorithm returned the true mixture
proles as the best matching combination 22 times. The number of cases where one (14 cases),
two (13 cases) or three (12 cases) loci were wrongly separated were almost the same. In ve
cases, half or less of the loci were correctly separated.
In 50 cases, the true major prole were correctly identied and in another 13 there were inconsis-
tency in at most two loci between the major prole of the best matching pair and the true major
prole. Furthermore, Table 5.8 shows that the eight remaining cases with incorrect identication
of the major prole had mixture ratio 1:1. Hence, in these cases, there were no obvious major
104 Identifying contributors of DNA mixtures by means of quantitative information
proles as the amounts of DNA contributed were (almost) equal. Furthermore, for 1:1-mixtures,
there are many pairs of proles yielding similar goodness of t to the observed peak intensities,
which previously was exemplied in Section 5.4.2. The algorithm is less successful in identi-
cation of the minor prole. However, in most cases, the minor prole was separated correctly in
seven or more loci.
In addition to the 71 DNA mixtures from controlled experiments, the separation algorithm was
used to separate 64 two-person DNA mixtures from real crime cases. For each of the 64 crime
cases the laboratory had two reference samples that were consistent with the observed stain.
Three experienced forensic geneticists tried to identify both the major and minor proles of the
mixture without knowing the true proles of the mixture for each mixture (blinded experiment).
In Table 5.9, the results from the separation using the separation algorithm is compared to those
of the forensic geneticists.
Table 5.9: Comparison of the performance of the separation algorithm and forensic geneticists.
The counts show the number of loci with the minor and major proles correctly identied.
Geneticists Algorithm
Correct loci Minor Major Minor Major
10 8 31 16 36
9 16 8 16 9
8 13 8 14 5
7 13 4 10 6
6 6 7 1 2
5 3 2 3 2
4 2 2 2 2
3 2 1 2 2
2 0 0 0 0
1 1 1 0 0
The total number of correctly separated mixtures was 16 for the separation algorithm and 8 for
the forensic geneticists. The samples where the minor contributor were correctly identied in all
loci also had the major component correct (see Table 5.9). As for the controlled experiments,
the success rate was dependent by the mixture ratio, with number of correctly separated loci
decreasing as increased towards 0.5.
Furthermore, it should be noted that the forensic geneticists were forced to call some pairs of
proles resulting in some inconclusive statements. That is, the forensic geneticists were forced
to deduce major and minor proles in cases where the regular protocol of the laboratory would
not support the separation of proles.
5.8 Discussion 105
5.8 Discussion
Using the quantitative information from STR DNA analysis in terms of peak intensities is
presently the only way to separate STR mixture results. Based on a statistical model, we de-
veloped a simple greedy algorithm for nding the best matching pair of proles.
Our model is based on few assumptions that are widely accepted among forensic geneticists.
The statistical model made it possible to make objective comparisons of various combinations
by evaluating the likelihood values. From the normal distribution assumption, this value is com-
puted by
N
, which implies that the lower estimate, the better concordance between observed
and expected peaks.
Importance sampling was used in order to estimate the likelihood ratio since this becomes com-
putationally dicult when 7
S
2
12
S
3
6
S
4
terms need to be evaluated in the numerator of the LR
with H
d
:(G
U
1
, G
U
2
). The method showed to be ecient, and future work will consist of imple-
mentation of sampling schemes that explore more of the sample space. This implementation
would ideally result in fewer estimates that are less than the true value.
5.9 Conclusion
By using the greedy algorithm of Section 5.4.1, we demonstrated that it is possible to automate
the separation of DNA mixtures. However, due to the assumption of no occurrence of drop-out
or stutters, the model may be too simple for more complicated cases. Hence, this methodology
is applicable to cases where the analysis today is standard but time-consuming. This allows the
forensic geneticists to focus on more complex crime cases.
Future work comprises the development of a methodology for handling drop-outs and stutters.
Since stutters are prole independent (stutters from parental peaks are constant for all alleged
combinations of proles), it is possible to remove stutters from the data prior to separation and
interpretation. Allowing for drop-outs while nding a best matching pair of proles is also pos-
sible. Using the methodology of Tvedebrink et al. (2009), the probability of drop-out, P(D|
H),
is assessed conditioned on a given prole.
Acknowledgements
The authors would like to thank Dr. Jakob Larsen and Dr. Frederik Torp Petersen (both from
The Section of Forensic Genetics, Department of Forensic Medicine, Faculty of Health Sci-
ences, University of Copenhagen, Denmark) for their assistance in manually analysing the DNA
mixtures from real crime cases.
106 Identifying contributors of DNA mixtures by means of quantitative information
Appendix
5.A The general case with mcontributors
In the general case with m contributors to the mixed stain, our method can be generalised by
assuming the mixture proportions
1
, . . . ,
m
to be strictly increasing,
1
< <
m1
<
m
= 1
+
,
+
=
m1
i=1
i
. (5.7)
The conditional covariance structure is the same as specied in (5.1), where the conditional mean
is:
E(A
s
|A
s,+
) =
A
s,+
2
_
_
m1
i=1
i
P
s,i
+ P
s,m
_
_
1
m1
i=1
i
_
_
_
_
= X
s,m
+
m1
i=1
i
X
s,i
, (5.8)
where X
s,i
= (P
s,i
P
s,m
)A
s,+
/2 for i = 1, . . . , m 1 and X
s,m
= P
s,m
A
s,+
/2. In order to nd the
MLE of = (
i
)
m1
i=1
, we solve the likelihood equations for (,
2
; (A
s
)
sS
) with respect to .
This implies that the MLE of is:
=
_
sS
X
s
W
s
X
s
_
_
1
_
sS
X
s
W
s
(A
s
X
s,m
)
_
_
.
Furthermore, the estimate of
2
in the general setting is
2
= N
1
sS
(A
s
X
s
X
s,m
)
s
(A
s
X
s
X
s,m
),
where N = n
+
S m + 1 =
_
sS
(n
s
1) (m 1).
5.A.1 Greedy algorithm
The greedy algorithmof Figure 5.1 needs only a fewmodications to be applicable to the general
case. Most important is the specication of the number of contributors, m. This needs to be
decided before running the algorithm. For the algorithmto be successful, there should preferably
be at least one locus with 2m peaks as this increases the condence in the estimate . The
modied greedy algorithm for m contributors to a DNA mixture is given in Figure 5.6.
Furthermore, it is necessary to check if the -estimate satises the inequalities of (5.7) for each
combination. In Table 5.10, we list ctive data together with two combinations both implying
a perfect t. Both matrices are valid as the orders in the -sum columns satisfy the condition
(5.7). However, for Combination 1, the estimate
1
= (0.2, 0.45) does not satises (5.7) while
the estimate for Combination 2
2
= (0.2, 0.35) does. Hence, Combination 2 is chosen over
Combination 1.
5.A The general case with mcontributors 107
Algorithm: Find best matching set of m proles
Specify the number m of contributors.
Let T = , = 0 and
2
= .
While
2
decreases or TS
For i {2m, . . . , 2}
For s S
i
= {s : s S and n
s
= i}
Choose combination j J
i
minimising
2
and satisfying restrictions of (5.7)
Set T = {T \ (s, )} (s, j) and compute
Return , and T.
Figure 5.6: Greedy algorithm for nding a set of proles (locally) maximising the likelihood of
(5.8).
Table 5.10: Fictive data showing the importance of ensuring (5.7) is satised.
Combination 1 Combination 2
Area P
1
P
2
P
3
-sum P
1
P
2
P
3
-sum
200 1 0 0
1
1 0 0
1
450 0 1 0
2
0 0 1 1
+
550 1 0 1 1
2
1 1 0
1
+
2
800 0 1 1 1
1
0 1 1 1
1
In Figure 5.7, the greedy algorithm of Figure 5.6 is described by a diagram emphasising the
various steps in the procedure of nding the best matching combination.
In step A, the parameters and are estimated using only the loci with 2m observed peaks. Step
B determines the prole combination (see step D) on the current locus that minimises given
the combinations on the already visited loci. The algorithm visits the blocks of loci with equal
numbers of observed alleles in decreasing order: 2m1, . . . , 2. If any of the blocks is empty, the
algorithms skips forward to the next nonempty block. The order within each block of loci with
2mi observed peaks is arbitrary. When reaching the last locus, the combination and estimates
of and are saved.
In step C, the algorithm visits each locus searching for a combination that might decrease with
all remaining loci combinations xed. If is non-changed the algorithm stops. Otherwise step C
is looped until a xed -value is obtained. On termination the algorithm returns the combination
and estimates of and .
Step D pictures that, for each locus with less than 2m peaks, there are several combinations of
proles that need to be investigated. In the gure, each depicts a combination and symbolises
the current optimal conguration. The black arrow shows which combination is currently tested.
When all the combinations are tested the one with smallest is returned.
108 Identifying contributors of DNA mixtures by means of quantitative information
A:
2m 2m1 2 2mi
B:
2m 2m1 2 2mi
2m 2m1 2 2mi
2m 2m1 2 2mi
C:
2m 2m1 2 2mi
2m 2m1 2 2mi
D:
Figure 5.7: Diagram describing the greedy algorithm for resolving DNA mixtures. The shaded
boxes show the loci previously visited by the algorithm. The bold lined box shows the current
locus under investigation.
Bibliography 109
Bibliography
Balding, D. J. (2005). Weight-of-evidence for Forensic DNA Proles. Chichester, West Sussex:
John Wiley & Sons, Ltd.
Bill, M. et al. (2005). PENDULUM - a guideline-based approach to the interpretation of STR
mixtures. Forensic Science International 148, 181189.
Buckleton, J. S., C. M. Triggs, and S. J. Walsh (2005). Forensic DNA evidence interpretation,
pp. 217274. Boca Raton, FL: CRC Press.
Butler, J. M. (2005). Forensic DNA Typing: Biology, Technology, and Genetics of STR Markers
(2 ed.). Burlington, MA: Elsevier Academic Press Inc., U.S.
Clayton, T. M., J. P. Whitaker, R. Sparkes, and P. D. Gill (1998). Analysis and interpretation of
mixed forensic stains using DNA STR proling. Forensic Science International 91, 5570.
Cowell, R. G., S. L. Lauritzen, and J. Mortera (2007a). A gamma model for DNA mixture
analyses. Bayesian Analysis 2(2), 333348.
Cowell, R. G., S. L. Lauritzen, and J. Mortera (2007b). Identication and separation of DNA
mixtures using peak area information. Forensic Science International 166, 2834.
Cox, D. R. (1958). Some problems connected with statistical inference. Annals of Mathematical
Statistics 29(2), 357372.
Curran, J. M. (2008). A MCMC method for resolving two person mixtures. Science &Justice 48,
168177.
Evett, I. W., P. D. Gill, and J. A. Lambert (1998). Taking account of peak areas when interpreting
mixed DNA proles. Journal of Forensic Sciences 43(1), 6269.
Evett, I. W. and B. S. Weir (1998). Interpreting DNA Evidence: Statistical Genetics for Forensic
Scientists. Sunderland, MA: Sinauer Associates.
Gill, P. D. et al. (1998). Interpreting simple STR mixtures using allele peak areas. Forensic
Science International 91(1), 4153.
Gill, P. D. et al. (2006). DNA commission of the International Society of Forensic Genetics:
Recommendations on the interpretation of mixtures. Forensic Science International 160(2-3),
90101.
Maimon, G. (2010). A Bayesian approach to the statistical interpretation of DNA evidence. Ph.
D. thesis, Department of Mathematics and Statistics, McGill University, Montreal, Canada.
Nichols, R. A. and D. J. Balding (1991). Eects of population structure on DNA ngerprint
analysis in forensic science. Heredity 66, 297302.
Perlin, M. W. and B. Szabady (2001). Linear mixture analysis: A mathematical approach to
resolving mixed DNA samples. Journal of Forensic Science 46(6), 13721378.
Robert, C. P. and G. Casella (2004). Monte Carlo Statistical Methods (2 ed.). Springer.
110 Identifying contributors of DNA mixtures by means of quantitative information
Tvedebrink, T., P. S. Eriksen, H. S. Mogensen, and N. Morling (2009). Estimating the proba-
bility of allelic drop-out of STR alleles in forensic genetics. Forensic Science International:
Genetics 3(4), 222226.
Tvedebrink, T., P. S. Eriksen, H. S. Mogensen, and N. Morling (2010). Evaluating the weight
of evidence using quantitative STR data in DNA mixtures. Journal of the Royal Statistical
Society. Series C, Applied statistics. In Press.
Wang, T., N. Xue, and J. D. Birdwell (2006). Least-square deconvolution: A framework for
interpreting short tandem repeat mixtures. Journal of Forensic Science 51(6), 12841297.
5.10 Supplementary remarks 111
5.10 Supplementary remarks
In the above manuscript only two-person mixtures were analysed in practice, but the appendix
demonstrated how to extend the model and algorithm to handle m-person mixtures. The Section
of Forensic Genetics, University of Copenhagen, also prepared three-person mixtures. Five
dierent DNA proles were mixed in trios in the mixture ratios: 1:2:4. The ve DNA proles
are listed in Table 5.11. There is
_
5
3
_
= 10 dierent triple-wise combinations and each triple
is analysed in six dierent mixture ratios (permutations of the three proles). This gives 120
samples since each case is analysed in duplicates. However, 17 samples were discarded due to
pipette and amplication errors leaving 103 samples to be analysed.
Table 5.11: The ve DNA proles used in the three-person mixtures.
D3 vWA D16 D2 D8 D21 D18 D19 TH0 FGA
A 14,18 17,19 12,14 20,24 10,13 30.2,32.2 13,13 12,13 8,9 20,22
B 17,18 14,17 9,9 17,23 13,15 28,28 14,19 14,15.2 8,9.3 20,24
C 16,18 16,19 10,13 16,23 11,14 31,32.2 15,19 12,15 9,9.3 20,24
D 15,18 16,18 9,11 20,21 12,13 29,32.2 13,14 13,14 6,7 22,22
E 15,19 15,17 12,13 16,19 12,13 27,30 13,15 13,14 9,9.3 19,25
The on-line implementation is programmed such that it handles both the analysis of single source
stains, two- and three-person mixtures. In Figure 5.8 the peak intensities for a mixture of proles
B, D and C (see Table 5.11) is plotted together with the expected values for the best matching
combination.
Since we know the true proles, we are able to compare the best matching combination with the
true proles as for the two-person mixtures. In Table 5.12 the three inferred proles are listed.
The major prole coincides with prole B while the mid prole diers from prole D by one
allele in locus D19. The minor prole has ve correct and 4 partially-correct loci compared to
prole C.
Table 5.12: The estimated proles fromthe separation of the three-person mixture of Figure 5.8.
The major prole coincides with prole B in all loci, the mid prole diers by one allele from
prole D in locus D19, and the minor is correctly identied in ve loci (compared to prole C).
Locus D3 vWA D16 D2 D8 D21 D18 D19 TH0 FGA
Minor prole 16,17 17,19 10,13 16,17 11,14 28,31 14,14 12,15 9,9.3 20,24
Mid prole 15,18 16,18 9,11 20,21 12,13 29,32.2 13,15 13,14 6,7 22,22
Major prole 17,18 14,17 9,9 17,23 13,15 28,28 14,19 14,15.2 8,9.3 20,24
112 Identifying contributors of DNA mixtures by means of quantitative information
Figure 5.8: Three-person DNA mixture of proles B, D and C in mixture ratio 4:2:1 (see Ta-
ble 5.11).
The performance of the mixture separator for the three-person DNA mixtures is summarised
in Table 5.13. For each three-person mixture the number of correctly (both alleles correctly
identied) and semi-correctly (exactly one alleles correctly identied) loci are computed. This
is done separately for the major, mid and minor prole where the median of the corresponding
amounts of DNA for these classications are 335 pg, 168 pg and 84 pg. The rst count in each
cell refers to the major prole, the second to the mid prole and lastly the minor component.
In 76 cases (73.8%) the major prole was correctly identied on at least eight loci (and partially
correct on the remaining ones), while 52 cases (50.5%) had the mid prole correct on at least
six loci. The success rate for the minor component was unsatisfactory low. However, the low
amounts of DNA compared to the other two components implies that the contributions from the
minor prole are within the limits of variation one would expect for the larger peak intensities.
That is, the unbalances induced by adding the fraction from the minor component to the peaks
of the mid and major proles is masked by the variability of these peaks.
The authors have in collaboration with Aalborg University and University of Copenhagen ap-
plied for a patent for the intellectual rights of the mixture separating algorithm presented above:
Name of invention: A Computer-Assisted Method of Analyzing a DNA Mixture.
Application details: U.S. Provisional Application 61/148221 led Jan. 29, 2009.
5.10 Supplementary remarks 113
Table 5.13: Summary table of the separation of three-person mixtures. Each cell corresponds to
the number of major/mid/minor counts for the number of correctly identied loci stratied on
matches and partial-matches. Row number is full matches and columns partial matches.
0 1 2 3 4 5 6 7 8 9 10
0 0/0/0 0/0/0 0/ 0/0 0/0/0 0/0/0 0/0/0 0/0/ 0 0/0/0 0/1/0 0/1/1 0/0/0
1 0/0/0 0/0/0 0/ 0/0 0/0/0 0/0/0 0/1/1 0/0/ 1 0/1/3 0/0/1 0/0/1
2 0/0/0 0/0/0 0/ 0/0 0/0/0 0/0/0 0/1/2 0/2/ 6 0/5/3 0/2/5
3 0/0/0 0/0/0 0/ 0/0 0/0/0 0/2/5 0/2/6 0/2/10 0/5/4
4 0/0/0 0/0/0 0/ 0/1 0/1/1 0/4/5 0/2/7 0/4/ 6
5 0/0/0 0/0/0 0/ 2/1 1/1/0 0/5/8 3/7/6
6 0/0/0 0/0/0 0/ 1/0 2/6/8 8/9/4
7 0/0/0 0/2/0 0/ 6/2 13/8/2
8 0/0/0 0/2/0 26/11/3
9 0/0/0 25/6/0
10 25/1/0
CHAPTER 6
Estimating the probability of allelic
drop-out of STR alleles in forensic genetics
Publication details
Co-authors: Poul Svante Eriksen
Human DNA Quantication kit (Applied Biosystems) with Human Genomic DNA Male
6.2 Material and methods 117
(Promega) as the quantication-standard on a ABIPrism
SGM Plus
-kit
(Applied Biosystems) as recommended by the manufacturer in an ABI GeneAmp
9700 PCR
thermocycler.
One ul of the amplicates in 15 ul HiDi
3100 Genetic Analyzer using POP4 as the polymer and 5 kV injection voltage for 6
seconds. DNA fragments were detected and fragment sizes were estimated with GeneScan 3.7
with a detection threshold of 50 rfu. Genotypes were assigned using GenoTyper 3.7 with the
Kazam macro (Applied Biosystems) with no stutter lter applied.
We excluded all alleles in stutter positions of true alleles to avoid complications of masked drop-
outs due to stutter eects. Table 6.1 presents the number of observed alleles, dropouts and the
proportion of drop-outs for each locus.
Table 6.1: Observed drop-outs in the data set stratied by locus. All drop-outs were single
contributor alleles.
D3 vWA D16 D2 D8 D21 D18 D19 TH0 FGA
Observed 306 356 322 398 362 375 315 220 258 313
Drop-outs 10 11 11 14 11 7 10 10 17 18
Proportion 0.03 0.03 0.03 0.04 0.03 0.02 0.03 0.05 0.07 0.06
There was a tendency for the high molecular loci to have more drop-outs than the remaining
ones within each uorescent dye colour. This indicates a locus dependence of the probability of
drop-out.
6.2.2 Logistic regression model
Let D be the event The contributors allele has dropped out, and
D when no drop-out oc-
curs, implying that P(
D) = 1P(D). For evidence evaluation, we are interested in quantifying
the probability of allelic drop-out P(D). As mentioned in Section 6.1, we wish to model this
probability conditioned on the observed stain.
118 Estimating the probability of allelic drop-out of STR alleles in forensic genetics
H
P
(
D
|
H
)
D
D
D3
20 50 150 400 1100 3000
vWA D16
20 50 150 400 1100 3000
D2
D
D
D8 D21 D18
D
D
20 50 150 400 1100 3000
D19 TH0
20 50 150 400 1100 3000
FGA
Figure 6.1: Locus specic logistic curves (solid) together with an overall estimate (dashed). The
plot is on log-scale ensuring P(D|
H), where
H is found from H as (for K = 2),
P(D|
H) =
_
_
P(D|H), Non-shared het allele
P(D|2H), Non-shared hom allele
P(D|H
(1)
+H
(2)
), Shared het allele,
where H
(1)
and H
(2)
may be weighted by 2 if the contributors of the shared alleles are homozy-
gous.
6.3 Results and discussion 119
Logistic regression is a standard way to estimate the probabilities for a dichotomous response
stochastic variable when explanatory variables are assumed to change the probability of the event
(McCullagh and Nelder, 1989). The logistic model is particularly simple in this case since we
only have one explanatory variable,
H,
P(D|
H) =
exp(
0
+
1
log
H)
1 + exp(
0
+
1
log
H)
,
where
1
showed to be negative such that P(D|
H) decreases as
H increases, and the use of log
H
rather than
H, ensures that with
1
being negative P(D|
H = 0) = 1. When we condition on
H, we
assume the event of two allelic drop-outs of the same contributor are independent, which is also
an underlying assumption of the logistic regression. That is, P(D
1
, D
2
|
H) = P(D
1
|
H)P(D
2
|
H),
where D
i
: Allele i of the contributor with DNA proxy
H has dropped out.
6.3 Results and discussion
The analysis showed that the intercept parameter,
0
, varied between loci with a p-value of 0.01
indicating a signicant dierence between loci (Venables and Ripley, 2002). A similar test for
the slope parameter,
1
, indicated that this parameter did not vary signicantly across loci (p-
value of 0.49). In addition, there was no signicant change of the drop-out probability caused
by the allelic number indicating that larger alleles within the same locus has the same drop-out
probability as smaller alleles. However, in the data set, the largest allelic dierence was eight
repeat units. This variability may be to small to demonstrate that a possible allelic eect is
signicant.
The parameters for locus s are thus
0,s
and
1
for computing P(D|
0,s
18.26 18.43 18.75 18.31 18.28 17.45 18.07 19.40 19.40 19.21
Note, that
0,s
are larger for the loci of the yellow uorescent dye band indicating their larger
drop-out probability as observed in Table 6.1. The corresponding logistic curves for the param-
eters of Table 6.2 are plotted in Figure 6.1 together with an overall estimate not stratifying on
loci. The parameters for the overall curve are
0
= 17.56 and
1
= 4.14.
In Figure 6.1, the box-plot added to each panel shows the DNA proxy
H for the drop-outs (D)
and observed alleles (
D). The boxes indicate the inter-quartile range (middle fty percent of
120 Estimating the probability of allelic drop-out of STR alleles in forensic genetics
the data) of the observations and the whiskers extend to the most extreme data points within 1.5
times the lengths of the boxes. Remaining points are marked by dots.
It is clear from Figure 6.1 that there is an overlap of the whiskers in the box-plots. This implies
that the classication of drop-outs is associated with uncertainty as one would expect. In partic-
ular, it is true for D21 where all drop-outs observed had a mean height,
H, above 70. This may
be due to the specic alleles in our data set (for D21, these were 28, 29, 30, 30.2, 31 and 32.2)
and possible individual specic eects from having only four dierent proles in the data.
We used the estimated parameters of Table 6.2 in order to create a table of the mean peak heights
that correspond to the specic drop-out probabilities. For the ten dierent loci included in our
data set, these mean heights are presented in Table 6.3.
Table 6.3: Mean peak heights (rfu) for various drop-out probabilities for ten STR loci.
P(D|
H
i
))
2
= 0.02, where D
i
is indicator for dropout of the allele
of the data and
H
i
is the associated proxy for the amount of DNA. A Brier Score close to zero
6.4 Conclusion 121
indicates that the model is adequate. A simulated p-value of 0.156 indicates a satisfying t of
the model. Furthermore, we tried to improve the model by using linear splines (Harrell Jr., 2001)
with knots at log(75) and log(100), but these model extensions were not supported by the data.
The use of the logit function implies that the interpretation is made in terms of log odds. The log
odds of the drop-out probability conditioned on
H is linear in log
H,
logitP(D|
H) = log
P(D|
H)
P(
D|
H)
=
0,s
+
1,s
log
H.
Using H as the explanatory variable implies lower variability on the DNA proxy than if only
using a single peak height observation, e.g. the peak height on the same locus of a heterozygous
allele that has not dropped-out. Furthermore, in real crime cases such an allele might not be
observed, since both alleles of a heterozygous might have dropped-out or the other allele may be
shared with an other contributor if the stain is a mixture.
Gill et al. (2000) discussed the importance of addressing the risk of allelic drop-out and how to
incorporate this into the likelihood ratio. Combining our approach for estimating P(D|
H) with
the methodology of Gill et al. (2000) may be a feasible approach for better assessment of the
weight of evidence when the level of the peak heights indicates the possibility of drop-outs.
6.4 Conclusion
We have demonstrated a simple and applicable way of assessing the drop-out probabilities of
STR alleles in forensic genetics. The drop-out probabilities computed using the model concur
with the prior knowledge of the drop-out behaviour varying with the observed peak heights.
Future work consists of testing the model on a larger data set including more alleles. With a
larger data set, it may also be possible to test whether alleles or fragment length has a signicant
eect on the drop-out probability as the individual specic eect decreases with the number of
dierent proles.
It is worth emphasising that the drop-out probabilities may vary between laboratories, machinery
within the same laboratory and typing kits used for proling. This is due to dierences in e.g.
the ability to amplify the DNA in the PCR and in the potential to measure the light intensities
for the electropherogram. Hence, before applying this methodology in the likelihood ratio for
evidence calculations, the laboratory needs to perform experiments with known proles in order
to estimate the parameters in the logistic regression model.
122 Estimating the probability of allelic drop-out of STR alleles in forensic genetics
Appendix
6.A Examples
In forensic genetics it is common to use the likelihood ratio LR = P(E|H
p
)/P(E|H
d
) as mean to
assess the weight of evidence. Here P(E|H) is the probability of observing the evidence E given
the hypothesis H. The prosecutors hypothesis, H
p
, often include more proles from identied
individuals than under the defence hypothesis H
d
. Having a single contributor stain H
p
may state
The suspect is the only contributor to the crime stain, whereas H
d
: An unknown individual
unrelated to the suspect is the only contributor to the crime stain.
In the situation where the hypotheses induces that an allelic drop-out has occurred one needs
to specify the proles that constitute the observed stain in order to compute the prole specic
drop-out probability for both H
p
and H
d
.
6.A.1 Example with data from a controlled experiment
We used the data in Table 6.4 to demonstrate the technique of computing the drop-out probability
of a given allele. The data originated from a mixture of a controlled experiment with the two
proles A and B denoted in Table 6.4 by and , respectively, where A contributed with 31.4
pg/ul and B with 424.6 pg/ul.
Table 6.4: Data used in the example of the Appendix 6.A. The sample was a mixture of the two
proles A and B (denoted by and ) contributing 31.4 pg/ul and 424.6 pg/ul, respectively.
Locus Allele Height Area
D3 15 766 7264
D3 16 991 9165
D3 19
vWA 15 788 7631
vWA 17 710 6678
D16 10 117 1201
D16 11 1765 18858
D16 12
D2 19 746 8816
D2 23
D2 25 696 8432
D8 8 967 9145
D8 12 895 8350
D8 13
Locus Allele Height Area
D21 28 70 660
D21 29 767 7169
D21 30 102 1024
D21 31 889 8283
D18 12 70 736
D18 15 766 8501
D18 16 127 1341
D18 17 687 7856
D19 13 1525 12862
D19 15
TH0 6 836 7333
TH0 7 82 736
TH0 8 595 5249
FGA 20
FGA 23 638 6507
FGA 24 549 5542
6.A Examples 123
Under the assumption that the data in Table 6.4 originated from a two-person mixture, we need
to specify a possible pair of proles explaining the observed alleles. We compute the individual
DNA proxies H
(A)
and H
(B)
as dened in Section 6.2.2 for the two proles A and B of Table 6.4,
H
(A)
=
117 + 70 + 102 + 70 + 127 + 82
6
= 94.67
H
(B)
=
766 + 1765 + 746 + 967 + 895 + 767 + 889 + 766 + 687 + 595 + 549
10 + (2 1)
= 782.67.
Let allele 19 in locus D3 be denoted by D3
19
, then from Table 6.4 we found that the homozygous
allele D8
13
and the following non-shared heterozygous alleles of prole A had dropped out:
D3
19
, D16
12
, D2
23
, D19
15
, and FGA
20
.
The DNA proxy was the same for all the heterozygous drop-outs,
H = H
(A)
, and for the homozy-
gous allele
H = 2H
(A)
. The parameter estimates of Table 6.2 were then used in order to compute
the locus specic drop-out probabilities. Below, we demonstrate how to compute the drop-out
probabilities for D3
19
, D19
15
and D8
13
:
P(D
D3
19
|
H) =
exp(18.264.35 log(94.67))
1 + exp(18.264.35 log(94.67))
= 0.177,
P(D
D19
15
|
H) =
exp(19.404.35 log(94.67))
1 + exp(19.404.35 log(94.67))
= 0.403,
P(D
D8
13
|
H) =
exp(18.284.35 log(189.33))
1 + exp(18.284.35 log(189.33))
= 0.011.
Suppose we only had information on prole B, e.g. B being the victim of a crime, and that the
prole of the suspect S only gave a partial match. For simplicity, we use the same mean height
estimate for the suspect as for A, i.e. H
(S )
= H
(A)
. In locus D19, only allele 13 was observed
and a shared allele may have dropped out. Assuming suspect S is homozygous for allele 11
and prole B is heterozygous with alleles 11 and 13, the DNA proxy is
H = 2H
(S )
+ H
(B)
=
189.33 + 782.67 = 972 and the drop-out probability is
P(D
D19
11
|
H) =
exp(19.404.35 log(972))
1 + exp(19.404.35 log(972))
=2.69
10
5
.
6.A.2 Example in the recommendation of the ISFG Commission
Following the idea of Example 1 given in (Gill et al., 2006, Appendix B.2), we compute the
likelihood ratio using our model for assessing the drop-out probabilities.
Assume that the genetic stain G = (a, c, d) and that the prosecutors hypothesis claims that the
suspect, G
S
= (a, b) is a contributor to the stain. For this hypothesis to be true, the b allele must
124 Estimating the probability of allelic drop-out of STR alleles in forensic genetics
have dropped out. In this example, we only consider data from one locus as in Table 3 of the
ISFG recommendations. We re-use the data from TH0 in Table 6.4 in order to exemplify how to
evaluate the LR. For consistency with the example of Gill et al. (2006), denote allele 7 by a and
let c and d be allele 6 and 8, respectively.
From Table 6.4, we compute the following estimates of
H and the associated P(D|
H) for every
combination of the alleles assuming a two-person mixture. Suppose that a contributor has non
shared alleles mn and that the DNA proxy for this combination is H
mn
. Then P(D
mn
) = P(D|H
mn
)
is the drop-out probability of either m or n. Alternatively in the actual case there may be one
shared allele, m, and in this case P(D
m,m
) = P(D|
H
m,m
) = P(D|H
mn
+H
mo
) is the drop-out prob-
ability for allele m when shared by two individuals with the combinations mn and mo. The
probability P(G|H
p
) is
P(G|H
p
) = 2P(cd)P(
D
cd
)
2
P(D
ab
)P(
D
ab
),
since allele b is assumed to have dropped out.
Assume that an allele, Q, has dropped out implying that the two proles are heterozygous not
sharing any allele. That is, Q is any allele of A
TH0
\ {a, c, d} with allele probability P(Q) =
1[P(a)+P(c)+P(d)], where A
TH0
is the set of alleles for locus TH0. All of the observed alleles
must be paired with the missing allele in order to compute the specic drop-out probabilities as
these dier due to the dierent peak heights. From Table 6.5, it is clear that P(D
aQ
) is the largest
of the three as expected since the peak height of a is only 82 rfu. When paired with any of c or
d, the drop-out probabilities are practically zero as one would require. Hence, the terms P(D
cQ
)
and P(D
dQ
) are also indicators of a poor agreement with the heterozygote balance when pairing
ad and ac, respectively.
Table 6.5: DNA proxies and drop-out probabilities for various proles
Prole(s) Notation H
H P(D|
H)
aQ P(D
aQ
) 82.0 82.0 5.57
10
1
ac P(D
ac
) 459.0 459.0 7.02
10
4
ad P(D
ad
) 338.5 338.5 2.63
10
3
cQ P(D
cQ
) 836.0 836.0 5.17
10
5
dQ P(D
dQ
) 595.0 595.0 2.27
10
4
cd P(D
cd
) 715.5 715.5 1.02
10
4
aa P(D
aa
) 82.0 164.0 5.82
10
2
cc P(D
cc
) 836.0 1672.0 2.54
10
6
dd P(D
dd
) 595.0 1190.0 1.11
10
5
ac, ad P(D
a,a
) 459.0, 338.5 797.5 6.35
10
5
ac, cd P(D
c,c
) 459.0, 715.5 1174.5 1.18
10
5
ad, cd P(D
d,d
) 338.5, 715.5 1054.0 1.89
10
5
Bibliography 125
The probability of the evidence given the defence hypothesis and that one allele has dropped out
is given as
P
1
(G|H
d
) = 8P(acdQ)[P(
D
ac
)
2
P(D
dQ
)P(
D
dQ
)+
P(
D
ad
)
2
P(D
cQ
)P(
D
cQ
) + P(
D
cd
)
2
P(D
aQ
)P(
D
aQ
)],
where the multiplication by 8 is due to the number of pairwise combinations of the alleles,
e.g. pairing the alleles ac and dQ may be done as (ac)(dQ), (ac)(Qd), (ca)(dQ) and (ca)(Qd);
interchanging the proles yields the eight combinations.
The defence hypothesis, H
d
, also comprises the scenario where no alleles has dropped out. This
implies that either an allele is shared or one contributor is homozygous. The probability of
P
0
(G|H
d
) is:
P
0
(G|H
d
) = P(acd)
_
P(a)
_
4P(
D
aa
)P(
D
cd
)
2
+8P(
D
a,a
)P(
D
ac
)P(
D
ad
)
_
+ P(c)
_
4P(
D
cc
)P(
D
ad
)
2
+8P(
D
c,c
)P(
D
ac
)P(
D
cd
)
_
+ P(d)
_
4P(
D
dd
)P(
D
ac
)
2
+8P(
D
d,d
)P(
D
ad
)P(
D
cd
)
_
_
.
It is worth noting that the probabilities P(D
ac
) and P(D
ad
) are misleading as the combination of
a together with c or d causes substantially imbalances in the proles peak heights.
In order to compute the likelihood ratio, LR, we need only to compute the ratio of P(G|H
p
) to
P
1
(G|H
d
) + P
0
(G|H
d
). As in Gill et al. (2006), we assume uniform allele probabilities of 0.1 for
the observed alleles implying that P(Q) = 0.7, yielding a LR of
LR =
P(G|H
p
)
P
1
(G|H
d
)+P
0
(G|H
d
)
=
0.0049
0.0014+0.0035
= 1.0007.
For the same scenario, Gill et al. (2006) considered uniformprobabilities of 0.02 of the observed
alleles. Using our model, this implies a LR of 9.6111.
126 Estimating the probability of allelic drop-out of STR alleles in forensic genetics
Bibliography
Balding, D. J. and J. S. Buckleton (2009). Interpreting low template DNA proles. Forensic
Science International: Genetics 4(1), 110.
Brier, G. W. (1950). Verication of forecasts expressed in terms of probability. Monthly Weather
Review 78, 13.
Gill, P. D. et al. (2006). DNA commission of the International Society of Forensic Genetics:
Recommendations on the interpretation of mixtures. Forensic Science International 160(2-3),
90101.
Gill, P. D. and J. S. Buckleton (2010a). A universal strategy to interpret DNA proles that does
not require a denition of low-copy-number. Forensic Science International: Genetics 4(4),
221227.
Gill, P. D. and J. S. Buckleton (2010b). Mixture interpretation: dening the relevant features
for guidelines for the assessment of mixed DNA proles in forensic casework. Journal of
Forensic Sciences 55(1), 265268.
Gill, P. D., J. M. Curran, and K. Elliot (2005). A graphical simulation model of the entire DNA
process associated with the analysis of short tandemrepeat loci. Nucleic Acids Research 33(2),
632643.
Gill, P. D., J. Whitaker, C. Flaxman, N. Brown, and J. S. Buckleton (2000). An investigation
of the rigor of interpretation rules for STRs derived from less than 100 pg of DNA. Forensic
Science International 112(1), 1740.
Harrell Jr., F. E. (2001). Regression Modeling Strategies. Springer.
McCullagh, P. and J. Nelder (1989). Generalized Linear Models. Chapman and Hall.
Petricevic, S. et al. (2009). Validation and development of interpretation guidelines for low
copy number (LCN) DNA proling in New Zealand using the AmpFSTR SGM Plus(TM)
multiplex. Forensic Science International: Genetics In Press, Corrected Proof.
Tvedebrink, T., P. S. Eriksen, H. S. Mogensen, and N. Morling (2009). Estimating the proba-
bility of allelic drop-out of STR alleles in forensic genetics. Forensic Science International:
Genetics 3(4), 222226.
Tvedebrink, T., P. S. Eriksen, H. S. Mogensen, and N. Morling (2010). Evaluating the weight
of evidence using quantitative STR data in DNA mixtures. Journal of the Royal Statistical
Society. Series C, Applied statistics. In Press.
Venables, W. N. and B. D. Ripley (2002). Modern Applied Statistics with S (4 ed.). Springer.
6.5 Supplementary remarks 127
6.5 Supplementary remarks
Several authors and commentators in forensic genetics have already accepted the model above as
a mean to estimate the probability of allelic drop-out (Balding and Buckleton, 2009; Petricevic
et al., 2009; Gill and Buckleton, 2010a,b). However, as with any piece of science and each
model criticism has also been put forward. Balding and Buckleton (2009) argue that the drop-
out probability of a homozygous allele, P(D
2
), should satisfy the property that P(D
2
) < P(D)
2
,
where P(D) is the drop-out probability of a heterozygous allele for the same DNA prole. Their
argument is based on the fact that the superposition of two low intensity peaks should have
smaller drop-out probabilities than when peaks are considered separately. That is, allelic drop-
out may occur due to absence of molecules associated with a particular allele, but may also be
due to the insucient amount of molecules to trigger the observation of an allele. In the latter
case, the amount of DNA might imply that heterozygous alleles yield peak height observations
close to 50 rfu while homozygous alleles are closer to 80 rfu.
Balding and Buckleton (2009) suggested that P(D
2
) = P(D)
2
for some value of < 1, and was
chosen since it satisfy the their requirement. Based on a survey from some forensic laboratories
Balding and Buckleton (2009) suggest = 0.5. However, there are at least two problems with
the -approach. First, there is not a solid model behind the suggestion, and second, how do one
choose the correct value for ? The model tted to the experimental data in Tvedebrink et al.
(2009) has P(D
2
; H) > P(D; H)
2
for H > 136 rfu. However, the dierences are in the fourth
decimal place and has no practical implications. D. J. Balding (personal communication, 2010)
suggested to use
H rather than log(H). This transformation yields a slightly better t to the
data and postpone the issue of P(D
2
; H) > P(D; H)
2
to H-values > 201 rfu.
Gill et al. (2005) demonstrated how to simulate DNA mixtures by mimicking the procedure
carried out by a forensic laboratory: DNA extraction, aliquot sampling, PCR eciency and
measurement variability. A similar approach is listed below:
(1) Assume that there are N chromosomes extracted for typing.
(2) Of these do n
(0)
carry the specic allele of interest, where n
(0)
= bin(N, x/46) where x = 1
for heterozygous and x = 2 for homozygous alleles, respectively.
(3) The PCR process is assumed to be a binomial process: n
(c)
= n
(c1)
+ bin(n
(c1)
,
PCR
), for
c = 1, . . . , C, cycles, where
PCR
is the PCR eciency for each cycle in the PCR process.
(4) If n
(C)
measured with noise gives reason to peak heights lower than a given threshold we
declare a drop-out.
By running (1)-(4) several times with varying initial values N we get an simulated distribution
of P(D). In Figure 6.2 simulations for heterozygous and homozygous alleles are simulated for
varying amounts of DNA. In these simulations
PCR
= 0.85, C = 28 and each point is based
on 5,000 simulations. The solid curve is tted to the heterozygous data points (open points) by
logit P(D; H) =
0
+
1
log(H) and demonstrates that the model ts the data well over the whole
range of the response. Dashed curves show the same regression with
(H) as covariate, and the
probit approach is discussed below. The tted parameters (
0
,
1
) were used to draw the curves
for the homozygous simulations (closed points) with log(2H) as covariate. The plot shows good
agreement between the simulation homozygous data points and the model predictions. The
128 Estimating the probability of allelic drop-out of STR alleles in forensic genetics
5 10 20 50 100 200
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Amount of DNA
P
(
D
)
P(D; H) with log(H)
P(D; H) with H
P(D; H) using probit
Balding and Buckleton (2009)
Figure 6.2: Simulations using (1)-(4) for varying amounts of DNA. Open points are heterozy-
gous simulations, and closed points homozygous. The curves are explained by the legend.
dotted curve represents the P(D; H)
2
-model of Balding and Buckleton (2009) with = 0.5.
The impression is quite dierent from the logistic regression tted to the data.
Another way to model the probability of allelic drop-out may be derived taking a slightly dier-
ent approach than above. Let X denote the number of molecules in a aliquot sampled for PCR.
We assume that if X is less than some threshold M the signal will not be suciently strong to
trigger the CCD camera and thus the signal will be undetected implying allelic drop-out.
Assume that the aliquot is sampled from a total number of molecules N in the extract. Fur-
thermore, the spacial pattern is Poisson distributed with intensity (the parameter reects the
concentration of molecules), which implies the position of the molecules are independent of each
other. The aliquot proportion p of molecules for PCR processing is sampled from this extract,
which again is Poisson distributed with intensity p.
Now, X is Poisson distributed with some unknown intensity = p, since p is also unknown.
We assume that the average peak height, H, is proportional to the number of sampled molecules,
X, such that H kX, X Poisson(), which implies that E(X) = k
1
H = cH. Rather than
assuming P(D) = P(X = 0), i.e. drop-out only happens when no molecules are samples, we
6.5 Supplementary remarks 129
allow a positive number of molecules to be sampled:
P(D; H) = P(X M) = P
_
X cH
cH
M cH
cH
_
_
M cH
cH
_
,
where the approximation of a Poisson distribution by the normal distribution is satised by the
large value of cH. This implies that probit[P(D; H)] =
1
[P(D; H)] =
1
H
1/2
+
2
H
1/2
. Fitting
this model using the same dataset as used in Tvedebrink et al. (2009) yields a similar t as that
of the original article. Similarly, did this approach indicate good agreement with the simulations
discussed above (as shown in Figure 6.2). Hence, this method which is more closely related
to the biochemistry than the logistic assumption adds further support to the logistic regression
approach through the similarity in results.
The Section of Forensic Genetics, University of Copenhagen, conducted after the publication of
Tvedebrink et al. (2009) more experiments with dilutions of DNA proles. These experiments
investigated the applicability of the drop-out model to dierent DNA genotyping kits and varying
number of cycles in the PCR process (see summary of the results in Table 6.6).
Table 6.6: Summary of the experiments with diluted samples using the SEler kit (Applied
Biosystems). Samples from identical aliquots were used in order to compare the eect of in-
creasing number of PCR cycles.
Cycles Classication D3 vWA D16 D2 D8 SE33 D19 TH0 FGA D21 D18
28 Observed 151 152 116 139 153 127 134 148 115 130 108
Drop-outs 17 16 11 29 15 19 15 20 10 16 17
29 Observed 165 156 125 162 165 141 144 160 120 141 122
Drop-outs 7 16 5 10 7 9 8 12 8 9 6
30 Observed 170 168 127 168 168 148 151 167 126 148 124
Drop-outs 2 4 3 4 4 2 1 5 2 2 4
The overall properties of the model did not change, only an extra term caused by the cycle-factor
was included:
logitP(D; H, C) = (
0,s
+
0,C
) + (
1
+
1,C
) log(H)
Hence, the overall interpretation of the model is the same. It is worth mentioning that
0,30
<
0,29
<
0,28
= 0 and 0 =
1,28
<
1,29
<
1,30
. This implies that for the same locus and
H
0
> 60 rfu xed:
P(D;
H = H
0
, C = 28) < P(D;
H = H
0
, C = 29) < P(D;
H = H
0
, C = 30).
This seems counter-intuitive since more PCR cycles implies higher peaks. However, with
H =
H
0
for all three levels of C, it is more likely to have drop-out for C = 30 than C = 28 since one
would expect that peaks with 30 cycles on average are higher than peaks from a 28 cycle PCR
130 Estimating the probability of allelic drop-out of STR alleles in forensic genetics
process. In Figure 6.3 a plot similar to Figure 6.1 summaries the estimated model. Each panel
shows a box plot of the drop-out events for the associated
H-estimate with the tted logistic
curves superimposed.
H
P
(
D
;
H
)
D
D
D3
20 50 150 400 1100 3000
vWA D16
20 50 150 400 1100 3000
D2
D
D
D8 SE33 D19 TH0
D
D
20 50 150 400 1100 3000
FGA D21
20 50 150 400 1100 3000
D18
Figure 6.3: Box-plots of the
H-estimates stratied by drop-out event. The boxes are vertically
shifted for visual comprehension. Black: 28 cycles, dark gray: 29 cycles and light gray: 30
cycles.
For low amounts of DNA there might be a potential bias when estimating H. This is due to the
fact that for a prole with many drop-outs, the peaks with heights above the detection threshold,
are outliers with respect to the peak height distribution. Hence, estimates based on these
observations tends to systematically overestimate the amount of DNA through biased estimates
of H. However, for a moderate number of drop-outs the bias is of minor concern.
6.5.1 Mixture separation allowing for allelic drop-out
The models for DNA mixtures presented in Chapters 4 and 5 do not allow for allelic drop-out in
their original formulation. However, they are both expendable for handling this sort of issues, in
particular the mixture separating model which is discussed in details below.
6.5 Supplementary remarks 131
For two-person mixtures, the possibility of allelic drop-out implies that for loci with three and
two observed alleles, the possible list of contributing DNA proles should be extended with
wild cards. That is, observing alleles a, b and c, genotypes involving an additional allele
dierent from the three needs to be considered. In practice, this implies that J
2
and J
3
from
Table 5.3 should be extended as shown in Table 6.7 (denoted J
2
and J
3
). Note that the columns
in J
2
assuming allelic drop-out are identical to those of J
3
and J
4
with the rst row(s) (lowest
peak heights/smallest peak areas) removed. Similarly the additional column in J
3
relative to J
3
refers to the lowest peak intensity of the minor contributor has dropped out (rst row of J
4
). The
number of expected alleles, N
s
, is given over each block, where the number of drop-outs equals
N
s
n
s
.
Table 6.7: Extension of J
2
and J
3
of Table 5.3 allowing for drop-outs. The number of drop-outs
is equal to N
s
n
s
, where N
s
is the expected number of alleles.
Expected number of alleles: N
s
=2 N
s
=3 N
s
=4
Number of drop-outs: 0 1 2
J
2
: P
1
P
2
P
1
P
2
P
1
P
2
P
1
P
2
P
1
P
2
P
1
P
2
P
1
P
2
P
1
P
2
P
1
P
2
A
s,(1)
1 1 2 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1
A
s,(2)
1 1 0 2 1 2 2 1 0 1 0 2 1 1 2 0 0 1
Expected number of alleles: N
s
=3 N
s
=4
Number of drop-outs: 0 1
J
3
: P
1
P
2
P
1
P
2
P
1
P
2
P
1
P
2
P
1
P
2
A
s,(1)
2 0 1 0 1 0 0 1 1 0
A
s,(2)
0 1 1 0 0 1 0 1 0 1
A
s,(3)
0 1 0 2 1 1 2 0 0 1
The statistical formulation of mean and variance are identical, drop-out allowed or not. However,
due to the missing data problem induced by the assumption of allelic drop-out we must impute
the missing data. An ad-hoc way to do this has been implemented and showed reasonably good
results:
N
s
= 4 and n
s
= 3: The missing data is imputed by repeating the A
s,(1)
-data row.
N
s
= 4 and n
s
= 2: The algorithm will always choose the leftmost conguration of J
2
since D
s
will be similar to that of the rightmost conguration, i.e. the dierence between the observed
and expected peak areas is similar when conditioned on the locus sum, A
s,+
. However, the
ratio of the likelihood values will approximately be P(D
s
|H
(1)
)
2
due to the two drop-outs for
N
s
= 4.
N
s
= 3 and n
s
= 2: There are four dierent congurations that need to be considered (numbers
refer to order in J
2
where N
s
= 3):
(1) If a homozygous allele drops-out the situation is the same as above for N
s
= 4 and n
s
= 2.
(2) The missing data is imputed by repeating the A
s,(1)
-row.
132 Estimating the probability of allelic drop-out of STR alleles in forensic genetics
(3) The missing data is imputed as the dierence of the A
s,(3)
- and A
s,(2)
-row.
(4) The missing data is imputed by repeating the A
s,(1)
-row.
A more rigorous approach would be to use the EM-algorithm, however, for practical purposes it
is believed, that there would be no substantial dierence.
Furthermore, the likelihood now includes an extra term P(D; H), where H is calculated as de-
scribed above. Assume that only alleles of the minor contributor have dropped out, then al-
lowing for drop-outs in mixture separation, implies that the selection criterion needs to evaluate
P(D;
H=H
(1)
)
n
D
1
P(D;
H=2H
(1)
)
n
D
2
N
, where n
D
1
and n
D
2
respectively are the number of het-
erozygous and homozygous drop-outs.
Example
Table 6.8 lists the STR data of a two-person mixture where several allelic drop-out has occurred.
In fact only three of the minor proles (marked by in Table 6.8) alleles not shared by the major
prole () had peak heights above the 50 rfu limit of detection.
Table 6.8: Observed STR data of a two-person DNA mixture. The estimated H-values were
respectively 60.3 rfu (Prole ) and 765.8 rfu (). The proles denoted by squares and triangles
are respectively iditied without drop-out allowed and taking drop-out into consideration.
Locus Allele Proles Height Area
D3 15
884 7787
D3 16
816 7140
D3 19
- -
vWA 15
519 5067
vWA 17
530 4928
D16 10 - -
D16 11
1373 14302
D16 12 - -
D2 19
565 6635
D2 23
- -
D2 25
518 6120
D8 8
993 8720
D8 12
807 7320
D8 13
78 891
Locus Allele Proles Height Area
D21 28 - -
D21 29
773 6867
D21 30 - -
D21 31
637 5867
D18 12 - -
D18 15
762 8449
D18 16
52 663
D18 17
644 7316
D19 13
1163 9550
D19 15
51 631
TH0 6
553 4691
TH0 7 - -
TH0 8
572 4936
FGA 20
- -
FGA 23
403 4024
FGA 24
363 3651
The drop-out probabilities of one of minor proles alleles for the various loci are listed in
Table 6.9. From this table we see that it is likely that one or more of the minor proles alleles
has dropped out.
6.5 Supplementary remarks 133
Table 6.9: Drop-out probabilities for the minor contributor (see Table 6.8). The probabilities
were computed using (
s,0
,
1
) from Tvedebrink et al. (2009).
Locus D3 vWA D16 D2 D8 D21 D18 D19 TH0 FGA
P
_
D|H
(1)
=60.3
_
0.61 0.65 0.72 0.62 0.61 0.41 0.56 0.83 0.83 0.80
We may analyse the data of Table 6.8 using the mixture separating algorithm. We rst anal-
yse the data assuming no allelic drop-out and next allowing for allelic drop-out. The likeli-
hood value when not allowing for allelic drop-outs is 1.908
10
10
which corresponds to =
7.65. The identied proles when drop-outs are neglected are denoted by square-symbols
in Table 6.8. Similarly, when allowing for drop-outs = 5.82 and the likelihood value is
1.178
10
9
. Since the best matching proles when allowing for drop-outs has three drop-outs
(see the identied proles in Table 6.8 marked by triangles) the likelihood value is computed as
N
P(D
D3
|H
(1)
)P(D
D2
|H
(1)
)P(D
FGA
|H
(1)
), with the locus specic drop-out probabilities listed in
Table 6.9.
Even though the algorithm allowing for drop-out correctly identied three allelic drop-outs, the
number of correctly identied loci for the two methods are almost the same. This is due to the
peak height imbalances of the peaks of the major prole. Often the shared allele of the two
true contributors is smaller than the one where the major prole is the only contributor (e.g. loci
D3, D2 and TH0 in Table 6.8). Hence, for D3 and D2 the larger of the two observed alleles is
assumed to be a shared allele.
In addition to the example above, we also simulated data mimicking a two-person DNA mixture.
The minor contributor had a xed mean peak height of 60 rfu, while the major component had
peak heights of 2000, 1500, 1000, 750, 500, 250, 150, 100 and 75 rfu. The reason for decreasing
the peak height of the major contributor is that since the variance is proportional to the mean, the
contribution from the minor component is masked in the variability for large peak intensities.
That is, it is not possible to detect whether the minor component has dropped out or if it shares
alleles with the major prole. Hence, the smaller the peak intensities of the major, the easier
it should become to detect allelic drop-out. For each mean value of the major peak height, the
standard deviation, , takes integer values from0 to 10. The locus where an allele of a four-allele
locus has dropped out, the methods detects most of the drop-outs for small values of and the
mean values of the major prole. For three-allele loci the method is less successful even for
moderate values of and low major peak heights.
Both experimental data and the simulations indicate that it is dicult to identify allelic drop-out
of a contributor to a DNA mixture. The problem with allelic drop-out is not only due to limited
amount of DNA in the sample. Lowering the limit of detection naturally decreases the number of
drop-outs. However, this might come with the cost of increased drop-in peaks. In the following
paper we discuss how the background noise can be used to determine a limit of detection using
the sample itself as reference.
CHAPTER 7
Sample and investigation specic ltering
of quantitative data from STR DNA analysis
Publication details
Co-authors: Poul Svante Eriksen
o
u
r
e
s
c
e
n
t
i
n
t
e
n
s
i
t
y
%
Figure 7.2: Fluorescent dye bands: Blue (dashed/semi-gray), green (solid/dark-gray), yellow
(dot-dashed/light-gray) and red (dotted/white). The shaded areas under each curve indicate the
amount of spectral overlap between the various dyes. Reproduced from Applied Biosystems
(2000).
as an increase in the intensities of both the background noise and true peaks. The shaded areas
in Figure 7.2 represent the amount of overlapping light frequencies of the four dierent colours
(blue, green, yellow and red) used in the SGM Plus kit. The increases caused by pull-up eects
are pictured in Figure 7.1 as Pull-up eect.
In our model, pull-up eects cannot cause stutters, whereas stutters may induce pull-up eects
on other dye bands.
7.2 Materials and methods 141
7.2.3 Determination of oating threshold
For each STR locus and amelogenin of a sample, we wanted to obtain a set of negative data in
order to model the negative data elements and develop a threshold for discrimination of positive
and negative signals based on the distribution of the negative signals. We used the training data
set for the development of the mathematical model. The peak height observations (intensities
of uorescent signals) below 5 rfu were removed at the rst step of analysis with the Genescan
software because the software was unable to handle the large amount of data elements below 5
rfu. The remaining signals comprised both background noise and more systematic components.
We removed (1) all peaks on the allelic ladder that primarily represent true alleles and (2) all o-
ladder signals in pull-up positions. This ensured that the remaining data points represented true
noise. The peaks designated Noise in Figure 7.1 illustrate the data used for the determination
of the threshold.
Inspection of the data indicated that the noise followed a right-skewed distribution. In order
to obtain a normal distribution, the peak heights were transformed by log
e
(peak height 4.5).
The distribution of the noise data tted the log-normal distribution, and the t was not better
with distributions like the exponential, Fisher-Tippett, Pareto, Rayleigh, or Weibull distributions.
Figure 7.3 shows the distribution of the observed peak heights of the noise for each locus of a
sample after the data had been transformed by log
e
(peak height in rfu 4.5) against a standard
normal distribution in a QQ-plot. Note that the outliers in the upper tail of the distribution are
in fact the true positive signal. The plots demonstrated that the noise (shifted by 4.5) followed a
log-normal distribution with individual parameters
s
and
s
for each locus, s. These parameters
determined the intercept and slope of the superimposed QQ-line and were estimated by
s
=
x
s(q
1
)
x
s(q
0
)
z
(q
1
)
z
(q
0
)
and
s
= x
s(q
0
)
s
z
(q
0
)
,
where x
s(q)
and z
(q)
are the empirical and standard normal q-quantiles, respectively. We used
these quantile estimators rather than the ordinary maximum likelihood estimators in order to
increase the robustness of the method.
Figure 7.3 shows that the t to normality was better for the higher values of
log
e
(peak height in rfu 4.5) than for the lower ones. The observations in the upper part of
the peak heights of the noise are those of main interest for the estimation of the threshold. Thus,
we chose to use the (q
0
, q
1
) = (50%, 90%)-interval for the estimation of the threshold. The
threshold was determined by the mean plus 3.29 times the standard deviation. The locus specic
threshold can be written as:
Threshold for locus s = exp(3.29
s
+
s
) + 4.5.
Approximately 99.95% of the noise will be below the threshold and, thus, will be categorized
as noise. However, 0.05% of the true negative results will be above the threshold and, thus, will
be categorized as positive signals. For practical purposes, the majority of such false positive
assignments will be in o-ladder positions rather than in allele positions.
Figure 7.3 shows that only few noise data elements were recorded for amelogenin. This is due
to the fact that the interval, in which noise could be recorded around the X- and Y-windows of
142 Sample and investigation specic ltering of quantitative data from STR DNA analysis
True allele
Stutter
Pullup (Onladder)
Pullup (Offladder)
Onladder peak
Offladder peak
Locus specific threshold
Fixed 50 rfu threshold
QQline
Standard normal
P
e
a
k
h
e
i
g
h
t
5
10
25
50
150
400
2000
Threshold: 36.4
D3
Threshold: 37.89
vWA
Threshold: 27.29
D16
Threshold: 27.38
D2
5
10
25
50
150
400
2000
Threshold: 24.73
AME
Threshold: 24.73
D8
Threshold: 37.12
D21
Threshold: 34.82
D18
5
10
25
50
150
400
2000
3 2 1 0 1 2 3
Threshold: 37.54
D19
3 2 1 0 1 2 3
Threshold: 32.61
TH0
3 2 1 0 1 2 3
Threshold: 29.7
FGA
Figure 7.3: QQ-plots of the observed peaks. Note the dierent thresholds computed using the
data of the locus itself as reference. For this particular sample, the xed 50 rfu-threshold causes
ve drop-outs (in loci D3, D2, D19, D2 and FGA) and two drop-out (in loci D19 and D21)
with the locus specic threshold (one true peak in locus D21 has a peak height of 22 rfu and is
embedded in the noise).
amelogenin, is small. However, amelogenin and D8 are marked with the same uorochrome
and D8 alleles are only slightly longer than the DNA fragments of X- and Y-amelogenin. The
distributions of noise in amelogenin and D8 were rather similar to each other and, therefore, the
threshold of D8 was also used for amelogenin.
7.2.4 Stutter correction
Figure 7.4 shows three dierent situations involving stutters. The background noise (grey peaks)
is the same in all three scenarios, but the parental peaks and, thus, the stutter peaks (black peaks)
dier in sizes.
We used the training data set to develop the mathematical model based on a regression model on
the peak intensities of the parental peaks. Assuming additivity of the noise and stutter product,
we take into account that peaks in stutter positions in front of small peaks mainly consist of noise
7.2 Materials and methods 143
Figure 7.4: Stutter peaks caused by dierent parental peaks (in black). The grey peaks picture
the noise and the dashed line the median of the noise.
as pictured in Figure 7.4. The model of the expected stutter height, h
Stutter
, is given by
h
Stutter
= h
Noise,s
+ (
s
+
s
bp
s
)h
Parent
, (7.1)
where h
Noise,s
is the known/determined median of the o-ladder peaks not in pull-up position
on locus s (see Section 7.2.3) and h
Parent
is the parental peaks height. The parameters were
estimated by a weighted least square t with weights 1/h
Parent
, due to the proportionality of the
mean and variance of the peak heights (Tvedebrink et al., 2010). In the latter term,
bp
s
is the
base pair deviation from the mean base pair,
bp
s
, on locus s,
bp
s
= bp
s
bp
s
. The parameter
s
is the average stutter eect at a given locus, s. By including the base pairs in the model, we are
able to have dierent stutter fractions for various alleles within a locus, if necessary.
Table 7.1: Estimates of
s
and
s
in the stutter model (7.1).
Locus D3 vWA D16 D2 D8 D21 D18 D19 TH0 FGA
s
10
2
7.209 6.714 6.101 7.712 4.996 6.359 7.031 7.252 2.031 6.508
SE(
s
)
10
3
0.930 0.883 0.874 0.848 0.645 0.836 0.776 1.348 1.310 1.221
s
10
2
0.104 0.203 0.212 0.091 0.096 0.075 0.172 0.215 0.069 0.123
SE(
s
)
10
3
0.112 0.118 0.137 0.064 0.069 0.140 0.099 0.215 0.184 0.139
The previously observed increase in stutter percentage as a function of allele number (Applied
Biosystems, 2006) was reproduced by the positive estimates of
s
in Table 7.1. The STR locus
specic
s
parameters in Table 7.1 are in accordance with the picture in the manufacturers kit
documentation (Applied Biosystems, 2006, Figure 9-5, 9-6 and 9-7), where e.g. the average
stutter eect,
TH0
, in TH0 is the weakest.
Figure 7.5 shows the stutter peak heights predicted by the model compared to the observed stutter
peak heights. The plot demonstrates that the model in (7.1) is sucient in order to describe
the stutters. For adjacent heterozygous alleles, the base pairs typically dier by only a limited
number of bp, which minimizes the eect of the length of the DNA fragment estimated by
s
.
Additional examinations of the data also made it clear that back-stutters were present typically
in the position 4 bp larger than the parental peak. The model for back-stutters and correction
144 Sample and investigation specic ltering of quantitative data from STR DNA analysis
Observed stutter height
P
r
e
d
i
c
t
e
d
s
t
u
t
t
e
r
h
e
i
g
h
t
25
100
225
D3
25 100 225
vWA D16
25 100 225
D2
25
100
225
D8 D21 D18
25
100
225
25 100 225
D19 TH0
25 100 225
FGA
Figure 7.5: Predicted stutter peak heights plotted against observed stutter peak heights with the
identity line superimposed. The scale of the plot is the variance stabilising square-root transfor-
mation.
is based on the same idea as those concerning conventional stutters with a noise level and an
additional eect from the parental peak, i.e.
h
Backstutter
= h
Noise,s
+
s
h
Parent
. (7.2)
Table 7.2 shows the parameter estimates. The lack of homozygous alleles in some of the loci in
the actual data set implied that the estimates of
s
were insignicant for these loci. This is due to
the fact that the parental peak needs to reach a certain height (typically well above 1,000 rfu) for
the back-stutter to exceed the noise level. For the same reason, base pairs were not included in
the back-stutter model as only a few base pair lengths were represented in the back stutter data.
Double stutters originating from two adjacent alleles separated by 4 bp in a heterozygous indi-
vidual behave slightly dierently from single stutters. In Appendix 7.A, we evaluate the ratio
of the stutter peak to the mean of the two parental peaks. In situations where the heterozygous
alleles are not adjacent (separated by more than 4 bp) or when a stutter originates from a ho-
mozygous allele, for practical purposes, we only need to consider the ratio of the stutter peak to
the parental peak.
7.2 Materials and methods 145
Table 7.2: Parameter estimates of the backstutter model (7.2). Loci with insignicant
s
esti-
mates due to lack of homozygous alleles were removed.
Locus D3 vWA D16 D2 D8 D21 D18 FGA
s
10
2
0.233 0.211 0.428 0.187 0.293 0.615 0.560 0.638
SE(
s
)
10
3
0.643 0.738 0.628 0.627 0.551 0.625 0.561 0.747
7.2.5 Pull-up correction
We dened pull-ups as peaks on dierent dye bands within 0.5 bp of the parental bp lengths.
Only the peaks not being true alleles or possible stutters on a dierent dye band than the parental
peak were included in the data analyses. Figure 7.1 shows an increase of the noise level (right-
most on the upper band) and a true peak (heterozygote imbalance, leftmost on the upper band).
We used the training data set to develop the mathematical model based on regression.
Figure 7.6 shows examples of pull-up values as function of the values of the parental peaks for
the various colours. The magnitudes of the observed pull-up eects were in accordance with the
spectral overlap in Figure 7.2, i.e. the eects of green signals in the yellowspectrumand of green
signals in the blue spectrum were the two largest, and yellow signals had the smallest eect in
the blue spectrum.
For predictive purposes, we tted a linear model to the observed data in Figure 7.6. Of the in-
cluded data points, only a limited subset comprised detectable pull-up peaks, while the remaining
observations were background noise in pull-up positions. Our model takes this into account by
having a noise dependent intercept, h
Noise,s
, for locus s (median of the noise data described in
Section 7.2.3). This approach is similar to the one used in the model for correction of stutter
eects. In the formulation of the model, the notation D d reects that the parental peak is in
uorescent dye band D and the pull-up peak is located in the uorescent dye band, d,
h
Pull-up
= h
Noise,s
+
Dd
h
Parent
, (7.3)
where parameters were estimated by a weighted least-square t. Table 7.3 shows the parameter
estimates of
Dd
. The superimposed lines in Figure 7.6 were based on the parameter estimates
of Table 7.3. Thus, the superimposed lines are in accordance with the spectral overlaps in Fig-
ure 7.2 except for
GB
, which is smaller than expected. This may be due to the particular alleles
included in our data set.
Table 7.3: Parameter estimates of the various overlapping uorescent dyes.
Dyedye BG BY GB GY Y B Y G
Dd
10
2
1.039 0.449 0.342 0.978 0.322 0.597
SE(
Dd
)
10
3
0.405 0.357 0.411 0.341 0.560 0.482
146 Sample and investigation specic ltering of quantitative data from STR DNA analysis
Parental peak height
P
u
l
l
u
p
p
e
a
k
h
e
i
g
h
t
4
16
36
64
100
144
196
Blue fluorescence dye Green fluorescence dye Blue fluorescence dye Yellow fluorescence dye
4
16
36
64
100
144
196
Green fluorescence dye Blue fluorescence dye Green fluorescence dye Yellow fluorescence dye
4
16
36
64
100
144
196
0 400 1600 3600 6400
Yellow fluorescence dye Blue fluorescence dye
0 400 1600 3600 6400
Yellow fluorescence dye Green fluorescence dye
Figure 7.6: Pull-up eects stratied by overlapping uorescent dyes. The superimposed lines
indicate the estimated model. The scale of the plot is the variance stabilising square-root trans-
formation.
7.3 Results
The parameters for correction of pull-up and stutter eects were estimated using the dilutions of
non mixture samples, whereas the overall performance of the lter was based on analysis of all
possible combinations of pairwise two-person mixtures of four proles in mixture ratios ranging
from 1:16 to 1:1.
The procedure of events were the following:
(1) Determination of the oating threshold: Determine the threshold and detect potential stut-
ters, pull up eects and true peaks, i.e. alleles with peak heights above the threshold.
(2) Pull-up correction: Correcting for pull-up eects caused by peaks above the threshold deter-
mined in (1).
(3) Stutter correction: Correcting for stutter eects caused by peaks above the threshold deter-
mined in (1).
(4) Allele assignment: Assignment of alleles according to the determined oating threshold in
(1) and the allelic ladder.
Note, that the corrections for pull-up eects were made before the stutter correction was applied,
because stutters may cause pull-ups while pull-ups cannot make stutters.
7.3 Results 147
7.3.1 DNA mixtures from controlled experiments
We used our oating threshold, stutter and pull-up correction method on 107 two-person mix-
tures. In Table 7.4, we summarise the performance of the overall lter. It is worth emphasising
that 263 of the true alleles dropped out and that the stutter lter let 6 stutters and no backstutters
slip through. In addition to the stutter peaks, another 25 (21 drop-ins and 4 pull-ups) on-ladder
peaks were classied as proper peaks of the samples.
Table 7.4: Filtered and passed peaks classied by type.
Classication Assigned negative result Assigned positive result
True allele 263 3,308
Stutter 2,167 6
Backstutter 1,260 0
Noise 62,669 324
On-ladder 11,619 21
O-ladder 51,050 303
Pull-up 3,825 14
On-ladder 982 4
O-ladder 2,843 10
The remaining peaks passing the lter were all in o-ladder positions and removed from the
analysis afterwards. The data were also analysed following the standard protocol of The Section
of Forensic Genetics, Department of Forensic Medicine, Faculty of Health Sciences, University
of Copenhagen. Using the technique recommended by the manufacturer, 312 drop-outs were
observed together with 27 stutters and 7 pull-up peaks. Thus, the number of drop-outs of true
alleles was 16% lower with locus specic ltering than with a xed 50 rfu threshold.
The classication tables for the two methods are listed in Table 7.5. In the classication tables
each observation is categorised by its classication and actual class. In Table 7.5 the diagonals
are the correctly classied observations, while the o-diagonals are the misclassied. The lower
the counts in the o-diagonal cells the better is the classication methodology.
To summarize the classication table in a single value we suggest the misclassication rate,
which is the total of misclassied observations to the total of correctly classied observations.
From the misclassication rates (bottom lines in Table 7.5) we see that the oating threshold
method yields a better classication than the xed 50 rfu threshold.
Table 7.5: Classication tables for the two methods: Fixed 50 rfu and oating threshold.
Floating threshold
Expected +
Observed + 3308 31
263 69,921
Misclassication rate: 0.401%
Fixed 50 rfu threshold
Expected +
Observed + 3259 34
312 69,918
Misclassication rate: 0.473%
148 Sample and investigation specic ltering of quantitative data from STR DNA analysis
7.3.2 Fingernail swabs from crime cases
Data from 98 crime cases were analysed using the approach presented. The Section of Forensic
Genetics, University of Copenhagen, supplied the data from the crime cases together with ref-
erence samples associated with the crime. These reference samples may explain the observed
stain, but since contamination from other biological material and debris may accumulate under
the ngernails, the number of random drop-ins may be misleading (Cook and Dixon, 2006).
From Table 7.6 we see that the number of drop-outs decreased by 220 events from 912 using the
standard protocol to 692 using the samples specic setup (decrease of 24%). However, this gain
in fewer drop-outs comes with a cost in more drop-ins. The standard protocol gave 15 drop-ins
versus 90 using our approach. In the experiment conducted by Cook and Dixon (2006), foreign
DNA were detected in 13% of the ngernail swabs taken from the participating individuals.
Hence, the higher number of drop-ins using our more sensitive methodology may be caused
by foreign DNA. These alleles are actually true alleles rather than drop-ins, however, this is
impossible to conclude from the available data.
Table 7.6: Classication tables for the two methods: Fixed 50 rfu and oating threshold.
Floating threshold
Expected +
Observed + 1,460 90
692 82,497
Misclassication rate: 0.931%
Fixed 50 rfu threshold
Expected +
Observed + 1,240 15
912 82,572
Misclassication rate: 1.106%
The misclassication rates in Table 7.6 are more than twice the rates of Table 7.5. This is
a consequence of the data being from real crime cases with many degraded samples and low
amounts of DNA. Hence, the number of drop-outs is larger and so is the number of partial DNA
proles.
In addition Table 7.7 compare the drop-outs and drop-ins of the two methods. We see that 236
of the drop-outs under the standard protocol were correctly declared as true alleles using the
oating threshold method. However, sixteen of the allelic drop-outs from the oating threshold
method did not drop-out using standard methods. More than half of these new drop-outs were
located in locus D3 which tends to have a higher noise level compared to the other loci in this
dataset. This may be due to primer residue increasing the background noise for the shorter loci
in the electrophoresis.
Table 7.7: Comparisons of the drop-ins and drop-outs produced by the two methods.
Dropped out Dropped in
Fixed 50 rfu threshold Yes No Yes No
Floating threshold Yes 676 16 7 83
No 236 1,224 8 -
7.4 Discussion 149
7.4 Discussion
Previously Gilder et al. (2007) indicated that using observations from the negative controls from
the same run as the samples could be used to extract information about the noise level. However,
their approach did not take variation between the capillaries into account. From our analysis
there are signicant dierences between the capillaries with negative controls within each run,
and also signicant dierences between the same capillaries with negative controls for dierent
runs. This suggest that the noise distribution is neither constant within runs nor for the same
capillary for consecutive runs. Hence, our approach where we use the sample itself in order to
determine the noise distribution is recommended, as it eliminates the between run and capillary
variation. Furthermore, the stratication on loci for determining the threshold clearly improves
the noise ltering as indicated by Figure 7.3.
The xed 50 rfu-threshold yields in many cases the same number of drop-outs as the locus
specic oating threshold. In Figure 7.7, the box-plots show the thresholds of the 107 mixture
samples. For all loci, the median of the oating threshold is lower than the xed 50 rfu limit.
Note that within each dye band, the threshold median tends to decrease with the base pair length.
D3 vWA D16 D2 D8 D21 D18 D19 TH0 FGA
2
0
4
0
6
0
8
0
1
0
0
1
2
0
T
h
r
e
s
h
o
l
d
[
R
F
U
]
Locus
Blue fluorescent dye Green fluorescent dye Yellow fluorescent dye
Figure 7.7: Box-plots of the estimated locus specic oating threshold for the 107 mixture
cases.
150 Sample and investigation specic ltering of quantitative data from STR DNA analysis
An advantage of the locus specic threshold is that it enables the case worker to assess the noise
level of the sample. Hence, for cases where a peak lies just below 50 rfu, the magnitude of the
locus specic threshold indicates whether it is reasonable to include the peak in the signal or not.
Furthermore, in cases where the transformed peak heights, log
e
(peak height 4.5), deviate sub-
stantially from normality, the data indicate that the sample may be subject to extensive noise or
contamination of some kind. This may be used as sample quality diagnostic in order to deter-
mine if a re-analysis is necessary. This deviation may be observed from the QQ-plots and other
usual diagnostics to validate assumptions of normality.
Since the stutter and pull-up corrections are based on a regression model, the parameters has
been tuned for this specic data set. In general, the parameters must be determined for each
laboratory, kit and DNA sequencer.
However, the trend in parameter magnitudes for the dierent pull-up directions is expected to be
satised in general - possibly with an increase in the
GB
-parameter estimate. It is also worth
emphasising the dependency on the kit used for DNA typing. The data used in our analyses were
obtained using the SGM-Plus kit from Applied Biosystems. I.e., the parameters of stutter and
pull-up lters are not directly applicable to other kits.
7.5 Conclusion
The methodology of regression and distributional analysis of the noise yielded satisfactory re-
sults in order to deduce a sample and investigation specic lter for STR DNA typing. Compar-
isons of the results with those based on the recommendations of the manufacturers indicated that
the number of drop-outs for the two validation datasets decreased by 16% and 24%, respectively.
Studies of dierent data sets supported this improvement and suggests that the methodology of
the threshold determination is adequate for the noise ltering of STR quantitative data.
The lters for pull-up eects and stutters based on regression analysis trained on non-mixture
data also showed applicability to mixed DNA samples. As mentioned in Section 7.4, the pa-
rameter estimates in the lter were tuned for this specic data set and the alleles of included
proles. Hence, the estimation of the parameters must be a part of a laboratorys internal quality
assessment, where the consistency of the estimates over time are quality indicators.
Appendix
7.A Double stutters
In Gill et al. (2005), the authors argue that, once a stutter has been formed, its replication during
subsequent PCR cycles perform as an ordinary allele. We investigate the behaviour of stutters,
when we have two adjacent alleles of a heterozygous prole, i.e. what we called a double stutter.
Let h
i
denote the expected value of pre-PCR peak height of allele i. Then for two adjacent alleles
7.A Double stutters 151
n and n+1 from the same contributor, we have h
n
= h
n+1
= h and h
n1
= 0 for the stutter position
n1. Let P denote the eect of one PCR-cycle, then after t PCR-cycles, the expected value of
post-PCR peak heights h
(t)
i
is given by,
_
h
(t)
n+1
, h
(t)
n
, h
(t)
n1
_
= P
t
(h, h, 0)
,
where x
denotes the transpose of the vector x. P may be specied in terms of the PCR e-
ciency in one cycle, p, and the one-cycle stutter percentage, ,
_
_
h
(t)
n+1
h
(t)
n
h
(t)
n1
_
_
=
_
_
1+p 0 0
1+p 0
0 1+p
_
_
t
_
_
h
h
0
_
_
=
_
_
(1+p)
t
0 0
t(1+p)
t1
(1+p)
t
0
_
t
2
_
2
(1+p)
t2
t(1+p)
t1
(1+p)
t
_
_
_
_
h
h
0
_
_
The second equality can be shown using some linear algebra. Dene = t/(1+p) to be the
stutter percentage for the entire PCR process comprising t cycles. This denition ensures that
the stutter percentage increases with the number of cycles as noted in the literature (Gill et al.,
2000). The expression can then be rewritten as
_
_
h
(t)
n+1
h
(t)
n
h
(t)
n1
_
_
_
_
1 0 0
1 0
2
2
1
_
_
_
_
h
0
h
0
0
_
_
=
_
_
h
0
h
0
(1 + )
h
0
( +
2
2
)
_
_
(7.4)
where h
0
= (1+p)
t
h and the is due to t(t1)/2 t
2
/2 from the binomial coecient. The error
induced from this approximation is negligible for t 28 cycles.
The peak height, h
0
, can be interpreted as the actual peak height after the PCR process. In Gill
et al. (2005), the authors use p = 0.8 as the eciency of a PCR cycle, hence indicating the
theoretical doubling eect (requires that p = 1) from each cycle is not met in practice. Note, that
there is a dierence to the work of Gill et al. (2005) where they model the PCR process at the
nucleic level. Our approach is in terms of quantitative measures of peak heights.
Often it is assumed that the peak at position n+1,
h
(t)
n+1
, equals some true height,
h, after t PCR
cycles. Due to stuttering, the peak at position n equals
h plus an additional fraction,
, from the
n+1-position peak,
h
(t)
n
= (1+
h
(t)
n+1
= (1+
h
(t)
n
= (
2
)
h
(t)
n+1
h
(t)
n
h
(t)
n1
_
_
=
_
_
1 0 0
1 0
2
1
_
_
_
h
0
_
_
(7.5)
where the dierence (
2
/2
2
) between the matrices in (7.5) and (7.4) is induced by the delay
of one cycle in the stutter product from the n+1-position peak to the stutter peak in position n.
Hence, the relative contribution from the n+1-peak is smaller than modelled in (7.5) since when
formed, the stutter peak is amplied as a regular peak (Gill et al., 2005), which is captured using
(7.4).
152 Sample and investigation specic ltering of quantitative data from STR DNA analysis
When referring to the stutter percentage, , we dene it as the percentage of the parental peak
that is transferred to the stutter peak, = h
(t)
n1
/h
(t)
n
. However, having two true alleles located at
position n and n + 1, we nd
h
(t)
n1
h
(t)
n
=
h
0
_
1+
2
_
h
0
(1+)
.
In this situation, the ratio of the stutter peak to the mean of the two parental peaks yields the
stutter percentage,
h
(t)
n1
1
2
_
h
(t)
n
+h
(t)
n+1
_ =
h
0
_
1+
2
_
1
2
(h
0
(1+)+h
0
)
=
h
0
_
1+
2
_
h
0
_
1+
2
_ = .
In situations where the heterozygous alleles are not adjacent (separated by more than 4 base
pairs) or when stutter originates from a homozygous allele, we need for practical purposes only
to consider the direct ratio h
(t)
n1
/h
(t)
n
in order to estimate .
Bibliography 153
Bibliography
Applied Biosystems (2000). GeneScan Reference Guide - Chemistry Reference for the ABI
PRISM 310 Genetic Analyzer. Applied Biosystems. Figure Virtual Filter Set F, pp. 4-10.
Applied Biosystems (2006). AmpFSTR SGM Plus PCR Amplication Kit Users Manual. Ap-
plied Biosystems.
Butler, J. M. (2005). Forensic DNA Typing: Biology, Technology, and Genetics of STR Markers
(2 ed.). Burlington, MA: Elsevier Academic Press Inc., U.S.
Cook, O. and L. Dixon (2006). The prevalence of mixed DNAproles in ngernail samples taken
fromindividuals in the general population. Forensic Science International: Genetics 1(1), 62
68.
Gilder, J. R., T. E. Doom, K. Inman, and D. E. Krane (2007). Run-Specic Limits of Detection
and Quantitation for STR-based DNA Testing. Journal of Forensic Science 52(1), 97101.
Gill, P. D., J. M. Curran, and K. Elliot (2005). A graphical simulation model of the entire DNA
process associated with the analysis of short tandemrepeat loci. Nucleic Acids Research 33(2),
632643.
Gill, P. D., J. Whitaker, C. Flaxman, N. Brown, and J. S. Buckleton (2000). An investigation
of the rigor of interpretation rules for STRs derived from less than 100 pg of DNA. Forensic
Science International 112(1), 1740.
Tvedebrink, T., P. S. Eriksen, H. S. Mogensen, and N. Morling (2010). Evaluating the weight
of evidence using quantitative STR data in DNA mixtures. Journal of the Royal Statistical
Society. Series C, Applied statistics. In Press.
154 Sample and investigation specic ltering of quantitative data from STR DNA analysis
7.6 Supplementary remarks
The three-person mixtures discussed in Section 5.10 were also analysed using the oating thresh-
old methodology. As in Section 5.10 we discarded 17 out of the 120 samples due to preparation
or run errors. For the remaining cases, the minimum amount of DNA contributed to a true allele
was approximately 77.5 pg. Hence, the number of peak heights close to the limit of detection
(50 rfu) is expected to be low, since experience show that with about 50 pg pre-PCR product the
average peak heights are close to this limit.
When using the standard protocol with a xed 50 rfu threshold, there was observed 8 drop-outs
and 80 extra peaks not assigned to the contributors. These were distributed as 47 stutters, 10 pull-
ups and 23 drop-ins. For the oating threshold there were 4 drop-outs and 85 extra peaks, which
were categorised as 27 stutters, 2 pull-ups and 56 drop-ins. Hence, the performance of the two
methods were almost identical with respect to the misclassication rates, which were 0.123%
and 0.124%, respectively. This non-signicant dierence in assignment of alleles indicates that
the xed 50 rfu threshold is very reasonable for standard applications.
However, the methodology may be useful in situations were a the amount of DNA contributed by
a suspect is limited. Given such circumstances the peak intensities associated with the suspects
prole may be close to the xed limit of detection, e.g. with the majority of peak heights in the
range 40 rfu to 60 rfu. Peaks below the limit of detection, 50 rfu say, would conventionally be
declared as drop-outs. However, if the level of the noise supports a oating threshold limit of 30
rfu such considerations need not to be made, since no alleles would drop-out in this case. Often
a case worker is able to visually detect peaks belonging to the suspect in the EPG below the limit
of detection. However, lowering the limit of detection in order to include the suspect is clearly
very erroneous and unfavourable to the defendant, since taken to the extreme, any DNA prole
could be included in the crime related stain.
Furthermore, the method of adjusting for the contribution of stutter and pull-up eects is more
accurate than just removing the peaks due to so-called masking. Keeping all relevant in the
system is desirable since having a peak in stutter position that after adjustment has a peak height
of 35 rfu, say, is more informative than having a NA observation due to removal of a potential
stutter.
CHAPTER 8
Statistical model for degraded DNA samples
and adjusted probabilities for allelic drop-out
Publication details
Co-authors: Poul Svante Eriksen
10
6
, where we used
0,D2
= 18.31 and
0
= 4.35 fromTable 2 of Tvedebrink
et al. (2009). This is an extremely low drop-out probability when considering the fact that allele
19 in the same locus has a peak height of 77 rfu.
From graphical inspections of the (simplied) EPG in Figure 8.2 it is obvious that the DNA
sample is subject to degradation. In order to take the degradation of the DNA into account we
adjust the estimated H. The solid line in Figure 8.1 has (
0
,
1
) = (10.262, 0.0177) with R
2
=
0.931 which together with Figures 8.1 and 8.2 and other graphical diagnostics indicate a good
agreement with the model. Since the fragment length, bp, of allele 24 in locus D2 is bp
D2
24
=
327.87 (see Table 8.1), the adjusted H-value yields H(bp) = exp(10.262 0.0177
327.87) =
85.25 rfu. This estimated peak height is reasonably close to the observed peak height (77 rfu)
for the other allele in the same locus (Table 8.1). The estimated peak height is plugged into
(8.5) which implies that P[D
D2
24
;
H(bp) = 85.25] = 0.26. This drop-out probability is more
reasonable than P(D;
H) not taking degradation into account.
Note from (8.3) we may compute p from the estimate of
1
, p = exp(
1
) = exp(0.0177) =
0.982. From experience (see the supplementary material) this sample is moderately degraded.
8.4 Discussion
Since most DNA samples are analysed in replicates (or at least in duplicates), an additional
source of information is the consistency of the estimated degradation parameter across replicates.
For replicates the amount of DNA may vary, however, this aects (in principle) only
0
, whereas
1
should remain constant. For most of the samples analysed in this paper the were no signicant
dierence between the levels of degradation p
R
i
and p
R
j
for dierent replicates R
i
and R
j
, i j.
Similarly, for samples originating from the same body tissue or uid, the degradation pattern
should be reasonably similar across samples taken fromthe same source of the crime scene. This
were supported by the data, however, some cases had signicant dierences between tissue/uid
samples.
The likelihood ratio is dened as LR = P(E|H
p
)/P(E|H
d
), where H
p
and H
d
are two competing
hypotheses that could represents the statements of the prosecutor and defence. Let G
S
be the
DNA prole of the suspect, which in the example of Section 8.3 equals the prole in Table 8.1.
In order to evaluate P(E|H
p
) where H
p
claims that G
S
is the donor of the observed stain, an allelic
drop-out need to have occurred in order to explain the missing 24 allele in locus D2. Hence, the
probability P(D
D2
24
) enters in the numerator of LR. Thus, the smaller this probability the smaller
the LR. Hence, the prosecutor will claim that degradation is present since the probability of
allelic drop-out is approximately 10
5
larger when assuming degradation, compared to the non-
degraded probability of allelic drop-out.
P(E|H
d
) is evaluated by summation over the set of possible unknown proles with or without
allelic drop-out. Whether or not it is favourable for the defence to consider unknown proles
Bibliography 163
with drop-outs depend on the allele probabilities for the homozygous loci. That is, if
P(
D; 2H)P(A
i
A
i
) < P(D; H)P(
D; H)
k
ji
P(A
i
A
j
)
then P(E|H
d
) is increased by allowing for drop-out which results in a decreased LR. This con-
sideration applies whether or not the sample is degraded. However, the drop-out probabilities
will only increase when considering degradation since H(bp) = cp
bp
H and P(D; H) increases
as H decreases. On the other hand, the probability of alleles not dropping out is possibly larger
when correcting for possible degradation, P(
D; H(bp)) < P(
D; H), since H(bp) may be larger
than H for short amplicons.
8.5 Conclusion
We presented a method for the decay in the peak intensities of forensic STR loci as a function
of increasing base pairs, bp. The model showed satisfactory agreement to data and is simple and
intuitive. Furthermore, we demonstrated how to implement the information of degradation in the
computation of the probability of allelic drop-out in the situation of degraded samples.
164Statistical model for degraded DNA samples and adjusted probabilities for allelic drop-out
Bibliography
Alaeddini, R., S. J. Walsh, and A. Abbas (2010). Forensic implications of genetic analyses from
degraded DNA - A review. Forensic Science International: Genetics 4(3), 148157.
Alonso, A. et al. (2005). Challenges of DNA proling in mass disaster investigations. Croatian
Medical Journal 46(4), 540548.
Bender, K., M. J. Farfan, and P. M. Schneider (2004). Preparation of degraded human DNA
under controlled conditions. Forensic Science International 139(2-3), 135140.
Bill, M. et al. (2005). PENDULUM - a guideline-based approach to the interpretation of STR
mixtures. Forensic Science International 148, 181189.
Colotte, M., V. Couallier, S. Tuet, and J. Bonnet (2009). Simultaneous assessment of aver-
age fragment size and amount in minute samples of degraded DNA. Analytical Biochem-
istry 388(2), 345347.
Dixon, L. A. et al. (2006). Analysis of articially degraded DNA using STRs and SNPs - results
of a collaborative European (EDNAP) exercise. Forensic Science International 164(1), 3344.
Green, R., I. Roinestad, C. Boland, and L. Hennessy (2005). Developmental Validation of the
Quantiler Real-Time PCR kits for the Quantication of Human Nuclear DNA samples.
Journal of Forensic Science 50(4), 809825.
Irwin, J. A. et al. (2007). Application of low copy number STR typing to the identication of
aged, degraded skeletal remains. Journal of Forensic Sciences 52(6), 13221327.
Prinz, M. et al. (2007). DNA Commision of the International Society for Forensic Genetics
(ISFG): Recommendations regarding the role of forensic genetics for disaster victim identi-
cation (DVI). Forensic Science International: Genetics 1(1), 312.
Schneider, P. M. et al. (2004). STR analysis of articially degraded DNA - results of a collabo-
rative European exercise. Forensic Science International 139(2-3), 123134.
Tvedebrink, T., P. S. Eriksen, H. S. Mogensen, and N. Morling (2009). Estimating the proba-
bility of allelic drop-out of STR alleles in forensic genetics. Forensic Science International:
Genetics 3(4), 222226.
Tvedebrink, T., P. S. Eriksen, H. S. Mogensen, and N. Morling (2010). Evaluating the weight
of evidence using quantitative STR data in DNA mixtures. Journal of the Royal Statistical
Society. Series C, Applied statistics. In Press.
8.6 Supplementary remarks 165
8.6 Supplementary remarks
Degradation aects the mean of the peak heights and areas. Since degradation is a very common
situation in forensic case work, the models developed should be able to handle degradation. As
with the extension of the mixture separation method to allowing for allelic drop-out, the method
is extensible to correct for degradation.
Assume that the biological material contributed by the donors is of similar type, e.g. blood,
tissue, body uids, etc., and that the material has been exposed to similar conditions over an
approximate identical time span. Based on these assumptions it is reasonable to assume that the
level of degradation is common for the DNA and that the peak intensities may be modelled by
c
k
p
bp
, where c
k
reects the amount of DNA contributed by the kth individual and p is common
for all k = 1, . . . , m.
In cases of degradation, the eect on a four peak locus might be such that the highest and lowest
peak heights relate to the major component and the two alleles with intermediate peak heights
belong to the minor contributor of a two-person DNA mixture. This could happen if the highest
and lowest peaks are in each end of the ladder interval and the intermediate in between. Fig-
ure 8.3 shows examples of this situation for the base pair interval from 125 bp to 280 bp for the
SGM Plus kit.
The plot in Figure 8.3 exemplies a two-person DNA mixture with p = 0.98 and amounts of
DNA corresponding approximately to an 1:2 mixture. That is, the expected peak heights of
heterozygous loci are given by h
(k)
s,i
= c
k
p
bp
(k)
s,i
, which implies that
_
4
i=1
log h
s,i
= 2
(+)
0
+
1
bp
s,+
,
where
(+)
0
=
(1)
0
+
(2)
0
,
(i)
0
= log c
i
and
1
= log p.
Hence, a regression of
_
4
i=1
log h
s,i
on bp
s,+
would give estimates of (
(+)
0
,
1
). However, ad-
ditivity of peak heights on natural-scale does not transfer to additivity on log-scale. Homozy-
gous allele peak heights are log h
(k)
s,1
= log 2 +
(k)
0
+
1
bp
(k)
s,1
and shared alleles has log h
s,i
=
log(c
1
+ c
2
) +
1
bp
()
s,i
10
5
, 4.62
10
5
) for the dierence between p
0
and p.
The relevant observation window for the loci included in the SGM Plus kit (AB) starts around
100 bp. Using this o-set the observed peak intensities may be adjusted for degradation by
compensating by the tted decay. Given and the peak height h it is possible to compute the
degradation corrected peak height
h. By multiplying the observed peak heights by exp[
0
(bp
100)] the eect of degradation is inverted resulting in less imbalances between loci,
h
s,i
=
h
s,i
exp[
1
(bp
s,i
100)].
In the example of Section 8.3,
1
= 0.0177 and by using the approach above we get the peak
166Statistical model for degraded DNA samples and adjusted probabilities for allelic drop-out
P
e
a
k
h
e
i
g
h
t
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
150 200 250
Base pair
Blue fluorescent dye
Green fluorescent dye
Yellow fluorescent dye
Major component
Minor component
Figure 8.3: Degradation of a two-person DNA mixture. The highest and lowest peaks belong to
the major component. The shaded areas below the rst axis show the range of the allelic ladder
for the various STR loci in the 125-280 bp window of the SGM-Plus kit (Applied Biosystems,
AB).
heights reported in the Corrected height-column of Table 8.1. There is still evidence of peak
height imbalances within loci. However, the heterozygote balance, Hb, which is the ratio of the
heterozygous peak heights (see e.g. Bill et al., 2005), is improved by the correction, where the
range of Hb for the observed peak heights is (0.54, 0.99) it is (0.74, 0.98) after the peak height
correction.
In order for the models for DNA mixtures of Chapters 4 and 5 to be valid, the proportionalities
of peak heights and peak areas need to be preserved. However, the correction of peak heights
is also applicable to peak areas, hence a
s,i
= a
s,i
exp[
1
(bp
s,i
100)] which ensures the same
proportionality as before the correction. Therefore, no changes are needed in the sets J
i
for the
mixture separator when the peak intensities are adjusted for degradation.
In the next chapter the model for degraded DNA is combined with the models from the previous
chapters in a unifying likelihood ratio. That is, a likelihood ratio were all the discussed com-
plications can be included and accounted for when assessing the weight of the DNA evidence in
crime cases.
CHAPTER 9
Epilogue
9.1 Conclusion
In the preceding seven chapters (Chapters 2-8) the core content of this present PhD thesis has
been presented. The main focus of the PhD project has been to develop statistical models ap-
plicable to the quantitative part of the STR analysis and in particular DNA mixtures. However,
since the genetic part (qualitative allelic data) of the evidence constitutes the fundamental inputs
in evidential weight calculations, it was dicult not to treat this topic. This lead to the interest for
IBD and the eect of population structures when computing the evidential weight. As pointed
out by one of the reviewers of the paper in Chapter 2 (Overdispersion in allelic counts and -
correction in forensic genetics to appear in Theoretical Population Biology) does the forensic
databases not constitute the databases of interest. More general population surveys should be
used when making inference about , e.g. random samples taken from well-dened subpopula-
tions on a high resolution. For the Danish population this could be samples taken from small
villages or islands since these subpopulations may cause large allelic divergence and thus yield
-estimates in the higher end of the plausible range (Balding, 2005). From Figure 2.1 we saw
that this in practise would lead to conservative evaluation of the evidence. Furthermore, this is
equivalent to the fact that the probability of a random match (a prole match of two unrelated
individuals) increases with .
For the quantitative part, the work was initiated by assuming no complications of stutters, pull-up
eects, allelic drop-out or degradation. Under these settings it was possible to derive two models
for DNA mixtures, where the simplest of the two were wrapped into a greedy algorithm which
167
168 Epilogue
eciently separated DNA mixtures. For the particular data used in the paper the algorithm
was at least as successful as three experienced case workers. However, the analysis did also
emphasise that the results should be interpreted with caution. This was especially important
for samples close to 1:1-mixture proportion and when the interest was about the minor prole.
The analysis of three-person mixtures repeated this picture where the success rate for the 1:2:4-
mixture proportion was rather low for the mid and minor proles.
Having done this, a natural extension of the models was to handle allelic drop-out and degraded
DNA samples as these phenomena are frequently occurring in real crime case work. From
the remarks of Sections 6.5.1 and 8.6 it was demonstrated how the presented models may be
combined in order to handle these complications. In the remarks it was only exemplied how
to modify the statistical model and mixture separating algorithm for two-person DNA mixtures.
However, the cases of more contributors follow along the same lines. The work with allelic
drop-out also made it evident that there were possibilities for renement of the determination
of the signal-to-noise ratio. The use of a xed threshold may in some cases discard important
information regarding the distribution of the noise component from the measurement technique.
The model proposed to determine this threshold was based on a simple analysis of quartiles in
order to estimate the parameters of the log-normal distribution. However, for some situations
this approach seemed to be too simple as a sudden increase of the noise level was detected for
a short bp-interval. This temporally increase in the background noise caused non-linearity in
the QQ-plots and in some cases increased the variance estimate substantially. Loess-curves were
investigated to handles this non-linearity. However, they did not improve the overall performance
signicantly. Further work may suggest ways to adjust for this fact, but one has to focus the
attention on newer typing kits, as these should have better signal-to-noise ratios than the SGM-
Plus kit (Applied Biosystems).
9.2 Weight of evidence calculations
In the preceding chapters it has been demonstrated how to incorporate the quantitative part of the
STR typing results in the likelihood ratio approach. The principle was to assign a weight to each
quantitative termof the LR, where the weight should reect the compliance between the expected
and observed peak intensities. Terms with minor disagreement (e.g. due to measurement errors)
should receive a large weight whereas prole combinations leading to substantial dierences
would be weighted by a quantity close to zero.
This extendability of the LR is one of the many arguments for using this approach rather than the
Random man not excluded-approach (often abbreviated RMNE-approach in the literature). I
will not discuss the philosophical dierences or many advantages of LR over RMNE, since these
are irrelevant at this point. However, it should be noted that the models discussed above is of no
use when assessing the weight of evidence through RMNE. In line with many others (e.g. Evett
and Weir, 1998; Balding, 2005; Buckleton et al., 2005; Gill et al., 2006; Buckleton and Curran,
2008) I strongly recommend the LR-approach in evidential calculations carried out in forensic
genetics.
The LR is formed by evaluating the evidence (crime scene evidence and identied proles) under
9.3 Unifying likelihood ratio 169
competing hypothesis, often denoted H
p
and H
d
for the prosecutor and defence hypotheses.
Since H
p
and H
d
are only mutually exclusive, and not exhaustive, one needs to recall that there
are several LRs - one for every (H
p
, H
d
)-pair of hypotheses. Hence, the fact that the LR favours
H
p
over H
d
does not imply that there cannot exist a H
d
for which LR
= P(E|H
p
)/P(E|H
d
) < 1
(Balding, 2005).
The extensions of the LR derived in Chapters 4 and 5 only considered cases assuming no allelic
drop-outs. However, as previously argued does this assumption often fail together with the no
degradation-assumption. Hence, for proper inclusion of the available data and applicability
to most types of crime cases, the LR needs to be extended further. Let Q = (Q
mis
, Q
obs
) and
G = (G
mis
, G
obs
), where the subscripts refer to dropped-out and observed alleles. That is, Q
obs
are
the observed peak intensities, whereas Q
mis
denotes the event of an allelic drop-out, i.e. the peak
failed to be detected. Similarly are G
obs
and G
mis
the associated types of alleles, where the need
for Q
mis
and G
mis
is induced by the hypothesis under consideration.
Given a specic hypothesis the set of plausible prole combinations C is induced. That is, if the
prosecutors hypothesis H
d
claims that the observed crime scene stain originates from a victim,
V, and suspect S then C
p
= {(G
V
, G
S
)}, where respectively G
V
and G
S
are the proles of V
and S . In connection to this hypothesis the defence states that The observed crime scene stain
originates from the victim and an unknown prole then C
d
= {G
U
: (G
V
, G
U
) H
d
}, with G
U
being the prole of the unknown contributor U. This denition of C
d
does not limit G
U
to be
consistent with (Q
obs
, G
obs
), hence drop-out of Us alleles is allowed with this formulation. Note
that this denition of C is dierent from that of Sections 4.3 and 5.5 where the plausible proles
in C needed to be consistent with the observed alleles, i.e. no allelic drop-outs were allowed.
Additionally, drop-ins, stutters and pull-up peaks possibly causes more alleles to be observed
than those of the true contributors. However, as claimed in Section 5.9 are stutters (and pull-
up peaks) prole independent. Hence, given the peak intensity information in allele position
n it is (in principle) possible to predict and adjust for the stutter contribution to the peak in
position n1. Similarly, the pull-up contribution can be removed from peaks with overlapping
bp-values. However, not all such peaks were successfully removed as 6 stutters and 4 pull-ups
were observed above the signal-to-noise threshold (Table 7.4) for the two-person mixtures, and
for the three-person mixtures 27 stutters and 2 pull-ups were detected. Hence, peaks other than
those belonging to the true donors must be incorporated in the unifying model to be consistent
with the observed data.
9.3 Unifying likelihood ratio
The evaluation of the LR consists of computing the probability of the evidence under the two
hypothesis and form their ratio. Since the C-sets are discrete the probability P(E|H) may be
evaluated using the law of total probability P(E|H) =
_
GC
P(E|G)P(G), where G is short for
the proles involved, e.g. G = (G
V
, G
S
) under the prosecutors hypothesis in the example above.
In order to discuss the evidential weight there need to be at least one identied DNA prole,
namely the suspects prole G
S
. For general purposes let K be the known DNA proles as-
sociated with the case, e.g. K = (G
V
, G
S
) above. Then the evidence E consists of E
c
and K,
170 Epilogue
where E
c
were the crime scene evidence including both the quantitative and qualitative parts,
E
c
= (Q, G). First we note that the crime scene evidence, E, and the known proles, K are as-
sumed conditionally independent given (G, G). That is, given G C and G the known proles K
has no inuence on the crime scene stain. Hence, using the denition of conditional probability
we can for G C factorise P(E|G) as:
P(E|G) = P(E
c
, K|G) = P(Q, G, K|G) = P(Q|G, G)P(G|K, G)P(K|G). (9.1)
In (9.1) the P(Q|G, G)-term measures the agreement of the observed and expected peak inten-
sities under some model. If the detected alleles in G equals those of G neither drop-out nor
drop-in (including stutters and pull-ups) have caused missing or additional alleles to be present
in the signal. Hence, P(Q|G, G) = P(Q|G) may be evaluated by the one of models as presented
in Chapters 4 or 5, and since G = (G G) = G, i.e. the proles are consistent, P(G|G) = 1.
However, in cases with stutters, pull-ups or drop-ins present Q is split into two parts ascribed
respectively to G = G G and
G = G \ (G G). The evaluation is done by P(Q|G, G) =
P(Q
G
|Q
G
, G, G)P(Q
G
|G), where in cases of possible stuttering P(Q
G
|Q
G
, G, G) assigns prob-
ability to this event. In this thesis such models have not been discussed, however, a logistic
regression (similar to that of the drop-out model) may be derived, where the explanatory vari-
able for stutters and pull-ups would be the parental peaks intensities. For drop-ins (additional
peaks not possible to categorise as stutters or pull-ups), the noise level of the sample might be
an appropriate covariate.
Furthermore, if G implies allelic drop-out Q can be decomposed into (Q
mis
, Q
obs
) and the quan-
titative term then factorises further P(Q|G, G) = P(Q
mis
|Q
obs
, G, G)P(Q
obs
|G, G). The probability
of an allelic drop-out, P(Q
mis
|Q
obs
, G, G), is computed given the observations and information
about the samples genotypes. An allelic drop-out is equivalent to the event that the peak height
is less than the limit of detection. Hence, P(Q
mis
|) could be evaluated by
_
T
0
P(h|) dh, where T
and h are the limit of detection and peak height, respectively. However, the drop-out model of
Chapter 6 is an approximation to this integral and since it is easier to compute we use P(D; H)
to quantify P(Q
mis
|Q
obs
, G, G).
Thus combining (9.1) with the extension for drop-outs and additional alleles compared to G the
unifying likelihood ratio can be dened as:
LR =
P(E|H
p
)
P(E|H
d
)
=
_
GC
p
P(Q
mis
|Q
obs
, G)P(Q
obs,
G
|Q
obs,G
, G, G)P(Q
obs,G
|G)P(G|K, G)P(K|G)P(G)
_
G
C
d
P(Q
mis
|Q
obs
, G
)P(Q
obs,
G
|Q
obs,G
, G, G
)P(Q
obs,G
|G
)P(G|K, G
)P(K|G
)P(G
)
(9.2)
This LR is constructed such that it (in principle) is applicable in all possible scenarios arising
from crime cases.
For the example above with H
p
: (G
V
, G
S
) and H
d
: (G
V
, G
U
) the known proles are thus K =
(G
V
, G
S
). Assume that G
S
has alleles not present in G implying that allelic drop-out must have
9.4 Future research 171
occurred if the suspect is a true contributor to the stain. Furthermore, all alleles in G is accounted
for by (G
V
, G
S
). Then the LR is given by:
LR =
P(Q|G, G
V
, G
S
)P(G|G
V
, G
S
)P(G
V
, G
S
)
_
G
U
C
d
P(Q|G, G
V
, G
U
)P(G|G
V
, G
U
)P(G
S
|G
V
, G
U
)P(G
U
, G
V
)
=
P(Q
mis
|Q
obs
, G, G
V
, G
S
)P(Q
obs
|G, G
V
, G
S
)P(G
mis
, G
obs
|G
V
, G
S
)
_
G
U
C
d
P(Q
mis
|Q
obs
, G, G
V
, G
U
)P(Q
obs
|G, G
V
, G
U
)P(G
mis
, G
obs
|G
V
, G
U
)P(G
U
|G
V
, G
S
)
,
where P(G
mis
, G
obs
|G
V
, G
S
) = 1 since (G
V
, G
S
) (G
mis
, G
obs
). Assume further that C
d
= {G
U
:
(G
V
, G
U
) G
obs
}, i.e. the set of possible unknown proles is restricted to be consistent with
the observed alleles when combined with G
V
. Thus, P(G
obs
, G
mis
|G
V
, G
U
) = 1 and LR reduces
further:
LR =
P(Q
mis
|Q
obs
, G, G
V
, G
S
)P(Q
obs
|G, G
V
, G
S
)
_
G
U
C
d
P(Q
obs
|G
V
, G
U
)P(G
U
|G
V
, G
S
)
9.4 Future research
9.4.1 Replicates
When a sample is taken from a crime scene the number of molecules may be limited, e.g. does
dead hair follicles only contain limited amounts of DNA and similarly for touch DNA which
is biological material transferred by physical contact (Gill and Buckleton, 2010a). Let N be
the number of DNA molecules present after extraction and n be the number of replicates, R
i
,
made based on the N molecules. For the n replicates to be comparable in terms of drop-outs
(and possibly stutters and contamination) it is desirable for the amount of DNA to be evenly
distributed among R
1
, . . . , R
n
, e.g. for n = 3 one could imagine to have approximately 30% in
each replicate leaving 10% of the extracted DNA in the tube.
Let
A
denote the aliquot proportion, then this sampling scheme implies that R
1
bin(N,
A
)
and (R
i
|R
1
, . . . , R
i1
) bin(N
_
i1
j=1
R
j
,
A
/[1 (i1)
A
]) for j = 2, . . . , n. It is easy to verify
that this construction yields the expected values as desired: E(R
1
) = N
A
and
E(R
i
) = E[E(R
i
|R
1
, . . . , R
i1
)] =
[N (i1)N
A
]
A
1 (i1)
A
= N
A
.
Furthermore, this implies that (R, Q) = (R
1
, . . . , R
n
, Q) mult(N, {1
A
, 1 n
a
}), where Q is
the remaining extract. Assume that there need to be M molecules of an allele prior to PCR in
order to be detected by the CCD camera in the electrophoresis machine post-PCR. Hence, for
the allele to be detected in each replicate we require that R
i
> M for all i:
P(R
1
> M, R
2
> M, . . . , R
n
> M) =
N
r
1
>M
Nr
1
r
2
>M
Nr
+
r
n
>M
P(R
1
= r
1
, R
2
= r
2
, . . . , R
n
= r
n
), (9.3)
172 Epilogue
where r
+
=
_
n1
i=1
r
i
. This probability depends on several factors but most importantly =
NnM. For small the probability that the allele has dropped out in at least one of the replicates
is considerable, and for negative we are sure to have drop-outs. However, when >> 0 the
probability of drop-outs in any of the replicates is minimal, i.e. when the amount of DNA is large
all the replicates should have all alleles present.
In low template DNA (LT-DNA, formerly known as Low Copy Number DNA, LCN-DNA, Gill
and Buckleton (2010a)) it is common to use the biological model to forma so-called consensus
prole (Buckleton et al., 2005, Chapter 8). That is, only alleles present in at least two replicates
are reported in the consensus prole (Gill et al., 2000). However, from the probability in (9.3)
it is for small N very likely that an allele present in some replicates is absent in others. Hence,
the denition of a consensus prole may not be the best approach when it is expected that the
replicates will showdierent alleles for small amounts of DNA, which is the case for LT-DNA. A
better method would be to model the negative correlation between peak intensities of replicates.
In the left panel of Figure 9.1 the probability that the consensus prole excludes a true al-
lele is plotted for two and three replicates against the total amount of extracted DNA, i.e.
2
P(R
1
<M, R
2
>M) and 3
P(R
1
<M, R
2
<M, R
3
>M), where permutation of replicates induce the
multiplication of weights. It is assumed that in order to trigger the observation of an allele using
a 50 rfu threshold it is required to have 50 pg of DNA material prior to PCR. Furthermore, for
the two replicate case all of the extracted DNA is used in equal amounts. For the three replicate
situation it is intended to assign 30% of the total DNA to each replicate.
50 100 150 200 250
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Amount of extraced DNA (pg)
P
r
o
b
a
b
i
l
i
t
y
t
h
a
t
e
x
a
c
t
l
y
o
n
e
r
e
p
l
i
c
a
t
e
s
h
o
w
s
t
h
e
a
l
l
e
l
e
50 100 150 200 250
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Amount of extraced DNA (pg)
P
r
o
b
a
b
i
l
i
t
y
t
h
a
t
a
t
l
e
a
s
t
t
w
o
r
e
p
l
i
c
a
t
e
s
s
h
o
w
t
h
e
a
l
l
e
l
e
Two replicates Three replicates
50 100 150 200 250
1
.
0
0
.
8
0
.
6
0
.
4
0
.
2
0
.
0
Amount of extracted DNA (pg)
P
a
i
r
w
i
s
e
c
o
r
r
e
l
a
t
i
o
n
o
f
r
e
p
l
i
c
a
t
e
s
Figure 9.1: Left: Probability that the consensus prole excludes a true allele for two and three
replicates. Centre: Probability that the allele will be included in the consensus prole for two
and three replicates. Right: Pairwise correlation between consensus prole inducing replicates.
From the curves in the left panel of Figure 9.1 it is evident that for small and large amounts
of DNA the probabilities are eectively zero. For the small values this is because neither of
the replicates have observed alleles (no allele in consensus prole due to drop-out in both repli-
cates), whereas for the large values it is because the allele is observed in all replicates (allele in
9.4 Future research 173
consensus prole). The maximum are respectively at 101 pg and 160 pg while the ranges where
the probabilities are larger than 10
3
are 75-136 pg and 112-201 pg, for two and three replicates.
In the centre panel of Figure 9.1 the probability that the consensus prole will include the allele
for two and three-replicates is plotted against the amount of extracted DNA. For an allele to be
present in the consensus prole it must be detected at least twice:
P(R
1
> M, R
2
> M) and 3
P(R
1
< M, R
2
> M, R
3
> M) + P(R
1
> M, R
2
> M, R
3
> M)
To be 99.9% certain that an allele is present in the consensus prole the minimum required
amount of extracted DNA are 140 pg for two replicates and 245 pg for three replicates using the
assumed aliquot sampling scheme.
Dene the indicator variables T
i
which are 1 if R
i
> M and 0 otherwise. Hence, T
i
indicates
whether replicate i triggers the observation of an allele above the threshold. The consensus
inducing correlations are thus Cor(T
1
, T
2
) for two replicates and similarly Cor(T
1
, T
2
|T
3
= 0)
for three replicates. The latter correlation is naturally subject to permutation of replicates, but
since the amount of DNA for one replicate, here R
3
, is less than M, the two other replicates need
to show the allele for it to be included in the consensus prole. The right panel of Figure 9.1
shows the negative correlations as expected due to the limited amount of DNA. For two replicates
the pairwise correlation is approximately equal to the negative probability of the left panel of
Figure 9.1.
The general picture from the model and analysis of replicates indicate that the concept of the
consensus prole (or biological model) is awed, due to the disproportion between expected
peak intensities and consensus prole construction. However, it should be added that the gures
above are computed without taking measurement error, PCR eciency variation, quantication
inaccuracy, etc. into account. A more rened model should include these and other factors to be
applicable to real STR data.
9.4.2 The number of contributors
When evaluating DNA mixtures a source of uncertainty is the number of contributors. Lau-
ritzen and Mortera (2002) derived an upper bound on the number of unknown contributors
worth considering (typically) under H
d
. That is, the bound b is computed such that if the
number of unknown proles x is larger than b, the evidence is less favourable to the de-
fendant than with x = b. However, this bound is computed without taking the quantita-
tive part of the evidence into consideration and may therefore yield an inaccurate bound for
LR = P(Q, G, K|H
p
)/P(Q, G, K|H
d
).
9.4.3 Distribution of max
G
L(Q|G) - optimisation over a discrete space
In relation to the problem above, it is relevant to be able to quantify the distribution of L(Q|G).
How does one measure the signicance in the L(Q|G)-value when changing the number of con-
tributors m? And how is this related to the mixture proportions ? For a xed combination of
174 Epilogue
proles, going from m to m 1 contributors is equivalent to setting
1
= 0. However, since the
greedy algorithm searches over all possible combination in the discrete space G, it may be inap-
propriate to rely on asymptotic theory or other common approaches to test H
0
:
1
= 0 against
H
1
:
1
> 0.
9.4.4 Estimation of P(D) using the oating threshold methodology
In the drop-out model of Chapter 6 the limit of detection threshold was xed at 50 rfu. How-
ever, if the STR signal is assigned positive and negative by the oating threshold methodology
(Chapter 7), the threshold is not xed and the previous denition of a dropout, D = {h < 50},
does not apply. The denition of the drop-out probability on page 170 as an integral may be
used in this setting. That is, the quantitative data is spilt into two disjoint partitions where
the noise part (o-ladder observation not in pull-up position) is used to determine T and is
therefore independent of the quantitative signal in the remaining part. Hence, it would be pos-
sible to estimate a mean,
h
, and standard deviation,
h
, for the peak heights and evaluate
P(D;
h
,
h
) = P(h < T;
h
,
h
) =
_
T
0
f (h;
h
,
h
) dh.
9.4.5 Evaluating the entire signal
As mentioned in Chapter 1 the use of threshold or limit of detection imply the possibility for
drop-out. In that chapter the argument for using a threshold strategy in this thesis were to limit
the set of possible combinations that were needed to evaluate LR. However, it may be possible to
evaluate the entire STR signal by including all observations above a given limit, 5 rfu say. This
would lead to more complicated expressions for the LR, however with a gain in conceptual clarity
since assignment of positive/negative alleles is superuous. Using this methodology, especially
the P(E|H
d
) could imply a summation over a huge set which would be computationally intense.
However, the terms in P(E|H
p
) and P(E|H
d
) that would have numerical impact on the LR would
be those including the observed alleles with the strongest signals. Often this would be those
associated with the alleles in K. However this need not to be the case, but searching for a best
matching pair of proles would still be possible. For the evaluation of LR to be operational, it
might be necessary to use importance sampling in order to evaluate the sum in the denominator
since fewer known proles is specied by H
d
than by H
p
. Assume that the hypothesis H
d
states
that the observed crime scene stain was a two-person DNA mixture, then correcting for stutters
and pull-up eects, it may be possible to determine a best matching pair of proles
G. This
best matching conguration is then applicable as reference proles for importance sampling
similar to the construction in Section 5.6.
Let E denote the signal obtained from the EPG based on a crime related sample, e.g. a sample
taken from a scene of crime. When evaluating the sample we are interested in P(E|H
a
) for some
H
a
-hypothesis. H
a
induces a discrete set of DNA proles and we denote this C
a
= {G : G H
a
}.
Furthermore, H
a
may specify further evidence in terms of DNA proles of identied individuals.
Let K denote the common set of known proles of the two hypotheses evaluated in the LR. For
example, in a two-person DNA mixture K may be the proles of a victim and the suspect,
9.4 Future research 175
K = (G
V
, G
S
). Thus the likelihood ratio is LR = P(E, K|H
p
)/P(E, K|H
d
). This LR is evaluated
by summing in both numerator and denominator over proles in C
p
and C
d
, respectively. That
is, P(E, K|H
a
) =
_
GC
a
P(E, K|G)P(G).
We assume that given G no other proles aect the observed signal. In particular this is true for
the known proles, K. Hence, E and K are conditionally independent given G: P(E, K|G) =
P(E|G)P(K|G). For each set of proles G C
a
a set of stutters and on-ladder pull-up peaks
are induced. Let S
G
and P
G
denote these derivatives, where S
G
includes both stutters (rst,
second, third, etc.) and back-stutters. Furthermore, for each G the allelic ladder, L, is known
and xed.
Given Gthe observed signal, E, may decomposed into ve parts that constitute a STR signal:
O-ladder noise, E
L
n
which are all intensity observations in o-ladder position and not in
possible pull-up position. E
L
n
is xed for all G since the it only rely on the xed ladder, L.
The signal due to the proposed proles in G: E
G
.
The signal due to stutters induced by proles in G: E
S
G
.
The signal due to pull-up peaks induced by proles in G and S
G
: E
P
G
.
On-ladder noise, E
L
n
which are all on-ladder observations not ascribed to Gand its derivatives.
Using this decomposition we have for G C
a
:
P(E|G) = P(E
L
n
|E
P
G
,E
S
G
,E
G
,E
L
n
,G)P(E
P
G
|E
S
G
,E
G
,E
L
n
,G)P(E
S
G
|E
G
,E
L
n
,G)P(E
G
|E
L
n
,G)P(E
L
n
|G)
= P(E
L
n
|E
L
n
,G)P(E
P
G
|E
S
G
,E
G
,E
L
n
,G)P(E
S
G
|E
G
,E
L
n
,G)P(E
G
|E
L
n
,G)P(E
L
n
), (9.4)
where P(E
L
n
|G) = P(E
L
n
) since it is xed for all proles Gand thus cancels out when forming the
likelihood ratio. It is likely that some of the terms in (9.4) can be simplied due to conditional in-
dependence given G. For example, may the on-ladder noise, E
L
n
, be independent of the o-ladder
noise, E
L
n
, given Gwhen the parameters of P(E
L
n
) is determined, i.e. P(E
L
n
|E
L
n
, G) = P(E
L
n
|E
L
n
, G).
The LR is formed by a hypothesis specic ratio of the expression in (9.4):
LR=
_
GC
d
P(E
L
n
|E
L
n
,G)P(E
P
G
|E
S
G
,E
G
,E
L
n
,G)P(E
S
G
|E
G
,E
L
n
,G)P(E
G
|E
L
n
,G)P(K|G)P(G)
_
G
C
d
P(E
L
n
|E
L
n
,G
)P(E
P
G
|E
S
G
,E
G
,E
L
n
,G
)P(E
S
G
|E
G
,E
L
n
,G
)P(E
G
|E
L
n
,G
)P(K|G
)P(G
)
As in Section 9.3 we consider a two-person DNA mixture with known victim prole G
V
and
suspect prole G
S
where H
p
:(G
V
, G
S
) and H
d
: (G
V
, G
U
). Due to limited space we dene
G
V,S
= (G
V
, G
S
) and G
U,S
= (G
U
, G
S
), then the likelihood ratio is
LR=
P(E
L
n
|E
L
n
,G
V,S
)P(E
P
G
V,S
|E
S
G
V,S
,E
G
V,S
,E
L
n
,G
V,S
)P(E
S
G
V,S
|E
G
V,S
,E
L
n
,G
V,S
)P(E
G
V,S
|E
L
n
,G
V,S
)
_
G
U
C
d
P(E
L
n
|E
L
n
,G
V,U
)P(E
P
G
V,U
|E
S
G
V,U
,E
G
V,U
,E
L
n
,G
V,U
)P(E
S
G
V,U
|E
G
V,U
,E
L
n
,G
V,U
)P(E
G
V,U
|E
L
n
,G
V,U
)
.
Bibliography
Alaeddini, R., S. J. Walsh, and A. Abbas (2010). Forensic implications of genetic analyses from
degraded DNA - A review. Forensic Science International: Genetics 4(3), 148157.
Alonso, A. et al. (2005). Challenges of DNA proling in mass disaster investigations. Croatian
Medical Journal 46(4), 540548.
Applied Biosystems (2000). GeneScan Reference Guide - Chemistry Reference for the ABI
PRISM 310 Genetic Analyzer. Applied Biosystems. Figure Virtual Filter Set F, pp. 4-10.
Applied Biosystems (2006). AmpFSTR SGM Plus PCR Amplication Kit Users Manual. Ap-
plied Biosystems.
Ayres, K. L. (2000). A two-locus forensic match probability for subdivided populations. Genet-
ica 108, 137143.
Balding, D. J. (2003). Likelihood-based inference for genetic correlation coecients. Theoreti-
cal Population Biology 63, 221230.
Balding, D. J. (2005). Weight-of-evidence for Forensic DNA Proles. Chichester, West Sussex:
John Wiley & Sons, Ltd.
Balding, D. J. and J. S. Buckleton (2009). Interpreting low template DNA proles. Forensic
Science International: Genetics 4(1), 110.
Balding, D. J. and R. A. Nichols (1994). DNA prole match probability calculation: how to
allow for population stratication, relatedness, database selection and single bands. Forensic
Science International 64, 125140.
Balding, D. J. and R. A. Nichols (1995). A method for quantifying dierentiation between
177
178 Bibliography
populations at multi-allelic loci and its implications for investigating identity and paternity.
Genetica 96, 312.
Balding, D. J. and R. A. Nichols (1997). Signicant genetic correlations among caucasians at
forensic DNA loci. Heredity 78(6), 583589.
Barndor-Nielsen, O. E. and D. R. Cox (1994). Inference and Asymptotics. Number 52 in
Monographs on Statistics and Applied Probability. London: Chapman & Hall.
Bender, K., M. J. Farfan, and P. M. Schneider (2004). Preparation of degraded human DNA
under controlled conditions. Forensic Science International 139(2-3), 135140.
Bill, M. et al. (2005). PENDULUM - a guideline-based approach to the interpretation of STR
mixtures. Forensic Science International 148, 181189.
Box, G. E. P. and N. R. Draper (1987). Empirical model-builing and response surfaces. Wiley.
Brier, G. W. (1950). Verication of forecasts expressed in terms of probability. Monthly Weather
Review 78, 13.
Buckleton, J. S. and J. M. Curran (2008). A discussion of the merits of randomman not excluded
and likelihood ratios. Forensic Science International: Genetics 2, 343348.
Buckleton, J. S., C. M. Triggs, and S. J. Walsh (2005). Forensic DNA evidence interpretation,
pp. 217274. Boca Raton, FL: CRC Press.
Budowle, B. and T. R. Moretti (1999). Genotype proles for six population groups at the 13
CODIS short tandem repeat core loci and other PCR-based loci. Forensic Science Communi-
cations.
Butler, J. M. (2005). Forensic DNA Typing: Biology, Technology, and Genetics of STR Markers
(2 ed.). Burlington, MA: Elsevier Academic Press Inc., U.S.
Clayton, T. M., J. P. Whitaker, R. Sparkes, and P. D. Gill (1998). Analysis and interpretation of
mixed forensic stains using DNA STR proling. Forensic Science International 91, 5570.
Cockerham, C. C. (1969). Variance of gene frequencies. Evolution 23(1), 7284.
Cockerham, C. C. (1973). Analysis of gene frequencies. Genetics 74(4), 679700.
Colotte, M., V. Couallier, S. Tuet, and J. Bonnet (2009). Simultaneous assessment of aver-
age fragment size and amount in minute samples of degraded DNA. Analytical Biochem-
istry 388(2), 345347.
Cook, O. and L. Dixon (2006). The prevalence of mixed DNAproles in ngernail samples taken
fromindividuals in the general population. Forensic Science International: Genetics 1(1), 62
68.
Cowell, R. G. (2009). Validation of an STR peak area model. Forensic Science International:
Genetics 3(3), 193199.
Cowell, R. G., S. L. Lauritzen, and J. Mortera (2007a). A gamma model for DNA mixture
analyses. Bayesian Analysis 2(2), 333348.
Bibliography 179
Cowell, R. G., S. L. Lauritzen, and J. Mortera (2007b). Identication and separation of DNA
mixtures using peak area information. Forensic Science International 166, 2834.
Cowell, R. G., S. L. Lauritzen, and J. Mortera (2010). Probabilistic expert systems for handling
artifacts in complex DNA mixtures. Forensic Science International: Genetics. In Press.
Cox, D. R. (1958). Some problems connected with statistical inference. Annals of Mathematical
Statistics 29(2), 357372.
Cox, D. R. and D. V. Hinkley (1974). Theoretical Statistics. Chapman and Hall Ltd.
Curran, J. M. (2008). A MCMC method for resolving two person mixtures. Science &Justice 48,
168177.
Curran, J. M., J. S. Buckleton, C. M. Triggs, and B. S. Weir (2002). Assessing uncertainty in
DNA evidence caused by sampling eects. Science and Justice 42(1), 2937.
Curran, J. M., C. M. Triggs, J. S. Buckleton, and B. S. Weir (1999). Interpreting DNA mixtures
in structured populations. Journal of Forensic Science 44(5), 987995.
Curran, J. M. and T. Tvedebrink (2010a). DNAtools - a R package for forensic DNA database
analysis. Journal of Computational Statistics. Manuscript in preparation.
Curran, J. M. and T. Tvedebrink (2010b). DNAtools: Statistical functions for analysing forensic
DNA databases. R package version 0.1.
Curran, J. M., S. J. Walsh, and J. S. Buckleton (2007). Empirical testing of estimated DNA
frequencies. Forensic Sciences International: Genetics 1, 267272.
Davison, A. C. and D. V. Hinkley (1997). Bootstrap Methods and their Application. Cambridge
University Press.
Dixon, L. A. et al. (2006). Analysis of articially degraded DNA using STRs and SNPs - results
of a collaborative European (EDNAP) exercise. Forensic Science International 164(1), 3344.
Donnelly, P. (1995a). Match probability calculations for multi-locus DNA proles. Genetica 96,
5567.
Donnelly, P. (1995b). Nonindependence of matches at dierence loci in DNA proles: quanti-
fying the eect of close relatives on the match probability. Heredity 75, 2634.
Evett, I. W., P. D. Gill, and J. A. Lambert (1998). Taking account of peak areas when interpreting
mixed DNA proles. Journal of Forensic Sciences 43(1), 6269.
Evett, I. W. and B. S. Weir (1998). Interpreting DNA Evidence: Statistical Genetics for Forensic
Scientists. Sunderland, MA: Sinauer Associates.
Fields, C. A. and A. H. Welsh (2007). Bootstrapping clustered data. Journal of the Royal
Statistical Society. Series B, Statistical methodology 69(3), 369390.
Gilder, J. R., T. E. Doom, K. Inman, and D. E. Krane (2007). Run-Specic Limits of Detection
and Quantitation for STR-based DNA Testing. Journal of Forensic Science 52(1), 97101.
180 Bibliography
Gill, P. D. et al. (1998). Interpreting simple STR mixtures using allele peak areas. Forensic
Science International 91(1), 4153.
Gill, P. D. et al. (2006). DNA commission of the International Society of Forensic Genetics:
Recommendations on the interpretation of mixtures. Forensic Science International 160(2-3),
90101.
Gill, P. D. and J. S. Buckleton (2010a). A universal strategy to interpret DNA proles that does
not require a denition of low-copy-number. Forensic Science International: Genetics 4(4),
221227.
Gill, P. D. and J. S. Buckleton (2010b). Mixture interpretation: dening the relevant features
for guidelines for the assessment of mixed DNA proles in forensic casework. Journal of
Forensic Sciences 55(1), 265268.
Gill, P. D., J. M. Curran, and K. Elliot (2005). A graphical simulation model of the entire DNA
process associated with the analysis of short tandemrepeat loci. Nucleic Acids Research 33(2),
632643.
Gill, P. D., J. Whitaker, C. Flaxman, N. Brown, and J. S. Buckleton (2000). An investigation
of the rigor of interpretation rules for STRs derived from less than 100 pg of DNA. Forensic
Science International 112(1), 1740.
Green, P. J. and J. Mortera (2009). Sensitivity of inferences in forensic genetics to assumptions
about founding genes. Annals of Applied Statistics 3(2), 731763.
Green, R., I. Roinestad, C. Boland, and L. Hennessy (2005). Developmental Validation of the
Quantiler Real-Time PCR kits for the Quantication of Human Nuclear DNA samples.
Journal of Forensic Science 50(4), 809825.
Hardy, G. H. (1908). Mendelian proportions in a mixed population. Science 28(706), 4950.
Harrell Jr., F. E. (2001). Regression Modeling Strategies. Springer.
Holsinger, K. E. (1999). Analysis of genetic diversity in geographically structure populations:
A bayesian perspective. Hereditas 130, 245255.
Holsinger, K. E. and B. S. Weir (2009). Genetics in geographically structured populations:
dening, estimating and interpreting F
S T
. Nature Reviews. Genetics 10(9), 639650.
Irwin, J. A. et al. (2007). Application of low copy number STR typing to the identication of
aged, degraded skeletal remains. Journal of Forensic Sciences 52(6), 13221327.
Johnson, N. L., S. Kotz, and N. Balakrishnan (1997). Discrete Multivariate Distributions. Wiley.
Lange, K. (1993). Match probabilities in racially admixed populations. American Journal of
Human Genetics 52, 305311.
Lange, K. (1995a). Applications of the Dirichlet distribution to forensic match probabilities.
Genetica 96, 107117.
Lange, K. (1995b). Mathematical and Statistical Methods for Genetic Analysis (2 ed.). Springer.
Bibliography 181
Laurie, C. and B. S. Weir (2003). Dependency eects in multi-locus match probabilities. Theo-
retical Population Biology 63, 207219.
Lauritzen, S. L. (1996). Graphical models. Oxford University Press.
Lauritzen, S. L. and J. Mortera (2002). Bounding the number of contributors to mixed DNA
stains. Forensic Science International 130(2-3), 125126.
Little, R. and D. Rubin (2002). Statistical Analysis with missing data (2 ed.). Wiley.
Maimon, G. (2010). A Bayesian approach to the statistical interpretation of DNA evidence. Ph.
D. thesis, Department of Mathematics and Statistics, McGill University, Montreal, Canada.
McCullagh, P. and J. Nelder (1989). Generalized Linear Models. Chapman and Hall.
Mosimann, J. E. (1962). On the compound multinomial distribution, the multivariate -
distribution, and correlations among proportions. Biometrika 49(1-2), 6582.
Mueller, L. D. (2008). Can simple populations genetic models reconcile partial match frequen-
cies observed in large forensic databases? Journal of Genetics 87(2), 101107.
Neerchal, N. K. and J. G. Morel (2005). An improved method for the computation of maximum
likelihood estimates for multinomial overdispersion models. Computational Statistics &Data
Analysis 49, 3343.
Nichols, R. A. and D. J. Balding (1991). Eects of population structure on DNA ngerprint
analysis in forensic science. Heredity 66, 297302.
Paul, S. R., U. Balasooriya, and T. Banerjee (2005). Fisher information matrix for the Dirichlet-
multinomial distribution. Biometrical Journal 47(2), 230236.
Perlin, M. W. and B. Szabady (2001). Linear mixture analysis: A mathematical approach to
resolving mixed DNA samples. Journal of Forensic Science 46(6), 13721378.
Petricevic, S. et al. (2009). Validation and development of interpretation guidelines for low
copy number (LCN) DNA proling in New Zealand using the AmpFSTR SGM Plus(TM)
multiplex. Forensic Science International: Genetics In Press, Corrected Proof.
Phillips, C., T. Tvedebrink, et al. (2010). Analysis of global variability in 15 established and
5 new European Standard Set (ESS) STRs using the CEPH human genome diversity panel.
Forensic Science International: Genetics. In Press.
Prinz, M. et al. (2007). DNA Commision of the International Society for Forensic Genetics
(ISFG): Recommendations regarding the role of forensic genetics for disaster victim identi-
cation (DVI). Forensic Science International: Genetics 1(1), 312.
R Development Core Team (2009). R: A Language and Environment for Statistical Computing.
Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0.
Rannala, B. and J. A. Hartigan (1996). Estimating gene ow in island populations. Genetical
Research 67, 147158.
Robert, C. P. and G. Casella (2004). Monte Carlo Statistical Methods (2 ed.). Springer.
182 Bibliography
Samanta, S., Y.-J. Li, and B. S. Weir (2009). Drawing inferences about the coancestry coecient.
Theoretial Population Biology 75, 312319.
Schneider, P. M. et al. (2004). STR analysis of articially degraded DNA - results of a collabo-
rative European exercise. Forensic Science International 139(2-3), 123134.
Song, Y. S. and M. Slatkin (2007). A graphical approach to multi-locus match probabilitiy
computation: Revisiting the product rule. Theoretical Population Biology 72, 96110.
Troyer, K., T. Gilroy, and B. Koeneman (2001). A nine STR locus match between two apparent
unrelated individuals using AmpFSTR Proler Plus and COler. Proceedings of the
Promega 12th International Symposium on Human Identication.
Tvedebrink, T. (2009). dirmult: Estimation in Dirichlet-Multinomial distribution. R package
version 0.1.3.
Tvedebrink, T. (2010). Overdispersion in allelic counts and -correction in forensic genetics.
Theoretical Population Biology. In Press.
Tvedebrink, T., P. S. Eriksen, H. S. Mogensen, and N. Morling (2008). Amplication of DNA
mixtures - Missing data approach. Forensic Science International: Genetics Supplement Se-
ries 1, 664666.
Tvedebrink, T., P. S. Eriksen, H. S. Mogensen, and N. Morling (2009). Estimating the proba-
bility of allelic drop-out of STR alleles in forensic genetics. Forensic Science International:
Genetics 3(4), 222226.
Tvedebrink, T., P. S. Eriksen, H. S. Mogensen, and N. Morling (2010a). Evaluating the weight
of evidence using quantitative STR data in DNA mixtures. Journal of the Royal Statistical
Society. Series C, Applied statistics. In Press.
Tvedebrink, T., P. S. Eriksen, H. S. Mogensen, and N. Morling (2010b). Identifying contributors
of DNA mixtures by of quantitative information of STR typing. Journal of Computational
Biology. Accepted for publication.
Ukoumunne, O. C., A. C. Davison, M. C. Gulliford, and S. Chinn (2003). Non-parametric boot-
strap condence intervals for the intraclass correlation coecient. Statistics in Medicine 22,
38053821.
Venables, W. N. and B. D. Ripley (2002). Modern Applied Statistics with S (4 ed.). Springer.
Votaw, D. F. (1948). Testing compound symmetry in a normal multivariate distribution. Annals
of Mathematical Statistics 19(4), 447473.
Wang, T., N. Xue, and J. D. Birdwell (2006). Least-square deconvolution: A framework for
interpreting short tandem repeat mixtures. Journal of Forensic Science 51(6), 12841297.
Weinberg, W. (1908).
Uber den nachweis der vererbung beimmenschen. Jahreshefte des Vereins
f ur vaterl andische Naturkunde in W urttemberg 64, 368382.
Weir, B. S. (1996). Genetic Data Analysis II. Sinauer Associates, Inc.
Bibliography 183
Weir, B. S. (2004). Matching and partially-matching DNA proles. Journal of Forensic Sci-
ence 49(5), 16.
Weir, B. S. (2007). The rarity of DNA proles. The Annals of Applied Statistics 1(2), 358370.
Weir, B. S. and C. C. Cockerham (1984). Estimating F-statistics for the Analysis of Population
Structure. Evolution 38(6), 13581370.
Weir, B. S. and W. G. Hill (2002). Esimating F-statistics. Annual Review of Genetics 36, 721
750.
Wright, S. (1951). The genetical structure of populations. Annals of eugenics 15, 323354.
Zhou, H. and K. Lange (2010). MM algorithms for some discrete multivariate distributions.
Journal of Computational and Graphical Statistics. In Press.