0% found this document useful (0 votes)
157 views49 pages

Genetic Linkage Analysis

The document describes genetic linkage analysis techniques and software. It discusses how genetic linkage analysis uses pedigree data to associate disease genes to their approximate location on chromosomes. It provides an outline of topics including genetics concepts, chromosome structure, inheritance patterns, and relevant probability models. It also compares different software for genetic linkage analysis, describing features of the Superlink software being developed.

Uploaded by

Gayatri Dave
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
157 views49 pages

Genetic Linkage Analysis

The document describes genetic linkage analysis techniques and software. It discusses how genetic linkage analysis uses pedigree data to associate disease genes to their approximate location on chromosomes. It provides an outline of topics including genetics concepts, chromosome structure, inheritance patterns, and relevant probability models. It also compares different software for genetic linkage analysis, describing features of the Superlink software being developed.

Uploaded by

Gayatri Dave
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Dan Geiger

Many slides were prepared by Maayan Fishelson, some are due to Nir Friedman, and some are mine. I have slightly edited many slides.

Genetic Linkage Analysis


A statistical method that is used to associate functionality of disease genes to their approximate location on the chromosome using pedigree data of affected families. Main idea: genes and markers that reside in vicinity on the chromosome have a tendency to stick together when passed on to offsprings. Some disease is often passed to offsprings along with specific markers the gene responsible for the disease is located close on the chromosome to these markers.
2

Outline
Part I: Part II: Part III: Part IV: Reminder about genetics mathematics/algorithms Software description Software demonstration online (by Gideon Greenspan)

Genetic Information
Gene basic unit of genetic information. They determine the inherited characters. Genome the collection of genetic information. Chromosomes storage units of genes.

Human Genome

Most human cells contain 46 chromosomes: 2 sex chromosomes (X,Y): XY in males. XX in females. 22 pairs of chromosomes named autosomes.

Chromosome Structure
Locus the location of genes on the chromosome. Allele one variant form (or state) of a gene at a particular locus.
Locus1 Possible Alleles: A1,A2

Locus2 Possible Alleles: B1,B2,B3


6

Alleles

genotype

phenotype

Eb- dominant allele. Ew- recessive allele.


7

Genotypes versus Phenotypes


At each locus (except for sex chromosomes) there are 2 genes. These constitute the individuals genotype at the locus. The expression of a genotype is termed a phenotype. For example, hair color, weight, or the presence or absence of a disease. Also measured unordered pairs of genes are now considered phenotype (at least mathematically).

Sexual Reproduction
egg

sperm

zygote

gametes

Recombination Phenomenon
I.

The exchange of pieces of homologous chromosomes during formation of gametes. A recombination between 2 genes occurred if the haplotype of the individual contains 2 alleles that resided in different haplotypes in the individual's parent.
(Haplotype the alleles at different loci that are received by an individual from one parent).

II.

10

Two Loci Inheritance


AA 1 BB A a B b aa bb a a 4 b b
5 6

2 3

Recombinant

A a b b

A a B b

11

An example - the ABO locus.


The ABO locus determines detectable Phenotype Genotype antigens on the surface A A/A, A/O of red blood cells. The 3 major alleles B B/B, B/O (A,B,O) interact to AB A/B determine the various ABO blood types. O O/O O is recessive to A and B. Alleles A and B are Note that the listed genotypes are unordered codominant.
(we dont know which allele is from the father and which one is from the mother).
12

Example: ABO, AK1 on Chromosome 9

A
1

O
2

A1/A1 A
3

A2/A2

O O A2 A 2 A
4

A O A1 A 2

A1/A2 O O A1 A 2 O A1/A2
5

A2/A2

A O A2 | A 2

Recombinant

Male recombination fraction 0.12 and female 0.2

13

Comments about the example


Often: Pedigrees are larger and more complex. Not every individual is typed. Recombinants cannot always be determined

for certain.
There are more markers and they are

polymorphic (not di-allelic).


14

Part II: Relevant Mathematics

15

Probabilistic model for inheritance I

Each node represents a random variable that has a finite number of states. The states for L11m, L11f, L13m are the possible alleles at locus 1. The states of X11 are the possible unordered allele-pairs at locus 1. The states of S13m are 0 or 1 depending whose allele is transmitted to the offspring. Each node is associated with a conditional probability table.
16

Probabilistic model for inheritance II

17

Probabilistic model for inheritance III

18

19

Probabilistic model for Recombination

P ( s23t

2 1 2 | s13t ) = where t {m,f} 1 2 2


20

Recombination Fraction
The recombination fraction between two loci is a monotone, nonlinear function of the physical distance separating between the loci. It is measured in terms of centi-morgans. One centimorgan means one recombination every 100 meiosis. Recombination fraction can change between males and females. So in the previous slide we might want to have:
2m 1 2 m P ( s23 m | s13 m ) = 2m 1 2m P ( s23 f 1 2 f | s13 f ) = 2 f 2 f 12 f

( Linkage) 0 < P( Recombination ) 0.5 ( No Linkage)


21

Having a disease Locus

P ( s23t

2 1 2 | s13t ' ) = 2 12

22

Maximum Likelihood Approach


23

Generalization: Bayesian Network


p(v) p(t|v) p(a|t,l) p(x|a) p(d|a,b) p(l|s) p(b|s) p(s)

P (v , s , t , l , b, a, x, d ) = P (v ) P ( s ) P (t | v ) P (l | s ) P (b | s ) P (a | t , l ) P ( x | a ) P (d | a, b )
Bayesian network = Directed Acyclic Graph (DAG), annotated with conditional probability distributions.
24

The Visit-to-Asia Example

25

Local distributions

p(A|T,L)

Table:

p(A=y|L=n, T=n) = 0.02 p(A=y|L=n, T=y) = 0.60 p(A=y|L=y, T=n) = 0.99 p(A=y|L=y, T=y) = 0.99
26

Exact Inference: Variable Elimination


General idea: Write a query in the form
P( data ) = L
x k m

P( x
x3 x1 i =1

| pai )

Iteratively Move all irrelevant terms outside of innermost sum Perform innermost sum, getting a new term Insert the new term into the product
27

We want to compute Need to eliminate: Initial factors

28

We want to compute Need to eliminate: Initial factors

Eliminate: Compute:

Note: In general, as we will see, the result of elimination is not necessarily a probability distribution.
29

We want to compute Need to eliminate: Initial factors

Eliminate: Compute:

Summing on results in a factor with two arguments In general, the result of elimination may be a function of several variables.

30

We want to compute Need to eliminate: Initial factors


Eliminate: Compute:

Note: for all values of


31

We want to compute Need to eliminate: Initial factors


Eliminate: Compute:

32

We want to compute Need to eliminate: Initial factors


Eliminate: Compute:

=
33

We want to compute Need to eliminate: a, Initial factors


Eliminate: Compute:

34

Comments on Variable Elimination


Actual computation is done in the elimination step. Computation depends on the order of elimination as in computing products of matrices.

35

Summation order in GeneHunter

36

GeneHunter summation order defines a Hidden Markov Model (HMM)

37

Hidden Markov Model (HMM)

38

Part III: Software for Genetic Analysis


Fastlink v4.1 (Our students have contributed) Vitesse v1, v2 GeneHunter Scores of other packages SuperLink (Temporary name, being developed here)

See http://linkage.rockefeller.edu/soft/list.html

39

Existing Programs for Genetic Linkage Analysis

40

The future: SUPERLINK


Stage 1: each pedigree is translated into a Bayesian network. Stage 2: value elimination is performed on each pedigree (i.e., some of the impossible values of the variables of the network are eliminated). Stage 3: an elimination order for the variables is determined, according to some heuristic. Stage 4: the likelihood of the pedigrees given the theta values is calculated. This is done by by performing variable elimination according to the elimination order determined in stage 3.

41

Special Features

42

Value Elimination & Allele Exclusion A preprocessing step that reduces the range of feasible values for the variables of the Bayesian network given the data.

Results in major savings in the time and memory requirements of the likelihood calculations.
43

Time-Space Tradeoff
For many data sets, the use of variable elimination alone isnt enough, due to the large memory overhead.

Superlink combines variable elimination with conditioning to achieve the best time-space tradeoff given the available memory. Conditioning is performed only after some steps of variable elimination, when the memory requirements are about to exceed the limitations.
44

Probability Table Representation All probability tables are defined in a flexible size which depends on viable variable combinations. Each table is represented by a onedimensional array of double-precision numbers. In addition, the number of possible values for each variable is stored. A special indexing method allows for quick access to the entrances of the table.
45

Experiment A
Same topology (57 people, no loops) Increasing number of loci (each one with 4-5 alleles) Run time is in seconds.

Files No. of Run Time Run Time Run Time Run Time Loci Superlink Fastlink Vitesse Genehunter A0 2 0.03 0.12 0.27 A1 5 0.1 3.77 0.31 A2 6 0.14 79.32 0.39 A3 7 0.42 0.69 A4 8 0.36 2.81 A5 10 1.19 84.66 A6 12 4.65 A7 14 3.01 A8 18 20.98 A9 37 8510.15 A10 38 10446.27 A11 40

over 100 hours Out-of-memory Pedigree size Too big for Genehunter.

46

Experiment B
Same topology (100 people, with loops) Increasing number of loci (each one with 5-10 alleles) Run time is in seconds.

Files No. of Run Time Run Time Run Time Run Time Loci Superlink Fastlink Vitesse Genehunter B0 5 2.56 3933.7 B1 6 2.63 B2 10 82.56 B3 12 437.55 B4 13 17.29 B5 14 278.8 B6 15 935.86 B7 16 902.8 B8 17 288.2 B9 18 113.96 B10 19 2901.25 B11 20 143640.2

Out-of-memory Vitesse doesnt handle looped Pedigrees. Pedigree size Too big for Genehunter.

47

Experiment C
Same topology (5 people, no loops) Increasing number of loci (each one with 3-6 alleles) Run time is in seconds.
No. of Loci 100 110 120 130 140 150 160 170 180 190 200 210 Run Time Superlink 0.16 (2 l.e.) 0.2 (2 l.e.) 0.21 (2 l.e.) 0.22 (2 l.e.) 0.24 (2 l.e.) 0.25 (2 l.e.) 0.27 (2 l.e.) 0.3 (2 l.e.) 0.3 (2 l.e.) 0.32 (2 l.e.) 0.34 (2 l.e.) 0.37 (2 l.e.) Run Time Run Time Fastlink Vitesse Run Time Genehunter 0.41 (99 l.e.) 0.45 (109 l.e.) 0.48 (119 l.e.) 0.49 (129 l.e.) 0.51 (139 l.e.) 0.53 (149 l.e.) 0.54 (159 l.e.) 0.6 (169 l.e.) 0.59 (179 l.e.) 0.61 (189 l.e.) 0.66 (199 l.e) 0.67 (209 l.e)

Files

D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11

Bus error Out-of-memory

48

Partial References
Kenneth Lange Mathematical and Statistical Methods for Genetic Analysis Jurg Ott Analysis of Human Genetic Linkage http://www.accessexcellence.com/AB/GG/ http://www.nhgri.nih.gov/DIR/VIP/Glossary/index.html http://www.tokyo-med.ac.jp/genet/index-e.htm

49

You might also like