0% found this document useful (0 votes)
68 views54 pages

Lecture 5: Algorithm Design and Time/space Complexity Analysis

The implanted motif with four random mutations is: atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg The mutations are underlined.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views54 pages

Lecture 5: Algorithm Design and Time/space Complexity Analysis

The implanted motif with four random mutations is: atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg The mutations are underlined.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Lecture 5:

Algorithm design and time/space 
complexity analysis
Torgeir R. Hvidsten

Professor
Norwegian University of Life Sciences

Guest lecturer
Umeå Plant Science Centre
Computational Life Science Cluster (CLiC)

1
This lecture
• Basic algorithm design: exhaustive search, greedy
algorithms, dynamic programming and randomized
algorithms
• Correct versus incorrect algorithms
• Time/space complexity analysis
• Go through Lab 3

2
Algorithm
• Algorithm: a sequence of instructions that one must
perform in order to solve a well-formulated problem
• Correct algorithm: translate every input instance into
the correct output
• Incorrect algorithm: there is at least one input instance
for which the algorithm does not produce the correct
output
• Many successful algorithms in bioinformatics are not
“correct” (optimal)

3
Search space

4
Sequence alignment as a search problem
w
j A T C T G A T C
0 1 2 3 4 5 6 7 8
i 0
Deletion
T 1

Matches
G 2

C 3
Insertion
v A 4

T 5

A 6

-TGCAT-A-C
C 7
AT-C-TGATC

5
Algorithm design (I)
• Exhaustive algorithms (brute force): examine every
possible alterative to find the solution
• Branch-and-bound algorithms: omit searching through
a large number of alternatives by branch-and-bound or
pruning
• Greedy algorithms: find the solution by always
choosing the currently ”best” alternative
• Dynamic programming: use the solution of the
subproblems of the original problem to construct the
solution

6
Algorithm design (II)
• Divide-and-conquer algorithms: splits the problem into
subproblems and solve the problems independently
• Randomized algorithms: finds the solution based on
randomized choices

• Machine learning: induce models based on previously


labeled observations (examples)

7
Algorithm complexity
• The Big-O notation:
– the running time of an algorithm as a function of the size of
its input
– worst case estimate
– asymptotic behavior
• O(n2) means that the running time of the algorithm on
an input of size n is limited by the quadratic function
of n

8
Big‐O Notation
• A function f(x) is O(g(x)) if there are positive real
constants c and x0 such that f(x) ≤ cg(x) for all values of
x ≥ x0.
Big‐O Notation
• A function f(x) is O(g(x)) if there are positive real
constants c and x0 such that f(x) ≤ cg(x) for all values of
x ≥ x0.
Time complexity
• Genome assembly: pice together a genome from short reads (~200bp)
– Aspen: 300M reads
– Spruce: 3000M reads

• Pair-wise all-against all alignment for Aspen takes 3 weeks on 16 porcessors


• What about spruce?

350
Bioinformatician:
300
Spruce: 300 uker
250
Time (weeks)

200 Time complexity: O(n2)
150

100
Biologist:
50
Spruce: 30 weeks
0
0 500 1000 1500 2000 2500 3000 3500
Million reads
11
Sorting algorithm
Sorting problem: Sort a list of n integers a = (a1, a2,
…, an)

SelectionSort(a,n)
1 for i ← 1 to n-1
2 j ← Index of the smallest element
among ai, ai+1, …, an
3 Swap elements ai and aj
4 return a

12
Example run
i = 1: (7,92,87,1,4,3,2,6)
i = 2: (1,92,87,7,4,3,2,6)
i = 3: (1,2,87,7,4,3,92,6)
i = 4: (1,2,3,7,4,87,92,6)
i = 5: (1,2,3,4,7,87,92,6)
i = 6: (1,2,3,4,6,87,92,7)
i = 7: (1,2,3,4,6,7,92,87)
(1,2,3,4,6,7,87,92)

13
Complexity of SelectionSort
• Makes n – 1 iterations in the for loop
• Analyzes n – i +1 elements ai, ai+1, …, an in iteration i
• Approximate number of operations:
– n + (n-1) + (n-2) + … + 2 + 1 = n(n+1)/2
– plus the swapping: n(n+1)/2 + 3n = 1/2 n2 + 3n + 1/2

• Thus the algorithm is O(n2)

14
Tractable versus intractable problems
• Some problems requires polynomial time
– e.g. sorting a list of integers
– called tractable problems
• Some problems require exponential time
– e.g. listing every subset in a list
– called intractable problems
• Some problems lie in between
– e.g. the traveling salesman problem
– called NP-complete problems
– nobody have proved whether a polynomial time algorithm
exists for these problems

15
Traveling salesman problem

16
Exhaustive search:
Finding regulatory motifs in 
DNA sequences

17
Random sample

atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca

tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag

gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca

18
Implanting motif AAAAAAAGGGGGGG

atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa

tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag

gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa

19
Where is the implanted motif? 
atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga

tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag

gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga

20
Implanting motif AAAAAAGGGGGGG
with four random mutations

atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa

tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag

gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa

21
Where is the motif? 

atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga

tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag

gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga

22
Why finding motif is difficult

atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa

tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag

gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa

AgAAgAAAGGttGGG
..|..|||.|..|||
cAAtAAAAcGGcGGG
23
Parameters

DNA l=8
cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

t=5 aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc

n = 69

S = s1 = 26 , s2 = 21 , s3= 3 , s4 = 56 , s5 = 60
24
Motifs: Profiles and consensus
a
C
G
c
g
A
t
t
a
a
c
c
T
g
t
t
• Line up the patterns by
Alignment a c g t T A g t their start indexes
a c g t C c A t
C c g t a c g G
s = (s1, s2, …, st)
_________________

Profile
A
C
3
2
0
4
1
0
0
0
3
1
1
4
1
0
0
0
• Construct matrix profile
G 0 1 4 0 0 0 3 1 with frequencies of each
T 0 0 0 5 1 0 1 4 nucleotide in columns
_________________

Consensus A C G T A C G T
• Consensus nucleotide in
each position has the
highest score in column
25
Scoring motifs: consensus score
l

a G g t a c T t
C c A t a c g t
a c g t T A g t t
a c g t C c A t
C c g t a c g G
_________________

A 3 0 1 0 3 1 1 0
C 2 4 0 0 1 4 0 0
G 0 1 4 0 0 0 3 1
T 0 0 0 5 1 0 1 4
_________________

Consensus a c g t a c g t

Score 3+4+4+5+3+4+3+4=30

26
BruteForceMotifSearch

BruteForceMotifSearch(DNA, t, n, l)
1 bestScore ← 0
2 for each s=(s1,s2 , . . ., st) from (1,1 . . .,1) to (n-l+1, . . ., n-l+1)
3 if (Score(s,DNA) > bestScore)
4 bestScore ← Score(s, DNA)
5 bestMotif ← (s1,s2 , . . . , st)
6 return bestMotif

27
Running Time of BruteForceMotifSearch
• Varying (n – l + 1) positions in each of t sequences, we’re
looking at (n – l + 1)t sets of starting positions

• For each set of starting positions, the scoring function


makes l operations, so complexity is
l(n – l + 1)t = O(lnt)

• That means that for t = 8, n = 1000, and l = 10 we must


perform approximately 1020 computations – it will take
billions of years!

28
Greedy search:
Finding regulatory motifs in 
DNA sequences

29
Approximation algorithms

• These algorithms find approximate solutions rather than


optimal solutions
• The approximation ratio of an algorithm A on input 
is:
A() / OPT()
where
A( ) - solution produced by algorithm A
OPT( ) - optimal solution of the problem

30
Performance guarantee
• Performance guarantee of algorithm A is the maximal
approximation ratio of all inputs of size n
• For algorithm A that minimizes the objective function
(minimization algorithm):
– max|  | = n A() / OPT()
• For maximization algorithms
– min|  | = n A() / OPT()

31
Parameters

DNA l=8
cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

t=5 aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc

n = 69

s s1 = 26 , s2 = 21 , s3= 3 , s4 = 56 , s5 = 60
32
Scoring motifs: consensus score
l

a G g t a c T t
C c A t a c g t
a c g t T A g t t
a c g t C c A t
C c g t a c g G
_________________

A 3 0 1 0 3 1 1 0
C 2 4 0 0 1 4 0 0
G 0 1 4 0 0 0 3 1
T 0 0 0 5 1 0 1 4
_________________

Consensus a c g t a c g t

Score 3+4+4+5+3+4+3+4=30

33
Greedy motif finding
• Partial score: Score(s, i, DNA)
– The consensus score for the first i sequences
• Algorithm:
– Find the optimal motif for the two first sequences
– Scan the remaining sequences only once, and choose the
motif with the best contribution to the partial score

34
Greedy motif finding
GreedyMotifSearch(DNA, t, n, l )
1 s ← (1,1, …, 1)
2 bestMotif ← s
3 for s1 ← 1 to n – l + 1
4 for s2 ← 1 to n – l + 1
5 if Score(s, 2, DNA) > Score(bestMotif, 2, DNA)
6 bestMotif1 ← s1
7 bestMotif2 ← s2
8 s1 ← bestMotif1
9 s2 ← bestMotif2
10 for i ← 3 to t
11 for si ← 1 to n – l + 1
12 if Score(s, i, DNA) > Score(bestMotif, i, DNA)
13 bestMotifi ← si
14 si ← bestMotifi
15 return bestMotif

35
Running time
• Optimal motif for the two first sequences
– l(n – l +1)2 operations
• The remaining t-2 sequence
– (t – 2)l(n – l +1) operations
• Running time
– O(ln2 + tln) or O(ln2) if n >> t
• Vastly better than
– BruteForceMotifSearch: O(lnt)

36
Dynamic programming:
Sequence alignment
Lecture 6

37
Randomized algorithms: 
Finding regulatory motifs in 
DNA sequences 

38
Randomized algorithms
• Randomized algorithms make random rather than
deterministic decisions
• The main advantage is that no input can reliably
produce worst-case results because the algorithm runs
differently each time
• These algorithms are commonly used in situations
where no correct polynomial algorithm is known

39
Two types of randomized  algorithms

• Las Vegas Algorithms – always produce the correct


solution

• Monte Carlo Algorithms – do not always return the


correct solution

• Las Vegas Algorithms are always preferred, but they are


often hard to come by

40
Profiles
• Let s=(s1,...,st) be the set of starting positions for l-mers
in our t sequences
• The substrings corresponding to these starting
positions will form:
- t x l alignment and
- 4 x l profile P

41
Scoring strings with a profile
Given a profile: P =
A 1/2 7/8 3/8 0 1/8 0
C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8

The probability of the consensus string:
Prob(aaacct|P) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 = .033646
Probability of a different string:

Prob(atacag|P) = 1/2 x 1/8 x 3/8 x 5/8 x 1/8 x 1/8 = .001602

42
P‐most probable l‐mer
Define the P-most probable l-mer from a sequence as an
l-mer in that sequence which has the highest probability
of being created from the profile P

A 1/2 7/8 3/8 0 1/8 0


P = C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8
Given a sequence = ctataaaccttacatc, find the P‐
most probable l‐mer
43
P‐most probable l‐mer
P‐most probable 6‐mer in the sequence is aaacct:
String, Highlighted in Red Calculations Prob(a|P)
ctataaaccttacat 1/8 x 1/8 x 3/8 x 0 x 1/8 x 0 0
ctataaaccttacat 1/2 x 7/8 x 0 x 0 x 1/8 x 0 0
ctataaaccttacat 1/2 x 1/8 x 3/8 x 0 x 1/8 x 0 0
ctataaaccttacat 1/8 x 7/8 x 3/8 x 0 x 3/8 x 0 0
ctataaaccttacat 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 .0336
ctataaaccttacat 1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/8 .0299
ctataaaccttacat 1/2 x 0 x 1/2 x 0 1/4 x 0 0
ctataaaccttacat 1/8 x 0 x 0 x 0 x 0 x 1/8 x 0 0
ctataaaccttacat 1/8 x 1/8 x 0 x 0 x 3/8 x 0 0
ctataaaccttacat 1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/8 .0004

44
Gibbs sampling
1) Randomly choose starting positions
s = (s1,...,st) and form the set of l-mers associated
with these starting positions
2) Randomly choose one of the t sequences
3) Create a profile P from the other t -1 sequences
4) For each position in the removed sequence, calculate
the probability that the l-mer starting at that position
was generated by P
5) Choose a new starting position for the removed
sequence at random based on the probabilities
calculated in step 4
6) Repeat steps 2-5 until there is no improvement
45
Gibbs sampling: an example
Input:
t = 5 sequences, motif length l = 8

1. GTAAACAATATTTATAGC
2. AAAATTTACCTCGCAAGG
3. CCGTACTGTCAAGCGTGG
4. TGAGTAAACGACGTCCCA
5. TACTTAACACCCTGTCAA

46
Gibbs sampling: an example
1) Randomly choose starting positions,
s=(s1,s2,s3,s4,s5) in the 5 sequences:

s1=7 GTAAACAATATTTATAGC
s2=11 AAAATTTACCTTAGAAGG
s3=9 CCGTACTGTCAAGCGTGG
s4=4 TGAGTAAACGACGTCCCA
s5=1 TACTTAACACCCTGTCAA

47
Gibbs sampling: an example

2) Choose one of the sequences at random:


Sequence 2: AAAATTTACCTTAGAAGG

s1=7 GTAAACAATATTTATAGC
s2=11 AAAATTTACCTTAGAAGG
s3=9 CCGTACTGTCAAGCGTGG
s4=4 TGAGTAAACGACGTCCCA
s5=1 TACTTAACACCCTGTCAA

48
Gibbs sampling: an example
3) Create profile P from l-mers in the remaining 4 sequences:

1 A A T A T T T A
3 T C A A G C G T
4 G T A A A C G A
5 T A C T T A A C
A 1/4 2/4 2/4 3/4 1/4 1/4 1/4 2/4
C 0 1/4 1/4 0 0 2/4 0 1/4
T 2/4 1/4 1/4 1/4 2/4 1/4 1/4 1/4
G 1/4 0 0 0 1/4 0 3/4 0
Consensus
String
T A A A T C G A

49
Gibbs Sampling: an Example
4) Calculate the prob(a|P) for every possible 8-mer in the
removed sequence 2:
Strings Highlighted in Red prob(a|P)
AAAATTTACCTTAGAAGG .000732
AAAATTTACCTTAGAAGG .000122
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG .000183
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0

50
Gibbs Sampling: an Example
5) Create a distribution of probabilities of l‐mers
prob(a|P), and randomly select a new starting 
position based on this distribution
To create a proper distribution, divide each 
probability prob(a|P) by the sum of probabilities 
over all position:
Probability (Selecting Starting Position 1) = 0.706
Probability (Selecting Starting Position 2) = 0.118
...
Probability (Selecting Starting Position 8) = 0.176

51
Gibbs sampling: an example
Assume we select the substring with the highest
probability – then we are left with the following new
substrings and starting positions

s1=7 GTAAACAATATTTATAGC
s2=1 AAAATTTACCTTAGAAGG
s3=9 CCGTACTGTCAAGCGTGG
s4=5 TGAGTAATCGACGTCCCA
s5=1 TACTTCACACCCTGTCAA

52
Gibbs sampling: an example
6) We iterate the procedure again with the above starting
positions until we cannot improve the score any more

53
Gibbs sampler in practice
• Gibbs sampling needs to be modified when applied to
samples with unequal distributions of nucleotides
(relative entropy approach)
• Gibbs sampling often converges to locally optimal
motifs rather than globally optimal motifs
• Needs to be run with many randomly chosen seeds to
achieve good results

54

You might also like