Lecture 5: Algorithm Design and Time/space Complexity Analysis
Lecture 5: Algorithm Design and Time/space Complexity Analysis
Algorithm design and time/space
complexity analysis
Torgeir R. Hvidsten
Professor
Norwegian University of Life Sciences
Guest lecturer
Umeå Plant Science Centre
Computational Life Science Cluster (CLiC)
1
This lecture
• Basic algorithm design: exhaustive search, greedy
algorithms, dynamic programming and randomized
algorithms
• Correct versus incorrect algorithms
• Time/space complexity analysis
• Go through Lab 3
2
Algorithm
• Algorithm: a sequence of instructions that one must
perform in order to solve a well-formulated problem
• Correct algorithm: translate every input instance into
the correct output
• Incorrect algorithm: there is at least one input instance
for which the algorithm does not produce the correct
output
• Many successful algorithms in bioinformatics are not
“correct” (optimal)
3
Search space
4
Sequence alignment as a search problem
w
j A T C T G A T C
0 1 2 3 4 5 6 7 8
i 0
Deletion
T 1
Matches
G 2
C 3
Insertion
v A 4
T 5
A 6
-TGCAT-A-C
C 7
AT-C-TGATC
5
Algorithm design (I)
• Exhaustive algorithms (brute force): examine every
possible alterative to find the solution
• Branch-and-bound algorithms: omit searching through
a large number of alternatives by branch-and-bound or
pruning
• Greedy algorithms: find the solution by always
choosing the currently ”best” alternative
• Dynamic programming: use the solution of the
subproblems of the original problem to construct the
solution
6
Algorithm design (II)
• Divide-and-conquer algorithms: splits the problem into
subproblems and solve the problems independently
• Randomized algorithms: finds the solution based on
randomized choices
7
Algorithm complexity
• The Big-O notation:
– the running time of an algorithm as a function of the size of
its input
– worst case estimate
– asymptotic behavior
• O(n2) means that the running time of the algorithm on
an input of size n is limited by the quadratic function
of n
8
Big‐O Notation
• A function f(x) is O(g(x)) if there are positive real
constants c and x0 such that f(x) ≤ cg(x) for all values of
x ≥ x0.
Big‐O Notation
• A function f(x) is O(g(x)) if there are positive real
constants c and x0 such that f(x) ≤ cg(x) for all values of
x ≥ x0.
Time complexity
• Genome assembly: pice together a genome from short reads (~200bp)
– Aspen: 300M reads
– Spruce: 3000M reads
350
Bioinformatician:
300
Spruce: 300 uker
250
Time (weeks)
200 Time complexity: O(n2)
150
100
Biologist:
50
Spruce: 30 weeks
0
0 500 1000 1500 2000 2500 3000 3500
Million reads
11
Sorting algorithm
Sorting problem: Sort a list of n integers a = (a1, a2,
…, an)
SelectionSort(a,n)
1 for i ← 1 to n-1
2 j ← Index of the smallest element
among ai, ai+1, …, an
3 Swap elements ai and aj
4 return a
12
Example run
i = 1: (7,92,87,1,4,3,2,6)
i = 2: (1,92,87,7,4,3,2,6)
i = 3: (1,2,87,7,4,3,92,6)
i = 4: (1,2,3,7,4,87,92,6)
i = 5: (1,2,3,4,7,87,92,6)
i = 6: (1,2,3,4,6,87,92,7)
i = 7: (1,2,3,4,6,7,92,87)
(1,2,3,4,6,7,87,92)
13
Complexity of SelectionSort
• Makes n – 1 iterations in the for loop
• Analyzes n – i +1 elements ai, ai+1, …, an in iteration i
• Approximate number of operations:
– n + (n-1) + (n-2) + … + 2 + 1 = n(n+1)/2
– plus the swapping: n(n+1)/2 + 3n = 1/2 n2 + 3n + 1/2
14
Tractable versus intractable problems
• Some problems requires polynomial time
– e.g. sorting a list of integers
– called tractable problems
• Some problems require exponential time
– e.g. listing every subset in a list
– called intractable problems
• Some problems lie in between
– e.g. the traveling salesman problem
– called NP-complete problems
– nobody have proved whether a polynomial time algorithm
exists for these problems
15
Traveling salesman problem
16
Exhaustive search:
Finding regulatory motifs in
DNA sequences
17
Random sample
atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca
tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag
gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca
18
Implanting motif AAAAAAAGGGGGGG
atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa
tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag
gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa
19
Where is the implanted motif?
atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga
tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag
gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga
20
Implanting motif AAAAAAGGGGGGG
with four random mutations
atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa
tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag
gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa
21
Where is the motif?
atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga
tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag
gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga
22
Why finding motif is difficult
atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa
tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag
gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa
AgAAgAAAGGttGGG
..|..|||.|..|||
cAAtAAAAcGGcGGG
23
Parameters
DNA l=8
cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
t=5 aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc
n = 69
S = s1 = 26 , s2 = 21 , s3= 3 , s4 = 56 , s5 = 60
24
Motifs: Profiles and consensus
a
C
G
c
g
A
t
t
a
a
c
c
T
g
t
t
• Line up the patterns by
Alignment a c g t T A g t their start indexes
a c g t C c A t
C c g t a c g G
s = (s1, s2, …, st)
_________________
Profile
A
C
3
2
0
4
1
0
0
0
3
1
1
4
1
0
0
0
• Construct matrix profile
G 0 1 4 0 0 0 3 1 with frequencies of each
T 0 0 0 5 1 0 1 4 nucleotide in columns
_________________
Consensus A C G T A C G T
• Consensus nucleotide in
each position has the
highest score in column
25
Scoring motifs: consensus score
l
a G g t a c T t
C c A t a c g t
a c g t T A g t t
a c g t C c A t
C c g t a c g G
_________________
A 3 0 1 0 3 1 1 0
C 2 4 0 0 1 4 0 0
G 0 1 4 0 0 0 3 1
T 0 0 0 5 1 0 1 4
_________________
Consensus a c g t a c g t
Score 3+4+4+5+3+4+3+4=30
26
BruteForceMotifSearch
BruteForceMotifSearch(DNA, t, n, l)
1 bestScore ← 0
2 for each s=(s1,s2 , . . ., st) from (1,1 . . .,1) to (n-l+1, . . ., n-l+1)
3 if (Score(s,DNA) > bestScore)
4 bestScore ← Score(s, DNA)
5 bestMotif ← (s1,s2 , . . . , st)
6 return bestMotif
27
Running Time of BruteForceMotifSearch
• Varying (n – l + 1) positions in each of t sequences, we’re
looking at (n – l + 1)t sets of starting positions
28
Greedy search:
Finding regulatory motifs in
DNA sequences
29
Approximation algorithms
30
Performance guarantee
• Performance guarantee of algorithm A is the maximal
approximation ratio of all inputs of size n
• For algorithm A that minimizes the objective function
(minimization algorithm):
– max| | = n A() / OPT()
• For maximization algorithms
– min| | = n A() / OPT()
31
Parameters
DNA l=8
cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
t=5 aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc
n = 69
s s1 = 26 , s2 = 21 , s3= 3 , s4 = 56 , s5 = 60
32
Scoring motifs: consensus score
l
a G g t a c T t
C c A t a c g t
a c g t T A g t t
a c g t C c A t
C c g t a c g G
_________________
A 3 0 1 0 3 1 1 0
C 2 4 0 0 1 4 0 0
G 0 1 4 0 0 0 3 1
T 0 0 0 5 1 0 1 4
_________________
Consensus a c g t a c g t
Score 3+4+4+5+3+4+3+4=30
33
Greedy motif finding
• Partial score: Score(s, i, DNA)
– The consensus score for the first i sequences
• Algorithm:
– Find the optimal motif for the two first sequences
– Scan the remaining sequences only once, and choose the
motif with the best contribution to the partial score
34
Greedy motif finding
GreedyMotifSearch(DNA, t, n, l )
1 s ← (1,1, …, 1)
2 bestMotif ← s
3 for s1 ← 1 to n – l + 1
4 for s2 ← 1 to n – l + 1
5 if Score(s, 2, DNA) > Score(bestMotif, 2, DNA)
6 bestMotif1 ← s1
7 bestMotif2 ← s2
8 s1 ← bestMotif1
9 s2 ← bestMotif2
10 for i ← 3 to t
11 for si ← 1 to n – l + 1
12 if Score(s, i, DNA) > Score(bestMotif, i, DNA)
13 bestMotifi ← si
14 si ← bestMotifi
15 return bestMotif
35
Running time
• Optimal motif for the two first sequences
– l(n – l +1)2 operations
• The remaining t-2 sequence
– (t – 2)l(n – l +1) operations
• Running time
– O(ln2 + tln) or O(ln2) if n >> t
• Vastly better than
– BruteForceMotifSearch: O(lnt)
36
Dynamic programming:
Sequence alignment
Lecture 6
37
Randomized algorithms:
Finding regulatory motifs in
DNA sequences
38
Randomized algorithms
• Randomized algorithms make random rather than
deterministic decisions
• The main advantage is that no input can reliably
produce worst-case results because the algorithm runs
differently each time
• These algorithms are commonly used in situations
where no correct polynomial algorithm is known
39
Two types of randomized algorithms
40
Profiles
• Let s=(s1,...,st) be the set of starting positions for l-mers
in our t sequences
• The substrings corresponding to these starting
positions will form:
- t x l alignment and
- 4 x l profile P
41
Scoring strings with a profile
Given a profile: P =
A 1/2 7/8 3/8 0 1/8 0
C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8
The probability of the consensus string:
Prob(aaacct|P) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 = .033646
Probability of a different string:
Prob(atacag|P) = 1/2 x 1/8 x 3/8 x 5/8 x 1/8 x 1/8 = .001602
42
P‐most probable l‐mer
Define the P-most probable l-mer from a sequence as an
l-mer in that sequence which has the highest probability
of being created from the profile P
44
Gibbs sampling
1) Randomly choose starting positions
s = (s1,...,st) and form the set of l-mers associated
with these starting positions
2) Randomly choose one of the t sequences
3) Create a profile P from the other t -1 sequences
4) For each position in the removed sequence, calculate
the probability that the l-mer starting at that position
was generated by P
5) Choose a new starting position for the removed
sequence at random based on the probabilities
calculated in step 4
6) Repeat steps 2-5 until there is no improvement
45
Gibbs sampling: an example
Input:
t = 5 sequences, motif length l = 8
1. GTAAACAATATTTATAGC
2. AAAATTTACCTCGCAAGG
3. CCGTACTGTCAAGCGTGG
4. TGAGTAAACGACGTCCCA
5. TACTTAACACCCTGTCAA
46
Gibbs sampling: an example
1) Randomly choose starting positions,
s=(s1,s2,s3,s4,s5) in the 5 sequences:
s1=7 GTAAACAATATTTATAGC
s2=11 AAAATTTACCTTAGAAGG
s3=9 CCGTACTGTCAAGCGTGG
s4=4 TGAGTAAACGACGTCCCA
s5=1 TACTTAACACCCTGTCAA
47
Gibbs sampling: an example
s1=7 GTAAACAATATTTATAGC
s2=11 AAAATTTACCTTAGAAGG
s3=9 CCGTACTGTCAAGCGTGG
s4=4 TGAGTAAACGACGTCCCA
s5=1 TACTTAACACCCTGTCAA
48
Gibbs sampling: an example
3) Create profile P from l-mers in the remaining 4 sequences:
1 A A T A T T T A
3 T C A A G C G T
4 G T A A A C G A
5 T A C T T A A C
A 1/4 2/4 2/4 3/4 1/4 1/4 1/4 2/4
C 0 1/4 1/4 0 0 2/4 0 1/4
T 2/4 1/4 1/4 1/4 2/4 1/4 1/4 1/4
G 1/4 0 0 0 1/4 0 3/4 0
Consensus
String
T A A A T C G A
49
Gibbs Sampling: an Example
4) Calculate the prob(a|P) for every possible 8-mer in the
removed sequence 2:
Strings Highlighted in Red prob(a|P)
AAAATTTACCTTAGAAGG .000732
AAAATTTACCTTAGAAGG .000122
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG .000183
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
50
Gibbs Sampling: an Example
5) Create a distribution of probabilities of l‐mers
prob(a|P), and randomly select a new starting
position based on this distribution
To create a proper distribution, divide each
probability prob(a|P) by the sum of probabilities
over all position:
Probability (Selecting Starting Position 1) = 0.706
Probability (Selecting Starting Position 2) = 0.118
...
Probability (Selecting Starting Position 8) = 0.176
51
Gibbs sampling: an example
Assume we select the substring with the highest
probability – then we are left with the following new
substrings and starting positions
s1=7 GTAAACAATATTTATAGC
s2=1 AAAATTTACCTTAGAAGG
s3=9 CCGTACTGTCAAGCGTGG
s4=5 TGAGTAATCGACGTCCCA
s5=1 TACTTCACACCCTGTCAA
52
Gibbs sampling: an example
6) We iterate the procedure again with the above starting
positions until we cannot improve the score any more
53
Gibbs sampler in practice
• Gibbs sampling needs to be modified when applied to
samples with unequal distributions of nucleotides
(relative entropy approach)
• Gibbs sampling often converges to locally optimal
motifs rather than globally optimal motifs
• Needs to be run with many randomly chosen seeds to
achieve good results
54