0% found this document useful (0 votes)
68 views

Tutorial Note 7 Midterm Exam Review (Again!)

The document provides tutorial notes for a midterm exam review in an algorithms for bioinformatics course. It includes the agenda for the review, sample solutions to an assignment on suffix trees and suffix arrays, and an explanation of building a Burrows-Wheeler Transform and using it for pattern matching. Key concepts covered are building a suffix trie and compressing it to a suffix tree, obtaining the suffix array through different methods, constructing the BWT directly and from the suffix array, and using the BWT for single pattern matching with the FM index.

Uploaded by

Romario Tim Vaz
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

Tutorial Note 7 Midterm Exam Review (Again!)

The document provides tutorial notes for a midterm exam review in an algorithms for bioinformatics course. It includes the agenda for the review, sample solutions to an assignment on suffix trees and suffix arrays, and an explanation of building a Burrows-Wheeler Transform and using it for pattern matching. Key concepts covered are building a suffix trie and compressing it to a suffix tree, obtaining the suffix array through different methods, constructing the BWT directly and from the suffix array, and using the BWT for single pattern matching with the FM index.

Uploaded by

Romario Tim Vaz
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Tutorial Note 7

Midterm Exam Review (Again!)


The Chinese University of Hong Kong
CSCI3220 Algorithms for Bioinformatics

TA: Zhenghao Zhang


30/10/2018
Agenda
• Suggested Solutions for Assignment 2
• Key Points Wrap-up
• Q&A

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 2


Assignment 2. Q1. a)
• Given DNA sequence s=GTAACTGTAGTG$, build Suffix Trie.
$ A C G T

A C G T $ T A G

C T T G A G A G $ T

T G G T A G $ C T A

G T $ A C T T G G

T A G T G G $ T

A G T G $ T G

G T G T A $

T G $ A G

G $ G T

$ T G

G $

• Suffix Trie <--> Trie of Suffixes


• Insert every suffixes (automatically, or carefully manually)
• Do not forget ‘$’.
• Left: Lex. Smaller, Right: Lex. Larger, ‘$’ < ‘A’-’Z’

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 3


Assignment 2. Q1. b)
• Compress Suffix Trie to Suffix Tree, with position labelling.
$ A C G T
13-13 3-3 5-13 1-1 2-2
A C G T $ T A G

C T T G A G A G $ T 4-13 5-13 10-13 13-13 2-2 3-3 1-1

T G G T A G $ C T A
3-3 12-13 4-13 10-13 13-13 8-13
G T $ A C T T G G
4-13 10-13
T A G T G G $ T

A G T G $ T G

G T G T A $

T G $ A G

G $ G T

$ T G

G $

• Compress “caterpillars” (both “dangling” ones and those


“inside” the tree).

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 4


Assignment 2. Q1. c)
• Perform DFS on Suffix Tree.
A

B C G H O

D E F I J P S

K N Q R T U

L M

• ABACDCECFCAGAHIHJKLKMKJNJHAOPQPRPOSTSUSOA
• Upon arriving leaves, output corresponding suffix starting
position (you can record current “depth” as suffix length).
• SA = [13,3,4,9,5,12,1,7,10,2,8,11,6]

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 5


Assignment 2. Q1. d)
• Obtain same Suffix Array using an alternative way.
• Recall the naïve way: Sort all Suffixes.
Suffix Position Suffix Sorted Position
GTAACTGTAGTG$ 1 $ 13
TAACTGTAGTG$ 2 AACTGTAGTG$ 3
AACTGTAGTG$ 3 ACTGTAGTG$ 4
ACTGTAGTG$ 4 AGTG$ 9
CTGTAGTG$ 5 CTGTAGTG$ 5
TGTAGTG$ 6 G$ 12
GTAGTG$ 7 GTAACTGTAGTG$ 1
TAGTG$ 8 GTAGTG$ 7
AGTG$ 9 GTG$ 10
GTG$ 10 TAACTGTAGTG$ 2
TG$ 11 TAGTG$ 8
G$ 12 TG$ 11
$ 13 TGTAGTG$ 6

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 6


Assignment 2. Q1. e)
• Build BWT from Suffix Array (SA  BWT).
i Sorted Suffix in BWT Rotations SA[i] BWT[i] = s[?]
1 $GTAACTGTAGTG 13 12
2 AACTGTAGTG$GT 3 2
3 ACTGTAGTG$GTA 4 3
4 AGTG$GTAACTGT 9 8
5 CTGTAGTG$GTAA 5 4
6 G$GTAACTGTAGT 12 11
7 GTAACTGTAGTG$ 1 13
8 GTAGTG$GTAACT 7 6
9 GTG$GTAACTGTA 10 9
10 TAACTGTAGTG$G 2 1
11 TAGTG$GTAACTG 8 7
• BWT[i] = s[SA[i] – 1] (where s[-1] = s[n])
12 TG$GTAACTGTAG 11 10
13 TGTAGTG$GTAAC 6 5

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 7


Assignment 2. Q1. f)
• Build BWT directly.
• The original “Rotation” way (table from Wikipedia):

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 8


Assignment 2. Q1. g)
• FM Index (Single pattern matching using BWT).
• Pattern: q=“GTA”, BWT: b=“GTATAT$TAGGGC”
• Step 1: Build “O Table” (Cumulative count of each
character in BWT):
BWT G T A T A T $ T A G G G C
i 1 2 3 4 5 6 7 8 9 10 11 12 13
x
A 0 0 1 1 2 2 2 2 3 3 3 3 3
C 0 0 0 0 0 0 0 0 0 0 0 0 1
G 1 1 1 1 1 1 1 1 1 2 3 4 4
T 0 1 1 2 2 3 3 4 4 4 4 4 4

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 9


Assignment 2. Q1. g)
• Pattern: q=“GTA”, BWT: b=“GTATAT$TAGGGC”
• Step 2: Build “F Table” (Starting position of each character
in the first column of BWT rotation matrix):

1 $GTAACTGTAGTG
2 AACTGTAGTG$GT
3 ACTGTAGTG$GTA
4 AGTG$GTAACTGT
x A C G T 5 CTGTAGTG$GTAA
F(x) 2 5 6 10 6 G$GTAACTGTAGT
7 GTAACTGTAGTG$
8 GTAGTG$GTAACT
9 GTG$GTAACTGTA
10 TAACTGTAGTG$G
11 TAGTG$GTAACTG
12 TG$GTAACTGTAG
13 TGTAGTG$GTAAC
 

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 10


Assignment 2. Q1. g)
• Pattern: q=“GTA”, BWT: b=“GTATAT$TAGGGC”
• Step 3: Matching Backwardly. (Use “O” and “F” to
implement the following matching process)
1 $GTAACTGTAGTG 1 $GTAACTGTAGTG 1 $GTAACTGTAGTG
2 AACTGTAGTG$GT 2 AACTGTAGTG$GT 2 AACTGTAGTG$GT
3 ACTGTAGTG$GTA 3 ACTGTAGTG$GTA 3 ACTGTAGTG$GTA
4 AGTG$GTAACTGT 4 AGTG$GTAACTGT 4 AGTG$GTAACTGT
5 CTGTAGTG$GTAA 5 CTGTAGTG$GTAA 5 CTGTAGTG$GTAA
6 G$GTAACTGTAGT 6 G$GTAACTGTAGT 6 G$GTAACTGTAGT
7 GTAACTGTAGTG$ 7 GTAACTGTAGTG$ 7 GTAACTGTAGTG$
8 GTAGTG$GTAACT 8 GTAGTG$GTAACT 8 GTAGTG$GTAACT
9 GTG$GTAACTGTA 9 GTG$GTAACTGTA 9 GTG$GTAACTGTA
10 TAACTGTAGTG$G 10 TAACTGTAGTG$G 10 TAACTGTAGTG$G
11 TAGTG$GTAACTG 11 TAGTG$GTAACTG 11 TAGTG$GTAACTG
12 TG$GTAACTGTAG 12 TG$GTAACTGTAG 12 TG$GTAACTGTAG
13 TGTAGTG$GTAAC 13 TGTAGTG$GTAAC 13 TGTAGTG$GTAAC
    q=“GTA”   q=“GTA”
q=“GTA”

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 11


Assignment 2. Q1. g)
• Pattern: q=“GTA”, BWT: b=“GTATAT$TAGGGC”
• Step 4: Get desired positions of “GTA” appearance by using
Suffix Array. i Sorted Suffix in BWT Rotations SA[i] BWT[i] = s[?]
1 $GTAACTGTAGTG 1 $GTAACTGTAGTG 13 12
2 AACTGTAGTG$GT 2 AACTGTAGTG$GT 3 2
3 ACTGTAGTG$GTA 3 ACTGTAGTG$GTA 4 3
4 AGTG$GTAACTGT 4 AGTG$GTAACTGT 9 8
5 CTGTAGTG$GTAA 5 CTGTAGTG$GTAA 5 4
6 G$GTAACTGTAGT 6 G$GTAACTGTAGT 12 11
7 GTAACTGTAGTG$ 7 GTAACTGTAGTG$ 1 13
8 GTAGTG$GTAACT 8 GTAGTG$GTAACT 7 6
9 GTG$GTAACTGTA 9 GTG$GTAACTGTA 10 9
10 TAACTGTAGTG$G 10 TAACTGTAGTG$G 2 1
11 TAGTG$GTAACTG 11 TAGTG$GTAACTG 8 7
12 TG$GTAACTGTAG 12 TG$GTAACTGTAG 11 10
13 TGTAGTG$GTAAC 13 TGTAGTG$GTAAC 6 5
  q=“GTA”

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 12


Assignment 2. Q1. h)
• Suppose given DNA sequence is randomly generated, if we
append ‘$’ and perform BWT on it, will ‘$’ have equal
probability to appear in each of the n+1 positions of BWT
output?
• Consider the appearance of BWT rotation matrix:

1 $GTAACTGTAGTG
2 AACTGTAGTG$GT
• Observation: ‘$’ must be lying on the
3 ACTGTAGTG$GTA top-left corner of BWT rotation matrix
4 AGTG$GTAACTGT
5 CTGTAGTG$GTAA • Thus, ‘$’ is impossible to be BWT[1]
6 G$GTAACTGTAGT
7 GTAACTGTAGTG$ (unless DNA sequence is empty)
8 GTAGTG$GTAACT
9 GTG$GTAACTGTA
10 TAACTGTAGTG$G
11 TAGTG$GTAACTG
12 TG$GTAACTGTAG
13 TGTAGTG$GTAAC
 
CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 13
Assignment 2. Q1. h)
• Suppose
  given DNA sequence is randomly generated, if we
append ‘$’ and perform BWT on it, will ‘$’ have equal
probability to appear in each of the n+1 positions of BWT
output?
• Or, consider how you calculate probabilities:

• If cannot be divided be then it is impossible to have equal

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 14


Assignment 2. Q2
• Implement an Eulerian path finder.
• Key points:
– How to determine Eulerian path existence.
– How to determine starting point(s).
– How to build, store and use Graph structure.
– How to get Eulerian path (one path and all paths).
• Minor Issues:
– Python & Java users should be aware of EOFException.
– Allocate sufficient array space (estimate the upper bound of
node number and edge number), or use dynamic arrays (sacrifice
performance) to avoid Runtime Error.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 15


Recap: Eulerian Path
• A path visiting each edge exactly once
• How to determine the existence of Eulerian path:
– Undirected Graph:
• All vertexes are connected
• Scheme 1: For each vertex, degree is even
 Eulerian circuit exists, every vertex is possible starting point
• Scheme 2: Have and only have two odd degree vertexes,
For each of remaining vertexes, degree is even
 Eulerian path exists, odd vertexes are possible starting point
– Directed Graph:
• Ignore edge direction, all vertexes are connected
• Scheme 1: For each vertex, in-degree == out-degree
 Eulerian circuit exists, every vertex is possible starting point
• Scheme 2: Have and only have one vertex s with in-degree+1 == out-degree,
Have and only have one vertex t with out-degree+1 == in-degree,
For each of remaining vertexes, in-degree == out-degree
 Eulerian path exists, starts at s and ends at t
• Note:
– Circuit is a special case of Path
– Zero-degree vertexes have no impact on Eulerian path existence
CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Prof. Kevin YIP, Mr. Chenyang Hong, Mr. Zhenghao Zhang| Fall 2018 16

You might also like